phadsconsult.com - GrazzMean

info server
Uname: Linux web3.us.cloudlogin.co 5.10.226-xeon-hst #2 SMP Fri Sep 13 12:28:44 UTC 2024 x86_64
Software: Apache
PHP version: 8.1.31 [ PHP INFO ] PHP os: Linux
Server Ip: 162.210.96.117
Your Ip: 18.218.67.239
User: edustar (269686) | Group: tty (888)
Safe Mode: OFF
Disable Function:
NONE
upload mass deface mass delete console
name : dbjoin
#!/usr/bin/perl -w

#
# dbjoin.pm
# Copyright (C) 1991-2022 by John Heidemann <johnh@isi.edu>
#
# This program is distributed under terms of the GNU general
# public license, version 2.  See the file COPYING
# in $dblibdir for details.
#


=head1 NAME

dbjoin - join two tables on common columns

=head1 SYNOPSIS

    dbjoin [-Sid] --input table1.fsdb --input table2.fsdb [-nNrR] column [column...]

OR

    cat table1.fsdb  | dbjoin [-Sid] --input table2.fsdb [-nNrR] column [column...]

=head1 DESCRIPTION

Does a natural, inner join on TABLE1 and TABLE2 
the specified columns.  With the C<-a> option, or 
with C<-t outer> it will do a natural, full outer join.

(Database review:
inner joints output records only when there are matches in both tables
and will omit records that do not match.
Outer joins output all records from both tables, 
filling with the empty value as needed.
Right (left) outer joins keep all elements of the right (left)
table, even those that don't match in the other table.)

By default for non-hash joins, data will be sorted lexically,
but the usual sorting options can be mixed with the column
specification.

Because two tables are required,
input is typically in files.
Standard input is accessible by the file "-".

If only one input is given, the first (left) input
is taken from stdin.

=head1 RESOURCE REQUIREMENTS AND PERFORMANCE

Joins can be expensive. 
Most databases have a query optimizer that 
knows something about the data and so can select
algorithms for efficent operation,
in Fsdb, I<you> are that optimizer.

For I<non-hash joins>:
If data is already sorted, dbjoin will run more efficiently
by telling dbjoin the data is sorted with the C<-S>.

The resource requirements L<dbjoin> vary.
If input data is sorted and C<-S> is given,
then memory consumption is bounded by the 
the sum of the largest number of records in either dataset
with the same value in the join column,
and there is no disk consumption.
If data is not sorted, then L<dbjoin> requires
disk storage the size of both input files.

One can minimize memory consumption by making sure
each record of table1 matches relatively few records in table2.
Typically this means that table2 should be the smaller.
For example, given two files: people.fsdb (schema: name iso_country_code)
and countries.fsdb (schema: iso_country_code full_country_name),
then

    dbjoin -i people.fsdb -i countries.fsdb iso_country_code

will require less memory than

    dbjoin -i countries.fsdb -i people.fsdb iso_country_code

if there are many people per country (as one would expect).
If warning "lots of matching rows accumulating in memory" appears,
this is the cause and try swapping join order.

For I<hash joins>
(that is, with C<-m righthash> or C<-m lefthash>):
all of the right table (the second input) or the left (the first)
is loaded into memory (and "hashed").
The other table need not be sorted.
Runtime is O(n), but memory is O(size of hashed table).


=head1 OPTIONS

=over 4

=item B<-a> or B<--all>

Perform a I<full outer join>,
include non-matches (each record which doesn't match at
all will appear once).
Default is an I<inner join>.

=item B<-t TYPE> or B<--type TYPE>

Explicitly specify the join type.
TYPE must be inner, outer, left (outer), right (outer).
(Recall tha inner join requires data on both sides,
outer joins keep all records from both sides for outer,
or all of the first or second input for left and right outer joins.)
Default: inner.

=item B<-m METHOD> or B<--method METHOD>

Select join method (algorithm).
Choices are merge, righthash, and lefthash.
With hashs, that side is cached.
(Warning: the lefthas method is about caching the left input,
and it does I<not> imply a left-type join,
which keeps even unmatched left keys.)
Default: merge.

=item B<-S> or B<--pre-sorted>

assume (and verify) data is already sorted

=item B<-e E> or B<--empty E>

give value E as the value for empty (null) records

=item B<-T TmpDir>

where to put tmp files.
Also uses environment variable TMPDIR, if -T is 
not specified.
Default is /tmp.

=back

Sort specification options (can be interspersed with column names):

=over 4

=item B<-r> or B<--descending>

sort in reverse order (high to low)

=item B<-R> or B<--ascending>

sort in normal order (low to high)

=item B<-n> or B<--numeric>

sort numerically

=item B<-N> or B<--lexical>

sort lexicographically

=back


=for comment
begin_standard_fsdb_options

This module also supports the standard fsdb options:

=over 4

=item B<-d>

Enable debugging output.

=item B<-i> or B<--input> InputSource

Read from InputSource, typically a file name, or C<-> for standard input,
or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue objects.

=item B<-o> or B<--output> OutputDestination

Write to OutputDestination, typically a file name, or C<-> for standard output,
or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue objects.

=item B<--autorun> or B<--noautorun>

By default, programs process automatically,
but Fsdb::Filter objects in Perl do not run until you invoke
the run() method.
The C<--(no)autorun> option controls that behavior within Perl.

=item B<--help>

Show help.

=item B<--man>

Show full manual.

=back

=for comment
end_standard_fsdb_options


=head1 SAMPLE USAGE

=head2 Input:

    #fsdb sid cid
    1 10
    2 11
    1 12
    2 12

And in the file F<DATA/classes>:

    #fsdb cid cname
    10 pascal
    11 numanal
    12 os

=head2 Command:

    cat DATA/reg.fsdb | dbsort -n cid | dbjoin -i - -i DATA/classes -n cid

=head2 Output:

    #fsdb      cid     sid     cname
    10      1       pascal
    11      2       numanal
    12      1       os
    12      2       os
    # - COMMENTS:
    #  | /home/johnh/BIN/DB/dbsort -n cid
    # DATA/classes COMMENTS:
    # joined comments:
    #  | /home/johnh/BIN/DB/dbjoin - DATA/classes cid

=head1 SEE ALSO

L<Fsdb>.


=cut


# WARNING: This code is derived from dbjoin.pm; that is the master copy.

use Fsdb::Filter::dbjoin;
my $f = new Fsdb::Filter::dbjoin(@ARGV);
$f->setup_run_finish;  # or could just --autorun
exit 0;


=head1 AUTHOR and COPYRIGHT

Copyright (C) 1991-2022 by John Heidemann <johnh@isi.edu>

This program is distributed under terms of the GNU general
public license, version 2.  See the file COPYING
with the distribution for details.

=cut

1;
GrazzMean Shell