Skip to content

join

"The join utility performs an 'equality join' on the specified files and writes the result to the standard output." - man join

join is a unix tool that is similar to a sql join, combining two files by joining on a column. The macOS man page has more example than the GNU man page, but as usual the syntaxes aren't exactly the same.

Examples

The following example use the following dates.txt file as file 1. Notice it is missing data between December 25 and December 31 (inclusive).

2022-12-21 9
2022-12-22 2
2022-12-23 1
2022-12-24 5
2023-01-01 6
2023-01-02 6
2023-01-03 2

We also use a loop that produces dates that contain the dates missing from dates.txt but do not go back as far in time:

$ for OFFSET in {10..0} ; do date -d "-$OFFSET days" "+%F offset=$OFFSET" ; done
2022-12-24 offset=10
2022-12-25 offset=9
2022-12-26 offset=8
2022-12-27 offset=7
2022-12-28 offset=6
2022-12-29 offset=5
2022-12-30 offset=4
2022-12-31 offset=3
2023-01-01 offset=2
2023-01-02 offset=1
2023-01-03 offset=0

Show only lines with common columns

$ join dates.txt <(for OFFSET in {10..0} ; do date -d "-$OFFSET days" "+%F $OFFSET" ; done)
2022-12-24 5 10
2023-01-01 6 2
2023-01-02 6 1
2023-01-03 2 0

Show all lines, as long as data is present in file 1

$ join -a 1 dates.txt <(for OFFSET in {10..0} ; do date -d "-$OFFSET days" "+%F offset=$OFFSET" ; done)
2022-12-21 9
2022-12-22 2
2022-12-23 1
2022-12-24 5 offset=10
2023-01-01 6 offset=2
2023-01-02 6 offset=1
2023-01-03 2 offset=0

Show all lines, as long as data is present in file 2

$ join -a 2 dates.txt <(for OFFSET in {10..0} ; do date -d "-$OFFSET days" "+%F offset=$OFFSET" ; done)
2022-12-24 5 offset=10
2022-12-25 offset=9
2022-12-26 offset=8
2022-12-27 offset=7
2022-12-28 offset=6
2022-12-29 offset=5
2022-12-30 offset=4
2022-12-31 offset=3
2023-01-01 6 offset=2
2023-01-02 6 offset=1
2023-01-03 2 offset=0

Only show certain columns in the output

We can specify which columns we want to see in the output, which includes the ability to join on a column that is not shown the output:

$ join -o 1.2,2.2 dates.txt <(for OFFSET in {10..0} ; do gdate -d "-$OFFSET days" "+%F offset=$OFFSET" ; done)
5 offset=10
6 offset=2
6 offset=1
2 offset=0

Fill in missing data an arbitrary string

We use -e Null to indicate we want to fill in missing values with Null. In order for -e to work right, e have to specify the output columns with -o. We also have to specify -a 2 to indicate we want to see all lines from file 2. Because we are showing all lines in file 2, we use -o 2.1 instead of -o 1.1 so that the date column is not populated with Null values.

$ join -e Null -o 2.1,1.2,2.2 -a 2 dates.txt <(for OFFSET in {10..0} ; do gdate -d "-$OFFSET days" "+%F offset=$OFFSET" ; done)
2022-12-24 5 offset=10
2022-12-25 Null offset=9
2022-12-26 Null offset=8
2022-12-27 Null offset=7
2022-12-28 Null offset=6
2022-12-29 Null offset=5
2022-12-30 Null offset=4
2022-12-31 Null offset=3
2023-01-01 6 offset=2
2023-01-02 6 offset=1
2023-01-03 2 offset=0

A practical example of the above is making sure you have an entire sequence filled in, for example when graphing a set where no entry is created for days that have no data points. This ensures we're seeing a complete set, not just the days that have data:

$ join -e 0 -o 2.1,1.2 -a 2 dates.txt <(for OFFSET in {13..0} ; do gdate -d "-$OFFSET days" "+%F offset=$OFFSET" ; done) | termgraph --format='{:.0f}'

2022-12-21: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 9
2022-12-22: ▇▇▇▇▇▇▇▇▇▇▇ 2
2022-12-23: ▇▇▇▇▇ 1
2022-12-24: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5
2022-12-25:  0
2022-12-26:  0
2022-12-27:  0
2022-12-28:  0
2022-12-29:  0
2022-12-30:  0
2022-12-31:  0
2023-01-01: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6
2023-01-02: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6
2023-01-03: ▇▇▇▇▇▇▇▇▇▇▇ 2