docs/src/questions-about-joins.md
This section describes behavior before Miller 5.1.0. As of 5.1.0, -u is the default.
For example, the right file here has nine records, and the left file should add in the hostname column -- so the join output should also have 9 records:
The issue is that Miller's join, by default (before 5.1.0), took input sorted (lexically ascending) by the sort keys on both the left and right files. This design decision was made intentionally to parallel the Unix/Linux system join command, which has the same semantics. The benefit of this default is that the joiner program can stream through the left and right files, needing to load neither entirely into memory. The drawback, of course, is that is requires sorted input.
The solution (besides pre-sorting the input files on the join keys) is to simply use mlr join -u (which is now the default). This loads the left file entirely into memory (while the right file is still streamed one line at a time) and does all possible joins without requiring sorted input:
<pre class="pre-highlight-in-pair"> <b>mlr --icsvlite --opprint join -u -j ipaddr -f data/join-u-left.csv data/join-u-right.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> ipaddr hostname timestamp bytes 10.3.1.27 zenith.west.our.org 1448762579 4568 10.3.1.18 nadir.east.our.org 1448762578 8729 10.4.5.94 apoapsis.east.our.org 1448762579 17445 10.3.1.27 zenith.west.our.org 1448762589 12 10.3.1.18 nadir.east.our.org 1448762588 44558 10.4.5.94 apoapsis.east.our.org 1448762589 8899 10.3.1.27 zenith.west.our.org 1448762599 0 10.3.1.18 nadir.east.our.org 1448762598 73425 10.4.5.94 apoapsis.east.our.org 1448762599 12200 </pre>General advice is to make sure the left-file is relatively small, e.g. containing name-to-number mappings, while saving large amounts of data for the right file.
Suppose you have the following two data files:
<pre class="pre-non-highlight-non-pair"> id,code 3,0000ff 2,00ff00 4,ff0000 </pre> <pre class="pre-non-highlight-non-pair"> id,color 4,red 2,green </pre>Joining on color the results are as expected:
<pre class="pre-highlight-in-pair"> <b>mlr --csv join -j id -f data/color-codes.csv data/color-names.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> id,code,color 4,ff0000,red 2,00ff00,green </pre>However, if we ask for left-unpaireds, since there's no color column, we get a row not having the same column names as the other:
To fix this, we can use unsparsify:
<pre class="pre-highlight-in-pair"> <b>mlr --csv join --ul -j id -f data/color-codes.csv \</b> <b> then unsparsify --fill-with "" \</b> <b> data/color-names.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> id,code,color 4,ff0000,red 2,00ff00,green 3,0000ff, </pre>Thanks to @aborruso for the tip!
See also the record-heterogeneity page.
Suppose we have the following data:
<pre class="pre-highlight-in-pair"> <b>cat multi-join/input.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> id,task 10,chop 20,puree 20,wash 30,fold 10,bake 20,mix 10,knead 30,clean </pre>And we want to augment the id column with lookups from the following data files:
We can run the input file through multiple join commands in a then-chain: