docs/src/shapes-of-data.md
Try od -xcv and/or cat -e on your file to check for non-printable characters.
If you're using Miller version less than 5.0.0 (try mlr --version on your system to find out), when the line-ending-autodetect feature was introduced, please see http://johnkerl.org/miller-releases/miller-4.5.0/doc/index.html.
Check the field-separators of the data, e.g. with the command-line head program. Example: for CSV, Miller's default record separator is comma; if your data is tab-delimited, e.g. aTABbTABc, then Miller won't find three fields named a, b, and c but rather just one named aTABbTABc. Solution in this case: mlr --fs tab {remaining arguments ...}.
Also try od -xcv and/or cat -e on your file to check for non-printable characters.
Use the file command to see if there are CR/LF terminators (in this case, there are not):
Look at the file to find names of fields:
<pre class="pre-highlight-in-pair"> <b>cat data/colours.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> KEY;DE;EN;ES;FI;FR;IT;NL;PL;TO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah </pre>Extract a few fields:
<pre class="pre-highlight-non-pair"> <b>mlr --csv cut -f KEY,PL,TO data/colours.csv</b> </pre>Use XTAB output format to get a sharper picture of where records/fields are being split:
<pre class="pre-highlight-in-pair"> <b>mlr --icsv --oxtab cat data/colours.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> KEY;DE;EN;ES;FI;FR;IT;NL;PL;TO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz KEY;DE;EN;ES;FI;FR;IT;NL;PL;TO;TR masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah </pre>Using XTAB output format makes it clearer that KEY;DE;...;TR is being treated as a single field name in the CSV header, and likewise each subsequent line is being treated as a single field value. This is because the default field separator is a comma but we have semicolons here. Use XTAB again with different field separator (--fs semicolon):
Using the new field-separator, retry the cut:
<pre class="pre-highlight-in-pair"> <b>mlr --csv --fs semicolon cut -f KEY,PL,TO data/colours.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> KEY;PL;TO masterdata_colourcode_1;Biały;Alb masterdata_colourcode_2;Czarny;Negru </pre>Miller records are ordered lists of key-value pairs. For NIDX format, DKVP format when keys are missing, or CSV/CSV-lite format with --implicit-csv-header, Miller will sequentially assign keys of the form 1, 2, etc. But these are not integer array indices: they're just field names taken from the initial field ordering in the input data, when it was originally read from the input file(s).
Example: columns rate,shape,flag were requested but they appear here in the order shape,flag,rate:
The issue is that Miller's cut, by default, outputs cut fields in the order they appear in the input data. This design decision was made intentionally to parallel the Unix/Linux system cut command, which has the same semantics.
The solution is to use the -o option:
The awk-like built-in variable NR is incremented for each input record:
However, this is the record number within the original input stream -- not after any filtering you may have done:
<pre class="pre-highlight-in-pair"> <b>mlr --csv filter '$color == "yellow"' then put '$nr = NR' example.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> color,shape,flag,k,index,quantity,rate,nr yellow,triangle,true,1,11,43.6498,9.8870,1 yellow,circle,true,8,73,63.9785,4.2370,8 yellow,circle,true,9,87,63.5058,8.3350,9 </pre>There are two good options here. One is to use the cat verb with -n:
The other is to keep your own counter within the put DSL:
The difference is a matter of taste (although mlr cat -n puts the counter first).
If your data has records appearing multiple times, you can use uniq to show and/or count the unique records.
If you want to look at partial uniqueness -- for example, show only the first record for each unique combination of the account_id and account_status fields -- you might use mlr head -n 1 -g account_id,account_status. Please also see head.
Suppose you have a method (in whatever language) which is printing things of the form
<pre class="pre-non-highlight-non-pair"> outer=1 outer=2 outer=3 </pre>and then calls another method which prints things of the form
<pre class="pre-non-highlight-non-pair"> middle=10 middle=11 middle=12 middle=20 middle=21 middle=30 middle=31 </pre>and then, perhaps, that second method calls a third method which prints things of the form
<pre class="pre-non-highlight-non-pair"> inner1=100,inner2=101 inner1=120,inner2=121 inner1=200,inner2=201 inner1=210,inner2=211 inner1=300,inner2=301 inner1=312 inner1=313,inner2=314 </pre>with the result that your program's output is
<pre class="pre-non-highlight-non-pair"> outer=1 middle=10 inner1=100,inner2=101 middle=11 middle=12 inner1=120,inner2=121 outer=2 middle=20 inner1=200,inner2=201 middle=21 inner1=210,inner2=211 outer=3 middle=30 inner1=300,inner2=301 middle=31 inner1=312 inner1=313,inner2=314 </pre>The idea here is that middles starting with a 1 belong to the outer value of 1, and so on. (For example, the outer values might be account IDs, the middle values might be invoice IDs, and the inner values might be invoice line-items.) If you want all the middle and inner lines to have the context of which outers they belong to, you can modify your software to pass all those through your methods. Alternatively, don't refactor your code just to handle some ad-hoc log-data formatting -- instead, use the following to rectangularize the data. The idea is to use an out-of-stream variable to accumulate fields across records. Clear that variable when you see an outer ID; accumulate fields; emit output when you see the inner IDs.
<pre class="pre-highlight-in-pair"> <b>mlr --from data/rect.txt put -q '</b> <b> is_present($outer) {</b> <b> unset @r</b> <b> }</b> <b> for (k, v in $*) {</b> <b> @r[k] = v</b> <b> }</b> <b> is_present($inner1) {</b> <b> emit @r</b> <b> }'</b> </pre> <pre class="pre-non-highlight-in-pair"> outer=1,middle=10,inner1=100,inner2=101 outer=1,middle=12,inner1=120,inner2=121 outer=2,middle=20,inner1=200,inner2=201 outer=2,middle=21,inner1=210,inner2=211 outer=3,middle=30,inner1=300,inner2=301 outer=3,middle=31,inner1=312,inner2=301 outer=3,middle=31,inner1=313,inner2=314 </pre>See also the record-heterogeneity page; see in
particular the regularize verb for a way to
do this with much less keystroking.