docs/src/reference-verbs.md
Verbs are the building blocks of how you can use Miller to process your data. When you type
<pre class="pre-highlight-in-pair"> <b>mlr --icsv --opprint sort -n quantity then head -n 4 example.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> color shape flag k index quantity rate red circle true 3 16 13.8103 2.9010 yellow triangle true 1 11 43.6498 9.8870 yellow circle true 9 87 63.5058 8.3350 yellow circle true 8 73 63.9785 4.2370 </pre>the sort and head bits are verbs. See the Miller command
structure page for context.
At the command line, you can use mlr -l and mlr -L for information much
like what's on this page.
Whereas the Unix toolkit is made of the separate executables cat, tail, cut,
sort, etc., Miller has subcommands, or verbs, such as mlr cat, mlr tail, mlr cut, and
mlr sort, invoked as follows:
These fall into categories as follows:
Analogs of their Unix-toolkit namesakes, discussed below as well as in Unix-toolkit Context: cat, cut, grep, head, join, sort, tac, tail, top, uniq.
awk-like functionality: filter, put, sec2gmt, sec2gmtdate, step, tee.
Statistically oriented: bar, bootstrap, decimate, histogram, least-frequent, most-frequent, sample, shuffle, stats1, stats2.
Particularly oriented toward Record Heterogeneity, although all Miller commands can handle heterogeneous records: group-by, group-like, having-fields.
These draw from other sources (see also How Original Is Miller?): count-distinct is SQL-ish, and rename can be done by sed (which does it faster: see Performance). Verbs: check, count-distinct, label, merge-fields, nest, nothing, regularize, rename, reorder, reshape, seqgen.
Map list of values to alternating key/value pairs.
<pre class="pre-highlight-in-pair"> <b>mlr altkv -h</b> </pre> <pre class="pre-non-highlight-in-pair"> Usage: mlr altkv [options] Given fields with values of the form a,b,c,d,e,f emits a=b,c=d,e=f pairs. Options: -h|--help Show this message. </pre> <pre class="pre-highlight-in-pair"> <b>echo 'a,b,c,d,e,f' | mlr altkv</b> </pre> <pre class="pre-non-highlight-in-pair"> a=b,c=d,e=f </pre> <pre class="pre-highlight-in-pair"> <b>echo 'a,b,c,d,e,f,g' | mlr altkv</b> </pre> <pre class="pre-non-highlight-in-pair"> a=b,c=d,e=f,4=g </pre>Cheesy bar-charting.
<pre class="pre-highlight-in-pair"> <b>mlr bar -h</b> </pre> <pre class="pre-non-highlight-in-pair"> Usage: mlr bar [options] Replaces a numeric field with a number of asterisks, allowing for cheesy bar plots. These align best with --opprint or --oxtab output format. Options: -f {a,b,c} Field names to convert to bars. --lo {lo} Lower-limit value for min-width bar: default '0.000000'. --hi {hi} Upper-limit value for max-width bar: default '100.000000'. -w {n} Bar-field width: default '40'. --auto Automatically computes limits, ignoring --lo and --hi. Holds all records in memory before producing any output. -c {character} Fill character: default '*'. -x {character} Out-of-bounds character: default '#'. -b {character} Blank character: default '.'. Nominally the fill, out-of-bounds, and blank characters will be strings of length 1. However you can make them all longer if you so desire. -h|--help Show this message. </pre> <pre class="pre-highlight-in-pair"> <b>mlr --opprint cat data/small</b> </pre> <pre class="pre-non-highlight-in-pair"> a b i x y pan pan 1 0.346791 0.726802 eks pan 2 0.758679 0.522151 wye wye 3 0.204603 0.338318 eks wye 4 0.381399 0.134188 wye pan 5 0.573288 0.863624 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --opprint bar --lo 0 --hi 1 -f x,y data/small</b> </pre> <pre class="pre-non-highlight-in-pair"> a b i x y pan pan 1 *************........................... *****************************........... eks pan 2 ******************************.......... ********************.................... wye wye 3 ********................................ *************........................... eks wye 4 ***************......................... *****................................... wye pan 5 **********************.................. **********************************...... </pre> <pre class="pre-highlight-in-pair"> <b>mlr --opprint bar --lo 0.4 --hi 0.6 -f x,y data/small</b> </pre> <pre class="pre-non-highlight-in-pair"> a b i x y pan pan 1 #....................................... ***************************************# eks pan 2 ***************************************# ************************................ wye wye 3 #....................................... #....................................... eks wye 4 #....................................... #....................................... wye pan 5 **********************************...... ***************************************# </pre> <pre class="pre-highlight-in-pair"> <b>mlr --opprint bar --auto -f x,y -w 20 data/small</b> </pre> <pre class="pre-non-highlight-in-pair"> a b i x y pan pan 1 [0.204603]*****...............[0.758679] [0.134188]****************....[0.863624] eks pan 2 [0.204603]*******************#[0.758679] [0.134188]**********..........[0.863624] wye wye 3 [0.204603]#...................[0.758679] [0.134188]*****...............[0.863624] eks wye 4 [0.204603]******..............[0.758679] [0.134188]#...................[0.863624] wye pan 5 [0.204603]*************.......[0.758679] [0.134188]*******************#[0.863624] </pre>The canonical use for bootstrap sampling is to put error bars on statistical quantities, such as mean. For example:
<!--- hard-coded, not live-code, since random sampling would generate different data on each doc run which would needlessly complicate git diff --> <pre class="pre-highlight-in-pair"> <b>mlr --c2p stats1 -a mean,count -f u -g color data/colored-shapes.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> color u_mean u_count yellow 0.4971291160651098 1413 red 0.49255964641241273 4641 purple 0.49400496322241666 1142 green 0.5048610595130744 1109 blue 0.5177171537414964 1470 orange 0.49053241584158375 303 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --c2p bootstrap then stats1 -a mean,count -f u -g color data/colored-shapes.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> color u_mean u_count red 0.49183858109559747 4655 yellow 0.487271566995769 1418 green 0.5018994641860465 1075 orange 0.5005396620689654 290 blue 0.5309761257817928 1439 purple 0.4917481873438798 1201 </pre> <pre class="pre-highlight-in-pair"> <b>color u_mean u_count</b> </pre> <pre class="pre-non-highlight-in-pair"> yellow 0.4809714157857651 1419 blue 0.5057790647530039 1498 red 0.49114305508382283 4593 purple 0.49652395202020194 1188 green 0.5011425433212993 1108 orange 0.48935696323529426 272 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --c2p bootstrap then stats1 -a mean,count -f u -g color data/colored-shapes.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> color u_mean u_count red 0.49934473217726466 4671 purple 0.4934976176735793 1109 blue 0.5097866573146287 1497 yellow 0.4987188126740959 1436 orange 0.4802164827586204 290 green 0.5129018241860459 1075 </pre>Most useful for format conversions (see File Formats) and concatenating multiple same-schema CSV files to have the same header:
<pre class="pre-highlight-in-pair"> <b>mlr cat -h</b> </pre> <pre class="pre-non-highlight-in-pair"> Usage: mlr cat [options] Passes input records directly to output. Most useful for format conversion. Options: -n Prepend field "n" to each record with record-counter starting at 1. -N {name} Prepend field {name} to each record with record-counter starting at 1. -g {a,b,c} Optional group-by-field names for counters, e.g. a,b,c --filename Prepend current filename to each record. --filenum Prepend current filenum (1-up) to each record. -h|--help Show this message. </pre> <pre class="pre-highlight-in-pair"> <b>cat data/a.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> a,b,c 1,2,3 4,5,6 </pre> <pre class="pre-highlight-in-pair"> <b>cat data/b.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> a,b,c 7,8,9 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --csv cat data/a.csv data/b.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> a,b,c 1,2,3 4,5,6 7,8,9 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --icsv --oxtab cat data/a.csv data/b.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> a 1 b 2 c 3 a 4 b 5 c 6 a 7 b 8 c 9 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --csv cat -n data/a.csv data/b.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> n,a,b,c 1,1,2,3 2,4,5,6 3,7,8,9 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --opprint cat data/small</b> </pre> <pre class="pre-non-highlight-in-pair"> a b i x y pan pan 1 0.346791 0.726802 eks pan 2 0.758679 0.522151 wye wye 3 0.204603 0.338318 eks wye 4 0.381399 0.134188 wye pan 5 0.573288 0.863624 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --opprint cat -n -g a data/small</b> </pre> <pre class="pre-non-highlight-in-pair"> n a b i x y 1 pan pan 1 0.346791 0.726802 1 eks pan 2 0.758679 0.522151 1 wye wye 3 0.204603 0.338318 2 eks wye 4 0.381399 0.134188 2 wye pan 5 0.573288 0.863624 </pre>Function links:
Please see DSL reference for more information about the expression language for mlr filter.
For example, suppose you have the following CSV file:
<pre class="pre-non-highlight-non-pair"> u=female,v=red,n=2458 u=female,v=green,n=192 u=female,v=blue,n=337 u=female,v=purple,n=468 u=female,v=yellow,n=3 u=female,v=orange,n=17 u=male,v=red,n=143 u=male,v=green,n=227 u=male,v=blue,n=2034 u=male,v=purple,n=12 u=male,v=yellow,n=1192 u=male,v=orange,n=448 </pre>Then we can see what each record's n contributes to the total n:
Using -g we can split those out by gender, or by color:
We can see, for example, that 70.9% of females have red (on the left) while 94.5% of reds are for females.
To convert fractions to percents, you may use -p:
Another often-used idiom is to convert from a point distribution to a cumulative distribution, also known as "running sums". Here, you can use -c:
This is similar to sort but with less work. Namely, Miller's sort has three steps: read through the data and append linked lists of records, one for each unique combination of the key-field values; after all records are read, sort the key-field values; then print each record-list. The group-by operation simply omits the middle sort. An example should make this more clear:
In this example, since the sort is on field a, the first step is to group together all records having the same value for field a; the second step is to sort the distinct a-field values pan, eks, and wye into eks, pan, and wye; the third step is to print out the record-list for a=eks, then the record-list for a=pan, then the record-list for a=wye. The group-by operation omits the middle sort and just puts like records together, for those times when a sort isn't desired. In particular, the ordering of group-by fields for group-by is the order in which they were encountered in the data stream, which in some cases may be more interesting to you.
This groups together records having the same schema (i.e. same ordered list of field names) which is useful for making sense of time-ordered output as described in Record Heterogeneity -- in particular, in preparation for CSV or pretty-print output.
<pre class="pre-highlight-in-pair"> <b>mlr cat data/het.dkvp</b> </pre> <pre class="pre-non-highlight-in-pair"> resource=/path/to/file,loadsec=0.45,ok=true record_count=100,resource=/path/to/file resource=/path/to/second/file,loadsec=0.32,ok=true record_count=150,resource=/path/to/second/file resource=/some/other/path,loadsec=0.97,ok=false </pre> <pre class="pre-highlight-in-pair"> <b>mlr --opprint group-like data/het.dkvp</b> </pre> <pre class="pre-non-highlight-in-pair"> resource loadsec ok /path/to/file 0.45 true /path/to/second/file 0.32 true /some/other/path 0.97 false record_count resource 100 /path/to/file 150 /path/to/second/file </pre>Similar to group-like, this retains records with specified schema.
<pre class="pre-highlight-in-pair"> <b>mlr cat data/het.dkvp</b> </pre> <pre class="pre-non-highlight-in-pair"> resource=/path/to/file,loadsec=0.45,ok=true record_count=100,resource=/path/to/file resource=/path/to/second/file,loadsec=0.32,ok=true record_count=150,resource=/path/to/second/file resource=/some/other/path,loadsec=0.97,ok=false </pre> <pre class="pre-highlight-in-pair"> <b>mlr having-fields --at-least resource data/het.dkvp</b> </pre> <pre class="pre-non-highlight-in-pair"> resource=/path/to/file,loadsec=0.45,ok=true record_count=100,resource=/path/to/file resource=/path/to/second/file,loadsec=0.32,ok=true record_count=150,resource=/path/to/second/file resource=/some/other/path,loadsec=0.97,ok=false </pre> <pre class="pre-highlight-in-pair"> <b>mlr having-fields --which-are resource,ok,loadsec data/het.dkvp</b> </pre> <pre class="pre-non-highlight-in-pair"> resource=/path/to/file,loadsec=0.45,ok=true resource=/path/to/second/file,loadsec=0.32,ok=true resource=/some/other/path,loadsec=0.97,ok=false </pre>Note that head is distinct from top -- head shows fields which appear first in the data stream; top shows fields which are numerically largest (or smallest).
This is just a histogram; there's not too much to say here. A note about binning, by example: Suppose you use --lo 0.0 --hi 1.0 --nbins 10 -f x. The input numbers less than 0 or greater than 1 aren't counted in any bin. Input numbers equal to 1 are counted in the last bin. That is, bin 0 has 0.0 < x < 0.1, bin 1 has 0.1 < x < 0.2, etc., but bin 9 has 0.9 < x < 1.0.
Examples:
Join larger table with IDs with smaller ID-to-name lookup table, showing only paired records:
<pre class="pre-highlight-in-pair"> <b>mlr --icsvlite --opprint cat data/join-left-example.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> id name 100 alice 200 bob 300 carol 400 david 500 edgar </pre> <pre class="pre-highlight-in-pair"> <b>mlr --icsvlite --opprint cat data/join-right-example.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> status idcode present 400 present 100 missing 200 present 100 present 200 missing 100 missing 200 present 300 missing 600 present 400 present 400 present 300 present 100 missing 400 present 200 present 200 present 200 present 200 present 400 present 300 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --icsvlite --opprint \</b> <b> join -u -j id -r idcode -f data/join-left-example.csv \</b> <b> data/join-right-example.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> id name status 400 david present 100 alice present 200 bob missing 100 alice present 200 bob present 100 alice missing 200 bob missing 300 carol present 400 david present 400 david present 300 carol present 100 alice present 400 david missing 200 bob present 200 bob present 200 bob present 200 bob present 400 david present 300 carol present </pre>Same, but with sorting the input first:
<pre class="pre-highlight-in-pair"> <b>mlr --icsvlite --opprint sort -f idcode \</b> <b> then join -j id -r idcode -f data/join-left-example.csv \</b> <b> data/join-right-example.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> id name status 100 alice present 100 alice present 100 alice missing 100 alice present 200 bob missing 200 bob present 200 bob missing 200 bob present 200 bob present 200 bob present 200 bob present 300 carol present 300 carol present 300 carol present 400 david present 400 david present 400 david present 400 david missing 400 david present </pre>Same, but showing only unpaired records:
<pre class="pre-highlight-in-pair"> <b>mlr --icsvlite --opprint \</b> <b> join --np --ul --ur -u -j id -r idcode -f data/join-left-example.csv \</b> <b> data/join-right-example.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> status idcode missing 600 id name 500 edgar </pre>Use prefixing options to disambiguate between otherwise identical non-join field names:
<pre class="pre-highlight-in-pair"> <b>mlr --csvlite --opprint cat data/self-join.csv data/self-join.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> a b c 1 2 3 1 4 5 1 2 3 1 4 5 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --csvlite --opprint join -j a --lp left_ --rp right_ -f data/self-join.csv data/self-join.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> a left_b left_c right_b right_c 1 2 3 2 3 1 4 5 2 3 1 2 3 4 5 1 4 5 4 5 </pre>Use zero join columns:
<pre class="pre-highlight-in-pair"> <b>mlr --csvlite --opprint join -j "" --lp left_ --rp right_ -f data/self-join.csv data/self-join.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> left_a left_b left_c right_a right_b right_c 1 2 3 1 2 3 1 4 5 1 2 3 1 2 3 1 4 5 1 4 5 1 4 5 </pre>See also rename.
Example: Files such as /etc/passwd, /etc/group, and so on have implicit field names which are found in section-5 manpages. These field names may be made explicit as follows:
Likewise, if you have CSV/CSV-lite input data which has somehow been bereft of its header line, you can re-add a header line using --implicit-csv-header and label:
In this example, the English and German pangrams are convertible from UTF-8 to Latin-1, but the Russian one is not:
See also most-frequent.
This is like mlr stats1 but all accumulation is done across fields within each given record: horizontal rather than vertical statistics, if you will.
Examples:
<pre class="pre-highlight-in-pair"> <b>mlr --csvlite --opprint cat data/inout.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> a_in a_out b_in b_out 436 490 446 195 526 320 963 780 220 888 705 831 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --csvlite --opprint merge-fields -a min,max,sum -c _in,_out data/inout.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> a_min a_max a_sum b_min b_max b_sum 436 490 926 195 446 641 320 526 846 780 963 1743 220 888 1108 705 831 1536 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --csvlite --opprint merge-fields -k -a sum -c _in,_out data/inout.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> a_in a_out b_in b_out a_sum b_sum 436 490 446 195 926 641 526 320 963 780 846 1743 220 888 705 831 1108 1536 </pre>See also least-frequent.
Please see the DSL reference for more information about the expression language for mlr put.
This exists since hash-map software in various languages and tools encountered in the wild does not always print similar rows with fields in the same order: mlr regularize helps clean that up.
See also reorder.
Since this verb needs to read all records to see if any of them has a non-empty value for a given field name, it is non-streaming: it will ingest all records before writing any.
As discussed in Performance, sed is significantly faster than Miller at doing this. However, Miller is format-aware, so it knows to do renames only within specified field keys and not any others, nor in field values which may happen to contain the same pattern. Example:
See also label.
This pivots specified field names to the start or end of the record -- for example when you have highly multi-column data and you want to bring a field or two to the front of line where you can give a quick visual scan.
<pre class="pre-highlight-in-pair"> <b>mlr --opprint cat data/small</b> </pre> <pre class="pre-non-highlight-in-pair"> a b i x y pan pan 1 0.346791 0.726802 eks pan 2 0.758679 0.522151 wye wye 3 0.204603 0.338318 eks wye 4 0.381399 0.134188 wye pan 5 0.573288 0.863624 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --opprint reorder -f i,b data/small</b> </pre> <pre class="pre-non-highlight-in-pair"> i b a x y 1 pan pan 0.346791 0.726802 2 pan eks 0.758679 0.522151 3 wye wye 0.204603 0.338318 4 wye eks 0.381399 0.134188 5 pan wye 0.573288 0.863624 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --opprint reorder -e -f i,b data/small</b> </pre> <pre class="pre-non-highlight-in-pair"> a x y i b pan 0.346791 0.726802 1 pan eks 0.758679 0.522151 2 pan wye 0.204603 0.338318 3 wye eks 0.381399 0.134188 4 wye wye 0.573288 0.863624 5 pan </pre>This is useful in at least two ways: one, as a data-generator as in the
above example using urand(); two, for reconstructing individual
samples from data which has been count-aggregated:
After expansion with repeat, such data can then be sent on to
stats1 -a mode, or (if the data are numeric) to stats1 -a p10,p50,p90, etc.
This is reservoir-sampling: select k items from n with
uniform probability and no repeats in the sample. (If n is less than
k, then of course only n samples are produced.) With -g {field names}, produce a k-sample for each distinct value of the
specified field names.
Note that no output is produced until all inputs are in. Another way to do
sampling, which works in the streaming case, is mlr filter 'urand() & 0.001' where you tune the 0.001 to meet your needs.
Example:
<pre class="pre-highlight-in-pair"> <b>mlr --opprint sort -f a -nr x data/small</b> </pre> <pre class="pre-non-highlight-in-pair"> a b i x y eks pan 2 0.758679 0.522151 eks wye 4 0.381399 0.134188 pan pan 1 0.346791 0.726802 wye pan 5 0.573288 0.863624 wye wye 3 0.204603 0.338318 </pre>Here's an example filtering log data: suppose multiple threads (labeled here by color) are all logging progress counts to a single log file. The log file is (by nature) chronological, so the progress of various threads is interleaved:
<pre class="pre-highlight-in-pair"> <b>head -n 10 data/multicountdown.dat</b> </pre> <pre class="pre-non-highlight-in-pair"> upsec=0.002,color=green,count=1203 upsec=0.083,color=red,count=3817 upsec=0.188,color=red,count=3801 upsec=0.395,color=blue,count=2697 upsec=0.526,color=purple,count=953 upsec=0.671,color=blue,count=2684 upsec=0.899,color=purple,count=926 upsec=0.912,color=red,count=3798 upsec=1.093,color=blue,count=2662 upsec=1.327,color=purple,count=917 </pre>We can group these by thread by sorting on the thread ID (here,
color). Since Miller's sort is stable, this means that
timestamps within each thread's log data are still chronological:
Any records not having all specified sort keys will appear at the end of the output, in the order they were encountered, regardless of the specified sort order:
<pre class="pre-highlight-in-pair"> <b>mlr sort -n x data/sort-missing.dkvp</b> </pre> <pre class="pre-non-highlight-in-pair"> x=1 x=2 x=4 a=3 </pre> <pre class="pre-highlight-in-pair"> <b>mlr sort -nr x data/sort-missing.dkvp</b> </pre> <pre class="pre-non-highlight-in-pair"> x=4 x=2 x=1 a=3 </pre>These are simple univariate statistics on one or more number-valued fields
(count and mode apply to non-numeric fields as well),
optionally categorized by one or more other fields.
These are simple bivariate statistics on one or more pairs of number-valued fields, optionally categorized by one or more fields.
<pre class="pre-highlight-in-pair"> <b>mlr --oxtab put '$x2=$x*$x; $xy=$x*$y; $y2=$y**2' \</b> <b> then stats2 -a cov,corr -f x,y,y,y,x2,xy,x2,y2 \</b> <b> data/medium</b> </pre> <pre class="pre-non-highlight-in-pair"> x_y_cov 0.00004257482082749404 x_y_corr 0.0005042001844473328 y_y_cov 0.08461122467974005 y_y_corr 1 x2_xy_cov 0.041883822817793716 x2_xy_corr 0.6301743420379936 x2_y2_cov -0.0003095372596253918 x2_y2_corr -0.003424908876111875 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --opprint put '$x2=$x*$x; $xy=$x*$y; $y2=$y**2' \</b> <b> then stats2 -a linreg-ols,r2 -f x,y,y,y,xy,y2 -g a \</b> <b> data/medium</b> </pre> <pre class="pre-non-highlight-in-pair"> a x_y_ols_m x_y_ols_b x_y_ols_n x_y_r2 y_y_ols_m y_y_ols_b y_y_ols_n y_y_r2 xy_y2_ols_m xy_y2_ols_b xy_y2_ols_n xy_y2_r2 pan 0.017025512736819345 0.500402892289764 2081 0.00028691820445815624 1 -0.00000000000000002890430283104539 2081 1 0.8781320866715664 0.11908230147563569 2081 0.4174982737731127 eks 0.04078049236855813 0.4814020796765104 1965 0.0016461239223448218 1 0.00000000000000017862676354313703 1965 1 0.897872861169018 0.1073405443361234 1965 0.4556322386425451 wye -0.03915349075204785 0.5255096523974457 1966 0.0015051268704373377 1 0.00000000000000004464425401127647 1966 1 0.8538317334220837 0.1267454301662969 1966 0.3899172181859931 zee 0.0027812364960401333 0.5043070448033061 2047 0.000007751652858787357 1 0.00000000000000004819404567023685 2047 1 0.8524439912011011 0.12401684308018947 2047 0.39356598090006495 hat -0.018620577041095272 0.5179005397264937 1941 0.00035200366460556604 1 -0.00000000000000003400445761787692 1941 1 0.8412305086345017 0.13557328318623207 1941 0.3687944261732266 </pre>Here's an example simple line-fit. The x and y
fields of the data/medium dataset are just independent uniformly
distributed on the unit interval. Here we remove half the data and fit a line to it.
I use pgr for plotting; here's a screenshot.
(Thanks Drew Kunas for a good conversation about PCA!)
Here's an example estimating time-to-completion for a set of jobs. Input data comes from a log file, with number of work units left to do in the count field and accumulated seconds in the upsec field, labeled by the color field:
We can do a linear regression on count remaining as a function of time: with c = m*u+b we want to find the time when the count goes to zero, i.e. u=-b/m.
Most Miller commands are record-at-a-time, with the exception of stats1, stats2, and histogram which compute aggregate output. The step command is intermediate: it allows the option of adding fields which are functions of fields from previous records. Rsum is short for running sum.
Example deriving uptime-delta from system uptime:
<pre class="pre-non-highlight-non-pair"> $ each 10 uptime | mlr -p step -a delta -f 11 ... 20:08 up 36 days, 10:38, 5 users, load averages: 1.42 1.62 1.73 0.000000 20:08 up 36 days, 10:38, 5 users, load averages: 1.55 1.64 1.74 0.020000 20:08 up 36 days, 10:38, 7 users, load averages: 1.58 1.65 1.74 0.010000 20:08 up 36 days, 10:38, 9 users, load averages: 1.78 1.69 1.76 0.040000 20:08 up 36 days, 10:39, 9 users, load averages: 2.12 1.76 1.78 0.070000 20:08 up 36 days, 10:39, 9 users, load averages: 2.51 1.85 1.81 0.090000 20:08 up 36 days, 10:39, 8 users, load averages: 2.79 1.92 1.83 0.070000 20:08 up 36 days, 10:39, 4 users, load averages: 2.64 1.90 1.83 -0.020000 </pre>Prints the records in the input stream in reverse order. Note: this requires Miller to retain all input records in memory before any output records are produced.
<pre class="pre-highlight-in-pair"> <b>mlr --icsv --opprint cat data/a.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> a b c 1 2 3 4 5 6 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --icsv --opprint cat data/b.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> a b c 7 8 9 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --icsv --opprint tac data/a.csv data/b.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> a b c 7 8 9 4 5 6 1 2 3 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --icsv --opprint put '$filename=FILENAME' then tac data/a.csv data/b.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> a b c filename 7 8 9 data/b.csv 4 5 6 data/a.csv 1 2 3 data/a.csv </pre>Prints the last n records in the input stream, optionally by category.
<pre class="pre-highlight-in-pair"> <b>mlr --c2p tail -n 4 data/colored-shapes.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> color shape flag i u v w x blue square 1 499872 0.618906 0.263796 0.531147 6.210738 blue triangle 0 499880 0.008111 0.826727 0.473296 6.146957 yellow triangle 0 499955 0.383942 0.559529 0.511376 4.307974 yellow circle 1 499974 0.764951 0.252842 0.499699 5.013810 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --c2p tail -n 1 -g shape data/colored-shapes.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> color shape flag i u v w x yellow triangle 0 499955 0.383942 0.559529 0.511376 4.307974 blue square 1 499872 0.618906 0.263796 0.531147 6.210738 yellow circle 1 499974 0.764951 0.252842 0.499699 5.013810 </pre>Note that top is distinct from head -- head shows fields which appear first in the data stream; top shows fields which are numerically largest (or smallest).
There are two main ways to use mlr uniq: the first way is with -g to specify group-by columns.
The second main way to use mlr uniq is without group-by columns, using -a instead:
The primary use-case is for PPRINT output, which is space-delimited. For example:
<pre class="pre-highlight-in-pair"> <b>cat data/spaces.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> column 1,column 2,column 3 apple,ball,cat dale egg,fish,gale </pre> <pre class="pre-highlight-in-pair"> <b>mlr --icsv --opprint cat data/spaces.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> column 1 column 2 column 3 apple ball cat dale egg fish gale </pre> <pre class="pre-highlight-in-pair"> <b>mlr --icsv --opprint cat data/spaces.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> column 1 column 2 column 3 apple ball cat dale egg fish gale </pre> <pre class="pre-highlight-in-pair"> <b>mlr --icsv --opprint unspace data/spaces.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> column_1 column_2 column_3 apple ball cat dale_egg fish gale </pre> <pre class="pre-highlight-in-pair"> <b>mlr --icsv --opprint unspace data/spaces.csv | mlr --ipprint --oxtab cat</b> </pre> <pre class="pre-non-highlight-in-pair"> column_1 apple column_2 ball column_3 cat column_1 dale_egg column_2 fish column_3 gale </pre>Examples:
<pre class="pre-highlight-in-pair"> <b>cat data/sparse.json</b> </pre> <pre class="pre-non-highlight-in-pair"> {"a":1,"b":2,"v":3} {"u":1,"b":2} {"a":1,"v":2,"x":3} {"v":1,"w":2} </pre> <pre class="pre-highlight-in-pair"> <b>mlr --json unsparsify data/sparse.json</b> </pre> <pre class="pre-non-highlight-in-pair"> [ { "a": 1, "b": 2, "v": 3, "u": "", "x": "", "w": "" }, { "a": "", "b": 2, "v": "", "u": 1, "x": "", "w": "" }, { "a": 1, "b": "", "v": 2, "u": "", "x": 3, "w": "" }, { "a": "", "b": "", "v": 1, "u": "", "x": "", "w": 2 } ] </pre> <pre class="pre-highlight-in-pair"> <b>mlr --ijson --opprint unsparsify data/sparse.json</b> </pre> <pre class="pre-non-highlight-in-pair"> a b v u x w 1 2 3 - - - - 2 - 1 - - 1 - 2 - 3 - - - 1 - - 2 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --ijson --opprint unsparsify --fill-with missing data/sparse.json</b> </pre> <pre class="pre-non-highlight-in-pair"> a b v u x w 1 2 3 missing missing missing missing 2 missing 1 missing missing 1 missing 2 missing 3 missing missing missing 1 missing missing 2 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --ijson --opprint unsparsify -f a,b,u data/sparse.json</b> </pre> <pre class="pre-non-highlight-in-pair"> a b v u 1 2 3 - u b a 1 2 - a v x b u 1 2 3 - - v w a b u 1 2 - - - </pre> <pre class="pre-highlight-in-pair"> <b>mlr --ijson --opprint unsparsify -f a,b,u,v,w,x then regularize data/sparse.json</b> </pre> <pre class="pre-non-highlight-in-pair"> a b v u w x 1 2 3 - - - - 2 - 1 - - 1 - 2 - - 3 - - 1 - 2 - </pre>