docs/src/two-pass-algorithms.md
Miller is a streaming record processor; commands are performed once per record.
(See here
and here for an introductory discussion.) This
makes Miller particularly suitable for single-pass algorithms, allowing many of
its verbs to process files that are (much) larger than the amount of RAM
present in your system. (Of course, Miller verbs such as sort, tac, etc.
all must ingest and retain all input records before emitting any output records
-- see the page on streaming processing and memory
usage.) You can also use out-of-stream
variables to perform
multi-pass computations, at the price of retaining all input records in memory.
One of Miller's strengths is its compact notation: for example, given input of the form
<pre class="pre-highlight-in-pair"> <b>head -n 5 ./data/medium</b> </pre> <pre class="pre-non-highlight-in-pair"> a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533 a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797 a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776 a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463 a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729 </pre>you can simply do
<pre class="pre-highlight-in-pair"> <b>mlr --oxtab stats1 -a sum -f x ./data/medium</b> </pre> <pre class="pre-non-highlight-in-pair"> x_sum 4986.019681679581 </pre>or
<pre class="pre-highlight-in-pair"> <b>mlr --opprint stats1 -a sum -f x -g b ./data/medium</b> </pre> <pre class="pre-non-highlight-in-pair"> b x_sum pan 965.7636699425815 wye 1023.5484702619565 zee 979.7420161495838 eks 1016.7728571314786 hat 1000.192668193983 </pre>rather than the more tedious
<pre class="pre-highlight-in-pair"> <b>mlr --oxtab put -q '</b> <b> @x_sum += $x;</b> <b> end {</b> <b> emit @x_sum</b> <b> }</b> <b>' data/medium</b> </pre> <pre class="pre-non-highlight-in-pair"> x_sum 4986.019681679581 </pre>or
<pre class="pre-highlight-in-pair"> <b>mlr --opprint put -q '</b> <b> @x_sum[$b] += $x;</b> <b> end {</b> <b> emit @x_sum, "b"</b> <b> }</b> <b>' data/medium</b> </pre> <pre class="pre-non-highlight-in-pair"> b x_sum pan 965.7636699425815 wye 1023.5484702619565 zee 979.7420161495838 eks 1016.7728571314786 hat 1000.192668193983 </pre>The former (mlr stats1 et al.) has the advantages of being easier to type, being less error-prone to type, and running faster.
Nonetheless, out-of-stream variables (which I whimsically call oosvars), begin/end blocks, and emit statements give you the ability to implement logic -- if you wish to do so -- which isn't present in other Miller verbs. (If you find yourself often using the same out-of-stream-variable logic over and over, please file a request at https://github.com/johnkerl/miller/issues to get it implemented directly in Go as a Miller verb of its own.)
The following examples compute some things using oosvars which are already computable using Miller verbs, by way of providing food for thought.
For example, mapping numeric values down a column to the percentage between their min and max values is two-pass: on the first pass you find the min and max values, then on the second, map each record's value to a percentage.
<pre class="pre-highlight-in-pair"> <b>mlr --from data/small --opprint put -q '</b> <b> # These are executed once per record, which is the first pass.</b> <b> # The key is to use NR to index an out-of-stream variable to</b> <b> # retain all the x-field values.</b> <b> @x_min = min($x, @x_min);</b> <b> @x_max = max($x, @x_max);</b> <b> @x[NR] = $x;</b> <b></b> <b> # The second pass is in a for-loop in an end-block.</b> <b> end {</b> <b> for (nr, x in @x) {</b> <b> @x_pct[nr] = 100 * (x - @x_min) / (@x_max - @x_min);</b> <b> }</b> <b> emit (@x, @x_pct), "NR"</b> <b> }</b> <b>'</b> </pre> <pre class="pre-non-highlight-in-pair"> NR x x_pct 1 0.346791 25.66218352716956 2 0.758679 100 3 0.204603 0 4 0.381399 31.90825807289974 5 0.573288 66.54051068806446 </pre>Similarly, finding the total record count requires first reading through all the data:
<pre class="pre-highlight-in-pair"> <b>mlr --opprint --from data/small put -q '</b> <b> @records[NR] = $*;</b> <b> end {</b> <b> for((Istring,k),v in @records) {</b> <b> int I = int(Istring);</b> <b> @records[I]["I"] = I;</b> <b> @records[I]["N"] = NR;</b> <b> @records[I]["PCT"] = 100*I/NR</b> <b> }</b> <b> emit @records,"I"</b> <b> }</b> <b>' then reorder -f I,N,PCT</b> </pre> <pre class="pre-non-highlight-in-pair"> I N PCT a b i x y 1 5 20 pan pan 1 0.346791 0.726802 2 5 40 eks pan 2 0.758679 0.522151 3 5 60 wye wye 3 0.204603 0.338318 4 5 80 eks wye 4 0.381399 0.134188 5 5 100 wye pan 5 0.573288 0.863624 </pre>The idea is to retain records having the largest value of n in the following data:
Of course, the largest value of n isn't known until after all data have been read. Using an out-of-stream variable we can retain all records as they are read, then filter them at the end:
Suppose you have some heterogeneous data like this:
<pre class="pre-non-highlight-non-pair"> { "qoh": 29874, "rate": 1.68, "latency": 0.02 } { "name": "alice", "uid": 572 } { "qoh": 1227, "rate": 1.01, "latency": 0.07 } { "qoh": 13458, "rate": 1.72, "latency": 0.04 } { "qoh": 56782, "rate": 1.64 } { "qoh": 23512, "rate": 1.71, "latency": 0.03 } { "qoh": 9876, "rate": 1.89, "latency": 0.08 } { "name": "bill", "uid": 684 } { "name": "chuck", "uid2": 908 } { "name": "dottie", "uid": 440 } { "qoh": 0, "rate": 0.40, "latency": 0.01 } { "qoh": 5438, "rate": 1.56, "latency": 0.17 } </pre>A reasonable question to ask is, how many occurrences of each field are there? And, what percentage of total row count has each of them? Since the denominator of the percentage is not known until the end, this is a two-pass algorithm:
<pre class="pre-non-highlight-non-pair"> for (key in $*) { @key_counts[key] += 1; } @record_count += 1; end { for (key in @key_counts) { @key_fraction[key] = @key_counts[key] / @record_count } emit @record_count; emit @key_counts, "key"; emit @key_fraction,"key" } </pre>Then
<pre class="pre-highlight-in-pair"> <b>mlr --json put -q -f data/feature-count.mlr data/features.json</b> </pre> <pre class="pre-non-highlight-in-pair"> [ { "record_count": 12 }, { "key": "qoh", "key_counts": 8 }, { "key": "rate", "key_counts": 8 }, { "key": "latency", "key_counts": 7 }, { "key": "name", "key_counts": 4 }, { "key": "uid", "key_counts": 3 }, { "key": "uid2", "key_counts": 1 }, { "key": "qoh", "key_fraction": 0.6666666666666666 }, { "key": "rate", "key_fraction": 0.6666666666666666 }, { "key": "latency", "key_fraction": 0.5833333333333334 }, { "key": "name", "key_fraction": 0.3333333333333333 }, { "key": "uid", "key_fraction": 0.25 }, { "key": "uid2", "key_fraction": 0.08333333333333333 } ] </pre> <pre class="pre-highlight-in-pair"> <b>mlr --ijson --opprint put -q -f data/feature-count.mlr data/features.json</b> </pre> <pre class="pre-non-highlight-in-pair"> record_count 12 key key_counts qoh 8 rate 8 latency 7 name 4 uid 3 uid2 1 key key_fraction qoh 0.6666666666666666 rate 0.6666666666666666 latency 0.5833333333333334 name 0.3333333333333333 uid 0.25 uid2 0.08333333333333333 </pre>The previous section discussed how to fill out missing data fields within CSV with full header line -- so the list of all field names is present within the header line. Next, let's look at a related problem: we have data where each record has various key names but we want to produce rectangular output having the union of all key names.
There is a keystroke-saving verb for this: unsparsify. Here, we look at how to implement that in the DSL.
For example, suppose you have JSON input like this:
<pre class="pre-highlight-in-pair"> <b>cat data/sparse.json</b> </pre> <pre class="pre-non-highlight-in-pair"> {"a":1,"b":2,"v":3} {"u":1,"b":2} {"a":1,"v":2,"x":3} {"v":1,"w":2} </pre>There are field names a, b, v, u, x, w in the data -- but not all in every record. Since we don't know the names of all the keys until we've read them all, this needs to be a two-pass algorithm. On the first pass, remember all the unique key names and all the records; on the second pass, loop through the records filling in absent values, then producing output. Use put -q since we don't want to produce per-record output, only emitting output in the end block:
You can also do this keyed, of course, imitating the keyed-mean example above.