docs/src/operating-on-all-records.md
As we saw in the DSL-overview page, the Miller programming language has an implicit loop over records for main statements.
Miller's feature of streaming operation over
records is implemented by the main statements
(everything outside begin/end/func/subr) getting invoked once per
record. You don't explicitly loop over records, as you would in some dataframes
contexts; rather, Miller loops over records for you, and it lets you specify
what to do on each record: you write the body of the loop.
That's fine for most simple use-cases, but sometimes you do want to loop over all records. Here we describe a few options.
The first option is to leverage the fact that main DSL statements are already invoked in a loop over records, and use out-of-stream variables to retain sums, counters, etc.
For example, let's look at our short data file data/short.csv:
<pre class="pre-highlight-in-pair"> <b>cat data/short.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> word,value apple,37 ball,28 cat,54 </pre>We can track count and sum using
out-of-stream variables -- the ones that
start with the @ sigil -- then
emit them as a new record
after all the input is read.
And if all we want is the final output and not the input data, we can use put -q to not pass through the input records:
As discussed a bit more on the page on streaming processing and memory usage, this doesn't keep all records in memory, only the count and sum variables. You can use this on very large files without running out of memory.
The second option is to retain entire records in a map, then loop over them in an end block.
Let's use the same short data file data/short.csv:
<pre class="pre-highlight-in-pair"> <b>cat data/short.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> word,value apple,37 ball,28 cat,54 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --icsv --ojson --from data/short.csv put -q '</b> <b> # map</b> <b> begin {</b> <b> @records = {};</b> <b> }</b> <b> @records[NR] = $*;</b> <b> end {</b> <b> count = length(@records);</b> <b> sum = 0;</b> <b> for (i = 1; i <= NR; i += 1) {</b> <b> sum += @records[i]["value"];</b> <b> }</b> <b> dump @records; # show the map</b> <b> emit (count, sum);</b> <b> }</b> <b>'</b> </pre> <pre class="pre-non-highlight-in-pair"> { "1": { "word": "apple", "value": 37 }, "2": { "word": "ball", "value": 28 }, "3": { "word": "cat", "value": 54 } } [ { "count": 3, "sum": 119 } ] </pre>The downside to this, of course, is that this retains all records (plus data-structure overhead) in memory, so you're limited to processing files that fit in your computer's memory. The upside, though, is that you can do random access over the records using things like
<pre class="pre-non-highlight-non-pair"> output = 0; for (i = 1; i <= NR; i += 1) { for (j = 1; j <= NR; j += 1) { for (k = 1; k <= NR; k += 1) { output += call_some_function_of(@records[i], @records[j], @record[k]) } } } # do something with the output </pre>The third option is to retain records in an array, then loop over them in an end block.
Just as with the retain-as-map approach, the downside is the overhead of retaining all records in memory, and the upside is that you get random access over records.
Retaining records as a map or as an array is a matter of taste. Some things to note:
If we initialize @records = {} in the begin block (or, if we don't initialize it at all and just start writing to it in the main statements) then @records is a map . If we initialize @records=[] then it's an array.
Arrays are, of course, contiguously indexed. (And, in Miller, their indices start with 1, not 0 as discussed in the Arrays page.) This means that if you are only retaining a subset of records then your array will have null-gaps in it:
<pre class="pre-highlight-in-pair"> <b>mlr --icsv --ojson --from data/short.csv put -q '</b> <b> begin {</b> <b> @records = [];</b> <b> }</b> <b> if (NR != 2) {</b> <b> @records[NR] = $*</b> <b> }</b> <b> end {</b> <b> dump @records;</b> <b> }</b> <b>'</b> </pre> <pre class="pre-non-highlight-in-pair"> [ { "word": "apple", "value": 37 }, null, { "word": "cat", "value": 54 } ] </pre>You can index @records by @count rather than NR to get a contiguous array:
If you use a map to retain records, then this is a non-issue: maps can retain whatever values you like:
<pre class="pre-highlight-in-pair"> <b>mlr --icsv --ojson --from data/short.csv put -q '</b> <b> begin {</b> <b> @records = {};</b> <b> }</b> <b> # main statement</b> <b> if (NR != 2) {</b> <b> @records[NR] = $*;</b> <b> }</b> <b> end {</b> <b> dump @records;</b> <b> count = length(@records);</b> <b> sum = 0;</b> <b> for (key in @records) {</b> <b> sum += @records[key]["value"];</b> <b> }</b> <b> emit (count, sum);</b> <b> }</b> <b>'</b> </pre> <pre class="pre-non-highlight-in-pair"> { "1": { "word": "apple", "value": 37 }, "3": { "word": "cat", "value": 54 } } [ { "count": 2, "sum": 91 } ] </pre>Do note that Miller maps preserve insertion order, so at the end you're guaranteed to loop over records in the same order you read them. Also note that when you index a Miller map with an integer key, this works, but the key is stringified.
If all you need is one or a few attributes out of a record, you don't need to retain full records. You can retain a map, or array, of just the fields you're interested in:
<pre class="pre-highlight-in-pair"> <b>mlr --icsv --ojson --from data/short.csv put -q '</b> <b> begin {</b> <b> @values = {};</b> <b> }</b> <b> # main statement</b> <b> if (NR != 2) {</b> <b> @values[NR] = $value;</b> <b> }</b> <b> end {</b> <b> dump @values;</b> <b> count = length(@values);</b> <b> sum = 0;</b> <b> for (key in @values) {</b> <b> sum += @values[key];</b> <b> }</b> <b> emit (count, sum);</b> <b> }</b> <b>'</b> </pre> <pre class="pre-non-highlight-in-pair"> { "1": 37, "3": 54 } [ { "count": 2, "sum": 91 } ] </pre>Please see the sorting page.
Please see the page on two-pass algorithms; see also the page on higher-order functions.