Operating on all records - Miller

<div> Quick links:   <a class="quicklink" href="../reference-main-flag-list/index.html">Flags</a>   <a class="quicklink" href="../reference-verbs/index.html">Verbs</a>   <a class="quicklink" href="../reference-dsl-builtin-functions/index.html">Functions</a>   <a class="quicklink" href="../glossary/index.html">Glossary</a>   <a class="quicklink" href="../release-docs/index.html">Release docs</a> </div> # Operating on all records

As we saw in the DSL-overview page, the Miller programming language has an implicit loop over records for main statements.

Miller's feature of streaming operation over records is implemented by the main statements (everything outside begin/end/func/subr) getting invoked once per record. You don't explicitly loop over records, as you would in some dataframes contexts; rather, Miller loops over records for you, and it lets you specify what to do on each record: you write the body of the loop.

That's fine for most simple use-cases, but sometimes you do want to loop over all records. Here we describe a few options.

Sums/counters

The first option is to leverage the fact that main DSL statements are already invoked in a loop over records, and use out-of-stream variables to retain sums, counters, etc.

For example, let's look at our short data file data/short.csv:

<pre class="pre-highlight-in-pair"> cat data/short.csv </pre> <pre class="pre-non-highlight-in-pair"> word,value apple,37 ball,28 cat,54 </pre>

We can track count and sum using out-of-stream variables -- the ones that start with the @ sigil -- then emit them as a new record after all the input is read.

<pre class="pre-highlight-in-pair"> mlr --icsv --ojson --from data/short.csv put ' begin { @count = 0; @sum = 0; } @count += 1; @sum += $value; end { emit (@count, @sum); } ' </pre> <pre class="pre-non-highlight-in-pair"> [ { "word": "apple", "value": 37 }, { "word": "ball", "value": 28 }, { "word": "cat", "value": 54 }, { "count": 3, "sum": 119 } ] </pre>

And if all we want is the final output and not the input data, we can use put -q to not pass through the input records:

<pre class="pre-highlight-in-pair"> mlr --icsv --ojson --from data/short.csv put -q ' begin { @count = 0; @sum = 0; } @count += 1; @sum += $value; end { emit (@count, @sum); } ' </pre> <pre class="pre-non-highlight-in-pair"> [ { "count": 3, "sum": 119 } ] </pre>

As discussed a bit more on the page on streaming processing and memory usage, this doesn't keep all records in memory, only the count and sum variables. You can use this on very large files without running out of memory.

Retaining records in a map

The second option is to retain entire records in a map, then loop over them in an end block.

Let's use the same short data file data/short.csv:

<pre class="pre-highlight-in-pair"> cat data/short.csv </pre> <pre class="pre-non-highlight-in-pair"> word,value apple,37 ball,28 cat,54 </pre> <pre class="pre-highlight-in-pair"> mlr --icsv --ojson --from data/short.csv put -q ' # map begin { @records = {}; } @records[NR] = $*; end { count = length(@records); sum = 0; for (i = 1; i <= NR; i += 1) { sum += @records[i]["value"]; } dump @records; # show the map emit (count, sum); } ' </pre> <pre class="pre-non-highlight-in-pair"> { "1": { "word": "apple", "value": 37 }, "2": { "word": "ball", "value": 28 }, "3": { "word": "cat", "value": 54 } } [ { "count": 3, "sum": 119 } ] </pre>

The downside to this, of course, is that this retains all records (plus data-structure overhead) in memory, so you're limited to processing files that fit in your computer's memory. The upside, though, is that you can do random access over the records using things like

<pre class="pre-non-highlight-non-pair"> output = 0; for (i = 1; i <= NR; i += 1) { for (j = 1; j <= NR; j += 1) { for (k = 1; k <= NR; k += 1) { output += call_some_function_of(@records[i], @records[j], @record[k]) } } } # do something with the output </pre>

Retaining records in an array

The third option is to retain records in an array, then loop over them in an end block.

<pre class="pre-highlight-in-pair"> mlr --icsv --ojson --from data/short.csv put -q ' # array begin { @records = []; } @records[NR] = $*; end { count = length(@records); sum = 0; for (i = 1; i <= NR; i += 1) { sum += @records[i]["value"]; } dump @records; # show the array emit (count, sum); } ' </pre> <pre class="pre-non-highlight-in-pair"> [ { "word": "apple", "value": 37 }, { "word": "ball", "value": 28 }, { "word": "cat", "value": 54 } ] [ { "count": 3, "sum": 119 } ] </pre>

Just as with the retain-as-map approach, the downside is the overhead of retaining all records in memory, and the upside is that you get random access over records.

Using maps vs using arrays

Retaining records as a map or as an array is a matter of taste. Some things to note:

If we initialize @records = {} in the begin block (or, if we don't initialize it at all and just start writing to it in the main statements) then @records is a map . If we initialize @records=[] then it's an array.

Arrays are, of course, contiguously indexed. (And, in Miller, their indices start with 1, not 0 as discussed in the Arrays page.) This means that if you are only retaining a subset of records then your array will have null-gaps in it:

<pre class="pre-highlight-in-pair"> mlr --icsv --ojson --from data/short.csv put -q ' begin { @records = []; } if (NR != 2) { @records[NR] = $* } end { dump @records; } ' </pre> <pre class="pre-non-highlight-in-pair"> [ { "word": "apple", "value": 37 }, null, { "word": "cat", "value": 54 } ] </pre>

You can index @records by @count rather than NR to get a contiguous array:

<pre class="pre-highlight-in-pair"> mlr --icsv --ojson --from data/short.csv put -q ' begin { @records = []; @count = 0; } # main statement if (NR != 2) { @count += 1; @records[@count] = $*; } end { dump @records; count = length(@records); sum = 0; for (record in @records) { sum += record["value"]; } emit (count, sum); } ' </pre> <pre class="pre-non-highlight-in-pair"> [ { "word": "apple", "value": 37 }, { "word": "cat", "value": 54 } ] [ { "count": 2, "sum": 91 } ] </pre>

If you use a map to retain records, then this is a non-issue: maps can retain whatever values you like:

<pre class="pre-highlight-in-pair"> mlr --icsv --ojson --from data/short.csv put -q ' begin { @records = {}; } # main statement if (NR != 2) { @records[NR] = $*; } end { dump @records; count = length(@records); sum = 0; for (key in @records) { sum += @records[key]["value"]; } emit (count, sum); } ' </pre> <pre class="pre-non-highlight-in-pair"> { "1": { "word": "apple", "value": 37 }, "3": { "word": "cat", "value": 54 } } [ { "count": 2, "sum": 91 } ] </pre>

Do note that Miller maps preserve insertion order, so at the end you're guaranteed to loop over records in the same order you read them. Also note that when you index a Miller map with an integer key, this works, but the key is stringified.

Retaining partial records in map or array

If all you need is one or a few attributes out of a record, you don't need to retain full records. You can retain a map, or array, of just the fields you're interested in:

<pre class="pre-highlight-in-pair"> mlr --icsv --ojson --from data/short.csv put -q ' begin { @values = {}; } # main statement if (NR != 2) { @values[NR] = $value; } end { dump @values; count = length(@values); sum = 0; for (key in @values) { sum += @values[key]; } emit (count, sum); } ' </pre> <pre class="pre-non-highlight-in-pair"> { "1": 37, "3": 54 } [ { "count": 2, "sum": 91 } ] </pre>

Sorting

Please see the sorting page.

For more information

Please see the page on two-pass algorithms; see also the page on higher-order functions.