DSL overview - Miller

<div> Quick links:   <a class="quicklink" href="../reference-main-flag-list/index.html">Flags</a>   <a class="quicklink" href="../reference-verbs/index.html">Verbs</a>   <a class="quicklink" href="../reference-dsl-builtin-functions/index.html">Functions</a>   <a class="quicklink" href="../glossary/index.html">Glossary</a>   <a class="quicklink" href="../release-docs/index.html">Release docs</a> </div> # DSL overview

DSL stands for domain-specific language: it's language particular to Miller which you can use to write expressions to specify customer transformations to your data. (See Miller programming language for an introduction.)

Verbs compared to DSL

While put and filter are verbs, they're different from the rest in that they let you use the DSL -- so we often contrast DSL (things you can do in the put and filter verbs), and verbs (things you can do using the other verbs besides put and filter.)

Here's comparison of verbs and put/filter DSL expressions:

Example:

<pre class="pre-highlight-in-pair"> mlr stats1 -a sum -f x -g a data/small </pre> <pre class="pre-non-highlight-in-pair"> a=pan,x_sum=0.346791 a=eks,x_sum=1.140078 a=wye,x_sum=0.777891 </pre>

Verbs are coded in Go
They run a bit faster
They take fewer keystrokes
There is less to learn
Their customization is limited to each verb's options

Example:

<pre class="pre-highlight-in-pair"> mlr put -q '@x_sum[$a] += $x; end{emit @x_sum, "a"}' data/small </pre> <pre class="pre-non-highlight-in-pair"> a=pan,x_sum=0.346791 a=eks,x_sum=1.140078 a=wye,x_sum=0.777891 </pre>

You get to write your own DSL expressions
They run a bit slower
They take more keystrokes
There is more to learn
They are highly customizable

Please see Verbs Reference for information on verbs other than put and filter.

Implicit loop over records for main statements

The most important point about the Miller DSL is that it is designed for streaming operation over records.

DSL statements include:

func and subr for user-defined functions and subroutines, which we'll look at later in the separate page about them;
begin and end blocks, for statements you want to run before the first record, or after the last one;
everything else, which collectively are called main statements.

The feature of streaming operation over records is implemented by the main statements getting invoked once per record. You don't explicitly loop over records, as you would in some dataframes contexts; rather, Miller loops over records for you, and it lets you specify what to do on each record: you write the body of the loop, not the loop itself.

(But you can, if you like, use those per-record statements to grow a list of records, then loop over them all in an end block. This is described in the page on operating on all records.)

To see this in action, let's take a look at the data/short.csv file:

<pre class="pre-highlight-in-pair"> cat data/short.csv </pre> <pre class="pre-non-highlight-in-pair"> word,value apple,37 ball,28 cat,54 </pre>

There are three records in this file, with word=apple, word=ball, and word=cat, respectively. Let's print something in a begin statement, add a field in a main statement, and print something else in an end statement:

<pre class="pre-highlight-in-pair"> mlr --csv --from data/short.csv put ' begin { print "begin"; } $nr = NR; end { print "end"; } ' </pre> <pre class="pre-non-highlight-in-pair"> begin word,value,nr apple,37,1 ball,28,2 cat,54,3 end </pre>

The print statements for begin and end went out before the first record was seen and after the last was seen; the field-creation statement $nr = NR was invoked three times, once for each record. We didn't explicitly loop over records, since Miller was already looping over records, and invoked our main statement on each loop iteration.

For almost all simple uses of the Miller programming language, this implicit looping over records is probably all you will need. (For more involved cases you can see the pages on operating on all records, out-of-stream variables, and two-pass algorithms.)

Essential use: record-selection and record-updating

The essential usages of mlr filter and mlr put are for record-selection and record-updating expressions, respectively. For example, given the following input data:

<pre class="pre-highlight-in-pair"> cat data/small </pre> <pre class="pre-non-highlight-in-pair"> a=pan,b=pan,i=1,x=0.346791,y=0.726802 a=eks,b=pan,i=2,x=0.758679,y=0.522151 a=wye,b=wye,i=3,x=0.204603,y=0.338318 a=eks,b=wye,i=4,x=0.381399,y=0.134188 a=wye,b=pan,i=5,x=0.573288,y=0.863624 </pre>

you might retain only the records whose a field has value eks:

<pre class="pre-highlight-in-pair"> mlr filter '$a == "eks"' data/small </pre> <pre class="pre-non-highlight-in-pair"> a=eks,b=pan,i=2,x=0.758679,y=0.522151 a=eks,b=wye,i=4,x=0.381399,y=0.134188 </pre>

or you might add a new field which is a function of existing fields:

<pre class="pre-highlight-in-pair"> mlr put '$ab = $a . "_" . $b ' data/small </pre> <pre class="pre-non-highlight-in-pair"> a=pan,b=pan,i=1,x=0.346791,y=0.726802,ab=pan_pan a=eks,b=pan,i=2,x=0.758679,y=0.522151,ab=eks_pan a=wye,b=wye,i=3,x=0.204603,y=0.338318,ab=wye_wye a=eks,b=wye,i=4,x=0.381399,y=0.134188,ab=eks_wye a=wye,b=pan,i=5,x=0.573288,y=0.863624,ab=wye_pan </pre>

Differences between put and filter

The two verbs mlr filter and mlr put are essentially the same. The only differences are:

mlr filter expressions may not reference the filter keyword within them.
Before Miller 6.17:
- Expressions sent to mlr filter should contain a boolean expression, which is the filtering criterion. (If not, all records pass through.)
As of Miller 6.17:
- Expressions sent to mlr filter must contain a boolean expression, which is the filtering criterion.
- If the expression evaluates to false, the record does not pass through.
- If the expression evaluates to true or absent, the record passes through.
- If the expression evaluates to anything other than boolean or absent, that is a fatal error.
- The reason for accepting absent is for Miller's record-heterogeneity guarantees. It's not an error to filter for $x > 10 if the current record has no $x field.

Location of boolean expression for filter

You can define and invoke functions and subroutines to help produce the bare-boolean statement, and record fields may be assigned in the statements before or after the bare-boolean statement. For example:

<pre class="pre-highlight-in-pair"> mlr --c2p --from example.csv filter ' # Bare-boolean filter expression: only records matching this pass through: $quantity >= 70; # For records that do pass through, set these: if ($rate > 8) { $description = "high rate"; } else { $description = "low rate"; } ' </pre> <pre class="pre-non-highlight-in-pair"> color shape flag k index quantity rate description red square true 2 15 79.2778 0.0130 low rate red square false 4 48 77.5542 7.4670 low rate purple triangle false 5 51 81.2290 8.5910 high rate red square false 6 64 77.1991 9.5310 high rate purple triangle false 7 65 80.1405 5.8240 low rate purple square false 10 91 72.3735 8.2430 high rate </pre> <pre class="pre-highlight-in-pair"> mlr --c2p --from example.csv filter ' # Bare-boolean filter expression: only records matching this pass through: $shape =~ "^(...)(...)$"; # For records that do pass through, capture the first "(...)" into $left and # the second "(...)" into $right $left = "\1"; $right = "\2"; ' </pre> <pre class="pre-non-highlight-in-pair"> color shape flag k index quantity rate left right red square true 2 15 79.2778 0.0130 squ are red circle true 3 16 13.8103 2.9010 cir cle red square false 4 48 77.5542 7.4670 squ are red square false 6 64 77.1991 9.5310 squ are yellow circle true 8 73 63.9785 4.2370 cir cle yellow circle true 9 87 63.5058 8.3350 cir cle purple square false 10 91 72.3735 8.2430 squ are </pre>

There are more details and more choices, of course, as detailed in the following sections.