docs/src/reference-dsl.md
DSL stands for domain-specific language: it's language particular to Miller which you can use to write expressions to specify customer transformations to your data. (See Miller programming language for an introduction.)
While put and filter are verbs, they're different
from the rest in that they let you use the DSL -- so we often contrast DSL
(things you can do in the put and filter verbs), and verbs (things you
can do using the other verbs besides put and filter.)
Here's comparison of verbs and put/filter DSL expressions:
Example:
<pre class="pre-highlight-in-pair"> <b>mlr stats1 -a sum -f x -g a data/small</b> </pre> <pre class="pre-non-highlight-in-pair"> a=pan,x_sum=0.346791 a=eks,x_sum=1.140078 a=wye,x_sum=0.777891 </pre>Example:
<pre class="pre-highlight-in-pair"> <b>mlr put -q '@x_sum[$a] += $x; end{emit @x_sum, "a"}' data/small</b> </pre> <pre class="pre-non-highlight-in-pair"> a=pan,x_sum=0.346791 a=eks,x_sum=1.140078 a=wye,x_sum=0.777891 </pre>Please see Verbs Reference for information on verbs other than put and filter.
The most important point about the Miller DSL is that it is designed for streaming operation over records.
DSL statements include:
func and subr for user-defined functions and subroutines, which we'll look at later in the separate page about them;begin and end blocks, for statements you want to run before the first record, or after the last one;The feature of streaming operation over records is implemented by the main statements getting invoked once per record. You don't explicitly loop over records, as you would in some dataframes contexts; rather, Miller loops over records for you, and it lets you specify what to do on each record: you write the body of the loop, not the loop itself.
(But you can, if you like, use those per-record statements to grow a list of
records, then loop over them all in an end block. This is described in the
page on operating on all records.)
To see this in action, let's take a look at the data/short.csv file:
<pre class="pre-highlight-in-pair"> <b>cat data/short.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> word,value apple,37 ball,28 cat,54 </pre>There are three records in this file, with word=apple, word=ball, and
word=cat, respectively. Let's print something in a begin statement, add a
field in a main statement, and print something else in an end statement:
The print statements for begin and end went out before the first record
was seen and after the last was seen; the field-creation statement $nr = NR
was invoked three times, once for each record. We didn't explicitly loop over
records, since Miller was already looping over records, and invoked our main
statement on each loop iteration.
For almost all simple uses of the Miller programming language, this implicit looping over records is probably all you will need. (For more involved cases you can see the pages on operating on all records, out-of-stream variables, and two-pass algorithms.)
The essential usages of mlr filter and mlr put are for record-selection and
record-updating expressions, respectively. For example, given the following
input data:
you might retain only the records whose a field has value eks:
or you might add a new field which is a function of existing fields:
<pre class="pre-highlight-in-pair"> <b>mlr put '$ab = $a . "_" . $b ' data/small</b> </pre> <pre class="pre-non-highlight-in-pair"> a=pan,b=pan,i=1,x=0.346791,y=0.726802,ab=pan_pan a=eks,b=pan,i=2,x=0.758679,y=0.522151,ab=eks_pan a=wye,b=wye,i=3,x=0.204603,y=0.338318,ab=wye_wye a=eks,b=wye,i=4,x=0.381399,y=0.134188,ab=eks_wye a=wye,b=pan,i=5,x=0.573288,y=0.863624,ab=wye_pan </pre>The two verbs mlr filter and mlr put are essentially the same. The only differences are:
mlr filter expressions may not reference the filter keyword within them.mlr filter should contain a boolean expression, which is the filtering criterion. (If not, all records pass through.)mlr filter must contain a boolean expression, which is the filtering criterion.false, the record does not pass through.true or absent, the record passes through.absent is for Miller's record-heterogeneity guarantees. It's not an error to filter for $x > 10 if the current record has no $x field.You can define and invoke functions and subroutines to help produce the bare-boolean statement, and record fields may be assigned in the statements before or after the bare-boolean statement. For example:
<pre class="pre-highlight-in-pair"> <b>mlr --c2p --from example.csv filter '</b> <b> # Bare-boolean filter expression: only records matching this pass through:</b> <b> $quantity >= 70;</b> <b> # For records that do pass through, set these:</b> <b> if ($rate > 8) {</b> <b> $description = "high rate";</b> <b> } else {</b> <b> $description = "low rate";</b> <b> }</b> <b>'</b> </pre> <pre class="pre-non-highlight-in-pair"> color shape flag k index quantity rate description red square true 2 15 79.2778 0.0130 low rate red square false 4 48 77.5542 7.4670 low rate purple triangle false 5 51 81.2290 8.5910 high rate red square false 6 64 77.1991 9.5310 high rate purple triangle false 7 65 80.1405 5.8240 low rate purple square false 10 91 72.3735 8.2430 high rate </pre> <pre class="pre-highlight-in-pair"> <b>mlr --c2p --from example.csv filter '</b> <b> # Bare-boolean filter expression: only records matching this pass through:</b> <b> $shape =~ "^(...)(...)$";</b> <b> # For records that do pass through, capture the first "(...)" into $left and</b> <b> # the second "(...)" into $right</b> <b> $left = "\1";</b> <b> $right = "\2";</b> <b>'</b> </pre> <pre class="pre-non-highlight-in-pair"> color shape flag k index quantity rate left right red square true 2 15 79.2778 0.0130 squ are red circle true 3 16 13.8103 2.9010 cir cle red square false 4 48 77.5542 7.4670 squ are red square false 6 64 77.1991 9.5310 squ are yellow circle true 8 73 63.9785 4.2370 cir cle yellow circle true 9 87 63.5058 8.3350 cir cle purple square false 10 91 72.3735 8.2430 squ are </pre>There are more details and more choices, of course, as detailed in the following sections.