docs/src/reference-main-null-data.md
One of Miller's key features is its support for heterogeneous data. For example, take mlr sort: if you try to sort on field hostname when not all records in the data stream have a field named hostname, it is not an error (although you could pre-filter the data stream using mlr having-fields --at-least hostname then sort ...). Rather, records lacking one or more sort keys are simply output contiguously by mlr sort.
Miller has three kinds of null data:
Empty (key present, value empty): a field name is present in a record (or in an out-of-stream variable) with empty value: e.g. x=,y=2 in the data input stream, or assignment $x="" or @x="" in mlr put.
Absent (key not present): a field name is not present, e.g. input record is x=1,y=2 and a put or filter expression refers to $z. Or, reading an out-of-stream variable which hasn't been assigned a value yet, e.g. mlr put -q '@sum += $x; end{emit @sum}' or mlr put -q '@sum[$a][$b] += $x; end{emit @sum, "a", "b"}'.
JSON null: The main purpose of this is to support reading the null type in JSON files. The Miller programming language has a null keyword as well, so you can also write the null type using $x = null. Additionally, though, when you write past the end of an array, leaving gaps -- e.g. writing a[12] when the array a has length 10 -- JSON-null is used to fill the gaps. See also the arrays page.
You can test these programmatically using the functions is_empty/is_not_empty, is_absent/is_present, and is_null/is_not_null. For the last pair, note that null means either empty or absent. Here is a full list of such functions:
with the exception that the min and max functions are special: if one argument is non-null, it wins:
Likewise, empty works like 0 for addition and subtraction, and like 1 for multiplication:
<pre class="pre-highlight-in-pair"> <b>echo 'x=,y=3' | mlr put '$a = $x + $y; $b = $x - $y; $c = $x * $y'</b> </pre> <pre class="pre-non-highlight-in-pair"> x=,y=3,a=3,b=-3,c=3 </pre>This is intended to follow the arithmetic rule for absent data (explained next). In particular:
$x. If a given record doesn't have $x, then $x will be absent for that record, and the sum should simply continue.$x, then $x will be empty for that record, and the sum should simply continue.mlr put '$y = log10($nonesuch)') evaluate to absent, and arithmetic/bitwise/boolean operators with both operands being absent evaluate to absent. Arithmetic operators with one absent operand return the other operand. More specifically, absent values act like zero for addition/subtraction, and one for multiplication: Furthermore, any expression which evaluates to absent is not stored in the left-hand side of an assignment statement:The reasoning is as follows:
Absent stream-record values should not break accumulations, since Miller by design handles heterogeneous data: the running @sum in mlr put '@sum += $x' should not be invalidated for records which have no x.
Absent out-of-stream-variable values are precisely what allow you to write mlr put '@sum += $x'. Otherwise you would have to write mlr put 'begin{@sum = 0}; @sum += $x' -- which is tolerable -- but for mlr put 'begin{...}; @sum[$a][$b] += $x' you'd have to pre-initialize @sum for all values of $a and $b in your input data stream, which is intolerable.
The penalty for the absent feature is that misspelled variables can be hard to find: e.g. in mlr put 'begin{@sumx = 10}; ...; update @sumx somehow per-record; ...; end {@something = @sum * 2}' the accumulator is spelt @sumx in the begin-block but @sum in the end-block, where since it is absent, @sum*2 evaluates to 2. See also the section on DSL errors and transparency.
Since absent plus absent is absent (and likewise for other operators), accumulations such as @sum += $x work correctly on heterogeneous data, as do within-record formulas if both operands are absent. If one operand is present, you may get behavior you don't desire. To work around this -- namely, to set an output field only for records which have all the inputs present -- you can use a pattern-action block with is_present:
If you're interested in a formal description of how empty and absent fields participate in arithmetic, here's a table for +, &&, and ||`. Notes:
&& and || are similar to +.&& and || obey short-circuiting semantics. That is:
false && X is false and X is not evaluated even if it is a complex expression (maybe including function calls)true || X is true and X is not evaluated even if it is a complex expression (maybe including function calls)false && X is false even if X is an error, a non-boolean type, etc.true || X is true even if X is an error, a non-boolean type, etc.