docs/src/record-heterogeneity.md
We think of CSV tables as rectangular: if there are 17 columns in the header, then there are 17 columns for every row, else the data has a formatting error.
But heterogeneous data abound -- log-file entries, JSON documents, no-SQL databases such as MongoDB, etc. -- not to mention data-cleaning opportunities we'll look at on this page. Miller offers several ways to handle data heterogeneity.
Different kinds of heterogeneous data include ragged, irregular, and sparse.
A homogeneous list of records is one in which all records have the same keys, in the same order. For example, here is a well-formed CSV file:
<pre class="pre-highlight-in-pair"> <b>mlr --csv cat data/het/hom.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> a,b,c 1,2,3 4,5,6 7,8,9 </pre>It has three records (written here using JSON Lines formatting):
<pre class="pre-highlight-in-pair"> <b>mlr --icsv --ojsonl cat data/het/hom.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> {"a": 1, "b": 2, "c": 3} {"a": 4, "b": 5, "c": 6} {"a": 7, "b": 8, "c": 9} </pre>Here every row has the same keys, in the same order: a,b,c.
These are also sometimes called rectangular since if we pretty-print them, we get a nice rectangle:
<pre class="pre-highlight-in-pair"> <b>mlr --icsv --opprint cat data/het/hom.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> a b c 1 2 3 4 5 6 7 8 9 </pre>A second example has some empty cells which could be filled:
<pre class="pre-highlight-in-pair"> <b>mlr --csv cat data/het/fillable.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> a,b,c 1,2,3 4,,6 ,8,9 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --icsv --ojsonl cat data/het/fillable.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> {"a": 1, "b": 2, "c": 3} {"a": 4, "b": "", "c": 6} {"a": "", "b": 8, "c": 9} </pre>This example is still homogeneous, though: every row has the same keys, in the same order: a,b,c.
Empty values don't make the data heterogeneous.
Note, however, that we can use the fill-empty verb to make these
values non-empty, if we like:
Next, let's look at non-well-formed CSV files. For a third example:
<pre class="pre-highlight-in-pair"> <b>cat data/het/ragged.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> a,b,c 1,2,3 4,5 7,8,9,10 </pre>If you mlr --csv cat this, you'll get an error message:
There are two kinds of raggedness here. Since CSVs form records by zipping the keys from the header line, together with the values from each data line, the second record has a missing value for key c (which ought to be fillable), while the third record has a value 10 with no key for it.
Using the --allow-ragged-csv-input flag, we can fill values in too-short rows and provide a key (column number starting with 1) for too-long rows:
Here's another situation -- this file has, in some sense, the "same" data as
our ragged.csv example above:
For example, on the second record, a is 4, b is 5, c is 6. But this data
is heterogeneous because the keys a,b,c aren't in the same order in each
record.
This kind of data arises often in practice. One reason is that, while many
programming languages (including the Miller DSL) preserve insertion
order in maps; others do
not. So someone might have written {"a":4,"b":5,"c":6} in the source code,
but the data may not have been printed that way into a given data file.
We can use the regularize or
sort-within-records verb to order
the keys:
The regularize verb tries to re-order subsequent rows to look like the first
(whatever order that is); the sort-within-records verb simply uses
alphabetical order (which is the same in the above example, where the first
record has keys in the order a,b,c).
Here's another frequently occurring situation -- quite often, systems will log data for items that are present, but won't log data for items that aren't.
<pre class="pre-highlight-in-pair"> <b>mlr --json cat data/het/sparse.json</b> </pre> <pre class="pre-non-highlight-in-pair"> [ { "host": "xy01.east", "status": "running", "volume": "/dev/sda1" }, { "host": "xy92.west", "status": "running" }, { "purpose": "failover", "host": "xy55.east", "volume": "/dev/sda1", "reimaged": true } ] </pre>This data is called sparse (from the data-storage term).
We can use the unsparsify verb to make sure every record has the same keys:
Since this data is now homogeneous (rectangular), it pretty-prints nicely:
<pre class="pre-highlight-in-pair"> <b>mlr --ijson --opprint unsparsify data/het/sparse.json</b> </pre> <pre class="pre-non-highlight-in-pair"> host status volume purpose reimaged xy01.east running /dev/sda1 - - xy92.west running - - - xy55.east - /dev/sda1 failover true </pre>In the previous sections, we saw different kinds of data heterogeneity and ways to transform the data to make it homogeneous.
For these formats, record heterogeneity comes naturally:
<pre class="pre-highlight-in-pair"> <b>cat data/het/sparse.json</b> </pre> <pre class="pre-non-highlight-in-pair"> { "host": "xy01.east", "status": "running", "volume": "/dev/sda1" } { "host": "xy92.west", "status": "running" } { "purpose": "failover", "host": "xy55.east", "volume": "/dev/sda1", "reimaged": true } </pre> <pre class="pre-highlight-in-pair"> <b>mlr --ijson --onidx --ofs ' ' cat data/het/sparse.json</b> </pre> <pre class="pre-non-highlight-in-pair"> xy01.east running /dev/sda1 xy92.west running failover xy55.east /dev/sda1 true </pre> <pre class="pre-highlight-in-pair"> <b>mlr --ijson --oxtab cat data/het/sparse.json</b> </pre> <pre class="pre-non-highlight-in-pair"> host xy01.east status running volume /dev/sda1 host xy92.west status running purpose failover host xy55.east volume /dev/sda1 reimaged true </pre> <pre class="pre-highlight-in-pair"> <b>mlr --ijson --odkvp cat data/het/sparse.json</b> </pre> <pre class="pre-non-highlight-in-pair"> host=xy01.east,status=running,volume=/dev/sda1 host=xy92.west,status=running purpose=failover,host=xy55.east,volume=/dev/sda1,reimaged=true </pre>Even then, we may wish to put like with like, using the group-like verb:
CSV and pretty-print formats expect a rectangular structure. But Miller lets you process non-rectangular using CSV and pretty-print.
For CSV-lite and TSV-lite, Miller prints a newline and a new header when there is a schema change -- where by schema we mean the list of record keys in the order they are encountered. When there is no schema change, you get CSV per se as a special case. Likewise, Miller reads heterogeneous CSV or pretty-print input the same way. The difference between CSV and CSV-lite is that the former is RFC-4180-compliant, while the latter readily handles heterogeneous data (which is non-compliant). For example:
<pre class="pre-highlight-in-pair"> <b>cat data/het.json</b> </pre> <pre class="pre-non-highlight-in-pair"> { "resource": "/path/to/file", "loadsec": 0.45, "ok": true } { "record_count": 100, "resource": "/path/to/file" } { "resource": "/path/to/second/file", "loadsec": 0.32, "ok": true } { "record_count": 150, "resource": "/path/to/second/file" } { "resource": "/some/other/path", "loadsec": 0.97, "ok": false } </pre> <pre class="pre-highlight-in-pair"> <b>mlr --ijson --opprint cat data/het.json</b> </pre> <pre class="pre-non-highlight-in-pair"> resource loadsec ok /path/to/file 0.45 true record_count resource 100 /path/to/file resource loadsec ok /path/to/second/file 0.32 true record_count resource 150 /path/to/second/file resource loadsec ok /some/other/path 0.97 false </pre> <pre class="pre-highlight-in-pair"> <b>mlr --ijson --opprint group-like data/het.json</b> </pre> <pre class="pre-non-highlight-in-pair"> resource loadsec ok /path/to/file 0.45 true /path/to/second/file 0.32 true /some/other/path 0.97 false record_count resource 100 /path/to/file 150 /path/to/second/file </pre> <pre class="pre-highlight-in-pair"> <b>mlr --ijson --ocsvlite group-like data/het.json</b> </pre> <pre class="pre-non-highlight-in-pair"> resource,loadsec,ok /path/to/file,0.45,true /path/to/second/file,0.32,true /some/other/path,0.97,false record_count,resource 100,/path/to/file 150,/path/to/second/file </pre> <pre class="pre-highlight-in-pair"> <b>mlr --ijson --ocsv group-like data/het.json</b> </pre> <pre class="pre-non-highlight-in-pair"> resource,loadsec,ok /path/to/file,0.45,true /path/to/second/file,0.32,true /some/other/path,0.97,false mlr: CSV schema change: first keys "resource,loadsec,ok"; current keys "record_count,resource" mlr: exiting due to data error </pre>Miller handles explicit header changes as shown. If your CSV input contains ragged data -- if there are implicit header changes (no intervening blank line and new header line) as seen above -- you can use --allow-ragged-csv-input (or keystroke-saver --ragged).
Above we saw how to make heterogeneous data homogeneous, and then how to print heterogeneous data. As for other processing, record heterogeneity is not a problem for Miller.
Miller operates on specified fields and takes the rest along: for example, if
you are sorting on the count field, then all records in the input stream must
have a count field, but the other fields can vary---and moreover the sorted-on
field name(s) don't need to be in the same position on each line: