docs/src/flatten-unflatten.md
Miller has long supported reading and writing multiple file formats including CSV and JSON, as well as converting back and forth between them. Two things new in Miller 6, though, are that arrays are now fully supported, and that record values are typed throughout Miller's processing chain from input through verbs to output -- which includes improved handling for maps and arrays as record values.
This raises the question, though, of how to handle maps and arrays as record values. For JSON files, this is easy -- JSON is a nested format where values can be maps or arrays, which can contain other maps or arrays, and so on, with the nesting happily indicated by curly braces:
<pre class="pre-highlight-in-pair"> <b>cat data/map-values.json</b> </pre> <pre class="pre-non-highlight-in-pair"> { "a": 1, "b": {"x": 2, "y": 3} } { "a": 4, "b": {"x": 5, "y": 6} } </pre> <pre class="pre-highlight-in-pair"> <b>cat data/map-values-nested.json</b> </pre> <pre class="pre-non-highlight-in-pair"> { "a": 1, "b": {"s": {"w": 2, "x": 3}, "t": {"y": 4, "z": 5}} } { "a": 6, "b": {"s": {"w": 7, "x": 8}, "t": {"y": 9, "z": 10}} } </pre>How can we represent these in CSV files?
Miller's non-JSON formats, such as CSV, are all non-nested -- a cell in a CSV row can't contain another entire row. As we'll see in this section, there are two main ways to flatten nested data structures down to individual CSV cells -- either by key-spreading (which is the default), or by JSON-stringifying:
b={"x": 2, "y": 3} spreads into multiple fields b.x=2,b.y=3;"b": {"x": 2, "y": 3} becomes the single string-valued field b="{\"x\":2,\"y\":3}".Miller intends to provide intuitive default behavior for these conversions, while also providing you with more control when you need it.
Let's first look at the default behavior with map-valued fields. Miller's
default behavior is to spread the map values into multiple keys -- using
Miller's flatsep separator, which defaults to . -- to join the original
record key with map keys:
Flattened to CSV format:
<pre class="pre-highlight-in-pair"> <b>mlr --ijson --ocsv cat data/map-values.json</b> </pre> <pre class="pre-non-highlight-in-pair"> a,b.x,b.y 1,2,3 4,5,6 </pre>Flattened to pretty-print format:
<pre class="pre-highlight-in-pair"> <b>mlr --ijson --opprint cat data/map-values.json</b> </pre> <pre class="pre-non-highlight-in-pair"> a b.x b.y 1 2 3 4 5 6 </pre>Using flatten-separator : instead of the default .:
If the maps are more deeply nested, each level of map keys is joined in:
<pre class="pre-highlight-in-pair"> <b>cat data/map-values-nested.json</b> </pre> <pre class="pre-non-highlight-in-pair"> { "a": 1, "b": {"s": {"w": 2, "x": 3}, "t": {"y": 4, "z": 5}} } { "a": 6, "b": {"s": {"w": 7, "x": 8}, "t": {"y": 9, "z": 10}} } </pre> <pre class="pre-highlight-in-pair"> <b>mlr --ijson --opprint cat data/map-values-nested.json</b> </pre> <pre class="pre-non-highlight-in-pair"> a b.s.w b.s.x b.t.y b.t.z 1 2 3 4 5 6 7 8 9 10 </pre>Unflattening is simply the reverse -- from non-JSON back to JSON:
<pre class="pre-highlight-in-pair"> <b>cat data/map-values.json</b> </pre> <pre class="pre-non-highlight-in-pair"> { "a": 1, "b": {"x": 2, "y": 3} } { "a": 4, "b": {"x": 5, "y": 6} } </pre> <pre class="pre-highlight-in-pair"> <b>mlr --ijson --ocsv cat data/map-values.json</b> </pre> <pre class="pre-non-highlight-in-pair"> a,b.x,b.y 1,2,3 4,5,6 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --ijson --ocsv cat data/map-values.json | mlr --icsv --ojson cat</b> </pre> <pre class="pre-non-highlight-in-pair"> [ { "a": 1, "b": { "x": 2, "y": 3 } }, { "a": 4, "b": { "x": 5, "y": 6 } } ] </pre>If the input data contains arrays, these are also flattened similarly: the
1-up array indices 1,2,3,... become string keys
"1","2","3",...:
If the arrays are more deeply nested, each level of arrays keys is joined in:
<pre class="pre-highlight-in-pair"> <b>cat data/array-values-nested.json</b> </pre> <pre class="pre-non-highlight-in-pair"> { "a": 1, "b": [[2, 3], [4, 5]] } { "a": 6, "b": [[7, 8], [9, 10]] } </pre> <pre class="pre-highlight-in-pair"> <b>mlr --ijson --opprint cat data/array-values-nested.json</b> </pre> <pre class="pre-non-highlight-in-pair"> a b.1.1 b.1.2 b.2.1 b.2.2 1 2 3 4 5 6 7 8 9 10 </pre>In the nested-data examples shown here, nested map values are shown containing maps, and nested array values are shown containing arrays -- of course (even though not shown here) nested map values can contain arrays, and vice versa.
Unflattening arrays is, again, simply the reverse -- from non-JSON back to JSON:
<pre class="pre-highlight-in-pair"> <b>cat data/array-values.json</b> </pre> <pre class="pre-non-highlight-in-pair"> { "a": 1, "b": [2, 3] } { "a": 4, "b": [5, 6] } </pre> <pre class="pre-highlight-in-pair"> <b>mlr --ijson --ocsv cat data/array-values.json</b> </pre> <pre class="pre-non-highlight-in-pair"> a,b.1,b.2 1,2,3 4,5,6 </pre> <pre class="pre-highlight-in-pair"> <b>mlr --ijson --ocsv cat data/array-values.json | mlr --icsv --ojson cat</b> </pre> <pre class="pre-non-highlight-in-pair"> [ { "a": 1, "b": [2, 3] }, { "a": 4, "b": [5, 6] } ] </pre>Note that the CSV field names b.x and b.y aren't too different from b.1
and b.2. Miller has the heuristic that if it's unflattening and gets a map
with keys "1", "2", etc. -- starting with "1", consecutively, and with
no gaps -- it turns that back into an array. This is precisely to undo the
flatten conversion. However, it may (or may not) be surprising:
An additional heuristic is that if a field name starts with a ., ends with
a ., or has two or more consecutive . characters, no attempt is made
to unflatten it on conversion from non-JSON to JSON.
To see what our options are for manually controlling flattening and unflattening (if the defaults aren't working for us in a particular situation), let's first look a little into how they're implemented.
mlr ... cat then sort ... (some chain of verbs) then Miller appends, in effect, then flatten to the end of the chain.
--no-auto-flatten flag.then unflatten to the end of the chain.
--no-auto-unflatten flag.Note in particular that auto-flatten happens even when the input format and the output format are both non-JSON, e.g. even for CSV-to-CSV processing. This is because map-valued/array-valued fields can be produced using DSL statements:
<pre class="pre-highlight-in-pair"> <b>cat data/hostnames.csv</b> </pre> <pre class="pre-non-highlight-in-pair"> host,status apoapsis.east.our.org,up nadir.west.our.org,down </pre>Using JSON output, we can see that splita has produced an array-valued field named components:
Using CSV output, with default auto-flatten, we get components.1 through components.4:
Using CSV output, without default auto-flatten, we get a JSON-stringified encoding of the components field:
Now suppose we ran this
<pre class="pre-highlight-in-pair"> <b>mlr --icsv --oxtab --from data/hostnames.csv --no-auto-flatten put '</b> <b> $a = splita($host, ".");</b> <b> $b = splita($host, ".");</b> <b>'</b> </pre> <pre class="pre-non-highlight-in-pair"> host apoapsis.east.our.org status up a ["apoapsis", "east", "our", "org"] b ["apoapsis", "east", "our", "org"] host nadir.west.our.org status down a ["nadir", "west", "our", "org"] b ["nadir", "west", "our", "org"] </pre>into a file data/hostnames.xtab:
<pre class="pre-highlight-in-pair"> <b>cat data/hostnames.xtab</b> </pre> <pre class="pre-non-highlight-in-pair"> host apoapsis.east.our.org status up a ["apoapsis", "east", "our", "org"] b ["apoapsis", "east", "our", "org"] host nadir.west.our.org status down a ["nadir", "west", "our", "org"] b ["nadir", "west", "our", "org"] </pre>This was written with --no-auto-unflatten so we need to manually revive the
array-valued fields, if we choose -- here, we can JSON-parse the a field and
leave b JSON-stringified:
See also the JSON parse and stringify section section for more on this -- for example, when Miller is producing SQL-query output from tables having one or more columns that contain JSON-encoded data.