docs/src/new-in-miller-6.md
Features are driven largely by great feedback in the 2021 Miller User Survey results and GitHub Issues.
TL;DRs: install, binaries, compatibility with Miller 5.
Performance is on par with Miller 5 for simple processing, and is far better than Miller 5 for complex processing chains -- the latter due to improved multicore utilization. CSV I/O is notably improved. See the Performance benchmarks section at the bottom of this page for details.
Documentation (what you're reading here) and online help (mlr --help) have been completely reworked.
In the initial release, the focus was on convincing users already familiar with awk, grep, and cut that Miller was a viable alternative; however, over time, it has become clear that many Miller users aren't experts with those tools. The focus has shifted toward a higher quantity of more introductory/accessible material for command-line data processing.
Similarly, the FAQ/recipe material has been expanded to include more, and simpler, use-cases, including resolved questions from Miller Issues and Miller Discussions; more complex/niche material has been pushed farther down. The lengthy reference pages have been divided into separate pages. (See also Structure of these documents.)
One of the main feedback themes from the 2021 Miller User Survey was that some things should be easier to find. Namely, on each doc page, there's now a banner across the top with things that should be one click away from the landing page (or any page): command-line flags, verbs, functions, glossary/acronyms, and a finder for docs by release.
Since CSV is overwhelmingly the most popular data format for Miller, it is now discussed first, and more examples use CSV.
Stronger support for Windows (with or without MSYS2), with a couple of exceptions. See Miller on Windows for more information.
Binaries are reliably available using GitHub Actions: see also Installation.
Miller uses separate, customizable colors for keys and values whenever the output is to a terminal. See Output Colorization.
Miller 6 has getoptish command-line parsing (pull request 467):
-xyz expands automatically to -x -y -z, so (for example) mlr cut -of shape,flag is the same as mlr cut -o -f shape,flag.--foo=bar expands automatically to --foo bar, so (for example) mlr --ifs=comma is the same as mlr --ifs comma.--mfrom, --load, --mload as described in the flags reference.A small but nice item: since mlr --csv and mlr --json are so common, you can now use alternate shorthands mlr -c and mlr -j, respectively.
For mlr put and mlr filter, parse-error messages now include location information:
Scripting is now easier -- support for #! with sh, as always, along with now support for #! with mlr -s. For Windows, mlr -s can also be used. These help reduce backslash clutter and let you do more while typing less. See the scripting page.
Miller now has a read-evaluate-print-loop (REPL) where you can single-step through your data-file record, express arbitrary statements to converse with the data, etc.
<pre class="pre-highlight-in-pair"> <b>mlr repl</b> </pre> <pre class="pre-non-highlight-in-pair"> [mlr] 1 + 2 3 [mlr] apply([1,2,3,4,5], func(e) {return e ** 3}) [1, 8, 27, 64, 125] [mlr] :open example.csv [mlr] :read [mlr] $* { "color": "yellow", "shape": "triangle", "flag": "true", "k": 1, "index": 11, "quantity": 43.6498, "rate": 9.8870 } </pre>You can now write field names, local variables, etc. all in UTF-8, e.g. mlr --c2p filter '$σχήμα == "κύκλος"' παράδειγμα.csv. See the
internationalization page for examples.
Including support for specifying timezone via function arguments, as an alternative to
the TZ environment variable. Please see DSL datetime/timezone functions.
In addition to --prepipe gunzip, you can now use the --gzin flag. In fact, if your files end in .gz you don't even need to do that -- Miller will autodetect by file extension and automatically uncompress mlr --csv cat foo.csv.gz. Similarly, for .z, .bz2, and .zst files. Please refer to the page on Compressed Data for more information.
You can read input with prefixes https://, http://, and file://:
Arrays are now supported in Miller's put/filter programming language, as described in the Arrays reference. (Also, array is now a keyword, so this is no longer usable as a local variable or UDF name.)
JSON support is improved:
[...], Miller now handles JSON / JSON Lines in tail -f contexts like it does for other file formats. (Note: use --records-per-batch 1 for tail -f` input.)--jvquoteall and --jknquoteint no longer have meaning -- although they're accepted as no-ops at the command line for backward compatibility.
--jvquoteall was restored shortly after 6he 6.4.0 release.JSON vs JSON Lines:
[...]; on output, by default it produced single-line records without outermost [...]. Miller 5 let you customize output formatting using --jvstack (multi-line records) and --jlistwrap (write outermost [...]). Thus, Miller 5's JSON output format, with default flags, was in fact JSON Lines all along.[...], and regardless of line breaks, whether the specified input format is JSON or JSON Lines. (This is similar to jq.)--ojson, output records are written multiline (pretty-printed), with outermost [...].--ojsonl, output records are written single-line, without outermost [...].--jvstack and --jlistwrap unnecessary. However, if you want outermost [...] with single-line records, you can use --ojson --no-jvstack.See also the Arrays reference for more information.
The most central part of Miller 6 is a deep refactor of how data values are parsed from file contents, how types are inferred, and how they're converted back to text into output files.
This was all initiated by https://github.com/johnkerl/miller/issues/151.
In Miller 5 and below, all values were stored as strings, then only converted to int/float as needed, for example, when a particular field was referenced in the stats1 or put verbs. This led to awkwardnesses such as the -S and -F flags for put and filter.
In Miller 6, values parseable as integers or floating-point numbers are treated as such from the moment the input data is read, and these are passed along through the verb chain. All values are typed from when they're read, and their types are passed along. Meanwhile, the original string representation of each value is also retained. If a numeric field isn't modified during the processing chain, it's printed out the way it arrived. Additionally, quoted values in JSON strings are consistently flagged as strings throughout the processing chain.
For example (see https://github.com/johnkerl/miller/issues/178) you can now do
<pre class="pre-highlight-in-pair"> <b>echo '{ "a": "0123" }' | mlr --json cat</b> </pre> <pre class="pre-non-highlight-in-pair"> [ { "a": "0123" } ] </pre> <pre class="pre-highlight-in-pair"> <b>echo '{ "x": 1.230, "y": 1.230000000 }' | mlr --json cat</b> </pre> <pre class="pre-non-highlight-in-pair"> [ { "x": 1.230, "y": 1.230000000 } ] </pre>By default, field names are deduplicated for all file formats except JSON / JSON Lines. So if you have an input record with x=8,x=9, then the second field's key is renamed to x_2 and so on -- the record scans as x=8,x_2=9. Use mlr --no-dedupe-field-names to suppress this, and have the record be scanned as x=9.
For JSON and JSON Lines, the last duplicated key in an input record is always retained, regardless of mlr --no-dedupe-field-names: {"x":8,"x":9} scans as if it were {"x":9}.
You can now split fields on whitespace when whitespace is a mix of tabs and spaces. As well, you can use regular expressions for the input field-separator and the input pair-separator. Please see the section on multi-character and regular-expression separators.
In particular, for NIDX format, the default IFS now allows splitting on one or more of space or tab.
The sort verb now accepts -c and -cr options for case-folded ascending/descending sort, respectively.
Higher-order functions select, apply, reduce, fold, and sort. See the sorting page and the higher-order-functions page for more information.
Absent-coalesce operator ?? along with ??=;
absent-empty-coalesce operator ??? along with ???=.
The dot operator is not new, but it has a new role: in addition to its existing use for string-concatenation like "a"."b" = "ab", you can now also use it for keying maps. For example, $req.headers.host is the same as $req["headers"]["host"]. See the dot-operator reference for more information.
Unsigned right-shift >>> along with >>>=.
The following differences are rather technical. If they don't sound familiar to you, not to worry! Most users won't be affected by the (relatively minor) changes between Miller 5 and Miller 6.
Please drop an issue at https://github.com/johnkerl/miller/issues with any/all backward-compatibility concerns between Miller 5 and Miller 6.
The --auto flag is now ignored. Before, if a file had CR/LF (Windows-style) line endings on input (on any platform), it would have the same on output; likewise, LF (Unix-style) line endings. Now, files with CR/LF or LF line endings are processed on any platform, but the output line ending is for the platform. E.g., reading CR/LF files on Linux will now produce LF output.
IFS and IPS can now be regular expressions. Please see the section on multi-character and regular-expression separators.
--jknquoteint and jquoteall are ignored; they were workarounds for the (now much-improved) type-inference and type-tracking in Miller 6.--json-fatal-arrays-on-input, --json-map-arrays-on-input, and --json-skip-arrays-on-input are ignored; Miller 6 now supports arrays fully.mlr help legacy-flags or the legacy-flags reference.[...]; on output, by default, it produced single-line records without outermost [...]. Miller 5 lets you customize output formatting using --jvstack (multi-line records) and --jlistwrap (write outermost [...]). Thus, Miller 5's JSON output format, with default flags, was in fact JSON Lines all along.[...], and regardless of line breaks, whether the specified input format is JSON or JSON Lines. (This is similar to jq.)--ojson, output records are written multiline (pretty-printed), with outermost [...].--ojsonl, output records are written single-line, without outermost [...].--jvstack and --jlistwrap unnecessary. However, if you want outermost [...] with single-line records, you can use --ojson --no-jvstack.{"x":1,"y":2,}. Miller 6 uses a JSON parser that is compliant with the JSON specification and does not accept trailing commas.-S and -F flags to mlr put and mlr filter are ignored, since type-inference is no longer done in mlr put and mlr filter, but rather, when records are first read. You can use mlr -S and mlr -A, respectively, instead to control type-inference within the record-readers.0123 and 07 are type-inferred as string. Use mlr -O to infer them as octal integers. Note that 08 and 09 will then infer as decimal integers.0o, e.g. 0o377, are already treated as octal, regardless of mlr -O -- mlr -O only affects how leading-zero integers are handled.Variables must be non-indexed on emit. To emit an indexed variable now requires the new emit1 keyword.
This worked in Miller 5 but is no longer supported in Miller 6:
<pre class="pre-highlight-in-pair"> <b>mlr5 -n put 'end {@input={"a":1}; emit @input["a"]}'</b> </pre> <pre class="pre-non-highlight-in-pair"> input=1 </pre>This works in Miller 6 (and worked in Miller 5 as well) and is supported:
<pre class="pre-highlight-in-pair"> <b>mlr -n put 'end {@input={"a":1}; emit1 {"input":@input["a"]}}'</b> </pre> <pre class="pre-non-highlight-in-pair"> input=1 </pre>Please see the section on emit statements for more information.
See the performance page.