README-dev.md
make, make check, make docs, etc: see Makefile in the repo base directory.godoc support is minimal: package-level synopses exist; most func/const/etc content lacks godoc-style comments. To view doc material, you can:
go get golang.org/x/tools/cmd/godoccd gogodoc -http=:6060 -goroot .http://localhost:6060The Go implementation is auto-built using GitHub Actions: see .github/workflows/go.yml. This works splendidly on Linux, MacOS, and Windows.
As I wrote here back in 2015 I couldn't get Rust or Go (or any other language I tried) to do some test-case processing as quickly as C, so I stuck with C.
Either Go has improved since 2015, or I'm a better Go programmer than I used to be, or both -- but as of 2020 I can get Go-Miller to process data about as quickly as C-Miller.
Note: in some sense Go-Miller is less efficient but in a way that doesn't significantly affect wall time. Namely, doing mlr cat on a million-record data file on my bargain-value MacBook Pro, the C version takes about 2.5 seconds and the Go version takes about 3 seconds. So in terms of wall time -- which is what we care most about, how long we have to wait -- it's about the same.
A way to look a little deeper at resource usage is to run htop, while processing a 10x larger file, so it'll take 25 or 30 seconds rather than 2.5 or 3. This way we can look at the steady-state resource consumption. I found that the C version -- which is purely single-threaded -- is taking 100% CPU. And the Go version, which uses concurrency and channels and MAXPROCS=4, with reader/transformer/writer each on their own CPU, is taking about 240% CPU. So Go-Miller is taking up not just a little more CPU, but a lot more -- yet, it does more work in parallel, and finishes the job in about the same amount of time.
Even commodity hardware has multiple CPUs these days -- and the Go code is much easier to read, extend, and improve than the C code -- so I'll call this a net win for Miller.
Donald Knuth famously said: Programs are meant to be read by humans and only incidentally for computers to execute.
During the coding of Miller, I've been guided by the following:
README.md files throughout the directory tree are intended to give you a sense of what is where, what to read first and what doesn't need reading right away, and so on -- so you spend a minimum of time being confused or frustrated.NewEvaluableLeafNode), except for a small number of most-used names where a longer name would cause unnecessary line-wraps (e.g. Mlrval instead of MillerValue since this appears very very often).-v in mlr -n put -v '$y = 3 + 0.1 * $x' shows you the abstract syntax tree derived from the DSL expression.mlr regtest functionality.Information here is for the benefit of anyone reading/using the Miller Go code. To use the Miller tool at the command line, you don't need to know any of this if you don't want to. :)
Miller is a multi-format record-stream processor, where a record is a sequence of key-value pairs. The basic stream operation is:
So, in broad overview, the key packages are:
put and filter verbs.go get golang.org/x/termmain() are here, for ease of testing.Mlrval datatype which includes string/int/float/boolean/void/absent/error types. These are used for record values, as well as expression/variable values in the Miller put/filter DSL. See also below for more details.Mlrmap is the sequence of key-value pairs which represents a Miller record. The key-lookup mechanism is optimized for Miller read/write usage patterns -- please see mlrmap.go for more details.context supports AWK-like variables such as FILENAME, NF, NR, and so on.mlr --icsv --ojson put '$sum = $a + $b' then filter '$sum > 1000' myfile.csv, it's the CLI parser which makes it possible for Miller to construct a CSV record-reader, a transformer-chain of put then filter, and a JSON record-writer.pkg/cli, which was split out to avoid a Go package-import cycle.cat, tac, sort, put, and so on.mlr.bnf, which is the lexical/semantic grammar file for the Miller put/filter DSL using the PGPG framework. All subdirectories of pkg/parsing/ are autogen code created by PGPG's processing of mlr.bnf. If you need to edit mlr.bnf, please use tools/build-dsl to autogenerate Go code from it (using the PGPG tool). (This takes several minutes to run.) See also tools/format-go-in-bnf (which reads stdin and writes stdout) for automated formatting of the Go bits.ast_types.go which is the abstract syntax tree datatype shared between PGPG and Miller. I didn't use a pkg/dsl/ast naming convention, although that would have been nice, in order to avoid a Go package-dependency cycle.$z = $x * 0.3 * $y. Please see the pkg/dsl/cst/README.md for more information.Through out the code, records are passed by reference (as are most things, for that matter, to reduce unnecessary data copies). In particular, records can be nil through the reader/transformer/writer sequence.
RecordAndContext struct) to signify end of input stream.cat, cut, rename, etc. produce one output record per input record.filter transformer produces one or zero output records per input record depending on whether the record passed the filter.nothing transformer produces zero output records.sort and tac transformers are non-streaming -- they produce zero output records per input record, and instead retain each input record in a list. Then, when the end-of-stream marker is received, they sort/reverse the records and emit them, then they emit the end-of-stream marker.stats1 and count also retain input records, then produce output once there is no more input to them.mlr cat or modified as in mlr cut.mlr filter in false cases, for example, or mlr nothing) it will be GCed.mlr repeat -- this needs to explicitly copy records instead of producing multiple pointers to the same record.*types.Mlrmap so they can be assigned to; rvalue expressions return non-pointed types.Mlrval but these are very shallow copies -- the int/string/etc types are copied but maps/arrays are passed by reference in the rvalue expression-evaluators.Mlrval is the datatype of record values, as well as expression/variable values in the Miller put/filter DSL. It includes string/int/float/boolean/void/absent/error types, not unlike PHP's zval.
absent type is like Javascript's undefined -- it's for times when there is no such key, as in a DSL expression $out = $foo when the input record is $x=3,y=4 -- there is no $foo so $foo has absent type. Nothing is written to the $out field in this case. See also here for more information.void type is like Javascript's null -- it's for times when there is a key with no value, as in $out = $x when the input record is $x=,$y=4. This is an overlap with string type, since a void value looks like an empty string. I've gone back and forth on this (including when I was writing the C implementation) -- whether to retain void as a distinct type from empty-string, or not. I ended up keeping it as it made the Mlrval logic easier to understand.error type is for things like doing type-uncoerced addition of strings. Data-dependent errors are intended to result in (error)-valued output, rather than crashing Miller. See also here for more information.BigInt).Mlrval package implements.Key performance-related PRs for the Go port include:
duffcopy and madvise appearing in the flame graphs. The idea was to reduce data-copies in the DSL.stdout are buffered a line at a time if the output is to the terminal, or a few KB at a time if not (i.e. file or pipe). Note the cost is how often the process does a write system call with associated overhead of context-switching into the kernel and back out. The C behavior is the right thing to do. In the Go port, very early on writes were all unbuffered -- several per record. Then buffering was soon switched to per-record, which was an improvement. But as of #765, the buffering is done at the library level, and it's done C-style -- much less frequently when output is not to a terminal.strings.Split will do.mlrval.String() method. Originally this method had non-pointer receiver to conform with the fmt.Stringer interface. Hoewver, that's a false economy: fmt.Println(someMlrval) is a corner case, and stream processing is the primary concern. Implementing this as a pointer-receiver method was a performance improvement.$y = $x + 1, each record's $x field's raw string (if not already accessed in the processing chain) needs to be checked to see if it's int (like 123), float (like 123.4 or 1.2e3), or string (anything else). Previously, succinct calls to built-in Go library functions were used. That was easy to code, but made too many expensive calls that were avoidable by lighter peeking of strings. In particular, an is-octal regex was being invoked unnecessarily on every field type-infer operation.See also ./README-profiling.md and https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks.
In summary: