docs/src/reference-main-regular-expressions.md
Miller lets you use regular expressions (of the types accepted by Go) in the following contexts:
In mlr filter with =~ or !=~, e.g. mlr filter '$url =~ "http.*com"'
In mlr put with regextract, e.g. mlr put '$output = regextract($input, "[a-z][a-z][0-9][0-9]")
In mlr put with sub or gsub, e.g. mlr put '$url = sub($url, "http.*com", "")'
In mlr having-fields, e.g. mlr having-fields --any-matching '^sda[0-9]'
In mlr cut, e.g. mlr cut -r -f '^status$,^sda[0-9]'
In mlr rename, e.g. mlr rename -r '^(sda[0-9]).*$,dev/\1'
In mlr grep, e.g. mlr --csv grep 00188555487 myfiles*.csv
Points demonstrated by the above examples:
There are no implicit start-of-string or end-of-string anchors; please use ^ and/or $ explicitly.
Miller regexes are wrapped with double quotes rather than slashes.
The i after the ending double quote indicates a case-insensitive regex.
Capture groups are wrapped with (...) rather than \(...\); use \( and \) to match against parentheses.
Example:
<pre class="pre-highlight-in-pair"> <b>cat data/regex-in-data.dat</b> </pre> <pre class="pre-non-highlight-in-pair"> name=jane,regex=^j.*e$ name=bill,regex=^b[ou]ll$ name=bull,regex=^b[ou]ll$ </pre> <pre class="pre-highlight-in-pair"> <b>mlr filter '$name =~ $regex' data/regex-in-data.dat</b> </pre> <pre class="pre-non-highlight-in-pair"> name=jane,regex=^j.*e$ name=bull,regex=^b[ou]ll$ </pre>=~ operatorRegex captures of the form \0 through \9 are supported as follows:
sub and gsub. For example, the first \1,\2 pair belong to the first sub and the second \1,\2 pair belong to the second sub:put for the =~ and !=~ operators. For example, here the \1,\2 are set by the =~ operator and are used by both subsequent assignment statements:\1,\2 won't be expanded from the regex capture:\1 through \9, while \0 is the entire match string; \15 is treated as \1 followed by an unrelated 5.If you use (...) in your regular expression, then up to 9 matches are supported for the =~
operator, and an arbitrary number of matches are supported for the match DSL function.
"\1" etc. in a string evaluate to themselves."\1" etc. in a string evaluate to the matched substring."\1" etc. in a string evaluate to the empty string.null to reset to the original state.strmatch and strmatchx DSL functionsThe =~ and !=~ operators have been in Miller for a long time, and they will continue to be
supported. They do, however, have some deficiencies. As of Miller 6.11 and beyond, the strmatch
and strmatchx provide more robust ways to do capturing.
First, some examples.
The strmatch function only returns a boolean result, and it doesn't set \0..\9:
The strmatchx function also doesn't set \0..\9, but returns a map-valued result:
Notes:
strmatchx only has the "matched":false key/value pair.strmatchx has the "matched":true key/value pair,
as well as full_capture (taking the place of \0 set by =~), and full_start and full_end
which =~ does not offer.strmatchx also has the captures array
whose slots 1, 2, 3, ... are the same as would have been set by =~ via \1, \2, \3, ....
However, strmatchx offers an arbitrary number of captures, not just \1..\9.
Additionally, the starts and ends arrays are indices into the input string.strmatchx, you can operate on it as you wish --- instead of
relying on the (function-scoped) globals \0..\9.strmatchx does indeed tend to take more keystrokes than =~.Regular expressions are those supported by the Go regexp package, which in turn are of type RE2 except for \C:
One caveat: for strings in "regex position" -- e.g. the second argument to
sub or
gsub, or after =~ -- "\t"
means a backslash and a t -- which is the right thing -- whereas for strings
in "non-regex position", e.g. anywhere else, "\t" becomes the tab character.
This is to say (if you're familiar with r-strings in Python) all strings in
regex position are implicit r-strings. Generally this is the right thing and
should cause little confusion. Note however that this means "\t"."\t" in the
second argument to sub isn't the same as "\t\t".