regex/docs/README.md
This Truffle language represents classic regular expressions. It treats a given regular expression as a "program" that you can execute to obtain a regular expression matcher object, which in turn can be used to perform regex searches via Truffle interop.
The expected syntax is options/regex/flags, where options is a comma-separated list of key-value pairs which affect
how the regex is interpreted, and /regex/flags is equivalent to the popular regular expression literal format found in
e.g. JavaScript or Ruby.
When parsing a regular expression, TRegex will return a Truffle CallTarget, which, when called, will yield one of the following results:
RegexObject) object, which can be used to match the given regex.PARSE_ERROR exception may be thrown to indicate a syntax error.An example of how to parse a regular expression:
Source source = Source.newBuilder("regex", "Flavor=ECMAScript/(a|(b))c/i", "myRegex").mimeType("application/tregex").internal(true).build();
Object regex;
try {
regex = getContext().getEnv().parseInternal(source).call();
} catch (AbstractTruffleException e) {
if (InteropLibrary.getUncached().getExceptionType(e) == ExceptionType.PARSE_ERROR) {
// handle parser error
} else {
// fatal error, this should never happen
}
}
if (InteropLibrary.getUncached().isNull(regex)) {
// regex is not supported by TRegex, fall back to a different regex engine
}
A RegexObject represents a compiled regular expression that can be used to match against input strings. It exposes the
following three properties:
pattern: the source string of the compiled regular expression.flags: an object representing the set of flags passed to the regular expression compiler, depending on the flavor of
regular expressions used.groupCount: the number of capture groups present in the regular expression, including group 0.groups: a map of all named capture groups to their respective group number, or a null value if the expression does
not contain named capture groups.exec: an executable method that matches the compiled regular expression against a string. The method accepts two
parameters:
input: the character sequence to search in. This may either be a Java String, or a Truffle Object that behaves
like a char-array.fromIndex: the position to start searching from.RegexResult object.A RegexResult object represents the result of matching a regular expression against a string. It can be obtained as
the result of a RegexObject's
exec-method and has the following properties:
boolean isMatch: true if a match was found, false otherwise.int getStart(int groupNumber): returns the position where the beginning of the capture group with the given number
was found. If the result is no match, the returned value is undefined. Capture group number 0 denotes the boundaries
of the entire expression. If no match was found for a particular capture group, the returned value is -1.int getEnd(int groupNumber): returns the position where the end of the capture group with the given number was
found. If the result is no match, the returned value is undefined. Capture group number 0 denotes the boundaries of
the entire expression. If no match was found for a particular capture group, the returned value is -1.Compiled regex usage example in pseudocode:
regex = <matcher from previous example>
assert(regex.pattern == "(a|(b))c")
assert(regex.flags.ignoreCase == true)
assert(regex.groupCount == 3)
result = regex.exec("xacy", 0)
assert(result.isMatch == true)
assertEquals([result.getStart(0), result.getEnd(0)], [ 1, 3])
assertEquals([result.getStart(1), result.getEnd(1)], [ 1, 2])
assertEquals([result.getStart(2), result.getEnd(2)], [-1, -1])
result2 = regex.exec("xxx", 0)
assert(result2.isMatch == false)
// result2.getStart(...) and result2.getEnd(...) are undefined
These options define how TRegex should interpret a given regular expression:
Flavor: specifies the regex dialect to use. Possible values:
ECMAScript: ECMAScript/JavaScript syntax (default).Python: Python 3 syntax.Ruby: Ruby syntax.Encoding: specifies the string encoding to match against. Possible values:
UTF-8UTF-16 (default)UTF-32LATIN-1BYTES (equivalent to LATIN-1)Validate: don't generate a regex matcher object, just check the regex for syntax errors.U180EWhitespace: treat 0x180E MONGOLIAN VOWEL SEPARATOR as part of \s. This is a legacy feature for languages
using a Unicode standard older than 6.3, such as ECMAScript 6 and older.UTF16ExplodeAstralSymbols: generate one DFA states per (16 bit) char instead of per-codepoint. This may
improve performance in certain scenarios, but increases the likelihood of DFA state explosion.AlwaysEager: do not generate any lazy regex matchers (lazy in the sense that they may lazily compute properties of a
{@link RegexResult}).RegressionTestMode: exercise all supported regex matcher variants, and check if they produce the same results.DumpAutomata: dump all generated parser trees, NFA, and DFA to disk. This will generate debugging dumps of most
relevant data structures in JSON, GraphViz and LaTex format.StepExecution: dump tracing information about all DFA matcher runs.All options except Flavor and Encoding are boolean and false by default.