javalib/src/main/scala/java/util/regex/README.md
j.u.regex.*Java and JavaScript have different support for regular expressions. In addition to Java having many more features, they also differ in the specifics of most of the features they have in common.
For performance and code size reasons, we still want to use the native JavaScript RegExp class.
Modern JavaScript engines JIT-compile RegExps to native code, so it is impossible to compete with that using a user-space engine.
For example, see V8 talking about its Irregexp library and SpiderMonkey talking about their latest integration of Irregexp.
Therefore, our strategy for java.util.regex is to compile Java regexes down to JavaScript regexes that behave in the same way.
The compiler is in the file PatternCompiler.scala, and is invoked at the time of Pattern.compile().
We can deal with most features in a compliant way using that strategy, while retaining performance, and without sacrificing code size too much compared to directly passing regexes through without caring about the discrepancies. There are however a few features that are either never supported, or only supported when targeting a recent enough version of ECMAScript.
The set of supported features depends on the target ECMAScript version, specified in ESFeatures.esVersion.
The following features are never supported:
CANON_EQ flag,\X, \b{g} and \N{...} expressions,\p{In𝘯𝘢𝘮𝘦} character classes representing Unicode blocks,\G boundary matcher, except if it appears at the very beginning of the regex (e.g., \Gfoo),(?idmsuxU-idmsuxU:𝑋),(?idmsuxU-idmsuxU), except if they appear at the very beginning of the regex (e.g., (?i)abc is accepted, but ab(?i)c is not), andThe following features require esVersion >= ESVersion.ES2015:
UNICODE_CASE flag.The following features require esVersion >= ESVersion.ES2018:
MULTILINE and UNICODE_CHARACTER_CLASS flags,(?<=𝑋) and (?<!𝑋),\b and \B expressions used together with the UNICODE_CASE flag,\p{𝘯𝘢𝘮𝘦} expressions where 𝘯𝘢𝘮𝘦 is not one of the POSIX character classes.It is worth noting that, among others, the following features are supported in all cases, even when no equivalent feature exists in ECMAScript at all, or in the target version of ECMAScript:
\G boundary matcher when it is at the beginning of the pattern (corresponding to the 'y' flag, natively supported in ES 2015+),DOTALL flag (natively supported in ES 2018+),CASE_INSENSITIVE on but UNICODE_CASE off),COMMENTS flag,UNICODE_CHARACTER_CLASS (if the latter is itself supported, see above),[a-z&&[^g-p]]),(?>𝑋),𝑋*+, 𝑋++ and 𝑋?+,\A, \Z and \z boundary matchers,\R expression,\Q and \E, both outside and inside character classes.All the supported features have the correct semantics from Java. This is even true for features that exist in JavaScript but with different semantics, among which:
^ and $ boundary matchers with the MULTILINE flag (when the latter is supported),\h, \s, \v, \w and their negated variants, respecting the UNICODE_CHARACTER_CLASS flag,\b and \B boundary matchers, respecting the UNICODE_CHARACTER_CLASS flag,\p{𝘯𝘢𝘮𝘦} character classes, including the \p{java𝘔𝘦𝘵𝘩𝘰𝘥𝘕𝘢𝘮𝘦} classes,If a feature is not supported, a PatternSyntaxException is thrown at the time of Pattern.compile().
If Pattern.compile() succeeds, the regex is guaranteed to behave exactly like on the JVM, except for capturing groups within repeated segments (both for their back references and subsequent calls to group, start and end):
The behavior of JavaScript is more "functional", whereas that of the JVM is more "imperative".
This imperative nature is also reflected in the hitEnd() and requireEnd() methods of Matcher, which we do not support (they don't link).
The behavior of the JVM does not appear to be specified, and is questionable. There are several open issues that argue it is buggy:
Therefore, it seems wise to keep the JavaScript behavior, and not try to replicate the JVM behavior at great cost (if that is even possible within our constrains).
Java regexes are compiled to JS regexes in one pass, using a recursive descent approach.
There is a state variable pIndex which indicates the position inside the original pattern.
Compilation methods parse a subexpression at pIndex, advance pIndex past what they parsed, and return the result of the compilation.
Parsing is always done at the code point level, and not at the individual Char level, using the WTF-16 encoding.
See Handling surrogate pairs without support for the 'u' flag for details about the behavior of lone surrogates.
We first describe the compilation with the assumption that the underlying RegExps support the u flag.
This is always true in ES 2015+, and dynamically determined at run-time in ES 5.1.
We will cover later what happens when it is not supported.
Irrespective of the Java flags, we always use the following JS flags when they are supported (including through dynamic detection):
. matches any code point (introduced in ES2018).In addition, we use the 'i' JS flag when both CASE_INSENSITIVE and UNICODE_CASE are on.
Since UNICODE_CASE is only supported in ES 2015+, this implies that 'u' is supported, and hence that we can leave all the handling of case insensitivity to the native RegExp.
When CASE_INSENSITIVE is on but UNICODE_CASE is off, we must apply ASCII case insensitivity.
This is not supported by JS RegExps, so we implement it ourselves during compilation.
This is represented by the property asciiCaseInsensitive.
When it is true:
[Gg]), in subexpressions or in character classes, andA-Z and/or a-z is compiled with additional range(s) to cover the uppercase and lowercase variants.PatternCompiler never uses any other JS RegExp flag.
Pattern adds the 'g' flag for its general-purpose instance of RegExp (the one used for everything except Matcher.matches()), as well as the 'y' flag if the regex is sticky and it is supported.
Usually, there is a 1-to-1 relationship between original group numbers and compiled groups numbers. However, differences are introduced when compiling atomic groups and possessive quantifiers. Therefore, we maintain a mapping from original group numbers to the corresponding group numbers in the compiled pattern. We use it for the following purposes:
\𝑁, andMatcher API to retrieve the groups' contents, start and end positions.Named capturing groups are always compiled as numbered capturing groups, even in ES 2018+. We record an additional map of names to the corresponding original group numbers, and use it
\k<𝘯𝘢𝘮𝘦> (as numbered back references), andMatcher API with group names.The following constructs are translated as is:
𝑋*, 𝑋+ and 𝑋?,𝑋*?, 𝑋+? and 𝑋??,(?:𝑋),(?=𝑋) and (?!𝑋),(?<=𝑋) and (?<!𝑋), after validating that they are supported.The following constructs have special handling and will be discussed later:
(?>𝑋), and𝑋*+.Subexpressions that represent a single code point are parsed and normalized as the code point that they represent.
For example, both a and \x65 are compiled as a.
Code points that are metacharacters in JS regexes (i.e., ^ $ \ . * + ? ( ) [ ] { } |) are escaped with a \, for example \$.
This is implemented in def literal(cp: Int).
Note that a double escape of the form \uℎℎℎℎ\uℎℎℎℎ representing a high surrogate and a low surrogate is treated as a single escape for a single supplementary code point.
For example, \uD834\uDD1E is considered as a single escape for the code point 𝄞 (U+1D11E Musical Symbol G Clef).
This behavior only applies to \u escapes.
A would-be double-escape \x{D834}\x{DD1E} constitutes two separate code points.
In practice, such a sequence can never match anything in the input; if the input contained that sequence of code units, it would be considered as a single code point 𝄞, which is not matched by a pattern meant to match two separate code points U+D834 and U+DD1E.
A quote starts with \Q, and ends at the first occurrence of \E or the end of the string.
The full string in between is taken as a sequence of code points.
Each code point is compiled as described in "Single code points" for def literal(cp: Int), and the compiled patterns are concatenated in a sequence.
This is implemented in def literal(s: String).
Predefined character classes represent a set of code points that matches a single code point in the input string.
The set typically depends on the value of UNICODE_CHARACTER_CLASS.
Since virtually none of them has a corresponding predefined character class in JS RegExps, they are all compiled as custom [...] character classes, according to their definition.
Atomic groups are not well documented in the JavaDoc, but they are well covered in outside documentation such as on Regular-Expressions.info.
They have the form (?>𝑋).
An atomic group matches whatever 𝑋 matches, but once it has successfully matched a particular substring, it is considered as an atomic unit.
If backtracking is needed later on because the rest of the pattern failed to match, the atomic group is backtracked as a whole.
JS does not support atomic groups.
However, there exists a trick to implement atomic groups on top of look-ahead groups and back references, including with the correct performance characterics.
It is well documented in the article Mimicking Atomic Groups.
In a nutshell, we compile (?>𝑋) to (?:(?=(𝑋))\𝑁) where 𝑁 is the allocated group number for the capturing group (𝑋).
This introduces a discrepancy between the original group numbers and the compiled group numbers for any subsequent capturing group.
This is why we maintain groupNumberMap.
Note that the discrepancy applies within 𝑋 as well, so we record it before compiling the subexpression 𝑋.
Possessive quantifiers can be interpreted as sugar for atomic groups over greedy quantifiers.
For example, 𝑋*+ is equivalent to (?>𝑋*).
Since JS does not support possessive quantifiers any more than atomic groups, we compile them as that desugaring, followed by the compilation scheme of atomic groups.
However, there is an additional problem.
For atomic groups, we know before parsing 𝑋 that we need to record a discrepancy in the group numbering.
For possessive quantifiers, we only know that after having parsed 𝑋, but it should apply also within 𝑋.
We do that with postprocessing.
Before compiling any token 𝑋, we record the current compiledGroupCount, and when compiling a possessive quantifier, we increment the compiled group number of those greater than the recorded count.
We do this
groupNumberMap, and𝑋.The latter is pretty ugly, but is robust nevertheless.
Unlike JavaScript, Java regexes support intersections and unions of character classes. We compile them away using the following equivalences:
[𝐴&&𝐵] is equivalent to (?=[𝐴])[𝐵][^𝐴&&𝐵] is equivalent to [^𝐴]|[^𝐵][𝐴𝐵] is equivalent to [𝐴]|[𝐵][^𝐴𝐵] is equivalent to (?![𝐴])[^𝐵]For example, using the rule on positive intersection, we can compile the example from the JavaDoc [a-z&&[^m-p]] to (?=[a-z])[^m-p].
An alternative design would have been to resolve all the operations at compile-time to get to flat code point sets.
This would require to expand \p{} and \P{} Unicode property names into equivalent sets, which would need a significant chunk of the Unicode database to be available.
That strategy would have a huge cost in terms of code size, and likely in terms of execution time as well (for compilation and/or matching).
So far, we have assumed that the underlying RegExp supports the 'u' flag, which we test with supportsUnicode.
In this section, we cover how the compilation is affected when it is not supported.
This can only happen when we target ES 5.1.
The ECMAScript specification is very precise about how patterns and strings are interpreted when the 'u' flag is enabled. It boils down to:
The documentation for Java regexes does not really say anything about what it considers "characters" to be. However, experimentation and tests show that they behave exactly like ECMAScript with the 'u' flag.
Without support for the 'u' flag, the JavaScript RegExp engine will parse the pattern and process the input with individual Chars rather than code points. In other words, it will consider surrogate pairs as two separate (and therefore separable) code units. If we do nothing against it, it can jeopardize the semantics of regexes in several ways:
. will match only the high surrogate of a surrogate pair instead of the whole codepoint,[^a],Even patterns that contain only ASCII characters (escaped or not) and use no flags can behave incorrectly on inputs that contain surrogate characters (paired or unpaired).
A possible design would have been to restrict the inputs to strings without surrogate code units when targeting ES 5.1.
However, that could lead to patterns that fail at matching-time, rather than at compile-time (during Pattern.compile()), unlike all the other features that are conditioned on the ES version.
Therefore, we go to great lengths to implement the right semantics despite the lack of support for 'u'.
When supportsUnicode is false, we apply the following changes to the compilation scheme.
In general, we make sure that:
We do nothing special for low surrogates, as all possible patterns go from left to right (we don't have look-behinds in this context) and we otherwise make sure that all code points from the input are either fully matched or not at all.
Therefore, the "cursor" of the engine can never stop in the middle of a code point, and so low surrogates are only visible if they were unpaired to begin with.
The only exception to this is when the cursor is at the beginning of the pattern, when using find.
In that case we cannot a priori prevent the JS engine from trying to find a match starting in the middle of a code point.
To address that, we have special a posteriori handling in Pattern.execFind().
A single code point that is a high surrogate 𝑥 is compiled to (?:𝑥(?![ℒ])), where ℒ is \uDC00-\uDFFF, the range of all low surrogates.
The negative look-ahead group prevents a match from separating the high surrogate from a following low surrogate.
A dot-all (in codePointNotAmong("")) is compiled to (?:[^ℋ]|[ℋ](?:[ℒ]|(?![ℒ]))), where ℋ is \uD800-\uDBFF, the range of all high surrogates.
This means either
We restrict the internal contract of codePointNotAmong(𝑠) to only take BMP code points that are not high surrogates, and compile it to the same as the dot-all but with the characters in 𝑠 excluded like the high surrogates: (?:[^𝑠ℋ]|[ℋ](?:[ℒ]|(?![ℒ]))).
Since UNICODE_CHARACTER_CLASS is not supported, all but one call site of codePointNotAmong already respect that stricter contract.
The only one that does not is the call codePointNotAmong(thisSegment) inside CharacterClassBuilder.conjunctResult().
To make that one compliant, we make sure not to add illegal code points in thisSegment.
To do that, we exploit the equivalences [𝐴𝐵] = [𝐴]|[𝐵] and [^𝐴𝐵] = (?![𝐴])[𝐵] where 𝐴 is an illegal code point to isolate it in a separate alternative, that we can compile as a single code point above.
For example, the character class [k\uD834f], containing a high surrogate code point, is equivalent to [\uD834]|[kf], which can be compiled as (?:\uD834(?![ℒ]))|[kf]).
That logic is implemented in CharacterClassBuilder.addSingleCodePoint().
Code point ranges that contain illegal code points are decomposed into the union of 4 (possibly empty) ranges:
𝑥-𝑦, compiled to (?:[𝑥-𝑦](?![ℒ]))𝑥-𝑦, where 𝑥 is the surrogate pair 𝑝𝑞 and 𝑦 is the pair 𝑠𝑡, which is further decomposed into:
𝑝𝑞-𝑝\uDFFF, compiled as (?:𝑝[𝑞-\uDFFF])𝑝′\uDC00-𝑠′\uDFFF where 𝑝′ = 𝑝+1 and 𝑠′ = 𝑠−1, compiled to (?:[𝑝′-𝑠′][\uDC00-\uDFFF])𝑠\uDC00-𝑠𝑡, compiled to (?:𝑠[\uDC00-𝑡])That logic is implemented in CharacterClassBuilder.addCodePointRange().
For historical reasons, code size is critical in this class.
Before Scala.js 1.7.0, java.util.regex.Pattern was just a wrapper over native RegExps.
The patterns were passed through with minimal preprocessing, without caring about the proper semantics.
This created an expectation of small code size for this class.
When we fixed the semantics, we had to introduce this compiler, which is non-trivial.
In order not to regress too much on code size, we went to great lengths to minimize the code size impact of this class, in particular in the default ES 2015 configuration.
When modifying this code, make sure to preserve as small a code size as possible.