packages/intl-segmenter/NOTES.md
The official document describing the unicode segmentation http://unicode.org/reports/tr29/
Unicode Common Locale Data Repository where locale (and root) segmentation rules are specified
CLDR segmentation rules: https://github.com/unicode-org/cldr/tree/main/common/segments
information about the rules: https://unicode.org/reports/tr35/tr35-general.html#Segmentations
JSON CLDR: https://github.com/unicode-org/cldr-json/tree/main/cldr-json/cldr-segments-full (und is the root ruleset)
CLDR Rules use unicode flavuored regex (https://unicode.org/reports/tr18/) which is not directly compatible with js regex.
Modern JS (es6) implements /u flag that allows some unicode regex features: property escapes \p{..}, handling of 4 bytes characters. (regexpu can transpile them to es5)
There is also the /v flag proposal (https://github.com/tc39/proposal-regexp-v-flag) that adds set operations (-- and &&) and nested character groups (both used by the CLDR rules), regexpu can transpile those too
CLDR rules use some character classes that are not implemented by regexpu, but the code lists are avalible in the UCD files (see UCD Notes)
More details about syntax incompatiblities in CLDR Rules RegExes
unicode character database: https://unicode.org/ucd/
UCD Files: https://www.unicode.org/Public/15.0.0/ucd/
Segmentation properties and tests: https://www.unicode.org/Public/15.0.0/ucd/auxiliary/
Character sets used by the CLDRs, that are missing from es5 regex and regexpu-core:
Grapheme_Cluster_Break : https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt (more information @ https://unicode.org/reports/tr29/#Grapheme_Cluster_Break_Property_Values)Sentence_Break: https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/SentenceBreakPropertyWord_Break: https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/WordBreakProperty.txtIndic_Syllabic_Category: https://unicode.org/Public/UCD/latest/ucd/IndicSyllabicCategory.txtea= => East_Asian_Width => : https://unicode.org/Public/UCD/latest/ucd/extracted/DerivedEastAsianWidth.txtccc => Canonical_Combining_Class => https://unicode.org/Public/UCD/latest/ucd/extracted/DerivedCombiningClass.txtProperties can use aliases: PropertyAliases.txt
Property values can use aliases: PropertyValueAliases.txt
https://unicode.org/reports/tr41/tr41-26.html#Tests29
About test files: https://www.unicode.org/reports/tr44/#Segmentation_Test_Files
Grapheme segmentation tests: https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.txt
Sentence segmentation tests: https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/SentenceBreakTest.txt
Word break segmentation tests: https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/WordBreakTest.txt
JS Polyfill: https://www.npmjs.com/package/intl-segmenter-polyfill (using icu C compiled in wasm+)
Java https://github.com/unicode-org/unicodetools/blob/70dce2c89f185c65b436c28404ae5b7bdb32c2d1/unicodetools/src/main/java/org/unicode/tools/Segmenter.java#L485 https://github.com/unicode-org/unicodetools/blob/70dce2c89f185c65b436c28404ae5b7bdb32c2d1/unicodetools/src/test/java/org/unicode/test/CompareBoundaries.java#L473
CLDR Rules use unicode regex (unsupported by JS).
There is a set of utils in the unicode repo that allows transformation of unicode regex to java compatible regex:
https://github.com/unicode-org/unicodetools
and UnicodeJsps hosted on the unicode.org: https://util.unicode.org/UnicodeJsps
https://github.com/mathiasbynens/regenerate can generate es5 regex given a list of unicode symbols
https://github.com/mathiasbynens/regexpu-core Can transpile es2015 (/u) regex to es5 and /v to unicode regex (with some limitations)
The unicode regexs have other syntax incompatiblities:
- instead of --[[$Extend-\\\\p{ccc=0}] $ZWJ]\p{Gujr} should be \{sc=Gujr}\p{..}\p{...}&[..] in unicode is treated as [...]&[...] but regexpu \v treates it as [...][...]&[...][^(?:...)] [(?:..)].Even after accounting for all of those, there are issues left:
cldr-segments-full/segments/el/suppressions.json has a rule "$STerm": "[[$STerm] [\\u003B \\u037E]]" which I currently do not know how to correctly process.