extra/protobuf/protobuf-24.4/third_party/utf8_range/README.md
This is a brand new algorithm to leverage SIMD for fast UTF-8 string validation. Both NEON(armv8a) and SSE4 versions are implemented. AVX2 implementation contributed by ioioioio.
Four UTF-8 validation methods are compared on both x86 and Arm platforms. Benchmark result shows range base algorithm is the best solution on Arm, and achieves same performance as Lemire's approach on x86.
| Test case | naive | lookup | lemire | range | range2 |
|---|---|---|---|---|---|
| UTF-demo.txt | 562.25 | 412.84 | 1198.50 | 1411.72 | 1579.85 |
| 32 bytes | 651.55 | 441.70 | 891.38 | 1003.95 | 1043.58 |
| 33 bytes | 660.00 | 446.78 | 588.77 | 1009.31 | 1048.12 |
| 129 bytes | 771.89 | 402.55 | 938.07 | 1283.77 | 1401.76 |
| 1K bytes | 811.92 | 411.58 | 1188.96 | 1398.15 | 1560.23 |
| 8K bytes | 812.25 | 412.74 | 1198.90 | 1412.18 | 1580.65 |
| 64K bytes | 817.35 | 412.24 | 1200.20 | 1415.11 | 1583.86 |
| 1M bytes | 815.70 | 411.93 | 1200.93 | 1415.65 | 1585.40 |
| Test case | naive | lookup | lemire | range | range2 |
|---|---|---|---|---|---|
| UTF-demo.txt | 753.70 | 310.41 | 3954.74 | 3945.60 | 3986.13 |
| 32 bytes | 1135.76 | 364.07 | 2890.52 | 2351.81 | 2173.02 |
| 33 bytes | 1161.85 | 376.29 | 1352.95 | 2239.55 | 2041.43 |
| 129 bytes | 1161.22 | 322.47 | 2742.49 | 3315.33 | 3249.35 |
| 1K bytes | 1310.95 | 310.72 | 3755.88 | 3781.23 | 3874.17 |
| 8K bytes | 1348.32 | 307.93 | 3860.71 | 3922.81 | 3968.93 |
| 64K bytes | 1301.34 | 308.39 | 3935.15 | 3973.50 | 3983.44 |
| 1M bytes | 1279.78 | 309.06 | 3923.51 | 3953.00 | 3960.49 |
Basic idea:
http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf, page 94
Table 3-7. Well-Formed UTF-8 Byte Sequences
| Code Points | First Byte | Second Byte | Third Byte | Fourth Byte |
|---|---|---|---|---|
| U+0000..U+007F | 00..7F | |||
| U+0080..U+07FF | C2..DF | 80..BF | ||
| U+0800..U+0FFF | E0 | A0..BF | 80..BF | |
| U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | |
| U+D000..U+D7FF | ED | 80..9F | 80..BF | |
| U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | |
| U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF |
| U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF |
| U+100000..U+10FFFF | F4 | 80..8F | 80..BF | 80..BF |
To summarise UTF-8 encoding:
Range table maps range index 0 ~ 15 to minimal and maximum values allowed. Our task is to observe input string, find the pattern and set correct range index for each byte, then validate input string.
| Index | Min | Max | Byte type |
|---|---|---|---|
| 0 | 00 | 7F | First Byte, ASCII |
| 1,2,3 | 80 | BF | Second, Third, Fourth Bytes |
| 4 | A0 | BF | Second Byte after E0 |
| 5 | 80 | 9F | Second Byte after ED |
| 6 | 90 | BF | Second Byte after F0 |
| 7 | 80 | 8F | Second Byte after F4 |
| 8 | C2 | F4 | First Byte, non-ASCII |
| 9..15(NEON) | FF | 00 | Illegal: unsigned char >= 255 && unsigned char <= 0 |
| 9..15(SSE) | 7F | 80 | Illegal: signed char >= 127 && signed char <= -128 |
Ignoring the four special cases(E0,ED,F0,F4), how should we set range index for each byte?
To implement above operations efficiently with SIMD:
Example(assume no previous data)
| Input | F1 | 80 | 80 | 80 | 80 | C2 | 80 | 80 | ... |
|---|---|---|---|---|---|---|---|---|---|
| first_len | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... |
| First Byte | 8 | 0 | 0 | 0 | 0 | 8 | 0 | 0 | ... |
| Second Byte | 0 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | ... |
| Third Byte | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | ... |
| Fourth Byte | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... |
| Range index | 8 | 3 | 2 | 1 | 0 | 8 | 1 | 0 | ... |
Range_index = First_Byte | Second_Byte | Third_Byte | Fourth_Byte
Overlapped non-ASCII First Byte
| Input | F1 | 80 | C2 | 90 |
|---|---|---|---|---|
| first_len | 3 | 0 | 1 | 0 |
| First Byte | 8 | 0 | 8 | 0 |
| Second Byte | 0 | 3 | 0 | 1 |
| Third Byte | 0 | 0 | 2 | 0 |
| Fourth Byte | 0 | 0 | 0 | 1 |
| Range index | 8 | 3 | 10 | 1 |
Range index adjustment for four special cases
| First Byte | Second Byte | Before adjustment | Correct index | Adjustment |
|---|---|---|---|---|
| E0 | A0..BF | 2 | 4 | 2 |
| ED | 80..9F | 2 | 5 | 3 |
| F0 | 90..BF | 3 | 6 | 3 |
| F4 | 80..8F | 3 | 7 | 4 |
Range index adjustment can be reduced to below problem:
Given 16 bytes, replace E0 with 2, ED with 3, F0 with 3, F4 with 4, others with 0.
A naive SIMD approach:
At least eight operations are required for naive approach.
Observing special bytes(E0,ED,F0,F4) are close to each other, we can do much better using lookup table.
NEON tbl instruction is very convenient for table lookup:
Leverage these features, we can solve the problem with as few as two operations:
tbl behaviourSSE pshufb instruction is not as friendly as NEON tbl in this case:
We can still leverage these features to solve the problem in five operations:
pshufb behaviour)For remaining input less than 16 bytes, we will fallback to naive byte by byte approach to validate them, which is actually faster than SIMD processing.
It's necessary to design test cases to cover corner cases as more as possible.
Below table shows how 16 bytes input are processed step by step. See range-neon.c for according code.