src/algo/SIMD.md
indexByteTwo / lastIndexByteTwoindexByteTwo(s []byte, b1, b2 byte) int -- returns the index of the
first occurrence of b1 or b2 in s, or -1.
lastIndexByteTwo(s []byte, b1, b2 byte) int -- returns the index of the
last occurrence of b1 or b2 in s, or -1.
They are used by the fuzzy matching algorithm (algo.go) to skip ahead
during case-insensitive search. Instead of calling bytes.IndexByte twice
(once for lowercase, once for uppercase), a single SIMD pass finds both at
once.
| File | Purpose |
|---|---|
indexbyte2_arm64.go | Go declarations (//go:noescape) for ARM64 |
indexbyte2_arm64.s | ARM64 NEON assembly (32-byte aligned blocks, syndrome extraction) |
indexbyte2_amd64.go | Go declarations + AVX2 runtime detection for AMD64 |
indexbyte2_amd64.s | AMD64 AVX2/SSE2 assembly with CPUID dispatch |
indexbyte2_other.go | Pure Go fallback for all other architectures |
indexbyte2_test.go | Unit tests, exhaustive tests, fuzz tests, and benchmarks |
ARM64 (NEON):
VMOV).VCMEQ), ORs the results (VORR), and builds a
64-bit syndrome with 2 bits per byte.indexByteTwo uses RBIT + CLZ to find the lowest set bit (first match).lastIndexByteTwo scans backward and uses CLZ on the raw syndrome to
find the highest set bit (last match).internal/bytealg/indexbyte_arm64.s.AMD64 (AVX2 with SSE2 fallback):
cpuHasAVX2() checks CPUID + XGETBV for AVX2 and OS YMM
support. The result is cached in _useAVX2.VPBROADCASTB.VPCMPEQB against both needles, VPOR, then
VPMOVMSKB to get a 32-bit mask.VZEROUPPER before every return to avoid SSE/AVX transition penalties.PUNPCKLBW + PSHUFL.PCMPEQB, POR, PMOVMSKB.BSFL (forward) / BSRL (reverse) for bit scanning.internal/bytealg/indexbyte_amd64.s.Fallback (other platforms):
indexByteTwo uses two bytes.IndexByte calls with scope-limiting
(search b1 first, then limit the b2 search to s[:i1]).lastIndexByteTwo uses a simple backward for loop.# Unit + exhaustive tests
go test ./src/algo/ -run 'TestIndexByteTwo|TestLastIndexByteTwo' -v
# Fuzz tests (run for 10 seconds each)
go test ./src/algo/ -run '^$' -fuzz FuzzIndexByteTwo -fuzztime 10s
go test ./src/algo/ -run '^$' -fuzz FuzzLastIndexByteTwo -fuzztime 10s
# Cross-architecture: test amd64 on an arm64 Mac (via Rosetta)
GOARCH=amd64 go test ./src/algo/ -run 'TestIndexByteTwo|TestLastIndexByteTwo' -v
GOARCH=amd64 go test ./src/algo/ -run '^$' -fuzz FuzzIndexByteTwo -fuzztime 10s
GOARCH=amd64 go test ./src/algo/ -run '^$' -fuzz FuzzLastIndexByteTwo -fuzztime 10s
# All indexByteTwo / lastIndexByteTwo benchmarks
go test ./src/algo/ -bench 'IndexByteTwo' -benchmem
# Specific size
go test ./src/algo/ -bench 'IndexByteTwo_1000'
Each benchmark compares the SIMD asm implementation against reference
implementations (2xIndexByte using bytes.IndexByte, and a simple loop).
The assembly is verified by three layers of testing:
testing.F, compared against the
same loop reference.