This file contains explanation of how SIMD implementation of Base64 functions

Our solution is based on 0x80 blog: encoding, decoding.

This is a different version of the same explanation + we have some differences.

NOTE: We are using 1 digit for two bits when writing data

Understanding Base64 encoding.

If you find it easier, Wikipedia.

The goal of base64 is to encode arbitrary byte data as text that can be passed around. It works as follows:

we select a 64 character alphabet.
we insert two 0 bits after each 6 bits. 2^6 is 64.
because we now only need to encode 64 values, our alphabet is sufficient.

Encoding

So, we have 2 steps needed to do the conversion: 1. get indexes 0-64, 2. lookup the corresponding char. There is also a little bit of tail handling, see the appropriate section.

If one letter represents two bits, it looks smth like this (remember - our machines are little endian):

`cddd`bbcc`aaab => 0ddd`0ccc`0bbb`0aaa

After this we just have to convert the bytes into the selected alphabet.

This means we have two steps to the algorithm: encodeToIndexes and lookupByIndex.

encodeToIndexes

NOTE: assume 16 byte registers.

We will only be converting 12 bytes to output 16 bytes. The last 4 loaded bytes are ignored.

We need to shuffle bytes in a way that each dword contains all the bits to construct output for that dword. Specifically:

PONM,LKJI,HGFE,DCBA => KLJK,GIGH,EFDE,BCAB

STEP BY STEP:

We had bytes in memory ABCD.

Loaded on a LE machine they become DCBA.

The first output dword will only fit ABC, D goes into the second dword.

We will also reshuffle CBA as BCAB because this will make further operations simpler.

After this we will mutate each dword (letter+digit indicate 2 bits):

shuffled b2b3c1c2'c3d1d2d3'a1a2a3b1'b2b3c1c2
expected 00d1d2d3'00c1c2c3'00b1b2b3'00a1a2a3

At the moment of writing this we only have an SSE4.2 version. We can do it fairly straightforwardly with shifts and blends. However the 0x80 blogs suggests that using a tricky multiplication scheme is superior, and that's what we do.

For neon there are by element shift instructions which would help. On avx2 there are srlv family of instructions that can come in handy as well. The 0x80 blog also talks about a BMI implementation (pdep/pext instructions). They had issues on AMD but apparently not on the AMDs in Meta's fleet, so in the future we can consider it as a possibility as well.

lookupByIndex

The idea is to use table lookup/shuffle instructions (pshuvb on sse4.2). They allow you to shuffle a register by index (0..16).

We have values 0..64. However, we can utilize that the output values are not random. The are split into a few contiguous groups. We don't have to lookup the complete encoding, we just need to lookup the offset to add.

Index	Maps to	Add to encode
0-25	'A'-'Z'	'A' - 0
26-51	'a'-'z'	'a' - 26
52-61	'0'-'9'	'0' - 52
62	'+'	'+' - 62
63	'/'	'/' - 63

NOTE: the URL version of Base64 encoding uses -_ instead of +/.

Since we have 16 values to map to, we are not going to bother with consolidating 52-61 to a single base and instead will them up individually.

This is our final lookup table:

0: 'A' - 0
1: 'a' - 26
2: '0' - 52,
3: '1' - 53
...
11: '9' - 61
12: '+' - 62
13: '/' - 63
14: -
15: -

We even have two extra indexes 14 and 15 that are unused.

Tail handling

What if we have less than register size to load? What if we have less than 3 elements to encode?

Base64 for incomplete 3 bytes works as if there were extra 0s at the end, except the garbage elements are replaced with either a padding character or just removed.

Example:

input bytes (be)	Basic	Url
(1)	AQ==	AQ
(1, 0)	AQA=	AQA
(1, 0, 0)	AQAA	AQAA

After measuring different options, using scalar loop peeling proved to be the fastest.

Decoding

Overall the decoding process is similar to inverse of the encoding: 1. from chars get numbers 0..63 2. pack them into bytes.

However, unlike encoding, not all the sequences of bytes are valid. We have to decide which ones to accept and how to handle errors.

NOTE: error detection is fairly expensive, if in the future there is a need for a decode that can sometimes swallow invalid values, it can be done faster.

Decisions: base64Decode requires = padding and accepts only [+, /] as 62/63 character. If there are extra bits on the end, those are not allowed: iZ== is not a valid input because no input sequence will produce it. base64URLDecode: allows everything encoded with plain base64 and base64URL. It will successfully decode some combinations that are not either. The extra bits on the end are currently allowed: iZ== will decode successfully (this decision is not motivated by anything).

NOTE: details on baseURLDecode rules:

Padding bytes follow regular base64: if they are present they are allowed to be only in positions that regular base64 allows them to be in: 4th or 3rd and 4th in the 4 byte chunk. "aa==", "aaa=" - OK, "aa=", "==" - not OK. This decision follows the previous most common implementation of base64 in the codebase.
Both ['+', '/'] and ['-', '_'] are allowed in any combinations. "+-" is legal for example.

NOTE: this follows the previous design and not necessarily how other places define base64URLDecode. The difference comes to decoding invalid sequence of padding characters. Example: "ba=" we consider invalid while some other implementations might just strip '='. We suspect it won't be an issue.

This behaviour matches what the previous implementation did.

We only have a SIMD version of this algorithm for base64Decode because the URL decode has very little use (so far at least).

The SIMD algorithm has 2 steps:

decodeCharToIndex: convert to numbers from 0..63
packIndexesToBytes: packs 0..63 into the originally encoded bytes.

If we encounter an error, we are not going to stop processing data. We are just going to mark it and then check at the end.

decodeCharToIndex

Conversions we want

Char	Char's Hex code	Res	Res Hex
'+'	0x2B	62	0x3E
'/'	0x2F	63	0x3F
'0' - '9'	0x30 - 0x39	52 - 61	0x34 - 0x3D
'A' - 'Z'	0x41 - 0x5A	0 - 25	0x00 - 0x19
'a' - 'z'	0x61 - 0x7A	26 - 51	0x1A - 0x33

Idea

We have unique higher 4 bits (nibble) per consecutive group, except for '+' and '/'. We can find a unique offset with pshuvb by it.

Deal with '+' and '/' nibbles

We need to move '+' and '/' into their own nibbles. It's important that all previously invalid chars stay invalid.

The best way I see to do it is to subtract 0x0f from everything less than or equal to +. The reason to use 0x0f is that it's a constant we will need anyways ot get 4 higher nibbles.

However on intel there is no unsigned comparison, so all the negative values will be considered as <= '+'. To deal with it, we can use a signed comparison but subtract with saturation.

<= '+' gives -1 where '+'
and 0x0f
subs

Everything bigger than '+' will stay the same. Everything smaller than '+' is invalid and will become even smaller. It will saturate to -127 which is still invalid, so we are OK.

NOTE: I was also thinking if this is possible to do with just one subtraction and no mask. Unfortunately we also need to keep a separate nibble for '0' and '/' and '0' == '/' + 1, so that subtraction will guarantee put '/' and '0' on the same higher nibble.

NOTE: If we drop the requirement on invalid elements, again - this is much easier.

New conversion table:

Char	New Hex code	Res	Res Hex
'+'	0x1C	62	0x3E
'/'	0x2F	63	0x3F
'0' - '9'	0x30 - 0x39	52 - 61	0x34 - 0x3D
'A' - 'Z'	0x41 - 0x5A	0 - 25	0x00 - 0x19
'a' - 'z'	0x61 - 0x7A	26 - 51	0x1A - 0x33

Detect invalid values.

The idea is that we have valid higher nibbles in range from 1..7. We can convert higher nibbles into a bitmask 0000'0010 ... 1000'0000 (1 << nibble) Due to a lack of proper shift instruction on x86, we will use a table lookup (pshuvb). All other higher nibbles can become 0, indicating invalid input.

Now for all possible lower nibbles we have a set of valid higher nibbles.

Lower Nibble	Valid higher nibble
0	3, 5, 7
1 - 9	3, 4, 5, 6, 7
A	4, 5, 6, 7
B	4, 6
C	1, 4, 6
D, E	4, 6
F	2, 4, 6

So: by higher nibble lookup a bit that indicates that higher nibble by lower nibble lookup a set of valid bits and

This will give us a non-zero when it's OK. And 0 when it's not.

Between iterations we can accumulate error with unsigned min. 0 - is the smallest value - so we will get 0 if there was one.

Offset to correct values.

Similar idea to encode - we just lookup the proper offset and add it. We'll need to lookup by higher nibble.

Char	New Hex code	Res	Offset (- means minus)
'+'	0x1C	62	62 - 0x1C
'/'	0x2F	63	63 - '/'
'0' - '9'	0x30 - 0x39	52 - 61	52 - '0'
'A' - 'O'	0x41 - 0x4F	0 - 14	0 - 'A'
'P' - 'Z'	0x50 - 0x5A	15 - 25	0 - 'A'
'a' - 'o'	0x61 - 0x6F	26 - 40	26 - 'a'
'p' - 'z'	0x70 - 0x7A	26 - 51	26 - 'a'

packIndexesToBytes

NOTE: assume 16 byte registers.

From a register of 16 bytes we need to get 12 bytes. I think the only way to do it is to convert each 4 bytes to appropriate 3 bytes and then put them all together. After we doing conversion 4 to 3 we'll remove the leftover byte with one shuffle.

Goal (One letter - 2 bits):

Input: 0ddd'0ccc'0bbb'0aaa
Needed bytes: 'cddd'bbcc'aaab

We can obviously shift and or to get the desired result (and maybe on ARM we should) but again, shifting story for x86 is poor.

The 0x80 blog suggests the following trick.

Multiply add (MADD) solution

On SSE there are madd instructions for adjacent bytes (_mm_maddubs_epi16) and words (_mm_madd_epi16). Multiply is a strictly more powerful operation than shift, so this will work.

NOTE: all of our numbers are positive so the sign of the operations doesn't matter

Here is the scalar equivalent of the code we'll do. (I omit indicating leading 0s and ignore casts).

std::uint16_t aaabbb = aaa << 6 + bbb;
std::uint16_t cccddd = ccc << 6 + ddd;
std::uint32_t aaabbbcccddd = aaabbb << 12 + cccddd;

Multiplication equivalent of << 6 is * 0x40 and of << 12 is * 0x1000.

Now that we have 0000'aaab'bbcc'cddd. The correct order cddd'bccc'aaab which we can mix into the final shuffle.

looping

We can convert only in registers. There is a question: when can we write the whole register to the output? For example, for 16 bytes of input (no padding) we only have 12 bytes of output space.

A simple way to answer if we have enough space is to compute how much output space we'll need overall. However, this operation is somewhat expensive and is likely already done for the allocation. So computing output size ideally should be avoided.

We can prove that as long as the Register is bigger than 16 bytes, 1.5 Register of input space is enough to guarantee at least one register of the output space.

Proof.

Out of 1.5 Register, 1 Register converts to 0.75 Register in the output. So we are left with proving that 0.5 Register will produce at least 0.25 Register of the output.

If there is no padding - 0.5 converts to 3/8, which is > 0.25. If there is padding, the last dword of input might produce just 1 output byte.

So, we need to prove that:

convertedSize(RegisterSize / 2 - 4) >= RegisterSize / 4 - 1

For 16 bytes it works convertedSize(4) == 3. For Register > 16 bytes, we can split:

convertedSize(RegisterSize / 4) +
convertedSize(RegisterSize / 4 - 4) >=
RegisterSize / 8 +
RegisterSize / 8 - 1

Since convertedSize(RegisterSize / 4) > RegisterSize / 8 we just reduced the problem to 16 bytes.

Proven.

Decoding SWAR

SWAR stands for SIMD Within A Register i.e. using bit tricks to process multiple bytes in a bigger integer register. Previous version of Base64Decode was using some bit tricks and, as a result, was beating our performance for tails.

On top of that, if the platform lacks the necessary SIMD capabilities, SWAR is beneficial, so we decided to implement it.

How does it work?

The simple version of the decoding algorithm matches from the input char to the index: let's say char a is matched to the number 26. So we would first match chars to the corresponding indexes and then blend those indexes together.

Let's say (one letter still 2 bits):

0AAA'0BBB'0CCC'0DDD  // input chars
0aaa'0bbb'0ccc'0ddd  // corresponding bytes
cddd'bbcc'aaab'____  // after a bunch of bit manipulations
                     // (last byte is w/e)

However, what if we incorporated both position and character in the lookup table?

char:                 0AAA
corresponds to index: 0aaa

[0][0AAA] = 0000'0000'aaa0'0000
[1][0AAA] = 0000'aa00'000a'0000
[2][0AAA] = a000'00aa'0000'0000
[4][0AAA] = 0aaa'0000'0000'0000

This way we can just or results for all input chars. Since we have one byte, which is always 0000, we can use that byte for marking errors. So for a busted input we can just store 0xffffffff. We will or everything we output - and then check for 0xffffffff in the end.

This is the formula for each position (d - decoded index for char):

d << 2;                                   // 0: 0000'0000'0000'aaa0
d >> 4 | (d << 12 & 0xff00);              // 1: 0000'0000'bb00'000b
(d << 6 & 0xff00) | (d << 22 & 0xff0000); // 2: 0000'c000'00cc'0000
d << 16;                                  // 3: 0000'0ddd'0000'0000

This is a memory/speed trade off. For a simple table lookup decoding we store 256 bytes. For this decoding we store 256 * 4 * 4 = 4kb. On big enough data this is a win of about a third over naive code. As far as SIMD clean up code is concerned, this is more questionable but for now we decided to go with it given that it shows an improvement in microbenchmarks.