utf8lut: Vectorized UTF-8 converter. Decoding UTF-8

This article is part 2 of the utf8lut: Vectorized UTF-8 converter series:

Since we only support basic plane in the fast path, all the code points must not exceed 0xFFFF, so each one fit into 16 bits. For simplicity, we will convert from UTF-8 into UTF-16 here. If one needs conversion to UTF-32, it is trivially obtained from UTF-16 output by interleaving its code units with zeros.

All the pictures in this article are drawn using the following endianness convention. The bytes are ordered in little-endian convention: lowest-order byte goes first (on the left side), and the highest-order byte goes last (on the right side). The bits of every single byte are represented in big-endian convention: highest-order bit is on the left, and lowest-order bit is on the right. The same big-endian notation applies to hexadecimal numbers. If a word larger than byte is treated as a single chunk without any splitting lines inside, then the whole word is written in big-endian.

Idea

The core algorithm of utf8lut follows the same generic approach as described here:

Obtain enough information about input, so that the needed shuffle control mask could be uniquely determined from this information.
Convert information into compact bitmask in general purpose register (usually done by some shuffling and movemask intrinsic).
(optional) Apply perfect hash function (e.g. some multiplicative hash) to bitmask to obtain number of smaller size.
Use obtained number as index in a lookup table. The corresponding element of the table must contain the desired control mask for shuffling (precomputed beforehand).

In case of UTF-8 decoding, knowing which bytes among the 16 input bytes are continuation bytes is enough to fully determine the structure of UTF-8 data: we can determine how the bytes are grouped into code points, except for the last code point perhaps. So the first two steps are rather straightforward. The last step is relatively clear, except that for utf8lut we put quite a lot of data into LUT (four __m128i values per entry). Perfect hash function is not used.

Input data

Consider the picture just below. It shows that bytes in UTF-8 data can be classified into five types: either first byte of 1-byte, 2-byte, 3-byte or 4-byte code point or a continuation byte. Each class has fixed bit representation: a header or 1-5 bits followed by 3-7 bits of payload.

The picture also shows classification of bytes in a sample text. The sample text is same as in this article: it is x≤(α+β)²γ², represented in UTF-8 as x\e2\89\a4(\ce\b1+\ce\b2)\c2\b2\ce\b3\c2\b2.

Get LUT index

We do not need full information about classes: knowing which bytes are continuation bytes is perfectly enough to deduce how bytes are grouped into code points. So we start the algorithm by taking 16-bit mask of continuation bytes:

Note that one nasty limitation of SSE2 is that it supports only signed comparisons on integers. In the current case all signed bytes less than 0xC0 are all bytes in range [0x80..0xC0), which are exactly all continuation bytes:

__m128i reg = _mm_loadu_si128((__m128i*)pSource);
uint32_t mask = _mm_movemask_epi8(_mm_cmplt_epi8(reg, _mm_set1_epi8(0xC0U)));

Fetch from LUT

Now we can fetch precomputed data from lookup table. We do not know exactly what we need yet, so let's load a single entry struct with contents to be determined:

const DecoderLutEntry &lookup = lutTable[mask];

Shuffle

To get sequence of code points, first we need to shuffle the input bytes we have. We can perform any shuffle, as long as it is fully within 16 bytes. If we consider the shuffled data as eight 16-bit words, the most convenient layout would be the one which has last byte of UTF-8 data in the low half of each word, and pre-last byte in the high half of each word. If any code point is three bytes long, then we put its starting byte into separate __m128i register by doing one more shuffle. Here is how this shuffle applies to the sample input:

Note that these pictures are drawn in little-endian convention, so low-order bytes are on the left, and high-order ones are on the right. The byte-granular shuffle is performed using _mm_shuffle_epi8 from SSSE3, and shuffle masks shufAB and shufC must be precomputed for each LUT entry. The bytes depicted as white cells are filled with zeros by setting corresponding values in shuffle mask to -1.

__m128i Rab = _mm_shuffle_epi8(reg, lookup.shufAB);
__m128i Rc = _mm_shuffle_epi8(reg, lookup.shufC);

Extract bits

From now on each 16-bit word in __m128i values will be processed independently, information will not move horizontally any more. For this reason, only one 16-bit word from each __m128i value will be shown on the remaining illustrations.

Now we need to extract all useful bits from Rab and Rc and combine them into single 16-bit value.

Let's start with Rab. Due to the way how shuffling works, the low byte of Rab can be either of type c or of type 1, and the high byte is either of type c or of type 2 (or zero). The types are defined here. So the low byte has either 6 or 7 bits of payload, and the high byte has either 5 or 6 bits of payload. Recall that the bit just before payload is always zero. It means that we can mask away headers by doing bitwise-and with a constant:

Rab = _mm_and_si128(Rab, _mm_set1_epi16(0x3F7F));

In order to compress the bits of Rab, it is enough to shift its high byte right by two bits, with two bits carried into the low byte as the result. Instead doing the shift in straightforward way, we implement it with a single _mm_maddubs_epi16 instruction. This instruction computes dot product of byte pairs, so we can use it to compute (low + high * 64):

Rab = _mm_maddubs_epi16(Rab, _mm_set1_epi16(0x4001));

The last step is to take Rc into account. It always has low byte of type 3 (or zero) with exactly 4 bits of payload. We simply shift it left by 12 bits, so that the payload gets into the high-order 4 bits (and header disappears), then add it to Rab:

__m128i sum = _mm_add_epi16(Rab, _mm_slli_epi16(Rc, 12));

Advance pointers

Now sum actually contains 16-bit code points, we can write them into the destination buffer. The non-trivial question is: how many code points are present in sum? The answer depends on the structure of the 16 input bytes:

If input contains only three-byte long characters, then only five characters will fit into 16-byte chunk, thus only five UTF-16 code points will be obtained.
If input contains only ASCII characters, then eight UTF-16 code points will be obtained, since shuffling is limited to 16-byte of data (both in input and output sense). Note that the remaining eight characters read from input will be processed on the following iteration.
If input contains only two-byte long characters, then seven UTF-16 code points will be obtained. The last character will not be processed, because it is not yet clear if it is two bytes long or three bytes long: it depends on whether the next byte is continuation byte or not.

As a result, it is not clear how many bytes of input have been consumed and how many bytes of output have been produced. Computing these numbers is not going to work fast, so the simplest approach is to take them from the lookup table. Compared to two __m128 shuffle masks, two small integers won't waste much memory anyway.

_mm_storeu_si128((__m128i*)pDest, sum);
pSource += lookup.srcStep;
pDest += lookup.dstStep;

That's it! The code for converting valid input with any BMP characters from UTF-8 into UTF-16 is complete. In something about 15 instructions, it processes at least 14 bytes of input or 16 bytes of output. There are no branches involved.

Validation

Of course, standard-compliant UTF-8 decoder must support four-byte characters and must check for invalid input, so we are not done yet. Four-byte characters will be processed in scalar slow path, and also slow path will perform full input validation. So in the fast path we only need to detect if something goes wrong and return error code in such case. For the sake of simplicity, we will consider four-byte long UTF-8 characters as wrong input in this section.

Wrong header. While we extracted the payload bits from input data, we never checked that the byte headers are correct. We need to check condition (value & mask == header) for each byte of input consumed. Since the header has all bits set except the last one, it can be computed as header = mask*2. The mask is different for bytes of different type, so it is hard to compute directly. That's why we take __m128i value with all masks from the lookup table.

__m128i byteMask = lookup.headerMask;
__m128i header = _mm_and_si128(reg, byteMask);
__m128i hdrRef = _mm_add_epi8(byteMask, byteMask);
__m128i hdrCorrect = _mm_cmpeq_epi8(header, hdrRef);

In the lookup table, each byte of headerMask has 1-4 high bits set, depending on the type of byte (1, c, 2, or 3 respectively). For a few last bytes which do not constitute a complete code point yet, headerMask is set to zero. Zero value effectively disables the check.

Overlong encoding. Although it is possible to represent an ASCII character with two or three bytes in UTF-8, the standard explicitly forbids such overlong encodings for the sake of uniqueness of representation. Each code point must be encoded with minimum possible number of bytes.

There are two cases of overlong encoding worth checking: when code point decoded from two-byte sequence is less than 0x80, or when code point decoded from three-byte sequence is less than 0x800. Since it is hard to see now which code points came from which byte lengths, we simply take __m128i value with the lower bounds from the lookup table:

__m128i overlongSymbol = _mm_cmplt_epi16(_mm_xor_si128(sum, _mm_set1_epi16(0x8000U)), lookup.minValues);

Note that we need to perform unsigned comparison here, which is absent in SSE2. The standard workaround (as described here) is to subtract 0x8000 from both values and compare in signed way.

Surrogate code point. The code points in range [0xD800..U+E000) are used for surrogates in UTF-16, and they are forbidden in a valid UTF-8 text. We can simply check if code point is within this range. In order to check it in one comparison, a signed variation of the well-known trick can be used: we subtract a constant value in such way, that one of the range bounds coincides with the overflow boundary. In our case, subtracting 0x6000 from the right bound 0xE000 turns it into 0x8000, so now any value greater or equal to 0x7800 is within the range.

__m128i surrogate = _mm_cmpgt_epi16(_mm_sub_epi16(sum, _mm_set1_epi16(0x6000)), _mm_set1_epi16(0x77FF));

Combine. If any of the three checks fails, then input is wrong. Recall that the header check produces 0x00 bytes on fails, while the other checks produce 0xFF bytes. Since all bits in a byte are either set or not set, using _mm_movemask_epi8 suffices in the end.

__m128i allCorr = _mm_andnot_si128(_mm_or_si128(overlongSymbol, surrogate), hdrCorrect);
if (_mm_movemask_epi8(allCorr) != 0xFFFF)
    return false;

Wrong LUT index. Finally, not every 16-bit mask used as LUT index corresponds to a valid UTF-8 sequence. For example, having three continuation bytes in a row is wrong, and getting such index is possible on wrong input.

In order to make sure that getting wrong index produces an error, we fill every wrong LUT entry with special values, so that validation will fail regardless of the input data. One way to achieve it is:

Set shufAB and shufC to -1 (i.e. all bits set). It is easy to see sum will surely be zero in this case.
Set minValues to 0x7FFF. Then the check for overlong encoding will detect any value of sum except 0xFFFF as wrong.

Clearly, sum cannot be zero and 0xFFFF at once, so the validation will surely fail.

Full code

Here is the full listing once again, with validation included:

__m128i reg = _mm_loadu_si128((__m128i*)pSource);

//decode input
uint32_t mask = _mm_movemask_epi8(_mm_cmplt_epi8(reg, _mm_set1_epi8(0xC0U)));
const DecoderLutEntry &lookup = lutTable[mask];
__m128i Rab = _mm_shuffle_epi8(reg, lookup.shufAB);
__m128i Rc = _mm_shuffle_epi8(reg, lookup.shufC);
Rab = _mm_and_si128(Rab, _mm_set1_epi16(0x3F7F));
Rab = _mm_maddubs_epi16(Rab, _mm_set1_epi16(0x4001));
__m128i sum = _mm_add_epi16(Rab, _mm_slli_epi16(Rc, 12));

//validate input
__m128i byteMask = lookup.headerMask;
__m128i header = _mm_and_si128(reg, byteMask);
__m128i hdrRef = _mm_add_epi8(byteMask, byteMask);
__m128i hdrCorrect = _mm_cmpeq_epi8(header, hdrRef);
__m128i overlongSymbol = _mm_cmplt_epi16(_mm_xor_si128(sum, _mm_set1_epi16(0x8000U)), lookup.minValues);
__m128i surrogate = _mm_cmpgt_epi16(_mm_sub_epi16(sum, _mm_set1_epi16(0x6000)), _mm_set1_epi16(0x77FF));
__m128i allCorr = _mm_andnot_si128(_mm_or_si128(overlongSymbol, surrogate), hdrCorrect);
if (_mm_movemask_epi8(allCorr) != 0xFFFF)
    return false;   //stop and report failure

//advance
_mm_storeu_si128((__m128i*)pDest, sum);
pSource += lookup.srcStep;
pDest += lookup.dstStep;

Lookup table

The lookup table eats quite a lot of space, which is not good. In the naive version described above, there are 65536 entries, each contains four __m128i values and two small integer values. Some space optimization would be handy.

Entry size optimization. First, we can reduce the size of one entry a bit. Notice that half of bytes in shufC control mask are always -1, because only lower halves are used in every 16-bit word of Rc value. If we remove half of bytes away, it turns out that shufC fits into 8 bytes instead of 16 bytes. The freed 8 bytes can be used to store srcStep and dstStep integers. Here is the resulting layout:

struct DecoderCoreInfo {
    __m128i shufAB;                     //shuffling mask to get lower two bytes of symbols
    union {
        __m128i shufC;                  //shuffling mask to get third bytes of symbols
        struct {
            uint32_t _shufC_part0;
            uint32_t _shufC_part1;
            uint32_t srcStep;           //number of bytes processed in input buffer
            uint32_t dstStep;           //number of symbols produced in output buffer (doubled)
        };
    };
    __m128i headerMask;                 //mask of "111..10" bits required in each byte
    __m128i minValues;                  //minimal value allowed for not being overlong (sign-shifted, 16-bit)
};

Of course, when we read shufC value from LUT, we have to unpack it. It does not matter what we shuffle into the higher halves of 16-bit words in Rc: the higher halves will be later removed by shifting anyway. That's why the following code can be used for unpacking:

__m128i shufC = _mm_unpacklo_epi8(lookup.shufC, lookup.shufC);
__m128i Rc = _mm_shuffle_epi8(reg, shufC);      //line 7 of original algorithm

Entry count optimization. Given that input data is valid, the first byte on each iteration must be a starting byte of some character. Therefore, it cannot be a continuation byte, so the LUT index (i.e. mask) must be even. Hence all odd entries of the lookup table are invalid, and it is easy to remove them from the table.

When accessing LUT without odd entries, we can use some rough type casting to avoid unnecessary division by two. Also, we should not forget that for validation purposes the algorithm must honestly check that the mask is even and report error otherwise.

if (mask & 1)
    return false;
const DecoderLutEntry &lookup = *(DecoderLutEntry*)(    //replacing line 5 or original algorithm
    ((char*)lutTable) + mask * (sizeof(lutTable[0]) / 2)
);

This leaves us with 32768 entries in the table, out of which 10609 entries are valid. While it looks like an opportunity for further optimization, it is very hard to use. The layout of invalid cells is likely very complicated, so direct transformation from mask to an index of smaller size won't be feasible. It is possible to use perfect hashing, but it won't help much. I tried to find 32-bit Knuth multiplicative hash for a table of size 16384, but it does not exist. Another idea is to use two-level lookup (like how virtual memory addresses are translated). It may probably save 50% of space, but it would add one more indirection.

Final memory usage. After the two optimizations presented above, the lookup table has 32768 entries with 64 bytes in each. This is whopping 2 MB of lookup data (or half of that if validation is disabled). It fits well into L3 cache of modern CPUs, but does not fit into L2 cache unfortunately. Is this a big problem? Well, yes it is. At least, it is clearly the weakest point of the utf8lut algorithm.

In fact, decoding performance depends on the distribution of code lengths in the input data. If one-byte, two-byte and three-byte characters are mixed randomly, then all the valid LUT entries will be used in random order. This is the worst-case scenario, but it is very unlikely to happen in the real world. The distribution of code lengths in the real data will likely show much less variation. If only one language is used, then the text will be almost pure mixture of either 1-byte and 2-byte (European) or 1-byte and 3-byte (Asian) characters. This greatly reduces the number of hot entries. In addition to that, not all combinations will be equally likely due to how words are composed, which will provide further benefit for CPU caching. All of this helps to remain optimistic about the issue.

Time/space tradeoff. One really effective way of reducing lookup table size is capping the number of input bytes processed per iteration. Imagine that input register is not 16 byte wide, but only (16-k) bytes wide. Generate lookup table with such assumption. In the decoding algorithm, remove the highest k bits of the mask by anding it with a constant.

This approach makes lookup table smaller in 2^k times but slows down the algorithm by something like k/16*100%. For instance, if we disable highest 4 bits, then lookup table will have 2048 entries and total size 128 KB, but processing a fixed text will need about 33% more iterations.

utf8lut code

The algorithm described above is used at the "core" of the utf8lut library. The relevant files are:

core/DecoderProcess.h: the algorithm for decoding UTF-8
core/DecoderLut.h: lookup table data types
core/DecoderLut.cpp: lookup table generation code

The code samples provided in this article are describing only the most powerful version of the fast path: decoding up to three-byte long characters with full validation. The library code also works for simpler cases: it is possible to disable validation and/or to reduce maximum supported byte-length of characters.

Part 1: Introduction

Part 3: Encoding UTF-8

social