utf8lut: Vectorized UTF-8 converter. Encoding UTF-8

This article is part 3 of the utf8lut: Vectorized UTF-8 converter series:

Conversion from UTF-16 to UTF-8 will be covered in this article. Since the fast path supports only basic plane, we don't need to convert surrogate pairs. Converting from UTF-32 works the same way, but two times more data is read per iteration and it is compressed into UTF-16 format first.

The images in this article represent binary or hexadecimal numbers in big-endian convention as usual in math, while all sequences of bytes or words are shown in little-endian as they are laid in memory.

Idea

The encoding algorithm will follow the same general idea (as described in the previous part): we extract information into a bitmask, then shuffle using precomputed control register. One notable difference is that since all supported UTF-16 characters are two bytes long, it is possible to process all 16 bytes of input at each iteration. However, UTF-8 output may be larger than UTF-16 input if three-byte characters are present, in which case output from one iteration may not fit into 16 bytes. For this reason, we will start with a code path supporting only one- and two- byte UTF-8 characters, and then will extend it to three-byte characters.

Input data (up to U+07FF)

Consider sample text: α+β-q=ελ (1)

Here Greek letters take two bytes in UTF-8, and other characters are ASCII. We are more interested in UTF-16 representation, which is \03B1\002B\03B2\002D\0071\003D\03B5\03BB\0020\0028\0031\0029. The image below shows classification of bytes/words, assuming little-endian version of UTF-16.

Get LUT index

First of all, we shift every 16-bit word right by 6 bits. It can probably be avoided right now, but this data will be helpful later for composing the second byte in UTF-8 characters.

__m128i reg = _mm_loadu_si128((const __m128i *)pSource);
__m128i levelA = reg;
__m128i levelB = _mm_srli_epi16(reg, 6);

The two-byte UTF-8 characters can be easily detected by looking which words are larger than 0x007F. This is equivalent to levelB having any bit set, except for the lowest one:

__m128i lenGe2 = _mm_cmpgt_epi16(levelB, _mm_set1_epi16(0x0001U));

This results in a bitmask with 8 words, each spanning two bytes. Ideally, we should use_mm_movemask_epi16 intrinsic to extract one bit from every 16-bit word, but such instruction does not exist in SSE. So we compact lower bytes using a byte shuffle and use _mm_movemask_epi8 instead:

__m128i lensAll = _mm_shuffle_epi8(lenGe2, _mm_setr_epi8(0, 2, 4, 6, 8, 10, 12, 14, -1, -1, -1, -1, -1, -1, -1, -1));
uint32_t mask = _mm_movemask_epi8(lensAll);

Fetch from LUT

Now mask is a bitmask ranging from 0 to 255, and we use it for indexing a lookup table:

const EncoderLutEntry &lookup = lutTable[mask];

If EncoderLutEntry is large, multiplication by size is inserted on assembly level. It can be spared by tweaking shuffle mask for lensAll.

Shuffle

If you recall how UTF-8 is encoded (see table from previous part), you will notice that the lower bits of UTF-8 output can be taken from the lower bits of levelA and levelB variables. The higher bits can then be overwritten with proper UTF-8 byte headers.

SSE shuffle only operates on 16-byte registers, so both input and output must fit into 16 bytes. Since we consider every output character to take 1 or 2 bytes in UTF-8, the output surely fits into 16-byte register. As for levelA and levelB values, each of them contains eight 16-bit words, but we only need the lower half from every word. Hence, we can mix all of these lower bytes together into one register:

__m128i levBA = _mm_xor_si128(levelB, _mm_slli_epi16(levelA, 8));

Now these bytes can be shuffled using the mask fetched from LUT:

The mask only depends on which words are ASCII and which are not, and this information is present in the bitmask mask used as index, so proper lookup table can indeed be constructed.

__m128i res = _mm_shuffle_epi8(levBA, lookup.shuf);

Fix headers

It remains only to overwrite the higher bits with UTF-8 headers, and UTF-16 output will be obtained. This is achieved in the same way as we validated headers in decoding. LUT entry contains a 16-byte bitmask, showing which of the bits must be overwritten (depending on which bytes are 1-byte start, 2-byte start, or continuation byte). Only the last bit of the header is zero, so multiplying the bitmask by two produces the header itself.

__m128i header = lookup.headerMask;
res = _mm_andnot_si128(header, res);
res = _mm_xor_si128(res, _mm_add_epi8(header, header));

Advance pointers

Since we always process the whole 16-bytes of input, advancing input pointer is trivial. As for output pointer, we can put the number of bytes encoded into the LUT entry, since it varies depending on the type of characters.

_mm_storeu_si128((__m128i *)pDest, res);
pDest += lookup.dstStep;
pSource += 16;

This is all for encoding characters up to U+07FF. Now let's see how supporting three-byte characters changes the algorithm.

Three-byte characters

The main problem with three-byte characters is that the output from 16 bytes of input can take up to 24 bytes of output, i.e. the data is expanding due to these characters. Since we cannot do our fancy shuffles across 24 bytes, we have to split input data into two halves: each half contains 8 bytes, being four UTF-16 words. These halves are processed independently most of the time, although for better performance we still try to process them together where possible.

Another thing to note is that third level is added. Previously, we had levelA and levelB variables, aligned to 0-th and 6-th bits respectively. Now we also have levelC aligned to 12-th bit, which should go into shuffling to produce first bytes of the three-byte UTF-8 characters.

Input data (up to U+FFFF)

Consider a string: ∫∂α◎∂t≤β∓λ

Its UTF-16 representation is \222B\2202\03B1\25CE\2202\0074\2264\03B2\2213\03BB.

Here is how it looks in bytes, with every byte colored by its type:

LUT indices

We need full information about byte-lengths of all input characters. Every character will take 2 bits in the LUT index with three possible values. It does not matter how the these possible values look exactly, as long as they are different for every byte length. I tried several ways and chose the one which looked fastest to me.

First, let's check which characters won't fit into 1 byte, and which characters won't fit into 2 bytes. This is done easily by comparing either the original word or the levelB shifted value against a threshold:

__m128i reg = _mm_loadu_si128((const __m128i *)pSource);
__m128i levelB = _mm_srli_epi16(reg, 6);
__m128i lenGe2 = _mm_cmpgt_epi16(levelB, _mm_set1_epi16(0x0001U));
__m128i lenGe3 = _mm_cmpgt_epi16(levelB, _mm_set1_epi16(0x001FU));

These two bitmasks together contain all the information we need, but they take 32 bytes. Obviously, they contains 16 words, each being 0x0000 or 0xFFFF. So we can compress every word into a byte and combine these bitmasks together:

__m128i lensMix = _mm_xor_si128(_mm_srli_epi16(lenGe3, 8), lenGe2);
uint32_t allMask = _mm_movemask_epi8(lensMix);

The lower 8 bits of allMask correspond to the first four characters, and the higher 8 bits correspond to the last four characters. Since shuffling would be performed separately for these halves, we should split them now:

uint32_t mask0 = (allMask & 255U);
uint32_t mask1 = (allMask >> 8U);

Shuffles

The values to be shuffled are not ready yet. Since we shuffle each half separately, we want to obtain two 16-byte values: one should contain all the needed bytes for the first four characters, and another one should contain same data for the last four characters. For every character we need three bytes:

(A) byte with word's 7 lowest bits
(B) byte with bits 6-11 of the word
(C) byte with word's 4 highest bits

As usual, there are many ways to achieve this, and I'll show the one I found to be the fastest. First, let's take the original data and shift right by 4 bits the higher byte in every word. There is no SSE instruction to shift only odd bytes, but we can emulate it using _mm_maddubs_epi16 intrinsic:

__m128i levAC = _mm_maddubs_epi16(_mm_and_si128(reg, _mm_set1_epi16(0xF0FFU)), _mm_set1_epi16(0x1001U));

Now levAC contains 8 words, and every word has (A) in its lower byte and (C) in its higher byte. Recall that we have already computed (B) in levelB variable when producing LUT index. Now we combine lower halves of these variables into level0, and higher halves into level1:

__m128i levels0 = _mm_unpacklo_epi64(levAC, levelB);
__m128i levels1 = _mm_unpackhi_epi64(levAC, levelB);

It does not matter how exactly the (A), (B), (C) bytes are packed in these values, as long as you properly account for their order when precomputing the shuffle mask. From now on, the processing is done separately for two halves. Shuffling allows us to move every byte into its final place in the UTF-8 output:

const EncoderLutEntry &lookup0 = lutTable[mask0];
const EncoderLutEntry &lookup1 = lutTable[mask1];
__m128i res0 = _mm_shuffle_epi8(levels0, lookup0.shuf);
__m128i res1 = _mm_shuffle_epi8(levels1, lookup1.shuf);

Headers and pointers

For every byte, we overwrite 1-4 highest bits with UTF-8 header:

__m128i header0 = lookup0.headerMask;
__m128i header1 = lookup1.headerMask;
res0 = _mm_andnot_si128(header0, res0);
res1 = _mm_andnot_si128(header1, res1);
res0 = _mm_xor_si128(res0, _mm_add_epi8(header0, header0));
res1 = _mm_xor_si128(res1, _mm_add_epi8(header1, header1));

Lastly, we store every half separately and advance pointers. Of course, these instructions must be done in proper sequence:

_mm_storeu_si128((__m128i *)pDest, res0);
pDest += lookup0.dstStep;
_mm_storeu_si128((__m128i *)pDest, res1);
pDest += lookup1.dstStep;
pSource += 16;

Here finishes UTF-8 encoding algorithm with support for the whole BMP.

Validation

Unlike UTF-8, validation of UTF-16 input is very simple. Surrogate words is the only possible problem in the input. They are the code points in range [0xD800..U+E000), and we can check for them easily at the very beginning:

__m128i diff = _mm_sub_epi16(reg, _mm_set1_epi16(0x5800U));
if (_mm_movemask_epi8(_mm_cmplt_epi16(diff, _mm_set1_epi16(0x8800U))) != 0x0000)
    return false;

Here we use the same trick as before for checking range inclusion with one comparison.

Full code

Here is the full listing again, with whole BMP support and validation:

__m128i reg = _mm_loadu_si128((const __m128i *)pSource);

//validate input
__m128i diff = _mm_sub_epi16(reg, _mm_set1_epi16(0x5800U));
if (_mm_movemask_epi8(_mm_cmplt_epi16(diff, _mm_set1_epi16(0x8800U))) != 0x0000)
    return false;

//encode input
__m128i levelB = _mm_srli_epi16(reg, 6);
__m128i lenGe2 = _mm_cmpgt_epi16(levelB, _mm_set1_epi16(0x0001U));
__m128i lenGe3 = _mm_cmpgt_epi16(levelB, _mm_set1_epi16(0x001FU));
__m128i lensMix = _mm_xor_si128(_mm_srli_epi16(lenGe3, 8), lenGe2);
uint32_t allMask = _mm_movemask_epi8(lensMix);
uint32_t mask0 = (allMask & 255U);
uint32_t mask1 = (allMask >> 8U);
__m128i levAC = _mm_maddubs_epi16(_mm_and_si128(reg, _mm_set1_epi16(0xF0FFU)), _mm_set1_epi16(0x1001U));
__m128i levels0 = _mm_unpacklo_epi64(levAC, levelB);
__m128i levels1 = _mm_unpackhi_epi64(levAC, levelB);
const EncoderLutEntry &lookup0 = lutTable[mask0];
const EncoderLutEntry &lookup1 = lutTable[mask1];
__m128i res0 = _mm_shuffle_epi8(levels0, lookup0.shuf);
__m128i res1 = _mm_shuffle_epi8(levels1, lookup1.shuf);
__m128i header0 = lookup0.headerMask;
__m128i header1 = lookup1.headerMask;
res0 = _mm_andnot_si128(header0, res0);
res1 = _mm_andnot_si128(header1, res1);
res0 = _mm_xor_si128(res0, _mm_add_epi8(header0, header0));
res1 = _mm_xor_si128(res1, _mm_add_epi8(header1, header1));

//advance
_mm_storeu_si128((__m128i *)pDest, res0);
pDest += lookup0.dstStep;
_mm_storeu_si128((__m128i *)pDest, res1);
pDest += lookup1.dstStep;
pSource += 16;

Note that this time lookup table contains only 256 entries. So even with 64 bytes per entry, it takes only 16 KB of space.

utf8lut code

The algorithm described above is used at the "core" of the utf8lut library. The relevant files are:

core/EncoderProcess.h: the algorithm for encoding UTF-8
core/EncoderLut.h: lookup table data types
core/EncoderLut.cpp: lookup table generation code

While the article concentrates on the most powerful full-BMP validating code path, the library allows to disable some features for faster processing.

Part 2: Decoding UTF-8

Part 4: Technical details

social