This article is part 1 of the utf8lut: Vectorized UTF-8 converter series:
Some time ago I wrote a surprising answer to stackoverflow question "Fastest way to get IPv4 address from string". At that very moment I discovered that _mm_shuffle_epi8 instruction combined with sufficiently large lookup-table can do wonders. Actually, this exact methodology was used to implement vectorized merging algorithm in one of my previous blog posts.
Inspired by the new discovery, I tried to apply this LUT + shuffle idea to some well-known and moderately generic problem. I tried to apply the idea to accelerate conversion from UTF-8 to UTF-16 and was successful. Initially, I had to postpone all efforts on the problem due to PhD-related activities. After that I thought that I would write a scientific article on the topic. When it became clear that I don't want to lose time on anything like that anymore, I decided to write about it in a blog. As a result, almost three years after the initial work (update: it is four years already), I finally managed to write this report, describing the algorithm and the utf8lut library.
Converting UTF-8 to UTF-16 is a well-known problem. The most elegant and also quite efficient solution to UTF-8 decoding was presented by Bjoern Hoehrmann in his article. While there are plenty of approaches and libraries to do this simple conversion, there are not so many attempts at vectorizing this process. This is not surprising, because UTF-8 is essentially a variable length encoding, and dealing with dynamic data layout in SSE/AVX is a huge problem. I'll briefly describe here all the attempts at UTF-8 vectorization that I've heard of. In fact, utf8lut is based on many ideas which were already used in these solutions.
This library used to have a website (currently offline). It is best described in a technical article u8u16 - A High-Speed UTF-8 to UTF-16 Transcoder Using Parallel Bit Streams.
The whole idea of vectorization is based on the so-called "parallel bit streams". Imagine that each byte is a struct with eight fields, each field being a single bit. When programs receives an array of bytes, it efficiently receives array of eight-bit structs, e.g. data in AoS layout. It is well-known that this layout is not convenient for vectorization, the SoA layout is much more convenient. If we convert the data to SoA, we will get eight separate arrays of bits: one will have exactly all the 0-th bits of bytes, one will have exactly all the 1-st bits of bytes, ..., the last one will have all the 7-th (last) bits of the bytes. Each such array of bits is called a bit stream, and there are eight bit streams in total.
In each processing step, u8u16 first loads next 1024 bits of input data as eight
__m128i values (AoS layout). Then it converts these values into eight bit streams (SoA layout), each bit stream has 128 bits exactly fitting into a
__m128i value. So there is one
__m128i containing 0-th bit of all the 128 bytes read, one
__m128i containing 1-st bits of all the 128 bytes read, etc. The actual UTF-8 decoding operates on these bit streams by performing trivial bitwise operations on them, like: and, andnot, or, xor. For any arbitrary condition, you can obtain a bitmask showing which bytes satisfy it. Using this methodology, 16 bit streams (note: the number has doubled) are computed for the UTF-16 code units. In UTF-16 streams, not all the 128 positions have meaningful data: usually the 16-bit code unit is attached to the position of the last byte in its UTF-8 encoding. The unused positions must be removed from the bit streams, so stream compaction algorithm is run. Then the 16 compacted bit streams are converted back into sequence of byte pairs (AoS layout), the code units are written to output and stream pointers are advanced forward.
The good point of this approach is that the middle part of the algorithm is perfectly vectorized: 128 bits are processed in one instruction, using very cheap bitwise logical operations. Also, most of the algorithm is branchless, including even the stream compaction. However, converting into bit stream representation and back takes time. Also, the bit streams methodology does not solve the problem of data moving dynamically between positions. This problem is localized to the stream compaction of bit streams, which looks rather heavy.
This algorithm was described in article UTF-8 processing using SIMD, which was posted on Woboq's website. This code is also available on GitHub. It was born due to an attempt to accelerate
QString::fromUtf8 function in Qt5, but it turned out that in the real world negligible portion of input strings are long/hard enough to benefit from acceleration.
The algorithm uses direct approach, and processes 14-16 bytes simultaneously in one step. To avoid complexity of surrogate pairs, non-BMP characters are handled outside of vectorized code path (just like in utf8lut). Immediately after reading 16 bytes of input, the algorithm classifies all the bytes into the leading bytes and the continuation bytes, and for each leading byte, the length of the corresponding code point is determined (4 classes in total). After that, 16-bit unicode values are calculated. Similar to u8u16, each obtained unicode value is attached to position where the last byte of its UTF-8 encoding is located. One
__m128i contains high bytes of these 16-bit values, and the other contains low bytes of them. Just like in u8u16, empty positions must be removed using stream compaction algorithm. But the stream compaction algorithm itself is closer to one used in utf8lut: single instruction
_mm_shuffle_epi8 is used to do the compaction itself. But unlike utf8lut, considerable number of instructions is spent on computing the shuffle mask.
Since only 16 bytes of data are processed at once (not 128 bytes like in u8u16), the whole algorithm is much more compact. It is branchless, except for a bit of conditional at the very end for advancing the source pointer. However, this algorithm uses quite a lot of instructions. Often it has to compute some value 3-4 times (once per each class of bytes) and blend the results together based on the mask of classes. Also, it recklessly uses a few string instructions from SSE 4.2 in validation, which are known to be slow. On the bonus side, it does not use CPU cache for anything, unlike utf8lut.
This algorithm was probably presented in a conference according to slides Accelerating UTF-8 Decoding using SIMD Instructions. Since this work was done in IBM research, it was applied only to PowerPC with its AltiVec instructions. IBM probably explains why the author follows the shitty practice of filing US patents "like hotcakes", including the US7394411 patent on the UTF-8 decoding algorithm.
The single step of the algorithm works as follows. First, information about the next 8 code points is gathered from the input byte array: the lengths of these eight code points are determined and combined into a single integer number. The information is gathered by scalar traversal of corresponding leading bytes without any vectorization, but the traversal is done in branchless fashion. The obtained integer number ranges from 0 to 6560, and it serves as index in 6561-entry lookup table. For each configuration of code points, the LUT contains some bitmasks, and most importantly, the shuffling mask. This data is enough to shuffle data into desired position and to produce the UTF-16 data for exactly the next 16 bytes.
The algorithm is very similar to utf8lut. It uses shuffling by mask generated during runtime in order to cope with the problem of dynamic movement of bytes in the vectors. The shuffling mask and the auxiliary data are also taken from lookup table. The main difference is how the LUT index is computed: utf8lut uses suitable SSE2 operations to get the index, while the algorithm by Inoue uses scalar instructions (quite a lot of them actually). Aside from that, the presentation is less detailed: it omits validation, encoding problem (UTF-16 to UTF-8). Finally, it uses AltiVec, while utf8lut is implemented exclusively for SSSE3. One positive side of the Inoue's algorithm is that its LUT is more compact than in utf8lut.