utf8lut: Vectorized UTF-8 converter. Technical details

This article is part 4 of the utf8lut: Vectorized UTF-8 converter series:

  1. Introduction
  2. Decoding UTF-8
  3. Encoding UTF-8
  4. Technical details
  5. Test results

Having described how both decoding and encoding of UTF-8 can be vectorized with lookup tables, now it is time to look into various technical difficulties and implementation details.

On C++ Templates

While I am heavily leaning towards C-style programming in performance-critical code, I still prefer C++ language over pure C. One of the features which makes life in C++ so much better for a high-performance project is C++ templates. When used properly, this feature makes wonderful things in simple and efficient way. It is a pity that templates are more often misused (or better to say: "overused"), leading to build time headache.

For utf8lut library, I wanted to allow user to configure the converter. Here is the list of parameters which can be tweaked:

  • Maximum length in UTF-8 of characters which are supported by fast path: 3, 2, 1, or even 0. Someone might want to lower this value for faster processing, depending on input data.
  • Error-checking mode, being one of:
    1. Fully validate input data, switch to slow path when necessary.
    2. Assume input to be valid UTF-8 and don't validate it, but check for character lengths and switch to slow path when necessary.
    3. No checks and no slow path. Input data is assumed to be valid UTF-8 with characters being no longer than the specified maximum length (in UTF-8). The fastest mode, but it cannot work with unicode supplementary planes at all.
  • Decoded width: UTF-16 or UTF-32. Conversion is done from UTF-8 to this encoding or vice versa.
  • Multistreaming: increase core utilization using software emulation of hyperthreading (see below).

To achieve this level of configurability, the core function is made template with all the parameters being template arguments. So the user may specify whatever settings he wants, and the compiler would generate and optimize this specific version of the algorithm. Obviously, all the compile-time constant checks are removed, along with any dead code they surround. The user can also use this function with different parameters at different places in a simple way.

As the result, the core of decoder looks like this:

/* Template params:
 *   MaxBytes = 1, 2, 3
 *   CheckExceed = false, true
 *   Validate = false, true
 *   OutputType = 2, 4        //UTF16/32
 */
template<int MaxBytes, bool CheckExceed, bool Validate, int OutputType>
struct DecoderCore {
    FORCEINLINE bool operator()(const char *&ptrSource, char *&ptrDest, const DecoderLutEntry<Validate> *RESTRICT lutTable) {
        static_assert(!Validate || CheckExceed, "Validate core mode requires CheckExceed enabled");
        const char *RESTRICT pSource = ptrSource;
        char *RESTRICT pDest = ptrDest;

        if (MaxBytes == 1) {
            ... //simple code for ASCII to UTF-16 or UTF-32 conversion
            return true;
        }
        else {  //MaxBytes = 2 or 3
            __m128i reg = _mm_loadu_si128((__m128i*)pSource);
            if (CheckExceed && !Validate) {
                __m128i pl = _mm_xor_si128(reg, _mm_set1_epi8(0x80U));  //_mm_sub_epi8
                __m128i cmpRes = _mm_cmpgt_epi8(pl, _mm_set1_epi8(MaxBytes == 3 ? 0x6F : 0x5F));
                if (!_mm_cmp_allzero(cmpRes))
                    return false;
            }

            uint32_t mask = _mm_movemask_epi8(_mm_cmplt_epi8(reg, _mm_set1_epi8(0xC0U)));
            if (Validate && (mask & 1))
                return false;
            //note: optimized half-index access
            const DecoderLutEntry<Validate> *RESTRICT lookup = TPNT(lutTable, DecoderLutEntry<Validate>, mask * (sizeof(lutTable[0]) / 2));

            __m128i Rab = _mm_shuffle_epi8(reg, lookup->shufAB);
            Rab = _mm_and_si128(Rab, _mm_set1_epi16(0x3F7F));
            Rab = _mm_maddubs_epi16(Rab, _mm_set1_epi16(0x4001));
            __m128i sum = Rab;

            if (MaxBytes == 3) {
                __m128i shufC = _mm_unpacklo_epi8(lookup->shufC, lookup->shufC);
                __m128i Rc = _mm_shuffle_epi8(reg, shufC);
                Rc = _mm_slli_epi16(Rc, 12);
                sum = _mm_add_epi16(sum, Rc);
            }

            if (Validate) {
                const DecoderLutEntry<true> *RESTRICT lookupX = (const DecoderLutEntry<true> *)lookup;
                __m128i byteMask = lookupX->headerMask;
                __m128i header = _mm_and_si128(reg, byteMask);
                __m128i hdrRef = _mm_add_epi8(byteMask, byteMask);
                __m128i hdrCorrect = _mm_cmpeq_epi8(header, hdrRef);
                __m128i overlongSymbol = _mm_cmplt_epi16(_mm_xor_si128(sum, _mm_set1_epi16(0x8000U)), lookupX->minValues);
                __m128i surrogate = _mm_cmpgt_epi16(_mm_sub_epi16(sum, _mm_set1_epi16(0x6000)), _mm_set1_epi16(0x77FF));
                if (MaxBytes == 2) {
                    __m128i shufC = _mm_unpacklo_epi8(lookupX->shufC, lookupX->shufC);
                    hdrCorrect = _mm_and_si128(hdrCorrect, shufC); //forbid 3-byte symbols
                }
                __m128i allCorr = _mm_andnot_si128(_mm_or_si128(overlongSymbol, surrogate), hdrCorrect);
                if (!_mm_cmp_allone(allCorr))
                    return false;
            }

            if (OutputType == 2)
                _mm_storeu_si128((__m128i*)pDest, sum);
            else {
                __m128i zero = _mm_setzero_si128();
                _mm_storeu_si128((__m128i*)pDest + 0, _mm_unpacklo_epi16(sum, zero));
                _mm_storeu_si128((__m128i*)pDest + 1, _mm_unpackhi_epi16(sum, zero));
            }
            ptrSource += lookup->srcStep;
            ptrDest += lookup->dstStep * (OutputType/2);

            return true;
        }
    }
};

There are many if-s, but most of them are compile-time constant and are wiped out by compiler. Obviously, writing all of the cases separately would cause a huge pain in maintainability. Indeed, one can port it to pseudo-templates in pure C, but it would be much more clumsy.

You have probably noticed that template arguments of DecoderCore struct are different from the parameters listed above. This is because many things like slow path and multistreaming are handled on the upper level. That is the level of processing a "buffer" of reasonable size, and there those exact parameters can be seen:

enum DecoderMode {
    dmFast,     //decode only byte lengths under limit, no checks
    dmFull,     //decode any UTF-8 chars (with fallback to slow version)
    dmValidate, //decode any UTF-8 chars, validate input
    dmAllCount, //helper
};
/* Params:
 *   MaxBytes = 0, 1, 2, 3
 *   StreamsNum = 1, 4
 *   Mode = fast, full, validate
 *   OutputType = 2, 4
 */
template<int MaxBytes, int OutputType, int Mode, int StreamsNum>
class BufferDecoder : public BaseBufferProcessor {
    ...
};

Slow path

The fast path supports characters of limited byte length, up to the whole basic multilingual plane. However, the text may contain some characters from supplementary planes, although it is unlikely that there are many of them. To make converters universally applicable, a fully scalar slow path is plugged into both decoder and encoder. The innermost loop can dynamically switch between the fast and the slow path, so a few supplementaries won't negatively affect overall performance.

bool ProcessSimple(const char *&inputPtr, const char *inputEnd, char *&outputPtr, bool isLastBlock) {
    bool ok = true;
    while (inputPtr <= inputEnd - 16) {
        ok = DecoderCore<...>()(inputPtr, outputPtr, ptrTable);
        if (!ok) {
            ok = DecodeTrivial(inputPtr, inputPtr + 16, outputPtr);
            if (!ok) break;
        }
    }
    if (isLastBlock)
        ok = DecodeTrivial(inputPtr, inputEnd, outputPtr);
    return ok;
}

The decoder's slow path consists of the excellent DFA-based algorithm by Bjoern Hoehrmann and a bit of post processing: checking when new character is ready and writing it as one word or as two surrogates. Ironically, the DFA-based algorithm also heavily relies on a lookup table, just like the vectorized fast path. The encoder's slow path is custom-made. First it converts a surrogate pair into a full code point, then checks byte length and writes out bytes. The corresponding code can be found in ProcessTrivial.h.

Critical path and Hyperthreading

One major difference between encoding and decoding algorithms is how they advance pointers. Here are excerpts from the corresponding sections (decode, encode):

//decoding
_mm_storeu_si128((__m128i*)pDest, sum);
pSource += lookup.srcStep;
pDest += lookup.dstStep;
//encoding
_mm_storeu_si128((__m128i *)pDest, res);
pDest += lookup.dstStep;
pSource += 16;

Why did we prefer using fixed input step in encoding? We could use dynamic step just like we did in decoding, then we could get rid of processing two halves separately, probably using less instructions on average.

Maybe it does not immediately catch attention, but fixed input step makes a huge difference in the performance characteristics of a loop. The difference stems from the fact that modern CPUs are essentially superscalar and perform out-of-order execution. Obviously, the more instructions CPU can do in parallel, the faster our loop will be on average, so it makes perfect sense for the CPU to execute several consecutive iterations of the whole loop in parallel. With fixed input step, the next iteration can be started immediately after the input pointer is advanced, which can be done before doing anything else inside the loop. The critical path of the loop body consists merely of moving input pointer by 16 (actually, moving output pointer has to be done sequentially too). So CPU can easily execute many iterations in parallel, without having to wait for anything. Whenever it stops on some high-latency instruction, it can simply pick up more instructions from the future and start working on them.

Now let's return to the decoding algorithm, where input step is dynamic. In order to execute something on the current iteration, one has to load 16 bytes of input data first. In order to load them, the address must be known beforehand, which is computed by adding lookup.srcStep to the address from the previous iteration. To obtain the step value, one has to load an entry of LUT first, for which the index mask has to be already computed. It turns out that loading input data, computing LUT index, loading LUT entry, and advancing pSource comprises a critical path, which must be done sequentially across iterations. The other processing may overlap with this path and be done in parallel. The worst part of this loop is loading the LUT entry, which is a random access into 2 MB table. Given that L2 cache is still 512 KB per core, one may expect 10-40 cycles wasted on it, which means low utilization of CPU core.

The only way to hide latency of the memory load is to do more work in parallel. This approach is taken to the extreme by GPUs, who easily hide the huge latency of RAM access by switching between insane number of lightweight threads. Unfortunately, it is more complicated on a CPU. One possible approach to improve core utilization is to execute two parallel threads on one core. This is known as "Hyperthreading" or "Simultaneous multithreading", and is supported on many modern x86 CPUs. Unfortunately, Intel is known for removing this feature from some of its processors, and it is still limited to two threads per CPU core. Another way to improve utilization is to emulate hyperthreading in software: interleave the code for several independent loops. This is what I'll call "multistreaming", to differentiate it from "multithreading", since no physical threads are created.

Here is how 4x multistreaming works. When user asks to convert a memory buffer (e.g. of size 64 KB), first of all we split this buffer into four equal parts. The splits are adjusted slightly, so that no UTF-8 character spans across several parts (it can be easily done thanks to self-synchronizing property of UTF-8). Initialize four input and four output pointers: one for each stream. Then write the main loop, inside which one conversion step is performed for every stream. Due to these streams being completely independent, CPU is free to reorder the streams processing relative to each other as it sees fit, thus hiding memory latency.

//split input buffer into four parts
const char *splits[5];
SplitRange(inputBuffer, inputSize, splits);
//init input/output pointers
const char *inputPtr0 = splits[0], *inputEnd0 = splits[1];
const char *inputPtr1 = splits[1], *inputEnd1 = splits[2];
const char *inputPtr2 = splits[2], *inputEnd2 = splits[3];
const char *inputPtr3 = splits[3], *inputEnd3 = splits[4];
char *outputPtr0 = outputBuffer[0];
char *outputPtr1 = outputBuffer[1];
char *outputPtr2 = outputBuffer[2];
char *outputPtr3 = outputBuffer[3];
while (1) {
    //check if any pointer has reached the end of its part
    if (inputPtr0 > inputEnd0 - 16) break;
    if (inputPtr1 > inputEnd1 - 16) break;
    if (inputPtr2 > inputEnd2 - 16) break;
    if (inputPtr3 > inputEnd3 - 16) break;
    //perform one iteration of decoding on every stream (the code is inlined here)
    bool ok0 = DecoderCore<...>()(inputPtr0, outputPtr0, ptrTable);
    bool ok1 = DecoderCore<...>()(inputPtr1, outputPtr1, ptrTable);
    bool ok2 = DecoderCore<...>()(inputPtr2, outputPtr2, ptrTable);
    bool ok3 = DecoderCore<...>()(inputPtr3, outputPtr3, ptrTable);
    //if some stream failed with fast path, try slow path instead
    if (!ok0) ok0 = DecodeTrivial(inputPtr0, inputPtr0 + 16, outputPtr0);
    if (!ok1) ok1 = DecodeTrivial(inputPtr1, inputPtr1 + 16, outputPtr1);
    if (!ok2) ok2 = DecodeTrivial(inputPtr2, inputPtr2 + 16, outputPtr2);
    if (!ok3) ok3 = DecodeTrivial(inputPtr3, inputPtr3 + 16, outputPtr3);
    //if even slow path cannot process data, then input data is invalid
    if (!ok0 || !ok1 || !ok2 || !ok3) break;
}
//finish converting the bytes assigned to every stream
bool ok0 = ProcessSimple(inputPtr0, inputEnd0, outputPtr0, true);
if (!ok0) return false;
bool ok1 = ProcessSimple(inputPtr1, inputEnd1, outputPtr1, true);
if (!ok1) return false;
bool ok2 = ProcessSimple(inputPtr2, inputEnd2, outputPtr2, true);
if (!ok2) return false;
bool ok3 = ProcessSimple(inputPtr3, inputEnd3, outputPtr3, true);
if (!ok3) return false;

The loop should be terminated when at least one of the pointers hits its end. After that we have to finish the remaining work of every stream sequentially. Here we hope that the text is relatively uniform in terms of byte lengths, so all the parts are processed at approximately the same average speed (measuring in bytes per iteration). Then the portion of data which will be processed without multistreaming will be rather small.

The above code sample is somewhat simplified, you can see the actual code of multistreaming at the beginning of _Process method in BufferDecoder.h. The performance measurements with and without multistreaming are available at the end of this article. Of course, such multistreaming improvement is useless for our encoding algorithm.

Upper levels

The utf8lut library is organized in three levels:

  • Core: template functions to perform one iteration of conversion, LUT precomputation.
  • Buffer: classes for converting a chunk/buffer of data, helpers for passing input and getting output, helper for setting template arguments.
  • Message: functions for performing end-user conversion: either convert a buffer in memory or a file.

The Buffer level is the largest and probably the most important one. It has several base classes:

  • BaseBufferProcessor (in BaseBufferProcessor.h): represents a data processor which converts a chunk of data according to some conventions. It unifies the interface of template classes BufferDecoder<...> and BufferEncoder<...>, allowing to build any higher-level code without templates. Such higher-level code performs virtual calls to the actual implementation methods, but it is done only a few times per chunk (which is usually 64 KB), so it does not decrease performance.
  • InputPlugin and OutputPlugin (in ProcessorPlugins.h): represent helpers for automatically pushing input data into a buffer processor and pulling output data out of it. Plugins are attached to buffer processor at runtime. For both input and output, two plugins are provided:
    • Contiguous plugin assumes that the whole input/output data is stored in a long contiguous array, so it installs consecutive chunks of this array into processor as input/output buffer. Useful for memory-to-memory conversion: user no longer has to follow all the pointers.
    • Interactive plugin creates a buffer for input/output, and expects user to fill it with data or get the data out of it after every chunk is processed. This is useful for streaming conversion, for instance file-to-file conversion: there is no need to read the whole file into memory. User reads a chunk into input buffer, performs chunk conversion, then writes a chunk out of output buffer, repeat.

It is worth noting that multistreaming had large impact on this architecture. For instance, in ordinary single-stream mode, one can perform memory-to-memory conversion completely in-place, without any additional buffers. This is how iconv interface works. With multiple streams however, there are several independent output buffers, so 1) the output data must be copied from these buffers, and 2) the input has to be split into chunks to keep additional memory requirements low. To avoid forcing this mess into user, plugins were created, which wrap these differences into relatively uniform manner. By the way, iconv interface is also provided, but with some quirks.

Finally, there is ProcessorSelector template class on the Buffer level (in ProcessorSelector.h), which simplifies creation of a buffer processor. It also supports some error correction (unless multistreaming is enabled). Here is how creating a processor with the selector looks like:

//type of UTF-8 to UTF-16 processor, with full validation and BMP plane support in fast path
typedef ProcessorSelector<dfUtf8, dfUtf16>::WithOptions<cmValidate, 3>::Processor MyProcessor;
//create new processor with error correction enabled
int errorsCount = 0;
BaseBufferProcessor *processor = MyProcessor::Create(&errorsCount);

This processor can later be passed into various Message-level conversion functions (memory-to-memory or file-to-file conversion), or user can attach the plugins he wants and drive the conversion process manually.

Comments (0)
atom feed: comments

There are no comments yet.

Add a Comment



?

social