utf8lut: Vectorized UTF-8 converter. Test results

This article is part 5 of the utf8lut: Vectorized UTF-8 converter series:

This final part of the article is dedicated to evaluation of the utf8lut library. It explains how the library was tested for correctness, and shows detailed performance data.

Fuzz-testing

Unlike typical scalar conversion, utf8lut converts data in 16-byte pieces, and uses a lookup table with thousands of entries. Moreover, some corner effects may happen rarely on the Buffer level (e.g. near chunk boundaries). The important question is: how to test all of this and make sure there are no bugs?

Fuzz-testing is the best answer (also called "stress-testing"). This is a technique when input data for testing is generated automatically (usually randomly) and output data is also checked for correctness automatically, which allows running thousands or millions of randomized tests per second. In fact, fuzz-testing has already been applied to UTF-8 decoders, see e.g. article by Bryan O'Sullivan or utf8fuzz project by Andrew Francis.

Checking output in case of UTF conversion is easy, provided that a separate supposedly bug-free implementation is available. The hardest part is generating random input data. The quality of testing heavily depends on how well input generator works, especially for a vectorized code. Plain random sequence of bytes will be detected as invalid too early, without any chance to uncover bugs happening after the first 16 bytes of input data. While such simple cases like random data or randomly shuffled characters are used in utf8lut fuzzing, the fuzz-testing mainly relies on the following procedure:

Choose which UTF-8 byte lengths are allowed: any of 16 subsets of {1, 2, 3, 4}.
Choose how many characters to generate: ranges from 0 to 1000000 (higher numbers are unlikely to be chosen).
Generate random unicode characters with properties chosen during the first two steps.
Choose one of the three encodings (UTF-8, UTF-16, or UTF-32) and encode the generated characters into it.
Apply 1-10 random mutations to this sequence of bytes. There are 10 types of mutations, most of them affect only a few bytes (up to 10). Multiple mutations are very likely to be applied near each other.
Run every type of processor with every set of settings on this input, excluding processors with incomplete validation if input is known to be invalid. Compare produced output to the result of independent reference converter.

Using fuzz-testing, I have found 8 different bugs in the library itself (including 1 bug in the slow path and 1 bug in memory-to-memory converter), and even 1 bug in the reference implementation that I checked against =) The bug which took most fuzzing time (a few seconds) to find was happening at split point in multistreaming UTF-8 decoder: the case when a part of buffer ends with incomplete character was handled incorrectly.

I have also tried to plug utf8lut decoder into utf8fuzz testing framework. Unfortunately, it does not validate the converted output, but only looks at the final verdict: whether input data is valid or invalid. Not surprisingly, utf8lut passes all of its tests: both simple and random ones. As for the ability to find previous bugs of utf8lut, utf8fuzz is not very powerful. Among three decoder bugs affecting validation, it finds only the simplest one (almost non-working check for overlong encodings). More complicated cases, including the aforementioned split bug, where not found even after 5 minutes of random testing (10000 tests). I have no idea why it is so weak.

Performance evaluation

All the data samples used for testing are described below. For each sample: a name is given, the number of bytes in UTF-8 / UTF-16 representations, and how many characters of each byte length it contains.

[rnd1111:400000] is 999198 bytes in UTF-8 and 999302 bytes in UTF-16, byte length counts are 100124 / 100205 / 100020 / 99651. This is a random sequence of characters. Each character may have length 1, 2, 3, or 4 bytes in UTF-8 representation, and every of the four cases is equiprobable. Within any particular byte length, every character is also chosen with equal probability. This is the hardest sample for any converter, and the fast path of utf8lut does not work on this sample at all.
[rnd1110:500000] is 999553 bytes in UTF-8 and 1000000 bytes in UTF-16, byte length counts are 166877 / 166693 / 166430 / 0. It is also a random sequence, but each character may have length 1, 2, or 3 bytes in UTF-8 representation (all three cases equiprobable). This is the hardest test limited to BMP, and it is supported by fast path of utf8lut, as well as other implementations.
chinese is 493413 bytes in UTF-8 and 349442 bytes in UTF-16, byte length counts are 15375 / 0 / 159346 / 0. This is A Dream Of Red Mansions by Xueqin Cao in text format and in Chinese language. 90% of characters are 3 bytes long in UTF-8, all the rest is ASCII.
russian is 3448147 bytes in UTF-8 and 3918582 bytes in UTF-16, byte length counts are 470435 / 1488856 / 0 / 0. This is War and Peace by Leo Tolstoy in FB2 format and in Russian language. While byte length never exceeds 2, the distribution of 1-byte vs 2-bytes is 1 : 3.
english is 204888 bytes in UTF-8 and 404760 bytes in UTF-16, byte length counts are 201125 / 2 / 1253 / 0. This is Hamlet by William Shakespeare in text format and in English. It is almost completely ASCII, with less than 1% of other byte lengths.
unicode is 171494 bytes in UTF-8 and 340712 bytes in UTF-16, byte length counts are 169575 / 412 / 357 / 6. This is a web page containing a table of unicode characters, stored in HTML format. Mostly consists of ASCII again, but contains a few characters from supplementary MP. Allows to check how well slow path works.

The following implementations were tested:

trivial is this library without fast path. Only simple scalar conversion is used. Compiled with Visual C++ 2017 x64.
utf8lut:1S is this library with the following settings: maximum byte length 3, full validation mode, single stream. Compiled with Visual C++ 2017 x64.
utf8lut:4S is this library with the following settings: maximum byte length 3, full validation mode, 4x multistreaming. In case of UTF-8 encoder, it is 4x manual unrolling instead of multistreaming. Compiled with Visual C++ 2017 x64.
u8u16 is the library by Robert Cameron based on Parallel Bit Streams. Works only as decoder. Compiled with MinGW-W64 GCC 8.1.0 x64 in SSSE3 mode.
utf8sse4 is the code by Olivier Goffart. Works only as decoder. Compiled with Visual C++ 2017 x64 with SSE4.1 and SSE4.2 enabled.

The benchmarking codes of both u8u16 and utf8sse4 were modified to run the same memory-to-memory conversion specified number of times. Their own benchmarks operate in other ways.

Finally, there are different machines to test on. A lot of performance testing and tuning was done when I started working on utf8lut, which was four years ago. At that moment Intel Core 2 (Allendale) was the main target CPU, which I no longer have at my disposal. All the benchmarking was done on a single core of:

Ryzen5 is AMD Ryzen 5 1600 at base frequency 3.2 GHz. It might be worth mentioning that dual-channel DDR4 memory clocked at 3066 MHz is used.
i3-5005U is Intel i3-5005U at frequency 2.0 GHz (Broadwell architecture).

Repeated runs

Measuring performance of code like utf8lut is not an easy task. First, we will start with a typical benchmark, which in some sense measures the best-case scenario. In this benchmark, conversion is performed from one memory buffer into another one. In order to lower the influence of overhead and measure steady performance, the same conversion is run many times. The memory buffers will stay the same every run, no relocation done. Given that decoder uses a lookup table with 10609/32768 entries, the data sample being converted should be large enough to hit most of the necessary entries.

Performance is reported in CPU cycles spent per input byte. There are two ways to measure this quantity:

Use rdtsc repeatedly to measure only cycles spent inside the main conversion loop. Accumulate it over the whole run, then divide by the total number of bytes consumed. This excludes various overhead like memory copying and file I/O. Given that utf8lut fiercely uses CPU cache, such measurements are not entirely fair.
Measure with rdstc how many cycles passed between the start and the end of the program, then divide total number of input bytes. This includes all the overhead, memory copying and buffer-related activities. It includes even file I/O and LUT creation (although they are done only once).

In the tables below, every cell contains the result measured both ways, unless the benchmarked solution does not implement measurement for one of them. Every data sample was normalized by truncation or repetition to the size of 1 MB. The conversion was performed 10000 times in a row during program run.

UTF-8 decoding on **Ryzen5** (lower is better)
	trivial	utf8lut:1S	utf8lut:4S	u8u16	utf8sse4
[rnd1111:400000]	14.33 / 14.33	14.75 / 14.76	16.14 / 16.25	4.31 / 4.55	??? / 14.70
[rnd1110:500000]	14.82 / 14.83	3.19 / 3.20	1.56 / 1.67	4.06 / 4.30	??? / 15.78
chinese	6.28 / 6.28	1.28 / 1.28	0.78 / 0.86	4.13 / 4.29	??? / 2.94
russian	8.64 / 8.65	1.78 / 1.78	0.99 / 1.11	3.59 / 3.86	??? / 3.80
english	6.01 / 6.02	2.46 / 2.46	1.38 / 1.59	2.00 / 2.26	??? / 1.76
unicode	5.99 / 5.99	2.46 / 2.47	1.38 / 1.59	1.63 / 1.96	??? / 12.39

UTF-8 decoding on **i3-5005U** (lower is better)
	trivial	utf8lut:1S	utf8lut:4S	u8u16	utf8sse4
[rnd1111:400000]	15.92 / 15.93	16.89 / 16.89	18.11 / 18.37	4.40 / 4.69	??? / 16.07
[rnd1110:500000]	16.43 / 16.44	4.02 / 4.03	1.99 / 2.23	3.94 / 4.40	??? / 16.71
chinese	7.55 / 7.56	1.61 / 1.61	1.10 / 1.23	3.98 / 4.22	??? / 4.28
russian	10.16 / 10.16	2.23 / 2.23	1.42 / 1.64	3.52 / 3.85	??? / 4.77
english	7.25 / 7.25	2.93 / 2.93	2.01 / 2.42	1.90 / 2.40	??? / 1.18
unicode	7.21 / 7.21	2.95 / 2.96	2.02 / 2.43	1.62 / 2.10	??? / 6.04

A few things to note from the timings of decoding:

u8u16 is the only solution which supports characters from supplementary planes on fast path. All the rest fall back to trivial conversion. It is obvious from the results on [random1111:400000] data sample.
The two timings differ for utf8lut:4S and u8u16, because these solutions perform memcpy on the data. In case of multistreaming utf8lut, it is necessary because the output for every input chunk goes into four separate buffers. The solutions trivial and utf8lut:1S perform all the work in-place without any temporary storage and copies.
utf8sse4 falls back to slow path (scalar conversion) as soon as it meets the first character from supplementary plane, without ever returning back to fast path. This is clearly seen from its bad results on the unicode data sample.
utf8sse4 has a special code path for ASCII data source, which is enabled whenever 16 input characters in a row are ASCII. The algorithm turns back to general code later if needed, so it is quite robust, unlike its handling of non-BMP characters. Thanks to this special path this solution works much faster on english and unicode data samples. u8u16 solution also has such optimization, while utf8lut does not.
Quite unexpectedly, utf8sse4 does not work well on the sample [random1110:500000], which consists only of BMP characters. After debugging the case, it turned out that utf8sse4 considers the noncharacters U+FFFE-U+FFFF and U+FDD0-U+FDEF as invalid input. So it switches to scalar processing (never switching back), and converts them into replacement characters.
Multistreaming version of decoder (utf8lut:4S) is faster than the simple version (utf8lut:1S), up to 2x in best cases. This confirms the idea that the decoding algorithm needs additional techniques for latency hiding.
Under these benchmarking conditions, utf8lut is significantly faster than other solutions, unless input data is mostly ASCII or has too many supplementaries. Looking at tests [random1110:500000], chinese, and russian, it is at least 2-2.5 times faster. I guess adding all-ASCII acceleration to utf8lut would allow it to catch up on the mostly-ASCII tests too.
As for the trivial solution, branch mispredictions waste about half of its runtime. It is clearly visible from the variance in timing across tests. Data samples like chinese, english, unicode mostly contain characters of same byte length, making it possible for branch predictor to handle the branches. The russian data sample has 1:3 ratio of 1-byte and 2-byte chars, and performance degrades slightly. Random data samples are much worse, and they make trivial solution 2-2.5 slower than it is in easily predicted cases. Vectorized solutions have no such problem.

UTF-8 encoding on **Ryzen5** (lower is better)
	trivial	utf8lut:1S	utf8lut:4S
[rnd1111:400000]	9.73 / 9.73	10.43 / 10.43	10.31 / 10.31
[rnd1110:500000]	10.40 / 10.40	??1.50??	??1.50??
chinese	4.18 / 4.18	??1.50??	??1.50??
russian	4.86 / 4.86	??1.50??	??1.50??
english	1.95 / 1.95	??1.50??	??1.50??
unicode	1.93 / 1.94	??1.50??	??1.50??

UTF-8 encoding on **i3-5005U** (lower is better)
	trivial	utf8lut:1S	utf8lut:4S
[rnd1111:400000]	9.36 / 9.36	10.03 / 10.03	9.98 / 9.99
[rnd1110:500000]	9.80 / 9.81	0.86 / 0.86	0.85 / 0.85
chinese	4.22 / 4.23	0.85 / 0.86	0.83 / 0.84
russian	4.78 / 4.79	0.86 / 0.86	0.84 / 0.84
english	1.64 / 1.64	0.86 / 0.86	0.84 / 0.84
unicode	1.59 / 1.60	0.86 / 0.86	0.85 / 0.85

It is important to note:

utf8lut fast path encoding has unstable performance on Ryzen5, jumping randomly from 0.6 to 1.5 cycles per byte. This is the reason for putting ??1.50?? in the corresponding cells. I tried to isolate the problem: it happens both on VC++ and GCC, both on Windows and in Linux VM. I set realtime priority, disabled ASLR, and preallocated all memory chunks as global arrays. Nothing helped. The problem does not happen on Intel CPU. It seems that some random condition heavily harms performance of encoder on Ryzen, and it sticks to the process until the process terminates.
Obviously, 4x unrolling does not change anything: timings are same for utf8lut:1S and utf8lut:4S. Most likely, the CPU manages to run several consecutive iterations of the main loop
It seems that performance of trivial implementation does not depend on branches the same way as in decoding. Fully random tests are still very slow, but this time russian sample is no harder than chinese sample. Also it is worth noting how fast the trivial solution is on mostly-ASCII samples, reducing vectorization speedup to mere +85%.

Large files

The above benchmark has one major problem: it is not clear how well it reflects behavior in the real-world usage. The same conversion is run thousands of times, meaning that most of this time is wasted on useless work.

In order to make time measurements convincing, there must be no repetition in the converted data. On the other hand, there is no point in measuring conversions taking microseconds, so the amount of data should be large. Given that converters work at speed of several gigabytes per second, it means that the data sample should have gigabyte size itself. Suppose that we have a gigabyte input file, run conversion on it once, and get a gigabyte output file. In such case even human can see which solution does the job faster. However, with such approach the speed of I/O becomes an issue.

A typical HDD has sequential read/write speed of about 125 MB/s, which is pretty laughable. A good SSD hits 500 MB/s, which is still much slower than the conversion itself. In the end, I decided to use RAM-disk to store files, which achieves speed of 4 GB/s in one thread. Although it is still RAM memory in the physical sense, it works under the same conditions as normal disk drives.

As it was not enough, I have implemented a faster type of file I/O based on memory-mapped files under Windows, which accelerates large file conversion almost twofold. Setting larger read/write buffer improves speed with traditional fread/fwrite, but is still noticeably slower than memory-mapped I/O, which seems to avoid some copying of the data. It can be done on other OSes of course, just too lazy to mess with it. Brief experiments with write-combined memory and non-temporal loads/stores did not yield significant benefits, so I stopped here.

Given that buffer size and I/O routines have huge impact on the resulting performance on large files, I did not try to benchmark other libraries. They usually have fread/fwrite I/O with very small buffer size (probably optimized for HDD in ancient times), so the competition is hardly fair. Only comparison between different versions of utf8lut is provided.

The following data samples were considered:

[rnd1110:500000000] is about 10^9 bytes both in UTF-8 and in UTF-16, there are about 166 million characters of every byte length: 1, 2, and 3. This is an artificial test which stresses all unpredictable branches in converter.
zhwikipedia is about 710 million bytes in UTF-8 and 928 million bytes in UTF-16, byte length counts are 340692184 / 848881 / 122696553 / 4649. This is some random XML file from Chinese Wikipedia dumps.

The benchmarking routine works as follows. It maps the whole input file into virtual address space, and similarly allocates and maps the output file too (with preliminary size surely enough to hold output). Then it performs memory-to-memory conversion over the mapped regions. Only a bit of additional memory (less than 1MB) may be used for temporary buffers. Two timings are measured the same way as described in the previous section. Conversion is run only once per OS process launched. The process is started twice in a row, and timing of the second run is taken into account.

Large file UTF-8 decoding on **Ryzen5** (lower is better)
	trivial	utf8lut:1S	utf8lut:4S
[rnd1110:500000000]	16.1 / 17.0	4.9 / 6.0	2.5 / 4.5
zhwikipedia	9.0 / 10.1	4.1 / 5.2	2.2 / 4.7

Large file UTF-8 encoding on **Ryzen5** (lower is better)
	trivial	utf8lut:1S	utf8lut:4S
[rnd1110:500000000]	11.6 / 12.6	3.3 / 4.3	2.8 / 3.8
zhwikipedia	4.8 / 5.6	3.1 / 3.9	3.0 / 3.8

Interesting points:

Everything becomes much slower compared to the repeated benchmark, even by internal timings. It can be seen on [rnd1110:500000000] data sample, which has a direct equivalent in the previous benchmarks. The trivial solution is less affected on relative scale (at least because it was much slower to begin with).
zhwikipedia data sample is rather easy for trivial solution. I suppose is has too many mostly-ASCII and mostly-chinese runs, so branch predictor can adapt well to incoming data. Unfortunately, it is quite hard to find a large data sample with challenging byte lengths distribution, since XML and HTML dumps have a lot of ASCII junk. It also poses another question: is it worth trying to vectorize conversion, given that most real-world data is either quite good for trivial solution or too small to even bother?
The acceleration from multistreaming is clearly visible by internal timings, but has very low impact on overall timings. This is not only because the overall timings are higher, but also it seems that some of the acceleration is compensated by I/O overhead. Interestingly, UTF-8 encoding exhibits a small but noticeable performance boost from 4x unrolling on random data. This effect was not present in the previous benchmark.

Conclusion

Some questions still remain. For instance, how would multithreading affect performance, given that RAM is shared between cores? How would LUT-based conversion affect the surrounding code if it is used in the real world? How would hardware hyperthreading compete with multistreaming? If I started looking into all of this, the article would never end =)

The main drawback of the proposed decoding algorithm is of course its large lookup table. A possible direction for future work would be trying to eliminate or minimize the usage of LUT. The utf8lut code only requires SSSE3 currently, maybe newer instructions sets like BMI2 could substitute LUT usage. They already do that for simpler problems.

The code of whole utf8lut library is available on Bitbucket under permissive license. It includes all the algorithms and cases described in the article.

Part 4: Technical details

Fuzz-testing

Performance evaluation

Repeated runs

Large files

Conclusion

social