Dirty hands coding

utf8lut: Vectorized UTF-8 converter
04 August 2019 | in High performance | tags: unicode simd lut

Some time ago I wrote a surprising answer to stackoverflow question "Fastest way to get IPv4 address from string". At that very moment I discovered that _mm_shuffle_epi8 instruction combined with sufficiently large lookup-table can do wonders. Actually, this exact methodology was used to implement vectorized merging algorithm in one of my previous blog posts.

Inspired by the new discovery, I tried to apply this LUT + shuffle idea to some well-known and moderately generic problem. I tried to apply the idea to accelerate conversion from UTF-8 to UTF-16 and was successful. Initially, I had to postpone all efforts on the problem due to PhD-related activities. After that I thought that I would write a scientific article on the topic. When it became clear that I don't want to lose time on anything like that anymore, I decided to write about it in a blog. As a result, almost three years after the initial work (update: it is four years already), I finally managed to write this report, describing the algorithm and the utf8lut library.
This article consists of many parts:
read more
Vectorizing small fixed-size sort

05 January 2018 | in High performance | tags: cmov simd sort

After a long break, I can finally return to the topic which was started in the previous blog post.

Imagine that we have a small array of compile-time constant size with integers. For instance, N = 32. And we want to sort it as fast as possible. What is the best solution for this problem?

The wisest of you would suggest simply using std::sort, because every modern implementation of it is well-optimized and contains special handling of small subarrays to accelerate the generic quicksort algorithm. The ones who don't trust libraries would suggest using insertion sort: this is the algorithm usually used to handle small cases in std::sort. The performance geeks and regular stackoverflow visitors would definitely point to sorting networks: every question like "the fastest way to sort K integers" ends up with a solution based on them (N=6, N=10, what, why). I'm going to present a much less known way to sort small arrays of 32-bit keys, with performance comparable to sorting networks.
read more
Performance comparison: linear search vs binary search
25 August 2017 | in High performance | tags: binary search cmov iaca simd

While working on an implementation of merge sort promised in the previous article, I realized that I'd like to use one neat little thing, which is worth its own post. It is a simple strategy for sorting or doing comparison-based tasks, which works wonderfully when input data is small enough.

Suppose that we have a very small array and we want to sort it as fast as possible. Indeed, applying some fancy O(N log N) algorithm is not a good idea: although it has optimal asymptotic performance, its logic is too complicated to outperform simple bubble-sort-like algorithms which take O(N^2) time instead. That's why every well-optimized sorting algorithm based on quicksort (e.g. std::sort) or mergesort includes some simple quadratic algorithm which is run for sufficiently small subarrays like N <= 32.

What exactly should we strive for to get an algorithm efficient for small N? Here is the list of things to look for:
1. Avoid branches whenever possible: unpredictable ones are very slow.
2. Reduce data dependency: this allows to fully utilize processing units in CPU pipeline.
3. Prefer simple data access and manipulation patterns: this allows to vectorize the algorithm.
4. Avoid complicated algorithms: they almost always fail on one of the previous points, and they sometimes do too much work for small inputs.
I decided to start investigating a simpler problem first, which is solved by std::lower_bound: given a sorted array of elements and a key, find index of the first array element greater or equal than the key. And this investigation soon developed into a full-length standalone article.
read more

social