Dirty hands coding

While working on an implementation of merge sort promised in the previous article, I realized that I'd like to use one neat little thing, which is worth its own post. It is a simple strategy for sorting or doing comparison-based tasks, which works wonderfully when input data is small enough.

Suppose that we have a very small array and we want to sort it as fast as possible. Indeed, applying some fancy O(N log N) algorithm is not a good idea: although it has optimal asymptotic performance, its logic is too complicated to outperform simple bubble-sort-like algorithms which take O(N^2) time instead. That's why every well-optimized sorting algorithm based on quicksort (e.g. std::sort) or mergesort includes some simple quadratic algorithm which is run for sufficiently small subarrays like N <= 32.

What exactly should we strive for to get an algorithm efficient for small N? Here is the list of things to look for:

Avoid branches whenever possible: unpredictable ones are very slow.
Reduce data dependency: this allows to fully utilize processing units in CPU pipeline.
Prefer simple data access and manipulation patterns: this allows to vectorize the algorithm.
Avoid complicated algorithms: they almost always fail on one of the previous points, and they sometimes do too much work for small inputs.

I decided to start investigating a simpler problem first, which is solved by std::lower_bound: given a sorted array of elements and a key, find index of the first array element greater or equal than the key. And this investigation soon developed into a full-length standalone article.

Recently I stumbled upon a question on stackoverflow, which asked how to vectorize computing symmetric difference of two sorted int32 arrays using AVX2. I decided to try doing it myself, and this post is about what I achieved. Of course, the best way to compute symmetric difference of sorted sets is by running ordinary merge algorithm (e.g. with std::merge) for the arrays plus some simple postprocessing. I'll concentrate only on the generic merging algorithm here.

I'll handle 32-bit integer keys only. Having keys of larger size would reduce significantly the efficiency of the algorithm (as usual with vectorization). Using 32-bit floating point keys is not much different from using integer keys; moreover, sorting 32-bit floats can be easily reduced to sorting 32-bit integers, which is often used to run radix sort on floating point data (see this and that). Also I'll briefly discuss the case when 32-bit values are attached to 32-bit keys (sorting without values is pretty useless in practice).

Performance comparison: linear search vs binary search

Vectorizing std::merge with vpermd from AVX2 and lookup table

social