1. utf8lut: Vectorized UTF-8 converter

    Some time ago I wrote a surprising answer to stackoverflow question "Fastest way to get IPv4 address from string". At that very moment I discovered that _mm_shuffle_epi8 instruction combined with sufficiently large lookup-table can do wonders. Actually, this exact methodology was used to implement vectorized merging algorithm in one of my previous blog posts.

    Inspired by the new discovery, I tried to apply this LUT + shuffle idea to some well-known and moderately generic problem. I tried to apply the idea to accelerate conversion from UTF-8 to UTF-16 and was successful. Initially, I had to postpone all efforts on the problem due to PhD-related activities. After that I thought that I would write a scientific article on the topic. When it became clear that I don't want to lose time on anything like that anymore, I decided to write about it in a blog. As a result, almost three years after the initial work (update: it is four years already), I finally managed to write this report, describing the algorithm and the utf8lut library.

    read more
  2. Vectorizing small fixed-size sort

    After a long break, I can finally return to the topic which was started in the previous blog post.

    Imagine that we have a small array of compile-time constant size with integers. For instance, N = 32. And we want to sort it as fast as possible. What is the best solution for this problem?

    The wisest of you would suggest simply using std::sort, because every modern implementation of it is well-optimized and contains special handling of small subarrays to accelerate the generic quicksort algorithm. The ones who don't trust libraries would suggest using insertion sort: this is the algorithm usually used to handle small cases in std::sort. The performance geeks and regular stackoverflow visitors would definitely point to sorting networks: every question like "the fastest way to sort K integers" ends up with a solution based on them (N=6, N=10, what, why). I'm going to present a much less known way to sort small arrays of 32-bit keys, with performance comparable to sorting networks.

    read more

social