Some time ago I wrote a surprising answer to stackoverflow question "Fastest way to get IPv4 address from string". At that very moment I discovered that _mm_shuffle_epi8 instruction combined with sufficiently large lookup-table can do wonders. Actually, this exact methodology was used to implement vectorized merging algorithm in one of my previous blog posts.
Inspired by the new discovery, I tried to apply this LUT + shuffle idea to some well-known and moderately generic problem. I tried to apply the idea to accelerate conversion from UTF-8 to UTF-16 and was successful. Initially, I had to postpone all efforts on the problem due to PhD-related activities. After that I thought that I would write a scientific article on the topic. When it became clear that I don't want to lose time on anything like that anymore, I decided to write about it in a blog. As a result, almost three years after the initial work (update: it is four years already), I finally managed to write this report, describing the algorithm and the utf8lut library.
This article consists of many parts: