You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ideally you would want to use in this case __m128i _mm_blendv_epi8(__m128i a, __m128i b, __m128i mask)
where the mask could be created with anything in the _mm_cmp**_epi8 range.
But blendv is a SSE4.1 instruction. Leading to compile headaches. However, this can be done using SSE2 instructions only:
I did some further research, as sequali now calculates sequence identity using Smith-Waterman, and unfortunately it was bottlenecking report creation. The paper below highlights ways of parallelizing the Smith-Waterman algorithm:
I went for the reverse diagonal approach, since it was immediately clear to me that I could use that in a way that I would only keep 3 diagonals in memory at any time. Striped looked to me more like the whole matrix needs to be in memory. (Although I haven't properly checked this, diagonals were much more obvious, so I simply started hastily implementing).
Since in Sequali queries are limited to 31 bp, I could take massive shortcuts using avx2 vectors. So that is what I did. The result is here: rhpvorderman/sequali#164.
In theory the same could be done for cutadapt, but it would be more work. No short cuts can be taken and probably 16-bit integer vectors need to be used instead of 8-bit integer vectors. Unfortunately it is not possible to write it in a way that the compiler auto vectorizes properly, so separate instructions need to be written for each architecture. In practice this means a fallback and a sse4.1 or avx2 implementation (ARM64 with Neon instructions in production is still quite a rarity I guess?) .
So a lot of work, much more code, with a lot of potential speed benefits. That it can be done does not mean it has to be done of course. But I figured the least I could do is dump some helpful resources in the case you like to fiddle with these things.
Ideally you would want to use in this case
__m128i _mm_blendv_epi8(__m128i a, __m128i b, __m128i mask)
where the mask could be created with anything in the _mm_cmp**_epi8 range.
But blendv is a SSE4.1 instruction. Leading to compile headaches. However, this can be done using SSE2 instructions only:
So this might open up opportunities for vectorization, using only
#ifdef __SSE2__
compile guards.EDIT: This would work for other than epi8 data types as well of course.
The text was updated successfully, but these errors were encountered: