I have a __mask64 as a result of a few AVX512 operations:
__mmask64 mboth = _kand_mask64(lres, hres);
I would like to count the number of nibbles in this that have all bits set (0xF).
The simple solution is to do this:
uint64 imask = (uint64)mboth;
while (imask) {
if (imask & 0xf == 0xf)
ret++;
imask = imask >> 4;
}
I wanted something better, but what I came up with doesn’t feel elegant:
//outside the loop
__m512i b512_1s = _mm512_set1_epi32(0xffffffff);
__m512i b512_0s = _mm512_set1_epi32(0x00000000);
//then...
__m512i vboth = _mm512_mask_set1_epi8(b512_0s, mboth, 0xff);
__mmask16 bits = _mm512_cmpeq_epi32_mask(b512_1s, vboth);
ret += __builtin_popcount((unsigned int)fres);
The above puts a 0xff byte into a vector where a 1 bit exists in the mask, then gets a 1-bit in the bits mask when the blown-up 0xf nibbles now are now found as 0xffffffff int32‘s.
I feel that two 512-bit operations are way overkill when the original data lives in a 64-bit number. This alternate is probably much worse; it’s too many instructions and still operates on 128 bits:
//outside the loop
__m128i b128_1s = _mm_set1_epi32(0xffffffff);
//then...
uint64 maskl = mboth & 0x0f0f0f0f0f0f0f0f;
uint64 maskh = mboth & 0xf0f0f0f0f0f0f0f0;
uint64 mask128[2] = { (maskl << 4) | maskl, (maskh >> 4) | maskh };
__m128i bytes = _mm_cmpeq_epi8(b128_1s, *(__m128i*)mask128);
uint bits = _mm_movemask_epi8(bytes);
ret += __builtin_popcount(bits);
>Solution :
With just some scalar operations you can do this:
imask &= imask >> 2;
imask &= imask >> 1;
ret += std::popcount(imask & 0x1111111111111111);
The first two steps put, for every nibble, the horizontal AND of the bits of that nibble in the least significant bit of that nibble. The other bits of the nibble become something that we don’t want here so we just mask them out. Then popcount the result.