I want to add 2 unsigned vectors using AVX2
__m256i i1 = _mm256_loadu_si256((__m256i *) si1); __m256i i2 = _mm256_loadu_si256((__m256i *) si2); __m256i result = _mm256_adds_epu16(i2, i1);
however I need to have overflow instead of saturation that
_mm256_adds_epu16 does to be identical with the non-vectorized code, is there any solution for that?
Use normal binary wrapping
_mm256_add_epi16 instead of saturating
Two’s complement and unsigned addition/subtraction are the same binary operation, that’s one of the reasons modern computers use two’s complement. As the asm manual entry for
vpaddw mentions, the instructions can be used on signed or unsigned integers. (The intrinsics guide entry doesn’t mention signedness at all, so is less helpful at clearing up this confusion.)
_mm_cmpgt_epi32 are sensitive to signedness, but math operations (and
The intrinsics names Intel chose might look like they’re for signed integers specifically, but they always use
si for things that work equally on signed and unsigned elements. But no,
epu implies a specifically unsigned thing, while
epi can be specifically signed operations or can be things that work equally on signed or unsigned. Or things where signedness is irrelevant.
_mm_and_si128 is pure bitwise.
_mm_srli_epi32 is a logical right shift, shifting in zeros, like an unsigned C shift. Not copies of the sign bit, that’s
_mm_srai_epi32 (shift right arithmetic by immediate). Shuffles like
_mm_shuffle_epi32 just move data around in chunks.
Non-widening multiplication like
_mm_mullo_epi32 are also the same for signed or unsigned. Only the high-half
_mm_mulhi_epu16 or widening multiplies
_mm_mul_epu32 have unsigned forms as counterparts to their specifically signed
That’s also why 386 only added a scalar integer
imul ecx, esi form, not also a
mul ecx, esi, because only the FLAGS setting would differ, not the integer result. And SIMD operations don’t even have FLAGS outputs.
The intrinsics guide unhelpfully describes
_mm_mullo_epi16 as sign-extending and producing a 32-bit product, then truncating to the low 32-bit. The asm manual for
pmullw also describes it as signed that way, it seems talking about it as the companion to signed
pmulhw. (And has some bugs, like describing the AVX1
VPMULLW xmm1, xmm2, xmm3/m128 form as multiplying 32-bit dword elements, probably a copy/paste error from
And sometimes Intel’s naming scheme is limited, like
_mm_maddubs_epi16 is a u8 x i8 => 16-bit widening multiply, adding pairs horizontally (with signed saturation). I usually have to look up the intrinsic for
pmaddubsw to remind myself that they named it after the output element width, not the inputs. The inputs have different signedness so if they have to pick one, side, I guess it makes sense to name it for the output, with the signed saturation that can happen with some inputs, like for