sse4 packed sum between int32_t and int16_t (sign extend to int32_t)

I have the following code snippet (a gist can be found here) where I am trying to do a sum between 4 int32_t negative values and 4 int16_t values (that will be sign extend to int32_t).

    extern  exit

    global _start

    section .data

a:     dd -76, -84, -84, -132
b:     dw 406, 406, 406, 406
    section .text
    movdqa xmm0, [a]
    pmovsxwd xmm2, [b]
    paddq xmm0, xmm2
    ;Expected: 330, 322, 322, 274
    ;Results:  330, 323, 322, 275
    call exit

However, when going through my debugger, I couldn’t understand why the output results are different from the expected results. Any idea ?

>Solution :

paddq does 64-bit qword chunks, so there’s carry across two of the 32-bit boundaries, leading to an off-by-one in the high half of each qword.

paddd is 32-bit dword chunks, matching the pmovsxwd dword element destination size. This is a SIMD operation with 4 separate adds, independent of each other.

BTW, you could have made this more efficient by folding the 16-byte aligned load into a memory operand for padd, but yeah for debugging it can help to see both inputs in registers with a separate load.

  default rel           ; use RIP-relative addressing modes when possible

   movsxwd xmm0, [b]
   paddd   xmm0, [a]

Also you’d normally put read-only arrays in section .rodata.

Leave a Reply