I have the following code snippet (a gist can be found here) where I am trying to do a sum between 4 int32_t negative values and 4 int16_t values (that will be sign extend to int32_t).
extern exit global _start section .data a: dd -76, -84, -84, -132 b: dw 406, 406, 406, 406 section .text _start: movdqa xmm0, [a] pmovsxwd xmm2, [b] paddq xmm0, xmm2 ;Expected: 330, 322, 322, 274 ;Results: 330, 323, 322, 275 call exit
However, when going through my debugger, I couldn’t understand why the output results are different from the expected results. Any idea ?
paddq does 64-bit qword chunks, so there’s carry across two of the 32-bit boundaries, leading to an off-by-one in the high half of each qword.
paddd is 32-bit dword chunks, matching the
pmovsxwd dword element destination size. This is a SIMD operation with 4 separate adds, independent of each other.
BTW, you could have made this more efficient by folding the 16-byte aligned load into a memory operand for
padd, but yeah for debugging it can help to see both inputs in registers with a separate load.
default rel ; use RIP-relative addressing modes when possible _start: movsxwd xmm0, [b] paddd xmm0, [a]
Also you’d normally put read-only arrays in