Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

arithmetic intensity of zgemv versus dgemv/sgemv?

The arithmetic intensity of sgemv (or dgemv) is derived in this set of exercises (https://florian.world/wp-content/uploads/FM-High-Performance-Computing-I-Assignment-1.pdf) to be:
0.5 / (1+c), where c is a constant.

I was wondering whether zgemv, the complex-counterpart of these operations, has the same arithmetic intensity as the purely real ones?

I think as follows:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

A complex-complex multiplication of the type (a+b*i) * (c+d*i), where i^2=-1, is equal to (ac - cd) + i*(bc + ad), so there are 4 multiplications and two additions, for a total of 6 ops. In contrast, to multiply two purely real numbers, only one multiplication is needed.

To load a complex number, two double precision numbers have to be loaded, so this increases the memory traffic by 2 as well.

In total, the arithmetic intensity shall be roughly 3 times larger?

Many thanks!

>Solution :

gemv is a matrix-vector product, so each element of the result is a dot-product of the vector and a row or column of the matrix. That’s a multiply and add, not just multiply.

On modern hardware, that’s done with one FMA (Fused Multiply-Add) for about the same cost as one multiply, effectively getting the add part for free.

An optimized ZGEMV would also use some FMAs for the complex multiply and add, like 2 multiplies, 2 FMAs for just a complex multiply.

Or four FMAs if adding to existing accumulators of real and imaginary parts. (Creating two long dependency chains, so you’d need to unroll more to hide it, although you do already have instruction-level parallelism between the FMAs that add into the real vs. imaginary parts.)

So with FMA, it should be about 2x the computational intensity: 4x as many FMAs for 2x as many loads per "element".


This is assuming the complex numbers are stored with separate arrays of real and imaginary parts, so SIMD loads can get vectors of [r0, r1, r2, r3] and [i0, i1, i2, i3], keeping the problem pure vertical with no shuffling like you’d need with a [r0,i0, r1,i1, ...] interleaved layout (Array of Structs aka AoS). If you need shuffles, then computational intensity is higher, but it’s not "useful" work, and it’s potentially on different execution units than the ones that can run FMA.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading