Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

(Why?) Is the simpe loop not vectorized and slower than a SIMD calculation?

(Why?) does the compiler not use SIMD instructions for a simple loop computing a sum, even when compiling with -03 and -march=native?

Consider the following two functions:

float sum_simd(const std::vector<float>& vec) {
    __m256 a{0.0};
    for (std::size_t i = 0; i < vec.size(); i += 8) {
        __m256 tmp = _mm256_loadu_ps(&vec[i]);
        a = _mm256_add_ps(tmp, a);
    }
    float res{0.0};
    for (size_t i = 0; i < 8; ++i) {
        res += a[i];
    }
    return res;
}

float normal_sum(const std::vector<float>& vec) {
    float sum{0};
    for (size_t i = 0; i < vec.size(); ++i) {
        sum += vec[i];
    }
    return sum;
}

The compiler seems to turn the summations into:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

vaddps  ymm0, ymm0, ymmword ptr [rax + 4*rdx]

and

vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 4]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 8]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 12]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 16]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 20]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 24]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 28]

When I run this on my machine, I get a substantial speedup (~factor 10) from the SIMD sum. The same is true on Godbolt. See here for the code.

I compiled the program with GCC 13 and Clang 17 and used the options -O3 -march=native.

Why is the function normal_sum slower and not fully vectorized? Do I need to specify additional compiler options?

>Solution :

Why is the function normal_sum slower and not fully vectorized? Do I need to specify additional compiler options?

Yes. -ffastmath solves this issue (see on Godbolt). Here is the main loop with this additional flag:

.L10:
        vaddps  ymm1, ymm1, YMMWORD PTR [rax]     ;     <---------- vectorized
        add     rax, 32
        cmp     rcx, rax
        jne     .L10

However, note that -ffastmath is a combination of several more specific flags. Some of them can be quite dangerous. For example, -funsafe-math-optimizations and -ffinite-math-only can break existing codes using infinities or reduce their precision. In fact, some codes like a Kahan summation algorithm requires the compiler not to assume floating-point operations are associative (which -ffast-math does).
For more information about this, please read the post What does gcc’s ffast-math actually do?.

The main reason why the code is not automatically vectorized without -ffastmath is simply because floating-point operations like a sum is not associative (i.e. (a+b)+c != a+(b+c)). Because of that, the compiler cannot reorder the long chain of floating-point additions. Note that there is a flag specifically meant to change this behaviour (-fassociative-math), but it is often not sufficient to auto-vectorize the code (this is the case here). One need to use a combination of flags (subset of -ffast-math) to enable the auto-vectorization regarding the target compiler (and possibly its version).

Note that on Clang, a simple architecture-independant way to vectorize the code is to use OpenMP. To do that, you need to add the line #pragma omp simd before the loop and the compilation flag -fopenmp-simd. See on Godbolt. Unfortunately, this solution does not work yet on GCC (AFAIK this is because the OpenMP SIMD pragmas are ignored by the GCC backend optimization steps so far).

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading