(Why?) Is the simpe loop not vectorized and slower than a SIMD calculation?

January 4, 2024

(Why?) does the compiler not use SIMD instructions for a simple loop computing a sum, even when compiling with -03 and -march=native?

Consider the following two functions:

float sum_simd(const std::vector<float>& vec) {
    __m256 a{0.0};
    for (std::size_t i = 0; i < vec.size(); i += 8) {
        __m256 tmp = _mm256_loadu_ps(&vec[i]);
        a = _mm256_add_ps(tmp, a);
    }
    float res{0.0};
    for (size_t i = 0; i < 8; ++i) {
        res += a[i];
    }
    return res;
}

float normal_sum(const std::vector<float>& vec) {
    float sum{0};
    for (size_t i = 0; i < vec.size(); ++i) {
        sum += vec[i];
    }
    return sum;
}

The compiler seems to turn the summations into:

vaddps  ymm0, ymm0, ymmword ptr [rax + 4*rdx]

and

vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 4]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 8]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 12]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 16]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 20]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 24]
vaddss  xmm0, xmm0, dword ptr [rcx + 4*rsi + 28]

When I run this on my machine, I get a substantial speedup (~factor 10) from the SIMD sum. The same is true on Godbolt. See here for the code.

I compiled the program with GCC 13 and Clang 17 and used the options -O3 -march=native.

Why is the function normal_sum slower and not fully vectorized? Do I need to specify additional compiler options?

>Solution :

Why is the function normal_sum slower and not fully vectorized? Do I need to specify additional compiler options?

Yes. -ffastmath solves this issue (see on Godbolt). Here is the main loop with this additional flag:

.L10:
        vaddps  ymm1, ymm1, YMMWORD PTR [rax]     ;     <---------- vectorized
        add     rax, 32
        cmp     rcx, rax
        jne     .L10

However, note that -ffastmath is a combination of several more specific flags. Some of them can be quite dangerous. For example, -funsafe-math-optimizations and -ffinite-math-only can break existing codes using infinities or reduce their precision. In fact, some codes like a Kahan summation algorithm requires the compiler not to assume floating-point operations are associative (which -ffast-math does).
For more information about this, please read the post What does gcc’s ffast-math actually do?.

The main reason why the code is not automatically vectorized without -ffastmath is simply because floating-point operations like a sum is not associative (i.e. (a+b)+c != a+(b+c)). Because of that, the compiler cannot reorder the long chain of floating-point additions. Note that there is a flag specifically meant to change this behaviour (-fassociative-math), but it is often not sufficient to auto-vectorize the code (this is the case here). One need to use a combination of flags (subset of -ffast-math) to enable the auto-vectorization regarding the target compiler (and possibly its version).

Note that on Clang, a simple architecture-independant way to vectorize the code is to use OpenMP. To do that, you need to add the line #pragma omp simd before the loop and the compilation flag -fopenmp-simd. See on Godbolt. Unfortunately, this solution does not work yet on GCC (AFAIK this is because the OpenMP SIMD pragmas are ignored by the GCC backend optimization steps so far).