(Why?) does the compiler not use SIMD instructions for a simple loop computing a sum, even when compiling with -03 and -march=native?
Consider the following two functions:
float sum_simd(const std::vector<float>& vec) {
__m256 a{0.0};
for (std::size_t i = 0; i < vec.size(); i += 8) {
__m256 tmp = _mm256_loadu_ps(&vec[i]);
a = _mm256_add_ps(tmp, a);
}
float res{0.0};
for (size_t i = 0; i < 8; ++i) {
res += a[i];
}
return res;
}
float normal_sum(const std::vector<float>& vec) {
float sum{0};
for (size_t i = 0; i < vec.size(); ++i) {
sum += vec[i];
}
return sum;
}
The compiler seems to turn the summations into:
vaddps ymm0, ymm0, ymmword ptr [rax + 4*rdx]
and
vaddss xmm0, xmm0, dword ptr [rcx + 4*rsi]
vaddss xmm0, xmm0, dword ptr [rcx + 4*rsi + 4]
vaddss xmm0, xmm0, dword ptr [rcx + 4*rsi + 8]
vaddss xmm0, xmm0, dword ptr [rcx + 4*rsi + 12]
vaddss xmm0, xmm0, dword ptr [rcx + 4*rsi + 16]
vaddss xmm0, xmm0, dword ptr [rcx + 4*rsi + 20]
vaddss xmm0, xmm0, dword ptr [rcx + 4*rsi + 24]
vaddss xmm0, xmm0, dword ptr [rcx + 4*rsi + 28]
When I run this on my machine, I get a substantial speedup (~factor 10) from the SIMD sum. The same is true on Godbolt. See here for the code.
I compiled the program with GCC 13 and Clang 17 and used the options -O3 -march=native.
Why is the function normal_sum slower and not fully vectorized? Do I need to specify additional compiler options?
>Solution :
Why is the function normal_sum slower and not fully vectorized? Do I need to specify additional compiler options?
Yes. -ffastmath solves this issue (see on Godbolt). Here is the main loop with this additional flag:
.L10:
vaddps ymm1, ymm1, YMMWORD PTR [rax] ; <---------- vectorized
add rax, 32
cmp rcx, rax
jne .L10
However, note that -ffastmath is a combination of several more specific flags. Some of them can be quite dangerous. For example, -funsafe-math-optimizations and -ffinite-math-only can break existing codes using infinities or reduce their precision. In fact, some codes like a Kahan summation algorithm requires the compiler not to assume floating-point operations are associative (which -ffast-math does).
For more information about this, please read the post What does gcc’s ffast-math actually do?.
The main reason why the code is not automatically vectorized without -ffastmath is simply because floating-point operations like a sum is not associative (i.e. (a+b)+c != a+(b+c)). Because of that, the compiler cannot reorder the long chain of floating-point additions. Note that there is a flag specifically meant to change this behaviour (-fassociative-math), but it is often not sufficient to auto-vectorize the code (this is the case here). One need to use a combination of flags (subset of -ffast-math) to enable the auto-vectorization regarding the target compiler (and possibly its version).
Note that on Clang, a simple architecture-independant way to vectorize the code is to use OpenMP. To do that, you need to add the line #pragma omp simd before the loop and the compilation flag -fopenmp-simd. See on Godbolt. Unfortunately, this solution does not work yet on GCC (AFAIK this is because the OpenMP SIMD pragmas are ignored by the GCC backend optimization steps so far).