Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Why is computing the histogram of a sorted array slower?

Consider this code on Godbolt

The goal is to benchmark a simple histogram function:

[[gnu::noinline]] static void histogram(int const *a, int n, int *h) {
  for (int i = 0; i < n; ++i)
    h[a[i]]++;
}

The benchmark measures the average time elapsed during the histogram invocation when the input is random data, and when the input is sorted.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

This is the output of this program on my machine:

g++ test.cpp -std=c++20 -O3 -march=native -o test.out && ./test.out 1000000 512 100 17
N (number of elements) = 1000000.
K (histogram size) = 512.
T (number of samples) = 100.
R (random seed) = 17.
time per element (random) : 0.751250 ns.
time per element (sorted) : 2.032259 ns.

I run this experiment on a 4-core sandy bridge intel processor i5-2320.
Why is the histogram algorithm more than 2 times slower when the input is sorted?

>Solution :

Storing/reloading the same element repeatedly creates a serial dependency chain that’s too long for out-of-order exec to fully overlap.

If the value-range isn’t too big, the random order probably still hits in cache for most increments and is only limited by throughput, not store-forwarding latency bottlenecks.

Some modern CPUs (Zen 2, Zen 4, and Ice Lake at least) have zero-latency store-forwarding in some cases, which may include this case where the index is reloaded with the same value. This might be why Godbolt runs show the sorted test being faster.

The benchmark warms up the CPU to turbo frequency with very slow rand() calls, and pre-zeros the arrays because they’re std::vector<int>, so the problems described in Idiomatic way of performance evaluation? aren’t happening here, I don’t think.


See Methods to vectorise histogram in SIMD? for a trick that helps: use multiple arrays of counts, and sum them at the end. (That part can use SIMD, the actual counting can’t unless you have AVX-512 for scatter/gather, and even then having multiple elements that map to the same bucket in the same vector is slower.)

This wouldn’t be necessary on CPUs where memory-renaming works to get zero-latency store-forwarding in this case, except on very wide CPUs that can do more than one load + memory-destination increment per clock.


If you know your data is sorted, you could count runs of the same element and do one += count, although that can easily lead to branch mispredicts unless you do something like _mm_cmpeq_epi32(load(ptr), set1(*ptr)) ; _mm_movemask_ps for a couple vectors and check that popcnt is < 8 elements, so runs shorter than that always branch the same way. (The actual compare could be on mask < 0xff, but you’d add the popcount. Or 31-std::countl_zero, or std::countr_zero(mask+1) so you can use tzcnt / bsf.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading