Does anyone have an example where _mm256_stream_load_si256 (non-tempral load to bypasse cache) actually improves performance?

Advertisements Consider massiveley SIMD-vectorized loops on very large amounts of floating point data (hundreds of GB) that, in theory, should benefit from non-temporal ("streaming" i.e. bypassing cache) loads/store. Using non-temp store (_mm256_stream_ps) actually does significantly improve throughput by about ~25% over plain store (_mm256_store_ps) However, I could not measure any difference when using _mm256_stream_load instead… Read More Does anyone have an example where _mm256_stream_load_si256 (non-tempral load to bypasse cache) actually improves performance?