Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Rust compiler not optimising lzcnt? (and similar functions)

What was done:

This follows as a result of experimenting on Compiler Explorer as to ascertain the compiler’s (rustc’s) behaviour when it comes to the log2()/leading_zeros() and similar functions. I came across this result with seems exceedingly both bizarre and concerning:

Compiler Explorer link

Code:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

pub fn lzcnt0(val: u64) -> u64 {
    val.leading_zeros() as u64
}

pub unsafe fn lzcnt1(val: u64) -> u64 {
    core::arch::x86_64::_lzcnt_u64(val)
}

pub unsafe fn lzcnt2(val: u64) -> u64 {
    asm_lzcnt(val)
}

#[inline]
pub unsafe fn asm_lzcnt(val: u64) -> u64 {
    let lzcnt: u64;
    core::arch::asm!("lzcnt {}, {}", in(reg) val, lateout(reg) lzcnt, options(nomem, nostack));
    lzcnt
}

Output:

example::lzcnt0:
        test    rdi, rdi
        je      .LBB0_2
        bsr     rax, rdi
        xor     rax, 63
        ret
.LBB0_2:
        mov     eax, 64
        ret

example::lzcnt1:
        jmp     core::core_arch::x86_64::abm::_lzcnt_u64

core::core_arch::x86_64::abm::_lzcnt_u64:
        lzcnt   rax, rdi
        ret

example::lzcnt2:
        lzcnt   rdi, rax
        ret

The compiler options are to best emulate cargo’s ‘release’ configuration (with opt-level=3 for good measure), and otherwise trying my best to get the compiler to optimise the functions. The specific target shouldn’t matter, as long as it targets x86-64, I’ve tried x86_64-{pc-windows-{msvc,gnu},unknown-linux-gnu}.

What was expected:

All of these outputs should be identical to lzcnt2. Instruction Performance Tables lzcnt is evidently a fast instruction across the board and should be used, and having an unnecessary branch in such a low level function is dismal. What’s weirder, the function _lzcnt_u64() calls leading_zeros() under the hood – which the compiler is happy to magic away (there’s no checks or asserts either), but won’t seem to do it for the underlying function. What’s more, the compiler won’t inline the lzcnt instruction even in that case? (the implementation marks the function a #[inline] too) Sure, a jmp isn’t as bad, but it’s entirely unnecessary as should be avoided.

What it could be:

  • Compiler bug?
  • Purposeful choice I don’t understand?
  • I don’t understand how to use Compiler Explorer properly?
  • Other?

I’m seeing similar results in functions like log2 and (I presume) others that rely on the ctlz rust compiler intrinsic in their implementation.

If you understand compilers sufficiently, any clarification would be greatly appreciated. I don’t fancy writing loads of utility functions for little reason, but I’ll do so if there’s no better alternative.

P.S. If your answer is along the lines of that the performance gain is negligible in most situations, and/or that I shouldn’t care due to code quality or similar reasoning: I understand the sentiment, but that’s not the point of this question. I’m writing for bare-metal, hot code in a personal project.

>Solution :

Old x86-64 CPUs don’t support lzcnt, so rustc/llvm won’t emit it by default. (They would execute it as bsr but the behavior is not identical.)

Use -C target-feature=+lzcnt to enable it. Try.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading