Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

cuobjdump emit no PTX arithmetic instruction

Why doesn’t cuobjdump emit the PTX mul instruction below? Has nvcc optimized the cubin output iteself? Is the result calculated at compile-time? If so, for this simplest case nvcc can reasonably further optimize the output w/o generating any instructions on the device side at all.

mul.cu

#include <cuda_runtime.h>
#include <stdio.h>

__global__ void mul(float *res) {
    float x = 11.1, y = 22.2;
    *res = x * y;
}

int main() {
    float *res;
    cudaMallocManaged(&res, sizeof(float));
    mul<<<1, 1>>>(res);
    cudaDeviceSynchronize();
    printf("11.1 * 22.2 = %f\n", *res);
}

Problem

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

$ nvcc mul.cu -o mul

$ ./mul

11.1 * 22.2 = 246.420013

$ cuobjdump -fun mul -ptx ./mul

...

.visible .entry _Z3mulPf(
.param .u64 _Z3mulPf_param_0
)
{
.reg .b32 %r<2>;
.reg .b64 %rd<3>;


ld.param.u64 %rd1, [_Z3mulPf_param_0];
cvta.to.global.u64 %rd2, %rd1;
mov.u32 %r1, 1131834246;
st.global.u32 [%rd2], %r1;
ret;

}

>Solution :

Is the result calculated at compile-time?

Yes.

The compiler can observe that the result will always be 11.1×22.2, so it simply puts that value (when the float bit pattern is viewed as a decimal integer: 1131834246) into the result location.

If you want to see the mul instruction, make the multiplication input values be kernel arguments:

#include <cuda_runtime.h>
#include <stdio.h>

__global__ void mul(float *res, float x, float y) {
    *res = x * y;
}

int main() {
    float *res;
    cudaMallocManaged(&res, sizeof(float));
    mul<<<1, 1>>>(res, 11.1, 22.2);
    cudaDeviceSynchronize();
    printf("11.1 * 22.2 = %f\n", *res);
}

If so, for this simplest case nvcc can reasonably further optimize the output w/o generating any instructions on the device side at all.

Any proper optimization still has to have the same result in global state. So in this case, it would probably be acceptable to replace the kernel with a cudaMemcpy type operation (or, since it is managed memory, perhaps simply a memcpy or some other memory setting operation), but I don’t think the compiler ever tries to do that sort of optimization.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading