Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Why does casting to uint64_t change the results in this code?

In the following code why do the sum and cast_sum results diverge at 2^53? The results of pow(2, 53) and (uint64_6) pow(2, 53) appear the same but when I sum them I get different results. Overall goal was to sum the results of 2^0 through 2^63. I just don’t understand why using pow(2, i) fails and (uint64_t) pow(2, i) works. Or why the results of the two differ starting at 2^53.

#include <math.h>
#include <stdio.h>

int main() {
  uint64_t sum = 0;
  uint64_t cast_sum = 0;
  for (int i = 0; i < 65; ++i) {
    sum += pow(2,i);
    cast_sum += (uint64_t) pow(2,i);
    printf("i: %d, 2^%d = %lf, sum: %lu, cast_sum:%lu\n", i, i, pow(2, i), sum, cast_sum);
  }
}
i: 52, 2^52 = 4503599627370496.000000, cast of 2^52: 4503599627370496, sum: 9007199254740991, cast_sum:9007199254740991
i: 53, 2^53 = 9007199254740992.000000, cast of 2^53: 9007199254740992, sum: 18014398509481984, cast_sum:18014398509481983
i: 54, 2^54 = 18014398509481984.000000, cast of 2^54: 18014398509481984, sum: 36028797018963968, cast_sum:36028797018963967```

>Solution :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

See C’s implicit conversions:

Otherwise, if one operand is double, double complex, or double imaginary (since C99), the other operand is implicitly converted as follows:
integer or real floating type to double

pow(2, i) is a double. sum += pow(2,i); thus converts sum to a double before adding; it is roughly equivalent to sum = (uint64_t) (((double) sum) + pow(2, i));.

The 2^53 is no coincidence. 2^53 is the limit for which integers 64-bit floats (which have 52 mantissa bits plus one implicit bit) can accurately represent. When sum uses more than 53 significant bits and is casted to a double, some of the less significant bits will be lost in the conversion.

If instead you cast before adding, all is fine since you’re working in the realm of 64-bit integer addition. Note that floats, using an exponent-mantissa representation, can exactly represent even rather large powers of two. So pow(2, i) can be exact for your small i. This can and will then be exactly converted into the appropriate 64-bit unsigned integer.

You should use bit shifts 1ULL << i instead of pow(2, i) though. Depending on how pow is implemented, it may not always be exact. Bitshifts will always be more reliable and more efficient.

If you want the bit pattern where all bits are ones, simply use ~0ULL.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading