Why does casting to uint64_t change the results in this code?

July 26, 2023

In the following code why do the sum and cast_sum results diverge at 2^53? The results of pow(2, 53) and (uint64_6) pow(2, 53) appear the same but when I sum them I get different results. Overall goal was to sum the results of 2^0 through 2^63. I just don’t understand why using pow(2, i) fails and (uint64_t) pow(2, i) works. Or why the results of the two differ starting at 2^53.

#include <math.h>
#include <stdio.h>

int main() {
  uint64_t sum = 0;
  uint64_t cast_sum = 0;
  for (int i = 0; i < 65; ++i) {
    sum += pow(2,i);
    cast_sum += (uint64_t) pow(2,i);
    printf("i: %d, 2^%d = %lf, sum: %lu, cast_sum:%lu\n", i, i, pow(2, i), sum, cast_sum);
  }
}

i: 52, 2^52 = 4503599627370496.000000, cast of 2^52: 4503599627370496, sum: 9007199254740991, cast_sum:9007199254740991
i: 53, 2^53 = 9007199254740992.000000, cast of 2^53: 9007199254740992, sum: 18014398509481984, cast_sum:18014398509481983
i: 54, 2^54 = 18014398509481984.000000, cast of 2^54: 18014398509481984, sum: 36028797018963968, cast_sum:36028797018963967```

>Solution :

See C’s implicit conversions:

Otherwise, if one operand is double, double complex, or double imaginary (since C99), the other operand is implicitly converted as follows:
integer or real floating type to double

pow(2, i) is a double. sum += pow(2,i); thus converts sum to a double before adding; it is roughly equivalent to sum = (uint64_t) (((double) sum) + pow(2, i));.

The 2^53 is no coincidence. 2^53 is the limit for which integers 64-bit floats (which have 52 mantissa bits plus one implicit bit) can accurately represent. When sum uses more than 53 significant bits and is casted to a double, some of the less significant bits will be lost in the conversion.

If instead you cast before adding, all is fine since you’re working in the realm of 64-bit integer addition. Note that floats, using an exponent-mantissa representation, can exactly represent even rather large powers of two. So pow(2, i) can be exact for your small i. This can and will then be exactly converted into the appropriate 64-bit unsigned integer.

You should use bit shifts 1ULL << i instead of pow(2, i) though. Depending on how pow is implemented, it may not always be exact. Bitshifts will always be more reliable and more efficient.

If you want the bit pattern where all bits are ones, simply use ~0ULL.