Cuda number of elements is larger than assigned threads

November 12, 2021

I am new to CUDA programming.
I am curious that what happens if the number of elements is larger than the number of threads?

In this simple vector_add example

__global__
void add(int n, float *x, float *y)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) 
        y[i] = x[i] + y[i];
}

Say the number of array elements is 10,000,000. And we call this function using 64 blocks and 256 threads per block:

int n = 1e8;
int grid_size = 64;
int block_sie = 256;

Then, only 64*256 = 16384 threads are assigned, what would happen to the rest of the array elements?

>Solution :

what would happen to the rest of the array elements?

Nothing at all. They wouldn’t be touched and would remain unchanged. Of course, your x array elements don’t change anyway. So we are referring to y here. The values of y[0..16383] would reflect the result of the vector add. The values of y[16384..9999999] would be unchanged.

For this reason (to conveniently handle arbitrary data set sizes independent of the chosen grid size), people sometimes suggest a grid-stride-loop kernel design.