Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How do i copy memory from CPU to GPU using CUDA C++?

I want to use my gpu instead of cpu for threading but im not really sure how to do that. i tried doing something like this:

int data_array = readfile();
int array_size = data_array.size();
int iterations = 25;
vector<person> result_array;
run_on_GPU<<<8, 32>>>(data_array, result_array, array_size, iterations);
cudaDeviceSynchronize();

for (int i = 0; i < result_array.size(); i++) {
    if (results_array[i] == condition) break;

    output_file << results_array[i].encoded << endl;
}

I want something like this, i tried using chatGpt but it still didn’t run.

The program did not work and I got something like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

CUDA Error: invalid argument at launch.
Error in file <secret :)> at line 48: cudaDeviceSynchronize() returned error 11 (cudaErrorInvalidConfiguration)

>Solution :

So it seems you forgot to actually allocate some memory before running the processes.
You should first do something like this:
Instead of DataClass and ResultClass datatypes use your own, theese are just for an example.

DataClass* device_entries = NULL;
ResultClass* device_results = NULL;

cudaMalloc(&device_entries, entry_count * sizeof(DataClass));
cudaMalloc(&device_results, entry_count * sizeof(ResultClass));

entry_count is the size of your data array.

Then after that you need to copy the actual array to the gpu using theese lines:

cudaMemcpy(device_entries, &entries[0], entry_count * sizeof(DataClass), cudaMemcpyHostToDevice);
cudaMemset(device_results, 0, entry_count * sizeof(ResultClass));
cudaDeviceSynchronize();

the cudaMemcpyHostToDevice as it sounds, copies the memory from host to device(the gpu). We will use the same thing but the other way around later.

The block_count and block_size is your own choice, but you should use a multiple of 32 for the block_size variable.
Also the iteration_count variable is the amount of data each thread will process so its up to you how you count it but you can use something like this:

entry_count / (block_count * block_size) + 1;

So to the next part will look something like this:

run_on_GPU<<<block_count, block_size>>>(device_entries, device_results, entry_count, iteration_count);
cudaDeviceSynchronize();

The run_on_GPU method should look something like this:

__global__ void run_on_GPU(DataClass* entries, ResultClass* results, size_t entry_count, int count)

To get the results from the gpu you have to copy the memory back from device to host:

ResultClass* results = (ResultClass*)malloc(entry_count * sizeof(ResultClass));
cudaMemcpy(results, device_results, entry_count * sizeof(ResultClass), cudaMemcpyDeviceToHost);

Bofore ending the program also dont forget to free up the used up memory:

free(results);
cudaFree(device_entries);
cudaFree(device_results);
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading