Why does a Torch error "Assertion `srcIndex < srcSelectDimSize` failed" only appear while training on GPU but not on CPU?

I’m trying to follow this tutorial to code a seq2seq translation model with pytorch:
Pytorch-seq2seq

Everything works perfectly fine when I train my model on cpu. The training is done, evaluation is also done and I get good results.

However, the moment I switch to GPU, I get this error while evaluating on the first batch:

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [179,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
  File "train.py", line 496, in <module>
    valid_loss = evaluate(model, valid_iterator, criterion)
  File "train.py", line 459, in evaluate
    for i, batch in enumerate(iterator):
  File "/home/miniconda3/envs/torch_env/lib/python3.6/site-packages/torchtext/legacy/data/iterator.py", line 160, in __iter__
    yield Batch(minibatch, self.dataset, self.device)
  File "/home/miniconda3/envs/torch_env/lib/python3.6/site-packages/torchtext/legacy/data/batch.py", line 34, in __init__
    setattr(self, name, field.process(batch, device=device))
  File "/home/miniconda3/envs/torch_env/lib/python3.6/site-packages/torchtext/legacy/data/field.py", line 231, in process
    tensor = self.numericalize(padded, device=device)
  File "/home/miniconda3/envs/torch_env/lib/python3.6/site-packages/torchtext/legacy/data/field.py", line 353, in numericalize
    var = torch.tensor(arr, dtype=self.dtype, device=device)
RuntimeError: CUDA error: device-side assert triggered

I searched through Stack and googled around, but the only answers I found is that the embedding dimensions must be wrong. Also, that I can use cpu to get the line when the error occurs. However, as I mentioned, the training on the cpu goes without any errors and the model is trained and evaluated, so I don’t think there is anything wrong with the code itself.

Does anyone have any pointers as to what I can do?

>Solution :

The error notice indicates that an index out of bounds error occurred during the numericalization stage of your data processing. This could be due to a number of factors, one of which is that the batch size is too large for the available GPU memory. The batch size may have been small enough to fit into memory when training on the CPU, but upon moving to the GPU, the greater memory requirements may have caused the problem.

You could try lowering the batch size when training on the GPU. You may also see whether the model parameters are too huge to fit on the GPU RAM and try shrinking the model or the embeddings.

You can also try setting the CUDA LAUNCH BLOCKING environment variable to 1 before executing your script. This causes CUDA to wait for each kernel launch to complete before launching the next one, which can aid in troubleshooting.

Finally, you can try upgrading PyTorch to the latest version to see if it addresses the problem. Bugs in previous versions of PyTorch are occasionally fixed in newer versions.

Leave a Reply