CUDA error: device-side assert triggered on torchrun DDP #10101

Sayan-m90 · 2025-03-05T23:52:59Z

🐛 Describe the bug

Hello,
I am getting a CUDA error: device-side assert triggered on the global_mean_pool to the extent that I cannot:

print the variable
detach and save it as a a tensor to see
put in a try catch and just ignore the batch, the whole program crashes
this happens for >15k data and after running for a good 3-4 epochs
there does not seem to be a nan in the dataset
the error goes on to subsequent batch and i cannot pass on to it

The followup error to this is:
failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [45,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [46,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [47,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [48,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [49,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [50,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [51,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [52,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [53,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [54,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [55,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [56,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [14,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [15,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [16,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [17,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [18,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [19,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [20,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [21,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [22,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [23,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [24,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [25,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [26,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\nTraceback (most recent call last):\n File "/package/molclass-0.1.1.dev618/molclass/cubes/scripts/script_helper.py", line 303, in forward\n apool = global_mean_pool(ah, abinput.x_s_batch)\n File "/home/floeuser/miniconda/envs/user_env/lib/python3.9/site-packages/torch_geometric/nn/pool/glob.py", line 63, in global_mean_pool\n return scatter(x, batch, dim=dim, dim_size=size, reduce=\'mean\')\n File "/home/floeuser/miniconda/envs/user_env/lib/python3.9/site-packages/torch_geometric/utils/_scatter.py", line 53, in scatter\n dim_size = int(index.max()) + 1 if index.numel() > 0 else 0\nRuntimeError: CUDA error: device-side assert triggered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile withTORCH_USE_CUDA_DSAto enable device-side assertions.\n\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/package/molclass-0.1.1.dev618/molclass/cubes/scripts/script_helper.py", line 313, in forward\n tt = torch.max(abinput.x_s_batch).to(device=\'cpu\')\nRuntimeError: CUDA error: device-side assert triggered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile withTORCH_USE_CUDA_DSAto enable device-side assertions.\n\nTraceback (most recent call last):\n File "/package/molclass-0.1.1.dev618/molclass/cubes/scripts/multi_gpu_train_regressor_module_script.py", line 109, in batch_train\n batch = batch.to(device)\n File "/home/floeuser/miniconda/envs/user_env/lib/python3.9/site-packages/torch_geometric/data/data.py", line 360, in to\n return self.apply(\n File "/home/floeuser/miniconda/envs/user_env/lib/python3.9/site-packages/torch_geometric/data/data.py", line 340, in apply\n store.apply(func, *args)\n File "/home/floeuser/miniconda/envs/user_env/lib/python3.9/site-packages/torch_geometric/data/storage.py", line 201, in apply\n self[key] = recursive_apply(value, func)\n File "/home/floeuser/miniconda/envs/user_env/lib/python3.9/site-packages/torch_geometric/data/storage.py", line 895, in recursive_apply\n return func(data)\n File "/home/floeuser/miniconda/envs/user_env/lib/python3.9/site-packages/torch_geometric/data/data.py", line 361, in <lambda>\n lambda x: x.to(device=device, non_blocking=non_blocking), *args)\nRuntimeError: CUDA error: device-side assert triggered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing
I checked the bounds of the batch and that seems to be within bound so am not sure why it gets the index out of bound.
Also note, this DOES NOT happen if I do not pass a DistSampler or turn shuffle off for the data. here is how i do that bit

sampler_tr = DistributedSampler(train_pair_graph, num_replicas=world_size,
                                        shuffle=ncclAttributes.shuffle,
                                        drop_last=True)
        sampler_vl = DistributedSampler(val_pair_graph, num_replicas=world_size,
                                        shuffle=ncclAttributes.shuffle,
                                        drop_last=True)
        if verbose:
            print('samplers loaded')
            time.sleep(1)
        ptr = GDL(train_pair_graph, batch_size=ncclAttributes.batch_size,
                  num_workers=ncclAttributes.num_workers, shuffle=not ncclAttributes.shuffle,
                  pin_memory=False, follow_batch=['x_s'], sampler=sampler_tr)
        pvl = GDL(val_pair_graph, batch_size=ncclAttributes.batch_size,
                  num_workers=ncclAttributes.num_workers, shuffle=not ncclAttributes.shuffle,
                  pin_memory=False, follow_batch=['x_s'], sampler=sampler_vl)
        pts = GDL(test_pair_graph, batch_size=ncclAttributes.batch_size,
                  num_workers=ncclAttributes.num_workers, shuffle=not ncclAttributes.shuffle,
                  pin_memory=False, follow_batch=['x_s'])

I can compile a working data but that is difficult, i was wondering if you guys have noticed this bug, specially on larger (>10k) data.

Versions

NA

The text was updated successfully, but these errors were encountered:

akihironitta · 2025-03-07T01:50:27Z

Have you had a chance to run it with CUDA_LAUNCH_BLOCKING=1 as the error suggests?

Sayan-m90 · 2025-03-10T16:31:55Z

ok I think i may have found the error. CUDA_LAUNCH_BLOCKING=1 put the error on the global_mean_pool but not more insightful.

the tensors which went into the pooling function were shaped (n,1) and n.. for any pytorch function (n,1) is reshaped to n or it rasies an error. but with global_mean_pool, it ran fine with the (n,1) when pytorch sampler (shuffle) was off, but threw cuda error with shuffle on.
imo, it should either say shape mismatch and fail or not fail at all. i wasnt expecting a cuda error for this.
i did a torch.squeeze() reshaping both my arrays to n and that seems to have fixed the issue.
would be great if you guys can add a safeguard for this function for future

Sayan-m90 added the bug label Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error: device-side assert triggered on torchrun DDP #10101

CUDA error: device-side assert triggered on torchrun DDP #10101

Sayan-m90 commented Mar 5, 2025

akihironitta commented Mar 7, 2025

Sayan-m90 commented Mar 10, 2025 •

edited

Loading

CUDA error: device-side assert triggered on torchrun DDP #10101

CUDA error: device-side assert triggered on torchrun DDP #10101

Comments

Sayan-m90 commented Mar 5, 2025

🐛 Describe the bug

Versions

akihironitta commented Mar 7, 2025

Sayan-m90 commented Mar 10, 2025 • edited Loading

Sayan-m90 commented Mar 10, 2025 •

edited

Loading