Skip to content

CUDA error: device-side assert triggered on torchrun DDP #10101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Sayan-m90 opened this issue Mar 5, 2025 · 2 comments
Open

CUDA error: device-side assert triggered on torchrun DDP #10101

Sayan-m90 opened this issue Mar 5, 2025 · 2 comments
Labels

Comments

@Sayan-m90
Copy link

🐛 Describe the bug

Hello,
I am getting a CUDA error: device-side assert triggered on the global_mean_pool to the extent that I cannot:

  1. print the variable
  2. detach and save it as a a tensor to see
  3. put in a try catch and just ignore the batch, the whole program crashes
  4. this happens for >15k data and after running for a good 3-4 epochs
  5. there does not seem to be a nan in the dataset
  6. the error goes on to subsequent batch and i cannot pass on to it

The followup error to this is:
failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [45,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [46,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [47,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [48,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [49,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [50,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [51,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [52,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [53,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [54,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [55,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [51,0,0], thread: [56,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [14,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [15,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [16,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [17,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [18,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [19,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [20,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [21,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [22,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [23,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [24,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [25,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\n../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [50,0,0], thread: [26,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.\nTraceback (most recent call last):\n File "/package/molclass-0.1.1.dev618/molclass/cubes/scripts/script_helper.py", line 303, in forward\n apool = global_mean_pool(ah, abinput.x_s_batch)\n File "/home/floeuser/miniconda/envs/user_env/lib/python3.9/site-packages/torch_geometric/nn/pool/glob.py", line 63, in global_mean_pool\n return scatter(x, batch, dim=dim, dim_size=size, reduce=\'mean\')\n File "/home/floeuser/miniconda/envs/user_env/lib/python3.9/site-packages/torch_geometric/utils/_scatter.py", line 53, in scatter\n dim_size = int(index.max()) + 1 if index.numel() > 0 else 0\nRuntimeError: CUDA error: device-side assert triggered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile withTORCH_USE_CUDA_DSAto enable device-side assertions.\n\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/package/molclass-0.1.1.dev618/molclass/cubes/scripts/script_helper.py", line 313, in forward\n tt = torch.max(abinput.x_s_batch).to(device=\'cpu\')\nRuntimeError: CUDA error: device-side assert triggered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile withTORCH_USE_CUDA_DSAto enable device-side assertions.\n\nTraceback (most recent call last):\n File "/package/molclass-0.1.1.dev618/molclass/cubes/scripts/multi_gpu_train_regressor_module_script.py", line 109, in batch_train\n batch = batch.to(device)\n File "/home/floeuser/miniconda/envs/user_env/lib/python3.9/site-packages/torch_geometric/data/data.py", line 360, in to\n return self.apply(\n File "/home/floeuser/miniconda/envs/user_env/lib/python3.9/site-packages/torch_geometric/data/data.py", line 340, in apply\n store.apply(func, *args)\n File "/home/floeuser/miniconda/envs/user_env/lib/python3.9/site-packages/torch_geometric/data/storage.py", line 201, in apply\n self[key] = recursive_apply(value, func)\n File "/home/floeuser/miniconda/envs/user_env/lib/python3.9/site-packages/torch_geometric/data/storage.py", line 895, in recursive_apply\n return func(data)\n File "/home/floeuser/miniconda/envs/user_env/lib/python3.9/site-packages/torch_geometric/data/data.py", line 361, in <lambda>\n lambda x: x.to(device=device, non_blocking=non_blocking), *args)\nRuntimeError: CUDA error: device-side assert triggered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing
I checked the bounds of the batch and that seems to be within bound so am not sure why it gets the index out of bound.
Also note, this DOES NOT happen if I do not pass a DistSampler or turn shuffle off for the data. here is how i do that bit

sampler_tr = DistributedSampler(train_pair_graph, num_replicas=world_size,
                                        shuffle=ncclAttributes.shuffle,
                                        drop_last=True)
        sampler_vl = DistributedSampler(val_pair_graph, num_replicas=world_size,
                                        shuffle=ncclAttributes.shuffle,
                                        drop_last=True)
        if verbose:
            print('samplers loaded')
            time.sleep(1)
        ptr = GDL(train_pair_graph, batch_size=ncclAttributes.batch_size,
                  num_workers=ncclAttributes.num_workers, shuffle=not ncclAttributes.shuffle,
                  pin_memory=False, follow_batch=['x_s'], sampler=sampler_tr)
        pvl = GDL(val_pair_graph, batch_size=ncclAttributes.batch_size,
                  num_workers=ncclAttributes.num_workers, shuffle=not ncclAttributes.shuffle,
                  pin_memory=False, follow_batch=['x_s'], sampler=sampler_vl)
        pts = GDL(test_pair_graph, batch_size=ncclAttributes.batch_size,
                  num_workers=ncclAttributes.num_workers, shuffle=not ncclAttributes.shuffle,
                  pin_memory=False, follow_batch=['x_s'])

I can compile a working data but that is difficult, i was wondering if you guys have noticed this bug, specially on larger (>10k) data.

Versions

NA

@Sayan-m90 Sayan-m90 added the bug label Mar 5, 2025
@akihironitta
Copy link
Member

Have you had a chance to run it with CUDA_LAUNCH_BLOCKING=1 as the error suggests?

@Sayan-m90
Copy link
Author

Sayan-m90 commented Mar 10, 2025

ok I think i may have found the error. CUDA_LAUNCH_BLOCKING=1 put the error on the global_mean_pool but not more insightful.

the tensors which went into the pooling function were shaped (n,1) and n.. for any pytorch function (n,1) is reshaped to n or it rasies an error. but with global_mean_pool, it ran fine with the (n,1) when pytorch sampler (shuffle) was off, but threw cuda error with shuffle on.
imo, it should either say shape mismatch and fail or not fail at all. i wasnt expecting a cuda error for this.
i did a torch.squeeze() reshaping both my arrays to n and that seems to have fixed the issue.
would be great if you guys can add a safeguard for this function for future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants