Skip to content

_check_ctypes_error #72448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
YU-FAITH opened this issue Apr 24, 2025 · 3 comments
Open

_check_ctypes_error #72448

YU-FAITH opened this issue Apr 24, 2025 · 3 comments
Assignees
Labels

Comments

@YU-FAITH
Copy link

请提出你的问题 Please ask your question

报这个错,但是找不到错误的发生在哪里,不知道怎么查,paddle版本2.6.2

LAUNCH INFO 2025-04-24 13:14:14,333 ------------------------- ERROR LOG DETAIL -------------------------
nfo.func(*info.args, **(info.kwargs or {}))
File "/usr/local/lib/python3.9/dist-packages/numba/cuda/cudadrv/driver.py", line 1698, in core
dealloc.add_item(module_unload, handle)
File "/usr/local/lib/python3.9/dist-packages/numba/cuda/cudadrv/driver.py", line 1180, in add_item
self.clear()
File "/usr/local/lib/python3.9/dist-packages/numba/cuda/cudadrv/driver.py", line 1191, in clear
dtor(handle)
File "/usr/local/lib/python3.9/dist-packages/numba/cuda/cudadrv/driver.py", line 327, in safe_cuda_api_call
self._check_ctypes_error(fname, retcode)
File "/usr/local/lib/python3.9/dist-packages/numba/cuda/cudadrv/driver.py", line 395, in _check_ctypes_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuStreamDestroy results in CUDA_ERROR_LAUNCH_FAILED
[2025-04-24 13:14:00,143] [ WARNING] dataloader_iter.py:707 - DataLoader 5 workers exit unexpectedly, pids: 179206, 179222, 179238, 179288, 179336
I0424 13:14:03.881071 176447 process_group_nccl.cc:132] ProcessGroupNCCL destruct
terminate called after throwing an instance of 'phi::enforce::EnforceNotMet'
what(): (External) CUDA error(719), unspecified launch failure.
[Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. Less common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at /data/Eager/Paddle3/paddle/phi/backends/gpu/cuda/cuda_info.cc:296)


C++ Traceback (most recent call last):

0 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
1 paddle::distributed::ProcessGroupNCCL::ProcessGroupNCCL()
2 std::_Hashtable<std::string, std::pair<std::string const, std::unique_ptr<phi::GPUContext, std::default_deletephi::GPUContext > >, std::allocator<std::pair<std::string const, std::unique_ptr<phi::GPUContext, std::default_deletephi::GPUContext > > >, std::__detail::_Select1st, std::equal_to<std::string >, std::hash<std::string >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::
_Hashtable()
3 phi::GPUContext::~GPUContext()
4 phi::GPUContext::~GPUContext()
5 phi::GPUContext::Impl::~Impl()


Error Message Summary:

FatalError: Process abort signal is detected by the operating system.
[TimeInfo: *** Aborted at 1745471643 (unix time) try "date -d @1745471643" if you are using GNU date ***]
[SignalInfo: *** SIGABRT (@0x2b13f) received by PID 176447 (TID 0x7fd85a095c00) from PID 176447 ***]

@liuruyan
Copy link
Contributor

您好,请问是自定义模型报错,还是使用PaddleClas...等套件库报错呢?

@YU-FAITH
Copy link
Author

您好,请问是自定义模型报错,还是使用PaddleClas...等套件库报错呢?
自定义模型,主要报错不提示哪里报错了。就很头疼

@liuruyan
Copy link
Contributor

liuruyan commented Apr 29, 2025

您好,可以提供下复现信息吗?我们会安排相关同学尝试进行下复现:

  1. 报错的物理机环境(windows/linux/mac)是否使用docker
  2. python版本环境、cuda版本
  3. 复现脚本

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants