You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am attempting to assign individual layers to separate GPUs in order to conserve memory. However, the Model.to_gpu function takes an all or nothing approach which prevents this from working.
While diagnosing the origin of memory access error during training, (cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered), I noticed that CupyOps.device_id is never used or set.
Ideally, all the CupyOps would run inside a cp.cuda.Device(device_id) context, but that is not the case. Instead, the xp attribute is (ab)used in many places. That will try and run everything through GPU 0, so errors won't appear until something was moved to another GPU.
Two other difficulties are the initialization step, which doesn't declare memory in the right places,
and the finish_update step, where the optimizer does arithmetic on parameters outside of a context.
The text was updated successfully, but these errors were encountered:
Thanks for reporting this issue! We currently only support using a single GPU with require_gpu(gpu_id=N), but multi-GPU support is on our todo list. Of course PRs to improve multi-GPU support are welcome!
I am attempting to assign individual layers to separate GPUs in order to conserve memory. However, the Model.to_gpu function takes an all or nothing approach which prevents this from working.
While diagnosing the origin of memory access error during training, (
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
), I noticed thatCupyOps.device_id
is never usedor set.
Ideally, all the CupyOps would run inside a
cp.cuda.Device(device_id)
context, but that is not the case. Instead, thexp
attribute is (ab)used in many places. That will try and run everything through GPU 0, so errors won't appear until something was moved to another GPU.Two other difficulties are the initialization step, which doesn't declare memory in the right places,
and the
finish_update
step, where the optimizer does arithmetic on parameters outside of a context.The text was updated successfully, but these errors were encountered: