You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, could you please share the use case for exclusive mode with spark cluster? Seems to be quite difficult to workaround if only a single process is allowed to access the GPU.
No real use case actually. I did not realize that it only works with default mode previously.
Another strange behavior is that the process 3981172 & 3981171 & 3981167 (should be spark executor processes) ran on GPU 1,2,3 firstly and then all these 3 processes were accessing GPU 0 instead of GPU 1,2,3. Not sure if this is expected behavior or not. You can see the processes section in the screenshot.
I tried to set GPU 1 to default mode and the process still tried to access different gpus
I don't think spark or XGBoost takes GPU "modes" into consideration when allocating/accessing GPUs, and it's unlikely we will try to check the admin setting of the GPUs.
xgboost4j-spark-gpu train failed on multiplue gpu node with EXCLUSIVE_PROCESS mode
Environment
Failure logs
cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or unavailable
Observed processes on gpu 1,2,3 were also accessing gpu 0
The text was updated successfully, but these errors were encountered: