Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan: Destroy Vulkan instance on exit #10989

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Dec 26, 2024

This seems to fix the Nvidia driver segfault on exit (#10528). If someone can reproduce it reliably, please test if this resolves it.

It's a little more hacky than I had hoped, but I couldn't think of a better way to check when the instance can be destroyed.

Basically the backend is now counting backend contexts and backend (device) buffer allocations and when a backend or buffer is destroyed checks whether any are left. If not, it destroys the instance. This seems to happen early enough that it avoids the issue with the Nvidia driver. Usually it should happen on a backend unload or model unload, depending on which happens last.

Let me know if you see any issues, I had to touch a lot of code to remove references to the devices that would prevent them and thus the instance from being destroyed.

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 26, 2024
@jeffbolznv
Copy link
Collaborator

Have you tried this on Windows? What I was seeing when I looked into this recently is that the static destructors in ggml-vulkan are run after the Vulkan driver is unloaded, so things crash if you try to call into the driver at all then. I was specifically seeing it crash when vkDestroyDevice is called.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Dec 26, 2024

Have you tried this on Windows? What I was seeing when I looked into this recently is that the static destructors in ggml-vulkan are run after the Vulkan driver is unloaded, so things crash if you try to call into the driver at all then. I was specifically seeing it crash when vkDestroyDevice is called.

No, I don't have a Windows setup. But this does not rely on the static destructors on exit, it should destroy the devices and instance once the controlling program unloaded the backend and freed the device buffers (unloaded the model) At least for the examples I tested this seems to fix the segfault on Linux.

@jeffbolznv
Copy link
Collaborator

Cool! I fetched your change and I'm not seeing the crash - the backend is indeed freed before the process is terminated.

@jeffbolznv
Copy link
Collaborator

I've had this running overnight on a Linux system and no crashes so far. But the previous repro rate was maybe once a day, so not definitive. I can keep running it for a couple more days.

@LostRuins
Copy link
Collaborator

LostRuins commented Dec 28, 2024

Unfortunately, this did not solve the BSOD issue for me. Loading the models was fine, but upon unload I got a VIDEO_MEMORY_MANAGEMENT_INTERNAL BSOD again. Again, it only happens whenever I offload enough layers that available VRAM is nearly depleted (> 90% vram utilization) otherwise it will not BSOD.

(sorry for poor quality)

image

and here's BlueScreenView's info for this minidump

image

@jeffbolznv
Copy link
Collaborator

IMO the VIDEO_MEMORY_MANAGEMENT_INTERNAL error is likely unrelated to the crash in driver threads.

@LostRuins
Copy link
Collaborator

Do you think it's something that can be fixed on the llama.cpp side? Or is it a bug in the graphics driver? I'm running it in user mode without admin permissions, so it shouldn't be able to trigger a BSOD under normal circumstances correct?

@jeffbolznv
Copy link
Collaborator

Do you think it's something that can be fixed on the llama.cpp side? Or is it a bug in the graphics driver? I'm running it in user mode without admin permissions, so it shouldn't be able to trigger a BSOD under normal circumstances correct?

I'd guess it's one of: OS bug, kernel driver bug, or hardware failure (e.g. memory corruption).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants