Misc. bug: Inconsistent Vulkan segfault #10528

RobbyCBennett · 2024-11-26T19:54:03Z

Name and Version

library 531cb1c (gguf-v0.4.0-2819-g531cb1c2)

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

No response

Problem description & steps to reproduce

Compile the program below
Run it a thousand times and it will probably have a segmentation fault at least once. I used the gdb debugger.

Simple program:

#include "llama.h"

static void handleLog(enum ggml_log_level level, const char *text, void *user_data) {}

int main(int argc, char **argv)
{
  llama_log_set(handleLog, 0);

  char path[] = "/your-path-to/llama.cpp/models/ggml-vocab-llama-bpe.gguf";
  struct llama_model_params params = llama_model_default_params();
  struct llama_model *model = llama_load_model_from_file(path, params);
  llama_free_model(model);

  return 0;
}

Shell script to run the program several times:

#! /bin/sh

PROGRAM=llama-bug
LOG=debug.log
COUNT=1000

rm -f "$LOG"

for i in `seq 1 $COUNT`; do
	gdb -batch -ex run -ex bt "$PROGRAM" >> "$LOG" 2>> "$LOG"
done

First Bad Commit

No response

Relevant log output

GDB output from crash caused by /lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.183.01

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
ggml_vulkan: Compiling shaders..............................Done!

Thread 3 "[vkrt] Analysis" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe35a8640 (LWP 1789333)]
0x00007fffeff1cb00 in ?? () from /lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.183.01
#0  0x00007fffeff1cb00 in ?? () from /lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.183.01
#1  0x00007ffff0246f1d in ?? () from /lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.183.01
#2  0x00007fffeff1fcfa in ?? () from /lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.183.01
#3  0x00007ffff7a1dac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#4  0x00007ffff7aaf850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

GDB output from crash with unknown cause

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
ggml_vulkan: Compiling shaders..............................Done!

Thread 3 "[vkrt] Analysis" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe35a8640 (LWP 1750868)]
0x00007fffeff1cb00 in ?? ()
#0  0x00007fffeff1cb00 in ?? ()
#1  0x000000006746139a in ?? ()
#2  0x0000000002a1b0d8 in ?? ()
#3  0x0000000067461399 in ?? ()
#4  0x00000000000e6817 in ?? ()
#5  0x00005555561076c0 in ?? ()
#6  0x00007fffeff1ef10 in ?? ()
#7  0x0000000000000000 in ?? ()

The text was updated successfully, but these errors were encountered:

jeffbolznv · 2024-11-27T14:56:54Z

This might be a driver bug. Can you try the latest drivers?

I think there's also a chance this could be caused by the ggml-vulkan backend not destroying the VkDevice/VkInstance before the process is terminated. That's something we should look into fixing.

slaren · 2024-11-27T15:03:53Z

I think there's also a chance this could be caused by the ggml-vulkan backend not destroying the VkDevice/VkInstance before the process is terminated.

We may need to add a function to destroy a backend and release all the resources, otherwise calling ggml_backend_unload to unload a dynamically loaded backend may result in a leak.

jeffbolznv · 2024-11-27T15:34:55Z

IMO issue #10420 is also a question about the object model for ggml backends, i.e. should it be possible for each thread to have its own VkInstance/VkDevice and what ggml/llama object should their lifetime be tied to.

0cc4m · 2024-11-27T15:51:09Z

This might be a driver bug. Can you try the latest drivers?

I think there's also a chance this could be caused by the ggml-vulkan backend not destroying the VkDevice/VkInstance before the process is terminated. That's something we should look into fixing.

I think this is a bug I often observe on Linux, but only on Nvidia. It happens when exiting the application, so some issue with clean-up. I haven't looked into it yet.

IMO issue #10420 is also a question about the object model for ggml backends, i.e. should it be possible for each thread to have its own VkInstance/VkDevice and what ggml/llama object should their lifetime be tied to.

I've tried building the global stuff in the way I think CUDA is handling it in the background, but it's not done yet. It should be possible to keep the device and instance global if all temporary variables stay attached to the backend instance. Command buffers are probably not the only thing where that hasn't been implemented yet.

0cc4m · 2024-11-29T06:35:05Z

This is how the crash looks for me. It happens in some Nvidia driver thread after all the ggml code has already exited:

Thread 7 "[vkps] Update" received signal SIGSEGV, Segmentation fault.

This is the thread:

* 7    Thread 0x7fffd56006c0 (LWP 683442) "[vkps] Update"   0x00007fffe5401960 in ?? () from /lib/x86_64-linux-gnu/libnvidia-eglcore.so.565.57.01

This is the stack trace:

#0  0x00007fffe5401960 in ?? () from /lib/x86_64-linux-gnu/libnvidia-eglcore.so.565.57.01
#1  0x00007fffe57392b4 in ?? () from /lib/x86_64-linux-gnu/libnvidia-eglcore.so.565.57.01
#2  0x00007fffe5404dfa in ?? () from /lib/x86_64-linux-gnu/libnvidia-eglcore.so.565.57.01
#3  0x00007ffff729ca94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#4  0x00007ffff7329c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

@jeffbolznv Do you happen to know what the [vkps] Update thread is? I don't know what is/isn't getting cleaned up in a way to cause the Nvidia driver to segfault. No other driver shows this issue.

Interestingly @RobbyCBennett saw it in one of the other Nvidia threads ([vkrt] Analysis).

Resolving this has gotten a little more important since @LostRuins reported that a crash on exit on Windows in certain cases causes a system crash (BSOD). Might be the same cause.

0cc4m · 2024-11-29T07:45:21Z

I hacked together a commit (335f48a) where the devices and Vulkan instance get cleaned up properly (at least I think so, validation layers didn't print anything), but the Nvidia driver still segfaults.

jeffbolznv · 2024-11-29T18:44:18Z

They're both threads created by the driver. I'm pretty sure they should get shut down when the VkDevice is destroyed.

I hacked together a commit (335f48a) where the devices and Vulkan instance get cleaned up properly

Is this relying on the static destructor for the vk_instance pointer? I think that may happen too late. Is there a hook where we can destroy the objects before the process is terminated?

slaren · 2024-11-29T18:48:43Z

Is there a hook where we can destroy the objects before the process is terminated?

Not at the moment. I may add destructors for the backend_device and backend_reg objects in the future, but these would still rely on a static destructor to be called normally when exiting the application. I understand that static destructors can be risky due to the order of destruction, but I am not sure why that should be a problem for the Vulkan driver. I would very much prefer to avoid adding a function to shutdown ggml unless it is absolutely necessary.

RobbyCBennett · 2024-11-29T20:07:20Z

I had some system problems when updating the driver, but I finally got some results. I still see segmentation faults with the new driver. I haven't tried the commit 335f48a yet.

The different types of seg faults I got:

Thread 5 "[vkps] Update" received signal SIGSEGV, Segmentation fault.
Thread 3 "[vkrt] Analysis" received signal SIGSEGV, Segmentation fault.
Thread 5 received signal SIGSEGV, Segmentation fault.

Here's some more system information if that helps at all:

Ubuntu 22 before and Ubuntu 24 now
NVIDIA RTX 4090

0cc4m · 2024-11-29T21:00:44Z

I would very much prefer to avoid adding a function to shutdown ggml unless it is absolutely necessary.

It's probably not absolutely necessary, since this issue only appears on Nvidia. Their driver should handle this gracefully.

jeffbolznv · 2024-11-30T03:07:40Z

@0cc4m or @RobbyCBennett, can you try adding a call into ggml-vulkan to destroy the VkDevice and VkInstance right before dlclose is called in unload_backend?

I may add destructors for the backend_device and backend_reg objects in the future, but these would still rely on a static destructor to be called normally when exiting the application.

I suspect (not sure) it's OK to invoke the cleanup from a static destructor in the main executable (or ggml.so?), as long as it's before ggml-vulkan or the vulkan driver libraries have been unloaded.

Linux doesn't give a good way to have this entirely self-contained in the vulkan driver or in ggml-vulkan. I think some kind of call from ggml is needed.

RobbyCBennett · 2024-12-02T17:34:12Z

I don't know this code and I haven't made any commits to this project. I don't see any VkDevice or VkInstance types. Maybe @0cc4m can make this change.

RobbyCBennett · 2024-12-02T20:53:05Z

Here's an idea for a potential workaround: provide a way to have a preferred backend. For my example, if CUDA is available, then use CUDA otherwise use Vulkan. I don't currently see a way to specify the preferred backend. Maybe it could look like the following.

enum llama_specific_backend_type {
    LLAMA_SPECIFIC_BACKEND_TYPE_CUDA,
    LLAMA_SPECIFIC_BACKEND_TYPE_VULKAN,
    // others...
};

const llama_specific_backend_type PREFERRED_BACKENDS[] = {
    LLAMA_SPECIFIC_BACKEND_TYPE_CUDA,
    LLAMA_SPECIFIC_BACKEND_TYPE_VULKAN,
};

int main()
{
  llama_set_backend(PREFERRED_BACKENDS, sizeof(PREFERRED_BACKENDS) / sizeof(PREFERRED_BACKENDS[0]));
}

slaren · 2024-12-02T21:11:27Z

You can set the devices that you want to use in llama_model_params::devices, but I don't see how that's related to Vulkan crashing.

RobbyCBennett · 2024-12-02T21:14:45Z

I don't have any crashes on CUDA, so selecting CUDA instead of Vulkan at runtime would prevent crashing in Vulkan with Nvidia. It wouldn't actually fix Vulkan crashing. It would just be a workaround.

slaren · 2024-12-02T21:15:29Z

If you build with GGML_BACKEND_DL enabled, then you can also use ggml_backend_load to load only the backend that you want to use.

RobbyCBennett · 2024-12-02T22:20:43Z

I looked into both options and llama_model_params::devices seems to be a good solution for me. Thanks for the help!

Here's a snippet of my workaround:

  // ... create the params
  #ifdef __linux__
    static ggml_backend_device *const sDevice = ggml_backend_dev_by_name("CUDA0");
    if (sDevice != nullptr) {
      static ggml_backend_dev_t sDevices[] = {sDevice, nullptr};
      params.devices = sDevices;
    }
  #endif
  // ... use the params

slaren · 2024-12-02T22:23:58Z

That should work, but if you don't intend to use the Vulkan backend at all, you can avoid loading it entirely by using GGML_BACKEND_DL and loading the backends dynamically. That should give you better compatibility, and use less resources. Keep in mind that without it, the CUDA backend will fail to load if the driver is not installed and stop your application from starting entirely.

Eventually this will become the standard in all the llama.cpp binary distributions.

RobbyCBennett · 2024-12-02T22:27:34Z

I still intend to use the Vulkan backend to support non-CUDA hardware like AMD. I'll keep that in mind. Thank you.

0cc4m · 2024-12-03T07:10:17Z

I'll look into it soon, I've been busy with #10597

jeffbolznv · 2024-12-23T13:39:38Z

I've borrowed a linux system and have reproduced this locally, I'll try to put together a fix.

jeffbolznv · 2024-12-24T00:49:28Z

Unfortunately, I've been unable to reproduce this again, running for the rest of the day. Only ever saw it the one time. So I'm not sure this system will be very helpful for testing.

In the meantime, I looked at the destruction order on Windows. Looks like the Vulkan driver gets unloaded before any static destructors run in ggml, so by then it's too late to do any cleanup. So I don't think we can handle this automatically from, say, ~ggml_backend_registry.

0cc4m · 2024-12-29T07:22:48Z

@RobbyCBennett Can you try #10989? For me that fixed the segfault.

RobbyCBennett · 2024-12-30T16:51:04Z

With aa014d7 I have a consistent crash if the Vulkan backend is available in the test program on that same Linux system. This even happens if I only use the CUDA device.

Stack trace with Vulkan (caused by the destructor ~vk_instance_t):

Thread 1 "ai_test" received signal SIGSEGV, Segmentation fault.
0x00007fffb6a33de0 in ?? ()
#0  0x00007fffb6a33de0 in ?? ()
#1  0x00007ffff7e0a123 in ?? () from /lib/x86_64-linux-gnu/libvulkan.so.1
#2  0x00007fffee4e95bb in ?? () from /lib/x86_64-linux-gnu/libVkLayer_MESA_device_select.so
#3  0x00007ffff7e1dde5 in vkDestroyInstance () from /lib/x86_64-linux-gnu/libvulkan.so.1
#4  0x000055555594d850 in vk::DispatchLoaderStatic::vkDestroyInstance (pAllocator=0x0, instance=<optimized out>, this=<optimized out>) at /usr/include/vulkan/vulkan.hpp:995
#5  vk::Instance::destroy<vk::DispatchLoaderStatic> (d=..., allocator=..., this=<optimized out>) at /usr/include/vulkan/vulkan_funcs.hpp:94
#6  vk_instance_t::~vk_instance_t (this=<optimized out>, __in_chrg=<optimized out>) at /home/robby/sti/src/lib/llama/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:764
#7  std::default_delete<vk_instance_t>::operator() (this=<optimized out>, __ptr=<optimized out>) at /usr/include/c++/13/bits/unique_ptr.h:99
#8  std::default_delete<vk_instance_t>::operator() (__ptr=<optimized out>, this=<optimized out>) at /usr/include/c++/13/bits/unique_ptr.h:93
#9  std::unique_ptr<vk_instance_t, std::default_delete<vk_instance_t> >::~unique_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/unique_ptr.h:404
#10 0x00007fffee847a66 in __run_exit_handlers (status=0, listp=<optimized out>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:108
#11 0x00007fffee847bae in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:138
#12 0x00007fffee82a1d1 in __libc_start_call_main (main=main@entry=0x5555555f9f20 <main(int, char**)>, argc=argc@entry=1, argv=argv@entry=0x7fffffffe448) at ../sysdeps/nptl/libc_start_call_main.h:74
#13 0x00007fffee82a28b in __libc_start_main_impl (main=0x5555555f9f20 <main(int, char**)>, argc=1, argv=0x7fffffffe448, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe438) at ../csu/libc-start.c:360
#14 0x000055555561fa15 in _start ()

0cc4m · 2024-12-30T17:30:53Z

With aa014d7 I have a consistent crash if the Vulkan backend is available in the test program on that same Linux system. This even happens if I only use the CUDA device.

That's concerning. Do you have example code that triggers this crash?

RobbyCBennett · 2024-12-30T17:35:14Z

Yes. Here's my original example with the addition of changing params.devices to use only CUDA.

#include <stdio.h>

#include "llama.h"

static void handleLog(enum ggml_log_level level, const char *text, void *user_data) {}

int main(int argc, char **argv)
{
  llama_log_set(handleLog, 0);

  struct llama_model_params params = llama_model_default_params();

  // Only use CUDA if it's available
  static ggml_backend_device *const sDevice = ggml_backend_dev_by_name("CUDA0");
  if (sDevice != nullptr) {
    puts("Switching to CUDA");
    static ggml_backend_dev_t sDevices[] = {sDevice, nullptr};
    params.devices = sDevices;
  }
  else {
    puts("Not using CUDA");
  }

  char path[] = "/your-path-to/llama.cpp/models/ggml-vocab-llama-bpe.gguf";
  struct llama_model *model = llama_load_model_from_file(path, params);
  llama_free_model(model);

  return 0;
}

RobbyCBennett added the bug-unconfirmed label Nov 26, 2024

derek-gerstmann mentioned this issue Dec 6, 2024

[vulkan] Inconsistent segfault at shutdown on NVIDIA hardware halide/Halide#8497

Open

netrunnereve mentioned this issue Dec 22, 2024

vulkan: build fixes for 32b #10927

Merged

0cc4m mentioned this issue Dec 26, 2024

Vulkan: Destroy Vulkan instance on exit #10989

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: Inconsistent Vulkan segfault #10528

Misc. bug: Inconsistent Vulkan segfault #10528

RobbyCBennett commented Nov 26, 2024 •

edited

Loading

jeffbolznv commented Nov 27, 2024

slaren commented Nov 27, 2024

jeffbolznv commented Nov 27, 2024

0cc4m commented Nov 27, 2024

0cc4m commented Nov 29, 2024

0cc4m commented Nov 29, 2024

jeffbolznv commented Nov 29, 2024

slaren commented Nov 29, 2024

RobbyCBennett commented Nov 29, 2024

0cc4m commented Nov 29, 2024

jeffbolznv commented Nov 30, 2024

RobbyCBennett commented Dec 2, 2024

RobbyCBennett commented Dec 2, 2024

slaren commented Dec 2, 2024

RobbyCBennett commented Dec 2, 2024

slaren commented Dec 2, 2024

RobbyCBennett commented Dec 2, 2024 •

edited

Loading

slaren commented Dec 2, 2024 •

edited

Loading

RobbyCBennett commented Dec 2, 2024

0cc4m commented Dec 3, 2024

jeffbolznv commented Dec 23, 2024

jeffbolznv commented Dec 24, 2024

0cc4m commented Dec 29, 2024

RobbyCBennett commented Dec 30, 2024

0cc4m commented Dec 30, 2024

RobbyCBennett commented Dec 30, 2024

Misc. bug: Inconsistent Vulkan segfault #10528

Misc. bug: Inconsistent Vulkan segfault #10528

Comments

RobbyCBennett commented Nov 26, 2024 • edited Loading

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Problem description & steps to reproduce

First Bad Commit

Relevant log output

jeffbolznv commented Nov 27, 2024

slaren commented Nov 27, 2024

jeffbolznv commented Nov 27, 2024

0cc4m commented Nov 27, 2024

0cc4m commented Nov 29, 2024

0cc4m commented Nov 29, 2024

jeffbolznv commented Nov 29, 2024

slaren commented Nov 29, 2024

RobbyCBennett commented Nov 29, 2024

0cc4m commented Nov 29, 2024

jeffbolznv commented Nov 30, 2024

RobbyCBennett commented Dec 2, 2024

RobbyCBennett commented Dec 2, 2024

slaren commented Dec 2, 2024

RobbyCBennett commented Dec 2, 2024

slaren commented Dec 2, 2024

RobbyCBennett commented Dec 2, 2024 • edited Loading

slaren commented Dec 2, 2024 • edited Loading

RobbyCBennett commented Dec 2, 2024

0cc4m commented Dec 3, 2024

jeffbolznv commented Dec 23, 2024

jeffbolznv commented Dec 24, 2024

0cc4m commented Dec 29, 2024

RobbyCBennett commented Dec 30, 2024

0cc4m commented Dec 30, 2024

RobbyCBennett commented Dec 30, 2024

RobbyCBennett commented Nov 26, 2024 •

edited

Loading

RobbyCBennett commented Dec 2, 2024 •

edited

Loading

slaren commented Dec 2, 2024 •

edited

Loading