Allows you to run llama.cpp with ROCm acceleration on most Radeon RX Vega/5000/6000/7000, even those not on AMD's official ROCm supported GPU list.
This is a Linux container which builds llama.cpp with ROCm support and uses llama-swap to serve models.
I just put this together from other people's work listed below.
Reddit comment that Debian Bookworm Backports kernel contains the ROCm kernel interface, and Debian Trixie contains the userspace:
Instructions on Debian-AI list to compile llama.cpp with ROCm:
llama.cpp - efficient CPU and GPU LLM inference server:
llama-swap - OpenAI-compatible server to serve models and swap/proxy inference servers:
Linux with the amdgpu
driver ROCm interface enabled. Debian Bookworm Backports, Debian Trixie/Sid, and Ubuntu 24.04 with this already done. For other distros you might need to use the amdgpu-install
script from the AMD website.
If using Debian or Ubuntu, make sure your GPU is on the Debian ROCm supported GPU list in Trixie/Sid. The Bookworm Backports kernel has the same support level as Trixie.
Add your user to the video
and render
groups on your system: usermod -aG video,render "$USER"
. Log out and log in again. Confirm with the groups
command.
Look up your GPU in the LLVM amdgpu targets and replace my gfx1010
in the Containerfile
with your GPU's architecture name.
Build the container:
podman build . -t rocswap
Deploy the container:
podman run -dit -p 8080:8080 --name rocswap \
-v ./models:/models \
-v ./config.yaml:/config.yaml \
--device /dev/dri --device /dev/kfd \
--group-add keep-groups \
--user 1000:1000 \
rocswap
If you have models which are smaller than your VRAM (minus about 1 GiB for other allocations) then you can keep -ngl 99
in the server config to load all layers on the GPU.
If you are running a model larger than your GPU's VRAM, then use the llama-swap llama.cpp log output (http://localhost:8080/logs) and the radeontop
commandline program to load as many layers as you can with the llama.cpp -ngl
option without overflowing VRAM. The other layers will run on the CPU.
For example, I have a Radeon RX 5600 XT 6Gb. I can load all of small models like Gemma-2-2B-it or Phi-3.5-mini-instruct (4B) on the GPU. To load a larger model like Llama-3.1-8B-Q6KL, I can only load 24 layers of the model's 33 layers so I use -ngl 24
.