-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misc. bug: Vulkan backend with 7900XTX has severe performance dropoff at some batch sizes #10966
Comments
It's very difficult to implement efficient small-batch kernels. For speculative decoding, your best bet is to to increase the min draft size and keep the draft prob high:
This should still give you some speed-up for low-entropy generations. |
Given the fact that even at batchsize 8 it's performing worse at token generation, that 'draft-min' should be even higher than 8, right? Especially given that it's unlikely all tokens in a long draft sequence will be accepted. I understand that it's very difficult to optimize small batch kernels, and performance can actually go down compared to the non-batched case due to overhead, but an almost 4x drop in performance going from non-batched to batchsize 2 sounds like a bug / major bottleneck somewhere, right? |
Just confirmed my own thoughts. Using:
I am seeing just over 30 tokens/sec:
The quality of the draft looked good as expected from the settings used:
But a simply llama-cli used as follows:
Is going at above 33 tokens/sec:
|
The Vulkan backend has two paths for matrix multiplication - a matrix-vector multiply for when N=1, and a matrix-matrix multiply for N>1. The matrix-matrix multiply is really aimed at larger matrices, and doesn't do well with small N. We should be able to do better by adapting the matrix-vector multiply to be able to do a few vectors at a time. I can look into this soon, but we should probably let #10846 land first to avoid conflicts. |
Awesome, thanks for your explanation! Let me know when you start working on this, I'd be happy to help benchmark certain setups to see what works best. I know it's only applicable to the 7900xtx, but maybe it's useful data for you. |
@jeffbolznv Will give feedback in that PR to keep the discussion in one place |
Name and Version
[docker@a242c844efbf ~]$ llama-cli-vulkan --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
version: 4384 (14b699e)
built with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-bench
Problem description & steps to reproduce
llama-batched-bench-vulkan -m /models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf -ngl 99 -npp 512 -ntg 128 -npl 1,2,4,8,16 -pps
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
build: 4384 (14b699e) with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu
main: n_kv_max = 4096, n_batch = 2048, n_ubatch = 512, flash_attn = 0, is_pp_shared = 1, n_gpu_layers = 99, n_threads = 12, n_threads_batch = 12
I understand scaling at some batch sizes might be less than ideal. But at worst I would expect small regressions if no scaling can be achieved at all (due to overhead of batched processing). Right now, for batch sizes 2 and 4 especially there is a massive performance loss. Can anything be done to improve this situation? Poor batched performance makes speculative decoding on the vulkan backend unusable unfortunately.
First Bad Commit
No response
Relevant log output
No response
The text was updated successfully, but these errors were encountered: