Misc. bug: Vulkan backend with 7900XTX has severe performance dropoff at some batch sizes #10966

Mushoz · 2024-12-24T08:23:43Z

Name and Version

[docker@a242c844efbf ~]$ llama-cli-vulkan --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
version: 4384 (14b699e)
built with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-bench

Problem description & steps to reproduce

llama-batched-bench-vulkan -m /models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf -ngl 99 -npp 512 -ntg 128 -npl 1,2,4,8,16 -pps
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
build: 4384 (14b699e) with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu

main: n_kv_max = 4096, n_batch = 2048, n_ubatch = 512, flash_attn = 0, is_pp_shared = 1, n_gpu_layers = 99, n_threads = 12, n_threads_batch = 12

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	128	1	640	1.578	324.39	3.838	33.35	5.416	118.17
512	128	2	768	1.555	329.33	31.047	8.25	32.602	23.56
512	128	4	1024	1.570	326.11	33.209	15.42	34.779	29.44
512	128	8	1536	1.571	325.94	37.241	27.50	38.812	39.58
512	128	16	2560	1.575	325.05	28.106	72.87	29.681	86.25

I understand scaling at some batch sizes might be less than ideal. But at worst I would expect small regressions if no scaling can be achieved at all (due to overhead of batched processing). Right now, for batch sizes 2 and 4 especially there is a massive performance loss. Can anything be done to improve this situation? Poor batched performance makes speculative decoding on the vulkan backend unusable unfortunately.

First Bad Commit

No response

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-12-24T09:15:27Z

It's very difficult to implement efficient small-batch kernels. For speculative decoding, your best bet is to to increase the min draft size and keep the draft prob high:

--draft-max 16 --draft-min 8 --draft-p-min 0.9

This should still give you some speed-up for low-entropy generations.

Mushoz · 2024-12-24T10:15:27Z

Given the fact that even at batchsize 8 it's performing worse at token generation, that 'draft-min' should be even higher than 8, right? Especially given that it's unlikely all tokens in a long draft sequence will be accepted.

I understand that it's very difficult to optimize small batch kernels, and performance can actually go down compared to the non-batched case due to overhead, but an almost 4x drop in performance going from non-batched to batchsize 2 sounds like a bug / major bottleneck somewhere, right?

Mushoz · 2024-12-24T12:45:32Z

Just confirmed my own thoughts. Using:

llama-speculative-simple-vulkan -m /models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf -md /models/Qwen2.5-Coder-0.5B-Instruct-IQ4_XS.gguf -ngl 99 -ngld 99 -p "Write a minesweeper game using html, js and css. Do not give any explanations. Only output the code." --draft-p-min 0.9 --draft 16 --draft-min 8

I am seeing just over 30 tokens/sec:

decoded 1325 tokens in 43.903 seconds, speed: 30.180 t/s

The quality of the draft looked good as expected from the settings used:

n_draft   = 16
n_predict = 1325
n_drafted = 655
n_accept  = 631
accept    = 96.336%

But a simply llama-cli used as follows:

llama-cli-vulkan -m /models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf -ngl 99 -p "Write a minesweeper game using html, js and css. Do not give any explanations. Only output the code."

Is going at above 33 tokens/sec:

llama_perf_context_print: eval time = 28566.51 ms / 956 runs ( 29.88 ms per token, 33.47 tokens per second)

jeffbolznv · 2024-12-24T20:49:55Z

The Vulkan backend has two paths for matrix multiplication - a matrix-vector multiply for when N=1, and a matrix-matrix multiply for N>1. The matrix-matrix multiply is really aimed at larger matrices, and doesn't do well with small N. We should be able to do better by adapting the matrix-vector multiply to be able to do a few vectors at a time. I can look into this soon, but we should probably let #10846 land first to avoid conflicts.

Mushoz · 2024-12-24T20:53:10Z

Awesome, thanks for your explanation! Let me know when you start working on this, I'd be happy to help benchmark certain setups to see what works best. I know it's only applicable to the 7900xtx, but maybe it's useful data for you.

jeffbolznv · 2024-12-28T21:23:17Z

Hi @Mushoz, please give #10991 a try when you get a chance.

Mushoz · 2024-12-28T21:39:57Z

@jeffbolznv Will give feedback in that PR to keep the discussion in one place

Mushoz added the bug-unconfirmed label Dec 24, 2024

jeffbolznv self-assigned this Dec 26, 2024

jeffbolznv linked a pull request Dec 26, 2024 that will close this issue

vulkan: optimize mul_mat for small values of N #10991

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: Vulkan backend with 7900XTX has severe performance dropoff at some batch sizes #10966

Misc. bug: Vulkan backend with 7900XTX has severe performance dropoff at some batch sizes #10966

Mushoz commented Dec 24, 2024

ggerganov commented Dec 24, 2024

Mushoz commented Dec 24, 2024

Mushoz commented Dec 24, 2024

jeffbolznv commented Dec 24, 2024

Mushoz commented Dec 24, 2024

jeffbolznv commented Dec 28, 2024

Mushoz commented Dec 28, 2024

Misc. bug: Vulkan backend with 7900XTX has severe performance dropoff at some batch sizes #10966

Misc. bug: Vulkan backend with 7900XTX has severe performance dropoff at some batch sizes #10966

Comments

Mushoz commented Dec 24, 2024

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Problem description & steps to reproduce

First Bad Commit

Relevant log output

ggerganov commented Dec 24, 2024

Mushoz commented Dec 24, 2024

Mushoz commented Dec 24, 2024

jeffbolznv commented Dec 24, 2024

Mushoz commented Dec 24, 2024

jeffbolznv commented Dec 28, 2024

Mushoz commented Dec 28, 2024