-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: optimize mul_mat for small values of N #10991
base: master
Are you sure you want to change the base?
Conversation
Results from
The "before" results with coopmat1 or no coopmat were worse (I can shared if somebody is interested, but probably more useful to benchmark another GPU instead). Still thinking about where to put the cutoff for switching from mat_mul_vec to mat_mul. Seems like 8 would still be better using mat_mul_vec, and it doesn't cost anything except a little bit of compile time. Let's collect data on some other systems before finalizing anything. |
CC @netrunnereve, can you please help with some perf tests? |
Results with mul_mat_vec_max_cols == 8:
|
Here are the numbers on my RX 470, it's much faster with small ns compared to master. My card prefers a max cols of 8 or maybe something even larger. Master:
PR:
max cols of 8:
|
Giving my results with a 7900XTX running radv: This PR: main: n_kv_max = 4096, n_batch = 2048, n_ubatch = 512, flash_attn = 0, is_pp_shared = 1, n_gpu_layers = 99, n_threads = 12, n_threads_batch = 12
Master: main: n_kv_max = 4096, n_batch = 2048, n_ubatch = 512, flash_attn = 0, is_pp_shared = 1, n_gpu_layers = 99, n_threads = 12, n_threads_batch = 12
Conclusion:
Let me know if you want any additional tests at different batch sizes. Thanks for making this PR! |
Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where the batch_strides are overloaded to hold the row strides. Put the loads from the B matrix in the innermost loop because it should cache better. Share some code for reducing the result values to memory in mul_mat_vec_base.
905c05d
to
0247aaf
Compare
I didn't see a perf regression for N==1. I've updated the limit to 8, and removed "draft". |
Thanks @Mushoz . I've updated the limit to 8. Feel free to try 16, but I suspect the mat-mat mul path would work better for 16, at least if we tuned the matrix sizes (the current set of three sizes may be limiting...). |
Token generation is looking good at batch size 8 as well now!
Going to try and see if a limit of 16 makes more sense. As N=8 is now outperforming N=16/ |
What did you mean with this btw? I can clearly see a 0.5 token/sec drop on my N=1 result on this branch vs the master branch. I think that's outside the margin of error? |
I meant in my own local testing. Is this outside the margin of error for you? |
Limit at 16:
So seems like 8 is indeed the sweet spot |
I'm surprised it's worse at 16. Maybe using too many registers? You could try changing rm_kq and rm_stdq to 1, it may not make sense to do multiple rows with such a large value of N. |
This is what I have in mind to fix #10966. Currently Draft because it needs more perf testing, particularly to make sure that it doesn't regress perf when N==1.
Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where the batch_strides are overloaded to hold the row strides. Put the loads from the B matrix in the innermost loop because it should cache better.
Share some code for reducing the result values to memory in mul_mat_vec_base.
I'll put directed perf tests in a separate comment.