vulkan: experimental coalesced read to shared memory before dequantization #10999

netrunnereve · 2024-12-27T22:33:57Z

This is an ugly and hacky experiment that's kept as a draft for a reason, but it makes Q6_K run almost as fast as Q4_K_M. Basically the idea is to first read the superblocks into shared memory using perfectly coalesced loads and then do the actual dequantizations using the fast shared memory. As this completely separates the memory reads from the dequantization code this also allows us to explore different algorithms for dequantization that don't rely on coalescing.

With any other language I'd be able to easily do this, but GLSL has no pointers, memcpy, and all that. For some reason though it lets me index an array of structs beyond it's stated size to access the next element, and that doesn't cause an error or crash my computer. As a result this is not suitable for release unless we have a way of doing this properly, and I'm just leaving this here for feedback and suggestions.

It's possible to make this go faster by using 32-bit loads per thread if the alignment allows for it, but again it's the language that's holding me back. The current code runs with a fixed 64 subgroup size.

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 7B Q6_K (Master)	5.53 GiB	7.24 B	Vulkan	100	8	1	none	tg128	20.39 ± 0.00
llama 7B Q6_K (PR)	5.53 GiB	7.24 B	Vulkan	100	8	1	none	tg128	25.15 ± 0.02

Master
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   368.04 us/run - 117.44 MFLOP/run - 319.10 GFLOPS

PR
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   294.77 us/run - 117.44 MFLOP/run - 398.41 GFLOPS

In 158ab15 I also have a proper version of this which runs at 23 t/s with the same Q6_K model. Here the loads are coalesced for each 16 thread superblock only and no hacks are required, but I think it only helps with AMD GCN as it's 64 wide SIMD is actually 4 16 wide units internally (and the loads are done in blocks of 16 threads).

small stuff scale cache there we go float type

netrunnereve · 2024-12-28T03:31:32Z

Turns out the CI is failing as llvmpipe isn't happy with this change, though strangely enough it provides good output and tests clean with RADV.

GL_EXT_shared_memory_block might be a way to do this properly provided I can find some examples on how to use it.

jeffbolznv · 2024-12-28T05:37:34Z

On my RTX 4070, commit b3b30e4 is maybe 1-2% slower than master and commit 158ab15 is maybe half a percent faster than master.

GL_EXT_shared_memory_block might be a way to do this properly provided I can find some examples on how to use it.

Yeah, I was going to suggest this if you need to treat the data differently when storing it than when loading it. IIRC the syntax is just:

shared Block1 {
   // members, optionally with offset decorations
};
shared Block2 {
   // members, optionally with offset decorations
};

and then data in each block can potentially alias.

netrunnereve · 2024-12-28T17:30:37Z

On my RTX 4070, commit b3b30e4 is maybe 1-2% slower than master

This might actually be fine considering how I've forced it to use a subgroup size of 64. It should go faster with 32, but then again if you're memory bound this may not help much.

Yeah, I was going to suggest this if you need to treat the data differently when storing it than when loading it.

Thanks. I managed to get GL_EXT_shared_memory_block working properly with llvmpipe but it spits out gibberish on AMD, it looks like the driver has some bugs with this extension. Even the simple change below on master causes incorrect output on my GPU.

---------- ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp ----------
index fab4ff5f..dfcae382 100644
@@ -1,6 +1,7 @@
 #version 450
 
 #extension GL_EXT_shader_explicit_arithmetic_types : require
+#extension GL_EXT_shared_memory_block : require
 
 #include "mul_mat_vec_base.comp"
 
@@ -9,7 +10,9 @@ layout(local_size_x_id = 0, local_size_y = 1, local_size_z = 1) in;
 layout (constant_id = 0) const uint BLOCK_SIZE = 32;
 layout (constant_id = 1) const uint NUM_ROWS = 1;
 
-shared FLOAT_TYPE tmpsh[NUM_ROWS][BLOCK_SIZE];
+shared shblk {
+	FLOAT_TYPE tmpsh[NUM_ROWS][BLOCK_SIZE];
+};
 
 void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
     uint a_offset, b_offset, d_offset;

jeffbolznv · 2024-12-28T19:13:04Z

I managed to get GL_EXT_shared_memory_block working properly with llvmpipe but it spits out gibberish on AMD, it looks like the driver has some bugs with this extension. Even the simple change below on master causes incorrect output on my GPU.

I looked at the SPIR-V that gets generated and I think what's happening is the ArrayStride for the inner array of tmpsh is being set to 128. This is a known limitation of SPIR-V where structure layouts that depend on spec constants don't really work because the layout doesn't get updated based on the spec constant. If you had BLOCK_SIZE = 64 in the shader it would probably work on AMD.

Keep in mind that GL_EXT_shared_memory_block is probably not supported everywhere, so we'd need to have a fallback path. So need to decide whether the perf we get from it is worth the additional maintenance cost.

0cc4m · 2024-12-29T07:21:35Z

We would definitely need a fallback path. I think all modern desktop devices support the required extension VK_KHR_workgroup_memory_explicit_layout, but many phones don't and some people also use Vulkan on older GPU/driver combinations on desktop.

Edit: You can check how broad the support is on https://vulkan.gpuinfo.org/listextensions.php

netrunnereve added 5 commits December 27, 2024 14:25

q6_k extract scale

158ab15

small stuff scale cache there we go float type

seperate threaded read test, slower somehow

0078ae4

hacky edition

a56504f

go all the way

f833b75

go even further

25d7ae4

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 27, 2024

Update mul_mat_vec_q6_k.comp

b3b30e4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: experimental coalesced read to shared memory before dequantization #10999

vulkan: experimental coalesced read to shared memory before dequantization #10999

netrunnereve commented Dec 27, 2024

netrunnereve commented Dec 28, 2024

jeffbolznv commented Dec 28, 2024

netrunnereve commented Dec 28, 2024

jeffbolznv commented Dec 28, 2024

0cc4m commented Dec 29, 2024 •

edited

Loading

vulkan: experimental coalesced read to shared memory before dequantization #10999

Are you sure you want to change the base?

vulkan: experimental coalesced read to shared memory before dequantization #10999

Conversation

netrunnereve commented Dec 27, 2024

netrunnereve commented Dec 28, 2024

jeffbolznv commented Dec 28, 2024

netrunnereve commented Dec 28, 2024

jeffbolznv commented Dec 28, 2024

0cc4m commented Dec 29, 2024 • edited Loading

0cc4m commented Dec 29, 2024 •

edited

Loading