-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: im2col and matmul optimizations for stable diffusion #10942
Conversation
I've tested those changes on my RX 5700 XT and it seems to have a mix of huge improvements and regressions (some specific calculations have a really unreliable behaviour). master:
PR run 1:
PR run 2:
|
I also see a regression on the Maybe we can find a way to avoid that regression? RTX 3090:
AMD Radeon Pro VII:
Intel A770:
|
im2col has a kind of tricky memory access pattern. I tried changing the mapping of invocation->elements in a few different ways and didn't find anything better than the current approach. It's actually lucky that it works out well, the I found that just changing the workgroup to handle 512 elements rather than 256 gives some gains, including for the 3x3 256,256,256 case. I lose some of the gains for the 5x5 filter, but I think 3x3 is much more common. My new results:
stable-diffusion performance is the same as with the previous commit. |
Here are the results from the latest commit. It looks better now, I think. RTX 3090
Radeon Pro VII:
Intel A770:
|
I was looking at performance using the command line from leejet/stable-diffusion.cpp#439, just with a larger resolution (640x640). These changes improve perf (with NV_coopmat2) from 3.68it/s to 5.07it/s on RTX 4070.