Woo Hoo! .496 seconds per 512x512 image 20 steps #6405
Replies: 6 comments 3 replies
-
Amazing, I dream of getting these optimizations up and running, but just haven't had success with the necessary steps. Looking forward to seeing this go more mainstream. |
Beta Was this translation helpful? Give feedback.
-
Have you tried the wheel here? #5962 |
Beta Was this translation helpful? Give feedback.
-
You might be right about CUDA. Before I got the compile to work I tried
downloading the Tensor2/CUDA-11.7 nightly build and seem to recall that I
saw a good perf boost.
Regarding CUDA-12. NVidia's announcement for this claims "Support for new
NVIDIA Hopper and NVIDIA Ada Lovelace" My 4090 is Lovelace. There are
other optimizations in CUDA-12 however they might have to be "used" to gain
the advantage.
I'll be doing more testing today to clarify things.
…On Fri, Jan 6, 2023 at 3:07 AM Billy Cao ***@***.***> wrote:
Have you tried the wheel here? #5962
<#5962>
I am interested to see if cuda 12 brings any improvements after all, as I
am seeing big speed ups from just building xformers with torch2.0 (and some
even reported a double speed boost)
—
Reply to this email directly, view it on GitHub
<#6405 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A3YFZAHUHBYTQHT5IGAJCX3WQ74IRANCNFSM6AAAAAATST6CVU>
.
You are receiving this because you authored the thread.Message ID:
<AUTOMATIC1111/stable-diffusion-webui/repo-discussions/6405/comments/4610866
@github.com>
|
Beta Was this translation helpful? Give feedback.
-
As Billy has mentioned CUDA 12 might not help. Most of the gain is from
pytorch 2.0.
Plus the other changes I did to processing.py.
Give me a bit more time to flesh out the details.
…On Fri, Jan 6, 2023 at 4:31 AM hippopotamus1000 ***@***.***> wrote:
Impressive speed. Could you write a short tutorial, how you build pytorch
2.0/xformers with CUDA 12?
—
Reply to this email directly, view it on GitHub
<#6405 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A3YFZACTLVCXG7IZOWB6WFLWRAGB7ANCNFSM6AAAAAATST6CVU>
.
You are receiving this because you authored the thread.Message ID:
<AUTOMATIC1111/stable-diffusion-webui/repo-discussions/6405/comments/4611662
@github.com>
|
Beta Was this translation helpful? Give feedback.
-
Waiting... |
Beta Was this translation helpful? Give feedback.
-
Started discussion #6932 |
Beta Was this translation helpful? Give feedback.
-
I managed to build pytorch 2.0 with CUDA 12 and then build xformers with this. I also reverted the decode_first_stage change which causes a perf regression. With this I got .538 seconds per image. Then I added my change to overlap the remaining CPU processing of images with the GPU processing for the next batch. This got me to .496! I finally broke 1/2 second.
Ubuntu, a 4090, Euler_a, 20 steps, v2-1_512-ema-pruned, 64 images with batch count 4 and batch size 16. Batch size 16 seems optimal.
With 768x768 images and the matching v2.1 ckpt file I average 1.296 seconds.
Beta Was this translation helpful? Give feedback.
All reactions