The root cause of a huge perf problem many see. #6954
Replies: 14 comments 42 replies
-
For windows you just download the zip and replace all the dlls inside into sitepackages/torch/lib |
Beta Was this translation helpful? Give feedback.
-
Google Colab 14.3% speed up on Euler A (Speeds calculated from final values reported on a 150 step pass, no indepth time study) |
Beta Was this translation helpful? Give feedback.
-
I can confirm, I copy all the .dll from |
Beta Was this translation helpful? Give feedback.
-
Will this work on older cards such as the 1080? |
Beta Was this translation helpful? Give feedback.
-
Worked for me! rtx 2070 windows 11 (about 10-15% faster) |
Beta Was this translation helpful? Give feedback.
-
RTX 3090, I see a similar 10-20% speedup doing this -- however, those gains are more than offset by the fact that xformers no longer works because it needs to be manually compiled against the new torch/cuda. So I guess I need to dive into that 20-step process. :( |
Beta Was this translation helpful? Give feedback.
-
I convinced the pytorch folks to upgrade the cuDNN they use so that people won't have to manually correct this problem. Some people are having problems with doing the workaround correctly. I'm not sure if this will be cu118 for Torch 2 or Torch 1.13.x. |
Beta Was this translation helpful? Give feedback.
-
Building with 2.0 CU118 and replacing the cudnn files offers a little bit of a performance boost on my RTX 4090 but nowhere near 40it/s let alone 20it/s, averaging 15it/s. Granted I am on Windows but still, not sure why there are such huge performance discrepancies. |
Beta Was this translation helpful? Give feedback.
-
Some may have noticed that some people on Windows are not getting this workaround to help. It seems to always work on Linux. I'm not sure why Windows isn't working 100% of the time. Let's hope we get a permanent fix soon. |
Beta Was this translation helpful? Give feedback.
-
I just found something new. I'm working with a CTO using this workaround and he only gets to 30 - 32 it/s. We are almost certain it is because he is running in a VM. Are any of the people here who get an improvement but less than I get running in a VM?But just when I was about to send the above I realized that my computer is both fast and slow. It is a Raptor Lake with two different types of COREs. This might be "PART" of the reason we see such a large diversity in the success of others trying this. I just found this so I need to do more research but it is so obvious it makes a difference. In both cases it is a lot better than the 13.6 it/s before the cuDNN fix but WOW! |
Beta Was this translation helpful? Give feedback.
-
Ok I tried it again, replacing the .so in the venv/lib/python3.10/site-packages/torch/lib folder did nothing so I tried to pip install cuda 8.7... apparently torch 1.13.1 doesn't like cuda 8.7, when I try to generate it dies with:
reinstalling cuda 8.5 and it works again. I don't know how you guys are using latest cuda with torch 1.13.1... |
Beta Was this translation helpful? Give feedback.
-
For Windows 11, (4090-OC, AMD Ryzen 9 5950X, 64GB), using vladmandic fork: I downloaded 'cudnn-windows-x86_64-8.9.4.25_cuda12-archive.zip' and copied over the files at: 'venv\Lib\site-packages\torch\lib'. Parameters:
I went from 3-4 it/s to 8-9 it/s. Definitely helped. |
Beta Was this translation helpful? Give feedback.
-
I've done the file replacement and my 4090 went from around 12its to 15its. I've never been able to hit 20+, ever. Not sure why??? From the sys info tab: app: stable-diffusion-webui.git Which of the above should I be looking at? I also saw someone in a thread earlier in the year said installing xformers (which I now know isn't needed as much anymore, but I didn't know so installed) overwrites something in the torch stuff and then torch should be reinstalled? Or is that the same as replacing these cudnn files from the manual download? or would an Nvidia driver update from 528 to 531 maybe, help? Thanks! |
Beta Was this translation helpful? Give feedback.
-
Putting a 4090 in a system with a 2.3 GHz processor is a waste of money. I
have a i9-13900K at 5.8 GHz.
The cpu isn't fast enough to push the GPU to its potential although for
"throughput" processing using a larger batchsize would help.
You should look at the GPU usage with NVTOP when running a generation.
cudnn 8.7 is good enough.
…On Wed, Sep 20, 2023 at 1:36 AM EmmaWebGH ***@***.***> wrote:
I've done the file replacement and my 4090 went from around 12its to
15its. I've never been able to hit 20+, ever. Not sure why???
From the sys info tab:
app: stable-diffusion-webui.git
updated: 2023-08-31
hash: 5ef669d
<5ef669d>
url:
https://github.com/AUTOMATIC1111/stable-diffusion-webui.git/tree/master
arch: AMD64
cpu: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
system: Windows
release: Windows-10-10.0.22621-SP0
python: 3.10.8
device: NVIDIA GeForce RTX 4090 (1) (compute_37) (8, 9)
cuda: 11.8
cudnn: 8700
driver: 528.02
xformers: 0.0.20
diffusers:
transformers: 4.30.2
Which of the above should I be looking at?
I noticed one poster in a different thread had cudnn 8800, whereas I have
8700.
I also saw someone in a thread earlier in the year said installing
xformers (which I now know isn't needed as much anymore, but I didn't know
so installed) overwrites something in the torch stuff and then torch should
be reinstalled? Or is that the same as replacing these cudnn files from the
manual download?
or would an Nvidia driver update from 528 to 531 maybe, help?
Thanks!
Em
—
Reply to this email directly, view it on GitHub
<#6954 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A3YFZAEU2WU5J5JLMMBN5VTX3KTITANCNFSM6AAAAAAUBDQFYE>
.
You are receiving this because you authored the thread.Message ID:
<AUTOMATIC1111/stable-diffusion-webui/repo-discussions/6954/comments/7055714
@github.com>
|
Beta Was this translation helpful? Give feedback.
-
Look at this actual output. The it/s for the first 100% line is the warmup.
Compare the it/s from the 2nd 100% line of both runs.
See the explanation below that. This is for a 4090.
(venv) ~/a1111$ python3 -c 'import torch;print(torch.version)'
1.12.1+cu113 # This isn't even torch 1.13.1 much less torch 2.0
(venv) ~/a1111$ ./webui.sh --xformers | tail -n +35
To create a public link, set
share=True
inlaunch()
.0%| | 0/20 [00:00<?, ?it/s]
100%|██████████| 20/20 [00:02<00:00, 9.71it/s]
100%|██████████| 20/20 [00:01<00:00, 13.48it/s]
^C
(venv) ~/a1111$ mv venv/lib/python3.10/site-packages/torch/lib/libcudnn.so.8 venv/lib/python3.10/site-packages/torch/lib/libcudnn.so.8-sv
(venv) ~/a1111$ ./webui.sh --xformers | tail -n +35
To create a public link, set
share=True
inlaunch()
.0%| | 0/20 [00:00<?, ?it/s]
100%|██████████| 20/20 [00:00<00:00, 22.55it/s]
100%|██████████| 20/20 [00:00<00:00, 39.08it/s]
Can it really be that simple to just remove or mv a file and performance triples? Yes.
I have found that all pytorch bundles you download from the net contain libcudnn.so.8.
This includes the nightly build of PYTorch 2.0.0. This library is an older version of cudnn.
Even if you have installed the latesat cudnn v8.7 from NVidia to your system location the python seach path will find the one in your venv. By getting rid of it you get nearly 3X more performance on a 4090. If you are on Linux please get an it/s figure for some simple 20 step image generation after a warmup. Then install cudnn 8.7 and remove any in your python search path that might hide the newer v8.7 version.
Let me know the before and after it/s and what kind of card you have.
As for Windows some people say they get a ~38 it/s figure for a 4090 and others say they get the 13.9 it/s I saw before I found this. So it looks like windows might also see this. I'm not a WIndows library search path expert so I'll let others figure out where the system libcudnn.dll and the one part of the torch download are.
Beta Was this translation helpful? Give feedback.
All reactions