The root cause of a huge perf problem many see. #6954

aifartist · 2023-01-20T04:46:48Z

aifartist
Jan 20, 2023

Look at this actual output. The it/s for the first 100% line is the warmup.
Compare the it/s from the 2nd 100% line of both runs.
See the explanation below that. This is for a 4090.
(venv) ~/a1111$ python3 -c 'import torch;print(torch.version)'
1.12.1+cu113 # This isn't even torch 1.13.1 much less torch 2.0
(venv) ~/a1111$ ./webui.sh --xformers | tail -n +35
To create a public link, set share=True in launch().
0%| | 0/20 [00:00<?, ?it/s]
100%|██████████| 20/20 [00:02<00:00, 9.71it/s]
100%|██████████| 20/20 [00:01<00:00, 13.48it/s]
^C
(venv) ~/a1111$ mv venv/lib/python3.10/site-packages/torch/lib/libcudnn.so.8 venv/lib/python3.10/site-packages/torch/lib/libcudnn.so.8-sv
(venv) ~/a1111$ ./webui.sh --xformers | tail -n +35
To create a public link, set share=True in launch().
0%| | 0/20 [00:00<?, ?it/s]
100%|██████████| 20/20 [00:00<00:00, 22.55it/s]
100%|██████████| 20/20 [00:00<00:00, 39.08it/s]

Can it really be that simple to just remove or mv a file and performance triples? Yes.
I have found that all pytorch bundles you download from the net contain libcudnn.so.8.
This includes the nightly build of PYTorch 2.0.0. This library is an older version of cudnn.
Even if you have installed the latesat cudnn v8.7 from NVidia to your system location the python seach path will find the one in your venv. By getting rid of it you get nearly 3X more performance on a 4090. If you are on Linux please get an it/s figure for some simple 20 step image generation after a warmup. Then install cudnn 8.7 and remove any in your python search path that might hide the newer v8.7 version.
Let me know the before and after it/s and what kind of card you have.

As for Windows some people say they get a ~38 it/s figure for a 4090 and others say they get the 13.9 it/s I saw before I found this. So it looks like windows might also see this. I'm not a WIndows library search path expert so I'll let others figure out where the system libcudnn.dll and the one part of the torch download are.

aliencaocao · 2023-01-20T05:03:06Z

aliencaocao
Jan 20, 2023

For windows you just download the zip and replace all the dlls inside into sitepackages/torch/lib
Actually this is known by the community for a very long time. E.g. Anyone who know a bit more than just running webui.bat will replace the dll manually for perf boost. Im surprised that on linux you have to remove the .so files torch ships.

17 replies

ghost Jan 22, 2023

Okay. I have found the dll files in /bin in the aforementioned cuda archive zip. But there are almost no files in "stable-diffusion-webui-master\venv\Lib\site-packages\torch\bin". There is nothing to replace.

Where do they belong?

ataa Jan 22, 2023

Copy those .dll files to \stable-diffusion-webui-master\venv\Lib\site-packages\torch\lib

ghost Jan 22, 2023

This finally worked. The it/s display 1 more iteration per second while it is running, but only 0.5 it/s more than usual when its done. Very slight speed increase now noticable but its essentially nothing. I dont know which CUDA version I have and what is used, maybe there is something to be done there

aliencaocao Jan 22, 2023

If you are NOT on rtx 4000, this minor improvement is normal. The newer cudnn files mainly optimizes for rtx 4000

treksis Mar 4, 2023

I copy and pasted .dll files I get like +0.5 it/s. I'm on 3090/windows. still gooooooood thank you

AI-Casanova · 2023-01-20T05:56:33Z

AI-Casanova
Jan 20, 2023

Google Colab

14.3% speed up on Euler A
12.2% on DPM++ 2S a Karras
10.3% on DPM++ 2M Karras

(Speeds calculated from final values reported on a 150 step pass, no indepth time study)

1 reply

NoCrypt Jan 20, 2023

What CUDA do you use? 11.2 default from colab?
What about the torch? 11.6 or 11.7?

tuangd · 2023-01-20T08:05:41Z

tuangd
Jan 20, 2023

I can confirm, I copy all the .dll from cudnn\bin\ to replaced those in webUI\venv\Lib\site-packages\torch\lib and the performance is significantly improved.
It jumped from 10-12 lt/s in DDIM to 22-23 lt/s.

1 reply

VictorZakharov Mar 18, 2023

For me the performance diff was only +40%, from 5it/sec to 7it/sec on 4070 Ti.

TomKranenburg · 2023-01-20T08:09:03Z

TomKranenburg
Jan 20, 2023

Will this work on older cards such as the 1080?

2 replies

aliencaocao Jan 20, 2023

Any card, but the performance improvements should only apply for rtx 4000 series. No harm replacing still, as the new version could have minor optimizations for older cards

TomKranenburg Jan 20, 2023

Cheers Billy. I might give it a shot. Is there a link anywhere to the reddit posts mentioned? I'm at work and can't look for it.

adso3210 · 2023-01-20T12:37:40Z

adso3210
Jan 20, 2023

Worked for me! rtx 2070 windows 11 (about 10-15% faster)
Thank you all!

0 replies

Alchete · 2023-01-20T19:16:20Z

Alchete
Jan 20, 2023

RTX 3090, I see a similar 10-20% speedup doing this -- however, those gains are more than offset by the fact that xformers no longer works because it needs to be manually compiled against the new torch/cuda. So I guess I need to dive into that 20-step process. :(

3 replies

aifartist Jan 20, 2023
Author

On linux rebuilding and installing xformers is a ONE step process.
MAX_JOBS=16 CUDA_PATH=/usr/local/cuda pip3 install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers

Alchete Jan 20, 2023

Thanks, but I'm in Windoze, unfortunately.

ataa Jan 20, 2023

For Windows:

set MAX_JOBS=16
pip install ninja
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers

for MAX_JOBS, set it to number of your cpu threads.
You should run the commands above in your VENV.

aifartist · 2023-01-20T19:49:47Z

aifartist
Jan 20, 2023
Author

I convinced the pytorch folks to upgrade the cuDNN they use so that people won't have to manually correct this problem. Some people are having problems with doing the workaround correctly.

pytorch/builder#1271

I'm not sure if this will be cu118 for Torch 2 or Torch 1.13.x.
I will try to push for some 1.13 fix.

0 replies

xmaris · 2023-01-20T21:37:24Z

xmaris
Jan 20, 2023

Building with 2.0 CU118 and replacing the cudnn files offers a little bit of a performance boost on my RTX 4090 but nowhere near 40it/s let alone 20it/s, averaging 15it/s. Granted I am on Windows but still, not sure why there are such huge performance discrepancies.

0 replies

aifartist · 2023-01-20T23:47:29Z

aifartist
Jan 20, 2023
Author

Some may have noticed that some people on Windows are not getting this workaround to help. It seems to always work on Linux. I'm not sure why Windows isn't working 100% of the time. Let's hope we get a permanent fix soon.

0 replies

aifartist · 2023-01-21T22:35:58Z

aifartist
Jan 21, 2023
Author

I just found something new. I'm working with a CTO using this workaround and he only gets to 30 - 32 it/s. We are almost certain it is because he is running in a VM. Are any of the people here who get an improvement but less than I get running in a VM?

But just when I was about to send the above I realized that my computer is both fast and slow. It is a Raptor Lake with two different types of COREs.
Believe it or not but:
taskset 0x0000ffff ./webui.sh --xformers # GIves me 39 it/s
taskset 0xffff0000 ./webui.sh --xformers # Gives me only 27.5 it/s
taskset can control which CPU's (P-cores or E-cores) a process runs on.
I had no idea the CPU perf makes so much of a difference on a largely GPU workload.

This might be "PART" of the reason we see such a large diversity in the success of others trying this. I just found this so I need to do more research but it is so obvious it makes a difference. In both cases it is a lot better than the 13.6 it/s before the cuDNN fix but WOW!

10 replies

tuangd Jan 22, 2023

I'm not sure I understand or perhaps I disagree. If something was taking 2 seconds before, then I do something, and it takes 1 second then I'm happy no matter what the it/s say. It's just another measurement. You tell the man on the street how long it takes to generate an image and it has meaning to them. You tell them it/s and they have no clue whether that is good or bad. But both the time and the it/s are slower with a slower CPU and the difference seems too much.

What I want to know is why. I tried the linux profiler but most time was spent on a 'pause' instruction which I suspect is a polling loop. In spin lock implementation a low latency pause instruction is often used. In more recent Intel processors they increase significantly the number of cycle it takes, However, I believe I successfully patched the binary code of the executable to replace the pause with 2 nop instructions and it didn't get faster.

I'll sleep on this and come up with a way to figure out why there's such a big difference by just binding to slower cpu's.

At least you helped me with the cudnn which improves so much on my machine. But stall can't get that 30 its/s so gonna keep looking for more info

aifartist Jan 23, 2023
Author

SD model sd2.1 is faster than SD v1-4 or v1-5.
xformers is like 25% faster.
A 5.8 GHz cpu is 50% faster than a 3 GHz processor which I have discovered 2 days back and have posted about separately.
cuDNN 5.7 is maybe 300% faster. Again it depends on if you were at 13 it/s before and on a 4090.

You can do the workaround and not get the 39.7 it/s I get. I do everything on a very fast Intel CPU with very fast memory with a native CPU build of pytorch, cuda 12(1%?), and opt-channelslast(2%?). All these things together helps.

The CPU speed of your processor(not the GPU) is the new datapoint that might lead to a complete understanding of this. I'm analyzing where the non-GPU CPU overhead is occurring in A1111 today.

aifartist Jan 23, 2023
Author

And, OF COURSE, the choice of sampler makes a difference. All my testing has always been using euler_a. Other samples will show, mostly slower, it/s.

Northloop Feb 15, 2023

Unfortunately I'm still getting

ERROR: Could not find a version that satisfies the requirement torchtriton (from versions: none)
ERROR: No matching distribution found for torchtriton

when I try to install the torch nightly

aliencaocao Feb 16, 2023

Unfortunately I'm still getting

ERROR: Could not find a version that satisfies the requirement torchtriton (from versions: none)
ERROR: No matching distribution found for torchtriton

when I try to install the torch nightly

If you are windows that is normal. torchtriton does not support windows. You can use without it or build yourself

drax-xard · 2023-01-23T20:31:56Z

drax-xard
Jan 23, 2023

Ok I tried it again, replacing the .so in the venv/lib/python3.10/site-packages/torch/lib folder did nothing so I tried to pip install cuda 8.7... apparently torch 1.13.1 doesn't like cuda 8.7, when I try to generate it dies with:

Could not load library libcudnn_cnn_infer.so.8. Error: libnvrtc.so: cannot open shared object file: No such file or directory
Aborted (core dumped)

reinstalling cuda 8.5 and it works again. I don't know how you guys are using latest cuda with torch 1.13.1...
From what I can gather, even when I do the pip install from within the venv (doing source activate) it still installs cuda in my /home/.local folder...

2 replies

aifartist Jan 23, 2023
Author

@drax-xard It does appear that the many other libcudnn_XXX so files need to be installed although the 8.7 version of libcudnn.so.8 was the one that helps performance. I didn't even check if the venv pytorch dir had the other ones and you need to get rid of all of them. Yet another thing to check.

Obviously you don't want a mixture of 8.5 and 8.7 cudnn libraries.

M1kep Mar 5, 2023

I got around this by symlinking `venv/lib/python3.10/site-packages/torch/lib/libnrvtc.so” to the libnrvtc.so.xx.xxx file that came with the cudnn download

ed1g1tal · 2023-08-30T15:11:00Z

ed1g1tal
Aug 30, 2023

For Windows 11, (4090-OC, AMD Ryzen 9 5950X, 64GB), using vladmandic fork:

I downloaded 'cudnn-windows-x86_64-8.9.4.25_cuda12-archive.zip' and copied over the files at: 'venv\Lib\site-packages\torch\lib'.

Parameters:

v2-1_512-ema-pruned
512 x 512
20 steps
Euler a
Batch 4c x 6s

I went from 3-4 it/s to 8-9 it/s. Definitely helped.

0 replies

ewebgh33 · 2023-09-20T08:36:11Z

ewebgh33
Sep 20, 2023

I've done the file replacement and my 4090 went from around 12its to 15its. I've never been able to hit 20+, ever. Not sure why???

From the sys info tab:

app: stable-diffusion-webui.git
updated: 2023-08-31
hash: 5ef669d
url: https://github.com/AUTOMATIC1111/stable-diffusion-webui.git/tree/master
arch: AMD64
cpu: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
system: Windows
release: Windows-10-10.0.22621-SP0
python: 3.10.8
device: NVIDIA GeForce RTX 4090 (1) (compute_37) (8, 9)
cuda: 11.8
cudnn: 8700
driver: 528.02
xformers: 0.0.20
diffusers:
transformers: 4.30.2

Which of the above should I be looking at?
I noticed one poster in a different thread had cudnn 8800, whereas I have 8700.

I also saw someone in a thread earlier in the year said installing xformers (which I now know isn't needed as much anymore, but I didn't know so installed) overwrites something in the torch stuff and then torch should be reinstalled? Or is that the same as replacing these cudnn files from the manual download?

or would an Nvidia driver update from 528 to 531 maybe, help?

Thanks!
Em

0 replies

aifartist · 2023-09-20T18:38:29Z

aifartist
Sep 20, 2023
Author

Putting a 4090 in a system with a 2.3 GHz processor is a waste of money. I have a i9-13900K at 5.8 GHz. The cpu isn't fast enough to push the GPU to its potential although for "throughput" processing using a larger batchsize would help. You should look at the GPU usage with NVTOP when running a generation. cudnn 8.7 is good enough.

…

On Wed, Sep 20, 2023 at 1:36 AM EmmaWebGH ***@***.***> wrote: I've done the file replacement and my 4090 went from around 12its to 15its. I've never been able to hit 20+, ever. Not sure why??? From the sys info tab: app: stable-diffusion-webui.git updated: 2023-08-31 hash: 5ef669d <5ef669d> url: https://github.com/AUTOMATIC1111/stable-diffusion-webui.git/tree/master arch: AMD64 cpu: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel system: Windows release: Windows-10-10.0.22621-SP0 python: 3.10.8 device: NVIDIA GeForce RTX 4090 (1) (compute_37) (8, 9) cuda: 11.8 cudnn: 8700 driver: 528.02 xformers: 0.0.20 diffusers: transformers: 4.30.2 Which of the above should I be looking at? I noticed one poster in a different thread had cudnn 8800, whereas I have 8700. I also saw someone in a thread earlier in the year said installing xformers (which I now know isn't needed as much anymore, but I didn't know so installed) overwrites something in the torch stuff and then torch should be reinstalled? Or is that the same as replacing these cudnn files from the manual download? or would an Nvidia driver update from 528 to 531 maybe, help? Thanks! Em — Reply to this email directly, view it on GitHub <#6954 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A3YFZAEU2WU5J5JLMMBN5VTX3KTITANCNFSM6AAAAAAUBDQFYE> . You are receiving this because you authored the thread.Message ID: <AUTOMATIC1111/stable-diffusion-webui/repo-discussions/6954/comments/7055714 @github.com>

6 replies

aliencaocao Sep 21, 2023

Image generation is completely different as not all process can be done on gpu

ewebgh33 Sep 21, 2023

Good to know! I suppose I should go learn more about how the process works. :)
And this certainly explains why I could never get better speed despite implementing all solutions.
I need a 2nd PC anyway!

ewebgh33 Sep 22, 2023

Actually coming back to this, I've been watching CPU performance in the Task Manager and generating a large XYZ plot in Auto1111, CPU use hovers from 45-55%. So is CPU really the bottleneck? If it was, should my CPU not be getting thrashed right now trying to keep up woth feeding the GPUs what they need and whatever else SD wants the CPU for?

aifartist Sep 22, 2023
Author

@EmmaWebGH

I have a number of posts in this A1111 discussion group on performance. They may be helpful.

During an image generation A1111 will use one core. Depending on whether Task Manager is in the overall view or the per core mode you'll see different things. Example: assume you have 10 cores and one of them is 100% maxed out doing a generation. You system wide cpu usage will only be 10%. If you switch to the per-core mode you'll see one core totally busy and the others idle. I'm not sure how busy your system is when not doing a generation.

Windows is slower than Linux for SD. There seems to be a lot of system/kernel time overhead on Windows that doesn't occur on Linux. Performance analysis is an art. Ideally you would get your system as idle as possible before running a test. You would also "BIND" your A1111 python processes a single CPU. Preferable not cpu 0, 1, 2, or 3 although that depends on a number of factors. Then you'd examine the CPU usage of the CPU you bound the A1111 process to during the gen. Even then Windows is complicated given that I've seen a lot of kernel//system time on other CPU's instead of the A1111 one. I suspect a lot of interrupt processing.

How busy is your system when not running A1111? Another important way to see that the CPU isn't fast enough to push the GPU to the max is to run "nvtop". I believe you can download a version for windows.

Try "right clicking" on the CPU usage graph and see if you can switch to the per processor view.

aifartist Sep 22, 2023
Author

@EmmaWebGH As for why a fast CPU is needed...

When doing a generation there are a number of "small" short running tasks executed on the GPU to generate an image. This is true for a 512x512 generation at batchsize=1. When one task finishes the CPU wakes up and does some work to submit the next task. If each GPU task were to take a long time then the CPU speed wouldn't matter much. This is true for large images like 1024x1024 or larger batchsizes where you'd see less of a difference than what I see on my box with its 5.8 GHz processor. But with fast tasks the CPU needs to get the next task started as fast as possible.

Try running batchsize 4 and multiple your it/s by 4 to get the effective batchsize.

The root cause of a huge perf problem many see. #6954

Replies: 14 comments · 42 replies

aifartist Jan 20, 2023 Author

aifartist Jan 20, 2023 Author

aifartist Jan 20, 2023 Author

aifartist Jan 21, 2023 Author

I just found something new. I'm working with a CTO using this workaround and he only gets to 30 - 32 it/s. We are almost certain it is because he is running in a VM. Are any of the people here who get an improvement but less than I get running in a VM?

aifartist Jan 23, 2023 Author

aifartist Jan 23, 2023 Author

aifartist Jan 23, 2023 Author

aifartist Sep 20, 2023 Author

aifartist Sep 22, 2023 Author

aifartist Sep 22, 2023 Author

Replies: 14 comments 42 replies

aifartist Jan 20, 2023
Author

aifartist
Jan 20, 2023
Author

aifartist
Jan 20, 2023
Author

aifartist
Jan 21, 2023
Author

aifartist Jan 23, 2023
Author

aifartist Jan 23, 2023
Author

aifartist Jan 23, 2023
Author

aifartist
Sep 20, 2023
Author

aifartist Sep 22, 2023
Author

aifartist Sep 22, 2023
Author