Poor 4090 Performance (~4it/s) #324

Deminisa · 2023-02-07T13:16:02Z

Deminisa
Feb 7, 2023

Hi,

Was wondering which kind of performance other people were seeing on 40X0 cards as I'm having some issues with a 4090 reaching it/s remotely close to what others have been reporting.

I.e generating "photo of a person" with euler a, 20 steps, 512x512 I'm getting everywhere between 3.8 and 4.5 it/s with 1 batch count and 1 batch size.

System:
i7 9700k, 64GB RAM, M2 NVMe (the CPU is a few generations old, but assuming it's not making it/s that much worse)
Clean install of Windows 11 22H2
WSL 2 Ubuntu 20.04
Docker 4.16.3
Clean container from the newest commit of this repo (1e0561c) on AUTO1111 having everything on default settings

Also tried building and using xformers-0.0.17+00afc12.d20230207 but did not see any noticeable improvements.

Any tips would be appreciated! 😀

Cheers!

AbdBarho · 2023-02-07T13:55:14Z

AbdBarho
Feb 7, 2023
Maintainer

@Deminisa you have to remove the --medvram cli flag, if it is still there

1 reply

Deminisa Feb 7, 2023
Author

Great shout! After removing --medvram we're now looking between 9.0 and 9.5 it/s so double the speed, so many thanks for that! 🙏

Is there anything other that comes to mind to increase the speed further as there's been mentions of it/s in the mid 20's on AUTO without heavy customization.

Have seen some talks about cuDNN and according to this it allegedly doubled it/s for some (https://www.reddit.com/r/StableDiffusion/comments/y71q5k/4090_cudnn_performancespeed_fix_automatic1111/). The PyTorch update part from cu113 is already implemented, but unsure how one would go about updating the cuDNN files since the post uses venv and I'm not familiar enough with Docker and the setup compared to doing it in Windows. Also since the container isn't persistent I assume a change would have to be done in the Dockerfile to mount new files + overwrite before the webui.py runs ... On the other hand it seems to have been tested here without any effect (#300)

DevilaN · 2023-02-08T08:47:06Z

DevilaN
Feb 8, 2023

@Deminisa:
I've tried updating cuDNN here: #300 but didn't get any performance boost. Still it was only on my poor-man's graphics card.

Download cuDNN for linux and place files in your /output directory (this is for convenience, because it is easily accessible from inside container, so you can unpack files later inside container into proper destination directory).

You can do docker ps when running automatic in docker and you get list of running containers with their corresponding id's.
Then you enter container with docker exec -it CONTAINER_ID_TAKEN_FROM_DOCKER_PS_COMMAND /bin/bash - this will put you inside the automatic running container, so you can unpack files and place them in proper location.

cuDNN files (libraries) are in /usr/local/lib/python3.10/site-packages/torch/lib/ directory, so simply unpack package with cuDNN you've got from NVIDIA site and copy them over there. After this I suggest to stop container and start it again. Now it should be running with updated cuDNN files.

If it works for you (this is temporary solution to check it out), than we'll think about incorporating it in regular dockerfile so everybody that can benefit from this will do so.

0 replies

Deminisa · 2023-02-08T13:27:48Z

Deminisa
Feb 8, 2023
Author

@DevilaN

Thank you very much for the instructions! I was originally under the impression that the container wasn't persistent as I'm having issues trying to properly set up the dreambooth plugin in auto (not being able to find a CUDA shared library to point to for bitsandbytes/8-bit adam, and installing cuda-libraries with apt gets removed after each container restart), but I digress and something for another discussion.

All tests were done with the v1-5-pruned-emaonly.ckpt [cc6cb27103] checkpoint (https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.ckpt)

Prompt: a photo of a person
euler_a
20 Steps
CFG 7
512x512

Default install
100%|██████████| 20/20 [00:02<00:00, 8.29it/s]00:19, 9.52it/s]
100%|██████████| 20/20 [00:02<00:00, 9.27it/s]00:17, 9.07it/s]
100%|██████████| 20/20 [00:02<00:00, 8.80it/s]00:15, 9.09it/s]
100%|██████████| 20/20 [00:02<00:00, 9.21it/s]00:14, 8.35it/s]
100%|██████████| 20/20 [00:02<00:00, 9.32it/s]00:10, 9.34it/s]
100%|██████████| 20/20 [00:02<00:00, 9.13it/s]<00:09, 8.36it/s]
100%|██████████| 20/20 [00:02<00:00, 9.29it/s]<00:06, 9.67it/s]
100%|██████████| 20/20 [00:02<00:00, 9.08it/s]<00:04, 9.01it/s]
100%|██████████| 20/20 [00:02<00:00, 9.18it/s]<00:02, 9.77it/s]
Total progress: 100%|██████████| 200/200 [00:24<00:00, 8.10it/s]

Default install with cudnn-linux-x86_64-8.6.0.163_cuda11.8
100%|██████████| 20/20 [00:02<00:00, 8.99it/s]00:11, 15.38it/s]
100%|██████████| 20/20 [00:01<00:00, 15.41it/s]00:10, 15.25it/s]
100%|██████████| 20/20 [00:01<00:00, 16.61it/s]00:08, 16.39it/s]
100%|██████████| 20/20 [00:01<00:00, 15.70it/s]00:08, 14.78it/s]
100%|██████████| 20/20 [00:01<00:00, 16.46it/s]00:06, 16.20it/s]
100%|██████████| 20/20 [00:01<00:00, 16.25it/s]<00:05, 15.70it/s]
100%|██████████| 20/20 [00:01<00:00, 16.51it/s]<00:03, 16.23it/s]
100%|██████████| 20/20 [00:01<00:00, 14.23it/s]<00:02, 15.08it/s]
100%|██████████| 20/20 [00:01<00:00, 15.59it/s]<00:01, 15.52it/s]
Total progress: 100%|██████████| 200/200 [00:15<00:00, 13.11it/s]

Default install with cudnn-linux-x86_64-8.7.0.84_cuda11
100%|██████████| 20/20 [00:01<00:00, 13.10it/s]00:13, 13.99it/s]
100%|██████████| 20/20 [00:01<00:00, 16.85it/s]00:09, 16.43it/s]
100%|██████████| 20/20 [00:01<00:00, 15.96it/s]00:09, 15.49it/s]
100%|██████████| 20/20 [00:01<00:00, 14.73it/s]00:08, 14.80it/s]
100%|██████████| 20/20 [00:01<00:00, 16.75it/s]00:06, 16.64it/s]
100%|██████████| 20/20 [00:01<00:00, 16.15it/s]<00:05, 15.74it/s]
100%|██████████| 20/20 [00:01<00:00, 15.52it/s]<00:04, 14.44it/s]
100%|██████████| 20/20 [00:01<00:00, 15.13it/s]<00:02, 15.42it/s]
100%|██████████| 20/20 [00:01<00:00, 16.20it/s]<00:01, 16.09it/s]
Total progress: 100%|██████████| 200/200 [00:15<00:00, 13.30it/s]

Also did a couple of tests with xformers 0.0.17 built with TORCH_CUDA_ARCH_LIST="8.6+ptx"

xformers-0.0.17+00afc12.d20230207-cp310-cp310-linux_x86_64.whl with cudnn-linux-x86_64-8.6.0.163_cuda11.8
100%|██████████| 20/20 [00:01<00:00, 13.24it/s]00:12, 14.74it/s]
100%|██████████| 20/20 [00:01<00:00, 15.45it/s]00:11, 14.72it/s]
100%|██████████| 20/20 [00:01<00:00, 17.06it/s]00:08, 16.85it/s]
100%|██████████| 20/20 [00:01<00:00, 16.04it/s]00:07, 15.29it/s]
100%|██████████| 20/20 [00:01<00:00, 15.29it/s]00:06, 14.75it/s]
100%|██████████| 20/20 [00:01<00:00, 15.37it/s]<00:05, 15.50it/s]
100%|██████████| 20/20 [00:01<00:00, 16.23it/s]<00:03, 15.98it/s]
100%|██████████| 20/20 [00:01<00:00, 14.78it/s]<00:02, 15.26it/s]
100%|██████████| 20/20 [00:01<00:00, 14.60it/s]<00:01, 13.99it/s]
Total progress: 100%|██████████| 200/200 [00:15<00:00, 13.06it/s]

xformers-0.0.17+00afc12.d20230207-cp310-cp310-linux_x86_64.whl with cudnn-linux-x86_64-8.7.0.84_cuda11
100%|██████████| 20/20 [00:02<00:00, 8.86it/s]00:18, 9.79it/s]
100%|██████████| 20/20 [00:02<00:00, 9.24it/s]00:17, 9.43it/s]
100%|██████████| 20/20 [00:02<00:00, 9.28it/s]00:15, 9.10it/s]
100%|██████████| 20/20 [00:02<00:00, 9.24it/s]00:13, 9.21it/s]
100%|██████████| 20/20 [00:02<00:00, 9.74it/s]00:10, 9.39it/s]
100%|██████████| 20/20 [00:02<00:00, 9.31it/s]<00:08, 9.26it/s]
100%|██████████| 20/20 [00:02<00:00, 9.57it/s]<00:06, 9.80it/s]
100%|██████████| 20/20 [00:02<00:00, 9.31it/s]<00:04, 9.16it/s]
100%|██████████| 20/20 [00:02<00:00, 9.44it/s]<00:02, 9.65it/s]
Total progress: 100%|██████████| 200/200 [00:24<00:00, 8.03it/s]

Made a few notes regarding the test:

Note 1)
Not sure if this was because I extracted the tar with 7-Zip in Windows and not by using tar in the container, but the lib folder has a lot of symlinks which first got copied over as a lot of 0b files in stead of links, so I had remove all empty files and then mass rename "8.6.0" and "8.7.0" to "8" before copying them over

Note 2)
"Total progress" it/s is always reporting lower it/s than the average so assuming it is averaging based on the whole trip.

Note 3)
Tested it multiple times, and the combination of xformers 0.0.17 and cuDNN v8.7.0 had no improvements and showed the same performance as a default install.

Note 4)
More of an aside and a bug elsewhere ... I noticed a lot of times when starting/stopping the docker it claimed to not find the checkpoint and used a fallback even though nothing had changed with the data directory. So if anyone else is doing these tests make sure that the correct checkpoint is being used for each test

webui-docker-auto-1 | Checkpoint v1-5-pruned-emaonly.ckpt [cc6cb27103] not found; loading fallback sd-v1-5-inpainting.ckpt [c6bbc15e32]

So in summary ... While anecdotal, I could see a great improvement by updating the libraries. Still not reaching the low to mid 20's it/s as reported by others though.

0 replies

DevilaN · 2023-02-08T14:28:01Z

DevilaN
Feb 8, 2023

We'll figure it out.
Regarding 4th point: #317

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor 4090 Performance (~4it/s) #324

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Poor 4090 Performance (~4it/s) #324

Deminisa Feb 7, 2023

Replies: 4 comments · 1 reply

AbdBarho Feb 7, 2023 Maintainer

Deminisa Feb 7, 2023 Author

DevilaN Feb 8, 2023

Deminisa Feb 8, 2023 Author

DevilaN Feb 8, 2023

Deminisa
Feb 7, 2023

Replies: 4 comments 1 reply

AbdBarho
Feb 7, 2023
Maintainer

Deminisa Feb 7, 2023
Author

DevilaN
Feb 8, 2023

Deminisa
Feb 8, 2023
Author

DevilaN
Feb 8, 2023