-
Notifications
You must be signed in to change notification settings - Fork 27.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE] All perf improvements bundle #15821
Conversation
* Precompute is_sdxl_inpaint flag * Fix flag check for SD15
* Prevent uncessary bias backup * Fix LoRA bias error --------- Co-authored-by: AUTOMATIC1111 <[email protected]>
Co-authored-by: AUTOMATIC1111 <[email protected]>
You can mark PR as draft to exclude accidentally merge |
Clicks Generate
Now what? |
Getting similar error as @Gushousekai195 (no lycoris, happens even when all loras disabled). One of these patches is breaking SD1.5. Edit: Narrowed it down to |
SDXL, Nvidia 4090 + Intel 12700K - seeing a 22.04% increase in speed. No (noticeable?) effect on image output. |
SD15 generation issue fixed. |
I don't have significant performance boost. Only sdxl + 2 cn has about 10-15% boost. Btw as in Forge I have the similar too:
Maybe I have CPU which doesn't fit to GPU, or vise versa. But it's not worse then non-patched, and other users have boost, so I like this work. 👍🏻 Now I will test it on 2gb gpu |
I tried to select the best time, but it's definitely slower on the very low vram setup. Maybe it conflicts with some optimizations?
|
Can you attach traces of your experiment? I am not sure which part of the optimization is affecting low vram perf. You can record trace according to instruction in lllyasviel/stable-diffusion-webui-forge#716 Running 2 steps should probably be enough. |
Okay @huchenlei
For some reason with tracing time difference become much more bigger |
It is very strange. Both were made about 3 times, and I pasted the last, to exclude any model loading from disk and cn preprocessing NB: this gpu has very slow vram, maybe it can be connected with this |
Visually 2 first steps are okay, but the last 2 steps are slower after patch Also I'm attaching trace files |
Just want to add you can checkout any Github PR without the GH CLI client using standard Git commands:
In this case, for example:
Just to save someone the install if they don't need the GH client otherwise. |
Great job! Merged the remote branch locally, and I'm now seeing faster gens on A1111 than on ComfyUI :) |
this is the only option i started @echo off |
Using sdxl-vae-fp16-fix as the VAE seems to fix the black output. Tested using |
I don't have extensive numbers, but 512x512 generates noticeably faster, as fast as Forge if not faster. 2 seconds to generate an image with 40 steps using DPM++ 2M SGM Uniform with a regular 1.5 checkpoint, and 15 seconds at 1024x 1024 with SDXL, on a 3080 12GB. I feel like highres fix and img2img are the bigger bottlenecks now, but I dunno how feasible it is to optimize them even further, especially since these fixes did also noticeably increase their speed. Maybe on the side of the upscalers, since some are noticeably slower than others just by their nature, but I guess it just is what it is due to hardware. Also ran into an issue, might be related to --precision-half too: Seems to happen when loading an SD 1.5 checkpoint, then loading an SDXL checkpoint and trying to generate. Loading a different SDXL checkpoint seems to fix it, but then happens all over again when loading between SD 1.5 and SDXL. This also happens if the checkpoint the UI loads by default is SD 1.5. |
Quick test using a 3060, doing a 4-batch at 896x1152 with 2 LORAs at 20 steps, DPM++ 3M Exponential Forge: 1:20. A single image with this PR is around 15 sec. Very nice! I'm going to use this from now on unless some issue comes up. |
the --precision half not work in my case, adding it triggers tracing to several u-net hijack extensions even I have disabled them. Main error |
I'd love to go back to A1111 from forge, but the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode), makes me not want to touch it anymore. This doesnt happen on any other ui. I wish someone could figure that out. :( |
Can you attach full stacktrace? |
@Zotikus1001 If you haven't, download multidiffusion-upscaler-for-automatic1111 and enable "Tiled VAE". Despite the name, it's not just about upscale. Forge is using an integrated version of this under the hood if I remember right. It would be nice to have the other memory management tricks like moving models, etc though, since I can pretty much use HR Fix with any size reliably in Forge. |
that doesnt work for this issue sadly. It still happens. |
Does it happen only during the HiRes Fix stage? That's the only place I exceed 24GB VRAM. Doesn't happen with Forge |
No, it happens every time, ofc the higher res I go, the worse it is. |
I saw a 10-19% speedup when using You can see more details here |
|
Thank you so much for these improvements! 1111 speed is now on par with forge thanks to you. :) |
I've tried everything ever suggested to get any sort of speed improvements for my RX 6700XT, and all have been fruitless outside of Token Merging which affects quality. This PR, although not mindblowing, has succeeded in providing actual speed improvements. On average I've noticed a 5% improvement. Not sure if there is anything else specific I should apply outside of the cmd args, but I'm happy and will keep running this until it's merged. |
@huchenlei |
Closing this PR as all component PRs are merged. |
It's not merged to |
When will this be merged to master? We need performance improvements since Forge is dead now. |
It’s merged in DEV. It will be merged with Master when the next A1111 update is ready. You can switch to DEV to try it out or just wait a little while longer. |
Could you tell me the command to switch to the Dev branch? Also, i should use --precision half with a 1080 TI? |
Still slow as heck. It was faster in Forge. |
its live now. thanks @huchenlei for your work! |
DO NOT MERGE THIS PR, merge individual PRs instead. This PR is for users to try out all performance improvements together.
Description
This is a bundle PR of all performance improvement PRs:
How to use
Checkout this PR
gh pr checkout 15821
Add
--precision half
to your command line args if your GPU supports fp16 calculation.Unpatch the PR
git checkout master
Expected performance improvement
For SDXL, this PR brings the performance from 580ms/it to 280ms/it on my machine. However this is only for Unet's denosing steps, not including all other factors such as VAE encode/decode and save the image, but overall you should expect at least 20% performance boost.
Report issues
Please report any bugs related to this batch of performance improvements to https://github.com/huchenlei/stable-diffusion-webui/issues
My tests on these PRs have limited coverage, so some features might get broken, and I would like to get these fixed before merging.
Checklist: