Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] All perf improvements bundle #15821

Closed
wants to merge 13 commits into from

Conversation

huchenlei
Copy link
Contributor

@huchenlei huchenlei commented May 17, 2024

DO NOT MERGE THIS PR, merge individual PRs instead. This PR is for users to try out all performance improvements together.

Description

This is a bundle PR of all performance improvement PRs:

How to use

  • Checkout this PR

    • In your A1111 repo directory, open terminal
    • Use Github client to checkout this PR using command gh pr checkout 15821
  • Add --precision half to your command line args if your GPU supports fp16 calculation.

Unpatch the PR

  • In A1111 repo directory, open terminal
  • git checkout master

Expected performance improvement

For SDXL, this PR brings the performance from 580ms/it to 280ms/it on my machine. However this is only for Unet's denosing steps, not including all other factors such as VAE encode/decode and save the image, but overall you should expect at least 20% performance boost.

Report issues

Please report any bugs related to this batch of performance improvements to https://github.com/huchenlei/stable-diffusion-webui/issues

My tests on these PRs have limited coverage, so some features might get broken, and I would like to get these fixed before merging.

Checklist:

@huchenlei huchenlei requested a review from AUTOMATIC1111 as a code owner May 17, 2024 00:22
@huchenlei huchenlei changed the base branch from master to dev May 17, 2024 00:23
@light-and-ray
Copy link
Contributor

You can mark PR as draft to exclude accidentally merge

@Gushousekai195
Copy link

Gushousekai195 commented May 17, 2024

Clicks Generate

Traceback (most recent call last):
      File "D:\AI\stable-diffusion-webui\modules\call_queue.py", line 57, in f
        res = list(func(*args, **kwargs))
      File "D:\AI\stable-diffusion-webui\modules\call_queue.py", line 36, in f
        res = func(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\modules\txt2img.py", line 109, in txt2img
        processed = processing.process_images(p)
      File "D:\AI\stable-diffusion-webui\modules\processing.py", line 839, in process_images
        res = process_images_inner(p)
      File "D:\AI\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\batch_hijack.py", line 59, in processing_process_images_hijack
        return getattr(processing, '__controlnet_original_process_images_inner')(p, *args, **kwargs)
      File "D:\AI\stable-diffusion-webui\modules\processing.py", line 975, in process_images_inner
        samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)
      File "D:\AI\stable-diffusion-webui\modules\processing.py", line 1322, in sample
        samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
      File "D:\AI\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 218, in sample
        samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
      File "D:\AI\stable-diffusion-webui\modules\sd_samplers_common.py", line 272, in launch_sampling
        return func()
      File "D:\AI\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 218, in <lambda>
        samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
        return func(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 594, in sample_dpmpp_2m
        denoised = model(x, sigmas[i] * s_in, **extra_args)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\modules\sd_samplers_cfg_denoiser.py", line 237, in forward
        x_out = self.inner_model(x_in, sigma_in, cond=make_condition_dict(cond_in, image_cond_in))
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 112, in forward
        eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 138, in get_eps
        return self.inner_model.apply_model(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\modules\sd_hijack_utils.py", line 22, in <lambda>
        setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: self(*args, **kwargs))
      File "D:\AI\stable-diffusion-webui\modules\sd_hijack_utils.py", line 34, in __call__
        return self.__sub_func(self.__orig_func, *args, **kwargs)
      File "D:\AI\stable-diffusion-webui\modules\sd_hijack_unet.py", line 48, in apply_model
        result = orig_func(self, x_noisy.to(devices.dtype_unet), t.to(devices.dtype_unet), cond, **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 858, in apply_model
        x_recon = self.model(x_noisy, t, **cond)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 1335, in forward
        out = self.diffusion_model(x, t, context=cc)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\modules\sd_unet.py", line 91, in UNetModel_forward
        return original_forward(self, x, timesteps, context, *args, **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\openaimodel.py", line 797, in forward
        h = module(h, emb, context)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\openaimodel.py", line 86, in forward
        x = layer(x)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\extensions\a1111-sd-webui-lycoris\l_networks.py", line 524, in network_Conv2d_forward
        return originals.Conv2d_forward(self, input)
      File "D:\AI\stable-diffusion-webui\extensions-builtin\Lora\networks.py", line 523, in network_Conv2d_forward
        return originals.Conv2d_forward(self, input)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\conv.py", line 463, in forward
        return self._conv_forward(input, self.weight, self.bias)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\conv.py", line 459, in _conv_forward
        return F.conv2d(input, weight, bias, self.stride,
    RuntimeError: Input type (float) and bias type (struct c10::Half) should be the same

Now what?

@serick4126
Copy link

serick4126 commented May 17, 2024

In my environment, the SDXL model failed to load because FP8 weight was set to enabled in the settings optimization.
image

When I disabled it, the SDXL model loaded without problems.

@feffy380
Copy link

feffy380 commented May 17, 2024

Getting similar error as @Gushousekai195 (no lycoris, happens even when all loras disabled). One of these patches is breaking SD1.5.
Only SDXL works.

Edit: Narrowed it down to --precision half

@bob7l
Copy link

bob7l commented May 17, 2024

SDXL, Nvidia 4090 + Intel 12700K - seeing a 22.04% increase in speed. No (noticeable?) effect on image output.

@huchenlei huchenlei marked this pull request as draft May 17, 2024 17:30
@huchenlei
Copy link
Contributor Author

SD15 generation issue fixed.

@light-and-ray
Copy link
Contributor

light-and-ray commented May 17, 2024

I don't have significant performance boost. Only sdxl + 2 cn has about 10-15% boost. Btw as in Forge I have the similar too:

rtx 3060 + 10400f
--medvram-sdxl --xformers --disable-model-loading-ram-optimization
python: 3.11.6  •  torch: 2.1.2+cu121  •  xformers: 0.0.23.post1

First column = non-patched | second = patched

sd1 1 image:
3.0 sec. | 3.0 sec.

sd1 batch_size 10:
22.8 sec. | 23.0 sec.

sd1 + cn canny, depth, batch_size 10:
38.0 sec. | 37.6 sec.

sdxl 1 image:
17.9 sec. | 17.2 sec.

sdxl + cn canny, depth 1 image:
39.0 sec. | 34.6 sec.

AnimateDiff + CN Inpaint + SparseCtrl works

Maybe I have CPU which doesn't fit to GPU, or vise versa. But it's not worse then non-patched, and other users have boost, so I like this work. 👍🏻 Now I will test it on 2gb gpu

@light-and-ray
Copy link
Contributor

I tried to select the best time, but it's definitely slower on the very low vram setup. Maybe it conflicts with some optimizations?

mx150 2gb aka gt 1030
--xformers --lowvram
Optimizations are in the screenshot
python: 3.10.6  •  torch: 2.1.2+cu121  •  xformers: 0.0.23.post1

First column = non-patched | second = patched

sd1 + merged lcm lora:
20.3 sec. | 22.3 sec.

sd1 + merged lcm lora + t2ia canny:
22.1 sec. | 22.8 sec.

sd1 + merged lcm lora + hiresfix 2x + tiled vae:
2 min. 22.1 sec. | 2 min. 40.7 sec.

Screenshot_20240517_223154

@huchenlei
Copy link
Contributor Author

I tried to select the best time, but it's definitely slower on the very low vram setup. Maybe it conflicts with some optimizations?

mx150 2gb aka gt 1030
--xformers --lowvram
Optimizations are in the screenshot
python: 3.10.6  •  torch: 2.1.2+cu121  •  xformers: 0.0.23.post1

First column = non-patched | second = patched

sd1 + merged lcm lora:
20.3 sec. | 22.3 sec.

sd1 + merged lcm lora + t2ia canny:
22.1 sec. | 22.8 sec.

sd1 + merged lcm lora + hiresfix 2x + tiled vae:
2 min. 22.1 sec. | 2 min. 40.7 sec.

Screenshot_20240517_223154

Can you attach traces of your experiment? I am not sure which part of the optimization is affecting low vram perf. You can record trace according to instruction in lllyasviel/stable-diffusion-webui-forge#716

Running 2 steps should probably be enough.

@light-and-ray
Copy link
Contributor

light-and-ray commented May 17, 2024

Okay @huchenlei

sd1 + merged lcm lora + t2ia canny
4 steps

Non-patched:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        model_inference        10.60%        2.483s       100.00%       23.436s       23.436s       0.000us         0.00%       20.588s       20.588s           0 b      -3.69 Gb     993.50 Kb      -9.40 Gb             1  
                                        cudaMemcpyAsync        72.21%       16.922s        72.46%       16.982s       2.585ms      54.251ms         0.27%      54.251ms       8.259us           0 b           0 b           0 b           0 b          6569  
                                               aten::to         0.30%      70.688ms        68.23%       15.990s       1.408ms       0.000us         0.00%        6.300s     554.910us       3.66 Gb      13.43 Mb      14.53 Gb     262.38 Mb         11353  
                                         aten::_to_copy         0.48%     113.589ms        68.05%       15.949s       1.462ms       0.000us         0.00%        6.335s     580.598us       3.66 Gb     598.63 Kb      14.53 Gb           0 b         10912  
                                            aten::copy_         0.74%     172.485ms        67.06%       15.715s       1.401ms        6.000s        29.97%        6.426s     572.681us           0 b           0 b           0 b           0 b         11221  
                                           aten::conv2d         0.05%      11.844ms         9.87%        2.314s       2.738ms       0.000us         0.00%       11.814s      13.981ms           0 b           0 b       2.90 Gb      -4.70 Gb           845  
                                             aten::item         0.00%      89.000us         9.56%        2.241s     149.386ms       0.000us         0.00%      12.138ms     809.200us           0 b           0 b           0 b           0 b            15  
                              aten::_local_scalar_dense         0.00%     241.000us         9.56%        2.241s     149.380ms      15.000us         0.00%      12.138ms     809.200us           0 b           0 b           0 b           0 b            15  
                                      aten::convolution         0.02%       3.958ms         8.77%        2.056s       4.558ms       0.000us         0.00%        6.954s      15.419ms           0 b           0 b       1.97 Gb           0 b           451  
                                     aten::_convolution         0.05%      10.801ms         8.75%        2.052s       4.549ms       0.000us         0.00%        6.954s      15.419ms           0 b           0 b       1.97 Gb      -4.00 Mb           451  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 23.436s
Self CUDA time total: 20.022s


Patched:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        model_inference        19.16%        9.898s       100.00%       51.653s       51.653s       0.000us         0.00%       40.367s       40.367s           0 b      -3.68 Gb       7.24 Mb     -18.52 Gb             1  
                                        cudaMemcpyAsync        59.11%       30.531s        60.30%       31.144s       4.745ms       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b          6564  
                                               aten::to         0.19%      98.962ms        58.74%       30.340s       2.701ms       0.000us         0.00%        7.325s     652.075us       3.66 Gb     522.07 Kb      14.50 Gb     237.55 Mb         11233  
                                         aten::_to_copy         0.46%     239.181ms        58.56%       30.249s       2.800ms       0.000us         0.00%        7.349s     680.168us       3.66 Gb           0 b      14.50 Gb           0 b         10804  
                                            aten::copy_         0.71%     365.811ms        57.60%       29.754s       2.677ms        7.471s        18.51%        7.471s     672.260us           0 b           0 b           0 b           0 b         11113  
                                           aten::conv2d         0.04%      20.753ms        16.41%        8.474s      10.040ms       0.000us         0.00%       26.478s      31.372ms           0 b           0 b       2.90 Gb      -4.70 Gb           844  
                                      aten::convolution         0.02%       8.169ms        15.56%        8.038s      17.824ms       0.000us         0.00%       17.953s      39.807ms           0 b           0 b       1.97 Gb           0 b           451  
                                     aten::_convolution         0.04%      23.143ms        15.55%        8.030s      17.806ms       0.000us         0.00%       17.953s      39.807ms           0 b           0 b       1.97 Gb      -4.00 Mb           451  
                                aten::cudnn_convolution         0.85%     437.998ms        15.42%        7.966s      17.662ms       15.939s        39.48%       15.939s      35.341ms           0 b           0 b       1.97 Gb       1.79 Gb           451  
                                               cudaFree        13.98%        7.223s        14.52%        7.499s     299.956ms       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b            25  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 51.653s
Self CUDA time total: 40.368s

For some reason with tracing time difference become much more bigger

@light-and-ray
Copy link
Contributor

It is very strange. Both were made about 3 times, and I pasted the last, to exclude any model loading from disk and cn preprocessing

NB: this gpu has very slow vram, maybe it can be connected with this

@light-and-ray
Copy link
Contributor

light-and-ray commented May 17, 2024

Visually 2 first steps are okay, but the last 2 steps are slower after patch

Also I'm attaching trace files
trace_non_patched.json.gz
trace_patched.json.gz

@strawberrymelonpanda
Copy link

strawberrymelonpanda commented May 18, 2024

Use Github client to checkout this PR using command gh pr checkout 15821

Just want to add you can checkout any Github PR without the GH CLI client using standard Git commands:

git fetch origin pull/ID/head:NAME
git checkout NAME

In this case, for example:

git fetch origin pull/15821/head:15821 && git checkout 15821

Just to save someone the install if they don't need the GH client otherwise.

@not-ski
Copy link

not-ski commented May 18, 2024

Great job! Merged the remote branch locally, and I'm now seeing faster gens on A1111 than on ComfyUI :)

@FurkanGozukara
Copy link

I just tested
it has speed improvement (around 8% for RTX 3090) however auto cast fails and we get black output on SDXL

image

@huchenlei
Copy link
Contributor Author

I just tested it has speed improvement (around 8% for RTX 3090) however auto cast fails and we get black output on SDXL

image

precision half disables autocast, so you should not use fp8 option in settings as well. Casting during inference is a big source of performance overhead.

@FurkanGozukara
Copy link

FurkanGozukara commented May 18, 2024

I just tested it has speed improvement (around 8% for RTX 3090) however auto cast fails and we get black output on SDXL
image

precision half disables autocast, so you should not use fp8 option in settings as well. Casting during inference is a big source of performance overhead.

this is the only option i started

@echo off
set PYTHON=
set GIT=
set VENV_DIR=
set COMMANDLINE_ARGS=--xformers
call webui.bat

image

@b-fission
Copy link

b-fission commented May 18, 2024

I just tested it has speed improvement (around 8% for RTX 3090) however auto cast fails and we get black output on SDXL

Using sdxl-vae-fp16-fix as the VAE seems to fix the black output. Tested using --precision half
Some checkpoints like AlbedoBaseXL will work as-is.

@freecoderwaifu
Copy link

freecoderwaifu commented May 18, 2024

I don't have extensive numbers, but 512x512 generates noticeably faster, as fast as Forge if not faster. 2 seconds to generate an image with 40 steps using DPM++ 2M SGM Uniform with a regular 1.5 checkpoint, and 15 seconds at 1024x 1024 with SDXL, on a 3080 12GB.

I feel like highres fix and img2img are the bigger bottlenecks now, but I dunno how feasible it is to optimize them even further, especially since these fixes did also noticeably increase their speed. Maybe on the side of the upscalers, since some are noticeably slower than others just by their nature, but I guess it just is what it is due to hardware.

Also ran into an issue, might be related to --precision-half too:
RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half

Seems to happen when loading an SD 1.5 checkpoint, then loading an SDXL checkpoint and trying to generate. Loading a different SDXL checkpoint seems to fix it, but then happens all over again when loading between SD 1.5 and SDXL. This also happens if the checkpoint the UI loads by default is SD 1.5.

@mweldon
Copy link

mweldon commented May 19, 2024

Quick test using a 3060, doing a 4-batch at 896x1152 with 2 LORAs at 20 steps, DPM++ 3M Exponential

Forge: 1:20.
A1111 with this PR: 0:58

A single image with this PR is around 15 sec. Very nice! I'm going to use this from now on unless some issue comes up.
Great work!

@enternalsaga
Copy link

the --precision half not work in my case, adding it triggers tracing to several u-net hijack extensions even I have disabled them.

Main error
File "I:\stable-diffusion-webui-updated\venv\lib\site-packages\torch\functional.py", line 377, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined] RuntimeError: expected scalar type Half but found Float

@Zotikus1001
Copy link

I'd love to go back to A1111 from forge, but the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode), makes me not want to touch it anymore. This doesnt happen on any other ui. I wish someone could figure that out. :(

@huchenlei
Copy link
Contributor Author

the --precision half not work in my case, adding it triggers tracing to several u-net hijack extensions even I have disabled them.

Main error File "I:\stable-diffusion-webui-updated\venv\lib\site-packages\torch\functional.py", line 377, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined] RuntimeError: expected scalar type Half but found Float

Can you attach full stacktrace?

@strawberrymelonpanda
Copy link

strawberrymelonpanda commented May 19, 2024

the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode),

@Zotikus1001 If you haven't, download multidiffusion-upscaler-for-automatic1111 and enable "Tiled VAE". Despite the name, it's not just about upscale.

Forge is using an integrated version of this under the hood if I remember right.

It would be nice to have the other memory management tricks like moving models, etc though, since I can pretty much use HR Fix with any size reliably in Forge.

@Zotikus1001
Copy link

the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode),

@Zotikus1001 If you haven't, download multidiffusion-upscaler-for-automatic1111 and enable "Tiled VAE". Despite the name, it's not just about upscale.

Forge is using an integrated version of this under the hood if I remember right.

It would be nice to have the other memory management tricks like moving models, etc though, since I can pretty much use HR Fix with any size reliably in Forge.

that doesnt work for this issue sadly. It still happens.

@bob7l
Copy link

bob7l commented May 19, 2024

the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode),

@Zotikus1001 If you haven't, download multidiffusion-upscaler-for-automatic1111 and enable "Tiled VAE". Despite the name, it's not just about upscale.
Forge is using an integrated version of this under the hood if I remember right.
It would be nice to have the other memory management tricks like moving models, etc though, since I can pretty much use HR Fix with any size reliably in Forge.

that doesnt work for this issue sadly. It still happens.

Does it happen only during the HiRes Fix stage? That's the only place I exceed 24GB VRAM. Doesn't happen with Forge

@Zotikus1001
Copy link

the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode),

@Zotikus1001 If you haven't, download multidiffusion-upscaler-for-automatic1111 and enable "Tiled VAE". Despite the name, it's not just about upscale.
Forge is using an integrated version of this under the hood if I remember right.
It would be nice to have the other memory management tricks like moving models, etc though, since I can pretty much use HR Fix with any size reliably in Forge.

that doesnt work for this issue sadly. It still happens.

Does it happen only during the HiRes Fix stage? That's the only place I exceed 24GB VRAM. Doesn't happen with Forge

No, it happens every time, ofc the higher res I go, the worse it is.
Where on forge or comfyui or invoke there's no increase in VRAM at all, not a single hickup.

@papuSpartan
Copy link
Contributor

I saw a 10-19% speedup when using --precision half along with --opt-channelslast after merging this. Newer accelerators will benefit more from these changes but that's not to say that older ones aren't getting an uplift either. There was about a 9% speedup going from 30 to 40 series.

You can see more details here

@enternalsaga
Copy link

the --precision half not work in my case, adding it triggers tracing to several u-net hijack extensions even I have disabled them.
Main error File "I:\stable-diffusion-webui-updated\venv\lib\site-packages\torch\functional.py", line 377, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined] RuntimeError: expected scalar type Half but found Float

Can you attach full stacktrace?
here it is:
trace.txt

@ByteSh0ck
Copy link

Thank you so much for these improvements! 1111 speed is now on par with forge thanks to you. :)

@Soulreaver90
Copy link

I've tried everything ever suggested to get any sort of speed improvements for my RX 6700XT, and all have been fruitless outside of Token Merging which affects quality. This PR, although not mindblowing, has succeeded in providing actual speed improvements. On average I've noticed a 5% improvement. Not sure if there is anything else specific I should apply outside of the cmd args, but I'm happy and will keep running this until it's merged.

@ostap667inbox
Copy link

@huchenlei
Is it possible in principle to add functionality similar to 'Never OOM' in Forge to A1111? That is, enable/disable on-the-fly VRAM saving mode.
I regularly encounter 12Gb VRAM shortage in A1111 even at low resolutions when using ControlNet or LoRa.

@huchenlei
Copy link
Contributor Author

Closing this PR as all component PRs are merged.

@huchenlei huchenlei closed this Jun 10, 2024
@wywywywy
Copy link
Contributor

It's not merged to master yet. Shouldn't we keep this open for tracking in case things crop up in dev?

@sinand99
Copy link

When will this be merged to master? We need performance improvements since Forge is dead now.

@Soulreaver90
Copy link

When will this be merged to master? We need performance improvements since Forge is dead now.

It’s merged in DEV. It will be merged with Master when the next A1111 update is ready. You can switch to DEV to try it out or just wait a little while longer.

@ZeroCool22
Copy link

When will this be merged to master? We need performance improvements since Forge is dead now.

It’s merged in DEV. It will be merged with Master when the next A1111 update is ready. You can switch to DEV to try it out or just wait a little while longer.

Could you tell me the command to switch to the Dev branch?

Also, i should use --precision half with a 1080 TI?

@Gushousekai195
Copy link

Still slow as heck. It was faster in Forge.

@ByteSh0ck
Copy link

its live now. thanks @huchenlei for your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.