[DO NOT MERGE] All perf improvements bundle #15821

huchenlei · 2024-05-17T00:22:54Z

DO NOT MERGE THIS PR, merge individual PRs instead. This PR is for users to try out all performance improvements together.

Description

This is a bundle PR of all performance improvement PRs:

How to use

Checkout this PR
- In your A1111 repo directory, open terminal
- Use Github client to checkout this PR using command gh pr checkout 15821
Add --precision half to your command line args if your GPU supports fp16 calculation.

Unpatch the PR

In A1111 repo directory, open terminal
git checkout master

Expected performance improvement

For SDXL, this PR brings the performance from 580ms/it to 280ms/it on my machine. However this is only for Unet's denosing steps, not including all other factors such as VAE encode/decode and save the image, but overall you should expect at least 20% performance boost.

Report issues

Please report any bugs related to this batch of performance improvements to https://github.com/huchenlei/stable-diffusion-webui/issues

My tests on these PRs have limited coverage, so some features might get broken, and I would like to get these fixed before merging.

Checklist:

I have read contributing wiki page
I have performed a self-review of my own code
My code follows the style guidelines
My code passes tests

This reverts commit b7b2bdc.

* Precompute is_sdxl_inpaint flag * Fix flag check for SD15

* Prevent uncessary bias backup * Fix LoRA bias error --------- Co-authored-by: AUTOMATIC1111 <[email protected]>

Co-authored-by: AUTOMATIC1111 <[email protected]>

light-and-ray · 2024-05-17T01:56:09Z

You can mark PR as draft to exclude accidentally merge

Gushousekai195 · 2024-05-17T04:02:43Z

Clicks Generate

Traceback (most recent call last):
      File "D:\AI\stable-diffusion-webui\modules\call_queue.py", line 57, in f
        res = list(func(*args, **kwargs))
      File "D:\AI\stable-diffusion-webui\modules\call_queue.py", line 36, in f
        res = func(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\modules\txt2img.py", line 109, in txt2img
        processed = processing.process_images(p)
      File "D:\AI\stable-diffusion-webui\modules\processing.py", line 839, in process_images
        res = process_images_inner(p)
      File "D:\AI\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\batch_hijack.py", line 59, in processing_process_images_hijack
        return getattr(processing, '__controlnet_original_process_images_inner')(p, *args, **kwargs)
      File "D:\AI\stable-diffusion-webui\modules\processing.py", line 975, in process_images_inner
        samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)
      File "D:\AI\stable-diffusion-webui\modules\processing.py", line 1322, in sample
        samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
      File "D:\AI\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 218, in sample
        samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
      File "D:\AI\stable-diffusion-webui\modules\sd_samplers_common.py", line 272, in launch_sampling
        return func()
      File "D:\AI\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 218, in <lambda>
        samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
        return func(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 594, in sample_dpmpp_2m
        denoised = model(x, sigmas[i] * s_in, **extra_args)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\modules\sd_samplers_cfg_denoiser.py", line 237, in forward
        x_out = self.inner_model(x_in, sigma_in, cond=make_condition_dict(cond_in, image_cond_in))
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 112, in forward
        eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 138, in get_eps
        return self.inner_model.apply_model(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\modules\sd_hijack_utils.py", line 22, in <lambda>
        setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: self(*args, **kwargs))
      File "D:\AI\stable-diffusion-webui\modules\sd_hijack_utils.py", line 34, in __call__
        return self.__sub_func(self.__orig_func, *args, **kwargs)
      File "D:\AI\stable-diffusion-webui\modules\sd_hijack_unet.py", line 48, in apply_model
        result = orig_func(self, x_noisy.to(devices.dtype_unet), t.to(devices.dtype_unet), cond, **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 858, in apply_model
        x_recon = self.model(x_noisy, t, **cond)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 1335, in forward
        out = self.diffusion_model(x, t, context=cc)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\modules\sd_unet.py", line 91, in UNetModel_forward
        return original_forward(self, x, timesteps, context, *args, **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\openaimodel.py", line 797, in forward
        h = module(h, emb, context)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\openaimodel.py", line 86, in forward
        x = layer(x)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
        return forward_call(*args, **kwargs)
      File "D:\AI\stable-diffusion-webui\extensions\a1111-sd-webui-lycoris\l_networks.py", line 524, in network_Conv2d_forward
        return originals.Conv2d_forward(self, input)
      File "D:\AI\stable-diffusion-webui\extensions-builtin\Lora\networks.py", line 523, in network_Conv2d_forward
        return originals.Conv2d_forward(self, input)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\conv.py", line 463, in forward
        return self._conv_forward(input, self.weight, self.bias)
      File "D:\AI\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\conv.py", line 459, in _conv_forward
        return F.conv2d(input, weight, bias, self.stride,
    RuntimeError: Input type (float) and bias type (struct c10::Half) should be the same

Now what?

serick4126 · 2024-05-17T05:25:54Z

In my environment, the SDXL model failed to load because FP8 weight was set to enabled in the settings optimization.

When I disabled it, the SDXL model loaded without problems.

feffy380 · 2024-05-17T07:48:46Z

Getting similar error as @Gushousekai195 (no lycoris, happens even when all loras disabled). One of these patches is breaking SD1.5.
Only SDXL works.

Edit: Narrowed it down to --precision half

bob7l · 2024-05-17T16:15:56Z

SDXL, Nvidia 4090 + Intel 12700K - seeing a 22.04% increase in speed. No (noticeable?) effect on image output.

huchenlei · 2024-05-17T17:35:52Z

SD15 generation issue fixed.

light-and-ray · 2024-05-17T18:04:19Z

I don't have significant performance boost. Only sdxl + 2 cn has about 10-15% boost. Btw as in Forge I have the similar too:

rtx 3060 + 10400f
--medvram-sdxl --xformers --disable-model-loading-ram-optimization
python: 3.11.6  •  torch: 2.1.2+cu121  •  xformers: 0.0.23.post1

First column = non-patched | second = patched

sd1 1 image:
3.0 sec. | 3.0 sec.

sd1 batch_size 10:
22.8 sec. | 23.0 sec.

sd1 + cn canny, depth, batch_size 10:
38.0 sec. | 37.6 sec.

sdxl 1 image:
17.9 sec. | 17.2 sec.

sdxl + cn canny, depth 1 image:
39.0 sec. | 34.6 sec.

AnimateDiff + CN Inpaint + SparseCtrl works

Maybe I have CPU which doesn't fit to GPU, or vise versa. But it's not worse then non-patched, and other users have boost, so I like this work. 👍🏻 Now I will test it on 2gb gpu

light-and-ray · 2024-05-17T18:38:05Z

I tried to select the best time, but it's definitely slower on the very low vram setup. Maybe it conflicts with some optimizations?

mx150 2gb aka gt 1030
--xformers --lowvram
Optimizations are in the screenshot
python: 3.10.6  •  torch: 2.1.2+cu121  •  xformers: 0.0.23.post1

First column = non-patched | second = patched

sd1 + merged lcm lora:
20.3 sec. | 22.3 sec.

sd1 + merged lcm lora + t2ia canny:
22.1 sec. | 22.8 sec.

sd1 + merged lcm lora + hiresfix 2x + tiled vae:
2 min. 22.1 sec. | 2 min. 40.7 sec.

huchenlei · 2024-05-17T20:28:34Z

I tried to select the best time, but it's definitely slower on the very low vram setup. Maybe it conflicts with some optimizations?

mx150 2gb aka gt 1030
--xformers --lowvram
Optimizations are in the screenshot
python: 3.10.6  •  torch: 2.1.2+cu121  •  xformers: 0.0.23.post1

First column = non-patched | second = patched

sd1 + merged lcm lora:
20.3 sec. | 22.3 sec.

sd1 + merged lcm lora + t2ia canny:
22.1 sec. | 22.8 sec.

sd1 + merged lcm lora + hiresfix 2x + tiled vae:
2 min. 22.1 sec. | 2 min. 40.7 sec.

Can you attach traces of your experiment? I am not sure which part of the optimization is affecting low vram perf. You can record trace according to instruction in lllyasviel/stable-diffusion-webui-forge#716

Running 2 steps should probably be enough.

light-and-ray · 2024-05-17T21:36:21Z

Okay @huchenlei

sd1 + merged lcm lora + t2ia canny
4 steps

Non-patched:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        model_inference        10.60%        2.483s       100.00%       23.436s       23.436s       0.000us         0.00%       20.588s       20.588s           0 b      -3.69 Gb     993.50 Kb      -9.40 Gb             1  
                                        cudaMemcpyAsync        72.21%       16.922s        72.46%       16.982s       2.585ms      54.251ms         0.27%      54.251ms       8.259us           0 b           0 b           0 b           0 b          6569  
                                               aten::to         0.30%      70.688ms        68.23%       15.990s       1.408ms       0.000us         0.00%        6.300s     554.910us       3.66 Gb      13.43 Mb      14.53 Gb     262.38 Mb         11353  
                                         aten::_to_copy         0.48%     113.589ms        68.05%       15.949s       1.462ms       0.000us         0.00%        6.335s     580.598us       3.66 Gb     598.63 Kb      14.53 Gb           0 b         10912  
                                            aten::copy_         0.74%     172.485ms        67.06%       15.715s       1.401ms        6.000s        29.97%        6.426s     572.681us           0 b           0 b           0 b           0 b         11221  
                                           aten::conv2d         0.05%      11.844ms         9.87%        2.314s       2.738ms       0.000us         0.00%       11.814s      13.981ms           0 b           0 b       2.90 Gb      -4.70 Gb           845  
                                             aten::item         0.00%      89.000us         9.56%        2.241s     149.386ms       0.000us         0.00%      12.138ms     809.200us           0 b           0 b           0 b           0 b            15  
                              aten::_local_scalar_dense         0.00%     241.000us         9.56%        2.241s     149.380ms      15.000us         0.00%      12.138ms     809.200us           0 b           0 b           0 b           0 b            15  
                                      aten::convolution         0.02%       3.958ms         8.77%        2.056s       4.558ms       0.000us         0.00%        6.954s      15.419ms           0 b           0 b       1.97 Gb           0 b           451  
                                     aten::_convolution         0.05%      10.801ms         8.75%        2.052s       4.549ms       0.000us         0.00%        6.954s      15.419ms           0 b           0 b       1.97 Gb      -4.00 Mb           451  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 23.436s
Self CUDA time total: 20.022s


Patched:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        model_inference        19.16%        9.898s       100.00%       51.653s       51.653s       0.000us         0.00%       40.367s       40.367s           0 b      -3.68 Gb       7.24 Mb     -18.52 Gb             1  
                                        cudaMemcpyAsync        59.11%       30.531s        60.30%       31.144s       4.745ms       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b          6564  
                                               aten::to         0.19%      98.962ms        58.74%       30.340s       2.701ms       0.000us         0.00%        7.325s     652.075us       3.66 Gb     522.07 Kb      14.50 Gb     237.55 Mb         11233  
                                         aten::_to_copy         0.46%     239.181ms        58.56%       30.249s       2.800ms       0.000us         0.00%        7.349s     680.168us       3.66 Gb           0 b      14.50 Gb           0 b         10804  
                                            aten::copy_         0.71%     365.811ms        57.60%       29.754s       2.677ms        7.471s        18.51%        7.471s     672.260us           0 b           0 b           0 b           0 b         11113  
                                           aten::conv2d         0.04%      20.753ms        16.41%        8.474s      10.040ms       0.000us         0.00%       26.478s      31.372ms           0 b           0 b       2.90 Gb      -4.70 Gb           844  
                                      aten::convolution         0.02%       8.169ms        15.56%        8.038s      17.824ms       0.000us         0.00%       17.953s      39.807ms           0 b           0 b       1.97 Gb           0 b           451  
                                     aten::_convolution         0.04%      23.143ms        15.55%        8.030s      17.806ms       0.000us         0.00%       17.953s      39.807ms           0 b           0 b       1.97 Gb      -4.00 Mb           451  
                                aten::cudnn_convolution         0.85%     437.998ms        15.42%        7.966s      17.662ms       15.939s        39.48%       15.939s      35.341ms           0 b           0 b       1.97 Gb       1.79 Gb           451  
                                               cudaFree        13.98%        7.223s        14.52%        7.499s     299.956ms       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b            25  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 51.653s
Self CUDA time total: 40.368s

For some reason with tracing time difference become much more bigger

light-and-ray · 2024-05-17T21:44:06Z

It is very strange. Both were made about 3 times, and I pasted the last, to exclude any model loading from disk and cn preprocessing

NB: this gpu has very slow vram, maybe it can be connected with this

light-and-ray · 2024-05-17T21:57:12Z

Visually 2 first steps are okay, but the last 2 steps are slower after patch

Also I'm attaching trace files
trace_non_patched.json.gz
trace_patched.json.gz

strawberrymelonpanda · 2024-05-18T06:49:49Z

Use Github client to checkout this PR using command gh pr checkout 15821

Just want to add you can checkout any Github PR without the GH CLI client using standard Git commands:

git fetch origin pull/ID/head:NAME
git checkout NAME

In this case, for example:

git fetch origin pull/15821/head:15821 && git checkout 15821

Just to save someone the install if they don't need the GH client otherwise.

not-ski · 2024-05-18T07:08:55Z

Great job! Merged the remote branch locally, and I'm now seeing faster gens on A1111 than on ComfyUI :)

FurkanGozukara · 2024-05-18T09:14:28Z

I just tested
it has speed improvement (around 8% for RTX 3090) however auto cast fails and we get black output on SDXL

huchenlei · 2024-05-18T13:15:08Z

I just tested it has speed improvement (around 8% for RTX 3090) however auto cast fails and we get black output on SDXL

precision half disables autocast, so you should not use fp8 option in settings as well. Casting during inference is a big source of performance overhead.

FurkanGozukara · 2024-05-18T14:34:16Z

I just tested it has speed improvement (around 8% for RTX 3090) however auto cast fails and we get black output on SDXL

precision half disables autocast, so you should not use fp8 option in settings as well. Casting during inference is a big source of performance overhead.

this is the only option i started

@echo off
set PYTHON=
set GIT=
set VENV_DIR=
set COMMANDLINE_ARGS=--xformers
call webui.bat

b-fission · 2024-05-18T16:34:16Z

I just tested it has speed improvement (around 8% for RTX 3090) however auto cast fails and we get black output on SDXL

Using sdxl-vae-fp16-fix as the VAE seems to fix the black output. Tested using --precision half
Some checkpoints like AlbedoBaseXL will work as-is.

freecoderwaifu · 2024-05-18T23:57:08Z

I don't have extensive numbers, but 512x512 generates noticeably faster, as fast as Forge if not faster. 2 seconds to generate an image with 40 steps using DPM++ 2M SGM Uniform with a regular 1.5 checkpoint, and 15 seconds at 1024x 1024 with SDXL, on a 3080 12GB.

I feel like highres fix and img2img are the bigger bottlenecks now, but I dunno how feasible it is to optimize them even further, especially since these fixes did also noticeably increase their speed. Maybe on the side of the upscalers, since some are noticeably slower than others just by their nature, but I guess it just is what it is due to hardware.

Also ran into an issue, might be related to --precision-half too:
RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half

Seems to happen when loading an SD 1.5 checkpoint, then loading an SDXL checkpoint and trying to generate. Loading a different SDXL checkpoint seems to fix it, but then happens all over again when loading between SD 1.5 and SDXL. This also happens if the checkpoint the UI loads by default is SD 1.5.

mweldon · 2024-05-19T02:12:17Z

Quick test using a 3060, doing a 4-batch at 896x1152 with 2 LORAs at 20 steps, DPM++ 3M Exponential

Forge: 1:20.
A1111 with this PR: 0:58

A single image with this PR is around 15 sec. Very nice! I'm going to use this from now on unless some issue comes up.
Great work!

enternalsaga · 2024-05-19T12:10:31Z

the --precision half not work in my case, adding it triggers tracing to several u-net hijack extensions even I have disabled them.

Main error
File "I:\stable-diffusion-webui-updated\venv\lib\site-packages\torch\functional.py", line 377, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined] RuntimeError: expected scalar type Half but found Float

Zotikus1001 · 2024-05-19T12:51:15Z

I'd love to go back to A1111 from forge, but the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode), makes me not want to touch it anymore. This doesnt happen on any other ui. I wish someone could figure that out. :(

huchenlei · 2024-05-19T13:28:25Z

the --precision half not work in my case, adding it triggers tracing to several u-net hijack extensions even I have disabled them.

Main error File "I:\stable-diffusion-webui-updated\venv\lib\site-packages\torch\functional.py", line 377, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined] RuntimeError: expected scalar type Half but found Float

Can you attach full stacktrace?

strawberrymelonpanda · 2024-05-19T18:42:05Z

the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode),

@Zotikus1001 If you haven't, download multidiffusion-upscaler-for-automatic1111 and enable "Tiled VAE". Despite the name, it's not just about upscale.

Forge is using an integrated version of this under the hood if I remember right.

It would be nice to have the other memory management tricks like moving models, etc though, since I can pretty much use HR Fix with any size reliably in Forge.

Zotikus1001 · 2024-05-19T19:11:17Z

the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode),

@Zotikus1001 If you haven't, download multidiffusion-upscaler-for-automatic1111 and enable "Tiled VAE". Despite the name, it's not just about upscale.

Forge is using an integrated version of this under the hood if I remember right.

It would be nice to have the other memory management tricks like moving models, etc though, since I can pretty much use HR Fix with any size reliably in Forge.

that doesnt work for this issue sadly. It still happens.

bob7l · 2024-05-19T20:47:06Z

the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode),

@Zotikus1001 If you haven't, download multidiffusion-upscaler-for-automatic1111 and enable "Tiled VAE". Despite the name, it's not just about upscale.
Forge is using an integrated version of this under the hood if I remember right.
It would be nice to have the other memory management tricks like moving models, etc though, since I can pretty much use HR Fix with any size reliably in Forge.

that doesnt work for this issue sadly. It still happens.

Does it happen only during the HiRes Fix stage? That's the only place I exceed 24GB VRAM. Doesn't happen with Forge

Zotikus1001 · 2024-05-19T23:37:37Z

the issue where the VRAM shoots up extremely on the last step of a generation (Im assuming its the VAE decode),

@Zotikus1001 If you haven't, download multidiffusion-upscaler-for-automatic1111 and enable "Tiled VAE". Despite the name, it's not just about upscale.
Forge is using an integrated version of this under the hood if I remember right.
It would be nice to have the other memory management tricks like moving models, etc though, since I can pretty much use HR Fix with any size reliably in Forge.

that doesnt work for this issue sadly. It still happens.

Does it happen only during the HiRes Fix stage? That's the only place I exceed 24GB VRAM. Doesn't happen with Forge

No, it happens every time, ofc the higher res I go, the worse it is.
Where on forge or comfyui or invoke there's no increase in VRAM at all, not a single hickup.

papuSpartan · 2024-05-22T15:01:44Z

I saw a 10-19% speedup when using --precision half along with --opt-channelslast after merging this. Newer accelerators will benefit more from these changes but that's not to say that older ones aren't getting an uplift either. There was about a 9% speedup going from 30 to 40 series.

You can see more details here

enternalsaga · 2024-05-23T14:37:24Z

the --precision half not work in my case, adding it triggers tracing to several u-net hijack extensions even I have disabled them.
Main error File "I:\stable-diffusion-webui-updated\venv\lib\site-packages\torch\functional.py", line 377, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined] RuntimeError: expected scalar type Half but found Float

Can you attach full stacktrace?
here it is:
trace.txt

ByteSh0ck · 2024-05-26T14:15:35Z

Thank you so much for these improvements! 1111 speed is now on par with forge thanks to you. :)

Soulreaver90 · 2024-05-27T11:28:21Z

I've tried everything ever suggested to get any sort of speed improvements for my RX 6700XT, and all have been fruitless outside of Token Merging which affects quality. This PR, although not mindblowing, has succeeded in providing actual speed improvements. On average I've noticed a 5% improvement. Not sure if there is anything else specific I should apply outside of the cmd args, but I'm happy and will keep running this until it's merged.

ostap667inbox · 2024-06-02T17:23:29Z

@huchenlei
Is it possible in principle to add functionality similar to 'Never OOM' in Forge to A1111? That is, enable/disable on-the-fly VRAM saving mode.
I regularly encounter 12Gb VRAM shortage in A1111 even at low resolutions when using ControlNet or LoRa.

huchenlei · 2024-06-10T03:00:11Z

Closing this PR as all component PRs are merged.

wywywywy · 2024-06-10T07:32:48Z

It's not merged to master yet. Shouldn't we keep this open for tracking in case things crop up in dev?

sinand99 · 2024-06-14T15:32:02Z

When will this be merged to master? We need performance improvements since Forge is dead now.

Soulreaver90 · 2024-06-15T00:55:12Z

When will this be merged to master? We need performance improvements since Forge is dead now.

It’s merged in DEV. It will be merged with Master when the next A1111 update is ready. You can switch to DEV to try it out or just wait a little while longer.

ZeroCool22 · 2024-06-30T14:41:49Z

When will this be merged to master? We need performance improvements since Forge is dead now.

It’s merged in DEV. It will be merged with Master when the next A1111 update is ready. You can switch to DEV to try it out or just wait a little while longer.

Could you tell me the command to switch to the Dev branch?

Also, i should use --precision half with a 1080 TI?

Gushousekai195 · 2024-07-20T23:20:58Z

Still slow as heck. It was faster in Forge.

ByteSh0ck · 2024-07-30T20:30:55Z

its live now. thanks @huchenlei for your work!

huchenlei and others added 11 commits May 15, 2024 15:38

use_checkpoint = False (#1)

5a5ac68

Replace einops.rearrange with torch native (#2)

aaa8a99

Disable nan check by default (#3)

fbeef19

Precompute is_sdxl_inpaint flag (#4)

b7b2bdc

Revert "Precompute is_sdxl_inpaint flag (#4)" (#5)

2197ab8

This reverts commit b7b2bdc.

Inpaint fix (#6)

f38bafd

* Precompute is_sdxl_inpaint flag * Fix flag check for SD15

Fix attr access

5b49881

Bias backup (#7)

b66dfb5

* Prevent uncessary bias backup * Fix LoRA bias error --------- Co-authored-by: AUTOMATIC1111 <[email protected]>

Fully prevent use_checkpoint

da73157

Fix SD15

0061b89

Add --precision half cmd option (#8)

eff3d79

Co-authored-by: AUTOMATIC1111 <[email protected]>

huchenlei requested a review from AUTOMATIC1111 as a code owner May 17, 2024 00:22

huchenlei changed the base branch from master to dev May 17, 2024 00:23

drhead mentioned this pull request May 17, 2024

Optimizations to PAG and t2i-zero v0xie/sd-webui-incantations#43

Merged

feffy380 mentioned this pull request May 17, 2024

[Performance 6/6] Add --precision half option to avoid casting during inference #15820

Merged

4 tasks

Fix SD15 dtype

b3971ad

huchenlei marked this pull request as draft May 17, 2024 17:30

Proper fix of SD15 dtype

8e355f0

CCpt5 mentioned this pull request May 29, 2024

Error after update wkpark/sd-webui-model-mixer#142

Open

w-e-w mentioned this pull request Jun 5, 2024

[Bug]: Automatic1111 works way slower than Forge and ComfyUI on Linux Ubuntu A6000 GPU #15947

Open

huchenlei closed this Jun 10, 2024

[DO NOT MERGE] All perf improvements bundle #15821

[DO NOT MERGE] All perf improvements bundle #15821

Conversation

huchenlei commented May 17, 2024 • edited Loading

Description

How to use

Unpatch the PR

Expected performance improvement

Report issues

Checklist:

light-and-ray commented May 17, 2024

Gushousekai195 commented May 17, 2024 • edited Loading

serick4126 commented May 17, 2024 • edited Loading

feffy380 commented May 17, 2024 • edited Loading

bob7l commented May 17, 2024

huchenlei commented May 17, 2024

light-and-ray commented May 17, 2024 • edited Loading

light-and-ray commented May 17, 2024

huchenlei commented May 17, 2024

light-and-ray commented May 17, 2024 • edited Loading

light-and-ray commented May 17, 2024

light-and-ray commented May 17, 2024 • edited Loading

strawberrymelonpanda commented May 18, 2024 • edited Loading

not-ski commented May 18, 2024

FurkanGozukara commented May 18, 2024

huchenlei commented May 18, 2024

FurkanGozukara commented May 18, 2024 • edited Loading

b-fission commented May 18, 2024 • edited Loading

freecoderwaifu commented May 18, 2024 • edited Loading

mweldon commented May 19, 2024 • edited Loading

enternalsaga commented May 19, 2024

Zotikus1001 commented May 19, 2024

huchenlei commented May 19, 2024

strawberrymelonpanda commented May 19, 2024 • edited Loading

Zotikus1001 commented May 19, 2024

bob7l commented May 19, 2024

Zotikus1001 commented May 19, 2024

papuSpartan commented May 22, 2024

enternalsaga commented May 23, 2024

ByteSh0ck commented May 26, 2024

Soulreaver90 commented May 27, 2024

ostap667inbox commented Jun 2, 2024

huchenlei commented Jun 10, 2024

wywywywy commented Jun 10, 2024

sinand99 commented Jun 14, 2024

Soulreaver90 commented Jun 15, 2024

ZeroCool22 commented Jun 30, 2024

Gushousekai195 commented Jul 20, 2024

ByteSh0ck commented Jul 30, 2024

huchenlei commented May 17, 2024 •

edited

Loading

Gushousekai195 commented May 17, 2024 •

edited

Loading

serick4126 commented May 17, 2024 •

edited

Loading

feffy380 commented May 17, 2024 •

edited

Loading

light-and-ray commented May 17, 2024 •

edited

Loading

light-and-ray commented May 17, 2024 •

edited

Loading

light-and-ray commented May 17, 2024 •

edited

Loading

strawberrymelonpanda commented May 18, 2024 •

edited

Loading

FurkanGozukara commented May 18, 2024 •

edited

Loading

b-fission commented May 18, 2024 •

edited

Loading

freecoderwaifu commented May 18, 2024 •

edited

Loading

mweldon commented May 19, 2024 •

edited

Loading

strawberrymelonpanda commented May 19, 2024 •

edited

Loading