PyTorch 2.0 and Torch Compile general discussion #6932

vladmandic · 2023-01-19T14:30:26Z

vladmandic
Jan 19, 2023
Collaborator

This is (hopefully) start of a thread on PyTorch 2.0 and benefits of model compile which is a new feature available in torch nightly builds

Builds on conversations in #5965, #6455, #6615, #6405

TL;DR

PyTorch 2.0 with Accelerate and XFormers works pretty much out-of-the-box, but it needs newer packages
But only limited luck so far using new torch.compile although made some progress

Install

First, this is written for torch 2.0 with cuda 11.8
If you want to use CUDA 11.7, modify install paths accordingly, but older versions will likely not work
(and neither will CUDA 12 as there is no support for it in torch just yet)

Btw, my environment is RTX3060 inside WSL2 (Ubuntu 22.04) on Windows 11, so your mileage/results may vary

1. CUDA

install CUDA 11.8 with latest cuDNN

2. Triton

If you have default OpenAI version of triton, uninstall it before installing torch as torch 2.0 comes with its own version of triton

pip uninstall triton

3. Torch

Install Torch nightly

pip3 install --pre torch torchvision torchaudio torchtriton --extra-index-url https://download.pytorch.org/whl/nightly/cu118 --force
pip show torch
2.0.0.dev20230113+cu118

4. Accelerate

Update Accelerate for Torch 2.0 compatibility as version specified in requirements_versions.txt

pip install -U accelerate==0.15.0

And don't forget to update requirements_versions.txt so webui doesn't auto-downgrade accelerate version

5. Xformers

Rebuild XFormers

Relying on pre-built wheels is not really an option since xformers get linked to specific torch version which changes daily
Plus rebuild only takes few min, so why bother with wheels
(just make sure you have build requirements before-hand).

pip install ninja setuptools
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
python -m xformers.info

And thats it, WebUI is happy to work with new libs out-of-the-box

Optimize

1. Accelerate

But now onto main reason why even try torch nightlies:

Torch includes dynamic compiler/optimizer which is only available in nightly builds: Dynamo

If you're lucky few, you may be able to configure Accelerate to use Dynamo

accelerate config

Do you wish to optimize your script with torch dynamo?[yes/NO]: yes
Which dynamo backend would you like to use? inductor

accelerate test

I haven't had luck getting accelerate test to complete which means that dynamo will NOT be used.

2. Compile

So lets do a manual config:

We need to setup torch.compile and best spot I've found so far is NOT in SD model load, but slightly afterwards due to function hijacking that happens in WebUI

For example, in modules/sd_hijack function def hijack, just before self.optimization_method = apply_optimizations()

try:
    import torch._dynamo as dynamo
    torch._dynamo.config.verbose = True
    torch.backends.cudnn.benchmark = True
    m.model = torch.compile(m.model, mode="max-autotune", fullgraph=False)
    print("Model compiled set")
except Exception as err:
    print(f"Model compile not supported: {err}")

Notes:

dynamo must be imported explicitly or namespace is not available in runtime
compile applies to actual model,
not to parent sd_model as that is entire pipeline, not model itself
compile time is irrelevant as it only marks model for compilation during first execution
fullgraph cannot be forced as there are parts of diffusion model that cannot be compiled,
so this internally allows it to split model into compiled+uncompiled graphs
cudnn.benchmark should be set so optimizer can perform precise execution timings
initial execution will always be slower as that is when compile actually happens
any benefits would be seen in subsequent calls

Result? In my case its the same error as with accelerate test
Not great...

3. Digging Deeper

Default (and recommended) dynamo backend for torch.compile is inductor,
but no matter what I cannot get inductor to work on my system

Error is in triton which fails with silly error:

RuntimeError: CUDA: Error- no device

And at this point I'm not sure if triton is broken for torch 2.0, even if its installed from the same nightly

So I wrote a standalone test script to evaluate all the different backens:
https://github.com/vladmandic/automatic/blob/master/cli/modules/dynamotest.py

This tests and benchmarks all possible dynamo backends, but I'm focusing on couple only:

default: eval in 4.247 ms
ofi: eval in 3.820 ms
uses TorchScript set for optimize_for_inference
this is basically same as default, but with some voodoo-magic regarding just-in-time ops and freeze, etc.
most likely not compatible with training, so cannot be used with dreambooth
aot_cudagraphs: eval in 6.460 ms
uses cudagraphs with AotAutograd
seems slower as no-compile
inductor: fail
uses TorchInductor backend with AotAutograd and cudagraphs by leveraging codegened Triton kernels
error: RuntimeError: CUDA: Error- no device
fx2trt: fail
uses nVidia TensorRT
error: ImportError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
seems like tensorrt is not yet compatible with torch 2.0 (yes, shared library does exist)

Now...All this uses an off-the-shelf model (resnet18) to evaluate,
next step would be to apply it to stable diffusion itself...

And I'd be curious to hear what your test results look like?

Btw, good getting-started doc is in torch code:
https://github.com/pytorch/pytorch/blob/4f4b62e4a255708e928445b6502139d5962974fa/docs/source/dynamo/get-started.rst

tuangd · 2023-01-19T15:03:26Z

tuangd
Jan 19, 2023

Can't find torchtriton with the nightly build.

ERROR: Could not find a version that satisfies the requirement torchtriton (from versions: none)
ERROR: No matching distribution found for torchtriton

6 replies

Marcophono2 Jan 22, 2023

Can't find torchtriton with the nightly build.

ERROR: Could not find a version that satisfies the requirement torchtriton (from versions: none)
ERROR: No matching distribution found for torchtriton

The torchtriton package name was removed as a dependency from PyTorch and replaced with pytorch-triton.

Marcophono2 Jan 22, 2023

The torchtriton package name was removed as a dependency from PyTorch and replaced with pytorch-triton.
I this is really the missing link - I was able to encode xformers with Pytorch 20. and all that other stuff as mentioned from the OP but SD let me not start with xformers. It doesn't like something at triton.

No module 'xformers'. Proceeding without it.
Cannot import xformers
Traceback (most recent call last):
  File "C:\AI\stable-diffusion-webui\modules\sd_hijack_optimizations.py", line 20, in <module>
    import xformers.ops
  File "C:\progs\anaconda3\envs\SD\lib\site-packages\xformers\ops\__init__.py", line 8, in <module>
    from .fmha import (
  File "C:\progs\anaconda3\envs\SD\lib\site-packages\xformers\ops\fmha\__init__.py", line 10, in <module>
    from . import cutlass, flash, small_k, triton
  File "C:\progs\anaconda3\envs\SD\lib\site-packages\xformers\ops\fmha\triton.py", line 15, in <module>
    if TYPE_CHECKING or _is_triton_available():
  File "C:\progs\anaconda3\envs\SD\lib\site-packages\xformers\__init__.py", line 33, in func_wrapper
    value = func()
  File "C:\progs\anaconda3\envs\SD\lib\site-packages\xformers\__init__.py", line 44, in _is_triton_available
    from xformers.triton.softmax import softmax as triton_softmax  # noqa
  File "C:\progs\anaconda3\envs\SD\lib\site-packages\xformers\triton\__init__.py", line 12, in <module>
    from .dropout import FusedDropoutBias, dropout  # noqa
  File "C:\progs\anaconda3\envs\SD\lib\site-packages\xformers\triton\dropout.py", line 13, in <module>
    import triton
  File "C:\progs\anaconda3\envs\SD\lib\site-packages\triton\__init__.py", line 1, in <module>
    raise RuntimeError("Should never be installed")
RuntimeError: Should never be installed

LatentDiffusion: Running in v-prediction mode
DiffusionWrapper has 865.91 M params.
Loading weights [4711ff4dd2] from C:\AI\stable-diffusion-webui\models\Stable-diffusion\model.ckpt
Model compiled set
Applying cross attention optimization (Doggettx).
Textual inversion embeddings loaded(0):
Model loaded in 4.5s (0.2s create model, 1.9s load weights).
Running on local URL:  http://127.0.0.1:7860

From 42it/s down to 1.5it/s - Juppiiiiie!!

sashasubbbb Jan 22, 2023

The torchtriton package name was removed as a dependency from PyTorch and replaced with pytorch-triton.
I this is really the missing link - I was able to encode xformers with Pytorch 20. and all that other stuff as mentioned from the OP but SD let me not start with xformers. It doesn't like something at triton.

No module 'xformers'. Proceeding without it.
Cannot import xformers
Traceback (most recent call last):
  File "C:\AI\stable-diffusion-webui\modules\sd_hijack_optimizations.py", line 20, in <module>
    import xformers.ops
  File "C:\progs\anaconda3\envs\SD\lib\site-packages\xformers\ops\__init__.py", line 8, in <module>
    from .fmha import (
  File "C:\progs\anaconda3\envs\SD\lib\site-packages\xformers\ops\fmha\__init__.py", line 10, in <module>
    from . import cutlass, flash, small_k, triton
  File "C:\progs\anaconda3\envs\SD\lib\site-packages\xformers\ops\fmha\triton.py", line 15, in <module>
    if TYPE_CHECKING or _is_triton_available():
  File "C:\progs\anaconda3\envs\SD\lib\site-packages\xformers\__init__.py", line 33, in func_wrapper
    value = func()
  File "C:\progs\anaconda3\envs\SD\lib\site-packages\xformers\__init__.py", line 44, in _is_triton_available
    from xformers.triton.softmax import softmax as triton_softmax  # noqa
  File "C:\progs\anaconda3\envs\SD\lib\site-packages\xformers\triton\__init__.py", line 12, in <module>
    from .dropout import FusedDropoutBias, dropout  # noqa
  File "C:\progs\anaconda3\envs\SD\lib\site-packages\xformers\triton\dropout.py", line 13, in <module>
    import triton
  File "C:\progs\anaconda3\envs\SD\lib\site-packages\triton\__init__.py", line 1, in <module>
    raise RuntimeError("Should never be installed")
RuntimeError: Should never be installed

LatentDiffusion: Running in v-prediction mode
DiffusionWrapper has 865.91 M params.
Loading weights [4711ff4dd2] from C:\AI\stable-diffusion-webui\models\Stable-diffusion\model.ckpt
Model compiled set
Applying cross attention optimization (Doggettx).
Textual inversion embeddings loaded(0):
Model loaded in 4.5s (0.2s create model, 1.9s load weights).
Running on local URL:  http://127.0.0.1:7860

From 42it/s down to 1.5it/s - Juppiiiiie!!

Hi, have you figured it out how to fix it?

brucethemoose Jan 22, 2023

File "C:\progs\anaconda3\envs\SD\lib\site-packages\triton_init_.py", line 1, in
raise RuntimeError("Should never be installed")
RuntimeError: Should never be installed

👀 The python package is literally saying it shouldnt be installed on line one of the init.

I think you have the wrong Triton package installed on conda. Maybe an old one that has been depreciated?

Also you can check if xformers works by itself with python -c "import xformers.ops"

And Triton for that matter: python -c "import triton"

Marcophono2 Jan 23, 2023

@sashasubbbb @brucethemoose Stupidly I had switched from Ubuntu to Windows (specially bought for this a new NVme 1TB Samsung 980 Pro.. 😄 ) because I seem to have confused something. There is soooo often to maintain 1111 becuase with every new version another thin stopped working. Staying at a working commit only works if you do not depend on using certain extensions. I really would love to use dreambooth vor processing more than one concept but that poor guy of that wonderful extention is killed with every new 1111 commit. Anyway, that's life and the prize one have to pay for such a dynamic repo like A1111.
So, on Windows triton doesn't work per se. Obvioulsy there was a working version in the past but at this time Pytorch is switching to the next level and a triton version for Windows is awaited later. That's what a guy of them wrote anywhere. Now I am back on Ubuntu and will try my luck.
By the way I bought a second server with the exactly same hadware, just to make sure that any change in the A1111 code will kill my server at runtime. On my second server I can very relaxed try out everything.

ataa · 2023-01-19T15:10:16Z

ataa
Jan 19, 2023

{
    "default": 2.815999984741211,
    "ansor": 2.815135955810547,
    "aot_cudagraphs": 2.8175359964370728,
    "aot_eager": 2.8190720081329346,
    "aot_inductor_debug": 2.8180480003356934,
    "aot_torchxla_trace_once": 2.8190720081329346,
    "aot_torchxla_trivial": 2.8190720081329346,
    "aot_ts": 2.821120023727417,
    "aot_ts_nvfuser": 2.817199945449829,
    "aot_ts_nvfuser_nodecomps": 2.8190720081329346,
    "cudagraphs": 2.8180480003356934,
    "cudagraphs_ts": 2.818560004234314,
    "cudagraphs_ts_ofi": 2.8180480003356934,
    "eager": 2.8187040090560913,
    "fx2trt": 2.8139519691467285,
    "inductor": 2.805759906768799,
    "ipex": 2.7996160984039307,
    "nnc": 2.798080086708069,
    "nnc_ofi": 2.8002400398254395,
    "nvprims_aten": 2.8016480207443237,
    "nvprims_nvfuser": 2.7972320318222046,
    "ofi": 2.800800085067749,
    "onednn": 2.803199887275696,
    "onnx2tensorrt": 2.801664113998413,
    "onnx2tf": 2.798080086708069,
    "onnxrt": 2.8001281023025513,
    "onnxrt_cpu": 2.801664113998413,
    "onnxrt_cpu_numpy": 2.8011521100997925,
    "onnxrt_cuda": 2.7985920906066895,
    "static_runtime": 2.7997440099716187,
    "taso": 2.7985920906066895,
    "tensorrt": 2.795184016227722,
    "torch2trt": 2.8078079223632812,
    "torchxla_trace_once": 2.8175359964370728,
    "torchxla_trivial": 2.815999984741211,
    "ts": 2.8165119886398315,
    "ts_nvfuser": 2.8139519691467285,
    "ts_nvfuser_ofi": 2.822144031524658,
    "tvm": 2.8156319856643677,
    "tvm_meta_schedule": 2.8149759769439697
}

6 replies

ataa Jan 19, 2023

I didn't! I am on Windows.

adrianpuiu Jan 19, 2023

https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-852/quick-start-guide/index.html#install

vladmandic Jan 19, 2023
Collaborator Author

How did you install the TensorRT backends? Did you compile from source, or are you using some kind of container?

looks like a false-positive, see #6932 (reply in thread)

ataa Jan 19, 2023

I set torch._dynamo.config.suppress_errors = False and I am still getting almost identical results:

{
    "default": 2.816175937652588,
    "ansor": 2.8165119886398315,
    "aot_cudagraphs": 2.8155200481414795,
    "aot_eager": 2.819584012031555,
    "aot_inductor_debug": 2.819584012031555,
    "aot_torchxla_trace_once": 2.817728042602539,
    "aot_torchxla_trivial": 2.818560004234314,
    "aot_ts": 2.818560004234314,
    "aot_ts_nvfuser": 2.817023992538452,
    "aot_ts_nvfuser_nodecomps": 2.819440007209778,
    "cudagraphs": 2.8175359964370728,
    "cudagraphs_ts": 2.817631959915161,
    "cudagraphs_ts_ofi": 2.815999984741211,
    "eager": 2.8206080198287964,
    "fx2trt": 2.819584012031555,
    "inductor": 2.8182560205459595,
    "ipex": 2.820096015930176,
    "nnc": 2.8175359964370728,
    "nnc_ofi": 2.8172160387039185,
    "nvprims_aten": 2.8197439908981323,
    "nvprims_nvfuser": 2.8190720081329346,
    "ofi": 2.815999984741211,
    "onednn": 2.821984052658081,
    "onnx2tensorrt": 2.814463973045349,
    "onnx2tf": 2.8047358989715576,
    "onnxrt": 2.8016639947891235,
    "onnxrt_cpu": 2.801151990890503,
    "onnxrt_cpu_numpy": 2.802175998687744,
    "onnxrt_cuda": 2.8006240129470825,
    "static_runtime": 2.787328004837036,
    "taso": 2.784559965133667,
    "tensorrt": 2.782415986061096,
    "torch2trt": 2.783743977546692,
    "torchxla_trace_once": 2.784767985343933,
    "torchxla_trivial": 2.7857919931411743,
    "ts": 2.7868160009384155,
    "ts_nvfuser": 2.7842559814453125,
    "ts_nvfuser_ofi": 2.78220796585083,
    "tvm": 2.78220796585083,
    "tvm_meta_schedule": 2.7822879552841187
}

brucethemoose Jan 19, 2023

@ataa Vlad is right though, there is no way all those backends are working with the bench. Many require convoluted installs, or only run on specific devices, or are (seemingly) fundamentally incompatible with the test.

vladmandic · 2023-01-19T15:11:16Z

vladmandic
Jan 19, 2023
Collaborator Author

@ataa impressive that every single backend is working! what is your environment?

3 replies

ataa Jan 19, 2023

Windows 10 Home, Python 3.10.9, Dell 3070 (OC)

clip-anytorch           2.5.0
open-clip-torch         2.7.0
pytorch-lightning       1.7.7
torch                   2.0.0.dev20230118+cu118
torchaudio              2.0.0.dev20230118+cu118
torchdiffeq             0.2.3
torchmetrics            0.11.0
torchsde                0.2.5
torchvision             0.15.0.dev20230118+cu118
xformers                0.0.16+6f3c20f.d20230116
accelerate              0.15.0

ataa Jan 19, 2023

Not sure why the results are almost identical on mine, Also I get lower numbers for cudagraphs after running the test for the third time (hotter gpu?!)

vladmandic Jan 19, 2023
Collaborator Author

Not sure why the results are almost identical on mine, Also I get lower numbers for cudagraphs after running the test for the third time (hotter gpu?!)

actually, thats a good point - dynamo allows for silent fallback to eager if backend fails, but thats normally disabled by default.
if its enabled, that would explain why all backends actually appear to work when in reality, i don't think some stand a chance!

brucethemoose · 2023-01-19T16:06:00Z

brucethemoose
Jan 19, 2023

Good stuff!

So I can get accelerate test to complete with inductor by adding the following environment variables, maybe some similar lib is missing from your path?

export LD_LIBRARY_PATH="/opt/cuda/targets/x86_64-linux/lib:$LD_LIBRARY_PATH"
export CPATH="/opt/cuda/targets/x86_64-linux/include:$CPATH"

My initial results are quite a failure, gonna mess with my environment and see if I can fix them:

{
    "default": 7.9911839962005615,
    "ansor": "error",
    "aot_cudagraphs": 11.574031829833984,
    "aot_eager": 6.984784126281738,
    "aot_inductor_debug": 11.78111982345581,
    "aot_torchxla_trace_once": "error",
    "aot_torchxla_trivial": 9.356000423431396,
    "aot_ts": 9.001967906951904,
    "aot_ts_nvfuser": 9.061408042907715,
    "aot_ts_nvfuser_nodecomps": 9.050415992736816,
    "cudagraphs": "error",
    "cudagraphs_ts": "error",
    "cudagraphs_ts_ofi": "error",
    "eager": 6.555520057678223,
    "fx2trt": "error",
    "inductor": 7.852895975112915,
    "ipex": "error",
    "nnc": "error",
    "nnc_ofi": "error",
    "nvprims_aten": 43.6898078918457,
    "nvprims_nvfuser": "error",
    "ofi": 6.675135850906372,
    "onednn": "error",
    "onnx2tensorrt": "error",
    "onnx2tf": "error",
    "onnxrt": "error",
    "onnxrt_cpu": "error",
    "onnxrt_cpu_numpy": "error",
    "onnxrt_cuda": "error",
    "static_runtime": "error",
    "taso": "error",
    "tensorrt": "error",
    "torch2trt": "error",
    "torchxla_trace_once": "error",
    "torchxla_trivial": 8.063135623931885,
    "ts": 6.08353590965271,
    "ts_nvfuser": "error",
    "ts_nvfuser_ofi": "error",
    "tvm": "error",
    "tvm_meta_schedule": "error"
}

0 replies

brucethemoose · 2023-01-19T16:10:03Z

brucethemoose
Jan 19, 2023

Also, will accelerate automatically optimize inference? I thought it only tried to optimize training.

2 replies

vladmandic Jan 19, 2023
Collaborator Author

Also, will accelerate automatically optimize inference? I thought it only tried to optimize training.

i think so...but...testing required :)

brucethemoose Jan 19, 2023

It doesn't seem to. There's no difference between accelerate/non accelerate speeds, no indication of compilation or anything like that.

adrianpuiu · 2023-01-19T16:17:51Z

adrianpuiu
Jan 19, 2023

Running: accelerate-launch --config_file=None /home/agp/stable-diffusion-webui/venvtorch20-cu118/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py
stderr: /home/agp/stable-diffusion-webui/venvtorch20-cu118/lib/python3.10/site-packages/accelerate/utils/dataclasses.py:391: UserWarning: DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.
stderr: warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")
stdout: [2023-01-19 18:16:13,944] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
stdout: Initialization
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: DEEPSPEED Backend: nccl
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 1, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'nvme'}, 'offload_param': {'device': 'nvme'}, 'stage3_gather_16bit_weights_on_model_save': False}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'fp16': {'enabled': True, 'auto_cast': True}}
stdout:
stdout:
stdout: Test random number generator synchronization
stdout: All rng are properly synched.
stdout:
stdout: DataLoader integration test
stdout: 0 tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
stdout: 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31],
stdout: device='cuda:0') <class 'accelerate.data_loader.DataLoaderShard'>
stdout: Non-shuffled dataloader passing.
stdout: Shuffled dataloader passing.
stdout: Non-shuffled central dataloader passing.
stdout: Shuffled central dataloader passing.
Test is a success! You are ready for your distributed training!

noice !!!

0 replies

brucethemoose · 2023-01-19T17:07:29Z

brucethemoose
Jan 19, 2023

Gonna come back to this later, but here are the issues I am having with the various backends:

[2023-01-19 11:53:39,742] torch._dynamo.optimizations.backends: [ERROR] tvm error
Traceback (most recent call last):
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/_dynamo/optimizations/backends.py", line 605, in tvm_compile_inner
    mod, params = relay.frontend.from_pytorch(jit_mod, shape_list)
  File "/home/alpha/.local/lib/python3.10/site-packages/tvm/relay/frontend/pytorch.py", line 4627, in from_pytorch
    converter.report_missing_conversion(op_names)
  File "/home/alpha/.local/lib/python3.10/site-packages/tvm/relay/frontend/pytorch.py", line 3831, in report_missing_conversion
    raise NotImplementedError(msg)
NotImplementedError: The following operators are not implemented: ['aten::format', 'aten::__is__', 'aten::conv2d', 'prim::unchecked_cast', 'aten::dim']
dyanmo backend failed: ansor ansor raised Exception: Please convert all Tensors to FakeTensors first or instantiate FakeTensorMode with 'allow_non_fake_inputs'. Found in aten.convolution.default(*(FakeTensor(FakeTensor(..., device='meta', size=(16, 3, 128, 128)), cuda:0), Parameter containing:
dynamo initial eval: aot_cudagraphs 2097.584228515625
dynamo initial eval: aot_eager 1503.4400634765625
dynamo initial eval: aot_inductor_debug 3220.664306640625
dyanmo backend failed: aot_torchxla_trace_once No module named 'torch_xla'
dynamo initial eval: aot_torchxla_trivial 1640.848876953125
dynamo initial eval: aot_ts 1639.3748779296875
dynamo initial eval: aot_ts_nvfuser 1830.6702880859375
dynamo initial eval: aot_ts_nvfuser_nodecomps 1669.760986328125
dyanmo backend failed: cudagraphs cudagraphs raised Exception: Please convert all Tensors to FakeTensors first or instantiate FakeTensorMode with 'allow_non_fake_inputs'. Found in aten.convolution.default(*(FakeTensor(FakeTensor(..., device='meta', size=(16, 3, 128, 128)), cuda:0), Parameter containing:
dyanmo backend failed: cudagraphs_ts cudagraphs_ts raised RuntimeError: The following operation failed in the TorchScript interpreter.
dyanmo backend failed: cudagraphs_ts_ofi cudagraphs_ts_ofi raised RuntimeError: Freezing is currently only implemented for modules in eval mode. Please call .eval() on your module before freezing.
dynamo initial eval: eager 309.3714904785156
dyanmo backend failed: fx2trt fx2trt raised ModuleNotFoundError: No module named 'torch_tensorrt'
dynamo initial eval: inductor 4831.7314453125
dyanmo backend failed: ipex ipex raised ModuleNotFoundError: No module named 'intel_extension_for_pytorch'
dyanmo backend failed: nnc nnc raised RuntimeError: The following operation failed in the TorchScript interpreter.
dyanmo backend failed: nnc_ofi nnc_ofi raised RuntimeError: The following operation failed in the TorchScript interpreter.
dynamo initial eval: nvprims_aten 20330.44921875
dyanmo backend failed: nvprims_nvfuser producer->getMemoryType() == MemoryType::Global INTERNAL ASSERT FAILED at "../torch/csrc/jit/codegen/cuda/lower_sync_information.cpp":437, please report a bug to PyTorch. Inconsistent parallelization found between TV31 (T31_l[ rblockIdx.x350{( ceilDiv(( ceilDiv(1, 4) ), blockDim.x) )}, iblockIdx.y353{( ceilDiv(( 1 * i1 ), 1) )}, iUS354{1}, rS349{4}, rthreadIdx.x351{blockDim.x} ] produce_pos( 3)) and TV34(T34_l[ iblockIdx.y356{( ceilDiv(( 1 * i1 ), 1) )}, iUS357{1} ] ca_pos( 2 )). Producer is required to be in Global Memory based on parallelization strategy.
dynamo initial eval: ofi 526.3748779296875
dyanmo backend failed: onednn onednn raised RuntimeError: The following operation failed in the TorchScript interpreter.
dyanmo backend failed: onnx2tensorrt onnx2tensorrt raised AssertionError:
dyanmo backend failed: onnx2tf onnx2tf raised ModuleNotFoundError: No module named 'onnx_tf'
dyanmo backend failed: onnxrt onnxrt raised AssertionError:
dyanmo backend failed: onnxrt_cpu onnxrt_cpu raised Exception: Please convert all Tensors to FakeTensors first or instantiate FakeTensorMode with 'allow_non_fake_inputs'. Found in aten.convolution.default(*(FakeTensor(FakeTensor(..., device='meta', size=(16, 3, 128, 128)), cuda:0), Parameter containing:
dyanmo backend failed: onnxrt_cpu_numpy onnxrt_cpu_numpy raised Exception: Please convert all Tensors to FakeTensors first or instantiate FakeTensorMode with 'allow_non_fake_inputs'. Found in aten.convolution.default(*(FakeTensor(FakeTensor(..., device='meta', size=(16, 3, 128, 128)), cuda:0), Parameter containing:
dyanmo backend failed: onnxrt_cuda onnxrt_cuda raised AssertionError:
dyanmo backend failed: static_runtime static_runtime raised Exception: Please convert all Tensors to FakeTensors first or instantiate FakeTensorMode with 'allow_non_fake_inputs'. Found in aten.convolution.default(*(FakeTensor(FakeTensor(..., device='meta', size=(16, 3, 128, 128)), cuda:0), Parameter containing:
dyanmo backend failed: taso taso raised Exception: Please convert all Tensors to FakeTensors first or instantiate FakeTensorMode with 'allow_non_fake_inputs'. Found in aten.convolution.default(*(FakeTensor(FakeTensor(..., device='meta', size=(16, 3, 128, 128)), cuda:0), Parameter containing:
dyanmo backend failed: tensorrt tensorrt raised AssertionError:
dyanmo backend failed: torch2trt torch2trt raised ModuleNotFoundError: No module named 'torch2trt'
dyanmo backend failed: torchxla_trace_once No module named 'torch_xla'
dynamo initial eval: torchxla_trivial 307.8309326171875
dynamo initial eval: ts 472.9351806640625
dyanmo backend failed: ts_nvfuser ts_nvfuser raised RuntimeError: The following operation failed in the TorchScript interpreter.
dyanmo backend failed: ts_nvfuser_ofi ts_nvfuser_ofi raised RuntimeError: The following operation failed in the TorchScript interpreter.
[2023-01-19 11:54:58,949] torch._dynamo.optimizations.backends: [ERROR] tvm error
Traceback (most recent call last):
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/_dynamo/optimizations/backends.py", line 605, in tvm_compile_inner
    mod, params = relay.frontend.from_pytorch(jit_mod, shape_list)
  File "/home/alpha/.local/lib/python3.10/site-packages/tvm/relay/frontend/pytorch.py", line 4627, in from_pytorch
    converter.report_missing_conversion(op_names)
  File "/home/alpha/.local/lib/python3.10/site-packages/tvm/relay/frontend/pytorch.py", line 3831, in report_missing_conversion
    raise NotImplementedError(msg)
NotImplementedError: The following operators are not implemented: ['aten::format', 'aten::__is__', 'aten::conv2d', 'prim::unchecked_cast', 'aten::dim']
dyanmo backend failed: tvm tvm raised Exception: Please convert all Tensors to FakeTensors first or instantiate FakeTensorMode with 'allow_non_fake_inputs'. Found in aten.convolution.default(*(FakeTensor(FakeTensor(..., device='meta', size=(16, 3, 128, 128)), cuda:0), Parameter containing:
[2023-01-19 11:54:59,485] torch._dynamo.optimizations.backends: [ERROR] tvm error
Traceback (most recent call last):
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/_dynamo/optimizations/backends.py", line 605, in tvm_compile_inner
    mod, params = relay.frontend.from_pytorch(jit_mod, shape_list)
  File "/home/alpha/.local/lib/python3.10/site-packages/tvm/relay/frontend/pytorch.py", line 4627, in from_pytorch
    converter.report_missing_conversion(op_names)
  File "/home/alpha/.local/lib/python3.10/site-packages/tvm/relay/frontend/pytorch.py", line 3831, in report_missing_conversion
    raise NotImplementedError(msg)
NotImplementedError: The following operators are not implemented: ['aten::format', 'aten::__is__', 'aten::conv2d', 'prim::unchecked_cast', 'aten::dim']
dyanmo backend failed: tvm_meta_schedule tvm_meta_schedule raised Exception: Please convert all Tensors to FakeTensors first or instantiate FakeTensorMode with 'allow_non_fake_inputs'. Found in aten.convolution.default(*(FakeTensor(FakeTensor(..., device='meta', size=(16, 3, 128, 128)), cuda:0), Parameter containing:

Apache TVM in particular seems to be an issue, even with pip install apache-tvm --pre

2 replies

vladmandic Jan 19, 2023
Collaborator Author

Apache TVM in particular seems to be an issue, even with pip install apache-tvm --pre

i wouldn't bother with tvm anyhow...

adrianpuiu Jan 20, 2023

dyanmo backend failed: fx2trt fx2trt raised ModuleNotFoundError: No module named 'torch_tensorrt'

dyanmo backend failed: fx2trt fx2trt raised ModuleNotFoundError: No module named 'torch_tensorrt'

Marcophono2 · 2023-01-19T17:46:21Z

Marcophono2
Jan 19, 2023

@vladmandic Wow, that promises an enourmous punch! 😃 Unfortunatelly I got an error when trying to install xformers. I am on Ubuntu 22.04, RTX 4090 and using the cuda 11.8 cores since a while with xformers and also Torch 2.0 (but unoptimized) at 42it/s for SD 1.5. But obviously this is now also possible on a RTX 3060 so I must do something. 😄

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

$ nvidia-smi
Thu Jan 19 18:45:37 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Graphics...  On   | 00000000:2D:00.0  On |                  Off |
|  0%   50C    P8    23W / 450W |   1127MiB / 24564MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1511      G   /usr/lib/xorg/Xorg                560MiB |
|    0   N/A  N/A      1820      G   /usr/bin/gnome-shell               95MiB |
|    0   N/A  N/A      2754    C+G   ...875137023623965593,131072      446MiB |
|    0   N/A  N/A      3478      G   ...mviewer/tv_bin/TeamViewer       20MiB |
+-----------------------------------------------------------------------------+

I reinstalled everything as described, but:

$ pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
Using pip 22.3.1 from /home/marc/.local/lib/python3.10/site-packages/pip (python 3.10)
Collecting xformers
  Cloning https://github.com/facebookresearch/xformers.git (to revision main) to /tmp/pip-install-fx2w8lcq/xformers_2dba1119a270458db96128cb2f85c51f
  Running command git version
  git version 2.34.1
  Running command git clone --filter=blob:none https://github.com/facebookresearch/xformers.git /tmp/pip-install-fx2w8lcq/xformers_2dba1119a270458db96128cb2f85c51f
  Klone nach '/tmp/pip-install-fx2w8lcq/xformers_2dba1119a270458db96128cb2f85c51f' …
  Aktualisiere Dateien:   0% (2/666)
  Aktualisiere Dateien:   1% (7/666)
  Aktualisiere Dateien:   2% (14/666)
  Aktualisiere Dateien:   3% (20/666)
  Aktualisiere Dateien:   4% (27/666)
  Aktualisiere Dateien:   5% (34/666)
  Aktualisiere Dateien:   6% (40/666)
  Aktualisiere Dateien:   7% (47/666)
  Aktualisiere Dateien:   8% (54/666)
  Aktualisiere Dateien:   9% (60/666)
  Aktualisiere Dateien:  10% (67/666)
  Aktualisiere Dateien:  11% (74/666)
  Aktualisiere Dateien:  12% (80/666)
  Aktualisiere Dateien:  13% (87/666)
  Aktualisiere Dateien:  14% (94/666)
  Aktualisiere Dateien:  15% (100/666)
  Aktualisiere Dateien:  16% (107/666)
  Aktualisiere Dateien:  17% (114/666)
  Aktualisiere Dateien:  18% (120/666)
  Aktualisiere Dateien:  19% (127/666)
  Aktualisiere Dateien:  20% (134/666)
  Aktualisiere Dateien:  21% (140/666)
  Aktualisiere Dateien:  22% (147/666)
  Aktualisiere Dateien:  23% (154/666)
  Aktualisiere Dateien:  24% (160/666)
  Aktualisiere Dateien:  25% (167/666)
  Aktualisiere Dateien:  26% (174/666)
  Aktualisiere Dateien:  27% (180/666)
  Aktualisiere Dateien:  28% (187/666)
  Aktualisiere Dateien:  29% (194/666)
  Aktualisiere Dateien:  30% (200/666)
  Aktualisiere Dateien:  31% (207/666)
  Aktualisiere Dateien:  32% (214/666)
  Aktualisiere Dateien:  33% (220/666)
  Aktualisiere Dateien:  34% (227/666)
  Aktualisiere Dateien:  35% (234/666)
  Aktualisiere Dateien:  36% (240/666)
  Aktualisiere Dateien:  37% (247/666)
  Aktualisiere Dateien:  38% (254/666)
  Aktualisiere Dateien:  39% (260/666)
  Aktualisiere Dateien:  40% (267/666)
  Aktualisiere Dateien:  41% (274/666)
  Aktualisiere Dateien:  42% (280/666)
  Aktualisiere Dateien:  43% (287/666)
  Aktualisiere Dateien:  44% (294/666)
  Aktualisiere Dateien:  45% (300/666)
  Aktualisiere Dateien:  46% (307/666)
  Aktualisiere Dateien:  47% (314/666)
  Aktualisiere Dateien:  48% (320/666)
  Aktualisiere Dateien:  49% (327/666)
  Aktualisiere Dateien:  50% (333/666)
  Aktualisiere Dateien:  51% (340/666)
  Aktualisiere Dateien:  52% (347/666)
  Aktualisiere Dateien:  53% (353/666)
  Aktualisiere Dateien:  54% (360/666)
  Aktualisiere Dateien:  55% (367/666)
  Aktualisiere Dateien:  56% (373/666)
  Aktualisiere Dateien:  57% (380/666)
  Aktualisiere Dateien:  58% (387/666)
  Aktualisiere Dateien:  59% (393/666)
  Aktualisiere Dateien:  60% (400/666)
  Aktualisiere Dateien:  61% (407/666)
  Aktualisiere Dateien:  62% (413/666)
  Aktualisiere Dateien:  63% (420/666)
  Aktualisiere Dateien:  64% (427/666)
  Aktualisiere Dateien:  65% (433/666)
  Aktualisiere Dateien:  66% (440/666)
  Aktualisiere Dateien:  67% (447/666)
  Aktualisiere Dateien:  68% (453/666)
  Aktualisiere Dateien:  69% (460/666)
  Aktualisiere Dateien:  70% (467/666)
  Aktualisiere Dateien:  71% (473/666)
  Aktualisiere Dateien:  72% (480/666)
  Aktualisiere Dateien:  73% (487/666)
  Aktualisiere Dateien:  74% (493/666)
  Aktualisiere Dateien:  75% (500/666)
  Aktualisiere Dateien:  76% (507/666)
  Aktualisiere Dateien:  77% (513/666)
  Aktualisiere Dateien:  78% (520/666)
  Aktualisiere Dateien:  79% (527/666)
  Aktualisiere Dateien:  80% (533/666)
  Aktualisiere Dateien:  81% (540/666)
  Aktualisiere Dateien:  82% (547/666)
  Aktualisiere Dateien:  83% (553/666)
  Aktualisiere Dateien:  84% (560/666)
  Aktualisiere Dateien:  85% (567/666)
  Aktualisiere Dateien:  86% (573/666)
  Aktualisiere Dateien:  87% (580/666)
  Aktualisiere Dateien:  88% (587/666)
  Aktualisiere Dateien:  89% (593/666)
  Aktualisiere Dateien:  90% (600/666)
  Aktualisiere Dateien:  91% (607/666)
  Aktualisiere Dateien:  92% (613/666)
  Aktualisiere Dateien:  93% (620/666)
  Aktualisiere Dateien:  94% (627/666)
  Aktualisiere Dateien:  95% (633/666)
  Aktualisiere Dateien:  96% (640/666)
  Aktualisiere Dateien:  97% (647/666)
  Aktualisiere Dateien:  98% (653/666)
  Aktualisiere Dateien:  99% (660/666)
  Aktualisiere Dateien: 100% (666/666)
  Aktualisiere Dateien: 100% (666/666), fertig.
  Running command git show-ref main
  814314dfc207836839c57613c0354fef6e07fa2d refs/heads/main
  814314dfc207836839c57613c0354fef6e07fa2d refs/remotes/origin/main
  Running command git symbolic-ref -q HEAD
  refs/heads/main
  Resolved https://github.com/facebookresearch/xformers.git to commit 814314dfc207836839c57613c0354fef6e07fa2d
  Running command git submodule update --init --recursive -q
  Running command git rev-parse HEAD
  814314dfc207836839c57613c0354fef6e07fa2d
  Running command python setup.py egg_info
  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/tmp/pip-install-fx2w8lcq/xformers_2dba1119a270458db96128cb2f85c51f/setup.py", line 23, in <module>
      import torch
    File "/home/marc/anaconda3/envs/SD2/lib/python3.10/site-packages/torch/__init__.py", line 868, in <module>
      from ._tensor import Tensor
    File "/home/marc/anaconda3/envs/SD2/lib/python3.10/site-packages/torch/_tensor.py", line 12, in <module>
      import torch.utils.hooks as hooks
    File "/home/marc/anaconda3/envs/SD2/lib/python3.10/site-packages/torch/utils/__init__.py", line 6, in <module>
      from .cpp_backtrace import get_cpp_backtrace
    File "/home/marc/anaconda3/envs/SD2/lib/python3.10/site-packages/torch/utils/cpp_backtrace.py", line 1, in <module>
      from torch._C import _get_cpp_backtrace
  ImportError: cannot import name '_get_cpp_backtrace' from 'torch._C' (/home/marc/anaconda3/envs/SD2/lib/python3.10/site-packages/torch/_C.cpython-310-x86_64-linux-gnu.so)
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: /home/marc/anaconda3/envs/SD2/bin/python -c '
  exec(compile('"'"''"'"''"'"'
  # This is <pip-setuptools-caller> -- a caller that pip uses to run setup.py
  #
  # - It imports setuptools before invoking setup.py, to enable projects that directly
  #   import from `distutils.core` to work with newer packaging standards.
  # - It provides a clear error message when setuptools is not installed.
  # - It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so
  #   setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:
  #     manifest_maker: standard file '"'"'-c'"'"' not found".
  # - It generates a shim setup.py, for handling setup.cfg-only projects.
  import os, sys, tokenize
  
  try:
      import setuptools
  except ImportError as error:
      print(
          "ERROR: Can not execute `setup.py` since setuptools is not available in "
          "the build environment.",
          file=sys.stderr,
      )
      sys.exit(1)
  
  __file__ = %r
  sys.argv[0] = __file__
  
  if os.path.exists(__file__):
      filename = __file__
      with tokenize.open(__file__) as f:
          setup_py_code = f.read()
  else:
      filename = "<auto-generated setuptools caller>"
      setup_py_code = "from setuptools import setup; setup()"
  
  exec(compile(setup_py_code, filename, "exec"))
  '"'"''"'"''"'"' % ('"'"'/tmp/pip-install-fx2w8lcq/xformers_2dba1119a270458db96128cb2f85c51f/setup.py'"'"',), "<pip-setuptools-caller>", "exec"))' egg_info --egg-base /tmp/pip-pip-egg-info-ku79i74d
  cwd: /tmp/pip-install-fx2w8lcq/xformers_2dba1119a270458db96128cb2f85c51f/
  Preparing metadata (setup.py) ... error
error: metadata-generation-failed

14 replies

adrianpuiu Jan 19, 2023

add this to webui-user.bat (in case you dont have them already defined as env variables )
``
set LD_LIBRARY_PATH=/path/to/cuda-11.8/lib64
set LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/cuda-11.8/include
set PATH="/path/to/cuda-11.8/bin:$PATH"

AyoKeito Jan 19, 2023

I've solved it by installing the different way:
git clone xformers
Then from xformers directory:
python setup.py build
python setup.py bdist_wheel
pip install (your created wheel file)
No errors this way.
Can't see any speed improvements tho on native Windows 10.
Maybe it's time to upgrade to 11 and go WSL...

adrianpuiu Jan 19, 2023

I've solved it by installing the different way: git clone xformers Then from xformers directory: python setup.py build python setup.py bdist_wheel pip install (your created wheel file) No errors this way. Can't see any speed improvements tho on native Windows 10. Maybe it's time to upgrade to 11 and go WSL...

go ubuntu and leave windows ... :))

AyoKeito Jan 19, 2023

go ubuntu and leave windows ... :))

O hell naaw, i need a lot of windows-exclusive software.

kou201 Mar 2, 2023

@vladmandic Wow, that promises an enourmous punch! 😃 Unfortunatelly I got an error when trying to install xformers. I am on Ubuntu 22.04, RTX 4090 and using the cuda 11.8 cores since a while with xformers and also Torch 2.0 (but unoptimized) at 42it/s for SD 1.5. But obviously this is now also possible on a RTX 3060 so I must do something. 😄

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

$ nvidia-smi
Thu Jan 19 18:45:37 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Graphics...  On   | 00000000:2D:00.0  On |                  Off |
|  0%   50C    P8    23W / 450W |   1127MiB / 24564MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1511      G   /usr/lib/xorg/Xorg                560MiB |
|    0   N/A  N/A      1820      G   /usr/bin/gnome-shell               95MiB |
|    0   N/A  N/A      2754    C+G   ...875137023623965593,131072      446MiB |
|    0   N/A  N/A      3478      G   ...mviewer/tv_bin/TeamViewer       20MiB |
+-----------------------------------------------------------------------------+

I reinstalled everything as described, but:

$ pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
Using pip 22.3.1 from /home/marc/.local/lib/python3.10/site-packages/pip (python 3.10)
Collecting xformers
  Cloning https://github.com/facebookresearch/xformers.git (to revision main) to /tmp/pip-install-fx2w8lcq/xformers_2dba1119a270458db96128cb2f85c51f
  Running command git version
  git version 2.34.1
  Running command git clone --filter=blob:none https://github.com/facebookresearch/xformers.git /tmp/pip-install-fx2w8lcq/xformers_2dba1119a270458db96128cb2f85c51f
  Klone nach '/tmp/pip-install-fx2w8lcq/xformers_2dba1119a270458db96128cb2f85c51f' …
  Aktualisiere Dateien:   0% (2/666)
  Aktualisiere Dateien:   1% (7/666)
  Aktualisiere Dateien:   2% (14/666)
  Aktualisiere Dateien:   3% (20/666)
  Aktualisiere Dateien:   4% (27/666)
  Aktualisiere Dateien:   5% (34/666)
  Aktualisiere Dateien:   6% (40/666)
  Aktualisiere Dateien:   7% (47/666)
  Aktualisiere Dateien:   8% (54/666)
  Aktualisiere Dateien:   9% (60/666)
  Aktualisiere Dateien:  10% (67/666)
  Aktualisiere Dateien:  11% (74/666)
  Aktualisiere Dateien:  12% (80/666)
  Aktualisiere Dateien:  13% (87/666)
  Aktualisiere Dateien:  14% (94/666)
  Aktualisiere Dateien:  15% (100/666)
  Aktualisiere Dateien:  16% (107/666)
  Aktualisiere Dateien:  17% (114/666)
  Aktualisiere Dateien:  18% (120/666)
  Aktualisiere Dateien:  19% (127/666)
  Aktualisiere Dateien:  20% (134/666)
  Aktualisiere Dateien:  21% (140/666)
  Aktualisiere Dateien:  22% (147/666)
  Aktualisiere Dateien:  23% (154/666)
  Aktualisiere Dateien:  24% (160/666)
  Aktualisiere Dateien:  25% (167/666)
  Aktualisiere Dateien:  26% (174/666)
  Aktualisiere Dateien:  27% (180/666)
  Aktualisiere Dateien:  28% (187/666)
  Aktualisiere Dateien:  29% (194/666)
  Aktualisiere Dateien:  30% (200/666)
  Aktualisiere Dateien:  31% (207/666)
  Aktualisiere Dateien:  32% (214/666)
  Aktualisiere Dateien:  33% (220/666)
  Aktualisiere Dateien:  34% (227/666)
  Aktualisiere Dateien:  35% (234/666)
  Aktualisiere Dateien:  36% (240/666)
  Aktualisiere Dateien:  37% (247/666)
  Aktualisiere Dateien:  38% (254/666)
  Aktualisiere Dateien:  39% (260/666)
  Aktualisiere Dateien:  40% (267/666)
  Aktualisiere Dateien:  41% (274/666)
  Aktualisiere Dateien:  42% (280/666)
  Aktualisiere Dateien:  43% (287/666)
  Aktualisiere Dateien:  44% (294/666)
  Aktualisiere Dateien:  45% (300/666)
  Aktualisiere Dateien:  46% (307/666)
  Aktualisiere Dateien:  47% (314/666)
  Aktualisiere Dateien:  48% (320/666)
  Aktualisiere Dateien:  49% (327/666)
  Aktualisiere Dateien:  50% (333/666)
  Aktualisiere Dateien:  51% (340/666)
  Aktualisiere Dateien:  52% (347/666)
  Aktualisiere Dateien:  53% (353/666)
  Aktualisiere Dateien:  54% (360/666)
  Aktualisiere Dateien:  55% (367/666)
  Aktualisiere Dateien:  56% (373/666)
  Aktualisiere Dateien:  57% (380/666)
  Aktualisiere Dateien:  58% (387/666)
  Aktualisiere Dateien:  59% (393/666)
  Aktualisiere Dateien:  60% (400/666)
  Aktualisiere Dateien:  61% (407/666)
  Aktualisiere Dateien:  62% (413/666)
  Aktualisiere Dateien:  63% (420/666)
  Aktualisiere Dateien:  64% (427/666)
  Aktualisiere Dateien:  65% (433/666)
  Aktualisiere Dateien:  66% (440/666)
  Aktualisiere Dateien:  67% (447/666)
  Aktualisiere Dateien:  68% (453/666)
  Aktualisiere Dateien:  69% (460/666)
  Aktualisiere Dateien:  70% (467/666)
  Aktualisiere Dateien:  71% (473/666)
  Aktualisiere Dateien:  72% (480/666)
  Aktualisiere Dateien:  73% (487/666)
  Aktualisiere Dateien:  74% (493/666)
  Aktualisiere Dateien:  75% (500/666)
  Aktualisiere Dateien:  76% (507/666)
  Aktualisiere Dateien:  77% (513/666)
  Aktualisiere Dateien:  78% (520/666)
  Aktualisiere Dateien:  79% (527/666)
  Aktualisiere Dateien:  80% (533/666)
  Aktualisiere Dateien:  81% (540/666)
  Aktualisiere Dateien:  82% (547/666)
  Aktualisiere Dateien:  83% (553/666)
  Aktualisiere Dateien:  84% (560/666)
  Aktualisiere Dateien:  85% (567/666)
  Aktualisiere Dateien:  86% (573/666)
  Aktualisiere Dateien:  87% (580/666)
  Aktualisiere Dateien:  88% (587/666)
  Aktualisiere Dateien:  89% (593/666)
  Aktualisiere Dateien:  90% (600/666)
  Aktualisiere Dateien:  91% (607/666)
  Aktualisiere Dateien:  92% (613/666)
  Aktualisiere Dateien:  93% (620/666)
  Aktualisiere Dateien:  94% (627/666)
  Aktualisiere Dateien:  95% (633/666)
  Aktualisiere Dateien:  96% (640/666)
  Aktualisiere Dateien:  97% (647/666)
  Aktualisiere Dateien:  98% (653/666)
  Aktualisiere Dateien:  99% (660/666)
  Aktualisiere Dateien: 100% (666/666)
  Aktualisiere Dateien: 100% (666/666), fertig.
  Running command git show-ref main
  814314dfc207836839c57613c0354fef6e07fa2d refs/heads/main
  814314dfc207836839c57613c0354fef6e07fa2d refs/remotes/origin/main
  Running command git symbolic-ref -q HEAD
  refs/heads/main
  Resolved https://github.com/facebookresearch/xformers.git to commit 814314dfc207836839c57613c0354fef6e07fa2d
  Running command git submodule update --init --recursive -q
  Running command git rev-parse HEAD
  814314dfc207836839c57613c0354fef6e07fa2d
  Running command python setup.py egg_info
  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/tmp/pip-install-fx2w8lcq/xformers_2dba1119a270458db96128cb2f85c51f/setup.py", line 23, in <module>
      import torch
    File "/home/marc/anaconda3/envs/SD2/lib/python3.10/site-packages/torch/__init__.py", line 868, in <module>
      from ._tensor import Tensor
    File "/home/marc/anaconda3/envs/SD2/lib/python3.10/site-packages/torch/_tensor.py", line 12, in <module>
      import torch.utils.hooks as hooks
    File "/home/marc/anaconda3/envs/SD2/lib/python3.10/site-packages/torch/utils/__init__.py", line 6, in <module>
      from .cpp_backtrace import get_cpp_backtrace
    File "/home/marc/anaconda3/envs/SD2/lib/python3.10/site-packages/torch/utils/cpp_backtrace.py", line 1, in <module>
      from torch._C import _get_cpp_backtrace
  ImportError: cannot import name '_get_cpp_backtrace' from 'torch._C' (/home/marc/anaconda3/envs/SD2/lib/python3.10/site-packages/torch/_C.cpython-310-x86_64-linux-gnu.so)
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: /home/marc/anaconda3/envs/SD2/bin/python -c '
  exec(compile('"'"''"'"''"'"'
  # This is <pip-setuptools-caller> -- a caller that pip uses to run setup.py
  #
  # - It imports setuptools before invoking setup.py, to enable projects that directly
  #   import from `distutils.core` to work with newer packaging standards.
  # - It provides a clear error message when setuptools is not installed.
  # - It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so
  #   setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:
  #     manifest_maker: standard file '"'"'-c'"'"' not found".
  # - It generates a shim setup.py, for handling setup.cfg-only projects.
  import os, sys, tokenize
  
  try:
      import setuptools
  except ImportError as error:
      print(
          "ERROR: Can not execute `setup.py` since setuptools is not available in "
          "the build environment.",
          file=sys.stderr,
      )
      sys.exit(1)
  
  __file__ = %r
  sys.argv[0] = __file__
  
  if os.path.exists(__file__):
      filename = __file__
      with tokenize.open(__file__) as f:
          setup_py_code = f.read()
  else:
      filename = "<auto-generated setuptools caller>"
      setup_py_code = "from setuptools import setup; setup()"
  
  exec(compile(setup_py_code, filename, "exec"))
  '"'"''"'"''"'"' % ('"'"'/tmp/pip-install-fx2w8lcq/xformers_2dba1119a270458db96128cb2f85c51f/setup.py'"'"',), "<pip-setuptools-caller>", "exec"))' egg_info --egg-base /tmp/pip-pip-egg-info-ku79i74d
  cwd: /tmp/pip-install-fx2w8lcq/xformers_2dba1119a270458db96128cb2f85c51f/
  Preparing metadata (setup.py) ... error
error: metadata-generation-failed

I got the same problem. I tried to uninstall the ninja in my system,then in worked. How strange.

brucethemoose · 2023-01-19T18:12:37Z

brucethemoose
Jan 19, 2023

Using the manual method with torchdynamo gives me this error:

  0%|                                                                             | 0/20 [00:02<?, ?it/s]
Error completing request
Arguments: ('task(vucdv5cuks7sza3)', 'test', '', [], 20, 0, False, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 768, 768, False, 0.7, 2, 'Latent', 0, 0, 0, 0, False, 'LoRA', 'None', 1, 'LoRA', 'None', 1, 'LoRA', 'None', 1, 'LoRA', 'None', 1, 'LoRA', 'None', 1, 'Refresh models', False, False, False, False, '', 1, '', 0, '', True, False, False, False, 4.0, '', 10.0, False, False, True, 30.0, True, False, False, 0, 0.0) {}
Traceback (most recent call last):
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 440, in _compile
    check_fn = CheckFunctionManager(
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/_dynamo/guards.py", line 536, in __init__
    guard.create(local_builder, global_builder)
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/_guards.py", line 163, in create
    return self.create_fn(self.source.select(local_builder, global_builder), self)
  File "/home/alpha/.local/lib/python3.10/site-packages/torch/_dynamo/guards.py", line 309, in LIST_LENGTH
    code.append(f"len({ref}) == {len(value)}")
TypeError: object of type 'tuple_iterator' has no len()


You can suppress this exception and fall back to eager by setting:
    torch._dynamo.config.suppress_errors = True

17 replies

brucethemoose Jan 19, 2023

Ah yeah I am having no issue running PyTorch 2.0, just issues running torch.compile() as described in the OP with (seemingly) any backend.

I can't seem to get it working in a simple diffusers script either, but maybe I am doing something wrong there too.

Is torch.compile() working for you? If so, you would be the first here, I think.

adrianpuiu Jan 19, 2023

[2023-01-19 23:22:39,477] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_line /home/agp/stable-diffusion-webui/venvtorch20-cu118/lib/python3.10/site-packages/timm/models/resnet.py:732
[2023-01-19 23:22:39,477] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST x []
[2023-01-19 23:22:39,477] torch._dynamo.symbolic_convert: [DEBUG] TRACE RETURN_VALUE None [TensorVariable()]
[2023-01-19 23:22:39,477] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo done tracing forward (RETURN_VALUE)
[2023-01-19 23:22:39,478] torch._dynamo.symbolic_convert: [DEBUG] RETURN_VALUE triggered compile
[2023-01-19 23:22:39,478] torch._dynamo.output_graph: [DEBUG] COMPILING GRAPH due to None
[2023-01-19 23:22:39,490] torch._dynamo.output_graph: [INFO] Step 2: calling compiler function debug_wrapper
/home/agp/stable-diffusion-webui/venvtorch20-cu118/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:92: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled.Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
[2023-01-19 23:22:48,514] torch._inductor.compile_fx: [INFO] Step 3: torchinductor compiling FORWARDS graph 0

adrianpuiu Jan 19, 2023

TORCH_COMPILE_DEBUG=1 python script.py

tuangd Jan 20, 2023

Is this actually a extensions issue?
I am just trying to run interence. Also, am using global python packages and have automatic's venv/updating stiff disabled.

same here. i have a wsl instance dedicated for this, so no point of having venv in it.

If I don't have wsl dedicated for that the same as you should I activate venv before follow the install instructions?

RedRayz Feb 11, 2023

I found a solution.
Replace lines 113-114 in stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\util.py with the following code:
return CheckpointFunction.apply(func, len(inputs), *inputs, *params)

EIDT:
I got 23.63s with torch.compile() (27.9s without compile).
parameters: Batch count 5, Batch size 2, Steps: 28, Sampler: Euler, CFG scale: 12, Seed: 2870305590, Size: 512x512, Model hash: 2ec49f7c35, Eta: 0.67, Clip skip: 2, ENSD: 31337

However, since it takes more than 20 seconds to change the model, resolution, and batch size each time, torch.compile() is not very practical at the moment.

adrianpuiu · 2023-01-19T21:51:46Z

adrianpuiu
Jan 19, 2023

why not use ONNX Runtime as backend for Dynamo, which will export to ONNX and run it
Check it out this unit test https://github.com/microsoft/onnxruntime/blob/main/orttraining/orttraining/test/python/orttraining_test_dort.py ?

2 replies

vladmandic Jan 19, 2023
Collaborator Author

why not use ONNX Runtime as backend for Dynamo, which will export to ONNX and run it Check it out this unit test https://github.com/microsoft/onnxruntime/blob/main/orttraining/orttraining/test/python/orttraining_test_dort.py ?

Do you have any numbers that would show that ONNX runtime would be faster than Torch?

adrianpuiu Jan 20, 2023

why not use ONNX Runtime as backend for Dynamo, which will export to ONNX and run it Check it out this unit test https://github.com/microsoft/onnxruntime/blob/main/orttraining/orttraining/test/python/orttraining_test_dort.py ?

Do you have any numbers that would show that ONNX runtime would be faster than Torch?

was reading a Microsoft research paper the other day and for inference it was 3x times faster, while on the other hand for training it was slower but much dependent on hardware architecture

vladmandic · 2023-01-19T22:43:58Z

vladmandic
Jan 19, 2023
Collaborator Author

update on ofi (optimize_for_inference) dynamo backend - like docs state, it does use torchscript to run analysis, but that basically prevents it from being usable here...

using either straight torch.compile or passing it through torch.jit.script(sd_model.model.eval()) results in:

DiffusionWrapper is not attached to a Trainer

only way forward i can think of would be to run in in trace mode and then pass that to compile, but thats just not worth it.

so onto other possible backends...

0 replies

atensity · 2023-01-20T10:25:23Z

atensity
Jan 20, 2023

For updating to PyTorch 2.0 the installation instructions do seem to help quite a lot in speeding up the it/s. I had previously tried it with PyTorch2.0-cu117 nightly installed as well as re-compiling/installing xformers afterwards, and I only was able to achieve the same speeds as before (~ 5it/s). Somehow, after following the above instructions I got my speeds up to ~10it/s which was pretty nice. My guess is that previously I was using cu117 as on the link given on their website, but changed it to cu118 this time and it seems to be a good improvement.

One issue I do have with PyTorch 2.0, which I had with the previous PyTorch 2.0 installation as well, is that when I try to increase the batch size to anything more than 1 when wanting to train an embedding, I'm hit with:

Traceback (most recent call last):
  File "stable-diffusion-webui/modules/textual_inversion/textual_inversion.py", line 489, in train_embedding
    scaler.scale(loss).backward()
  File "stable-diffusion-webui/.direnv/python-3.10.9/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "stable-diffusion-webui/.direnv/python-3.10.9/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

I've tried looking around but I haven't really been able to find anything to fix it. I saw this (https://github.com/Birch-san/stable-diffusion#patch) and tried doing something similar but it had no effect sadly.

In case it helps:
OS: Linux
GPU: Nvidia 2070 Super
Generation Params: DPM++ SDE Karras

EDIT:
Weirdly enough after a few hours, the speeds have dropped back to ~5it/s, 0 clue why as nothing has changed. The only thing I did in between was do a simple round of TI on an embed to check if there was any improvement there, but that was it. Thought it might be thermal issues and left it for awhile and it still seems to be back to 5it/s. Maybe something about TI that made it regress?

EDIT2:
Restarting the computer after doing TI seemed to have fixed the speed, it seems to be back at 10it/s. No idea what it might be, possibly some resources not getting release properly?

Testing it out again, it definitely seems to be happening due to something regarding the Train tab things. If I try to generate something right after I run into the error above, the speed drops to ~5it/s and persists during a WebUI restart. A full computer restart is required to get the speeds back. From a short test even just pressing "Create Embedding" seems to shoot the speed back down to 5it/s.

If anybody has any clue on how to solve the error or a direction to explore I'm all up for debugging it a bit (incase anyone else runs into it)

EDIT3:
To play around I also did the whole accelerate/dynamo set up and was able to carry out the accelerate test okay (after adjusting LD_LIBRARY_PATH to point to the correct location for cuda.h). T̶h̶i̶s̶ ̶s̶e̶e̶m̶ ̶t̶o̶ ̶h̶a̶v̶e̶ ̶(̶a̶t̶ ̶l̶e̶a̶s̶t̶ ̶t̶e̶m̶p̶o̶r̶a̶r̶i̶l̶y̶ ̶f̶o̶r̶ ̶n̶o̶w̶)̶ ̶f̶i̶x̶e̶d̶ ̶t̶h̶e̶ ̶i̶s̶s̶u̶e̶ ̶w̶i̶t̶h̶ ̶t̶h̶e̶ ̶h̶a̶l̶v̶i̶n̶g̶ ̶o̶f̶ ̶t̶h̶e̶ ̶s̶p̶e̶e̶d̶ ̶e̶v̶e̶r̶y̶t̶i̶m̶e̶ ̶a̶n̶y̶t̶h̶i̶n̶g̶ ̶w̶i̶t̶h̶ ̶̶T̶r̶a̶i̶n̶̶ ̶i̶s̶ ̶d̶o̶n̶e̶. Scratch that still seems to be there.

BUT the actual batch size increase problem is still there, with the exact same error, so still wondering on how to approach that.

EDIT4:
Seems to be the halving problem disapears sometimes for who knows what reason and comes back at some point. No good way to reproduce it (before it was always after doing some TI or something). Adding the code in sd_hijack does indeed not allow any generation (but WebUI launches okay)

I saw something on the pytorch github issues page recently, a bug that said that TorchInductor doesn't support generators (issue 92633) so it could maybe be related to why the inductor backend doesn't seem to work? Mainly since a lot of the errors that appear mention assert "source" in options and options["source"] is not None.
Or I might be totally off base haha

2 replies

tuangd Jan 20, 2023

Have you tried replace your cudnn with the dll from nvidia? Checkout this one #6954

atensity Jan 20, 2023

Yeahp saw that. It doesn't really seem to affect the speeds on my system (I think it might only have more of an effect on newer cards? i.e like 30 or 40series). The only thing that seems to solve it is a full system restart.

sonphantrung · 2023-01-21T07:17:00Z

sonphantrung
Jan 21, 2023

I've made a GitHub Actions workflow that compiles xformers, but uses nightly torch2: https://github.com/sonphantrung/abc/blob/main/.github/workflows/xformers.yml. You may want to edit the TORCH_CUDA_ARCH_LIST variable to the one [matching your GPU] since I specifically built it for Kaggle T4 and P100

0 replies

Interpause · 2023-01-22T08:33:58Z

Interpause
Jan 22, 2023

EDIT:

Red herring. Seems the two important steps are:

Disable --medvram and --lowvram
Ensure CUDA is located at /usr/local/cuda (on Gentoo, its installed to /opt/cuda)
- Seems the include path for cuda is hardcoded.
Not sure which ones are strictly necessary:
- Add /usr/local/cuda/bin to PATH
- Add /usr/local/cuda/lib64 & /usr/local/cuda/include to LD_LIBRARY_PATH (and maybe LIBRARY_PATH?)

However, ultimately, you will still get the same error as #6932 (comment)

Test script:

{
"default": 5.612623929977417,
"ansor": "error",
"aot_cudagraphs": 8.528383731842041,
"aot_eager": 5.482496023178101,
"aot_inductor_debug": 8.616447925567627,
"aot_torchxla_trace_once": "error",
"aot_torchxla_trivial": 6.626816034317017,
"aot_ts": 6.360575914382935,
"aot_ts_nvfuser": 5.486592054367065,
"aot_ts_nvfuser_nodecomps": 6.346751928329468,
"cudagraphs": "error",
"cudagraphs_ts": "error",
"cudagraphs_ts_ofi": "error",
"eager": 5.060096025466919,
"fx2trt": "error",
"inductor": 5.426176071166992,
"ipex": "error",
"nnc": "error",
"nnc_ofi": "error",
"nvprims_aten": 38.688255310058594,
"nvprims_nvfuser": "error",
"ofi": 5.050944089889526,
"onednn": "error",
"onnx2tensorrt": "error",
"onnx2tf": "error",
"onnxrt": "error",
"onnxrt_cpu": "error",
"onnxrt_cpu_numpy": "error",
"onnxrt_cuda": "error",
"static_runtime": "error",
"taso": "error",
"tensorrt": "error",
"torch2trt": "error",
"torchxla_trace_once": "error",
"torchxla_trivial": 5.648384094238281,
"ts": 5.2705278396606445,
"ts_nvfuser": "error",
"ts_nvfuser_ofi": "error",
"tvm": "error",
"tvm_meta_schedule": "error"
}

2. Compile

So lets do a manual config:

We need to setup torch.compile and best spot I've found so far is NOT in SD model load, but slightly afterwards due to function hijacking that happens in WebUI

For example, in modules/sd_hijack function def hijack, just before self.optimization_method = apply_optimizations()
try:
    import torch._dynamo as dynamo
    torch._dynamo.config.verbose = True
    torch.backends.cudnn.benchmark = True
    m.model = torch.compile(m.model, mode="max-autotune", fullgraph=False)
    print("Model compiled set")
except Exception as err:
    print(f"Model compile not supported: {err}")
Notes:

dynamo must be imported explicitly or namespace is not available in runtime

compile applies to actual model,
not to parent sd_model as that is entire pipeline, not model itself

compile time is irrelevant as it only marks model for compilation during first execution

fullgraph cannot be forced as there are parts of diffusion model that cannot be compiled,
so this internally allows it to split model into compiled+uncompiled graphs

cudnn.benchmark should be set so optimizer can perform precise execution timings

initial execution will always be slower as that is when compile actually happens
any benefits would be seen in subsequent calls

Result? In my case its the same error as with accelerate test Not great...

It seem to work on initial load for me, just that subsequent model swaps don't. See screenshot for Torch version used:

3 replies

brucethemoose Jan 22, 2023

Eh it looks like the compile is failing even on the initial load, but that "cannot asign".error is new to me.

vladmandic Jan 22, 2023
Collaborator Author

Eh it looks like the compile is failing even on the initial load, but that "cannot asign".error is new to me.

looks like we need to "unwrap" the sd_model pipeline and apply compile to individual models where applicable and not to the parent diffuser.

Interpause Jan 25, 2023

I just tested all the backends with a lower time than "eager". None of them seem to work unfortunately.

vladmandic · 2023-01-24T17:19:21Z

vladmandic
Jan 24, 2023
Collaborator Author

if you have ampere or higher gpu, hardware l2 cache can be persisted (its not by default) and it does help with performance

export CUDA_DEVICE_DEFAULT_PERSISTING_L2_CACHE_PERCENTAGE_LIMIT=100

can anyone think of why allowing cache persistence on all l2 cache memory (value of 100%) would be a bad thing? afaik, if new data is needed, it still goes through cache, this just allows persistence if same data is requested again?
yes, i can think of mps scenarios where this is a bad thing as value may be modified by secondary gpu, but otherwise?

2 replies

hithereai Jan 27, 2023
Collaborator

no speed difference on my 4090 with latest pt2 and xformers for pt2 which I compiled

vladmandic Jan 27, 2023
Collaborator Author

no speed difference on my 4090 with latest pt2 and xformers for pt2 which I compiled

thanks. on rtx3000 cards its seems like it gives 2-4% boost. not huge, just worth noting.

EfourC · 2023-03-07T01:38:21Z

EfourC
Mar 7, 2023

@aifartist
How do sdp and xformers compare at 1024x1024 on the 4090? I imagine either way you're cpu limited at 512x512, per your earlier testing.

1 reply

aifartist Mar 7, 2023

@aifartist How do sdp and xformers compare at 1024x1024 on the 4090? I imagine either way you're cpu limited at 512x512, per your earlier testing.

xformers 9.55 it/s
sdp 9.52 it/s
Basically the same

winthropharvey · 2023-03-08T03:08:13Z

winthropharvey
Mar 8, 2023

I installed torch 2.1 and cuda 11.8, as well as latest cudnn, and built my own xformers. Extremely good speedup on 4090. However, I was annoyed that I had to do the install and build TWICE because unless you add in --skip-install (and change the requirements file in launch.py to requirements.txt instead of requirements_version, I did both just in case) to the webui_user.bat the thing will immediately move to overwrite your shiny newly built stuff with torch 1.13 and old xformers on startup.

Is there a more elegant way to get webui to NOT do this besides this? I'm sure it will cause issues with future updates and extensions and I'd prefer a different way.

5 replies

aliencaocao Mar 8, 2023

Just comment out the function in launch.py saying setup environment

hallatore Mar 8, 2023

Got a short writeup what you did to install and set up the latest versions? And are you on windows?
Would love some speedup for my 4090.

hithereai Mar 8, 2023
Collaborator

Basically, clone auto to a new folder, change pip pytorch command in launch.py to fit the new one for pt2, then click the webui-user.bat.

Then replace cudnn files in venv-->torch-->lib with cudnn 8.7/8.7 zip file's folder BIN contents. OK to overwrite.

Then compile xformers for that same nighly pt2 build (might not be needed any-more, I just don't know better) and install it manually using pip install xformers____.whl

And add --xformers flag to webui-user.bat.

Done.

vladmandic Mar 8, 2023
Collaborator Author

packaged xformers that can be installed from pypi only ship for stable version of pytorch (1.13.1).
anything else requires manually building them.

btw, if it helps, those are my notes: https://github.com/vladmandic/automatic/wiki/Torch-Optimizations

winthropharvey Mar 9, 2023

Got a short writeup what you did to install and set up the latest versions? And are you on windows? Would love some speedup for my 4090.

Basically just followed the instructions at the top of the thread - had to slightly modify install instruction removing triton since no triton for window (also it got name changed anyway due to some namespace squatting or something). Also had to do some setting changes to get the compile to go through (file path was too long for git https://stackoverflow.com/questions/22575662/filename-too-long-in-git-for-windows), and needed to install some c++ build tools using a link from an error message. Then I modified the launch.py to point to requirements.txt instead of requirments_version.txt for requirements so that it wouldn't auto downgrade. I didn't bother to do anything in the optimize section as a lot of it assumes linux/triton.

pamparamm · 2023-03-10T08:36:00Z

pamparamm
Mar 10, 2023

SDP (#8367) can now be run with deterministic results (thanks to Sakura-Luna for pointing out), You can test it with --opt-sdp-no-mem-attention argument.

16 replies

Cyberbeing Mar 11, 2023

@pamparamm I looked into this a bit, and it seems that the reason why sdp_attnblock_forward results in no change in memory allocations compared to not using it (as you can see in your graphs) is because as implemented scaled_dot_product_attention_forward is unable to use the Memory Efficient or Flash kernels with attnblock_forward.

  File "F:\AI\stable-diffusion-webui\modules\sd_hijack_optimizations.py", line 492, in sdp_attnblock_forward
    out = torch.nn.functional.scaled_dot_product_attention(q, k, v, dropout_p=0.0, is_causal=False)
RuntimeError: No available kernel.  Aborting execution.

The above error is produced if you disable every kernel except mem efficient for sdp_attnblock_forward. Tried the same with the flash kernel, same error. Only with the sdp math kernel does sdp_attnblock_forward function. This likely means there may not be much of a point in including sdp_attnblock_forward as-implemented. And rather hope in the future pytorch will eventually expand the support of their SDP Memory Efficient kernel so it can suport things such as ldm.modules.diffusionmodules.model.AttnBlock.forward. Otherwise it seems you may need implement a custom Memory Efficient hijack wrapper for the SDP Math Kernel, if that is even possible.

And yes, out of interest that error only appears after 100% when all steps have completed and all that remains in the final VAE decode and merge processing, so it's definitely related somehow or rather only used during those final memory allocations. This likely makes it an important spot to have a Memory Efficient implementation, since that is where most users with low VRAM will get OOM errors.

Sakura-Luna Mar 11, 2023
Collaborator

I tested that mem attention does not work in attnblock, so implementing sdp attnblock is useless to save VRAM. You may need to remove it to avoid a bad situation.

Sakura-Luna Mar 11, 2023
Collaborator

Have you compared the VRAM consumption before and after attnblock is turned on?

pamparamm Mar 11, 2023

  File "F:\AI\stable-diffusion-webui\modules\sd_hijack_optimizations.py", line 492, in sdp_attnblock_forward
    out = torch.nn.functional.scaled_dot_product_attention(q, k, v, dropout_p=0.0, is_causal=False)
RuntimeError: No available kernel.  Aborting execution.
The above error is produced if you disable every kernel except mem efficient for sdp_attnblock_forward. Tried the same with the flash kernel, same error. Only with the sdp math kernel does sdp_attnblock_forward function. This likely means there may not be much of a point in including sdp_attnblock_forward as-implemented. And rather hope in the future pytorch will eventually expand the support of their SDP Memory Efficient kernel so it can suport things such as ldm.modules.diffusionmodules.model.AttnBlock.forward. Otherwise it seems you may need implement a custom Memory Efficient hijack wrapper for the SDP Math Kernel, if that is even possible.

Same error can be produced if you repeat this process to scaled_dot_product_no_mem_attention_forward - it requires math kernel as well. Yet there is a speedup boost if you enable other kernels along with math kernel. So the described method of testing is not sufficient enough to make conclusions, I think the best way would be to count total calls to scaled_dot_product_attention and then count successful calls to each individual kernel.

Sakura-Luna Mar 11, 2023
Collaborator

Yet there is a speedup boost if you enable other kernels along with math kernel.

Attnblock will only execute one or two times (Hires ON) for a picture, and it can only work in the Math kernel, so if Attnblock cannot guarantee to save VRAM consumption when enabled, it is useless.

youngercloud · 2023-03-13T11:53:02Z

youngercloud
Mar 13, 2023

Sharing some updates of accelerate 0.17.0/0.18.0.dev0 from my side.

Graphics Card: PNY Nvidia 4080
Arch: AMD64
CPU: The AMD Ryzen 9 3900X 3.8 GHz
System: Linux Mint 21 5.15.0-67-generic
Python: 3.10.6

torch: 2.1.0.dev20230313+cu118
xformers: 0.0.17+b89a493.d20230310
accelerate: 0.18.0.dev0

I've noticed accelerate==0.17.0 was just released 3 days ago. The community said that it would have good support for Pytorch 2. However, it turns out that errors are encountered when I run the accelerate test command. The following configurations for accelerate config have been used

Do you wish to optimize your script with torch dynamo?[yes/NO]:yes                                                                                                                                                                            
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Which dynamo backend would you like to use?                                                                                                                                                                                                   
fx2trt                                                                                                                                                                                                                                        
Do you want to customize the defaults sent to torch.compile? [yes/NO]: yes                                                                                                                                                                    
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Which mode do you want to use?                                                                                                                                                                                                                
max-autotune

Other options use default settings.

accelerate test result:

Error
assert idxs[0] == 0, "Main process was not first."
IndexError: invalid index of a 0-dim tensor. Use `tensor.item()` in Python or `tensor.item<T>()` in C++ to convert a 0-dim tensor to a number

I noticed that there is a problem with the testing script that might probably cause a false negative. Developers soon merged a fixed testing script to the main branch.

Updated my requirements_versions.txt points to the GitHub main branch, and finally, installed accelerate==0.18.0.dev0

By using the same config and doing the accelerate test again, the result finally reported Test is a success!

Disappointingly, there has been no improvement in my performance on inference. Before enabling accelerate (set export ACCELERATE="True"), and with arguments: --xformers --opt-channelslast, model v1-5-pruned-emaonly.safetensors, default settings (empty prompt/512x512, etc), with 10 batches of images, I am able to get 24.90 it/s averagely from 8 images (starting to calculate from the 3rd image)

After enabling the accelerate with the above settings, I only get 25.58375 it/s. I don't think this level of performance enhancement is good, and I'm not even sure I'm getting a "real" performance enhancement since this is just a sample set of comparisons.

Anyone want to share their journey with the latest version of accelerate, many thanks for the discussion and contribution.

1 reply

youngercloud Mar 13, 2023

Updated in here

vladmandic · 2023-03-13T14:39:24Z

vladmandic
Mar 13, 2023
Collaborator Author

Unless you can use accelerate to run on multi-gpu or distributed system, I don't think it has any effect at all

Here are some of my non-extensive tests:
Note that all results are pretty much within margin of error...

No accelerate -> 8.67 it/s <- baseline

pip uninstall accelerate && python launch.py --xformers
accelerate==0.12 -> 8.69 it/s
accelerate==0.16 -> 8.74 it/s
accelerate==0.17 -> 8.72 it/s
accelerate==0.18.0.dev0 dynamo: no -> 8.77 it/s
accelerate==0.18.0.dev0 dynamo: inductor -> 8.74 it/s
accelerate==0.18.0.dev0 dynamo: fx2trt -> 8.73
accelerate==0.18.0.dev0 dynamo: nvfuser -> 8.72

As it is, I don't think accelerate dynamo support has any effect on sd model at all

Note:

accelerate config was run before each test
accelerate test is passing all tests using 0.18.0.dev0

Env:

OS: Windows 11 with WSL2 running Ubuntu 22.04
Python: 3.10.6
Torch: 2.1.0.dev20230310+cu118
CUDA: 11.8
cuDNN: 8.8
Xformers: 0.0.17+b89a493.d20230310

2 replies

youngercloud Mar 13, 2023

Thank you for testing and for your reply. I just saw this PR, so I may know why there is no increase in speed. This PR provides an easy way to use the accelerate launch, but it does not provide the necessary code that is used by accelerate, as the comments indicate.

While accelerate can play an important role in multi-GPU and distributed systems, for a single GPU, if the purpose is just to use Dynamo in Torch 2 via accelerate, in my humble opinion, there should be a speed-up.

Accelerate developers released version 0.17.1 about an hour ago, so we can use this version instead of 0.18.0.dev0 now.

I will try to write accelerate code into launch.py to see if I can get a speed-up once I get some time.

By the way, on another topic, I noticed your system. On my side, Windows 11 22H2 + WSL2 (or just using Windows 11 22H2) is a serious performance bottleneck for me. With my environment, I only get 14it/s, but after switching to Linux Mint with the exact same environment, I can get 24it/s."

vladmandic Mar 13, 2023
Collaborator Author

Comparing to other ppl using same GPU (there's quite a few in performance DB that grew during past few weeks since I created it), I'm getting better performance than any, so I'd say WSL works just fine ;)

Cyberbeing · 2023-03-15T01:29:51Z

Cyberbeing
Mar 15, 2023

While testing --opt-sdp-no-mem-attention on a photorealistic model I noticed by chance that one of the cuda matmul options which is enabled by default in torch-2.0.0.dev20230228+CUDNN 8.8.0 results in moderate to minor distortions of fine details. Testing a bit further, I noticed this occurs no matter what attention method is used, so it's not unique to sdp.

The offending option is torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction

Set this to torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction=False to resolve the issue.

For good measure also set torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction=False, but for inference this likely won't matter unless you have a full BF16 mixed precision workflow.

Note that this will result a minor change to existing seeds (some details, edges, objects will slightly change in shape), but from my quick tests all changes made were objective improvements. On my RTX A4000, I did not notice any reduction of FP16 interference speed setting both options to false. YMMV

For reference, the following is what I currently do to manage all these torch cudnn/cuda variables:

def enable_misc_optimizations():
    torch.backends.cudnn.allow_tf32 = True
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.benchmark_limit = 1
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False
    torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction = False
    if torch.backends.cudnn.benchmark:
        print("Enabled CUDNN Benchmark Sucessfully")
    else:
        print("CUDNN Benchmark Disabled")
    if torch.backends.cuda.matmul.allow_tf32 and torch.backends.cudnn.allow_tf32:
        print("Enabled CUDA & CUDNN TF32 Sucessfully")
    else:
        print("CUDA & CUDNN TF32 Disabled")
    if not torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction and not torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction:
        print("CUDA Matmul fp16/bf16 Reduced Precision Reduction Disabled")
    else:
        print("CUDA Matmul fp16/bf16 Reduced Precision Reduction Expected Value Mismatch")

enable_misc_optimizations()

I just delete the entire enable_tf32 def below in devices.py and replace it with my code above in the same location:

stable-diffusion-webui/modules/devices.py

Lines 63 to 76 in a9fed7c

    
           def enable_tf32(): 
        
               if torch.cuda.is_available(): 
        
                   # enabling benchmark option seems to enable a range of cards to do fp16 when they otherwise can't 
        
                   # see https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/4407 
        
                   if any([torch.cuda.get_device_capability(devid) == (7, 5) for devid in range(0, torch.cuda.device_count())]): 
        
                       torch.backends.cudnn.benchmark = True 
        
                   torch.backends.cuda.matmul.allow_tf32 = True 
        
                   torch.backends.cudnn.allow_tf32 = True 
        
           errors.run(enable_tf32, "Enabling TF32")

9 replies

Cyberbeing Mar 15, 2023

Indeed. I tried generating a 1024x768 image and the difference was indeed huge (while maintaining the same overall structure).
Here is a better example (but not a great image) from one of my failed generations:

Compare Link: https://slow.pics/c/OPg4m5DW

torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction=False

torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction=True

You can clearly see that the torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction=False version is more detailed (especially in shadows) while the True version appears more deformed and off looking in general when examining the details.

dsully Mar 15, 2023

I took a look at the python docs https://peps.python.org/pep-0008/#programming-recommendations, and it seems that both == > True and is True and they even consider is True as worse. They recommend just using if booleanvariable which will only succeed if the boolean is True. Checking for None I don't believe applies to these booleans, since when I checked the pytorch code awhile back the default values were True. I've fixed the above code snippet.

Correct. All Python linters (pylint, flake8, ruff) will warn about this. Thanks!

Panchovix Mar 18, 2023

Really thanks for this! It did the difference, and also thanks for the code itself.

Cyberbeing Apr 20, 2023

@CCpt5 It looks like you deleted your comment as you must have discovered it was unrelated, but just to confirm, I tested with the latest ControlNet 1.1 a2a414a (Thu Apr 20 05:52:13 2023) extension update with various ControlNet models in PyTorch 2.1.0 nightly and didn't see any issue with reduced precision reduction set to False. It behaved the same as without ControlNet, with the output of reduced precision reduction disabled having some fine details and edges of objects sharper and more detailed then enabled.

It's likely you just ended up pulling a bad commit, as the ControlNet extension has been going through rapid major updates over the past few days in the process of migrating from ControlNet 1.0 to 1.1 as well as the introduction of new 1.1 models, though 1.0 models should continue to work.

CCpt5 Apr 20, 2023

@Cyberbeing - Yea I did delete my comment as it didn't have to do w/ your code change here (which I went back to and definitely looks better). Not sure what my issue was/is. Sessions w/ CN start out great/as intented, but often it starts producing incorrect generations over time. This was happening on two installs I have (one pytorch 2 and one on the suggested version)......regrabbing all of the models/yamls and uninstalling/resinstalling the NVIDIA stuff seems to have it working properly for now.....hopefully it maintains through long sessions.

Sorry for any confusion - and appreciate your original comment and followup.

aifartist · 2023-03-15T05:51:49Z

aifartist
Mar 15, 2023

This is (hopefully) start of a thread on PyTorch 2.0 and benefits of model compile which is a new feature available in torch nightly builds

Builds on conversations in #5965, #6455, #6615, #6405

This thread has split off into many different paths. Most having nothing to do with Torch v2.x compile.

2 replies

Cyberbeing Mar 15, 2023

Do you think Pytorch 2.0 General Discussion (non-compile related) should be branched off into a separate Discussion?

vladmandic Mar 15, 2023
Collaborator Author

that's the nature of any unmoderated discussion with high number of views. it trends high on discussions list, so people post anything there.
if you branch out to new discussions, it may solve the problem temporarily or those branches may sizzle out. either way.

FurkanGozukara · 2023-03-18T15:38:21Z

FurkanGozukara
Mar 18, 2023

24.) Automatic1111 Web UI - PC - Free
For downgrade to older version if you don't like Torch 2 : first delete venv, let it reinstall, then activate venv and run this command pip install -r "path_of_SD_Extension\requirements.txt"
How To Install New DREAMBOOTH & Torch 2 On Automatic1111 Web UI PC For Epic Performance Gains Guide

this method installs latest cuda dll files too

test py

import torch

print(f"torch {torch.__version__}, cuda {torch.version.cuda}, cudnn {torch.backends.cudnn.version()}")

5 replies

aifartist Mar 18, 2023

This has nothing to do with "torch.compile()"
It is about torch itself and it is incorrect that torch 2.0 provides an EPIC perf gain.

FurkanGozukara Mar 18, 2023

This has nothing to do with "torch.compile()" It is about torch itself and it is incorrect that torch 2.0 provides an EPIC perf gain.

this also installs updated cuda dlls. so it is an overall statement

aifartist Mar 18, 2023

I was just pointing out that the EPIC perf gain has been known for several months since a couple of us figured out the cuDNN(not cuda) problem. Also, only one of the two torch 2.0.0's bundle the correct libraries. The default repo will NOT give you the cuDNN 8.7.
But all is good. No problem.

FurkanGozukara Mar 18, 2023

I was just pointing out that the EPIC perf gain has been known for several months since a couple of us figured out the cuDNN(not cuda) problem. Also, only one of the two torch 2.0.0's bundle the correct libraries. The default repo will NOT give you the cuDNN 8.7. But all is good. No problem.

I see thank you. I didn't know these before and many of my viewers

FurkanGozukara Mar 19, 2023

@aifartist i have tested and 8.7 is installed with my shown method

updated my post

aifartist · 2023-03-19T23:11:31Z

aifartist
Mar 19, 2023

Not bad for my first try with the GA torch.compile(). The second try if you count the torch/_dynamo/guards.py bug I have to fix to get it to work.
100%|████████████| 100/100 [00:01<00:00, 50.04it/s]
100%|████████████| 100/100 [00:01<00:00, 50.29it/s]
100%|████████████| 100/100 [00:01<00:00, 50.39it/s]
100%|████████████| 100/100 [00:01<00:00, 50.45it/s]
100%|████████████| 100/100 [00:01<00:00, 50.18it/s]
100%|████████████| 100/100 [00:01<00:00, 50.10it/s]
100%|████████████| 100/100 [00:01<00:00, 50.15it/s]
100%|████████████| 100/100 [00:01<00:00, 50.28it/s]

3 replies

vladmandic Mar 20, 2023
Collaborator Author

Compared to?
Where did you add compile statements?
Which backend (I assume inductor?)

aifartist Mar 20, 2023

Compared to the about 40 it/s I get without a compile. Basic sd2.1 model, euler_a, 512x512 and a modest prompt. My usual baseline test.
In sd_hijack.py with the try block you provided previously. I assume inductor is the default.
m.model = torch.compile(m.model, mode="max-autotune", fullgraph=False)

Also, I finally figured out the nuances of Turbo Boost 3.0 and why I was only seeing 5.5 GHz and not 5.8(Too many chrome windows). This gives me:
100%|███████████| 20/20 [00:00<00:00, 51.31it/s]
100%|███████████| 20/20 [00:00<00:00, 51.28it/s]
100%|███████████| 20/20 [00:00<00:00, 51.14it/s]
100%|███████████| 20/20 [00:00<00:00, 51.28it/s]

Cyberbeing Mar 22, 2023

Out of curiosity, how long does the torch.compile process take before inference starts? During my failed tests on Windows native last month where compile fell back from CUDA to CPU eager, torch took a really really long time to compile the model on my i5-3570K @4.4Ghz before inference even started.

Does CUDA inductor have faster compile times than CPU eager prior to inference starting?

Are model compile results cached, or does it always compile over and over every time you run inference with the same model and settings? It seemed to not cache anything during my prior test.

The next time you post results it may be useful to also compare the webui Time taken: with and without torch.compile, to get a better idea of the real-world speed advantage. It may also be interesting if you could artificially limit your CPU to only 4 and 8 active cores in your BIOS to see what kind of impact that has on compile.

I may make another attempt at seeing if I can get Triton working on Windows native with the newly released LLVM 16.0.x in a few weeks (currently I'm awaiting some P0 Clang bugs to be fixed first). I saw in late February that Triton accepted a pull request with a few Windows fixes, so I wonder if it's functional now. Previously my issue was that Triton attempting to load a compiled GPU kernel crashed Python.

vladmandic · 2023-03-20T12:13:40Z

vladmandic
Mar 20, 2023
Collaborator Author

@aifartist i've tried again using torch 2.0 ga and sd 1.5 default model, still no luck at all with pretty much any backend...

torch.compile(m.model, mode="max-autotune", backend="inductor", fullgraph=False, dynamic=False)

torch._inductor.utils: [WARNING] not enough cuda cores to use max_autotune mode

torch.compile(m.model, mode="default", backend="inductor", fullgraph=False, dynamic=True)

File "repositories/stable-diffusion-stability-ai/ldm/modules/diffusionmodules/util.py", line 166, in timestep_embedding
  -math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32) / half

m.model = torch.compile(m.model, mode="default", backend="aot_ts_nvfuser", fullgraph=False, dynamic=False)

benchmark error: 4 The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "<eval_with_key>.35", line 14, in forward
    _to_copy_1 = torch.ops.aten.to(unsqueeze, dtype = torch.float32);  unsqueeze = None
    unsqueeze_1 = torch.ops.aten.unsqueeze(_to_copy, 0);  _to_copy = None
    mul_1 = torch.ops.aten.mul(_to_copy_1, unsqueeze_1);  _to_copy_1 = unsqueeze_1 = None

m.model = torch.compile(m.model, mode="default", backend="nvprims_nvfuser", fullgraph=False, dynamic=False)

torch/_prims/nvfuser_executor.py:416: RuntimeWarning: No partition found for the graph. This is likely because the graph is not supported by nvFuser. Please use the eager ATen mode to execute the graph.

torch.compile(m.model, mode="default", backend="inductor", fullgraph=False, dynamic=False)

no error, but silent failure

torch.compile(m.model, mode="default", backend="cudagraphs", fullgraph=False, dynamic=False)

no error, but silent failure

re: turbo boost - it's pretty much expected, single core boost is typically higher than multicore boost. i have 12900k with aio and set my single core to 52x and multicore to 50x.

4 replies

aifartist Mar 20, 2023

@vladmandic FYI, the majority of the features listed on the Torch GA 2.0.0 release page are listed as beta or of prototype quality. I've worked 40 years for commercial software companies and this seems like a rushed-out-the-door joke. I commented about this on https://discuss.pytorch.org/

Given the cuda core error message are you using a v100? On a 4090:

import torch
torch.cuda.get_device_properties(0).multi_processor_count
128

Does the v100 have less than 80 which would trigger the error message? If you look at the source code this might be the only(?) GPU that it doesn't like. The is 12 years old but might help https://stackoverflow.com/questions/5217167/how-many-cuda-cores-does-each-multiprocessor-of-a-gpu-have
I don't know much about this.

To make inductor work without a hard crash I make the change seen below. Could this be your silent crash?
venv/lib/python3.10/site-packages/torch/_dynamo/guards.py

    rename_implicit,
    tuple_iterator,   # I ADDED THIS
    tuple_iterator_getitem,

AND replace the code.append() line commented out below with the if/else below. This bug has been reported to pytorch github.

def LIST_LENGTH(self, guard):
        ...
        #code.append(f"len({ref}) == {len(value)}")
        if istype(value, tuple_iterator):
            code.append(f"___tuple_iterator_len({ref}) == {tuple_iterator_len(value)}")
        else:
            code.append(f"len({ref}) == {len(value)}")

vladmandic Mar 20, 2023
Collaborator Author

@aifartist i'm not currently testing dynamo in my lab with a100, only on my home pc which has rtx3060 and torch reports multi_processor_count=28 which is number of streaming multiprocessors, nothing to do with cuda cores (which happens to be 3584). seems that dynamo is basically checking for the wrong thing as number of SMs is irrelevant here.

and regarding the torch bug you've mentioned pytorch/pytorch#93405 - its nearly 2 months old with a known root cause for 3 weeks, i would expect it to be more than "triaged" by now, especially since compile is the biggest announcement in torch 2.0

vladmandic Mar 20, 2023
Collaborator Author

created a torch issue pytorch/pytorch#97179 to fix cuda cores detection.
seems like i'll give it a month or two before revisiting torch.compile()

ivan-kulikov-dev Apr 3, 2023

@vladmandic pytorch/pytorch#97179 was merged (pytorch/pytorch@6dded5d)

if-ai · 2023-03-21T15:03:30Z

if-ai
Mar 21, 2023

This is not possible without WSL because torch can't compile on windows yet. I am tired of caped performance on windows, I know cero about Linux but is time for me to learn I saw @aifartist comments on reddit and I agree it would be too hard to explain all this, the big companies let people like me out in the cold is a bad practice all I can do is rely on threads like this to get things done.

1 reply

vladmandic Mar 21, 2023
Collaborator Author

This is not possible without WSL because torch can't compile on windows yet

yes it can, but its complex.
even on wsl or linux, compiling torch is a nightmare as its build process depends on anaconda which i simply refuse to use.

I am tired of caped performance on windows

there is no caped performance on windows and for sure you don't need torch 2.0 with torch.compile() for that.
so instead of attempting things that you may not even need, focus on fixing whats broken.

the big companies let people like me out in the cold is a bad practice all I can do is rely on threads like this to get things done.

anything cutting edge is complex and/or buggy exactly because experienced people did not yet have time to polish it.
if you don't feel comfortable with that, use older/proven tech. sorry for the tone, its just there are too many ppl experimenting with untested/unproven tech without any understanding what it involves.

if-ai · 2023-03-21T18:56:31Z

if-ai
Mar 21, 2023

Yes, you got my meaning all wrong, I know, I have compiled torch before to get dreambooth running and yes it was hard but that was Torch not Torch 2. All people say Linux is easier and I bet it is. So I thought to say thank you for your effort putting this guide out but forget it. I guess I get this for simping. Never again.

…

On Tue, Mar 21, 2023, 4:23 PM Vladimir Mandic ***@***.***> wrote: This is not possible without WSL because torch can't compile on windows yet yes it can, but its complex. even on wsl or linux, compiling torch is a nightmare as its build process depends on anaconda which i simply refuse to use. I am tired of caped performance on windows there is no caped performance on windows and for sure you don't need torch 2.0 with torch.compile() for that. so instead of attempting things that you may not even need, focus on fixing whats broken. the big companies let people like me out in the cold is a bad practice all I can do is rely on threads like this to get things done. anything cutting edge is complex and/or buggy exactly because experienced people did not yet have time to polish it. if you don't feel comfortable with that, use older/proven tech. sorry for the tone, its just there are too many ppl experimenting with untested/unproven tech without any understanding what it involves. — Reply to this email directly, view it on GitHub <#6932 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFBUFQSRX6IYXNBYXQ2YRB3W5HIYDANCNFSM6AAAAAAUAM6I4Y> . You are receiving this because you commented.Message ID: <AUTOMATIC1111/stable-diffusion-webui/repo-discussions/6932/comments/5383301 @github.com>

4 replies

vladmandic Mar 21, 2023
Collaborator Author

sorry if i misunderstood you

if-ai Mar 21, 2023

Is okay, you are also correct in most of what you said.

if-ai Apr 8, 2023

Hi, did you delete your dynamo test script? I can seem to access it. So I wrote a standalone test script to evaluate all the different backens:
https://github.com/vladmandic/automatic/blob/master/cli/modules/dynamotest.py

vladmandic Apr 8, 2023
Collaborator Author

No, just moved it, cli/random

FurkanGozukara · 2023-03-29T18:58:07Z

FurkanGozukara
Mar 29, 2023

@vladmandic i am trying to run dreambooth on runpod

unfortunately pytorch team removed xformers older version
i cant believe how smart they are
now we have to use torch 2
however it is not working on runpod

here the errors and steps i tried to solve the problem

I have installed Torch 2 via this command on RunPod io instance

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Everything installed perfectly fine

With Torch 1 and Cuda 11.7, I was not getting any error but with Torch 2 the below error produced

Could not load library libcudnn_cnn_infer.so.8. Error: libnvrtc.so: cannot open shared object file: No such file or directory

How to fix?

It is using unix

On Windows same prodecure working very well

Using Automatic1111 web UI to use Stable Diffusion

this above i couldnt solve

therefore i have done the following things

apt update
apt install sudo
sudo apt install nvidia-cudnn
sudo apt-get install python3-dev

after installing all above

now i have this warning and training never progress

Steps: 0%| | 0/170 [00:00<?, ?it/s][2023-03-29 18:50:26,163] torch._inductor.utils: [WARNING] not enough cuda cores to use max_autotune mode

now when i run below python code i see everything looking good

import torch

# Check if CUDA is available
if torch.cuda.is_available():
    print("CUDA is available")
    # Display the current GPU name
    print("GPU name: ", torch.cuda.get_device_name(torch.cuda.current_device()))
else:
    print("CUDA is not available")

# Verify the PyTorch version
print("PyTorch version: ", torch.__version__)

import torch
print(torch.cuda.get_device_properties(0).multi_processor_count)

test.py result

CUDA is available
GPU name:  NVIDIA RTX A4500
PyTorch version:  2.0.0+cu118
56

it is able to generate images with 15.58it which is very fast

any help appreciated very much

0 replies

vladmandic · 2023-03-29T19:23:55Z

vladmandic
Mar 29, 2023
Collaborator Author

libnvrtc.so is part of cuda, so it should be in torch. but if its auto-installed on your system, you may need to download and install cuda 11.8 manually (download from nvidia).

regarding inductor warning - its just a warning so it can be ignored. but whats much stranger, is that warning only happens when you try to run torch.compile which is NOT done by default. that means you've added that part yourself - i'd advise not to do it unless you know exactly what it does.

4 replies

FurkanGozukara Mar 29, 2023

Have you read all the actions I have taken? I used that command to install torch 2. It works on Windows but not on runpod

I didn't add any part manually these are all steps I did

So any ideas how to add libnvrtc.so on runpod?

vladmandic Mar 29, 2023
Collaborator Author

That is EXACTLY what I wrote.

FurkanGozukara Mar 29, 2023

That is EXACTLY what I wrote.

OK tell me the steps I have to make please if you know

vladmandic Mar 29, 2023
Collaborator Author

go to nvidia.com, find cuda 11.8, follow the steps to download and install.

EbenezerWS · 2023-10-12T17:33:01Z

EbenezerWS
Oct 12, 2023

This is (hopefully) start of a thread on PyTorch 2.0 and benefits of model compile which is a new feature available in torch nightly builds

Builds on conversations in #5965, #6455, #6615, #6405

TL;DR

PyTorch 2.0 with Accelerate and XFormers works pretty much out-of-the-box, but it needs newer packages But only limited luck so far using new torch.compile although made some progress

Install

First, this is written for torch 2.0 with cuda 11.8 If you want to use CUDA 11.7, modify install paths accordingly, but older versions will likely not work (and neither will CUDA 12 as there is no support for it in torch just yet)

Btw, my environment is RTX3060 inside WSL2 (Ubuntu 22.04) on Windows 11, so your mileage/results may vary

1. CUDA

install CUDA 11.8 with latest cuDNN

2. Triton

If you have default OpenAI version of triton, uninstall it before installing torch as torch 2.0 comes with its own version of triton

pip uninstall triton

3. Torch

Install Torch nightly

pip3 install --pre torch torchvision torchaudio torchtriton --extra-index-url https://download.pytorch.org/whl/nightly/cu118 --force
pip show torch
2.0.0.dev20230113+cu118

4. Accelerate

Update Accelerate for Torch 2.0 compatibility as version specified in requirements_versions.txt

pip install -U accelerate==0.15.0

And don't forget to update requirements_versions.txt so webui doesn't auto-downgrade accelerate version

5. Xformers

Rebuild XFormers

Relying on pre-built wheels is not really an option since xformers get linked to specific torch version which changes daily Plus rebuild only takes few min, so why bother with wheels (just make sure you have build requirements before-hand).

pip install ninja setuptools
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
python -m xformers.info

And thats it, WebUI is happy to work with new libs out-of-the-box

Optimize

1. Accelerate

But now onto main reason why even try torch nightlies:

Torch includes dynamic compiler/optimizer which is only available in nightly builds: Dynamo

If you're lucky few, you may be able to configure Accelerate to use Dynamo

accelerate config

Do you wish to optimize your script with torch dynamo?[yes/NO]: yes

Which dynamo backend would you like to use? inductor

accelerate test

I haven't had luck getting accelerate test to complete which means that dynamo will NOT be used.

2. Compile

So lets do a manual config:

We need to setup torch.compile and best spot I've found so far is NOT in SD model load, but slightly afterwards due to function hijacking that happens in WebUI

For example, in modules/sd_hijack function def hijack, just before self.optimization_method = apply_optimizations()
try:
    import torch._dynamo as dynamo
    torch._dynamo.config.verbose = True
    torch.backends.cudnn.benchmark = True
    m.model = torch.compile(m.model, mode="max-autotune", fullgraph=False)
    print("Model compiled set")
except Exception as err:
    print(f"Model compile not supported: {err}")
Notes:

dynamo must be imported explicitly or namespace is not available in runtime

compile applies to actual model,
not to parent sd_model as that is entire pipeline, not model itself

compile time is irrelevant as it only marks model for compilation during first execution

fullgraph cannot be forced as there are parts of diffusion model that cannot be compiled,
so this internally allows it to split model into compiled+uncompiled graphs

cudnn.benchmark should be set so optimizer can perform precise execution timings

initial execution will always be slower as that is when compile actually happens
any benefits would be seen in subsequent calls

Result? In my case its the same error as with accelerate test Not great...

3. Digging Deeper

Default (and recommended) dynamo backend for torch.compile is inductor, but no matter what I cannot get inductor to work on my system

Error is in triton which fails with silly error:

RuntimeError: CUDA: Error- no device

And at this point I'm not sure if triton is broken for torch 2.0, even if its installed from the same nightly

So I wrote a standalone test script to evaluate all the different backens: https://github.com/vladmandic/automatic/blob/master/cli/modules/dynamotest.py

This tests and benchmarks all possible dynamo backends, but I'm focusing on couple only:

default: eval in 4.247 ms

ofi: eval in 3.820 ms
uses TorchScript set for optimize_for_inference
this is basically same as default, but with some voodoo-magic regarding just-in-time ops and freeze, etc.
most likely not compatible with training, so cannot be used with dreambooth

aot_cudagraphs: eval in 6.460 ms
uses cudagraphs with AotAutograd
seems slower as no-compile

inductor: fail
uses TorchInductor backend with AotAutograd and cudagraphs by leveraging codegened Triton kernels
error: RuntimeError: CUDA: Error- no device

fx2trt: fail
uses nVidia TensorRT
error: ImportError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
seems like tensorrt is not yet compatible with torch 2.0 (yes, shared library does exist)

Now...All this uses an off-the-shelf model (resnet18) to evaluate, next step would be to apply it to stable diffusion itself...

And I'd be curious to hear what your test results look like?

Btw, good getting-started doc is in torch code: https://github.com/pytorch/pytorch/blob/4f4b62e4a255708e928445b6502139d5962974fa/docs/source/dynamo/get-started.rst

Hello there,
I'm a big newbie in the Stable diffusion universe aha but I would like to improve some performances.
I guess facing a big issue with a1111, my 3070 Ti is getting 2its/s when processing img2img when I read some getting at least 10 to 40 it/s...

I'm using 576*1024, euler a, few controlnet ( softedge and temporalnet)
is that normal or should I have better performances ?
when i try to use img2img batch i't's so slow than it's showin 10seconds/it
If no,

I'd like to understand how/where to install/execute all your command lines for exemple
I get this resut when trying to know my torch version : " pip3 show torch
WARNING: Package(s) not found: torch"
Should I show a precise path in the cmd line ?

Thx for the reading and hope getting better results :)

0 replies

PyTorch 2.0 and Torch Compile general discussion #6932

vladmandic Jan 19, 2023 Collaborator

TL;DR

Install

1. CUDA

2. Triton

3. Torch

4. Accelerate

5. Xformers

Optimize

1. Accelerate

2. Compile

3. Digging Deeper

Replies: 37 comments · 164 replies

vladmandic Jan 19, 2023 Collaborator Author

vladmandic Jan 19, 2023 Collaborator Author

vladmandic Jan 19, 2023 Collaborator Author

vladmandic Jan 19, 2023 Collaborator Author

vladmandic Jan 19, 2023 Collaborator Author

vladmandic Jan 19, 2023 Collaborator Author

vladmandic Jan 19, 2023 Collaborator Author

2. Compile

vladmandic Jan 22, 2023 Collaborator Author

vladmandic
Jan 19, 2023
Collaborator

Replies: 37 comments 164 replies

vladmandic Jan 19, 2023
Collaborator Author

vladmandic
Jan 19, 2023
Collaborator Author

vladmandic Jan 19, 2023
Collaborator Author

vladmandic Jan 19, 2023
Collaborator Author

vladmandic Jan 19, 2023
Collaborator Author

vladmandic Jan 19, 2023
Collaborator Author

vladmandic
Jan 19, 2023
Collaborator Author

vladmandic Jan 22, 2023
Collaborator Author