PyTorch 2.0 and Torch Compile general discussion #6932
Replies: 37 comments 164 replies
-
Can't find
|
Beta Was this translation helpful? Give feedback.
-
{
"default": 2.815999984741211,
"ansor": 2.815135955810547,
"aot_cudagraphs": 2.8175359964370728,
"aot_eager": 2.8190720081329346,
"aot_inductor_debug": 2.8180480003356934,
"aot_torchxla_trace_once": 2.8190720081329346,
"aot_torchxla_trivial": 2.8190720081329346,
"aot_ts": 2.821120023727417,
"aot_ts_nvfuser": 2.817199945449829,
"aot_ts_nvfuser_nodecomps": 2.8190720081329346,
"cudagraphs": 2.8180480003356934,
"cudagraphs_ts": 2.818560004234314,
"cudagraphs_ts_ofi": 2.8180480003356934,
"eager": 2.8187040090560913,
"fx2trt": 2.8139519691467285,
"inductor": 2.805759906768799,
"ipex": 2.7996160984039307,
"nnc": 2.798080086708069,
"nnc_ofi": 2.8002400398254395,
"nvprims_aten": 2.8016480207443237,
"nvprims_nvfuser": 2.7972320318222046,
"ofi": 2.800800085067749,
"onednn": 2.803199887275696,
"onnx2tensorrt": 2.801664113998413,
"onnx2tf": 2.798080086708069,
"onnxrt": 2.8001281023025513,
"onnxrt_cpu": 2.801664113998413,
"onnxrt_cpu_numpy": 2.8011521100997925,
"onnxrt_cuda": 2.7985920906066895,
"static_runtime": 2.7997440099716187,
"taso": 2.7985920906066895,
"tensorrt": 2.795184016227722,
"torch2trt": 2.8078079223632812,
"torchxla_trace_once": 2.8175359964370728,
"torchxla_trivial": 2.815999984741211,
"ts": 2.8165119886398315,
"ts_nvfuser": 2.8139519691467285,
"ts_nvfuser_ofi": 2.822144031524658,
"tvm": 2.8156319856643677,
"tvm_meta_schedule": 2.8149759769439697
} |
Beta Was this translation helpful? Give feedback.
-
@ataa impressive that every single backend is working! what is your environment? |
Beta Was this translation helpful? Give feedback.
-
Good stuff! So I can get
My initial results are quite a failure, gonna mess with my environment and see if I can fix them:
|
Beta Was this translation helpful? Give feedback.
-
Also, will |
Beta Was this translation helpful? Give feedback.
-
Running: accelerate-launch --config_file=None /home/agp/stable-diffusion-webui/venvtorch20-cu118/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py noice !!! |
Beta Was this translation helpful? Give feedback.
-
Gonna come back to this later, but here are the issues I am having with the various backends:
Apache TVM in particular seems to be an issue, even with |
Beta Was this translation helpful? Give feedback.
-
@vladmandic Wow, that promises an enourmous punch! 😃 Unfortunatelly I got an error when trying to install xformers. I am on Ubuntu 22.04, RTX 4090 and using the cuda 11.8 cores since a while with xformers and also Torch 2.0 (but unoptimized) at 42it/s for SD 1.5. But obviously this is now also possible on a RTX 3060 so I must do something. 😄
I reinstalled everything as described, but:
|
Beta Was this translation helpful? Give feedback.
-
Using the manual method with torchdynamo gives me this error:
|
Beta Was this translation helpful? Give feedback.
-
why not use ONNX Runtime as backend for Dynamo, which will export to ONNX and run it |
Beta Was this translation helpful? Give feedback.
-
update on using either straight
only way forward i can think of would be to run in in so onto other possible backends... |
Beta Was this translation helpful? Give feedback.
-
For updating to PyTorch 2.0 the installation instructions do seem to help quite a lot in speeding up the One issue I do have with PyTorch 2.0, which I had with the previous PyTorch 2.0 installation as well, is that when I try to increase the batch size to anything more than 1 when wanting to train an embedding, I'm hit with:
I've tried looking around but I haven't really been able to find anything to fix it. I saw this (https://github.com/Birch-san/stable-diffusion#patch) and tried doing something similar but it had no effect sadly. In case it helps: EDIT: EDIT2: Testing it out again, it definitely seems to be happening due to something regarding the If anybody has any clue on how to solve the error or a direction to explore I'm all up for debugging it a bit (incase anyone else runs into it) EDIT3: BUT the actual batch size increase problem is still there, with the exact same error, so still wondering on how to approach that. EDIT4: I saw something on the |
Beta Was this translation helpful? Give feedback.
-
I've made a GitHub Actions workflow that compiles xformers, but uses nightly torch2: https://github.com/sonphantrung/abc/blob/main/.github/workflows/xformers.yml. You may want to edit the |
Beta Was this translation helpful? Give feedback.
-
EDIT: Red herring. Seems the two important steps are:
However, ultimately, you will still get the same error as #6932 (comment) Test script: {
"default": 5.612623929977417,
"ansor": "error",
"aot_cudagraphs": 8.528383731842041,
"aot_eager": 5.482496023178101,
"aot_inductor_debug": 8.616447925567627,
"aot_torchxla_trace_once": "error",
"aot_torchxla_trivial": 6.626816034317017,
"aot_ts": 6.360575914382935,
"aot_ts_nvfuser": 5.486592054367065,
"aot_ts_nvfuser_nodecomps": 6.346751928329468,
"cudagraphs": "error",
"cudagraphs_ts": "error",
"cudagraphs_ts_ofi": "error",
"eager": 5.060096025466919,
"fx2trt": "error",
"inductor": 5.426176071166992,
"ipex": "error",
"nnc": "error",
"nnc_ofi": "error",
"nvprims_aten": 38.688255310058594,
"nvprims_nvfuser": "error",
"ofi": 5.050944089889526,
"onednn": "error",
"onnx2tensorrt": "error",
"onnx2tf": "error",
"onnxrt": "error",
"onnxrt_cpu": "error",
"onnxrt_cpu_numpy": "error",
"onnxrt_cuda": "error",
"static_runtime": "error",
"taso": "error",
"tensorrt": "error",
"torch2trt": "error",
"torchxla_trace_once": "error",
"torchxla_trivial": 5.648384094238281,
"ts": 5.2705278396606445,
"ts_nvfuser": "error",
"ts_nvfuser_ofi": "error",
"tvm": "error",
"tvm_meta_schedule": "error"
}
It seem to work on initial load for me, just that subsequent model swaps don't. See screenshot for Torch version used: |
Beta Was this translation helpful? Give feedback.
-
if you have ampere or higher gpu, hardware l2 cache can be persisted (its not by default) and it does help with performance
can anyone think of why allowing cache persistence on all l2 cache memory (value of 100%) would be a bad thing? afaik, if new data is needed, it still goes through cache, this just allows persistence if same data is requested again? |
Beta Was this translation helpful? Give feedback.
-
@aifartist |
Beta Was this translation helpful? Give feedback.
-
I installed torch 2.1 and cuda 11.8, as well as latest cudnn, and built my own xformers. Extremely good speedup on 4090. However, I was annoyed that I had to do the install and build TWICE because unless you add in --skip-install (and change the requirements file in launch.py to requirements.txt instead of requirements_version, I did both just in case) to the webui_user.bat the thing will immediately move to overwrite your shiny newly built stuff with torch 1.13 and old xformers on startup. Is there a more elegant way to get webui to NOT do this besides this? I'm sure it will cause issues with future updates and extensions and I'd prefer a different way. |
Beta Was this translation helpful? Give feedback.
-
SDP (#8367) can now be run with deterministic results (thanks to Sakura-Luna for pointing out), You can test it with |
Beta Was this translation helpful? Give feedback.
-
Sharing some updates of accelerate Graphics Card: PNY Nvidia 4080 torch: 2.1.0.dev20230313+cu118 I've noticed
Other options use default settings.
I noticed that there is a problem with the testing script that might probably cause a false negative. Developers soon merged a fixed testing script to the main branch. Updated my By using the same config and doing the Disappointingly, there has been no improvement in my performance on inference. Before enabling accelerate (set After enabling the accelerate with the above settings, I only get Anyone want to share their journey with the latest version of accelerate, many thanks for the discussion and contribution. |
Beta Was this translation helpful? Give feedback.
-
Unless you can use Here are some of my non-extensive tests:
As it is, I don't think accelerate dynamo support has any effect on sd model at all Note:
Env:
|
Beta Was this translation helpful? Give feedback.
-
While testing --opt-sdp-no-mem-attention on a photorealistic model I noticed by chance that one of the cuda matmul options which is enabled by default in torch-2.0.0.dev20230228+CUDNN 8.8.0 results in moderate to minor distortions of fine details. Testing a bit further, I noticed this occurs no matter what attention method is used, so it's not unique to sdp. The offending option is Set this to For good measure also set Note that this will result a minor change to existing seeds (some details, edges, objects will slightly change in shape), but from my quick tests all changes made were objective improvements. On my RTX A4000, I did not notice any reduction of FP16 interference speed setting both options to false. YMMV For reference, the following is what I currently do to manage all these torch cudnn/cuda variables:
I just delete the entire enable_tf32 def below in devices.py and replace it with my code above in the same location: stable-diffusion-webui/modules/devices.py Lines 63 to 76 in a9fed7c |
Beta Was this translation helpful? Give feedback.
-
This thread has split off into many different paths. Most having nothing to do with Torch v2.x compile. |
Beta Was this translation helpful? Give feedback.
-
24.) Automatic1111 Web UI - PC - Free this method installs latest cuda dll files too test py
|
Beta Was this translation helpful? Give feedback.
-
Not bad for my first try with the GA torch.compile(). The second try if you count the torch/_dynamo/guards.py bug I have to fix to get it to work. |
Beta Was this translation helpful? Give feedback.
-
@aifartist i've tried again using torch 2.0 ga and sd 1.5 default model, still no luck at all with pretty much any backend...
re: turbo boost - it's pretty much expected, single core boost is typically higher than multicore boost. i have 12900k with aio and set my single core to 52x and multicore to 50x. |
Beta Was this translation helpful? Give feedback.
-
This is not possible without WSL because torch can't compile on windows yet. I am tired of caped performance on windows, I know cero about Linux but is time for me to learn I saw @aifartist comments on reddit and I agree it would be too hard to explain all this, the big companies let people like me out in the cold is a bad practice all I can do is rely on threads like this to get things done. |
Beta Was this translation helpful? Give feedback.
-
Yes, you got my meaning all wrong, I know, I have compiled torch before to
get dreambooth running and yes it was hard but that was Torch not Torch 2.
All people say Linux is easier and I bet it is. So I thought to say thank
you for your effort putting this guide out but forget it. I guess I get
this for simping. Never again.
…On Tue, Mar 21, 2023, 4:23 PM Vladimir Mandic ***@***.***> wrote:
This is not possible without WSL because torch can't compile on windows yet
yes it can, but its complex.
even on wsl or linux, compiling torch is a nightmare as its build process
depends on anaconda which i simply refuse to use.
I am tired of caped performance on windows
there is no caped performance on windows and for sure you don't need torch
2.0 with torch.compile() for that.
so instead of attempting things that you may not even need, focus on
fixing whats broken.
the big companies let people like me out in the cold is a bad practice all
I can do is rely on threads like this to get things done.
anything cutting edge is complex and/or buggy exactly because experienced
people did not yet have time to polish it.
if you don't feel comfortable with that, use older/proven tech. sorry for
the tone, its just there are too many ppl experimenting with
untested/unproven tech without any understanding what it involves.
—
Reply to this email directly, view it on GitHub
<#6932 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFBUFQSRX6IYXNBYXQ2YRB3W5HIYDANCNFSM6AAAAAAUAM6I4Y>
.
You are receiving this because you commented.Message ID:
<AUTOMATIC1111/stable-diffusion-webui/repo-discussions/6932/comments/5383301
@github.com>
|
Beta Was this translation helpful? Give feedback.
-
@vladmandic i am trying to run dreambooth on runpod unfortunately pytorch team removed xformers older version here the errors and steps i tried to solve the problem I have installed Torch 2 via this command on RunPod io instance
Everything installed perfectly fine With Torch 1 and Cuda 11.7, I was not getting any error but with Torch 2 the below error produced
How to fix? It is using unix On Windows same prodecure working very well Using Automatic1111 web UI to use Stable Diffusion this above i couldnt solve therefore i have done the following things apt update after installing all above now i have this warning and training never progress
now when i run below python code i see everything looking good
test.py result
it is able to generate images with 15.58it which is very fast any help appreciated very much |
Beta Was this translation helpful? Give feedback.
-
regarding |
Beta Was this translation helpful? Give feedback.
-
Hello there, I'm using 576*1024, euler a, few controlnet ( softedge and temporalnet) I'd like to understand how/where to install/execute all your command lines for exemple Thx for the reading and hope getting better results :) |
Beta Was this translation helpful? Give feedback.
-
This is (hopefully) start of a thread on PyTorch 2.0 and benefits of model compile which is a new feature available in torch nightly builds
Builds on conversations in #5965, #6455, #6615, #6405
TL;DR
PyTorch 2.0 with Accelerate and XFormers works pretty much out-of-the-box, but it needs newer packages
But only limited luck so far using new torch.compile although made some progress
Install
First, this is written for torch 2.0 with cuda 11.8
If you want to use CUDA 11.7, modify install paths accordingly, but older versions will likely not work
(and neither will CUDA 12 as there is no support for it in torch just yet)
Btw, my environment is RTX3060 inside WSL2 (Ubuntu 22.04) on Windows 11, so your mileage/results may vary
1. CUDA
install CUDA 11.8 with latest cuDNN
2. Triton
If you have default OpenAI version of
triton
, uninstall it before installingtorch
as torch 2.0 comes with its own version of triton3. Torch
Install Torch nightly
4. Accelerate
Update Accelerate for Torch 2.0 compatibility as version specified in
requirements_versions.txt
And don't forget to update
requirements_versions.txt
sowebui
doesn't auto-downgradeaccelerate
version5. Xformers
Rebuild XFormers
Relying on pre-built wheels is not really an option since
xformers
get linked to specifictorch
version which changes dailyPlus rebuild only takes few min, so why bother with wheels
(just make sure you have build requirements before-hand).
And thats it, WebUI is happy to work with new libs out-of-the-box
Optimize
1. Accelerate
But now onto main reason why even try
torch
nightlies:Torch includes dynamic compiler/optimizer which is only available in nightly builds: Dynamo
If you're lucky few, you may be able to configure Accelerate to use Dynamo
I haven't had luck getting accelerate test to complete which means that dynamo will NOT be used.
2. Compile
So lets do a manual config:
We need to setup
torch.compile
and best spot I've found so far is NOT in SD model load, but slightly afterwards due to function hijacking that happens inWebUI
For example, in
modules/sd_hijack
functiondef hijack
, just beforeself.optimization_method = apply_optimizations()
Notes:
not to parent
sd_model
as that is entire pipeline, not model itselfso this internally allows it to split model into compiled+uncompiled graphs
any benefits would be seen in subsequent calls
Result? In my case its the same error as with
accelerate test
Not great...
3. Digging Deeper
Default (and recommended) dynamo backend for
torch.compile
isinductor
,but no matter what I cannot get
inductor
to work on my systemError is in
triton
which fails with silly error:And at this point I'm not sure if
triton
is broken for torch 2.0, even if its installed from the same nightlySo I wrote a standalone test script to evaluate all the different backens:
https://github.com/vladmandic/automatic/blob/master/cli/modules/dynamotest.py
This tests and benchmarks all possible dynamo backends, but I'm focusing on couple only:
default
: eval in 4.247 msofi
: eval in 3.820 msuses
TorchScript
set foroptimize_for_inference
this is basically same as
default
, but with some voodoo-magic regarding just-in-time ops and freeze, etc.most likely not compatible with training, so cannot be used with
dreambooth
aot_cudagraphs
: eval in 6.460 msuses
cudagraphs
withAotAutograd
seems slower as no-compile
inductor
: failuses
TorchInductor
backend withAotAutograd
andcudagraphs
by leveraging codegenedTriton
kernelserror:
RuntimeError: CUDA: Error- no device
fx2trt
: failuses nVidia
TensorRT
error:
ImportError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
seems like tensorrt is not yet compatible with torch 2.0 (yes, shared library does exist)
Now...All this uses an off-the-shelf model (
resnet18
) to evaluate,next step would be to apply it to stable diffusion itself...
And I'd be curious to hear what your test results look like?
Btw, good getting-started doc is in torch code:
https://github.com/pytorch/pytorch/blob/4f4b62e4a255708e928445b6502139d5962974fa/docs/source/dynamo/get-started.rst
Beta Was this translation helpful? Give feedback.
All reactions