Is torch 2.0 any better than torch 1.13.1 #6455

aifartist · 2023-01-07T02:22:58Z

aifartist
Jan 7, 2023

Yesterday I was experimenting with building torch 2.0 so that I could ALSO use CUDA-12.0. I was excited it was so fast. Today I explored deeper and to do so I did a clean install of A1111 and it was very fast without torch 2. ??? I found an older A1111 I still had and it was slow. After a lot of work to figure out what the difference was I found the old version had torch 1.12.1+cu113 and if you remove and reinstall you get 1.13.1+cu117.

Indeed it was much faster. I now have learned that "git pull" to update A1111 doesn't upgrade python packages to newer versions.

Then I installed torch 2.0.0.dev20230106+cu117 and I no longer see a perf improvement over 1.13.1. I have a feeling that CUDA-12.0 won't make any further difference. A valuable lesson learned.

I can still get another 16% perf improvement with the two other changes I mentioned in another post but now torch 2.0 has nothing to do with it. I was going to post those changes today but trying to figure out why the baseline was so much faster has consumed most of the day.

@hippopotamus1000
@DustyCooper
@aliencaocao See if torch 1.13.1 is just as good as torch 2.

aliencaocao · 2023-01-07T03:24:35Z

aliencaocao
Jan 7, 2023

torch 1.13.1 is not as good. I have been using 1.31.1 since this repo existed, and I only see the performance improvement when I upgraded to 2.0 AND compiled xformer with 2.0. Did you compile xformer with 2.0 too?

3 replies

aifartist Jan 7, 2023
Author

Yes, I compiled xformer with 2.0 when I used 2.0.
Are you seeing a good perf boost for 2.0 with simple image generation?
Doing just an upgrade of torch from 1.12.1 to 1.13.1 got me the same boost as torch 2 did.
Mistakes always get made. I once had my venv set to the wrong directory when switching back and forth leading to wrong conclusions.

Assuming you meant 1.13.1 and not 1.31.1 I don't see how you could have been using this since the repo existed.
Using what this repo(A1111) pulled back in dec when I last created my slow version 1.12.1 was the current release.
I highly doubt torch 1.13.1 was available back in Aug/Sep?

aliencaocao Jan 7, 2023

I started using this repo in late Oct. It was 1.13.0 back then. Then I upgraded to 1.13.1 in mid nov. The performance gain I have from 1.13.1 to 2.0 is very reproducible. See #5962

aifartist Jan 7, 2023
Author

Well I can get .6 seconds per image without Torch 2.0.
and .5 seconds with my own changes. To make absolutely sure I printed the torch version from within process.py itself and it prints 1.13.1+cu117
A NVidia 4090, v2-1_512-ema-pruned, 512x512, 4 batches of 16 images per batch, Euler_a, 20 steps.

Generated 64 images in 38.854778 seconds
Time per image 0.60710590625 seconds
AFTER: Memory used 4356571136 = Tensor 2626348544 + Other 1730222592
Max tensor memory used 4981131264
Total progress: 100%|███████████████████████████| 80/80 [00:38<00:00, 2.08it/s]

system1system2 · 2023-01-07T12:29:57Z

system1system2
Jan 7, 2023

I wonder if @brkirch can weigh in here and say if upgrading to Torch 2.0 could make a meaningful difference on macOS + Apple Silicon, too.
In #5962, I'm reading that it appears so only if used in combinations with Xformers, so maybe the answer is no.

3 replies

brkirch Jan 9, 2023
Collaborator

I do have an experimental configuration ready. To try it, make sure Xcode is up to date, run brew install pkg-config libuv and then open webui-user.sh in Xcode and replace its contents with:

Expand

#!/bin/bash
#########################################################
# Uncomment and change the variables below to your need:#
#########################################################

# Install directory without trailing slash
#install_dir="/home/$(whoami)"

# Name of the subdirectory
#clone_dir="stable-diffusion-webui"

# Commandline arguments for webui.py, for example: export COMMANDLINE_ARGS="--medvram --opt-split-attention"
export COMMANDLINE_ARGS="$COMMANDLINE_ARGS --opt-sub-quad-attention"

# python3 executable
#python_cmd="python3"

# git executable
#export GIT="git"

# python3 venv without trailing slash (defaults to ${install_dir}/${clone_dir}/venv)
venv_dir="venv-torch-2.0-alpha"

# script to launch to start the app
#export LAUNCH_SCRIPT="launch.py"

# install command for torch
export USE_DISTRIBUTED=1
export TORCH_COMMAND="pip install git+https://github.com/brkirch/pytorch@ebfccd2b78f85bae16e8fcaea1e492abbf395baf#egg=torch --pre torchvision==0.15.0.dev20230106 -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html"

# Requirements file to use for stable-diffusion-webui
#export REQS_FILE="requirements_versions.txt"

# Fixed git repos
#export K_DIFFUSION_PACKAGE=""
#export GFPGAN_PACKAGE=""

# Fixed git commits
#export STABLE_DIFFUSION_COMMIT_HASH=""
#export TAMING_TRANSFORMERS_COMMIT_HASH=""
#export CODEFORMER_COMMIT_HASH=""
#export BLIP_COMMIT_HASH=""

# Uncomment to enable accelerated launch
#export ACCELERATE="True"

###########################################

The first time you run ./webui.sh it will take a long time at Installing torch and torchvision; this is normal because it is building PyTorch from source instead of installing an already built package.

Note that this uses a custom fork: brkirch/pytorch which uses kulinseth/pytorch with a few commits I added on from the main branch in order to get training working. Performance at the time of writing this should usually be more than 10% faster than nightly builds, and usually more than 25% faster than PyTorch 1.12.1 that is currently being used.

This should in theory work for everything that already works on PyTorch 1.12.1, including training embeddings and hypernetworks. I'll probably advertise this a bit more if I get #6510 completed and webui-user.sh modified to use it.

system1system2 Jan 10, 2023

This is great. I'll try over the weekend. 25% performance improvement is huge. Maybe we should open a dedicated thread about this. There would be too much confusion in this one, where everybody is talking about Windows systems.

gabrielbacha Jan 12, 2023

You really weren't lying when saying that it takes a LONG time to install torch (30 mins)!
I can confirm your results, probably around 20-25% faster (I was getting 2.05 it/s now closer to 2.4 it/s

Nacurutu · 2023-01-07T13:23:23Z

Nacurutu
Jan 7, 2023

I was able to update the wubui to torch==1.13.1+cu117 and torchvision==0.14.1+cu117 (Windows 10 + Python 3.10.9)

Now, Xformers doesnt work anymore after this update..

I tried to rebuild Xformers, to look for new wheels, etc etc etc... I have literally 7 hours trying to update Xformers to 0.0.16 and Im not able to do it, I have searched all github, reedit, google.. followed 21374628746 tutorials, steps... and nothing... I just cant... and I need Xformers because i just have an 1660ti 6gb vram, so Xformers helps me a lot...

Anyone knows how to do it? and please, something for dummies, im not an expert as you can see...

thanks in advance...

my head is going to blow...

10 replies

DustyCooper Jan 7, 2023

Got it working now, and I'm at the triton error that doesn't matter.

The other issue I'm having is that when copying over the dlls, I do not have torch/libsl/ have to create /venv/scripts/torch/libsl. also my scripts folder is /Scripts/

is this an issue?

Nacurutu Jan 7, 2023

@playlogitech Thank you very much for the info, I was able to install Xformers 0.0.16 on torch==1.13.1+cu117 and torchvision==0.14.1+cu117. (install -U -I --no-deps xformers==0.0.16rc402).

Im going to save this venv folder and now build a new venv with your steps for torch==2.00 and also save that venv.

Then im going to start testing if i get improvements on my 1660ti 6gb vram.

I will let u all know whats better for this graphic card.

PS: also thanks to @aliencaocao and @petalas... ur steps also helped me a lot to uderstand how to do it and how this process works. Im not very good at this :p

Albies42 Jan 8, 2023

xformers_windows_package = os.environ.get('XFORMERS_WINDOWS_PACKAGE', "C:\xformers\dist\xformers-0.0.15+f351fee.d20230106-cp310-cp310-win_amd64.whl")

for those getting a SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in positio
n 16-18: truncated \xXX escape

error, put an r before the path.

xformers_windows_package = os.environ.get('XFORMERS_WINDOWS_PACKAGE', r"C:\dist\xformers-0.0.15+f351fee.d20230106-cp310-cp310-win_amd64.whl")

playlogitech Jan 8, 2023

Hi,

i have built it like the instructions but i am getting no module named triton, did i mess up somewhere? and how can i fix this.

Hi
Its okay to have this error, Triton is not existant on windows systems yet. Still you get all speed ups from Xformers as from previous versions of it.

Nacurutu Jan 8, 2023

Got it working now, and I'm at the triton error that doesn't matter.

The other issue I'm having is that when copying over the dlls, I do not have torch/libsl/ have to create /venv/scripts/torch/libsl. also my scripts folder is /Scripts/

is this an issue?

Libsl was a typo...

Paste the DLLs on this folder -> (your SD root Folder)\venv\Lib\site-packages\torch\lib

Albies42 · 2023-01-07T17:34:43Z

Albies42
Jan 7, 2023

is anyone able to still train after upgrading to torch 2.0?

am considering upgrading, but don't want to waste time if it results in the inability to train.

#5962

2 replies

aifartist Jan 7, 2023
Author

I believe you need to make a change to the code to leverage the torch 2.0 features that greatly speeds training. I don't yet know how to train in A1111. I tried "sd_model = torch.compile(sd_model)" in a1111 but it did not speed up inference.

aliencaocao Jan 8, 2023

#5965

torch.compile indeed does not speed up inference, but it SHOULD speed up training - but there is an existing bug that prevents training to be ran on torch 2.0 builds (#6465 fixes it). So i guess you can only test after this fix is merged.

DustyCooper · 2023-01-07T17:37:06Z

DustyCooper
Jan 7, 2023

Ok gents I have down time and motivation going to give this a go.

0 replies

Nacurutu · 2023-01-08T08:58:04Z

Nacurutu
Jan 8, 2023

Ok, im here with my test and results:

OS Name: Microsoft Windows 10 Pro
Version: 10.0.19045 Build 19045

Processor: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz, 2592 Mhz, 6 Core(s), 12 Logical Processor(s)
Installed Physical Memory (RAM): 32,0 GB
Graphics card: Nvidia GTX 1660ti 6GB VRAM (Laptop)

I have 4 Stable Diffusions WebUI venv installed:
Im using the same SD root folder with 4 different venv folders assigned on set VENV_DIR= in 4 different .bat files with the same Arguments.

Vanilla
python: 3.10.9 • torch==1.12.1+cu113 - torchvision==0.13.1cu113 • xformers: 0.0.14.dev
Torch 1.13.1
python: 3.10.9 • torch==1.13.1+cu117 - torchvision==0.14.1cu117 • xformers: 0.0.16rc402
Torch 2.0.0cu118
python: 3.10.9 • torch==2.0.0+cu118 - torchvision==0.15.0cu118 + CUDA18 DLLs • xformers: 0.0.15+f351fee.d20230106
Torch 2.0.0cu117
python: 3.10.9 • torch==2.0.0+cu117 - torchvision==0.15.0cu117 • xformers: 0.0.15+6cd1b36.d20230107

Arguments

Launching Web UI with arguments: --precision full --no-half --no-half-vae --medvram --opt-split-attention --xformers --opt-channelslast --api --gradio-img2img-tool color-sketch --theme dark --autolaunch

Test parameters:

Prompt: portrait photo of a asia old warrior chief <- (Random prompt from the internet)
Steps: 20,
Sampler: Euler a,
CFG scale: 7,
Seed: 3849050609,
Size: 512x512,
Model: v1-5-pruned.ckpt [a9263745]

Restore faces: NO
Hires. fix: NO
Extra Scripts: NO

Vanilla:

Batch Count: 1
Batch Size: 1

Video Card in Normal mode
Video Card in Turbo mode
Image:

Batch Count: 4
Batch Size: 1

Video Card in Normal mode
Video Card in Turbo mode
Image:

Batch Count: 1
Batch Size: 4

Video Card in Normal mode
Video Card in Turbo mode
Image:

Torch 1.13.1

Batch Count: 1
Batch Size: 1

Video Card in Normal mode
Video Card in Turbo mode
Image:

Batch Count: 4
Batch Size: 1

Video Card in Normal mode
Video Card in Turbo mode
Image:

Batch Count: 1
Batch Size: 4

Video Card in Normal mode
Video Card in Turbo mode

Image:

Torch 2.0.0cu118

Batch Count: 1
Batch Size: 1

Video Card in Normal mode
Video Card in Turbo mode
Image: (Just getting noise or Black images)

Batch Count: 4
Batch Size: 1

Video Card in Normal mode
Video Card in Turbo mode
Image: (Just getting noise or Black images)

Batch Count: 1
Batch Size: 4

Video Card in Normal mode
Video Card in Turbo mode
Image: (Just getting noise or Black images)

Torch 2.0.0cu117

Same speeds
Just Black images

My conclusions of the test in my config:

( Vanilla = Torch 1.13.1 ) < Torch 2.0.0

1.12.1cu113 almost the same as 1.13.1cu117
Torch 2.0.0cu118 is a bit faster, but I think it's because it just generates noise.

Everything was installed correctly following the steps, no errors in the process, just the triton one:

A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'

To hide the Triton Error:
#5962 (reply in thread)

Anyone has a clue about the noise generation in 2.0.0?

Could be the --precision full --no-half Arguments not optimized for torch 2.0.0 and/or new Xformers?

Green or Black screen
When running on video cards which don't support half precision floating point numbers (a known issue with 16xx cards), a green or black screen may appear instead of the generated pictures. This may be fixed by using the command line arguments --precision full --no-half at a significant increase in VRAM usage, which may require --medvram.

https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Troubleshooting#green-or-black-screen

as I have a 1660 ti, I always use those arguments and always worked. Still working with previous torch versions.

I know my Graphics card is not the ideal for SD, but is what I have.

I hope this information helps.

12 replies

aliencaocao Jan 8, 2023

I am having some weird issues building xformers with torch2.0 20230106+cu118 now...should be my own environment's issue but there is not even an error message so I'm kinda stuck here, wont be able to report if I have the same noise issue or not for now. I use python 3.9 so the wheels built by @hithereai wont work for me.

aliencaocao Jan 8, 2023

I just tested with torch 20230106+cu118 and xformer latest commit compiled using it, no issue. Looks like it is specific to your hardware/software. I am on RTX 3080Ti

Nacurutu Jan 8, 2023

Maybe related to my graphics card model and the arguments...

aifartist Jan 8, 2023
Author

While a good effort it was hard to read to see the bottom line that your times were significantly faster using torch 2. However, the fact the images produced were as different as can be I wonder what else is going on. And if you decide to using some vae, no-half, kind of option to fix the noise generated then that would be a fair comparison either.

I have a 4090 and I see no difference between the two torch versions.
My wish is that if someone had a 4090 on Ubuntu I'd like to how long it takes to do 4 batches of size 16, 20 steps 512x512 and either sd-v1-5 or v2-1-ema... I get about 38 seconds for the 64 images.

Nacurutu Jan 9, 2023

My test was on a 1660ti 6 GB VRAM.

Arguments: --precision full --no-half --no-half-vae --medvram --opt-split-attention --xformers --opt-channelslast <- in order to generate images using card.

Checkpoint: SD 1.5
No Vae, no Restore faces, no hires fix. no extra Scripts. Just a simple prompt + generation.

I think im getting black images/noise on Torch 2.0.0 due to my card model + the Arguments.
I need the arguments to generate images on any Torch/Xformers versions previous torch2.0.0.

I got an speed improvement (just a little) with torch2.0.0 + Xformers build for that version. However, im just getting black images or noise.

playlogitech · 2023-01-08T10:00:58Z

playlogitech
Jan 8, 2023

anyone tried on make those checkups on linux and install this triton too?

4 replies

Nacurutu Jan 8, 2023

I dnt have Linux installed on this machine :(

Also, got just noise and black images with both torch2.0.0cu118 with your xformers wheel and torch2.0.0cu117 with this wheel: #5962 (comment)

playlogitech Jan 8, 2023

probably due to --medvram or anything else, im on 4090, sorry bud

Nacurutu Jan 8, 2023

probably due to --medvram or anything else, im on 4090, sorry bud

Could be the --precision full --no-half Arguments not optimized for torch 2.0.0 and/or new Xformers?

Green or Black screen
When running on video cards which don't support half precision floating point numbers (a known issue with 16xx cards), a green or black screen may appear instead of the generated pictures. This may be fixed by using the command line arguments --precision full --no-half at a significant increase in VRAM usage, which may require --medvram.

https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Troubleshooting#green-or-black-screen

as I have a 1660 ti, I always use those arguments and always worked. Still working with previous torch versions.

aifartist Jan 8, 2023
Author

@playlogitech I have a 4090 also on Ubuntu. Using the v2.1 512 model how long does a batch of 64 images (count=4, size=16) takes with 20 steps and euler_a? Have you tried this with both torch 1.13.1 and torch 2? I see no difference. I only use xformers as a command lipne option.

GrennKren · 2023-01-08T11:10:31Z

GrennKren
Jan 8, 2023

Looks like I was doing it wrong or maybe It just doesn't work with T4 yet.

I was following the instruction from this https://pytorch.org/get-started/pytorch-2.0/#requirements
To install this.
pip3 install numpy --pre torch torchvision torchaudio --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu116

and then clone the xformers from its github, and lastly compile them. and after it was done, I created the wheel file to make sure I would not compile it again.
But the speed was just the same with xformers with PyTorch 1.12, which I usually use. Did I miss something?

0 replies

Zuxier · 2023-01-08T11:43:29Z

Zuxier
Jan 8, 2023

4090 here
torch 2.0.0.dev20221226+cu117
xformers 0.0.15+e163309.d20221226

Launching Web UI with arguments: --xformers

5 replies

Nacurutu Jan 8, 2023

have u got Speeds improvements using torch 2.0.0?

Zuxier Jan 8, 2023

xformers built on torch 2 is a little faster. that's the only improvement i have seen.

aifartist Jan 8, 2023
Author

@Zuxier I also have a 4090. I do not see an improvement with torch 2.0.
My standard benchmark for a 4090 is 64 images (4 batches or size 16).
I choose 16 because experiment show it maximizes throughput.
Euler_a 20 steps cfg=7 v2-1_512-ema-pruned.ckpt

I get: Total progress: 100%|████| 80/80 [00:37<00:00, 2.12it/s]
37 seconds is .578 seconds per image using
print(torch.version)
1.13.1+cu117

What do you get with torch 2.0?

Zuxier Jan 8, 2023

@Zuxier I also have a 4090. I do not see an improvement with torch 2.0. My standard benchmark for a 4090 is 64 images (4 batches or size 16). I choose 16 because experiment show it maximizes throughput. Euler_a 20 steps cfg=7 v2-1_512-ema-pruned.ckpt

I get: Total progress: 100%|████| 80/80 [00:37<00:00, 2.12it/s] 37 seconds is .578 seconds per image using print(torch.version) 1.13.1+cu117

What do you get with torch 2.0?

no increase of speed from torch2 compared to what i had, but xformers built on torch 2 gave me some extra speed.

aifartist Jan 8, 2023
Author

I assume you also had a xformers compatible with whatever you had before(torch 1.13.1?).
Hmmm,
So we have me saying no perf improvement; and
@Zuxier saying a "a little faster" but only where both the before and after were both using xformers; and
@aliencaocao also says torch2 + xformers is faster than torch1.13.1 + xformers.

I get 37 seconds for 64 images(4x16) with 1.13.1. What does anyone else with a 4090 get with torch2 and xformers?

What does this command tell you(gpu.compute_capability for instance):
cmd> python3 -m xformers.info
xFormers 0.0.15+6cd1b36.d20230108
memory_efficient_attention.cutlassF: available
memory_efficient_attention.cutlassB: available
memory_efficient_attention.flshattF: available
memory_efficient_attention.flshattB: available
memory_efficient_attention.smallkF: available
memory_efficient_attention.smallkB: available
memory_efficient_attention.tritonflashattF: available
memory_efficient_attention.tritonflashattB: available
swiglu.fused.p.cpp: available
is_triton_available: True
is_functorch_available: False
pytorch.version: 1.13.1+cu117
pytorch.cuda: available
gpu.compute_capability: 8.9
gpu.name: NVIDIA GeForce RTX 4090

ataa · 2023-01-08T21:46:14Z

ataa
Jan 8, 2023

Dell 3070 (8GB) OC, Win 10 Home, 16GB DDR4
Vanilla: 9.5it/s
PyTorch 2.0 cu118: 14.24it/s
To my surprise, Now It can do batch size of 35 images!

Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 3508515359, Size: 512x512, Model hash: a9263745, Model: v1-5-pruned
python: 3.10.6 • torch: 2.0.0 • xformers: 0.0.15+6cd1b36.d20230108 • gradio: 3.15.0 • commit: 8850fc2

9 replies

ataa Jan 8, 2023

Cool result, I have 8-8.5it/s on 3070 (laptop) with PyTorch 2.0.0.dev20230106+cu118 & xFormers 0.0.15+f351fee.d20230106.

I am using latest Studio Driver + latest cuDNN dlls.
arguments: --listen --xformers --enable-insecure-extension-access --opt-channelslast

Edit:
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:464

ke1ne Jan 8, 2023

I am using latest Studio Driver + latest cuDNN dlls

Game Driver 528.02 + dlls from cudnn-windows-x86_64-8.7.0.84_cuda11-archive.zip

--opt-channelslast

This helped to get 9.5 instead of 8.5. PYTORCH_CUDA_ALLOC_CONF (added to webui-user.bat) - no effect.

Thanks!

aliencaocao Jan 9, 2023

Yes channel last should always be enabled on Turing and newer cards to use tensor core

aifartist Jan 9, 2023
Author

I've enabled it and generated images and seen no benefit at all. My 4090 certainly should have tensor cores. Also, memory layout is one thing but shouldn't you also need to set something like enable_tf32 also? Would it be a good idea to simply have a micro-benchmark py program(maybe 20 lines) which can test whether some convolution operation was faster with that memory layout?

aliencaocao Jan 10, 2023

Tensorfloat32 is by default enabled in this repo.

78Alpha · 2023-01-09T06:23:24Z

78Alpha
Jan 9, 2023

Gave it a test myself with my RTX 3070

Fresh installs for both: 8850fc2

Anything 3.0
Prompt: "test image"
100 images
Euler A
20 steps

Torch 2.0.0 + Xformers Compiled with Torch 2.0.0 + cuda116 (No Triton)
10.3 it/s for individual images, 8.8 it/s in total

Torch 1.13.1 + Xformers from Automatic
10.2 it/s for individual images, 8.66 it/s in total

No major change, but it didn't absolutely break things.

Edit:

Changed some of the Compiler options, set it to just max-autotune
Torch 2.0.0 + Xformers Compiled with Torch 2.0.0 + cuda116 (No Triton) + max-autotune
10.86 it/s for individual images, 9.27 it/s in total

4 replies

ataa Jan 10, 2023

I compiled xformers with flags below:

XFORMERS_DISABLE_FLASH_ATTN=1
NVCC_FLAGS="--use_fast_math -DXFORMERS_MEM_EFF_ATTENTION_DISABLE_BACKWARD" 
MAX_JOBS=4

But couldn't find anything about max-autotune for xformers, Are you referring to Pytorch build with mode="max-autotune"?

aliencaocao Jan 10, 2023

this means torch.compile options, nothing to do with xformer or torch building.

ataa Jan 10, 2023

Thanks, Tested that and got nice boost on single image (16.6it/s) but better result with batch of 10 to 20 (17.1 it/s) on 3070 (OC), I'll do more tests with different settings, will post the results shortly.

ataa Jan 10, 2023

Check #6615 for detailed test results.

Croestalker · 2023-01-10T23:54:12Z

Croestalker
Jan 10, 2023

Hey guys, I'm a novice when it comes to all the lingo, (I was fine 10+ years ago, but a lot has changed that I haven't caught up on...) Is installing torch 2.0 better than the torch I have now (whatever version I was told to install....) My card is a GTX 1080, if I want to install the new torch, what steps do I need to do? Install python whatever, then do I need to change any files in my Auto WebUI folder? I read that xformers is installed automatically when I just type --xformers into my bat file, is that not true, or if it is true... after installing torch 2.0, do I need to manually rebuild/install/whatever?

I haven't updated because all the information that I've read/watched has told me autoui was built around python 3.10 (or whatever...) Are we updating python, or just the torch. Does this affect any other extensions I've installed recently?

Thanks ya'll!

0 replies

aifartist · 2023-01-11T01:22:13Z

aifartist
Jan 11, 2023
Author

xformers is far more important than torch 2.0. I see no perf improvement with torch 2.0 but others claim so. The problem is that no one quotes image generation times and only it/s. I won't even try to test torch 2.0 anymore until someone says they can generate, on a 4090, 96 images (nbatches=6 X batchsize=16) in under 57 seconds. I get this with torch 1.13.1. Torch 2.0 is no faster. opt-channelslast only slows things down for me.

…

On Tue, Jan 10, 2023 at 3:54 PM Croestalker ***@***.***> wrote: Hey guys, I'm a novice when it comes to all the lingo, (I was fine 10+ years ago, but a lot has changed that I haven't caught up on...) Is installing torch 2.0 better than the torch I have now (whatever version I was told to install....) My card is a GTX 1080, if I want to install the new torch, what steps do I need to do? Install python whatever, then do I need to change any files in my Auto WebUI folder? I read that xformers is installed automatically when I just type --xformers into my bat file, is that not true, or if it is true... after installing torch 2.0, do I need to manually rebuild/install/whatever? I haven't updated because all the information that I've read/watched has told me autoui was built around python 3.10 (or whatever...) Are we updating python, or just the torch. Does this affect any other extensions I've installed recently? Thanks ya'll! — Reply to this email directly, view it on GitHub <#6455 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A3YFZABVF2MICWBHUCGFRH3WRXZC7ANCNFSM6AAAAAATTUV6EU> . You are receiving this because you authored the thread.Message ID: <AUTOMATIC1111/stable-diffusion-webui/repo-discussions/6455/comments/4650926 @github.com>

41 replies

aifartist Jan 15, 2023
Author

PleezDeez torch.compile is a newish torch thing and xformers is xformers. I'm not sure what you are looking for. I do rebuild xformers after rebuilding or reinstalling torch.

torch.compile is for learning but I'm speeding up inference by correctly building my own torch.

In any case, I rebuild xformers with: MAX_JOBS=16 CUDA_PATH=/usr/local/cuda-$CUDAVER pip3 install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
I set the env var $CUDAVER to match the torch build. 11.7, 11.8, 1.20

PleezDeez Jan 16, 2023

Maybe it's just my particular setup causing me issues bc I can't get xformers and torch 2.0 to play nicely with each other without errors galore popping up so I'll just wait it out until xformers has better support for it baked in. Tho on the flipside I have Triton working well with xformers 0.16.0dev424 in windows thanks to a file shared the other day and I was able to use that to put together a whl of deepspeed from it that also works... I hope... It built it and installs at least... I have no way to test it bc a new error that I traced back to accelerate anytime distributed training at all is turned on in windows even from a fresh install of A1111 on a different drive is blocking my progression and I'm a builder, literally build houses for a living, not a coder so am already well outside my whl house 🤷🏻‍♂️

aliencaocao Jan 16, 2023

You have to build xformers yourself, if you didnt

hippopotamus1000 Jan 16, 2023

@aifartist I have a 4090, too. But it is not working that fast. How did you do that? Only with rebuilding xformers? Could you explain a step by step guide (for dummies) how to rebuild and install xformers?

aifartist Jan 16, 2023
Author

@hippopotamus1000 Are you on Ubuntu or WIndows?
"rebuilding" xformers, if you already have it, doesn't make it faster. REBUILDING xformers is only a necessity if you change your pytorch version. If you are on Ubuntu and only get about 11 to 13 it/s when doing one image per batch(batchsize=1) then you have the same problem I, and others see.

aifartist · 2023-01-17T06:31:44Z

aifartist
Jan 17, 2023
Author

Status for anybody interested ...
Premise 1: The nightly build of Torch 2.0 has horrible single image generation performance on Ubuntu.
My 4090 only gets 11 to 13 it/s while on windows even with cuda 11.8 it gets about 35 to 38 it/s. Today someone has confirmed this same problem on RedHat. I'm on Ubuntu.

Premise 2: If you build Torch 2 locally you can get a 3X perf improvement on a 4090. Something is wrong with the nightly builds that pytorch.org provides.

I had claimed this before and today someone else has just confirmed this. I was contacted on Facebook by an ex CTO that had access to various GPU's in the cloud and offered me hardware to test and his expertise in exchange for my help in speeding up his Linux based SD service. Spent 12 hours online figure out how to build Torch 2.0 locally and when it worked it was indeed as fast as I hoped for. We also confirmed some speed up on a 3090 and tomorrow he will test an A4000 GPU which currently is only doing 7 it/s.

Because I spent all day doing this I'm still trying to figure out why I can't build Torch 2.0 for a second time so I can document all the steps to give to everyone on LInux.

21 replies

aifartist Jan 18, 2023
Author

Manually built xformers using the nightly build. Still only getting around 10it/s on a single image with a 4090. Will need to try to build Torch 2 locally. Using Cuda 11.8 when manually building so I have no idea. Will try later when I'm home
This whole thing worked for me on my home Linux PC once. After that I couldn't build nor install pytorch 2.0 a 2nd time when I thought I would now document everything. Now either the build of pytorch fails because it can't identify the GPU to correctly set the architecture (sm_89) or if the build does succeed I get undefined symbols when torchvision loads image.so and even if I get passed this it always SEGV's when I start webui.sh. The SEGV is coming from somewhere in Google protobuf.

I helped someone else creating a SD service in the cloud to build Torch 2.0 on RedHat and it worked great speeding up image generations to 35+ it/s. The difference is he is using DiffUsers. If I log onto his machine with a 4090 which is using our local Torch build where he gets the great performance, install A1111 and then install out local Torch for it I get the same SEGV. Something in A1111 is no longer working with the local build. But diffuser doesn't have that problem.

Good luck to you but I need to debug what is happening. I've been getting up to speed on cmake build scripts to see why it is having problems with identifying the GPU. I've learned a few things about CUDA installation that may be helpful but will first get back to figure this out.

aifartist Jan 18, 2023
Author

Manually built xformers using the nightly build. Still only getting around 10it/s on a single image with a 4090. Will need to try to build Torch 2 locally. Using Cuda 11.8 when manually building so I have no idea. Will try later when I'm home

Even on Cu118 Torch, copying the cuDNN dlls over gave me better speeds. I built some xformers wheels for windows trying to troubleshoot dreambooth training issues with the latest versions https://github.com/Zuxier/xformers/releases (cuda 11.8 torch 2)

I may have mentioned before but CUDA 11.8 is the first to leverage the sm_89 Ada Lovelace architecture(4090 and others). This will definitely help 4090's.

Zuxier Jan 18, 2023

Even on Cu118 Torch, copying the cuDNN dlls over gave me better speeds. I built some xformers wheels for windows trying to troubleshoot dreambooth training issues with the latest versions https://github.com/Zuxier/xformers/releases (cuda 11.8 torch 2)

I may have mentioned before but CUDA 11.8 is the first to leverage the sm_89 Ada Lovelace architecture(4090 and others). This will definitely help 4090's.

cuDNN dlls are a newer version, they just perform better.

outpoints Jan 19, 2023

Even on Cu118 Torch, copying the cuDNN dlls over gave me better speeds. I built some xformers wheels for windows trying to troubleshoot dreambooth training issues with the latest versions Zuxier/xformers/releases (cuda 11.8 torch 2)

When using your compiled xformers I am getting quite a bit less it/s than when I built them.
Is there anything in particular you have done to achieve such a high it/s? It seems to vary wildly from person to person and there's so much misinformation about how to build xformers for maximum performance/what versions to use etc etc.
I'm currently building them with the nightly versions for myself to see how they perform.

Zuxier Jan 19, 2023

When using your compiled xformers I am getting quite a bit less it/s than when I built them. Is there anything in particular you have done to achieve such a high it/s? It seems to vary wildly from person to person and there's so much misinformation about how to build xformers for maximum performance/what versions to use etc etc. I'm currently building them with the nightly versions for myself to see how they perform.

It's not me claiming those speeds, to me its just few % faster than older torch/xformers. Of the wheels i built some should be faster, but dreambooth training wise only the 0.14.dev0 works for me. 10it/s on a single image with a 4090 is slow tho.

aifartist · 2023-01-18T21:57:49Z

aifartist
Jan 18, 2023
Author

Well, this is fun! NOT!!!
Finally I can build pytorch 2.0 and it actually works. I went back to CUDA-12.0 instead of CUDA-11.8.
But instead of the 38 it/s I was hoping for or even the 13.9 it/s I had before I'm getting:
100%|███████████████████████| 20/20 [00:07<00:00, 2.65it/s]
100%|███████████████████████| 20/20 [00:07<00:00, 2.65it/s]
100%|███████████████████████| 20/20 [00:07<00:00, 2.65it/s]
I had 1.7 second image generations before that I wasn't happy with and then got to .6 seconds with Torch 2(one time) and now WTF!? 7 seconds!?

2 replies

playlogitech Jan 18, 2023

it seems you generating on cpu

aifartist Jan 18, 2023
Author

@playlogitech I had that same thought although that'd be quite fast but I do have a 32 processor i9-13900. However, in the venv for the a1111 I get the following for torch version and cuda is_available():
python3 ~/id.py
2.0.0a0+gitbb34461
True

brucethemoose · 2023-01-19T03:51:32Z

brucethemoose
Jan 19, 2023

So this is a long discussion, but I will post random observations from CachyOS Linux/a 2060 laptop:

PyTorch 1.13 CUDA 11.8 with xformers is slower than PyTorch 2.0 CUDA 11.8 with xformers. Arch builds Pytorch for CUDA 11.8 (and probably for CUDA 12 soon), so thats available to test without building.
--opt-channels-last was slower in 1.13, but is now faster in 2.0.

6 replies

outpoints Jan 19, 2023

@hippopotamus1000
Holy shit that's insane. What OS are you running? Please post a link to it here when you post it!

hippopotamus1000 Jan 19, 2023

WOW! Looking forward to the explanation.

aifartist Jan 19, 2023
Author

@outpoints @hippopotamus1000 5.17.0-1019-oem #20-Ubuntu SMP
I think someone else posted 35 to 38? it/s with torch 2.0 but that was windows. I suspect something is wrong with the nightly build of torch 2.0 for Ubuntu so we'd need to build our own till the issue is resolved.
Yes, I'll also post something here. But this thread has gone on long enough and I want a top thread with actual build instructions and hard results.
I about 39.7 to 39.8 when I used opt-channelslast which previously had slowed me down. I also built pytorch with march=native to target my Raptor Lake processor. Before doing a "clean" top level post I want to cross check all my facts. I have a 4090 on a box with Raptor Lake i9-13900K (5.8 GHz) and 32GB's of DDR-6400 CL32 memory.

But I need to sleep. Tomorrow I hope to be able to wrap this up.

playlogitech Jan 19, 2023

@outpoints @hippopotamus1000 5.17.0-1019-oem #20-Ubuntu SMP I think someone else posted 35 to 38? it/s with torch 2.0 but that was windows. I suspect something is wrong with the nightly build of torch 2.0 for Ubuntu so we'd need to build our own till the issue is resolved. Yes, I'll also post something here. But this thread has gone on long enough and I want a top thread with actual build instructions and hard results. I about 39.7 to 39.8 when I used opt-channelslast which previously had slowed me down. I also built pytorch with march=native to target my Raptor Lake processor. Before doing a "clean" top level post I want to cross check all my facts. I have a 4090 on a box with Raptor Lake i9-13900K (5.8 GHz) and 32GB's of DDR-6400 CL32 memory.

But I need to sleep. Tomorrow I hope to be able to wrap this up.

how much is cpu involved in those speeds? I start thinking my 5900x bottlenecking my it/s, thats why im at 25-28 it/s

adrianpuiu Jan 19, 2023

i'm catching up fast with torch 2.0 and deep speed. and cu117

edit you ~/.bashrc file and add :
export LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.7/include
export PATH="/usr/local/cuda-11.7/bin:$PATH"

save and run : source ~/.bashrc
activate python env

1 : pip install numpy --pre torch[deepspeed] torchvision torchaudio --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu117
2: pip install ninja
3 : then build your xformers wheels
4: pip install tensorrt triton
5: edit webui-user.sh and add :
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/[yourusername]/[patToSdWui]/venvtorch20/lib64/python3.10/site-packages/tensorrt
6. go to /home/[yourusername]/[patToSdWui]/venvtorch20/lib64/python3.10/site-packages/tensorrt and copy both libnvinfer.so.8 and libnvinfer_plugin.so.8 and rename them both to .7 [ libnvinfer_plugin.so.7 ]

deep speed : has good config examples on how to optimize the training using hugging face transformers ...

**Accelerating launch.py...
################################################################
[2023-01-19 14:59:48,726] [WARNING] [runner.py:178:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-01-19 14:59:48,734] [INFO] [runner.py:504:main] cmd = /home/agp/stable-diffusion-webui/venvtorch20/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --no_local_rank launch.py
[2023-01-19 14:59:49,412] [INFO] [launch.py:136:main] WORLD INFO DICT: {'localhost': [0]}
[2023-01-19 14:59:49,412] [INFO] [launch.py:142:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-01-19 14:59:49,412] [INFO] [launch.py:155:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-01-19 14:59:49,412] [INFO] [launch.py:156:main] dist_world_size=1
[2023-01-19 14:59:49,412] [INFO] [launch.py:158:main] Setting CUDA_VISIBLE_DEVICES=0
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]
Commit hash: 5467467
Installing requirements for Web UI
Launching Web UI with arguments: --xformers --opt-channelslast
WARNING:root:Pytorch pre-release version 2.0.0.dev20230119+cu117 - assuming intent to test it
WARNING:root:Pytorch pre-release version 2.0.0.dev20230119+cu117 - assuming intent to test it
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading weights [61a37adf76] from /home/agp/stable-diffusion-webui/models/Stable-diffusion/ProtoGen_X3.4.ckpt
Applying xformers cross attention optimization.
Textual inversion embeddings loaded(0):
Model loaded in 3.6s (0.2s create model, 3.2s load weights).
Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch().**

vladmandic · 2023-01-19T14:31:24Z

vladmandic
Jan 19, 2023
Collaborator

Started discussion #6932

2 replies

aifartist Jan 19, 2023
Author

I would have preferred to have this thread under my own name as I would like to have more visibility in the community once I get up to speed on SD. But, no problem. I just woke up and will start putting together a HOW-TO document for Linux users.

outpoints Jan 19, 2023

@aifartist you could create your own discussion if you'd like as you probably have other important pieces to add that would be harder to find as GitHub discussions is a mess

aifartist · 2023-01-20T04:51:18Z

aifartist
Jan 20, 2023
Author

The performance differences in pytorch has more to do with the libcudnn version than anything else. I have found the root cause of why some see 13 it/s on a 4090 and others see ~40 it/s. See #6954 I've let the pytorch community know.

0 replies

aifartist · 2023-01-21T05:53:47Z

aifartist
Jan 21, 2023
Author

Torch 2.0.0 is >>NOT<< a lot faster, faster nor even a little bit faster than 1.13.1.
At least that is true on a 4090. Once I figured out all the nuances I have no problem getting just over 39 it/s generating images with torch 1.13.1. I even get 38+ it/s with torch 1.12.1+cu113.

If I am wrong, show me results where you get over 45 it/s with a regular 4090. Torch 2 is NOT faster. Now xformers does give me a 28% speedup over the 30.5it/s I see when it is not used. That is a real example of a lot faster.

I get very consistent results.

Of course, there might be some edge case that isn't what someone doing normal image generation will see. Maybe by GA in Mar? we'll see some improvements with inference.

0 replies

vladmandic · 2023-01-21T13:29:40Z

vladmandic
Jan 21, 2023
Collaborator

i agree, torch 2.0 is not any better on its own than torch 1.13.1 if everything is configured correctly
(correct build of xformers, correct cudnn version and links, correct accelerate config, etc.)

torch 2.0 has a long term potential due to inclusion of dynamo, but its very early days of getting any of that to work.

3 replies

aliencaocao Jan 21, 2023

Many people out there including myself will disagree with xformer part. Xformer + torch 2.0 has observable performance improvements for us. Your machines may just be too fast to see it.

vladmandic Jan 21, 2023
Collaborator

Many people out there including myself will disagree with xformer part. Xformer + torch 2.0 has observable performance improvements for us. Your machines may just be too fast to see it.

quite possibly, it just has been my experience - particularly when ppl use older xformers with older torch.
just because they exist out-of-the-box, doesn't mean they are optimal.

if you run python -m xformers.info and it shows everything as available, you should be good to go regardless of torch version.

aliencaocao Jan 21, 2023

I get 10% perf boost on windows with functorch and triton UNAVAILABLE

DustyCooper · 2023-01-24T05:21:23Z

DustyCooper
Jan 24, 2023

How would I get this to work for the new launch.py where line 182 and 242 are the relevant lines? The xformers install is different.

Clean install of windows and I'm starting from scratch.

1 reply

brucethemoose Jan 24, 2023

You can set TORCH_COMMAND to be blank in the user .bat file.

xformers has to be built from source pretty much every time you update torch 2.0, but the ui wont try to reinstall it once you build it.

FurkanGozukara · 2023-03-18T15:40:20Z

FurkanGozukara
Mar 18, 2023

24.) Automatic1111 Web UI - PC - Free
For downgrade to older version if you don't like Torch 2 : first delete venv, let it reinstall, then activate venv and run this command pip install -r "path_of_SD_Extension\requirements.txt"
How To Install New DREAMBOOTH & Torch 2 On Automatic1111 Web UI PC For Epic Performance Gains Guide

0 replies

jkrauss82 · 2023-04-09T08:31:33Z

jkrauss82
Apr 9, 2023

For my setup I can confirm the superiority of torch 2.0 / cuda 11.8 over torch 1.13.1 / cuda 11.7. This seems to be mostly due to improvements in cuda 11.8 which is supported by torch 2.0 (source).

Performance went up from 5.97 it/s to 8.59 it/s while VRAM usage being much lower. I wasn't using xformers so that explains the larger gain.

System:
Core i5 4570S, 16G DDR3
RTX 3060 12G
Ubuntu 20.04 5.15
webui @ 22bcc7b
COMMANDLINE_ARGS --opt-channelslast --opt-sub-quad-attention --opt-sdp-no-mem-attention

0 replies

Is torch 2.0 any better than torch 1.13.1 #6455

Replies: 23 comments · 128 replies

aifartist Jan 7, 2023 Author

aifartist Jan 7, 2023 Author

brkirch Jan 9, 2023 Collaborator

aifartist Jan 7, 2023 Author

aifartist Jan 8, 2023 Author

aifartist Jan 8, 2023 Author

aifartist Jan 8, 2023 Author

aifartist Jan 8, 2023 Author

aifartist Jan 9, 2023 Author

aifartist Jan 11, 2023 Author

aifartist Jan 15, 2023 Author

Replies: 23 comments 128 replies

aifartist Jan 7, 2023
Author

aifartist Jan 7, 2023
Author

brkirch Jan 9, 2023
Collaborator

aifartist Jan 7, 2023
Author

aifartist Jan 8, 2023
Author

aifartist Jan 8, 2023
Author

aifartist Jan 8, 2023
Author

aifartist Jan 8, 2023
Author

aifartist Jan 9, 2023
Author

aifartist
Jan 11, 2023
Author

aifartist Jan 15, 2023
Author