-
Notifications
You must be signed in to change notification settings - Fork 27.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Asynchronous model mover for lowvram #14855
base: dev
Are you sure you want to change the base?
Conversation
Smart moverThe smart mover does something similar to forge, and it only move tensors from CPU to GPU, but not coming back. At some point, I was somehow able to get a same or even 2x faster speed than sd-webui-forge under I'd gonna leave it as is and come back when I am getting interested again. |
The broken images seems to be caused by not synchronize back the after usage to the creation stream. Fixed. Also changed to layer-wise movement. |
There might be problem with extra networks. Haven't look into that. |
This looks very cool, but please don't change the formatting of those existing lines in lowvram.py (newlines and quotes), put those new classes into separate file and write a bit of comment there how the performance gain is achieved. Also maybe an option to use old method even if steaming is supported. |
Need some help on making this support Lora/Controlnets. As these things probably altering weights and biases, the tensors cached in the mover may be outdated, and a slow path will be taken. |
I'll be honest with you, I don't know how it works, so I can't help either; The "not moving from GPU to CPU" is smart and reasonable, and it can be implemented with ease, but cuda streams things would need me to get a lot more involved to understand. Plus, there is FP8 support now, maybe that one can work better than lowvram for people who need it? |
The cuda stream thing is used because I want to overlap memcpy with compute. It can be seen as threads. Briefly speaking, it does several things (all in a non-blocking way to Python):
Apart from the moving things, I have to do these things in addition:
Can the Regarding FP8, I think it does not hurt if there is more options. |
Actually, there are 2 main pain points that drives me here:
|
A better way is implemented here, which uses the async nature of CUDA. One thing to note that for the acceleration to work, the weight and biases of the unet must be placed in non-pageable (pinned) memory (they will go to pageable if the module is somehow However, should any extension / modules touches the weight and biases of the model (by using |
As I understand it requires more vram then old lowvram. Maybe you should disable it by default? Also due to fp8 is enough 2GB for medvram, and lowvram saves only about 500 MB. With this patch this already small vram difference will be smaller |
I profiled with Also the two streams does not go out of sync by a big margin.
False. This takes significantly less VRAM. 890 MB vs 350 MB. The speed difference is 200 ms per step vs 260 ms per step. |
I saw in discord async lowvram keep more then one layer in gpu. But maybe it really requires even less vram idk
Is it total vram usage? I tested my 2gb it ate about 1.7 GB in fp16 mode lowvram I will test this patch and original lowvram, medvram on it |
It is the peak usage recorded by Nsight. |
A closer look shows that it is the horizontal scale of the diagram. The actual usage is smaller. See the tooltips on the new screenshots. |
Hm, this patch really requres more vram for me Maybe it ignores |
The same vram usage, but slower...
|
May be your actual compute work is lagging behind. Use I can add synchronize mark there to constraint it, but it hurts the performance by a lot. Without xformers it will be slower. One thing interesting is that the lowvram goes faster than medvram. Can you upload a nsys profile? |
Yes, but hires vram usage. 93% vs 99% XD |
Can't install... Installed |
You can use nsys cli. Collect these data:
|
I have only I will try to collect these data |
@wfjsw check discord PM |
IPEX does not seem to support |
To fix for default options:
|
modules/lowvram.py
Outdated
|
||
if use_streamlined_lowvram: | ||
# put it into pinned memory to achieve data transfer overlap | ||
diff_model.time_embed._apply(lambda x: x.pin_memory()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifying device
parameter will let pin_memory
offload to other non-CUDA backends (e.g. IPEX)
diff_model.time_embed._apply(lambda x: x.pin_memory()) | |
diff_model.time_embed._apply(lambda x: x.pin_memory(device=devices.get_optimal_device_name())) |
Intel A750 8G (IPEX backend): this improves the performance from 0.7it/s to 1.5it/s with no significant VRAM usage increase. |
Someone says the LoRA is not actually working. Pending test. UPDATE: I cannot reproduce UPDATE: For FP16 LoRAs, it will have a hard time trying to apply them on CPUs. Need some cast here. |
TODO: add a queue somewhere to constraint the speed |
@light-and-ray can you try this? it no longer should oom now |
It still uses more vram then medvram
|
There is a new setting in the optimization folder. Reduce it and see what happens. You can go with 1 or 2. |
|
Any progress on this? |
I still need a nsight system profile for lowend cards to find out why the max block limit does not work (as it seems) |
Description
--lowvram
by taking the model moving out of the forward loop.I'm getting a 3.7it/s on a 3060 Laptop with half of the VRAM compared to
--medvram
. It was originally 1.65it/s. As a reference, the medvram speed was 5.8it/s.Concerns
Checklist: