I've made an Extension "Giant Kitten" to show how the new System Fallback Policy of recent Nvidia driver can be further optimized #14077
aleksusklim
started this conversation in
Show and tell
Replies: 1 comment 3 replies
-
Hey! Just want to say I randomly found your extension on a github search yesterday (for A1111 extension by recent). It works great for me!! I am running a 4090 w/ 24gb RAM + 64gb RAM, but would still get OOMs in a lot of instances. I really like how you explained everything on your page, and in the end it's a simple on/off toggle process. I hope others pick up on this! I posted a reddit thread an hour ago so perhaps some will see that: https://old.reddit.com/r/StableDiffusion/comments/182w7q6/new_unlisted_extension_trick_to_use_the_new/ Great work though - Appreciate it! |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This article explains what System Fallback Policy is:
https://nvidia.custhelp.com/app/answers/detail/a_id/5490/~/system-memory-fallback-for-stable-diffusion
Basically it allocated RAM instead of VRAM but uses it as GPU memory with CUDA calculations, not CPU.
This happens when your "dedicated GPU memory" is already full, and from now on all tensors would be allocated in "shared GPU memory" which it capped by half of your installed RAM.
This allows WebUI to continue working without throwing OOM errors, which is good.
The speed quickly becomes slow, approximately 10 times slower, which is bad.
(But note that without GPU the cpu-only inference would have been more than in 100 times slower than your usual speeds!)
If pytorch would have a way to move tensors between dedicated and shared GPU memory – that could be a huge optimization. And by huge I mean REALLY HUGE
For example, instead or keeping CLIP and VAE on CPU and then moving tensors back and forth – that could be stored in shared memory on purpose. To use tensors in shared memory in calculations with tensors stored in dedicated memory – there is no need to move them explicitly.
Meaning, WebUI could load the whole Stable Diffusion model into shared memory, but then only the most important parts of it should be transferred to dedicated memory as they are needed.
On the other hand, the current operational tensors (that represent the noised latent) should always reside in dedicated memory, since they are used in all operations.
But that is not possible yet, since pytorch does not have any
tensor.shared()
,tensor.dedicated()
ortensor.to(device='cuda:0:dedicated')
methods. Pytorch cannot even see, whether the Nvidia driver had moved a new tensor to RAM or not.But you know what? If a big tensor is allocated on CUDA as the very first one – then it goes into the "fast" dedicated area. Then we can load SD model so that a part of it would overflow to the shared "slow" area. But what if we'll free the first dummy tensor now?
As long as pytorch's internal Garbage Collector is not called – the dedicated memory stays reserved inside pytorch, and all new tensors would use that memory (which is fast) instead of asking the driver for new bytes (that are slow).
This gives a one-time opportunity to move away the least important tensors of SD model, so that the real GPU memory will be available for "operational" tensors during the generation!
My extension shows on practice that this actually improves speed in cases where your generation uses the shared GPU memory. Here is the link:
https://github.com/klimaleksus/stable-diffusion-webui-giant-kitten
I encourage you guys to try benchmarking your speeds, so we could confirm independently that this approach is stable, considering various RAM and VRAM sizes, bus speeds and GPU models.
So, if:
– Then please try my extension and share your findings!
With more evidence that this approach is working and that the shared memory "is not that slow" when the most important tensors are stored in dedicated memory instead – we can ask for official pytorch support at https://github.com/pytorch/pytorch/issues
(If they'll say that they would need a new API from Nvidia SDK – then we will ask NVIDIA to create that too, because it is a real game changer)
After that, everyone would call
model.to(device='cuda:0:shared')
instead ofmodel.to(device='cpu')
and forget about moving tensors between devices since shared memory is transparent inside CUDA.Speed changes would reflect the state-of-the-art method from local-LLM world: where some layers of the quantized model are stored in RAM while others are offloaded on GPU – so the more VRAM you have, the faster your inference is (because you can offload more layers onto fast memory).
We will have the same for diffusion: the less dedicated GPU memory you have, the more blocks of the model would be offloaded to shared memory, decreasing your speed LINEARLY!
You won't need 3090 to run something big, and I suspect things would become only bigger in the near future.
Beta Was this translation helpful? Give feedback.
All reactions