I've made an Extension "Giant Kitten" to show how the new System Fallback Policy of recent Nvidia driver can be further optimized #14077

aleksusklim · 2023-11-24T09:13:29Z

aleksusklim
Nov 24, 2023

This article explains what System Fallback Policy is:
https://nvidia.custhelp.com/app/answers/detail/a_id/5490/~/system-memory-fallback-for-stable-diffusion

Basically it allocated RAM instead of VRAM but uses it as GPU memory with CUDA calculations, not CPU.
This happens when your "dedicated GPU memory" is already full, and from now on all tensors would be allocated in "shared GPU memory" which it capped by half of your installed RAM.

This allows WebUI to continue working without throwing OOM errors, which is good.
The speed quickly becomes slow, approximately 10 times slower, which is bad.
(But note that without GPU the cpu-only inference would have been more than in 100 times slower than your usual speeds!)

If pytorch would have a way to move tensors between dedicated and shared GPU memory – that could be a huge optimization. And by huge I mean REALLY HUGE

For example, instead or keeping CLIP and VAE on CPU and then moving tensors back and forth – that could be stored in shared memory on purpose. To use tensors in shared memory in calculations with tensors stored in dedicated memory – there is no need to move them explicitly.

Meaning, WebUI could load the whole Stable Diffusion model into shared memory, but then only the most important parts of it should be transferred to dedicated memory as they are needed.
On the other hand, the current operational tensors (that represent the noised latent) should always reside in dedicated memory, since they are used in all operations.

But that is not possible yet, since pytorch does not have any tensor.shared(), tensor.dedicated() or tensor.to(device='cuda:0:dedicated') methods. Pytorch cannot even see, whether the Nvidia driver had moved a new tensor to RAM or not.

But you know what? If a big tensor is allocated on CUDA as the very first one – then it goes into the "fast" dedicated area. Then we can load SD model so that a part of it would overflow to the shared "slow" area. But what if we'll free the first dummy tensor now?

As long as pytorch's internal Garbage Collector is not called – the dedicated memory stays reserved inside pytorch, and all new tensors would use that memory (which is fast) instead of asking the driver for new bytes (that are slow).

This gives a one-time opportunity to move away the least important tensors of SD model, so that the real GPU memory will be available for "operational" tensors during the generation!

My extension shows on practice that this actually improves speed in cases where your generation uses the shared GPU memory. Here is the link:

https://github.com/klimaleksus/stable-diffusion-webui-giant-kitten

I encourage you guys to try benchmarking your speeds, so we could confirm independently that this approach is stable, considering various RAM and VRAM sizes, bus speeds and GPU models.

So, if:

you are up to a research (and patient enough to comprehend everything)
you have Windows with recent Nvidia drivers (or don't mind to update)
your GPU is less than 10 Gb (otherwise it's hard to overflow VRAM realistically)
your RAM is big (at least 16 Gb or more; increase your pagefile just in case)

– Then please try my extension and share your findings!

With more evidence that this approach is working and that the shared memory "is not that slow" when the most important tensors are stored in dedicated memory instead – we can ask for official pytorch support at https://github.com/pytorch/pytorch/issues
(If they'll say that they would need a new API from Nvidia SDK – then we will ask NVIDIA to create that too, because it is a real game changer)

After that, everyone would call model.to(device='cuda:0:shared') instead of model.to(device='cpu') and forget about moving tensors between devices since shared memory is transparent inside CUDA.

Speed changes would reflect the state-of-the-art method from local-LLM world: where some layers of the quantized model are stored in RAM while others are offloaded on GPU – so the more VRAM you have, the faster your inference is (because you can offload more layers onto fast memory).
We will have the same for diffusion: the less dedicated GPU memory you have, the more blocks of the model would be offloaded to shared memory, decreasing your speed LINEARLY!

You won't need 3090 to run something big, and I suspect things would become only bigger in the near future.

CCpt5 · 2023-11-24T17:10:15Z

CCpt5
Nov 24, 2023

Hey! Just want to say I randomly found your extension on a github search yesterday (for A1111 extension by recent). It works great for me!! I am running a 4090 w/ 24gb RAM + 64gb RAM, but would still get OOMs in a lot of instances. I really like how you explained everything on your page, and in the end it's a simple on/off toggle process. I hope others pick up on this!

I posted a reddit thread an hour ago so perhaps some will see that: https://old.reddit.com/r/StableDiffusion/comments/182w7q6/new_unlisted_extension_trick_to_use_the_new/

Great work though - Appreciate it!

3 replies

aleksusklim Nov 24, 2023
Author

…Wait, are you sure?

Firstly, if your GPU is 24 Gb, then you already have a giant kitten in your bathroom!

Secondly, are you sure that my extension gives any ANYTHING at all? Isn't it enabled Prefer System Fallback what gives you all benefits?

After enabling it in Nvidia Control Panel, you should not OOM anymore. My extension will give you only theoretical speedup in some special conditions, which you should find by trial and error (against "reserve" value).
Though, when you'll find your best reserve – you can of course use it every time.

But I doubt that it's my extension helped you. For sure that was solely the new Nvidia driver!
(Still, thank you for the reddit post)

CCpt5 Nov 24, 2023

Lol, I'm not totally sure, maybe it was just your instructions on how to turn that on and putting it in a positive light. I'd avoided upgrading my Nvidia driver from 531.68 after reading people were having significantly reduced speeds due to that new feature.

I'll look into it more and see if I have the same results w/o enabling your script. Even if it turns out I don't need the extension, your write up on it helped immensely, so thanks

aleksusklim Nov 24, 2023
Author

I suspect that "reduced speeds because of new driver" is exactly this System Fallback Policy option (which became default on).

Probably, speed is reduced for people, who:

don't have a giant kitten in their bathrooms: their GPU is 6-8 Gb; and
don't have a lot of RAM: no space to share memory; and
use almost all of their VRAM during generations, which now started to offload in advance.

For anyone with 10 Gb and more – new driver should not bring slowdowns.
But since there is a lot of users with 4/6/8 – they scream loudly!

Actually, my extension aims to solve exactly this: slowdowns caused by low VRAM when sharing memory.
We need more independent testing to confirm that enabling system fallback AND properly activating my extension is better than disabling system fallback!

There is already a giant kitten in my bathroom (3060 with 12Gb) so I cannot test it myself…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I've made an Extension "Giant Kitten" to show how the new System Fallback Policy of recent Nvidia driver can be further optimized #14077

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

I've made an Extension "Giant Kitten" to show how the new System Fallback Policy of recent Nvidia driver can be further optimized #14077

aleksusklim Nov 24, 2023

Replies: 1 comment · 3 replies

CCpt5 Nov 24, 2023

aleksusklim Nov 24, 2023 Author

CCpt5 Nov 24, 2023

aleksusklim Nov 24, 2023 Author

aleksusklim
Nov 24, 2023

Replies: 1 comment 3 replies

CCpt5
Nov 24, 2023

aleksusklim Nov 24, 2023
Author

aleksusklim Nov 24, 2023
Author