NVIDIA and Microsoft Drive Innovation for Windows PCs in New Era of Generative AI #10684

dill-shower · 2023-05-24T10:33:00Z

dill-shower
May 24, 2023

Once deployed, generative AI models demand incredible inference performance. RTX Tensor Cores deliver up to 1,400 Tensor TFLOPS for AI inferencing. Over the last year, NVIDIA has worked to improve DirectML performance to take full advantage of RTX hardware.

On May 24, we’ll release our latest optimizations in Release 532.03 drivers that combine with Olive-optimized models to deliver big boosts in AI performance. Using an Olive-optimized version of the Stable Diffusion text-to-image generator with the popular Automatic1111 distribution, performance is improved over 2x with the new driver.

https://blogs.nvidia.com/blog/2023/05/23/microsoft-build-nvidia-ai-windows-rtx/

nixudos · 2023-05-24T11:16:48Z

nixudos
May 24, 2023

There is some more info on Microsoft: https://github.com/microsoft/Olive/blob/main/examples/directml/stable_diffusion/README.md

0 replies

WhiteZeroX · 2023-05-24T12:28:35Z

WhiteZeroX
May 24, 2023

Olive-optimized models

So we're going to have to wait for people to make optimized models or maybe an extension to convert them.

2 replies

dill-shower May 24, 2023
Author

You can convert models

https://github.com/microsoft/Olive/blob/main/examples/directml/stable_diffusion/README.md#conversion-to-onnx-and-latency-optimization

WhiteZeroX May 24, 2023

You can convert models

https://github.com/microsoft/Olive/blob/main/examples/directml/stable_diffusion/README.md#conversion-to-onnx-and-latency-optimization

Is it just me or does it sound like you can convert models but not checkpoints? In other words, no something end-users will be doing, but model devs can.

Sakura-Luna · 2023-05-24T12:33:12Z

Sakura-Luna
May 24, 2023
Collaborator

I made an extension that can be used to test the effects of Olive (Direct-ML).

2 replies

Sakura-Luna May 31, 2023
Collaborator

I did a simple test and there seems to be no significant performance gap with sdp so far.

Sakura-Luna Jun 1, 2023
Collaborator

I have optimized the extension and I can see about a 30% speedup (compared to sdp) on slightly larger images, but there are still a lot of limitations and it is not recommended for users with lower hardware specifications.

aleimu · 2023-05-25T05:48:08Z

aleimu
May 25, 2023

mark

0 replies

ghost · 2023-05-25T08:32:55Z

ghost
May 25, 2023

This ui runs ONNX models: https://github.com/ForserX/StableDiffusionUI

I havent tested it with the olive models though. TensorRT and these optimizations have been out for so long but so far no one cared enough about the performance boost to have it properly integrated, a great pity. But I am sure the future will hold many more surprises, and one of them might just be real-time image generation

0 replies

Exozze · 2023-05-25T09:02:00Z

Exozze
May 25, 2023

Okay, I'm obviously missing something. Nvidia claims about 2X performance gain with optimized models "with popular Automatic1111 distribution", but in practice, these models are not compatible with auto1111. And to use these models it was necessary to install some other incomprehensible forks or UI. Why then they mentioned Auto1111, if it does not work on it?

2 replies

Sakura-Luna May 25, 2023
Collaborator

Nvidia should mean that Olive uses the converted model to run twice as fast as Auto1111 runs the original model. But considering Nvidia's consistent publicity strategy, the actual performance is a question mark.

levicki May 28, 2023

@Sakura-Luna NVIDIA's PR statement is totally misleading:

Using an Olive-optimized version of the Stable Diffusion text-to-image generator with the popular Automatic1111 distribution, performance is improved over 2x with the new driver.

To me, the statement above implies that they took AUTOMATIC1111 distribution and bolted this Olive-optimized SD implementation to it.

If they did that without giving credit where credit is due, and especially if they didn't approach @AUTOMATIC1111 and didn't announce that they will submit a pull request for the inclusion of their changes so that the community can benefit, then that's a total asshole move.

In that case, the best thing community could do is to ignore both TensorRT and DirectML like plague. It's bad enough we are locked-in to CUDA, we don't need more proprietary stuff.

Just my $0.02, YMMV.

TCNOco · 2023-05-25T09:08:15Z

TCNOco
May 25, 2023

My question exactly. Seems odd they mention it, and not a specific fork...
Vlad has the following to say: vladmandic/automatic#1145 (comment)

Yeah, I don't have much faith in this.

"Olive" optimized means compiling model from Torch to ONNX and then using new ONNX backend with built-in tensor cores support. Which would mean that entire SD ecosystem should stop using Torch/CUDA/ROCm/etc and switch to ONNX - highly unlikely.

And if going down that path, I'd much rather support TensorRT directly instead of ONNX unless someone says that ONNX is next-best-thing and will completely surpass Torch in the near-future.

Microsoft uses ONNX internally in Azure for ML-as-a-Service, so this makes nVidida a much more interesting choice for MS and I get why nVidida would do it (why wouldn't they?).

0 replies

Lling00 · 2023-05-26T04:07:34Z

Lling00
May 26, 2023

Anyway, after I updated the driver, there was practically no change in speed.

1 reply

RedRayz May 26, 2023

This is an optimization for the very slow DirectML and does not affect CUDA.

AUTOMATIC1111 · 2023-05-27T14:35:34Z

AUTOMATIC1111
May 27, 2023
Maintainer

NVidia are working on releasing a webui modification with TensorRT and DirectML support built-in. They say they can't release it yet because of approval issues.

Meanwhile, I made an extension to make and use TensorRT engines for Unet: https://github.com/AUTOMATIC1111/stable-diffusion-webui-tensorrt

My performance gains for 512x512 pictures is about 50-100% faster (depending on the weather) compared to sdp-no-mem optimization. On larger resolutions, gains are smaller. After NVidia releases their version I would probably integrate the differences that make the performance better (according to the doc they have shown me TensorRT was 3 times as fast as xformers).

Edit: the TensorRT support in the extension is unrelated to Microsoft Olive.

9 replies

rkfg May 28, 2023

Yeah, I mean even though it's known and documented there has to be a reason for it to exist. Perhaps they flex their proprietary muscle to show how good this tech is to make people pay for the non-restricted version that targets the high-end accelerators. I can only speculate.

Sakura-Luna May 28, 2023
Collaborator

Given that Nvidia doesn't explain why, this could be a hardware limitation, ~~perhaps beyond the Tensor core's design specifications~~.

rkfg May 28, 2023

Hmm, I thought CuDNN already uses tensor cores, it's mentioned in its documentation: https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html And we have no such limitations running it with Torch.

Sakura-Luna May 28, 2023
Collaborator

Safety-critical applications such as automotive place strict requirements on throughput and latency expected from deep learning models. The same holds true for some consumer applications, including recommendation systems.

TensorRT is designed to help deploy deep learning for these use cases. With support for every major framework, TensorRT helps process large amounts of data with low latency through powerful optimizations, use of reduced precision, and efficient memory use.

tensorrt is optimized for embedded and low-latency, the limited scale is not surprising.

rkfg May 28, 2023

Hmm, makes sense kinda. Thanks for the clarification!

viperwasp · 2023-05-28T03:05:39Z

viperwasp
May 28, 2023

Can someone please summarize this all in english?
Basically the new driver update is not for CUDA? It's for another technology that won't work in Automatic1111? If Automatic 1111 could support it would it be faster? Or is the technology Nvidia talking about initially slower so the 2x boost just puts it up to par with Cuda? Or is this actually something to get a little excited about?

Finally is it possible for this to be implemented into Automatic1111. It's open source? Can someone put it in? Or does it need a different fork? What is a fork? Is it basically Automatic1111 with changes that differ from the community that handle this open source version?

8 replies

Sakura-Luna May 28, 2023
Collaborator

These are not the biggest problems, the problem is that the conversion process needs to consume a lot of VRAM, so it will not work on many graphics cards.

rkfg May 28, 2023

But it can be done by someone with more VRAM and then shared. For me the converter process mostly used about 3-5 Gb and spiked to about 15-17 Gb once for a very short time, I suppose this spike would be a deal breaker for non-x090 cards.

Sakura-Luna May 28, 2023
Collaborator

I tested the Olive code conversion runwayml SD 1.5 model. Then it throws an assert. Although their article describes it pretty well. But in fact it looks like neither Nvidia nor Microsoft is doing enough.

marhensa Jun 4, 2023

Yes, you need to do some work first, it's a one time conversion but I honestly don't think it's worth it. Too many limitations, such as: ControlNet is not supported, LoRA can only be baked in, max resolution is 768x768, max batch size is 2 etc. And all you get in return is some speed up for the base image generation. It's not even what usually takes the most time (upscaling does). As an experiment it's great, if you have spare time.

so the Hires fix or upscalling with img2img time won't improve much?

for initial image generation, it's not a big deal like you said, I agree, initial image generation is already fast enough compared to Hires fix.

with no LoRA, no ControlNet, and have to converting safetensors model one by one (which also not every GPU could convert it because of huge amount of VRAM needed for conversion), I think for now this is too much hassle to implemented for most people.

NeusZimmer Jan 1, 2024

HI, to clarify on this, controlnet is supported in ONNX, only there is no pipeline uploaded, max resolution is not limited, you only need the right conversion of a model to run in ONNX and it goes a lot faster than you might expect... I'm getting (with a old AMD notebook card -5500xt) 2s per it at 768x512 and I'm able to reach ( with 4gb) 960x704 res... and you may go higher as long as you got memory...
ONNX is designed for inference, not for training. It speed ups that area, is designed for production, big ML models are recommended to train on Pytorch and then transform to ONNX for running in production environments, and yes it lacks of textual inversions, loras and lyco....( and i can live without them...) due to this specification: the model is tied to a specific flow design and cannot be modified since its creation ( to improve its performance).
For testing what i said, just look into my repo if you wants...

ZeroCool22 · 2023-05-28T20:39:17Z

ZeroCool22
May 28, 2023

So, no Speed Up for PASCAL GPU's like 1080 TI.

0 replies

if-ai · 2023-05-29T01:59:36Z

if-ai
May 29, 2023

2 replies

ZeroCool22 May 29, 2023

I used 1.3 today did with torch 2 and Token merge this thing is blazing fast now is amazing. I imagine tensor Rt needs converts the models so is not in yet in my machine which means is about to get even faster and I love it.

Token merge speed up things at the sacrifice of Quality.

You can see a clear example here: https://youtu.be/nlpHvD9mbR8?t=391

if-ai May 30, 2023

I know that is nothing new since I used the TomeSd via extension before and now I am using it 1.3. I was just praising the efforts to speed up the generations. I also use Lsmith for months which is TensorRt version of StableDiffusion but the inference tools were lacking in comparison to A1111.

NVIDIA and Microsoft Drive Innovation for Windows PCs in New Era of Generative AI #10684

Replies: 12 comments · 26 replies

dill-shower May 24, 2023 Author

Sakura-Luna May 24, 2023 Collaborator

Sakura-Luna May 31, 2023 Collaborator

Sakura-Luna Jun 1, 2023 Collaborator

Sakura-Luna May 25, 2023 Collaborator

AUTOMATIC1111 May 27, 2023 Maintainer

Sakura-Luna May 28, 2023 Collaborator

Sakura-Luna May 28, 2023 Collaborator

Sakura-Luna May 28, 2023 Collaborator

Sakura-Luna May 28, 2023 Collaborator

Replies: 12 comments 26 replies

dill-shower May 24, 2023
Author

Sakura-Luna
May 24, 2023
Collaborator

Sakura-Luna May 31, 2023
Collaborator

Sakura-Luna Jun 1, 2023
Collaborator

Sakura-Luna May 25, 2023
Collaborator

AUTOMATIC1111
May 27, 2023
Maintainer

Sakura-Luna May 28, 2023
Collaborator

Sakura-Luna May 28, 2023
Collaborator

Sakura-Luna May 28, 2023
Collaborator

Sakura-Luna May 28, 2023
Collaborator