30%+ Speedup for AMD RDNA3/ROCm using Flash Attention w/ SDP Fallback #7172

Beinsezii · 2024-03-01T11:31:59Z

Beinsezii
Mar 1, 2024

Yes, now you too can have memory efficient attention on AMD with some (many) caveats.

Numbers

Throughput for the diffusers default (SDP), my SubQuad port, and the presented Flash Attention + SDP fallback method

Image	SDP	SubQuad	Flash
SDXL 1024x1024	2.85 it/s	2.85 it/s	3.73 it/s
SDXL 1536x1536	1.01 it/s	1.01 it/s	1.43 it/s
SDXL 2048x2048	OOM	0.46 it/s	0.69 it/s
SD 1.5 512x512	19.24 it/s	19.13 it/s	23.21 it/s
SD 1.5 768x768	5.88 it/s	5.79 it/s	8.93 it/s
SD 1.5 1024x1024	OOM	2.41 it/s	3.72 it/s

All numbers measured with a 7900 XTX on Pytorch Nightly + ROCm 6.0. Additionally, I have my card limited to 300w so your numbers may be higher.

Okay, how?

First you need to install Flash Attention, which requires the ROCm SDK to be installed and ideally your PyTorch version should match. If your distro is on ROCm 6.0, use the nightly Torch to match. Using this setup I've never had install issues across both ROCm 5.7 and 6.0

To install flash attention, activate your virtual environment (if you use one, which you should) and execute

pip install -U git+https://github.com/ROCm/flash-attention@howiejay/navi_support

Which will install AMD's Flash Attention 2 fork with Navi support. There's a very real chance it may only work for 7000 series GPUs , as older cards don't have WMMAs and I'm not sure this build has any fallbacks for that.

Then, to actually use flash attention in Diffusers you need to implement it in an attention processor and have a fallback for unsupported head dimensions. Which I've already done here

To use you can simply place the flash_attn_rocm file in your tree and import FlashAttnProcessor such as

from flash_attn_rocm import FlashAttnProcessor
pipe.unet.set_attn_processor(FlashAttnProcessor())
pipe.vae.set_attn_processor(FlashAttnProcessor())

And now inference should be much faster, use less memory (usually), and more.

You mentioned caveats?

Oh there's plenty. Read about them here but I'll summarize

Firstly, the reason I keep mentioning "SDP Fallback" is the Navi branch currently does not support head dimensions > 128. Here I chose to fall back to good ole SDP for that since the functions are basically the same minus a few transposes.

The 128 head dim limit results in memory spikes when it falls back to SDP, particularly on the VAE. This means for large renders you'll probably have to use VAE tiling (SubQuad might work too, but it'll be slow).

Second, this is forward pass only. No training. Like at all.

Finally, there's no masking support in the function. So far it seems to run ok, but this limitation might adversely affect some workflows.

Also, the AMD fork is like a billion versions behind the Dao-AILab master, so newer functions aren't available either. 2.0.4 is all we get. On top of this, it appears to be very unoptimized. It barely works but that's a lot better than the last year of nothing.

What about other models?

A lot of models will use SDPA and dont contain their own easy way to set attention. What I'd recommend instead is to simply monkey patch the torch sdpa function with your own that hijacks it into Flash Attention where supported. Example:

if "AMD" in torch.cuda.get_device_name() or "Radeon" in torch.cuda.get_device_name():
    try:
        from flash_attn import flash_attn_func

        sdpa = torch.nn.functional.scaled_dot_product_attention

        def sdpa_hijack(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False, scale=None):
            if query.shape[3] <= 128 and attn_mask is None and query.dtype != torch.float32:
                result = flash_attn_func(
                    q=query.transpose(1, 2),
                    k=key.transpose(1, 2),
                    v=value.transpose(1, 2),
                    dropout_p=dropout_p,
                    causal=is_causal,
                    softmax_scale=scale,
                )
                hidden_states = result.transpose(1, 2) if result is not None else None
            else:
                hidden_states = sdpa(
                    query=query,
                    key=key,
                    value=value,
                    attn_mask=attn_mask,
                    dropout_p=dropout_p,
                    is_causal=is_causal,
                    scale=scale,
                )
            return hidden_states

        torch.nn.functional.scaled_dot_product_attention = sdpa_hijack
        print("# # #\nHijacked SDPA with ROCm Flash Attention\n# # #")
    except ImportError as e:
        print(f"# # #\nCould not load Flash Attention for hijack:\n{e}\n# # #")
else:
    print(f"# # #\nCould not detect AMD GPU from:\n{torch.cuda.get_device_name()}\n# # #")

Enjoy that Stable Cascade speedup.

yiyixuxu · 2024-03-03T10:23:42Z

yiyixuxu
Mar 3, 2024
Maintainer

cc @sayakpaul

1 reply

sayakpaul Mar 3, 2024
Maintainer

Saw it this morning. Very cool stuff.

Beinsezii · 2024-03-03T11:09:13Z

Beinsezii
Mar 3, 2024
Author

Apparently it's possible for card names to not have "AMD" so I added another check for "Radeon" on the hijack and made it prettier
Beinsezii/comfyui-amd-go-fast#1 (comment)

0 replies

sancspro · 2024-03-10T08:12:05Z

sancspro
Mar 10, 2024

Hi there. I am very impressed with performance of my 7800xt with quickdif and the memory optimization you've done. Is there a way to bring this optimization to auto1111 webui? I mean the flash attention part.
I would be glad to help with testing and sharing my results.

Regardless, appreciate your contribution to the AMD community! Created this account to share this comment :)

6 replies

Beinsezii Mar 10, 2024
Author

Another thing to watch is AMD's official FA2 impl for SDPA which currently works for the CDNA cards
pytorch/pytorch#121561
So hopefully RDNA3 follows suit, as their W7000 Pro GPUs use the same LLVM targets as the RX 7000 series.

sancspro Mar 10, 2024

Thanks for the response. I've tried doing it multiple times in the last few days but I am getting the same error:
IndexError: tuple index out of range

Here's the full console output:

cd stable-diffusion-webui
python -m venv venv
source venv/bin/activate
(venv) san7800@san7800-Linux:~/stable-diffusion-webui$ python launch.py --opt-sdp-attention

#

Hijacked SDPA with ROCm Flash Attention

#

Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
Version: v1.8.0
Commit hash: bef51aed032c0aaa5cfd80445bc4cf0d85b408b5
Launching Web UI with arguments: --opt-sdp-attention
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...

#

Hijacked SDPA with ROCm Flash Attention

#

No module 'xformers'. Proceeding without it.
Loading weights [62bb78983a] from /home/san7800/stable-diffusion-webui/models/Stable-diffusion/epicphotogasm_lastUnicorn.safetensors
Running on local URL: http://127.0.0.1:7860
Creating model from config: /home/san7800/stable-diffusion-webui/configs/v1-inference.yaml

To create a public link, set share=True in launch().
Startup time: 3.9s (prepare environment: 1.0s, import torch: 1.2s, import gradio: 0.4s, setup paths: 0.5s, other imports: 0.2s, load scripts: 0.2s, create ui: 0.2s, gradio launch: 0.2s).
Applying attention optimization: sdp... done.
Model loaded in 2.1s (load weights from disk: 0.2s, create model: 0.4s, apply weights to model: 1.1s, calculate empty prompt: 0.3s).
100%|███████████████████████████████████████████| 20/20 [00:01<00:00, 18.39it/s]
*** Error completing request█████████████████▎ | 18/20 [00:00<00:00, 20.41it/s]
*** Arguments: ('task(l13ydy6nkqv250n)', <gradio.routes.Request object at 0x72d900e92e00>, 'cat', '', [], 20, 'DPM++ 2M Karras', 1, 1, 7, 512, 512, False, 0.7, 2, 'Latent', 0, 0, 0, 'Use same checkpoint', 'Use same sampler', '', '', [], 0, False, '', 0.8, -1, False, -1, 0, 0, 0, False, False, 'positive', 'comma', 0, False, False, 'start', '', 1, '', [], 0, '', [], 0, '', [], True, False, False, False, False, False, False, 0, False) {}
Traceback (most recent call last):
File "/home/san7800/stable-diffusion-webui/modules/call_queue.py", line 57, in f
res = list(func(*args, **kwargs))
File "/home/san7800/stable-diffusion-webui/modules/call_queue.py", line 36, in f
res = func(*args, **kwargs)
File "/home/san7800/stable-diffusion-webui/modules/txt2img.py", line 110, in txt2img
processed = processing.process_images(p)
File "/home/san7800/stable-diffusion-webui/modules/processing.py", line 785, in process_images
res = process_images_inner(p)
File "/home/san7800/stable-diffusion-webui/modules/processing.py", line 933, in process_images_inner
x_samples_ddim = decode_latent_batch(p.sd_model, samples_ddim, target_device=devices.cpu, check_for_nans=True)
File "/home/san7800/stable-diffusion-webui/modules/processing.py", line 632, in decode_latent_batch
sample = decode_first_stage(model, batch[i:i + 1])[0]
File "/home/san7800/stable-diffusion-webui/modules/sd_samplers_common.py", line 76, in decode_first_stage
return samples_to_images_tensor(x, approx_index, model)
File "/home/san7800/stable-diffusion-webui/modules/sd_samplers_common.py", line 58, in samples_to_images_tensor
x_sample = model.decode_first_stage(sample.to(model.first_stage_model.dtype))
File "/home/san7800/stable-diffusion-webui/modules/sd_hijack_utils.py", line 18, in
setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: self(*args, **kwargs))
File "/home/san7800/stable-diffusion-webui/modules/sd_hijack_utils.py", line 32, in call
return self.__orig_func(*args, **kwargs)
File "/home/san7800/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/san7800/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py", line 826, in decode_first_stage
return self.first_stage_model.decode(z)
File "/home/san7800/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/models/autoencoder.py", line 90, in decode
dec = self.decoder(z)
File "/home/san7800/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/san7800/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1536, in _call_impl
return forward_call(*args, **kwargs)
File "/home/san7800/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/modules/diffusionmodules/model.py", line 631, in forward
h = self.mid.attn_1(h)
File "/home/san7800/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/san7800/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1536, in _call_impl
return forward_call(*args, **kwargs)
File "/home/san7800/stable-diffusion-webui/modules/sd_hijack_optimizations.py", line 648, in sdp_attnblock_forward
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, dropout_p=0.0, is_causal=False)
File "/home/san7800/stable-diffusion-webui/launch.py", line 10, in sdpa_hijack
if query.shape[3] <= 128 and attn_mask is None:
IndexError: tuple index out of range

Beinsezii Mar 10, 2024
Author

So for some reason auto is passing a 1 or 2 dimensional query tensor. I'm not sure how to fix that, as the function explicitly expects 4 dimensions.

Beinsezii Mar 10, 2024
Author

Maybe this would be of more help
ROCm/flash-attention#27 (comment)

sancspro Mar 11, 2024

Yes, I tried to implement their method too for webui but getting the following error:
RuntimeError: FlashAttention forward only supports head dimension at most 128

I posted a comment on that thread, I will wait for their response. Thanks for ur help!!

sancspro · 2024-04-25T04:28:52Z

sancspro
Apr 25, 2024

Hi. As you said, the official FA2 support on ROCM is released I guess via Pytorch 2.3. Now, what is its significance on consumer cards like my 7800XT? It broke my existing FA build when I installed pytorch 2.3.

2 replies

Beinsezii Apr 25, 2024
Author

The in-tree aotriton kernels are only for CDNA GPUs like the MI200 and MI300. You should be able to just re-compile the Navi fork wheel in the 2.3 environment. I run torch 2.3.0 as well with no issues.

According to the this comment from an AMD dev proper Navi support isn't coming for what sounds like quite a while...

sancspro Apr 25, 2024

Re-built the wheels for Navi fork of FA. Works now. Hopefully, we get the official support so we can use the latest FA implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

30%+ Speedup for AMD RDNA3/ROCm using Flash Attention w/ SDP Fallback #7172

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 9 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

30%+ Speedup for AMD RDNA3/ROCm using Flash Attention w/ SDP Fallback #7172

Beinsezii Mar 1, 2024

Numbers

Okay, how?

You mentioned caveats?

What about other models?

Replies: 4 comments · 9 replies

yiyixuxu Mar 3, 2024 Maintainer

sayakpaul Mar 3, 2024 Maintainer

Beinsezii Mar 3, 2024 Author

sancspro Mar 10, 2024

Beinsezii Mar 10, 2024 Author

sancspro Mar 10, 2024

#

#

#

#

Beinsezii Mar 10, 2024 Author

Beinsezii Mar 10, 2024 Author

sancspro Mar 11, 2024

sancspro Apr 25, 2024

Beinsezii Apr 25, 2024 Author

sancspro Apr 25, 2024

Beinsezii
Mar 1, 2024

Replies: 4 comments 9 replies

yiyixuxu
Mar 3, 2024
Maintainer

sayakpaul Mar 3, 2024
Maintainer

Beinsezii
Mar 3, 2024
Author

sancspro
Mar 10, 2024

Beinsezii Mar 10, 2024
Author

Beinsezii Mar 10, 2024
Author

Beinsezii Mar 10, 2024
Author

sancspro
Apr 25, 2024

Beinsezii Apr 25, 2024
Author