-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any plans about flash attention v2? #25
Comments
Are you referring to flashattention v3? Because this repo has just been upgraded to v2 of the flashattention FA algorithm over the summer. I think the author said that he's looking into FA v3 or pytorch's flexattention or something like that. |
No, v2. Sorry, from the last released version I saw (v1.0.1), and the commit history, and searching the repo, I couldn't find any reference or pointer as of if this fork supports v2 of flash-attention. That's why I asked. |
And I think I'm just shooting in the dark. Sorry, I thought I could use this repo as a replacement for the python flash-attention project/package on macos. But seeing that this is a swift implementation of the algorithm, I don't think that is possible. I was following the README on https://github.com/QwenLM/Qwen2-VL that mentions that you can use flash_attention_2 to speed up inference, but the python project seems to run only on CUDA. |
FlashAttention v3 was an algorithm specialized for the H100 chip. It doesn't support backward pass or other hardware. You could argue that the
You can just translate the code to your desired language. That's been used before, as I've have someone translate both the GEMM and forward FlashAttention code to C++. |
This is how the repo differs from FlashAttention v2: Dao-AILab/flash-attention#1172 "v2" of this repository has nothing to do with the versioning in DaoAILab/flash-attention. The "v1" of this repository was an implementation of DaoAILab "v2", but only forward pass. The "v2" of this repository was an implementation of DaoAILab "v2", but both forward and backward pass. For MFA v2, I removed the pre-compiled |
Is there a public repo for the translation to C++? |
This repo under the Documentation archive folder. A C++ translation of an older version of GEMM. github.com/philipturner/metal-flash-attention Somebody else’s C++ translation of the newer GEMM and only the forward part of FlashAttention. Look through the commit history or PR history and you’ll find what you’re looking for. github.com/liuliu/ccv C++ attention of backward gradient for training models (the whole point of doing this, because forward inference is easy AF). Not explicitly translated, but you could do it with enough time to invest. Like any code, it will not compile right away verbatim in whatever compiler you have. It is a reference that you read through, customize for your application. Liu customized the kernels a bit, so they deviate from the source tree’s original goals of eliminating the fluff (batching, multi-head attention, masks, attention with linear bias, GQA, block sparsity, and a few other dozen I don’t know about). Hence I am not holding anything but my own personal translations in the source tree. |
Any plans on upgrading this repo for v2 of flash-attention?
The text was updated successfully, but these errors were encountered: