Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can GTE Models Achieve Good Performance on Turing Hardware? #453

Open
superchar opened this issue Dec 13, 2024 · 0 comments
Open

Can GTE Models Achieve Good Performance on Turing Hardware? #453

superchar opened this issue Dec 13, 2024 · 0 comments

Comments

@superchar
Copy link

Feature request

Hello @OlivierDehaene,

First of all, thank you for creating such an incredible framework! The performance of my GTE model on an A100 GPU is exceptional and unmatched.
That said, I’m looking to reduce costs and tried running my GTE model on TEI with a T4 GPU. Unfortunately, this revealed some challenges:

  • Version 1.5.0 with Flash Attention: This setup delivered great performance but was unstable. It occasionally returned null vectors, which I suspect is caused by FP operation overflow in the custom Flash Attention v1 implementation in TEI.
  • Version 1.6.0 without Flash Attention: The release added support for GTE models without Flash Attention and works stably in my tests. However, the performance drop is massive—approximately 50x slower than Flash GTE in my setup.

I’m willing to contribute to improving the framework, but I’m relatively new to this field and would greatly appreciate your insights:

  • Can the Flash Attention v1 implementation in TEI be fixed?
    From my understanding, the original Flash Attention v1 supports Turing GPUs, so I wonder if it can be adapted for TEI to resolve these issues. This is just an assumption, but it seems promising.
  • Is there potential to optimize the current GTE implementation without Flash Attention for better performance?

I’d love to dive deeper into this, but I want to make sure these limitations on T4 GPUs can realistically be addressed before investing significant time.
Thank you for your time and any guidance you can provide.

Motivation

Run GTE models on Turing with good performance.

Your contribution

I am ready to contribute but need some insights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant