You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, thank you for creating such an incredible framework! The performance of my GTE model on an A100 GPU is exceptional and unmatched.
That said, I’m looking to reduce costs and tried running my GTE model on TEI with a T4 GPU. Unfortunately, this revealed some challenges:
Version 1.5.0 with Flash Attention: This setup delivered great performance but was unstable. It occasionally returned null vectors, which I suspect is caused by FP operation overflow in the custom Flash Attention v1 implementation in TEI.
Version 1.6.0 without Flash Attention: The release added support for GTE models without Flash Attention and works stably in my tests. However, the performance drop is massive—approximately 50x slower than Flash GTE in my setup.
I’m willing to contribute to improving the framework, but I’m relatively new to this field and would greatly appreciate your insights:
Can the Flash Attention v1 implementation in TEI be fixed?
From my understanding, the original Flash Attention v1 supports Turing GPUs, so I wonder if it can be adapted for TEI to resolve these issues. This is just an assumption, but it seems promising.
Is there potential to optimize the current GTE implementation without Flash Attention for better performance?
I’d love to dive deeper into this, but I want to make sure these limitations on T4 GPUs can realistically be addressed before investing significant time.
Thank you for your time and any guidance you can provide.
Motivation
Run GTE models on Turing with good performance.
Your contribution
I am ready to contribute but need some insights.
The text was updated successfully, but these errors were encountered:
Feature request
Hello @OlivierDehaene,
First of all, thank you for creating such an incredible framework! The performance of my GTE model on an A100 GPU is exceptional and unmatched.
That said, I’m looking to reduce costs and tried running my GTE model on TEI with a T4 GPU. Unfortunately, this revealed some challenges:
I’m willing to contribute to improving the framework, but I’m relatively new to this field and would greatly appreciate your insights:
From my understanding, the original Flash Attention v1 supports Turing GPUs, so I wonder if it can be adapted for TEI to resolve these issues. This is just an assumption, but it seems promising.
I’d love to dive deeper into this, but I want to make sure these limitations on T4 GPUs can realistically be addressed before investing significant time.
Thank you for your time and any guidance you can provide.
Motivation
Run GTE models on Turing with good performance.
Your contribution
I am ready to contribute but need some insights.
The text was updated successfully, but these errors were encountered: