Can GTE Models Achieve Good Performance on Turing Hardware? #453

superchar · 2024-12-13T13:10:12Z

Feature request

First of all, thank you for creating such an incredible framework! The performance of my GTE model on an A100 GPU is exceptional and unmatched.
That said, I’m looking to reduce costs and tried running my GTE model on TEI with a T4 GPU. Unfortunately, this revealed some challenges:

Version 1.5.0 with Flash Attention: This setup delivered great performance but was unstable. It occasionally returned null vectors, which I suspect is caused by FP operation overflow in the custom Flash Attention v1 implementation in TEI.
Version 1.6.0 without Flash Attention: The release added support for GTE models without Flash Attention and works stably in my tests. However, the performance drop is massive—approximately 50x slower than Flash GTE in my setup.

I’m willing to contribute to improving the framework, but I’m relatively new to this field and would greatly appreciate your insights:

Can the Flash Attention v1 implementation in TEI be fixed?
From my understanding, the original Flash Attention v1 supports Turing GPUs, so I wonder if it can be adapted for TEI to resolve these issues. This is just an assumption, but it seems promising.
Is there potential to optimize the current GTE implementation without Flash Attention for better performance?

I’d love to dive deeper into this, but I want to make sure these limitations on T4 GPUs can realistically be addressed before investing significant time.
Thank you for your time and any guidance you can provide.

Motivation

Run GTE models on Turing with good performance.

Your contribution

I am ready to contribute but need some insights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can GTE Models Achieve Good Performance on Turing Hardware? #453

Can GTE Models Achieve Good Performance on Turing Hardware? #453

superchar commented Dec 13, 2024

Can GTE Models Achieve Good Performance on Turing Hardware? #453

Can GTE Models Achieve Good Performance on Turing Hardware? #453

Comments

superchar commented Dec 13, 2024

Feature request

Motivation

Your contribution