Implements FlashDecoding with Sparsity Support #899

hanzhi713 · 2024-12-18T00:45:41Z

To be merged in 2025

This PR implements GPU FlashDecoding. See the docstring of gpu_decoding.py for description of the kernel.

Also, gpu_attention_benchmark.py is completely rewritten to use Jax mosaic profiling tools. Previously, triton testing suite is used, which gives meaningless results as triton uses cudaEvent that records to the default stream. That is not the cudaStream that jax uses.

apghml

It looks like there is still an unresolved internal comment.

PR has been updated.

Implements FlashDecoding

fa687cc

hanzhi713 requested review from ruomingp, markblee and a team as code owners December 18, 2024 00:45

hanzhi713 changed the title ~~Implements FlashDecoding.~~ Implements FlashDecoding with Sparsity Support Dec 18, 2024

apghml previously requested changes Dec 18, 2024

View reviewed changes

hanzhi713 added 2 commits December 18, 2024 01:24

require kv_seq_len

becccdb

update

93120dd

ruomingp approved these changes Dec 18, 2024

View reviewed changes

hanzhi713 requested a review from apghml December 18, 2024 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implements FlashDecoding with Sparsity Support #899

Implements FlashDecoding with Sparsity Support #899

hanzhi713 commented Dec 18, 2024 •

edited

Loading

apghml left a comment

Implements FlashDecoding with Sparsity Support #899

Are you sure you want to change the base?

Implements FlashDecoding with Sparsity Support #899

Conversation

hanzhi713 commented Dec 18, 2024 • edited Loading

apghml left a comment

Choose a reason for hiding this comment

hanzhi713 commented Dec 18, 2024 •

edited

Loading