[WIP] Add support for flex attention (paged attention) #35419

blzheng · 2024-12-26T05:44:44Z

What does this PR do?

This pr integrates PyTorch Flex attention (paged attention) into Llama and GPT-J (greedy search).

Add flex attention for first token. The current implementation uses SDPA for first token and flex attention for next token
Address re-compile issues to eliminate hard-coded logic
Extend support to beam search

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

blzheng and others added 15 commits November 18, 2024 05:34

enable flex attention (paged attention) for llama

96a8858

minor fix

092a345

move block_mask and score_mod outof forward

12dbf89

remove deadcode

77ec625

update block_mask and score_mask logic to avoid recompile

adb3c15

first token: sdpa, next token: pagedattention

b42ce88

minor fix

dfb9936

hard code block_mask

61dc05e

fix issue for bs=x

8e281dd

co-work with latest pytorch

9d990fe

fix multiple iter issue and add kernel option

7206ecf

enable gptj with flexatten

a20a8a1

fix gqa

c4fd9c9

clean code

e51f70e

format code

60a0d92