Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiModalityCausalLM does not support Flash Attention 2.0 yet #35383

Open
AlanPonnachan opened this issue Dec 21, 2024 · 0 comments
Open

MultiModalityCausalLM does not support Flash Attention 2.0 yet #35383

AlanPonnachan opened this issue Dec 21, 2024 · 0 comments

Comments

@AlanPonnachan
Copy link

AlanPonnachan commented Dec 21, 2024

System Info

transformers version 4.47.1
Google colab
Python 3.10.12


I attempted to use Flash Attention with the Janus-1.3B model, but encountered the following error:

ValueError: MultiModalityCausalLM does not support Flash Attention 2.0 yet.

This error was raised by the transformers/modeling_utils.py file:

if not cls._supports_flash_attn_2:
    raise ValueError(
        f"{cls.__name__} does not support Flash Attention 2.0 yet. Please request to add support where"
        f" the model is hosted, on its model hub page: https://huggingface.co/{config._name_or_path}/discussions/new"
        " or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new"
    )

Installed FlashAttention-2 using the command:
pip install flash-attn --no-build-isolation

Here is the code I used:

import torch
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images

# specify the path to the model
model_path = "deepseek-ai/Janus-1.3B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True,  attn_implementation="flash_attention_2"
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

conversation = [
    {
        "role": "User",
        "content": "<image_placeholder>\nConvert the formula into latex code.",
        "images": ["images/equation.png"],
    },
    {"role": "Assistant", "content": ""},
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)

# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# # run the model to get the response
outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True,
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant