Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Questions About 'accelerate.prepare()' #3278

Open
Klein-Lan opened this issue Dec 7, 2024 · 0 comments
Open

Some Questions About 'accelerate.prepare()' #3278

Klein-Lan opened this issue Dec 7, 2024 · 0 comments

Comments

@Klein-Lan
Copy link

Klein-Lan commented Dec 7, 2024

Hello, I have some questions regarding the operations related to accelerate.prepare(), and I hope to get your answers.

During my experiments, I noticed the following:

from transformers import AutoModelForCausalLM

import accelerate

accelerator = accelerate.Accelerator()

model = AutoModelForCausalLM.from_pretrained("llama2-7b-hf").half()

model = accelerator.prepare(model)

print(model)

My hardware is an NVIDIA A5000 GPU with 24GB of VRAM. Theoretically, loading a 7B model in half precision should only require around 14GB of VRAM.

However, I encountered an out-of-memory error when executing model = accelerator.prepare(model), while using model.to(accelerator.device) did not result in an error.

This outcome is quite puzzling to me. I don't understand why this happens, or how I can use accelerate to perform multi-GPU bf16 inference for llama2-7b-hf on my GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant