You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I have some questions regarding the operations related to accelerate.prepare(), and I hope to get your answers.
During my experiments, I noticed the following:
from transformers import AutoModelForCausalLM
import accelerate
accelerator = accelerate.Accelerator()
model = AutoModelForCausalLM.from_pretrained("llama2-7b-hf").half()
model = accelerator.prepare(model)
print(model)
My hardware is an NVIDIA A5000 GPU with 24GB of VRAM. Theoretically, loading a 7B model in half precision should only require around 14GB of VRAM.
However, I encountered an out-of-memory error when executing model = accelerator.prepare(model), while using model.to(accelerator.device) did not result in an error.
This outcome is quite puzzling to me. I don't understand why this happens, or how I can use accelerate to perform multi-GPU bf16 inference for llama2-7b-hf on my GPU.
The text was updated successfully, but these errors were encountered:
Hello, I have some questions regarding the operations related to
accelerate.prepare()
, and I hope to get your answers.During my experiments, I noticed the following:
My hardware is an NVIDIA A5000 GPU with 24GB of VRAM. Theoretically, loading a 7B model in half precision should only require around 14GB of VRAM.
However, I encountered an out-of-memory error when executing
model = accelerator.prepare(model)
, while usingmodel.to(accelerator.device)
did not result in an error.This outcome is quite puzzling to me. I don't understand why this happens, or how I can use accelerate to perform multi-GPU bf16 inference for llama2-7b-hf on my GPU.
The text was updated successfully, but these errors were encountered: