You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)
Reproduction
run 'accelerate launch pippy_example2.py' on 2 node. each node has 2 gpu each one has 24gb vram.
before line 'model.eval()' getting oom.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from accelerate import PartialState, prepare_pippy, init_empty_weights, load_checkpoint_and_dispatch
from torch.distributed import init_process_group
import os
def main():
model_name = "google/gemma-2-27b-it" # Replace with the correct Hugging Face model ID
with init_empty_weights():
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
model.tie_weights()
print('\n\n###########################################\n###########################################\nload_checkpoint_and_dispatch2\n###########################################\n###########################################\n\n')
model = load_checkpoint_and_dispatch(model, device_map="auto", checkpoint='/storage/.cache/huggingface/hub/models--google--gemma-2-27b-it/snapshots/aaf20e6b9f4c0fcf043f6fb2a2068419086d77b0')
# model.tie_weights()
print('\n\n###########################################\n###########################################\neval\n###########################################\n###########################################\n\n')
model.eval()
# Input configs
# Create example inputs for the model
print('\n\n###########################################\n###########################################\ntest\n###########################################\n###########################################\n\n')
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompts = ("I would like to", "I really like to") # bs = 2, sending 2 per process
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer(prompts, return_tensors="pt", padding=True)
prompts = ("I would like to", "I really like to", "The weather is pretty") # bs = 3
inputs = tokenizer(prompts, return_tensors="pt", padding=True)
inputs = inputs.to(0)
with torch.no_grad():
output = model(**inputs)
# The outputs are only on the final process by default
if PartialState().is_last_process:
next_token_logits = output[0][:, -1, :]
next_token = torch.argmax(next_token_logits, dim=-1)
print(tokenizer.batch_decode(next_token))
PartialState().destroy_process_group()
if __name__ == "__main__":
main()
Expected behavior
load model destributed between 4 gpu on 2 node. i want inference with multi node while no one node can completely load model.
The text was updated successfully, but these errors were encountered:
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
run 'accelerate launch pippy_example2.py' on 2 node. each node has 2 gpu each one has 24gb vram.
before line 'model.eval()' getting oom.
Expected behavior
load model destributed between 4 gpu on 2 node. i want inference with multi node while no one node can completely load model.
The text was updated successfully, but these errors were encountered: