accelerate multi-node multi-gpu oom #3310

rastinrastinii · 2024-12-22T11:54:33Z

System Info

Node1
- `Accelerate` version: 1.2.1
- Platform: Linux-6.8.0-48-generic-x86_64-with-glibc2.39
- `accelerate` bash location: /home/mshahsavari/.pyenv/versions/3.11.10_venv/bin/accelerate
- Python version: 3.11.10
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 62.50 GB
- GPU type: NVIDIA GeForce RTX 4090
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: fp16
        - use_cpu: False
        - debug: True
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 2
        - gpu_ids: all
        - main_process_ip: 172.16.22.61
        - main_process_port: 6834
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []


Node2:
- `Accelerate` version: 1.2.1
- Platform: Linux-6.8.0-31-generic-x86_64-with-glibc2.39
- `accelerate` bash location: /home/mshahsavari/.pyenv/versions/3.11.10_venv/bin/accelerate
- Python version: 3.11.10
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 227.95 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: fp16
        - use_cpu: False
        - debug: True
        - num_processes: 2
        - machine_rank: 1
        - num_machines: 2
        - gpu_ids: all
        - main_process_ip: 172.16.22.61
        - main_process_port: 6834
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

run 'accelerate launch pippy_example2.py' on 2 node. each node has 2 gpu each one has 24gb vram.
before line 'model.eval()' getting oom.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

from accelerate import PartialState, prepare_pippy, init_empty_weights, load_checkpoint_and_dispatch
from torch.distributed import init_process_group
import os

def main():

    model_name = "google/gemma-2-27b-it"  # Replace with the correct Hugging Face model ID
    with init_empty_weights():
        model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

    model.tie_weights()
    print('\n\n###########################################\n###########################################\nload_checkpoint_and_dispatch2\n###########################################\n###########################################\n\n')
    model = load_checkpoint_and_dispatch(model, device_map="auto", checkpoint='/storage/.cache/huggingface/hub/models--google--gemma-2-27b-it/snapshots/aaf20e6b9f4c0fcf043f6fb2a2068419086d77b0')
    # model.tie_weights()
    print('\n\n###########################################\n###########################################\neval\n###########################################\n###########################################\n\n')
    model.eval()

    # Input configs
    # Create example inputs for the model
    print('\n\n###########################################\n###########################################\ntest\n###########################################\n###########################################\n\n')
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    prompts = ("I would like to", "I really like to")  # bs = 2, sending 2 per process
    tokenizer.pad_token = tokenizer.eos_token
    inputs = tokenizer(prompts, return_tensors="pt", padding=True)

    prompts = ("I would like to", "I really like to", "The weather is pretty")  # bs = 3
    inputs = tokenizer(prompts, return_tensors="pt", padding=True)
    inputs = inputs.to(0)
    with torch.no_grad():
        output = model(**inputs)

    # The outputs are only on the final process by default
    if PartialState().is_last_process:
        next_token_logits = output[0][:, -1, :]
        next_token = torch.argmax(next_token_logits, dim=-1)
        print(tokenizer.batch_decode(next_token))
    PartialState().destroy_process_group()

if __name__ == "__main__":
    main()

Expected behavior

load model destributed between 4 gpu on 2 node. i want inference with multi node while no one node can completely load model.

The text was updated successfully, but these errors were encountered:

rastinrastinii · 2024-12-28T05:45:59Z

is there any suggestion, recommendation or help?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accelerate multi-node multi-gpu oom #3310

accelerate multi-node multi-gpu oom #3310

rastinrastinii commented Dec 22, 2024

rastinrastinii commented Dec 28, 2024

accelerate multi-node multi-gpu oom #3310

accelerate multi-node multi-gpu oom #3310

Comments

rastinrastinii commented Dec 22, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

rastinrastinii commented Dec 28, 2024