distributed GPU inference across multiple machines on the same network #3313

MonkeeMan1 · 2024-12-25T23:18:41Z

System Info

- `Accelerate` version: 1.2.1
- Platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.35
- `accelerate` bash location: /mnt/c/Users/benn/OneDrive/Desktop/aii/venv/bin/accelerate
- Python version: 3.10.12
- Numpy version: 2.2.1
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 30.92 GB
- GPU type: NVIDIA GeForce RTX 4090
- `Accelerate` default config:
        Not found

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Hello.

I am currently attempting to get multi machine gpu inference working on 2 computers on my local network. The goal for this is to be able to load 50% of a model on my 4090 and another 50% of the model on my other 4090, my goal is to speed up inference and possibly allow me to load larger models.

from accelerate import Accelerator
from accelerate.utils import gather_object
from transformers import AutoModelForCausalLM, AutoTokenizer
from statistics import mean
import torch, time, json

accelerator = Accelerator()

# 10*10 Prompts. Source: https://www.penguin.co.uk/articles/2022/04/best-first-lines-in-books
prompts_all=[
    "The King is dead. Long live the Queen.",
]

# load a base model and tokenizer
model_path = "Qwen/Qwen2.5-Coder-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_path,    
    device_map={"": accelerator.process_index},
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_path)   

# sync GPUs and start the timer
accelerator.wait_for_everyone()
start=time.time()
print("STARTING")

# divide the prompt list onto the available GPUs 
with accelerator.split_between_processes(prompts_all) as prompts:
    # store output of generations in dict
    results=dict(outputs=[], num_tokens=0)

    # have each GPU do inference, prompt by prompt
    for prompt in prompts:
        prompt_tokenized=tokenizer(prompt, return_tensors="pt").to("cuda")
        output_tokenized = model.generate(**prompt_tokenized, max_new_tokens=100)[0]

        # remove prompt from output 
        output_tokenized=output_tokenized[len(prompt_tokenized["input_ids"][0]):]

        # store outputs and number of tokens in result{}
        results["outputs"].append( tokenizer.decode(output_tokenized) )
        results["num_tokens"] += len(output_tokenized)
        print("1here")

    results=[ results ] # transform to list, otherwise gather_object() will not collect correctly
    print("2here")

# collect results from all the GPUs
results_gathered=gather_object(results)

if accelerator.is_main_process:
    timediff=time.time()-start
    num_tokens=sum([r["num_tokens"] for r in results_gathered ])

    print(f"tokens/sec: {num_tokens//timediff}, time {timediff}, total tokens {num_tokens}, total prompts {len(prompts_all)}")

master machine:

accelerate launch --num_machines 2 --machine_rank 0 --main_process_ip 192.168.0.21 --main_process_port 29500 main.py

worker machine:

accelerate launch --num_machines 2 --machine_rank 1 --main_process_ip 192.168.0.21 --main_process_port 29500 main.py

Expected behavior

The expected behaviour is that the two machines will work together to improve the inference of the prompts, speeding them up. However, when starting the workers both seem to work completely independently of each other, and make no attempt at connecting to each other. There is no evidence that the num_machines and --main_process_ip args are doing anything.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed GPU inference across multiple machines on the same network #3313

distributed GPU inference across multiple machines on the same network #3313

MonkeeMan1 commented Dec 25, 2024 •

edited

Loading

distributed GPU inference across multiple machines on the same network #3313

distributed GPU inference across multiple machines on the same network #3313

Comments

MonkeeMan1 commented Dec 25, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

MonkeeMan1 commented Dec 25, 2024 •

edited

Loading