ValueError: Default process group has not been initialized, please make sure to call init_process_group #3315

wangyu-ustc · 2024-12-27T02:12:50Z

System Info

accelerate == 1.2.0
deepspeed == 0.16.2

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

I copied the example from examples/nlp_example.py and renamed it to main.py in a new folder then I replaced the optimizer and scheduler in the file with DummyOptim and DummyScheduler to make it compatible with deepspeed configs. This is fine, the code is running properly. But once I add the following it started to raise errors in the title:

accelerator.wait_for_everyone()
  state_dict = accelerator.get_state_dict()
  accelerator.unwrap_model(model).save_pretrained(
      f"{args.output_dir}",
      is_main_process=accelerator.is_main_process,
      save_function=accelerator.save,
      state_dict=state_dict,
  )

Put the following file into the same folder
(1) stage2.json

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "weight_decay": "auto",
            "torch_adam": true,
            "adam_w_mode": true
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": "auto",
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Put the following content into config.yaml:

compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  deepspeed_config_file: stage2.json
  zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_config:
  dynamo_backend: INDUCTOR
  dynamo_mode: default
  dynamo_use_dynamic: true
  dynamo_use_fullgraph: true
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Then run accelerate launch --config_file config.yaml main.py, it will raise the error saying ValueError: Default process group has not been initialized, please make sure to call init_process_group.

Expected behavior

I expect accelerator.wait_for_everyone to work fine.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: Default process group has not been initialized, please make sure to call init_process_group #3315

ValueError: Default process group has not been initialized, please make sure to call init_process_group #3315

wangyu-ustc commented Dec 27, 2024 •

edited

Loading

ValueError: Default process group has not been initialized, please make sure to call init_process_group #3315

ValueError: Default process group has not been initialized, please make sure to call init_process_group #3315

Comments

wangyu-ustc commented Dec 27, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

wangyu-ustc commented Dec 27, 2024 •

edited

Loading