Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate fails to initialize on Cloud TPUs #3304

Open
tengyifei opened this issue Dec 18, 2024 · 2 comments
Open

Accelerate fails to initialize on Cloud TPUs #3304

tengyifei opened this issue Dec 18, 2024 · 2 comments

Comments

@tengyifei
Copy link

System Info

We have a CI test in PyTorch/XLA that runs `accelerate test`. When `accelerate` is installed from the main branch, the command fail with `ValueError: Unsupported nprocs (8). Please use the environment variable for the hardware you are using (X_NUM_DEVICES where X is CPU, GPU, TPU, NEURONCORE, etc)`.


[2024-12-18, 14:28:43 UTC] {taskinstance.py:1547} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='unowned' AIRFLOW_CTX_DAG_ID='pytorchxla-nightly' AIRFLOW_CTX_TASK_ID='pt-nightly-accelerate-smoke-v2-8-1vm.run_model' AIRFLOW_CTX_EXECUTION_DATE='2024-12-17T14:00:00+00:00' AIRFLOW_CTX_TRY_NUMBER='3' AIRFLOW_CTX_DAG_RUN_ID='scheduled__2024-12-17T14:00:00+00:00'
[2024-12-18, 14:28:43 UTC] {tpu.py:375} INFO - Connecting to IP addresses of workers: ['10.128.0.130']
[2024-12-18, 14:28:43 UTC] {transport.py:1909} INFO - Connected (version 2.0, client OpenSSH_8.9p1)
[2024-12-18, 14:28:43 UTC] {transport.py:1909} INFO - Authentication (publickey) successful!
[2024-12-18, 14:28:44 UTC] {logging_mixin.py:150} WARNING - + sudo echo 'accelerator_type=${1}
  if [[ ${accelerator_type} =~ ^v5.* ]]
  then
    device_name=vfio/*
  else
    device_name=accel*
  fi
  echo "Terminating all processes utilizing the TPU (if any)."
  sudo lsof -t /dev/${device_name} | xargs -r kill -9
  '
[2024-12-18, 14:28:44 UTC] {logging_mixin.py:150} WARNING - + bash /tmp/kill_process.sh v2-8
[2024-12-18, 14:28:44 UTC] {logging_mixin.py:150} INFO - Terminating all processes utilizing the TPU (if any).
[2024-12-18, 14:28:45 UTC] {logging_mixin.py:150} WARNING - + export PJRT_DEVICE=TPU
+ PJRT_DEVICE=TPU
[2024-12-18, 14:28:45 UTC] {logging_mixin.py:150} WARNING - + export PATH=/home/ml-auto-solutions/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
[2024-12-18, 14:28:45 UTC] {logging_mixin.py:150} WARNING - + PATH=/home/ml-auto-solutions/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
+ accelerate test
[2024-12-18, 14:28:59 UTC] {logging_mixin.py:150} WARNING - stderr: Traceback (most recent call last):
stderr:   File "/home/ml-auto-solutions/.local/bin/accelerate-launch", line 8, in <module>
stderr:     sys.exit(main())
stderr:   File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1174, in main
stderr:     launch_command(args)
[2024-12-18, 14:28:59 UTC] {logging_mixin.py:150} WARNING - stderr:   File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1164, in launch_command
[2024-12-18, 14:28:59 UTC] {logging_mixin.py:150} WARNING - stderr:     tpu_launcher(args)
stderr:   File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 884, in tpu_launcher
stderr:     xmp.spawn(PrepareForLaunch(main_function), args=(), nprocs=args.num_processes)
stderr:   File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 39, in spawn
stderr:     return pjrt.spawn(fn, nprocs, start_method, args)
[2024-12-18, 14:28:59 UTC] {logging_mixin.py:150} WARNING - stderr:   File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 209, in spawn
stderr:     raise ValueError(
stderr: ValueError: Unsupported nprocs (8). Please use the environment variable for the hardware you are using (X_NUM_DEVICES where X is CPU, GPU, TPU, NEURONCORE, etc).
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} INFO - 
Running:  accelerate-launch /home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} WARNING - Traceback (most recent call last):
  File "/home/ml-auto-solutions/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/test.py", line 53, in test_command
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} WARNING -     result = execute_subprocess_async(cmd)
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/test_utils/testing.py", line 607, in execute_subprocess_async
    raise RuntimeError(
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} WARNING - RuntimeError: 'accelerate-launch /home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1

The combined stderr from workers follows:
Traceback (most recent call last):
  File "/home/ml-auto-solutions/.local/bin/accelerate-launch", line 8, in <module>
    sys.exit(main())
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1174, in main
    launch_command(args)
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1164, in launch_command
    tpu_launcher(args)
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 884, in tpu_launcher
    xmp.spawn(PrepareForLaunch(main_function), args=(), nprocs=args.num_processes)
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 39, in spawn
    return pjrt.
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} WARNING - spawn(fn, nprocs, start_method, args)
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 209, in spawn
    raise ValueError(
ValueError: Unsupported nprocs (8). Please use the environment variable for the hardware you are using (X_NUM_DEVICES where X is CPU, GPU, TPU, NEURONCORE, etc).
[2024-12-18, 14:29:01 UTC] {taskinstance.py:1826} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/python3.11/lib/python3.11/site-packages/airflow/decorators/base.py", line 220, in execute
    return_value = super().execute(context)
                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/python3.11/lib/python3.11/site-packages/airflow/operators/python.py", line 181, in execute
    return_value = self.execute_callable()
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/python3.11/lib/python3.11/site-packages/airflow/operators/python.py", line 198, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/gcs/dags/xlml/utils/tpu.py", line 404, in ssh_tpu
    ssh_group.run(cmds, env=env)
  File "/opt/python3.11/lib/python3.11/site-packages/fabric/group.py", line 116, in run
    return self._do("run", *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/python3.11/lib/python3.11/site-packages/fabric/group.py", line 282, in _do
    raise GroupException(results)
fabric.exceptions.GroupException: {<Connection host=10.128.0.130>: <UnexpectedExit: cmd='set -xue\nexport PJRT_DEVICE=TPU\nexport PATH=~/.local/bin:$PATH\n\naccelerate test' exited=1>}
[2024-12-18, 14:29:01 UTC] {taskinstance.py:1346} INFO - Marking task as FAILED. dag_id=pytorchxla-nightly, task_id=pt-nightly-accelerate-smoke-v2-8-1vm.run_model, execution_date=20241217T140000, start_date=20241218T142842, end_date=20241218T142901
[2024-12-18, 14:29:01 UTC] {standard_task_runner.py:104} ERROR - Failed to execute job 6266661 for task pt-nightly-accelerate-smoke-v2-8-1vm.run_model ({<Connection host=10.128.0.130>: <UnexpectedExit: cmd='set -xue\nexport PJRT_DEVICE=TPU\nexport PATH=~/.local/bin:$PATH\n\naccelerate test' exited=1>}; 3375752)
[2024-12-18, 14:29:02 UTC] {local_task_job_runner.py:225} INFO - Task exited with return code 1
[2024-12-18, 14:29:03 UTC] {taskinstance.py:2656} INFO - 1 downstream tasks scheduled from follow-on schedule check


### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`)
- [ ] My own task or dataset (give details below)

### Reproduction

if [ -d "$HOME/.local/bin" ] ; then
export PATH="$HOME/.local/bin:$PATH"
fi

Dependency of accelerate, unfortunately there is no requirements.txt in accelerate.

pip install pytest
git clone https://github.com/huggingface/accelerate.git
pip install ./accelerate

mkdir -p ~/.cache/huggingface/accelerate/
cat > ~/.cache/huggingface/accelerate/default_config.yaml << 'HF_CONFIG_EOF'
compute_environment: LOCAL_MACHINE
distributed_type: XLA
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
HF_CONFIG_EOF

accelerate env

accelerate test


### Expected behavior

Test passes
@tengomucho
Copy link

Hi @tengyifei,
I think the error comes from torch_xla's usage of xmp.spawn, in particular from here.
The error says the nprocs argument should be set to either 1 or number of devices, but the code raises an error when it is not None.
I would check that setting it to the number of devices makes it work, otherwise try setting it to None. I do not know if that is something that has changed on torch xla?

@radna0
Copy link

radna0 commented Dec 21, 2024

@tengomucho Can you update the accelerate launch code for tpus vm to the following. xmp.spawn only accepts nprocs of either 1 or None, None uses all of the devices.

xmp.spawn(PrepareForLaunch(main_function), args=(), nprocs=1 if args.num_processes == 1 else None)```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants