You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a CI testin PyTorch/XLA that runs `accelerate test`. When `accelerate` is installed from the main branch, the command fail with `ValueError: Unsupported nprocs (8). Please use the environment variable for the hardware you are using (X_NUM_DEVICES where X is CPU, GPU, TPU, NEURONCORE, etc)`.
[2024-12-18, 14:28:43 UTC] {taskinstance.py:1547} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='unowned' AIRFLOW_CTX_DAG_ID='pytorchxla-nightly' AIRFLOW_CTX_TASK_ID='pt-nightly-accelerate-smoke-v2-8-1vm.run_model' AIRFLOW_CTX_EXECUTION_DATE='2024-12-17T14:00:00+00:00' AIRFLOW_CTX_TRY_NUMBER='3' AIRFLOW_CTX_DAG_RUN_ID='scheduled__2024-12-17T14:00:00+00:00'
[2024-12-18, 14:28:43 UTC] {tpu.py:375} INFO - Connecting to IP addresses of workers: ['10.128.0.130']
[2024-12-18, 14:28:43 UTC] {transport.py:1909} INFO - Connected (version 2.0, client OpenSSH_8.9p1)
[2024-12-18, 14:28:43 UTC] {transport.py:1909} INFO - Authentication (publickey) successful!
[2024-12-18, 14:28:44 UTC] {logging_mixin.py:150} WARNING - + sudo echo'accelerator_type=${1} if [[ ${accelerator_type} =~ ^v5.* ]] then device_name=vfio/* else device_name=accel* fi echo "Terminating all processes utilizing the TPU (if any)." sudo lsof -t /dev/${device_name} | xargs -r kill -9'
[2024-12-18, 14:28:44 UTC] {logging_mixin.py:150} WARNING - + bash /tmp/kill_process.sh v2-8
[2024-12-18, 14:28:44 UTC] {logging_mixin.py:150} INFO - Terminating all processes utilizing the TPU (if any).
[2024-12-18, 14:28:45 UTC] {logging_mixin.py:150} WARNING - + export PJRT_DEVICE=TPU
+ PJRT_DEVICE=TPU
[2024-12-18, 14:28:45 UTC] {logging_mixin.py:150} WARNING - + export PATH=/home/ml-auto-solutions/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
[2024-12-18, 14:28:45 UTC] {logging_mixin.py:150} WARNING - + PATH=/home/ml-auto-solutions/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
+ accelerate test
[2024-12-18, 14:28:59 UTC] {logging_mixin.py:150} WARNING - stderr: Traceback (most recent call last):
stderr: File "/home/ml-auto-solutions/.local/bin/accelerate-launch", line 8, in<module>
stderr: sys.exit(main())
stderr: File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1174, in main
stderr: launch_command(args)
[2024-12-18, 14:28:59 UTC] {logging_mixin.py:150} WARNING - stderr: File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1164, in launch_command
[2024-12-18, 14:28:59 UTC] {logging_mixin.py:150} WARNING - stderr: tpu_launcher(args)
stderr: File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 884, in tpu_launcher
stderr: xmp.spawn(PrepareForLaunch(main_function), args=(), nprocs=args.num_processes)
stderr: File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 39, in spawn
stderr: return pjrt.spawn(fn, nprocs, start_method, args)
[2024-12-18, 14:28:59 UTC] {logging_mixin.py:150} WARNING - stderr: File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 209, in spawn
stderr: raise ValueError(
stderr: ValueError: Unsupported nprocs (8). Please use the environment variable for the hardware you are using (X_NUM_DEVICES where X is CPU, GPU, TPU, NEURONCORE, etc).
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} INFO -
Running: accelerate-launch /home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} WARNING - Traceback (most recent call last):
File "/home/ml-auto-solutions/.local/bin/accelerate", line 8, in<module>sys.exit(main())
File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/test.py", line 53, in test_command
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} WARNING - result = execute_subprocess_async(cmd)
File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/test_utils/testing.py", line 607, in execute_subprocess_async
raise RuntimeError(
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} WARNING - RuntimeError: 'accelerate-launch /home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1
The combined stderr from workers follows:
Traceback (most recent call last):
File "/home/ml-auto-solutions/.local/bin/accelerate-launch", line 8, in<module>sys.exit(main())
File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1174, in main
launch_command(args)
File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1164, in launch_command
tpu_launcher(args)
File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 884, in tpu_launcher
xmp.spawn(PrepareForLaunch(main_function), args=(), nprocs=args.num_processes)
File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 39, in spawn
return pjrt.
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} WARNING - spawn(fn, nprocs, start_method, args)
File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 209, in spawn
raise ValueError(
ValueError: Unsupported nprocs (8). Please use the environment variable for the hardware you are using (X_NUM_DEVICES where X is CPU, GPU, TPU, NEURONCORE, etc).
[2024-12-18, 14:29:01 UTC] {taskinstance.py:1826} ERROR - Task failed with exception
Traceback (most recent call last):
File "/opt/python3.11/lib/python3.11/site-packages/airflow/decorators/base.py", line 220, in execute
return_value = super().execute(context)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/airflow/operators/python.py", line 181, in execute
return_value = self.execute_callable()
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/airflow/operators/python.py", line 198, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/airflow/gcs/dags/xlml/utils/tpu.py", line 404, in ssh_tpu
ssh_group.run(cmds, env=env)
File "/opt/python3.11/lib/python3.11/site-packages/fabric/group.py", line 116, in run
return self._do("run", *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/python3.11/lib/python3.11/site-packages/fabric/group.py", line 282, in _do
raise GroupException(results)
fabric.exceptions.GroupException: {<Connection host=10.128.0.130>: <UnexpectedExit: cmd='set -xue\nexport PJRT_DEVICE=TPU\nexport PATH=~/.local/bin:$PATH\n\naccelerate test' exited=1>}
[2024-12-18, 14:29:01 UTC] {taskinstance.py:1346} INFO - Marking task as FAILED. dag_id=pytorchxla-nightly, task_id=pt-nightly-accelerate-smoke-v2-8-1vm.run_model, execution_date=20241217T140000, start_date=20241218T142842, end_date=20241218T142901
[2024-12-18, 14:29:01 UTC] {standard_task_runner.py:104} ERROR - Failed to execute job 6266661 for task pt-nightly-accelerate-smoke-v2-8-1vm.run_model ({<Connection host=10.128.0.130>: <UnexpectedExit: cmd='set -xue\nexport PJRT_DEVICE=TPU\nexport PATH=~/.local/bin:$PATH\n\naccelerate test' exited=1>}; 3375752)
[2024-12-18, 14:29:02 UTC] {local_task_job_runner.py:225} INFO - Task exited with return code 1
[2024-12-18, 14:29:03 UTC] {taskinstance.py:2656} INFO - 1 downstream tasks scheduled from follow-on schedule check
### Information
- [ ] The official example scripts
- [ ] My own modified scripts
### Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`)
- [ ] My own task or dataset (give details below)
### Reproduction
if [ -d "$HOME/.local/bin" ] ; then
export PATH="$HOME/.local/bin:$PATH"
fi
Dependency of accelerate, unfortunately there is no requirements.txt in accelerate.
Hi @tengyifei,
I think the error comes from torch_xla's usage of xmp.spawn, in particular from here.
The error says the nprocs argument should be set to either 1 or number of devices, but the code raises an error when it is not None.
I would check that setting it to the number of devices makes it work, otherwise try setting it to None. I do not know if that is something that has changed on torch xla?
@tengomucho Can you update the accelerate launch code for tpus vm to the following. xmp.spawn only accepts nprocs of either 1 or None, None uses all of the devices.
xmp.spawn(PrepareForLaunch(main_function), args=(), nprocs=1 if args.num_processes == 1 else None)```
System Info
if [ -d "$HOME/.local/bin" ] ; then
export PATH="$HOME/.local/bin:$PATH"
fi
Dependency of accelerate, unfortunately there is no requirements.txt in accelerate.
pip install pytest
git clone https://github.com/huggingface/accelerate.git
pip install ./accelerate
mkdir -p ~/.cache/huggingface/accelerate/
cat > ~/.cache/huggingface/accelerate/default_config.yaml << 'HF_CONFIG_EOF'
compute_environment: LOCAL_MACHINE
distributed_type: XLA
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
HF_CONFIG_EOF
accelerate env
accelerate test
The text was updated successfully, but these errors were encountered: