RuntimeError: synStatus=26 [Generic failure] Device acquire failed. #1611

VinayHN1365466 · 2024-12-16T08:29:38Z

System Info

HL-SMI Version:hl-1.18.0-fw-53.1.1.1
Driver Version:1.18.0-ee698fb 
Docker: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest

I'm getting the below error while running text-generation file. 
python run_generation.py --model_name_or_path gpt2 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --prompt "Here is my prompt"



/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
  return isinstance(object, types.FunctionType)
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Fetching 1 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 6898.53it/s]
Fetching 1 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4760.84it/s]
12/16/2024 08:28:10 - INFO - __main__ - Single-device run.
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi_common/hcl_device_control_factory.cpp::84(initDevice): The condition [ g_ibv.init(deviceConfig) == hcclSuccess ] failed. ibv initialization failed
Traceback (most recent call last):
  File "/root/optimum-habana/examples/text-generation/run_generation.py", line 773, in <module>
    main()
  File "/root/optimum-habana/examples/text-generation/run_generation.py", line 384, in main
    model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
  File "/root/optimum-habana/examples/text-generation/utils.py", line 720, in initialize_model
    setup_model(args, model_dtype, model_kwargs, logger)
  File "/root/optimum-habana/examples/text-generation/utils.py", line 297, in setup_model
    model = model.eval().to(args.device)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2958, in to
    return super().to(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1177, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 805, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1163, in convert
    return t.to(
RuntimeError: synStatus=26 [Generic failure] Device acquire failed.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

cd examples/text-generation/
python run_generation.py --model_name_or_path gpt2 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --prompt "Here is my prompt"

Expected behavior

Execute Successfully

The text was updated successfully, but these errors were encountered:

regisss · 2024-12-16T08:31:43Z

@VinayHN1365466 It looks like your devices are already busy or are somehow unavailable. Can you run hl-smi and paste the output here please?

VinayHN1365466 · 2024-12-16T08:33:32Z

no process are running

VinayHN1365466 · 2024-12-16T08:34:29Z

regisss · 2024-12-16T09:01:24Z

Can you try adding --privileged to your docker run command?

VinayHN1365466 · 2024-12-16T09:10:37Z

Thanks Regisss, I tried with --privileged with Docker, its still the same error

docker run --privileged -it --name optimum_118_8cards_vinay_new_1234 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host -v /mode_file/:/root/.cache/ -v /optimum-habana:/root/optimum-habana vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:lates

regisss · 2024-12-16T09:12:58Z

Can you paste here the complete logs you're getting?

VinayHN1365466 · 2024-12-16T09:13:42Z

~/optimum-habana/examples/text-generation# python run_generation.py
--model_name_or_path gpt2
--use_hpu_graphs
--use_kv_cache
--max_new_tokens 100
--do_sample
--prompt "Here is my prompt"
/usr/lib/python3.10/inspect.py:288: FutureWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
return isinstance(object, types.FunctionType)
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
Fetching 1 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14665.40it/s]
Fetching 1 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 6316.72it/s]
12/16/2024 09:09:06 - INFO - main - Single-device run.
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi_common/hcl_device_control_factory.cpp::84(initDevice): The condition [ g_ibv.init(deviceConfig) == hcclSuccess ] failed. ibv initialization failed
Traceback (most recent call last):
File "/root/optimum-habana/examples/text-generation/run_generation.py", line 773, in
main()
File "/root/optimum-habana/examples/text-generation/run_generation.py", line 384, in main
model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
File "/root/optimum-habana/examples/text-generation/utils.py", line 720, in initialize_model
setup_model(args, model_dtype, model_kwargs, logger)
File "/root/optimum-habana/examples/text-generation/utils.py", line 297, in setup_model
model = model.eval().to(args.device)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2958, in to
return super().to(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1177, in to
return self._apply(convert)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 805, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1163, in convert
return t.to(
RuntimeError: synStatus=26 [Generic failure] Device acquire failed.

regisss · 2024-12-16T09:29:28Z

Does running

import torch
import habana_frameworks.torch.hpu

a = torch.tensor(1, device="hpu")

work?

VinayHN1365466 · 2024-12-16T09:33:49Z

I got the same error
/usr/lib/python3.10/inspect.py:288: FutureWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
return isinstance(object, types.FunctionType)
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi_common/hcl_device_control_factory.cpp::84(initDevice): The condition [ g_ibv.init(deviceConfig) == hcclSuccess ] failed. ibv initialization failed
Traceback (most recent call last):
File "/root/optimum-habana/examples/text-generation/sample.py", line 4, in
a = torch.tensor(1, device="hpu")
RuntimeError: synStatus=26 [Generic failure] Device acquire failed.

regisss · 2024-12-16T09:41:52Z

Can you reboot this instance?

VinayHN1365466 · 2024-12-16T09:44:09Z

Sorry, I don't have access to reboot the instance :(

libinta · 2024-12-16T16:48:54Z

@VinayHN1365466 can you capture dmesg -T ? thanks.

yuanwu2017 · 2024-12-18T05:25:43Z

no process are running

On some cloud machine, you need to add sudo to watch all process of other users.

VinayHN1365466 added the bug Something isn't working label Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: synStatus=26 [Generic failure] Device acquire failed. #1611

RuntimeError: synStatus=26 [Generic failure] Device acquire failed. #1611

VinayHN1365466 commented Dec 16, 2024

regisss commented Dec 16, 2024

VinayHN1365466 commented Dec 16, 2024

VinayHN1365466 commented Dec 16, 2024

regisss commented Dec 16, 2024

VinayHN1365466 commented Dec 16, 2024 •

edited

Loading

regisss commented Dec 16, 2024

VinayHN1365466 commented Dec 16, 2024

regisss commented Dec 16, 2024

VinayHN1365466 commented Dec 16, 2024

regisss commented Dec 16, 2024

VinayHN1365466 commented Dec 16, 2024

libinta commented Dec 16, 2024

yuanwu2017 commented Dec 18, 2024

RuntimeError: synStatus=26 [Generic failure] Device acquire failed. #1611

RuntimeError: synStatus=26 [Generic failure] Device acquire failed. #1611

Comments

VinayHN1365466 commented Dec 16, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

regisss commented Dec 16, 2024

VinayHN1365466 commented Dec 16, 2024

VinayHN1365466 commented Dec 16, 2024

regisss commented Dec 16, 2024

VinayHN1365466 commented Dec 16, 2024 • edited Loading

regisss commented Dec 16, 2024

VinayHN1365466 commented Dec 16, 2024

regisss commented Dec 16, 2024

VinayHN1365466 commented Dec 16, 2024

regisss commented Dec 16, 2024

VinayHN1365466 commented Dec 16, 2024

libinta commented Dec 16, 2024

yuanwu2017 commented Dec 18, 2024

VinayHN1365466 commented Dec 16, 2024 •

edited

Loading