Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: synStatus=26 [Generic failure] Device acquire failed. #1611

Open
2 of 4 tasks
VinayHN1365466 opened this issue Dec 16, 2024 · 13 comments
Open
2 of 4 tasks
Labels
bug Something isn't working

Comments

@VinayHN1365466
Copy link

System Info

HL-SMI Version:hl-1.18.0-fw-53.1.1.1
Driver Version:1.18.0-ee698fb 
Docker: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest

I'm getting the below error while running text-generation file. 
python run_generation.py --model_name_or_path gpt2 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --prompt "Here is my prompt"



/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
  return isinstance(object, types.FunctionType)
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Fetching 1 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 6898.53it/s]
Fetching 1 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4760.84it/s]
12/16/2024 08:28:10 - INFO - __main__ - Single-device run.
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi_common/hcl_device_control_factory.cpp::84(initDevice): The condition [ g_ibv.init(deviceConfig) == hcclSuccess ] failed. ibv initialization failed
Traceback (most recent call last):
  File "/root/optimum-habana/examples/text-generation/run_generation.py", line 773, in <module>
    main()
  File "/root/optimum-habana/examples/text-generation/run_generation.py", line 384, in main
    model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
  File "/root/optimum-habana/examples/text-generation/utils.py", line 720, in initialize_model
    setup_model(args, model_dtype, model_kwargs, logger)
  File "/root/optimum-habana/examples/text-generation/utils.py", line 297, in setup_model
    model = model.eval().to(args.device)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2958, in to
    return super().to(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1177, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 780, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 805, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1163, in convert
    return t.to(
RuntimeError: synStatus=26 [Generic failure] Device acquire failed.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. cd examples/text-generation/
  2. python run_generation.py --model_name_or_path gpt2 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --prompt "Here is my prompt"

Expected behavior

Execute Successfully

@VinayHN1365466 VinayHN1365466 added the bug Something isn't working label Dec 16, 2024
@regisss
Copy link
Collaborator

regisss commented Dec 16, 2024

@VinayHN1365466 It looks like your devices are already busy or are somehow unavailable. Can you run hl-smi and paste the output here please?

@VinayHN1365466
Copy link
Author

no process are running

@VinayHN1365466
Copy link
Author

image

@regisss
Copy link
Collaborator

regisss commented Dec 16, 2024

Can you try adding --privileged to your docker run command?

@VinayHN1365466
Copy link
Author

VinayHN1365466 commented Dec 16, 2024

Thanks Regisss, I tried with --privileged with Docker, its still the same error

docker run --privileged -it --name optimum_118_8cards_vinay_new_1234 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host -v /mode_file/:/root/.cache/ -v /optimum-habana:/root/optimum-habana vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:lates

image

@regisss
Copy link
Collaborator

regisss commented Dec 16, 2024

Can you paste here the complete logs you're getting?

@VinayHN1365466
Copy link
Author

~/optimum-habana/examples/text-generation# python run_generation.py
--model_name_or_path gpt2
--use_hpu_graphs
--use_kv_cache
--max_new_tokens 100
--do_sample
--prompt "Here is my prompt"
/usr/lib/python3.10/inspect.py:288: FutureWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
return isinstance(object, types.FunctionType)
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
Fetching 1 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14665.40it/s]
Fetching 1 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 6316.72it/s]
12/16/2024 09:09:06 - INFO - main - Single-device run.
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi_common/hcl_device_control_factory.cpp::84(initDevice): The condition [ g_ibv.init(deviceConfig) == hcclSuccess ] failed. ibv initialization failed
Traceback (most recent call last):
File "/root/optimum-habana/examples/text-generation/run_generation.py", line 773, in
main()
File "/root/optimum-habana/examples/text-generation/run_generation.py", line 384, in main
model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
File "/root/optimum-habana/examples/text-generation/utils.py", line 720, in initialize_model
setup_model(args, model_dtype, model_kwargs, logger)
File "/root/optimum-habana/examples/text-generation/utils.py", line 297, in setup_model
model = model.eval().to(args.device)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2958, in to
return super().to(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1177, in to
return self._apply(convert)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 805, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1163, in convert
return t.to(
RuntimeError: synStatus=26 [Generic failure] Device acquire failed.

@regisss
Copy link
Collaborator

regisss commented Dec 16, 2024

Does running

import torch
import habana_frameworks.torch.hpu

a = torch.tensor(1, device="hpu")

work?

@VinayHN1365466
Copy link
Author

I got the same error
/usr/lib/python3.10/inspect.py:288: FutureWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
return isinstance(object, types.FunctionType)
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi_common/hcl_device_control_factory.cpp::84(initDevice): The condition [ g_ibv.init(deviceConfig) == hcclSuccess ] failed. ibv initialization failed
Traceback (most recent call last):
File "/root/optimum-habana/examples/text-generation/sample.py", line 4, in
a = torch.tensor(1, device="hpu")
RuntimeError: synStatus=26 [Generic failure] Device acquire failed.

@regisss
Copy link
Collaborator

regisss commented Dec 16, 2024

Can you reboot this instance?

@VinayHN1365466
Copy link
Author

Sorry, I don't have access to reboot the instance :(

@libinta
Copy link
Collaborator

libinta commented Dec 16, 2024

@VinayHN1365466 can you capture dmesg -T ? thanks.

@yuanwu2017
Copy link
Contributor

no process are running

On some cloud machine, you need to add sudo to watch all process of other users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants