"hbldv_modify_qp(INIT) failed: 22, nic: 0" Error on Gaudi 3 Accelerator #1673

ajscalers · 2024-12-27T12:34:30Z

System Info

Docker version: vault.habana.ai/gaudi-docker/1.19.0/ubuntu24.04/habanalabs/pytorch-installer-2.5.1:1.19.0-561

Optimum-Habana version: 1.14.1

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Install the dependencies for the examples/text-generation folder as specified in the README.md (i.e. install deepspeed, and requirements.txt)
Run the flash attention example from the text-generation folder (i.e.

python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
--model_name_or_path meta-llama/Llama-2-70b-hf \
--use_hpu_graphs \
--limit_hpu_graphs \
--use_kv_cache \
--bf16 \
--trim_logits \
--attn_softmax_bf16 \
--bucket_size=128 \
--bucket_internal \
--batch_size 8 \
--max_input_tokens 40960 \
--max_new_tokens 5120 \
--use_flash_attention \
--flash_attention_recompute \
--flash_attention_causal_mask \
--book_source

This doesn't run, but instead throws the following error after loading the models onto the GPU:

/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/ibverbs/hcl_ibverbs.cpp::295(create_qp): The condition [ rc == 0 ] failed. hbldv_modify_qp(INIT) failed: 22, nic: 11

I have uploaded the full file below:

habana_error.txt

Expected behavior

The example should run to completion.

The text was updated successfully, but these errors were encountered:

ajscalers added the bug Something isn't working label Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"hbldv_modify_qp(INIT) failed: 22, nic: 0" Error on Gaudi 3 Accelerator #1673

"hbldv_modify_qp(INIT) failed: 22, nic: 0" Error on Gaudi 3 Accelerator #1673

ajscalers commented Dec 27, 2024

"hbldv_modify_qp(INIT) failed: 22, nic: 0" Error on Gaudi 3 Accelerator #1673

"hbldv_modify_qp(INIT) failed: 22, nic: 0" Error on Gaudi 3 Accelerator #1673

Comments

ajscalers commented Dec 27, 2024

System Info

Information

Tasks

Reproduction

Expected behavior