Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen1.5-14B finetune error #1336

Open
2 of 4 tasks
Zjq9409 opened this issue Sep 17, 2024 · 2 comments
Open
2 of 4 tasks

Qwen1.5-14B finetune error #1336

Zjq9409 opened this issue Sep 17, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@Zjq9409
Copy link
Contributor

Zjq9409 commented Sep 17, 2024

System Info

optimum-habana              1.13.2
+-----------------------------------------------------------------------------+
| HL-SMI Version:                                hl-1.17.1-fw-51.5.0          |
| Driver Version:                                     1.17.1-78932ae          |

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

download Qwen1.5-14B weight from: https://huggingface.co/Qwen/Qwen1.5-14B

cd optimum-habana/examples/language-modeling
python ../gaudi_spawn.py \
    --world_size 8 --use_deepspeed run_clm.py \
    --model_name_or_path /data/models/Qwen1.5-7B-Chat/ \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --per_device_train_batch_size 6 \
    --per_device_eval_batch_size 4 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-clm-xl-1 \
    --gaudi_config_name ./gaudi_config.json \
    --use_habana \
    --logging_steps 1  \
    --use_lazy_mode \
    --gradient_checkpointing \
    --use_hpu_graphs_for_inference \
    --throughput_warmup_steps 3 \
    --overwrite_output_dir \
    --deepspeed ./llama2_ds_zero3_config.json

The running error log is as follows:

[2024-09-17 07:57:31,077] [INFO] [checkpointing.py:542:forward] Activation Checkpointing Information
[2024-09-17 07:57:31,078] [INFO] [checkpointing.py:543:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2024-09-17 07:57:31,078] [INFO] [checkpointing.py:544:forward] ----contiguous Memory Checkpointing False with None total layers
[2024-09-17 07:57:31,078] [INFO] [checkpointing.py:546:forward] ----Synchronization False
[2024-09-17 07:57:31,078] [INFO] [checkpointing.py:547:forward] ----Profiling time in checkpointing False
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank3]:     main()
[rank3]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 641, in main
[rank3]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
[rank3]:     return inner_training_loop(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 978, in _inner_training_loop
[rank3]:     tr_loss_step = self.training_step(model, inputs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1575, in training_step
[rank3]:     loss = self.compute_loss(model, inputs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3363, in compute_loss
[rank3]:     outputs = model(**inputs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1544, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1885, in forward
[rank3]:     loss = self.module(*inputs, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank3]:     result = forward_call(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 789, in forward
[rank3]:     outputs = self.model(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank3]:     result = forward_call(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 677, in forward
[rank3]:     layer_outputs = self._gradient_checkpointing_func(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 692, in hpu_deepspeed_checkpointing
[rank3]:     CheckpointFunction.apply(function, all_outputs, *checkpoint_args)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
[rank3]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 568, in forward
[rank3]:     outputs = run_function(*inputs_cuda)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank3]:     result = forward_call(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 464, in forward
[rank3]:     hidden_states, self_attn_weights, present_key_value = self.pre_attn(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 515, in pre_attn
[rank3]:     hidden_states, attn_weights, present_key_value = self.self_attn.pre_attn_forward(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 401, in pre_attn_forward
[rank3]:     attn_output = self.o_proj(attn_output)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1574, in _call_impl
[rank3]:     args_result = hook(self, args)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank3]:     self.pre_sub_module_forward_function(module)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank3]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank3]:     return fn(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 290, in fetch_sub_module
[rank3]:     self.__all_gather_params(params_to_fetch, forward)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 434, in __all_gather_params
[rank3]:     self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 463, in __all_gather_params_
[rank3]:     handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1241, in all_gather_coalesced
[rank3]:     handles = _dist_allgather_fn(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank3]:     return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank3]:     return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank3]:     return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank3]:     return fn(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 218, in all_gather_into_tensor
[rank3]:     return self.all_gather_function(output_tensor=output_tensor,
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2949, in all_gather_into_tensor
[rank3]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank3]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. 
[rank6]: Traceback (most recent call last):
[rank6]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank6]:     main()
[rank6]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 641, in main
[rank6]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
[rank6]:     return inner_training_loop(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 978, in _inner_training_loop
[rank6]:     tr_loss_step = self.training_step(model, inputs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1575, in training_step
[rank6]:     loss = self.compute_loss(model, inputs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3363, in compute_loss
[rank6]:     outputs = model(**inputs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1544, in _call_impl
[rank6]:     return forward_call(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1885, in forward
[rank6]:     loss = self.module(*inputs, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank6]:     result = forward_call(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 789, in forward
[rank6]:     outputs = self.model(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank6]:     result = forward_call(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 677, in forward
[rank6]:     layer_outputs = self._gradient_checkpointing_func(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 692, in hpu_deepspeed_checkpointing
[rank6]:     CheckpointFunction.apply(function, all_outputs, *checkpoint_args)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
[rank6]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 568, in forward
[rank6]:     outputs = run_function(*inputs_cuda)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank6]:     result = forward_call(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 464, in forward
[rank6]:     hidden_states, self_attn_weights, present_key_value = self.pre_attn(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 515, in pre_attn
[rank6]:     hidden_states, attn_weights, present_key_value = self.self_attn.pre_attn_forward(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 401, in pre_attn_forward
[rank6]:     attn_output = self.o_proj(attn_output)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank6]:     return self._call_impl(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1574, in _call_impl
[rank6]:     args_result = hook(self, args)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank6]:     self.pre_sub_module_forward_function(module)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank6]:     return func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank6]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank6]:     return fn(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank6]:     return func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 290, in fetch_sub_module
[rank6]:     self.__all_gather_params(params_to_fetch, forward)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 434, in __all_gather_params
[rank6]:     self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 463, in __all_gather_params_
[rank6]:     handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1241, in all_gather_coalesced
[rank6]:     handles = _dist_allgather_fn(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank6]:     return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank6]:     return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank6]:     return func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank6]:     return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank6]:     return fn(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 218, in all_gather_into_tensor
[rank6]:     return self.all_gather_function(output_tensor=output_tensor,
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank6]:     return func(*args, **kwargs)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2949, in all_gather_into_tensor
[rank6]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank6]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. 
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank1]:     main()
[rank1]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 641, in main
[rank1]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
[rank1]:     return inner_training_loop(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 978, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1575, in training_step
[rank1]:     loss = self.compute_loss(model, inputs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3363, in compute_loss
[rank1]:     outputs = model(**inputs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1544, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1885, in forward
[rank1]:     loss = self.module(*inputs, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank1]:     result = forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 789, in forward
[rank1]:     outputs = self.model(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank1]:     result = forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 677, in forward
[rank1]:     layer_outputs = self._gradient_checkpointing_func(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 692, in hpu_deepspeed_checkpointing
[rank1]:     CheckpointFunction.apply(function, all_outputs, *checkpoint_args)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
[rank1]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 568, in forward
[rank1]:     outputs = run_function(*inputs_cuda)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank1]:     result = forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 464, in forward
[rank1]:     hidden_states, self_attn_weights, present_key_value = self.pre_attn(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 515, in pre_attn
[rank1]:     hidden_states, attn_weights, present_key_value = self.self_attn.pre_attn_forward(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 401, in pre_attn_forward
[rank1]:     attn_output = self.o_proj(attn_output)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1574, in _call_impl
[rank1]:     args_result = hook(self, args)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank1]:     self.pre_sub_module_forward_function(module)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank1]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank1]:     return fn(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 290, in fetch_sub_module
[rank1]:     self.__all_gather_params(params_to_fetch, forward)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 434, in __all_gather_params
[rank1]:     self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 463, in __all_gather_params_
[rank1]:     handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1241, in all_gather_coalesced
[rank1]:     handles = _dist_allgather_fn(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank1]:     return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank1]:     return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank1]:     return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank1]:     return fn(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 218, in all_gather_into_tensor
[rank1]:     return self.all_gather_function(output_tensor=output_tensor,
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2949, in all_gather_into_tensor
[rank1]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank1]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. 
[rank4]: Traceback (most recent call last):
[rank4]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank4]:     main()
[rank4]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 641, in main
[rank4]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
[rank4]:     return inner_training_loop(
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 978, in _inner_training_loop
[rank4]:     tr_loss_step = self.training_step(model, inputs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1575, in training_step
[rank4]:     loss = self.compute_loss(model, inputs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3363, in compute_loss
[rank4]:     outputs = model(**inputs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1544, in _call_impl
[rank4]:     return forward_call(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1885, in forward
[rank4]:     loss = self.module(*inputs, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank4]:     result = forward_call(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 789, in forward
[rank4]:     outputs = self.model(
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank4]:     result = forward_call(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 677, in forward
[rank4]:     layer_outputs = self._gradient_checkpointing_func(
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 692, in hpu_deepspeed_checkpointing
[rank4]:     CheckpointFunction.apply(function, all_outputs, *checkpoint_args)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
[rank4]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 568, in forward
[rank4]:     outputs = run_function(*inputs_cuda)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank4]:     result = forward_call(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 464, in forward
[rank4]:     hidden_states, self_attn_weights, present_key_value = self.pre_attn(
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 515, in pre_attn
[rank4]:     hidden_states, attn_weights, present_key_value = self.self_attn.pre_attn_forward(
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 401, in pre_attn_forward
[rank4]:     attn_output = self.o_proj(attn_output)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1574, in _call_impl
[rank4]:     args_result = hook(self, args)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank4]:     self.pre_sub_module_forward_function(module)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank4]:     return func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank4]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank4]:     return fn(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank4]:     return func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 290, in fetch_sub_module
[rank4]:     self.__all_gather_params(params_to_fetch, forward)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 434, in __all_gather_params
[rank4]:     self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 463, in __all_gather_params_
[rank4]:     handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1241, in all_gather_coalesced
[rank4]:     handles = _dist_allgather_fn(
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank4]:     return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank4]:     ret_val = func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank4]:     return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank4]:     return func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank4]:     return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank4]:     return fn(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 218, in all_gather_into_tensor
[rank4]:     return self.all_gather_function(output_tensor=output_tensor,
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank4]:     return func(*args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2949, in all_gather_into_tensor
[rank4]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank4]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. 
[rank5]: Traceback (most recent call last):
[rank5]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank5]:     main()
[rank5]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 641, in main
[rank5]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
[rank5]:     return inner_training_loop(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 978, in _inner_training_loop
[rank5]:     tr_loss_step = self.training_step(model, inputs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1575, in training_step
[rank5]:     loss = self.compute_loss(model, inputs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3363, in compute_loss
[rank5]:     outputs = model(**inputs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1544, in _call_impl
[rank5]:     return forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1885, in forward
[rank5]:     loss = self.module(*inputs, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 789, in forward
[rank5]:     outputs = self.model(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 677, in forward
[rank5]:     layer_outputs = self._gradient_checkpointing_func(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 692, in hpu_deepspeed_checkpointing
[rank5]:     CheckpointFunction.apply(function, all_outputs, *checkpoint_args)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
[rank5]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 568, in forward
[rank5]:     outputs = run_function(*inputs_cuda)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank5]:     result = forward_call(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 464, in forward
[rank5]:     hidden_states, self_attn_weights, present_key_value = self.pre_attn(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 515, in pre_attn
[rank5]:     hidden_states, attn_weights, present_key_value = self.self_attn.pre_attn_forward(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 401, in pre_attn_forward
[rank5]:     attn_output = self.o_proj(attn_output)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank5]:     return self._call_impl(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1574, in _call_impl
[rank5]:     args_result = hook(self, args)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank5]:     self.pre_sub_module_forward_function(module)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank5]:     return func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank5]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank5]:     return fn(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank5]:     return func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 290, in fetch_sub_module
[rank5]:     self.__all_gather_params(params_to_fetch, forward)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 434, in __all_gather_params
[rank5]:     self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 463, in __all_gather_params_
[rank5]:     handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1241, in all_gather_coalesced
[rank5]:     handles = _dist_allgather_fn(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank5]:     return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank5]:     return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank5]:     return func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank5]:     return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank5]:     return fn(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 218, in all_gather_into_tensor
[rank5]:     return self.all_gather_function(output_tensor=output_tensor,
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank5]:     return func(*args, **kwargs)
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2949, in all_gather_into_tensor
[rank5]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank5]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. 
[rank7]: Traceback (most recent call last):
[rank7]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank7]:     main()
[rank7]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 641, in main
[rank7]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
[rank7]:     return inner_training_loop(
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 978, in _inner_training_loop
[rank7]:     tr_loss_step = self.training_step(model, inputs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1575, in training_step
[rank7]:     loss = self.compute_loss(model, inputs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3363, in compute_loss
[rank7]:     outputs = model(**inputs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1544, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank7]:     ret_val = func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1885, in forward
[rank7]:     loss = self.module(*inputs, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank7]:     result = forward_call(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 789, in forward
[rank7]:     outputs = self.model(
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank7]:     result = forward_call(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 677, in forward
[rank7]:     layer_outputs = self._gradient_checkpointing_func(
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 692, in hpu_deepspeed_checkpointing
[rank7]:     CheckpointFunction.apply(function, all_outputs, *checkpoint_args)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
[rank7]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 568, in forward
[rank7]:     outputs = run_function(*inputs_cuda)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank7]:     result = forward_call(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 464, in forward
[rank7]:     hidden_states, self_attn_weights, present_key_value = self.pre_attn(
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 515, in pre_attn
[rank7]:     hidden_states, attn_weights, present_key_value = self.self_attn.pre_attn_forward(
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 401, in pre_attn_forward
[rank7]:     attn_output = self.o_proj(attn_output)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1574, in _call_impl
[rank7]:     args_result = hook(self, args)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank7]:     ret_val = func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank7]:     self.pre_sub_module_forward_function(module)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank7]:     return func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank7]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank7]:     return fn(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank7]:     ret_val = func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank7]:     return func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 290, in fetch_sub_module
[rank7]:     self.__all_gather_params(params_to_fetch, forward)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank7]:     ret_val = func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 434, in __all_gather_params
[rank7]:     self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 463, in __all_gather_params_
[rank7]:     handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank7]:     ret_val = func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1241, in all_gather_coalesced
[rank7]:     handles = _dist_allgather_fn(
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank7]:     return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank7]:     ret_val = func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank7]:     return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank7]:     return func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank7]:     return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank7]:     return fn(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 218, in all_gather_into_tensor
[rank7]:     return self.all_gather_function(output_tensor=output_tensor,
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank7]:     return func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2949, in all_gather_into_tensor
[rank7]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank7]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. 
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
[rank0]:     main()
[rank0]:   File "/home/jane/optimum-habana/examples/language-modeling/run_clm.py", line 641, in main
[rank0]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 978, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1575, in training_step
[rank0]:     loss = self.compute_loss(model, inputs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3363, in compute_loss
[rank0]:     outputs = model(**inputs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1544, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1885, in forward
[rank0]:     loss = self.module(*inputs, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 789, in forward
[rank0]:     outputs = self.model(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 677, in forward
[rank0]:     layer_outputs = self._gradient_checkpointing_func(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 692, in hpu_deepspeed_checkpointing
[rank0]:     CheckpointFunction.apply(function, all_outputs, *checkpoint_args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 568, in forward
[rank0]:     outputs = run_function(*inputs_cuda)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 464, in forward
[rank0]:     hidden_states, self_attn_weights, present_key_value = self.pre_attn(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 515, in pre_attn
[rank0]:     hidden_states, attn_weights, present_key_value = self.self_attn.pre_attn_forward(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 401, in pre_attn_forward
[rank0]:     attn_output = self.o_proj(attn_output)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1574, in _call_impl
[rank0]:     args_result = hook(self, args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank0]:     self.pre_sub_module_forward_function(module)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank0]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 290, in fetch_sub_module
[rank0]:     self.__all_gather_params(params_to_fetch, forward)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 434, in __all_gather_params
[rank0]:     self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 463, in __all_gather_params_
[rank0]:     handle = param_group[0].all_gather_coalesced(param_group, quantize=quantize)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1241, in all_gather_coalesced
[rank0]:     handles = _dist_allgather_fn(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 95, in _dist_allgather_fn
[rank0]:     return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 320, in allgather_fn
[rank0]:     return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py", line 305, in all_gather_into_tensor
[rank0]:     return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py", line 218, in all_gather_into_tensor
[rank0]:     return self.all_gather_function(output_tensor=output_tensor,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2949, in all_gather_into_tensor
[rank0]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank0]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. 

Expected behavior

Can successfully run Qwen1.5-14B with full parameter fine-tuning.

@Zjq9409 Zjq9409 added the bug Something isn't working label Sep 17, 2024
@regisss
Copy link
Collaborator

regisss commented Oct 21, 2024

I can reproduce it, cc @libinta

@skaulintel
Copy link
Collaborator

@Zjq9409 Have you tried qwen finetune from examples/trl side?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants