You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PaliGemma2 peft + accelerate evaluation during training fails with TypeError: Unsupported types (<class 'transformers.cache_utils.HybridCache'>) passed to _pad_across_processes#3277
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)
Reproduction
Traceback (most recent call last):
File "/data1/beniz/code/llmbox/multimodal/ft_paligemma2.py", line 213, in <module>
trainer.train(resume_from_checkpoint=(args.resume > 0))
File "/home/beniz/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train
return inner_training_loop(
File "/home/beniz/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2589, in _inner_training_loop
self._maybe_log_save_evaluate(
File "/home/beniz/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3047, in _maybe_log_save_evaluate
metrics = self._evaluate(trial, ignore_keys_for_eval)
File "/home/beniz/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3001, in _evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/home/beniz/.local/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluate
output = eval_loop(
File "/home/beniz/.local/lib/python3.10/site-packages/transformers/trainer.py", line 4267, in evaluation_loop
logits = self.accelerator.pad_across_processes(logits, dim=1, pad_index=-100)
File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2602, in pad_across_processes
return pad_across_processes(tensor, dim=dim, pad_index=pad_index, pad_first=pad_first)
File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 412, in wrapper
return function(*args, **kwargs)
File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 682, in pad_across_processes
return recursively_apply(
File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 108, in recursively_apply
return honor_type(
File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 82, in honor_type
return type(obj)(generator)
File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 111, in <genexpr>
recursively_apply(
File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 129, in recursively_apply
raise TypeError(
TypeError: Unsupported types (<class 'transformers.cache_utils.HybridCache'>) passed to `_pad_across_processes`. Only nested list/tuple/dicts of objects that are valid for `is_torch_tensor` should be passed.
Now, my goal here is to gather feedback on whether this could be a bug in accelerate. This because it seems to be occuring before my own code is called, in the pad_across_processes function.
Happy to help dig further.
Expected behavior
With PaliGemma, the same script works fine: training steps are OK, evaluation steps are OK.
With PaliGemma2, training steps are OK, evaluation fails with the error above.
The text was updated successfully, but these errors were encountered:
@beniz It would be great if you could provide a complete reproducer. The script you linked seems to be rely on a local dataset. Can this be substituted with a publicly available dataset?
Otherwise, it may help if you could start a debugger and report back what the arguments are that are passed to pad_across_processes in this part of the code:
Happy to help with this, I've looked for the HybridCache stuff and even within accelerate, but I could not really locate even a starting point from where to debug.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
This is the full trace. I have yet to find how to give you code for easy reproduction. This is using https://github.com/beniz/llmbox/blob/main/multimodal/ft_paligemma.py.
Now, my goal here is to gather feedback on whether this could be a bug in accelerate. This because it seems to be occuring before my own code is called, in the
pad_across_processes
function.Happy to help dig further.
Expected behavior
With PaliGemma, the same script works fine: training steps are OK, evaluation steps are OK.
With PaliGemma2, training steps are OK, evaluation fails with the error above.
The text was updated successfully, but these errors were encountered: