PaliGemma2 peft + accelerate evaluation during training fails with TypeError: Unsupported types (<class 'transformers.cache_utils.HybridCache'>) passed to `_pad_across_processes` #3277

beniz · 2024-12-07T09:56:28Z

System Info

Accelerate 1.2.0, peft 0.14.0, transformers 4.47.0

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Traceback (most recent call last):
  File "/data1/beniz/code/llmbox/multimodal/ft_paligemma2.py", line 213, in <module>
    trainer.train(resume_from_checkpoint=(args.resume > 0))
  File "/home/beniz/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train
    return inner_training_loop(
  File "/home/beniz/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2589, in _inner_training_loop
    self._maybe_log_save_evaluate(
  File "/home/beniz/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3047, in _maybe_log_save_evaluate
    metrics = self._evaluate(trial, ignore_keys_for_eval)
  File "/home/beniz/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3001, in _evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/beniz/.local/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluate
    output = eval_loop(
  File "/home/beniz/.local/lib/python3.10/site-packages/transformers/trainer.py", line 4267, in evaluation_loop
    logits = self.accelerator.pad_across_processes(logits, dim=1, pad_index=-100)
  File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2602, in pad_across_processes
    return pad_across_processes(tensor, dim=dim, pad_index=pad_index, pad_first=pad_first)
  File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 412, in wrapper
    return function(*args, **kwargs)
  File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 682, in pad_across_processes
    return recursively_apply(
  File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 108, in recursively_apply
    return honor_type(
  File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 82, in honor_type
    return type(obj)(generator)
  File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 111, in <genexpr>
    recursively_apply(
  File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 129, in recursively_apply
    raise TypeError(
TypeError: Unsupported types (<class 'transformers.cache_utils.HybridCache'>) passed to `_pad_across_processes`. Only nested list/tuple/dicts of objects that are valid for `is_torch_tensor` should be passed.

This is the full trace. I have yet to find how to give you code for easy reproduction. This is using https://github.com/beniz/llmbox/blob/main/multimodal/ft_paligemma.py.

Now, my goal here is to gather feedback on whether this could be a bug in accelerate. This because it seems to be occuring before my own code is called, in the pad_across_processes function.

Happy to help dig further.

Expected behavior

With PaliGemma, the same script works fine: training steps are OK, evaluation steps are OK.
With PaliGemma2, training steps are OK, evaluation fails with the error above.

The text was updated successfully, but these errors were encountered:

scris · 2024-12-15T14:17:54Z

I met similar issue on it migrating a Mistral code into Gemma 2, and it's also only during evaluation.
Looking forward to any updates.

BenjaminBossan · 2024-12-16T10:55:51Z

@beniz It would be great if you could provide a complete reproducer. The script you linked seems to be rely on a local dataset. Can this be substituted with a publicly available dataset?

Otherwise, it may help if you could start a debugger and report back what the arguments are that are passed to pad_across_processes in this part of the code:

pad_across_processes(tensor, dim=dim, pad_index=pad_index, pad_first=pad_first)

beniz · 2024-12-18T21:02:21Z

Hi @BenjaminBossan, thanks for your answer.

Please find a script to easily reproduce the problem: https://github.com/beniz/llmbox/blob/debug_accelerate_paligemma2/multimodal/ft_paligemma2_debug.py

This is a very simple task that classifies images of dogs and cats by finetuning paligemma2.

To reproduce the issue:

python3 ft_paligemma2_debug.py --model-id google/paligemma2-3b-pt-448 --batch-size 1 --iter-size 8 --save-steps 100 --eval-steps 2 --output-dir test_cats_dogs --nepochs 3 --min-img-size 448

The error is at evaluation.

Now, if you rollback to the v1 of paligemma, it does work fine with the same script:

python3 ft_paligemma2_debug.py --model-id google/paligemma-3b-pt-448 --batch-size 1 --iter-size 8 --save-steps 100 --eval-steps 2 --output-dir test_cats_dogs --nepochs 3 --min-img-size 448

Happy to help with this, I've looked for the HybridCache stuff and even within accelerate, but I could not really locate even a starting point from where to debug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PaliGemma2 peft + accelerate evaluation during training fails with TypeError: Unsupported types (<class 'transformers.cache_utils.HybridCache'>) passed to `_pad_across_processes` #3277

PaliGemma2 peft + accelerate evaluation during training fails with TypeError: Unsupported types (<class 'transformers.cache_utils.HybridCache'>) passed to `_pad_across_processes` #3277

beniz commented Dec 7, 2024

scris commented Dec 15, 2024

BenjaminBossan commented Dec 16, 2024

beniz commented Dec 18, 2024 •

edited

Loading

PaliGemma2 peft + accelerate evaluation during training fails with TypeError: Unsupported types (<class 'transformers.cache_utils.HybridCache'>) passed to _pad_across_processes #3277

PaliGemma2 peft + accelerate evaluation during training fails with TypeError: Unsupported types (<class 'transformers.cache_utils.HybridCache'>) passed to _pad_across_processes #3277

Comments

beniz commented Dec 7, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

scris commented Dec 15, 2024

BenjaminBossan commented Dec 16, 2024

beniz commented Dec 18, 2024 • edited Loading

PaliGemma2 peft + accelerate evaluation during training fails with TypeError: Unsupported types (<class 'transformers.cache_utils.HybridCache'>) passed to `_pad_across_processes` #3277

PaliGemma2 peft + accelerate evaluation during training fails with TypeError: Unsupported types (<class 'transformers.cache_utils.HybridCache'>) passed to `_pad_across_processes` #3277

beniz commented Dec 18, 2024 •

edited

Loading