Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PaliGemma2 peft + accelerate evaluation during training fails with TypeError: Unsupported types (<class 'transformers.cache_utils.HybridCache'>) passed to _pad_across_processes #3277

Open
1 of 4 tasks
beniz opened this issue Dec 7, 2024 · 3 comments

Comments

@beniz
Copy link

beniz commented Dec 7, 2024

System Info

Accelerate 1.2.0, peft 0.14.0, transformers 4.47.0

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Traceback (most recent call last):
  File "/data1/beniz/code/llmbox/multimodal/ft_paligemma2.py", line 213, in <module>
    trainer.train(resume_from_checkpoint=(args.resume > 0))
  File "/home/beniz/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2164, in train
    return inner_training_loop(
  File "/home/beniz/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2589, in _inner_training_loop
    self._maybe_log_save_evaluate(
  File "/home/beniz/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3047, in _maybe_log_save_evaluate
    metrics = self._evaluate(trial, ignore_keys_for_eval)
  File "/home/beniz/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3001, in _evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/beniz/.local/lib/python3.10/site-packages/transformers/trainer.py", line 4051, in evaluate
    output = eval_loop(
  File "/home/beniz/.local/lib/python3.10/site-packages/transformers/trainer.py", line 4267, in evaluation_loop
    logits = self.accelerator.pad_across_processes(logits, dim=1, pad_index=-100)
  File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2602, in pad_across_processes
    return pad_across_processes(tensor, dim=dim, pad_index=pad_index, pad_first=pad_first)
  File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 412, in wrapper
    return function(*args, **kwargs)
  File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 682, in pad_across_processes
    return recursively_apply(
  File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 108, in recursively_apply
    return honor_type(
  File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 82, in honor_type
    return type(obj)(generator)
  File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 111, in <genexpr>
    recursively_apply(
  File "/home/beniz/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 129, in recursively_apply
    raise TypeError(
TypeError: Unsupported types (<class 'transformers.cache_utils.HybridCache'>) passed to `_pad_across_processes`. Only nested list/tuple/dicts of objects that are valid for `is_torch_tensor` should be passed.

This is the full trace. I have yet to find how to give you code for easy reproduction. This is using https://github.com/beniz/llmbox/blob/main/multimodal/ft_paligemma.py.

Now, my goal here is to gather feedback on whether this could be a bug in accelerate. This because it seems to be occuring before my own code is called, in the pad_across_processes function.

Happy to help dig further.

Expected behavior

With PaliGemma, the same script works fine: training steps are OK, evaluation steps are OK.
With PaliGemma2, training steps are OK, evaluation fails with the error above.

@scris
Copy link

scris commented Dec 15, 2024

I met similar issue on it migrating a Mistral code into Gemma 2, and it's also only during evaluation.
Looking forward to any updates.

@BenjaminBossan
Copy link
Member

@beniz It would be great if you could provide a complete reproducer. The script you linked seems to be rely on a local dataset. Can this be substituted with a publicly available dataset?

Otherwise, it may help if you could start a debugger and report back what the arguments are that are passed to pad_across_processes in this part of the code:

pad_across_processes(tensor, dim=dim, pad_index=pad_index, pad_first=pad_first)

@beniz
Copy link
Author

beniz commented Dec 18, 2024

Hi @BenjaminBossan, thanks for your answer.

Please find a script to easily reproduce the problem: https://github.com/beniz/llmbox/blob/debug_accelerate_paligemma2/multimodal/ft_paligemma2_debug.py

This is a very simple task that classifies images of dogs and cats by finetuning paligemma2.

To reproduce the issue:

python3 ft_paligemma2_debug.py --model-id google/paligemma2-3b-pt-448 --batch-size 1 --iter-size 8 --save-steps 100 --eval-steps 2 --output-dir test_cats_dogs --nepochs 3 --min-img-size 448

The error is at evaluation.

Now, if you rollback to the v1 of paligemma, it does work fine with the same script:

python3 ft_paligemma2_debug.py --model-id google/paligemma-3b-pt-448 --batch-size 1 --iter-size 8 --save-steps 100 --eval-steps 2 --output-dir test_cats_dogs --nepochs 3 --min-img-size 448

Happy to help with this, I've looked for the HybridCache stuff and even within accelerate, but I could not really locate even a starting point from where to debug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants