Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError when inference with different LoRA adapters in the same batch #2283

Open
2 of 4 tasks
yuxiang-guo opened this issue Dec 15, 2024 · 8 comments
Open
2 of 4 tasks

Comments

@yuxiang-guo
Copy link

yuxiang-guo commented Dec 15, 2024

System Info

transformers 4.41.0
peft 0.13.2

Who can help?

@BenjaminBossan
I tried to adopt [Inference with different LoRA adapters in the same batch] to an encoder-decoder T5 model.
Specifically, I load the base model, the first LoRA, and the second LoRA adapters, and perform inference with these three models in the same batch. However, some errors occurred.

BTW, does [inference with different LoRA adapters in the same batch] support beam search when using generate()?

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

Code:

base_model = MT5ForConditionalGeneration.from_pretrained(base_model_path, cache_dir='cache')
peft_model = PeftModel.from_pretrained(base_model,<lora_path1> ,adapter_name="l1")
peft_model.load_adapter(<lora_path2>, adapter_name="l2")
adapter_names = ["__base__", "l1", "l2"]
output = model.generate(
                input_ids=inputs['input_ids'],
                adapter_names=adapter_names,
                max_length=20,
                prefix_allowed_tokens_fn=self.restrict_decode_vocab,
                early_stopping=True
            )

The error message:

Traceback (most recent call last):
  File "/home/user/user1/GR/trainer.py", line 1025, in prediction_step
    doc_ids = model.generate(
  File "/home/user/anaconda3/envs/test/lib/python3.8/site-packages/peft/peft_model.py", line 1972, in generate
    with self._enable_peft_forward_hooks(**kwargs):
  File "/home/user/anaconda3/envs/test/lib/python3.8/contextlib.py", line 113, in __enter__
    python-BaseException
return next(self.gen)
  File "/home/user/anaconda3/envs/test/lib/python3.8/site-packages/peft/peft_model.py", line 798, in _enable_peft_forward_hooks
    with self.base_model._enable_peft_forward_hooks(*args, **kwargs):
  File "/home/user/anaconda3/envs/test/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/user/anaconda3/envs/test/lib/python3.8/site-packages/peft/tuners/lora/model.py", line 441, in _enable_peft_forward_hooks
    handle = module.register_forward_pre_hook(pre_forward, with_kwargs=True)
TypeError: register_forward_pre_hook() got an unexpected keyword argument 'with_kwargs'

Expected behavior

I expect the existing function for [inference with different LoRA adapters in the same batch] to support T5 with LoRAs and work in my beam search experiments during generation.

@yuxiang-guo yuxiang-guo changed the title Inference with different LoRA adapters in the same batch has a TypeError: register_forward_pre_hook() got an unexpected keyword argument 'with_kwargs' TypeError when inference with different LoRA adapters in the same batch Dec 15, 2024
@yuxiang-guo
Copy link
Author

The code is as follows:

base_model = MT5ForConditionalGeneration.from_pretrained(base_model_path, cache_dir='cache')
peft_model = PeftModel.from_pretrained(base_model,<lora_path1> ,adapter_name="l1")
peft_model.load_adapter(<lora_path2>, adapter_name="l2")
adapter_names = ["__base__", "l1", "l2"]
output = peft_model.generate(
input_ids=inputs['input_ids'],
adapter_names=adapter_names,
max_length=20,
prefix_allowed_tokens_fn=self.restrict_decode_vocab,
early_stopping=True
)

@BenjaminBossan
Copy link
Member

What PyTorch version are you using?

@yuxiang-guo
Copy link
Author

1.13.1+cu117

@BenjaminBossan
Copy link
Member

That's the reason, your torch version is really old and does not support this argument yet. Would it be possible for you to upgrade to a newer torch version?

@yuxiang-guo
Copy link
Author

Thanks for the suggestion. After upgrading to torch 2.4.1, the error during normal generation has been resolved. However, when using beam search during inference, I encountered a new ValueError. I suspect this is because num_beams is set to 20, causing the error.

How can I adapt the model to handle beam search during inference?

Additionally, I’m curious whether the inference for different adapters in a batch runs serially or in parallel. Since different adapters share the same base model, they theoretically can perform in parallel. However, the current implementation seems to run serially, as suggested by the execution time.

Is optimizing for parallel execution feasible, and are there any plans to support this functionality in the future?

batch_beams = model.generate(
    input_ids=inputs['input_ids'].to(self.args.device),
    max_length=20,
    num_beams=20,
    prefix_allowed_tokens_fn=self.restrict_decode_vocab,
    adapter_names=adapter_names,
    num_return_sequences=20,
    early_stopping=True, )

......
File "/home/user/anaconda3/envs/DSIQG_new/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/anaconda3/envs/DSIQG_new/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/anaconda3/envs/DSIQG_new/lib/python3.8/site-packages/transformers/models/mt5/modeling_mt5.py", line 176, in forward
hidden_gelu = self.act(self.wi_0(hidden_states))
File "/home/user/anaconda3/envs/DSIQG_new/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/anaconda3/envs/DSIQG_new/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/user/anaconda3/envs/DSIQG_new/lib/python3.8/site-packages/peft/tuners/lora/layer.py", line 560, in forward
self._check_forward_args(x, *args, **kwargs)
File "/home/user/anaconda3/envs/DSIQG_new/lib/python3.8/site-packages/peft/tuners/lora/layer.py", line 339, in _check_forward_args
raise ValueError(msg)
ValueError: Length of adapter_names should be the same as the number of inputs, but got 3 and 60 respectively.

BenjaminBossan added a commit to BenjaminBossan/peft that referenced this issue Dec 17, 2024
See huggingface#2283

Right now, using mixed adapter batches with beam search generations does
not work. This is because users need to pass the adapter names
associated with each sample, i.e. the number of adapter names should be
identical to the number of samples in the input.

When applying beam search, transformers internally repeats the samples
once per beam (or so it looks like). Therefore, we have more samples
during generation than samples in the input. Consequently, the adapter
names have to be extended accordingly. This is now taken care of.

Unfortunately, this does not work for encoder-decoder models yet. With
these models, there is always a size mismatch, whether adapter names are
extended or not. What I suspect is happening is that only the decoder
needs to be extended, but right now I don't see a way to implement this
distinction in PEFT. Therefore, encoder-decoder + beam search
generations is not supported for the time being.
@BenjaminBossan
Copy link
Member

@yuxiang-guo Thanks for reporting back. Indeed, beam search is currently not supported but I created a PR that should enable it: #2287. If you have time, you could try installing PEFT from that branch and see if it fixes your issue.

Is optimizing for parallel execution feasible, and are there any plans to support this functionality in the future?

You are right that PEFT does not parallelize here. This is somewhat out of scope for PEFT, as there are already many parallelization methods in the torch ecosystem (DDP, FSDP, DeepSpeed, etc.). If we added our own parallelization into PEFT, it would most likely interfere with these existing methods and hamper performance.

@yuxiang-guo
Copy link
Author

@BenjaminBossan Many thanks! I will try it.

This is somewhat out of scope for PEFT, as there are already many parallelization methods in the torch ecosystem (DDP, FSDP, DeepSpeed, etc.). If we added our own parallelization into PEFT, it would most likely interfere with these existing methods and hamper performance.

In the current implementation, are different adapters loaded into the base model and then unloaded to perform serial inference within a batch? I’m wondering if it’s possible to use multiple adapters to perform inference in parallel, where the outputs of each adapter are then added to the base model’s output. That being said, inference could be conducted without merging the LoRA adapters into the base model. In this way, the inference time won't increase linearly to the number of adapters within a batch.

@BenjaminBossan
Copy link
Member

I’m wondering if it’s possible to use multiple adapters to perform inference in parallel, where the outputs of each adapter are then added to the base model’s output.

The best way to achieve that would be to merge those adapters into the base model using the model.merge_adapter(<adapter-names>) method. If you want to always use a specific combination of LoRA adapters, you can also create a new one that is a merge of those adapters using model.add_weighted_adapter.

That being said, inference could be conducted without merging the LoRA adapters into the base model. In this way, the inference time won't increase linearly to the number of adapters within a batch.

I don't understand this part. If you don't merge the weights, it means that there is always a LoRA overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants