TypeError when inference with different LoRA adapters in the same batch #2283

yuxiang-guo · 2024-12-15T13:07:34Z

System Info

transformers 4.41.0
peft 0.13.2

Who can help?

@BenjaminBossan
I tried to adopt [Inference with different LoRA adapters in the same batch] to an encoder-decoder T5 model.
Specifically, I load the base model, the first LoRA, and the second LoRA adapters, and perform inference with these three models in the same batch. However, some errors occurred.

BTW, does [inference with different LoRA adapters in the same batch] support beam search when using generate()?

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder
My own task or dataset (give details below)

Reproduction

Code:

base_model = MT5ForConditionalGeneration.from_pretrained(base_model_path, cache_dir='cache')
peft_model = PeftModel.from_pretrained(base_model,<lora_path1> ,adapter_name="l1")
peft_model.load_adapter(<lora_path2>, adapter_name="l2")
adapter_names = ["__base__", "l1", "l2"]
output = model.generate(
                input_ids=inputs['input_ids'],
                adapter_names=adapter_names,
                max_length=20,
                prefix_allowed_tokens_fn=self.restrict_decode_vocab,
                early_stopping=True
            )

The error message:

Traceback (most recent call last):
  File "/home/user/user1/GR/trainer.py", line 1025, in prediction_step
    doc_ids = model.generate(
  File "/home/user/anaconda3/envs/test/lib/python3.8/site-packages/peft/peft_model.py", line 1972, in generate
    with self._enable_peft_forward_hooks(**kwargs):
  File "/home/user/anaconda3/envs/test/lib/python3.8/contextlib.py", line 113, in __enter__
    python-BaseException
return next(self.gen)
  File "/home/user/anaconda3/envs/test/lib/python3.8/site-packages/peft/peft_model.py", line 798, in _enable_peft_forward_hooks
    with self.base_model._enable_peft_forward_hooks(*args, **kwargs):
  File "/home/user/anaconda3/envs/test/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/user/anaconda3/envs/test/lib/python3.8/site-packages/peft/tuners/lora/model.py", line 441, in _enable_peft_forward_hooks
    handle = module.register_forward_pre_hook(pre_forward, with_kwargs=True)
TypeError: register_forward_pre_hook() got an unexpected keyword argument 'with_kwargs'

Expected behavior

I expect the existing function for [inference with different LoRA adapters in the same batch] to support T5 with LoRAs and work in my beam search experiments during generation.

The text was updated successfully, but these errors were encountered:

yuxiang-guo · 2024-12-16T03:20:45Z

The code is as follows:

base_model = MT5ForConditionalGeneration.from_pretrained(base_model_path, cache_dir='cache')
peft_model = PeftModel.from_pretrained(base_model,<lora_path1> ,adapter_name="l1")
peft_model.load_adapter(<lora_path2>, adapter_name="l2")
adapter_names = ["__base__", "l1", "l2"]
output = peft_model.generate(
input_ids=inputs['input_ids'],
adapter_names=adapter_names,
max_length=20,
prefix_allowed_tokens_fn=self.restrict_decode_vocab,
early_stopping=True
)

BenjaminBossan · 2024-12-16T10:11:18Z

What PyTorch version are you using?

yuxiang-guo · 2024-12-16T12:59:25Z

1.13.1+cu117

BenjaminBossan · 2024-12-16T13:30:34Z

That's the reason, your torch version is really old and does not support this argument yet. Would it be possible for you to upgrade to a newer torch version?

yuxiang-guo · 2024-12-17T03:53:58Z

Thanks for the suggestion. After upgrading to torch 2.4.1, the error during normal generation has been resolved. However, when using beam search during inference, I encountered a new ValueError. I suspect this is because num_beams is set to 20, causing the error.

How can I adapt the model to handle beam search during inference?

Additionally, I’m curious whether the inference for different adapters in a batch runs serially or in parallel. Since different adapters share the same base model, they theoretically can perform in parallel. However, the current implementation seems to run serially, as suggested by the execution time.

Is optimizing for parallel execution feasible, and are there any plans to support this functionality in the future?

batch_beams = model.generate(
    input_ids=inputs['input_ids'].to(self.args.device),
    max_length=20,
    num_beams=20,
    prefix_allowed_tokens_fn=self.restrict_decode_vocab,
    adapter_names=adapter_names,
    num_return_sequences=20,
    early_stopping=True, )

......
File "/home/user/anaconda3/envs/DSIQG_new/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/anaconda3/envs/DSIQG_new/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/anaconda3/envs/DSIQG_new/lib/python3.8/site-packages/transformers/models/mt5/modeling_mt5.py", line 176, in forward
hidden_gelu = self.act(self.wi_0(hidden_states))
File "/home/user/anaconda3/envs/DSIQG_new/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/anaconda3/envs/DSIQG_new/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/user/anaconda3/envs/DSIQG_new/lib/python3.8/site-packages/peft/tuners/lora/layer.py", line 560, in forward
self._check_forward_args(x, *args, **kwargs)
File "/home/user/anaconda3/envs/DSIQG_new/lib/python3.8/site-packages/peft/tuners/lora/layer.py", line 339, in _check_forward_args
raise ValueError(msg)
ValueError: Length of adapter_names should be the same as the number of inputs, but got 3 and 60 respectively.

See huggingface#2283 Right now, using mixed adapter batches with beam search generations does not work. This is because users need to pass the adapter names associated with each sample, i.e. the number of adapter names should be identical to the number of samples in the input. When applying beam search, transformers internally repeats the samples once per beam (or so it looks like). Therefore, we have more samples during generation than samples in the input. Consequently, the adapter names have to be extended accordingly. This is now taken care of. Unfortunately, this does not work for encoder-decoder models yet. With these models, there is always a size mismatch, whether adapter names are extended or not. What I suspect is happening is that only the decoder needs to be extended, but right now I don't see a way to implement this distinction in PEFT. Therefore, encoder-decoder + beam search generations is not supported for the time being.

BenjaminBossan · 2024-12-17T12:43:16Z

@yuxiang-guo Thanks for reporting back. Indeed, beam search is currently not supported but I created a PR that should enable it: #2287. If you have time, you could try installing PEFT from that branch and see if it fixes your issue.

Is optimizing for parallel execution feasible, and are there any plans to support this functionality in the future?

You are right that PEFT does not parallelize here. This is somewhat out of scope for PEFT, as there are already many parallelization methods in the torch ecosystem (DDP, FSDP, DeepSpeed, etc.). If we added our own parallelization into PEFT, it would most likely interfere with these existing methods and hamper performance.

yuxiang-guo · 2024-12-17T14:49:05Z

@BenjaminBossan Many thanks! I will try it.

This is somewhat out of scope for PEFT, as there are already many parallelization methods in the torch ecosystem (DDP, FSDP, DeepSpeed, etc.). If we added our own parallelization into PEFT, it would most likely interfere with these existing methods and hamper performance.

In the current implementation, are different adapters loaded into the base model and then unloaded to perform serial inference within a batch? I’m wondering if it’s possible to use multiple adapters to perform inference in parallel, where the outputs of each adapter are then added to the base model’s output. That being said, inference could be conducted without merging the LoRA adapters into the base model. In this way, the inference time won't increase linearly to the number of adapters within a batch.

BenjaminBossan · 2024-12-18T10:17:46Z

I’m wondering if it’s possible to use multiple adapters to perform inference in parallel, where the outputs of each adapter are then added to the base model’s output.

The best way to achieve that would be to merge those adapters into the base model using the model.merge_adapter(<adapter-names>) method. If you want to always use a specific combination of LoRA adapters, you can also create a new one that is a merge of those adapters using model.add_weighted_adapter.

That being said, inference could be conducted without merging the LoRA adapters into the base model. In this way, the inference time won't increase linearly to the number of adapters within a batch.

I don't understand this part. If you don't merge the weights, it means that there is always a LoRA overhead.

yuxiang-guo changed the title ~~Inference with different LoRA adapters in the same batch has a TypeError: register_forward_pre_hook() got an unexpected keyword argument 'with_kwargs'~~ TypeError when inference with different LoRA adapters in the same batch Dec 15, 2024

BenjaminBossan mentioned this issue Dec 17, 2024

FIX: Generating with mixed adapter batches and with beam search enabled #2287

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError when inference with different LoRA adapters in the same batch #2283

TypeError when inference with different LoRA adapters in the same batch #2283

yuxiang-guo commented Dec 15, 2024 •

edited by githubnemo

Loading

yuxiang-guo commented Dec 16, 2024

BenjaminBossan commented Dec 16, 2024

yuxiang-guo commented Dec 16, 2024

BenjaminBossan commented Dec 16, 2024

yuxiang-guo commented Dec 17, 2024

BenjaminBossan commented Dec 17, 2024

yuxiang-guo commented Dec 17, 2024

BenjaminBossan commented Dec 18, 2024

TypeError when inference with different LoRA adapters in the same batch #2283

TypeError when inference with different LoRA adapters in the same batch #2283

Comments

yuxiang-guo commented Dec 15, 2024 • edited by githubnemo Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

yuxiang-guo commented Dec 16, 2024

BenjaminBossan commented Dec 16, 2024

yuxiang-guo commented Dec 16, 2024

BenjaminBossan commented Dec 16, 2024

yuxiang-guo commented Dec 17, 2024

BenjaminBossan commented Dec 17, 2024

yuxiang-guo commented Dec 17, 2024

BenjaminBossan commented Dec 18, 2024

yuxiang-guo commented Dec 15, 2024 •

edited by githubnemo

Loading