Release v0.10.0: Fine-tune larger QLoRA models with DeepSpeed and FSDP, layer replication, enhance DoRA · huggingface/peft

Highlights

Support for QLoRA with DeepSpeed ZeRO3 and FSDP

We added a couple of changes to allow QLoRA to work with DeepSpeed ZeRO3 and Fully Sharded Data Parallel (FSDP). For instance, this allows you to fine-tune a 70B Llama model on two GPUs with 24GB memory each. Besides the latest version of PEFT, this requires bitsandbytes>=0.43.0, accelerate>=0.28.0, transformers>4.38.2, trl>0.7.11. Check out our docs on DeepSpeed and FSDP with PEFT, as well as this blogpost from answer.ai, for more details.

Layer replication

First time contributor @siddartha-RE added support for layer replication with LoRA. This allows you to duplicate layers of a model and apply LoRA adapters to them. Since the base weights are shared, this costs only very little extra memory, but can lead to a nice improvement of model performance. Find out more in our docs.

Improving DoRA

Last release, we added the option to enable DoRA in PEFT by simply adding use_dora=True to your LoraConfig. However, this only worked for non-quantized linear layers. With this PEFT release, we now also support Conv2d layers, as well as linear layers quantized with bitsandbytes.

Mixed LoRA adapter batches

If you have a PEFT model with multiple LoRA adapters attached to it, it's now possible to apply different adapters (or, in fact, no adapter) on different samples in the same batch. To do this, pass a list of adapter names as an additional argument. For example, if you have a batch of three samples:

output = model(**inputs, adapter_names=["adapter1", "adapter2", "__base__"])`

Here, "adapter1" and "adapter2" should be the same name as your corresponding LoRA adapters and "__base__" is a special name that refers to the base model without any adapter. Find more details in our docs.

Without this feature, if you wanted to run inference with different LoRA adapters, you'd have to use single samples or try to group batches with the same adapter, then switch between adapters using set_adapter -- this is inefficient and inconvenient. Therefore, it is recommended to use this new, faster method from now on when encountering this scenario.

New LoftQ initialization function

We added an alternative way to initialize LoRA weights for a quantized model using the LoftQ method, which can be more convenient than the existing method. Right now, using LoftQ requires you to go through multiple steps as shown here. Furthermore, it's necessary to keep a separate copy of the quantized weights, as those are not identical to the quantized weights from the default model.

Using the new replace_lora_weights_loftq function, it's now possible to apply LoftQ initialization in a single step and without the need for extra copies of the weights. Check out the docs and this example notebook to see how it works. Right now, this method only supports 4bit quantization with bitsandbytes, and the model has to be stored in the safetensors format.

Deprecations

The function prepare_model_for_int8_training was deprecated for quite some time and is now removed completely. Use prepare_model_for_kbit_training instead.

What's Changed

Besides these highlights, we added many small improvements and fixed a couple of bugs. All these changes are listed below. As always, we thank all the awesome contributors who helped us improve PEFT.

Bump version to 0.9.1.dev0 by @BenjaminBossan in #1517
Fix for "leaf Variable that requires grad" Error in In-Place Operation by @DopeorNope-Lee in #1372
FIX [CI / Docker] Follow up from #1481 by @younesbelkada in #1487
CI: temporary disable workflow by @younesbelkada in #1534
FIX [Docs/ bnb / DeepSpeed] Add clarification on bnb + PEFT + DS compatibilities by @younesbelkada in #1529
Expose bias attribute on tuner layers by @BenjaminBossan in #1530
docs: highlight difference between num_parameters() and get_nb_trainable_parameters() in PEFT by @kmehant in #1531
fix: fail when required args not passed when prompt_tuning_init==TEXT by @kmehant in #1519
Fixed minor grammatical and code bugs by @gremlin97 in #1542
Optimize levenshtein_distance algorithm in peft_lora_seq2seq_accelera… by @SUNGOD3 in #1527
Update prompt_based_methods.md by @insist93 in #1548
FIX Allow AdaLoRA rank to be 0 by @BenjaminBossan in #1540
FIX: Make adaptation prompt CI happy for transformers 4.39.0 by @younesbelkada in #1551
MNT: Use BitsAndBytesConfig as load_in_* is deprecated by @BenjaminBossan in #1552
Add Support for Mistral Model in Llama-Adapter Method by @PrakharSaxena24 in #1433
Add support for layer replication in LoRA by @siddartha-RE in #1368
QDoRA: Support DoRA with BnB quantization by @BenjaminBossan in #1518
Feat: add support for Conv2D DoRA by @sayakpaul in #1516
TST Report slowest tests by @BenjaminBossan in #1556
Changes to support fsdp+qlora and dsz3+qlora by @pacman100 in #1550
Update style with ruff 0.2.2 by @BenjaminBossan in #1565
FEAT Mixing different LoRA adapters in same batch by @BenjaminBossan in #1558
FIX [CI] Fix test docker CI by @younesbelkada in #1535
Fix LoftQ docs and tests by @BenjaminBossan in #1532
More convenient way to initialize LoftQ by @BenjaminBossan in #1543

New Contributors

@DopeorNope-Lee made their first contribution in #1372
@kmehant made their first contribution in #1531
@gremlin97 made their first contribution in #1542
@SUNGOD3 made their first contribution in #1527
@insist93 made their first contribution in #1548
@PrakharSaxena24 made their first contribution in #1433
@siddartha-RE made their first contribution in #1368

Full Changelog: v0.9.0...v0.10.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.10.0: Fine-tune larger QLoRA models with DeepSpeed and FSDP, layer replication, enhance DoRA