PEFT LoRA training on multiple datasets in parallel #1061

finndayton · 2023-10-29T01:14:27Z

finndayton
Oct 29, 2023

Hello, I want to find the most efficient way to train N different LoRA weight adapters separately on N different datasets / tasks. I have access to up to to 8 A100 GPUs with 40GB VRAM each and want to optimize for speed.

Right now, training just 1 dataset (dolly-15k) using peft Lora on one GPU is going well–we are using up around 80% of 1 GPU's memory.

To most efficiently train N different sets of loRA adapters on N different datasets, optimizing for speed, there are different approaches I was thinking about:

For this, I am assuming N=5.

Sequential datasets. Put the pre-trained / frozen model on all 8 GPUs and and use something like FSDP to shard one dataset at a time across all 8 GPUs. FSDP handles gradient orchestration of the loRA adapter weights. Once the entire dataset is trained on, the LoRA adapter are gathered and combined from the 8 GPUs and returned.
Parallel datasets put a different model / dataset on each GPU. This means 5 GPUs are used in this case. Since each GPU is only at 80% capacity for a single dataset, I could increase the mini-batch size and get each of the (5) GPUs closer to 100% memory utilization.

For both of these approaches, would I need to edit Trainer directly? How would you approach this? Thank you!

BenjaminBossan · 2023-10-30T11:35:05Z

BenjaminBossan
Oct 30, 2023
Maintainer

To make sure, you want to train the different LoRA adapters all independently of each other, so not LoRA 5 on top of LoRA 4 on top of ... (incremental training)? If that is so, it would seem to me that the "naive" approach 2 should be both easiest (less custom code, less debugging) and most efficient. I would only go for 1 if either the model doesn't fit on one GPU or 2 you want some kind of incremental training.

However, there is no perfect answer that always applies, as it can depend on so many factors. I would suggest to follow the general advice for training neural nets, I don't think the fact that LoRA is being used here changes the general wisdom.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PEFT LoRA training on multiple datasets in parallel #1061

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

PEFT LoRA training on multiple datasets in parallel #1061

finndayton Oct 29, 2023

Replies: 1 comment

BenjaminBossan Oct 30, 2023 Maintainer

finndayton
Oct 29, 2023

BenjaminBossan
Oct 30, 2023
Maintainer