Skip to content

Releases: huggingface/trl

v0.13.0

16 Dec 18:31
Compare
Choose a tag to compare

Major and breaking changes

🐾 Process-supervised RM Trainer

We introduced a new trainer to train Process-supervised Reward Model (PRM) in TRL. A PRM rewards the quality of intermediate steps, promoting structured reasoning over focusing solely on the final outcome.With this trainer, we introduce a new dataset type: Stepwise supervision, which is a variant of the prompt-completion type, but for which completion is divided into several intermediate steps, and each step is associated with a label. Find out more in the stepwise-supervision section in the TRL documentation.

Here is an example of how to use the PRMTrainer to train a PRM on the Math Shepherd dataset:

# train_prm.py
from datasets import load_dataset
from trl import PRMConfig, PRMTrainer
from transformers import AutoModelForTokenClassification, AutoTokenizer

model = AutoModelForTokenClassification.from_pretrained("Qwen/Qwen2-0.5B", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B")
train_dataset = load_dataset("trl-lib/math_shepherd", split="train[:10%]")

training_args = PRMConfig(output_dir="Qwen2-0.5B-Reward-Math-Sheperd", logging_steps=10)
trainer = PRMTrainer(model=model, args=training_args, processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()

For more information, check out the PRMTrainer documentation.

by @qgallouedec and @gaetanlop in #2127 and #2148

🔀 Add MergeModelCallBack

Various works show that model merging can non-trivially improve performance, especially if the models belong to the same architecture. TRL now features a callback that merges the reference model with the current policy and optionally pushes the merged checkpoint to the Hub. This could be done on step/epoch end and/or the end of training. This callback uses Arcee's mergekit lib: https://github.com/arcee-ai/mergekit

from trl import DPOTrainer, MergeModelCallback
from trl.mergekit_utils import MergeConfig

config = MergeConfig()
merge_callback = MergeModelCallback(config)
trainer = DPOTrainer(...,  callbacks=[merge_callback])

by @August-murr in #2282

🔨 Support for tools for data utils

TRL preprocessing utils now support tooling. A first step toward agent fine-tuning.

from trl import apply_chat_template

def get_current_temperature(location: str):
    """
    Gets the temperature at a given location.

    Args:
        location: The location to get the temperature for
    """
    return 22.0

example = apply_chat_template(example, tokenizer, tools=[get_current_temperature])

by @August-murr in #2455

🌋 Add support for LLaVA-Next in DPOTrainer

VLMs have their own specificities which require special treatment in the trainer. DPOTrainer now supports LLaVA-Next models natively.

model = model = AutoModelForVision2Seq.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
trainer = DPOTrainer(model=model, ...)

by @chenweize1998 in #2413

🕹️ CLI and TRLParser refactor

TRL CLI has been refactored to be more user-friendly and easy to extend. We plan to extend the support to all trainers soon.

(simplified output, for readibility)

$ trl dpo --help
usage: trl dpo [-h] --dataset_name DATASET_NAME [--dataset_config DATASET_CONFIG] --output_dir OUTPUT_DIR [--loss_type {sigmoid,hinge,ipo}]

options:
  -h, --help            show this help message and exit
  --dataset_name DATASET_NAME, --dataset-name DATASET_NAME
  --dataset_config DATASET_CONFIG, --dataset-config DATASET_CONFIG
  --output_dir OUTPUT_DIR, --output-dir OUTPUT_DIR
                        The output directory where the model predictions and checkpoints will be written. (default: None)
  --loss_type {sigmoid,hinge,ipo}, --loss-type {sigmoid,hinge,ipo}

by @qgallouedec in #2380 and #2412

🤝 Mixture of judges

TRL features a new judge AllTrueJudge that unifies the decision of multiple binary judges. This judge implements the Mixture of Judges as described in the CGPO paper.

from trl import AllTrueJudge, BaseBinaryJudge

class RandomBinaryJudge(BaseBinaryJudge):
    """
    Random binary judge, for testing purposes.
    """

    def judge(self, prompts, completions, gold_completions=None, shuffle_order=True):
        return [random.choice([0, 1, -1]) for _ in range(len(prompts))]


prompts = ["The capital of France is", "The biggest planet in the solar system is"]
completions = [["Paris", "Marseille"], ["Saturn", "Jupiter"]]
judge = AllTrueJudge(judges=[RandomBinaryJudge(), RandomBinaryJudge()])
judgements = judge.judge(prompts=prompts, completions=completions)
print(judgements)  # [0, 1]

by @gaetanlop in #2159

❄️ DPO trainer supports num_logits_to_keep to save memory

Save memory by only keeping the top num_logits_to_keep logits in the DPO trainer.

training_args = DPOConfig(..., use_num_logits_to_keep=True)

by @xyangk in #2129

🗺️ Implementation DiscoPOP Loss

The DiscoPOP paper uses LLMs to discover more efficient offline preference optimization losses. In the paper the proposed DiscoPOP loss (which is a log-ratio modulated loss) outperformed other optimization losses on different tasks (IMDb positive text generation, Reddit TLDR summarization, and Alpaca Eval 2.0).

training_args = DPOConfig(..., loss_type="discopop", discopop_tau=0.05)

by @fanconic in #2323

🧑‍🍳 Add precompute batch size argument in DPOTrainer for reference model

We can now control the batch size for precomputing reference model logits.

training_args = DPOConfig(
...
    precompute_ref_log_probs=True,
    precompute_ref_batch_size=4,
)

by @SwayamInSync in #2426

📦 Support for packing tokenized datasets for SFT

SFTTrainer has supported packing datasets for faster training. Now, it support packing tokenized datasets as well.

by @kmehant in #2011

📉 Add PEFT support for PPOTrainer

PPOTrainer now supports PEFT for efficient training.

PPOTrainer(
    ...,
    peft_config=peft_config,
)

by @ccs96307 in #2344

💾 Deprecate config in favor of args in PPOTrainer

config has been deprecated in favor of args in PPOTrainer.

  PPOTrainer(
-   config=training_args,
+   args=training_args,
  )

by @qgallouedec in #2384

👮 Deprecate policy in favor of model in PPOTrainer

policy has been deprecated in favor of model in PPOTrainer.

  PPOTrainer(
-   policy=model,
+   model=model,
  )

by @qgallouedec in #2386

What's Changed

Read more

v0.12.2

06 Dec 13:01
4c71daf
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.12.1...v0.12.2

v0.12.1

15 Nov 12:30
Compare
Choose a tag to compare

What's Changed

  • 👈 Add tokenizer arg back and add deprecation guidelines by @qgallouedec in #2348

Full Changelog: v0.12.0...v0.12.1

v0.12.0

04 Nov 13:54
Compare
Choose a tag to compare

Major and breaking changes

General reward model support for Online DPO

Online DPO intially only supported a reward model that had the same tokenizer and chat template as the trained model. Now, you can use any reward model.

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
from trl import OnlineDPOConfig, OnlineDPOTrainer

model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_config.model_name_or_path, padding_side="left")

reward_model = AutoModelForSequenceClassification.from_pretrained(training_args.reward_model_path, num_labels=1)
reward_tokenizer = AutoTokenizer.from_pretrained(reward_model_name, truncation=True, truncation_side="left")

dataset = load_dataset(script_args.dataset_name)

training_args = OnlineDPOConfig(output_dir="...")
trainer = OnlineDPOTrainer(
    model=model,
    reward_model=reward_model,
    args=training_args,
    train_dataset=dataset,
    processing_class=tokenizer,
    reward_processing_class=reward_tokenizer,
)
trainer.train()

by @qgallouedec in #2276

Migration PPOv2 -> PPO

The PPOv2 trainer has been renamed to PPO. The old PPO trainer has been removed. PPOv2 is now deprecated and will be removed in the next release.

- trainer = PPOv2Trainer(...)
+ trainer = PPOTrainer(...)

by @qgallouedec in #2174

Refactor ScriptArguments

We had ScriptArguments, SFTScriptArguments, DPOScriptArguments and RewardScriptArguments. Since they all share mostly the same fields, we've merged them into a single ScriptArguments class.
SFTScriptArguments, DPOScriptArguments and RewardScriptArguments still exist but are deprecated and will be removed in the next release.

- script_args = DPOScriptArguments(...)
+ script_args = ScriptArguments(...)

by @qgallouedec in #2145

Soft judges for PairRM

The PairRMJudge now when called via the judge method has a flag return_scores that returns the probability scores of the first completion of the pair (instead of the rank of the preferred completion). The logits for the probability score can be scaled by an optional temperature parameter.

from trl import PairRMJudge
pairrm_judge = PairRMJudge()
prompts = ["Translate 'hello' to French", "What's the capital of Japan?"]
completions = [["Bonjour", "Salut"], ["Kyoto", "Tokyo"]]
results = pairrm_judge.judge(prompts, completions, return_scores=True)
print(results)  # [0.7492601275444031, 0.0005497377132996917]

by @kashif in #2221

Use pairwise judges for online methods

The OnlineDPOTrainer and any trainers that inherit from it (NashMDTrainer and XPOTrainer) can now accept an initialized PairwiseJudge instead of a reward model.

from datasets import load_dataset
from trl import OnlineDPOConfig, OnlineDPOTrainer, PairRMJudge
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
judge = PairRMJudge()
train_dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")

training_args = OnlineDPOConfig(output_dir="Qwen2-0.5B-OnlineDPO", logging_steps=10)
trainer = OnlineDPOTrainer(
    model=model, judge=judge, args=training_args, processing_class=tokenizer, train_dataset=train_dataset
)
trainer.train()

by @kashif in #2243

Rename trainer arg tokenizer to processing_class

The tokenizer argument in the trainers has been renamed to processing_class to better reflect the fact that it can be not only a tokenizer but also a processor.

- trainer = DPOTrainer(model, args=training_args, train_dataset=dataset, tokenizer=tokenizer)
+ trainer = DPOTrainer(model, args=training_args, train_dataset=dataset, processing_class=tokenizer)

tokenizer is still supported for SFTTrainer and DPOTrainer but deprecated and will be removed in the next release.

by @qgallouedec in #2162

Adding weighted preference optimization (WPO) to DPO

The WPO paper adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. To use this method, set the use_weighting flag to True in the [DPOConfig].

DPOConfig(..., use_weighting=True)
Screenshot 2024-11-04 at 10 59 38 Screenshot 2024-11-04 at 10 59 22

by @gaetanlop in #2141

🃏 Model card for TRL

Using trainer.push_to_hub() now automatically creates a model card that includes:

  • A link to the base model used
  • A link to the dataset used for training
  • A link to the TRL repository
  • Sample demo code
  • A link to the associated Weights & Biases run
  • A link to the paper detailing the training procedure
  • Versions of dependencies
  • BibTeX citations for both the training procedure and TRL

All links are properly formatted to allow cross-referencing, enabling traceability back to sources (e.g., the model appears linked on the paper’s page).

IOm_SdRMRwAvjfbB.mp4

by @qgallouedec in #2123

Minor

Conversational dataset support

You can now use conversational datasets directly, without needing to apply a chat template beforehand, for the following trainers:

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import DPOTrainer

model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
dataset = load_dataset(dataset_name, split="train")

# Not needed anymore:
#
# def process(row):
#     prompt = tokenizer.apply_chat_template(example["prompt"], tokenize=False, add_generation_prompt=True)
#     prompt_chosen = tokenizer.apply_chat_template(example["prompt"] + example["chosen"], tokenize=False)
#     chosen = prompt_chosen[len(prompt) :]
#     prompt_rejected = tokenizer.apply_chat_template(example["prompt"] + example["rejected"], tokenize=False)
#     rejected = prompt_rejected[len(prompt) :]
#     return {"prompt": prompt, "chosen": chosen, "rejected": rejected}
#
# dataset = dataset.map(process)

training_args = DPOConfig(output_dir="...")
trainer = DPOTrainer(model, args=training_args, train_dataset=dataset, processing_class=tokenizer)
trainer.train()

Refactor DPO data processing

For more information, see PR #2209.

trl env for printing system info

You can now use trl env to print system information, including the platform, Python version, PyTorch version, CUDA device(s), and versions of various libraries.

$ trl env

Copy-paste the following information when reporting an issue:

- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Python version: 3.11.9
- PyTorch version: 2.4.0
- CUDA device(s): NVIDIA H100 80GB HBM3
- Transformers version: 4.47.0.dev0
- Accelerate version: 0.19.0
- Accelerate config: not found
- Datasets version: 3.0.2
- HF Hub version: 0.26.1
- TRL version: 0.12.0+14ef1ab
- bitsandbytes version: 0.44.1
- DeepSpeed version: 0.15.3
- Diffusers version: 0.30.3
- Liger-Kernel version: 0.3.0
- LLM-Blender version: 0.0.2
- OpenAI version: 1.46.0
- PEFT version: 0.13.2

by @qgallouedec in #2104

Sequence-Level KD

From GKD paper:

Sequence-Level KD (Kim & Rush, 2016). SeqKD maximizes the likelihood of high probability sequences generated by the teacher, and can be viewed as supervised FT on teacher-generated outputs.

SeqKD is taken as a baseline in the paper. It is now possible to use Sequence-Level KD in the GKDTrainer by setting seq_kd=True in the GKDConfig.

training_args = GKDConfig(..., seq_kd=True)

by @mst272 in #2220

Default dataset_text_field to "text"

Since many users use "text" as the column name for textual data in datasets, we've made it the default (previously a required argument) in SFTConfig. Now, specifying dataset_text_field="text" is no longer necessary.

  SFTConfig(
      ...,
-     dataset_text_field="text",
  )

by @qgallouedec in #2078

What's Changed

Read more

v0.11.4

15 Oct 08:50
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.11.3...v0.11.4

v0.11.3

10 Oct 16:33
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.11.2...v0.11.3

v0.11.2

07 Oct 17:21
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.11.1...v0.11.2

v0.11.1

24 Sep 16:13
Compare
Choose a tag to compare

Bug fix

  • allow parse-args as list of floats for Online DPO, XPO and Nash-MD configs by @kashif in #2108

Full Changelog: v0.11.0...v0.11.1

v0.11.0

19 Sep 08:46
Compare
Choose a tag to compare

We are excited to introduce the new v0.11.0 release, with many new features and post-training algorithms. The highlights are as follows:

New post-training methods

Generalized Knowledge Distillation

Screenshot 2024-09-19 at 10 01 02

Generalized Knowledge Distillation (GKD) is a post-training method from Google DeepMind that extends standard knowledge distillation by allowing the student to generate outputs during training and receive online feedback from the teacher. It consistently outperforms SFT and in some cases enables the student model to match the performance of the teacher, but with far fewer parameters.

To train models with this method, check out the GKDTrainer.

Exploratory Preference Optimization

Screenshot 2024-09-19 at 10 13 27

Exploratory Preference Optimization is an online post-training method from researchers at Microsoft, MIT, and Wisconsin that extends DPO to incorporate online feedback from reward models or LLM judges. It is similar to online DPO, but has a slightly different theoretical basis concerning sample efficiency.

To train models with this method, check out the XPOTrainer.

Nash Learning with Human Feedback

Screenshot 2024-09-19 at 10 32 04

Nash Learning with Human Feedback is a novel post-training method from Google DeepMind that uses pairwise preference models which are conditioned on two inputs, instead of the single one used in reward models. These preference models are then used to train a policy that consistently produces responses that are preferred over those from competing policies, thus approximating a Nash equilibrium (i.e. a two player game where actions are responses and payoffs are given by the preference model).

To train models with this method, check out the NashMDTrainer.

New trainer features

  • Online DPO now supports training LoRA adapters with PEFT, which means you can dramatically reduce the amount of VRAM needed to train models with this method. By @qgallouedec in #2041
  • The OrpoTrainer has better integration with PyTorchXLA for faster step time on TPUs ⚡ . By @wenxindongwork in #2001

Deprecations 🚨

  • The PPOTrainer is marked for deprecated in favour of PPOv2Trainer to provide a consistent API across TRL's trainers. It will be removed in v0.12.0. By @qgallouedec in #2016
  • The RichProgressCallback has been removed from the example scripts as it caused a variety of problems with logging in distributed environments. You can still use it by adding it manually to the trainer callbacks. By @lewtun in #2053

Bugfixes and improvements

New Contributors

Read more

v0.10.1

29 Aug 14:34
Compare
Choose a tag to compare

We are excited to introduce the new v0.10.1 release, with many new exciting features and post-training algorithms. The highlights are as follows:

Online DPO

Screenshot 2024-08-29 at 15 53 29

Online DPO is a new alignment method from DeepMind to boost the performance of LLMs. With Online DPO, data is generated on the fly by the trained model (instead of pre-collected). For each prompt, two completions are generated, with a reward model selecting the preferred one. This approach:

  • Eliminates the need for a pre-collected preference dataset (it's generated online)
  • Enables continuous model improvement
  • Yields better results than traditional DPO

To train models with this method, use the OnlineDPOTrainer

Liger Triton kernels for supercharged SFT

image (18)

  • We've integrated LinkedIn's Liger Triton kernels to the SFTTrainer for faster throughput and lower memory usage. To use them, set use_liger_kernel in SFTConfig

DPO for VLMs

  • We've added support to align vision-language models with DPO, now covering architectures LLaVa-1.5, PaliGemma, and Idefics2. To train VLMs with DPO, use the dpo_visual.py script as follows
accelerate launch examples/scripts/dpo_visual.py \
    --dataset_name HuggingFaceH4/rlaif-v_formatted \
    --model_name_or_path google/paligemma-3b-pt-224 \
    --trust_remote_code \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --output_dir dpo_paligemma_rlaif-v \
    --bf16 \
    --torch_dtype bfloat16

WinRate callback for LLM as a judge

  • We've added support to compute win rates over the reference model for methods like DPO. To do so, configure the callback to point to the LLM as judge API (OpenAI or Hugging Face Inference API) and then add:
trainer = DPOTrainer(...)
win_rate_callback = WinRateCallback(..., trainer=trainer)
trainer.add_callback(win_rate_callback)

Anchored Preference Optimisation (APO) for fine-grained human/AI feedback

  • Added the APO method, which is an "anchored" version of the alignment objective. There are two variants: apo_zero and apo_down. The apo_zero loss increases the likelihood of winning outputs while decreasing the likelihood of losing outputs, making it suitable when the model is less performant than the winning outputs. On the other hand, apo_down decreases the likelihood of both winning and losing outputs, but with a stronger emphasis on reducing the likelihood of losing outputs. This variant is more effective when the model is better than the winning outputs. To use these losses, set loss_type="apo_zero" or loss_type="apo_down" in the DPOConfig

What's Changed

Read more