How can we mitigate reward hacking in RLHF? 🤔 Constrained Generative Policy Optimization (CGPO) is a new RLHF method using Mixture of Judges (MoJ) from @AIatMeta . CGPO outperforms PPO (single RM) on Alpaca Eval, Arena Hard, IFEval! 👀

Implementation

1️⃣ Select pre-trained LLM (e.g., Llama 3.0 70B) and SFT the model using datasets for various tasks (general chat, math, instruction following).
2️⃣ Create Mixture of Judges (MoJ) - combine rule-based and LLM-based judges for constraint evaluation (optionally train different RM, e.g. helpfulness)
3️⃣ Implement warm-up phase using DPO for a few steps on combined reward data.
4️⃣ Use CGPO: Sample prompts, generate responses, apply MoJ for constraint evaluation, and update the policy using a constrained optimizer (CRPG, CODPO, or CRRAFT).

Insights

💡 MoJ prevents reward hacking and boosts performance by enforcing constraints.
📈 Outperforms PPO and DPO in benchmarks like AlpacaEval (+7.4%), Arena-Hard (+12.5%), and others.
🧮 2-5% gains in math, coding, and knowledge tasks
🔄 Warm-up phase with DPO significantly boosts final performance
⚙️ Allows tailoring of reward models, judges, and optimizer settings for individual tasks.
❌ Requires significantly more computing as their are multiple reward models/judges

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM-as-judge.md

LLM-as-judge.md

Files

LLM-as-judge.md

Latest commit

History

LLM-as-judge.md

File metadata and controls