Skip to content

Latest commit

 

History

History
43 lines (25 loc) · 2.6 KB

LLM-as-judge.md

File metadata and controls

43 lines (25 loc) · 2.6 KB

LLM-as-a-Judge for AI Systems

Mixture-of-Judges

Artificial Intelligence LLM-as-a-Judge: A Scalable Solution for Evaluating Language Models Using Language Models

RAG Evaluation with LLM-as-a-Judge + Synthetic Dataset Creation

Improve factual consistency with LLM Debates

Llama-Bench

Screenshot 2024-11-30 at 6 51 32 PM

Screenshot 2024-11-29 084421

Screenshot 2024-10-05 at 10 06 35 AM

How can we mitigate reward hacking in RLHF? 🤔 Constrained Generative Policy Optimization (CGPO) is a new RLHF method using Mixture of Judges (MoJ) from @AIatMeta . CGPO outperforms PPO (single RM) on Alpaca Eval, Arena Hard, IFEval! 👀

Implementation

  • 1️⃣ Select pre-trained LLM (e.g., Llama 3.0 70B) and SFT the model using datasets for various tasks (general chat, math, instruction following).
  • 2️⃣ Create Mixture of Judges (MoJ) - combine rule-based and LLM-based judges for constraint evaluation (optionally train different RM, e.g. helpfulness)
  • 3️⃣ Implement warm-up phase using DPO for a few steps on combined reward data.
  • 4️⃣ Use CGPO: Sample prompts, generate responses, apply MoJ for constraint evaluation, and update the policy using a constrained optimizer (CRPG, CODPO, or CRRAFT).

Insights

  • 💡 MoJ prevents reward hacking and boosts performance by enforcing constraints.
  • 📈 Outperforms PPO and DPO in benchmarks like AlpacaEval (+7.4%), Arena-Hard (+12.5%), and others.
  • 🧮 2-5% gains in math, coding, and knowledge tasks
  • 🔄 Warm-up phase with DPO significantly boosts final performance
  • ⚙️ Allows tailoring of reward models, judges, and optimizer settings for individual tasks.
  • ❌ Requires significantly more computing as their are multiple reward models/judges

Screenshot 2024-10-03 084414