Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does prompt make sense? #243

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

vwxyzjn
Copy link
Collaborator

@vwxyzjn vwxyzjn commented Aug 12, 2024

This PR includes a simple script to judge the quality of the SFT prompts in the dataset.

image

@hamishivi
Copy link
Collaborator

As a very basic first pass this makes sense, but I wonder if:
(a) we can be more specific / ask for fine-grained scores (e.g., coherence, length, etc.). I feel like recent llm-as-a-judge work is trending to more fine-grained scores (e.g. ultrafeedback finegrained vs overall).
(b) If we can make this more quantitative somehow. Can we have some analysis of false positives or similar? Or maybe some small val set that we can reason about and quality-check with?

Copy link
Collaborator

@hamishivi hamishivi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could make some sort of experimental folder for a rough script like this? Unsure, this sort of thing feels way more rough that e.g. the submit_eval/finetune scripts.

@dataclass
class LLMJudgeConfig:
n: int = 64
model: str = "gpt-3.5-turbo-0125"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use 4o?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants