-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial persona data gen 2 commit #489
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. Can we just host the jsonl files on huggingface or something so we don't need to add 200k lines of code to the repo?
@natolambert good idea! removed the big files and moved them to hugging face dataset at: |
@@ -0,0 +1,48 @@ | |||
To start make sure you have your OpenAI and Anthropic API keys and have installed the libraries listed in `requirements.txt`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe give it a title like
Persona Data Generation
This folder contains scripts on generating persona data used for preference tuning in the Tulu 3 recipe [citation, section xxxx)
|
||
``` | ||
# Generate Instruction Following prompts | ||
python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index <SAMPLE_SIZE> --output_path if_prompts.jsonl --openai_key Z --org_id YYY --dataset ai2-adapt-dev/personahub_personas --template instruction_following |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make the command copy-paste and run? E.g., remove <SAMPLE_SIZE>
and replace it with a sensible default value. --openai_key
should be read from the environment possibly? What is this --org_id
?
python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index <SAMPLE_SIZE> --output_path if_solutions.jsonl --openai_key Z --org_id YYY --dataset if_prompts.jsonl --template instruction_following_solution | ||
|
||
# Rewrite prompts to form Rejected Response (used for Presona-IF DPO data) | ||
python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index <SAMPLE_SIZE> --output_path if_solutions.jsonl --openai_key Z --org_id YYY --dataset if_prompts.jsonl --template rewrite_if_prompt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe include some example outputs to give the readers on what to expect after running the script. Kind of like how I included Here is an example created dataset: https://huggingface.co/datasets/vwxyzjn/rejection_sampling_1727887719
in https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md#scoring-completions
This is the main scripts for persona-driven synthetic data generation for Math, Coding and Precise Instruction Following.