Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial persona data gen 2 commit #489

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

initial persona data gen 2 commit #489

wants to merge 2 commits into from

Conversation

fabrahman
Copy link
Contributor

This is the main scripts for persona-driven synthetic data generation for Math, Coding and Precise Instruction Following.

@fabrahman fabrahman self-assigned this Dec 18, 2024
Copy link
Collaborator

@natolambert natolambert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Can we just host the jsonl files on huggingface or something so we don't need to add 200k lines of code to the repo?

@fabrahman
Copy link
Contributor Author

@natolambert good idea! removed the big files and moved them to hugging face dataset at: ai2-adapt-dev/personahub_personas

@@ -0,0 +1,48 @@
To start make sure you have your OpenAI and Anthropic API keys and have installed the libraries listed in `requirements.txt`:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe give it a title like

Persona Data Generation

This folder contains scripts on generating persona data used for preference tuning in the Tulu 3 recipe [citation, section xxxx)


```
# Generate Instruction Following prompts
python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index <SAMPLE_SIZE> --output_path if_prompts.jsonl --openai_key Z --org_id YYY --dataset ai2-adapt-dev/personahub_personas --template instruction_following
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make the command copy-paste and run? E.g., remove <SAMPLE_SIZE> and replace it with a sensible default value. --openai_key should be read from the environment possibly? What is this --org_id?

python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index <SAMPLE_SIZE> --output_path if_solutions.jsonl --openai_key Z --org_id YYY --dataset if_prompts.jsonl --template instruction_following_solution

# Rewrite prompts to form Rejected Response (used for Presona-IF DPO data)
python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index <SAMPLE_SIZE> --output_path if_solutions.jsonl --openai_key Z --org_id YYY --dataset if_prompts.jsonl --template rewrite_if_prompt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe include some example outputs to give the readers on what to expect after running the script. Kind of like how I included Here is an example created dataset: https://huggingface.co/datasets/vwxyzjn/rejection_sampling_1727887719 in https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md#scoring-completions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants