initial persona data gen 2 commit #489

fabrahman · 2024-12-18T22:47:46Z

This is the main scripts for persona-driven synthetic data generation for Math, Coding and Precise Instruction Following.

natolambert

This looks good. Can we just host the jsonl files on huggingface or something so we don't need to add 200k lines of code to the repo?

fabrahman · 2024-12-19T00:26:48Z

@natolambert good idea! removed the big files and moved them to hugging face dataset at: ai2-adapt-dev/personahub_personas

vwxyzjn · 2024-12-19T14:51:05Z

scripts/persona_driven_data_gen/README.md

@@ -0,0 +1,48 @@
+To start make sure you have your OpenAI and Anthropic API keys and have installed the libraries listed in `requirements.txt`:


Maybe give it a title like

Persona Data Generation

This folder contains scripts on generating persona data used for preference tuning in the Tulu 3 recipe [citation, section xxxx)

vwxyzjn · 2024-12-19T14:52:15Z

scripts/persona_driven_data_gen/README.md

+
+```
+# Generate Instruction Following prompts
+python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index <SAMPLE_SIZE> --output_path if_prompts.jsonl --openai_key Z --org_id YYY --dataset ai2-adapt-dev/personahub_personas --template instruction_following


can we make the command copy-paste and run? E.g., remove <SAMPLE_SIZE> and replace it with a sensible default value. --openai_key should be read from the environment possibly? What is this --org_id?

vwxyzjn · 2024-12-19T14:53:32Z

scripts/persona_driven_data_gen/README.md

+python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index <SAMPLE_SIZE> --output_path if_solutions.jsonl --openai_key Z --org_id YYY --dataset if_prompts.jsonl --template instruction_following_solution
+
+# Rewrite prompts to form Rejected Response (used for Presona-IF DPO data)
+python persona_driven_generate_ifdata.py --model "gpt-4o" --start_index 0 --end_index <SAMPLE_SIZE> --output_path if_solutions.jsonl --openai_key Z --org_id YYY --dataset if_prompts.jsonl --template rewrite_if_prompt


Maybe include some example outputs to give the readers on what to expect after running the script. Kind of like how I included Here is an example created dataset: https://huggingface.co/datasets/vwxyzjn/rejection_sampling_1727887719 in https://github.com/allenai/open-instruct/blob/main/docs/algorithms/rejection_sampling.md#scoring-completions

initial persona data gen 2 commit

545a92d

fabrahman requested review from vwxyzjn and natolambert December 18, 2024 22:47

fabrahman self-assigned this Dec 18, 2024

natolambert requested changes Dec 18, 2024

View reviewed changes

removed input persona filed moved to hf

ea2aa74

fabrahman requested a review from natolambert December 19, 2024 00:38

vwxyzjn reviewed Dec 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial persona data gen 2 commit #489

initial persona data gen 2 commit #489

fabrahman commented Dec 18, 2024

natolambert left a comment

fabrahman commented Dec 19, 2024

vwxyzjn Dec 19, 2024

vwxyzjn Dec 19, 2024

vwxyzjn Dec 19, 2024

		@@ -0,0 +1,48 @@
		To start make sure you have your OpenAI and Anthropic API keys and have installed the libraries listed in `requirements.txt`:

initial persona data gen 2 commit #489

Are you sure you want to change the base?

initial persona data gen 2 commit #489

Conversation

fabrahman commented Dec 18, 2024

natolambert left a comment

Choose a reason for hiding this comment

fabrahman commented Dec 19, 2024

vwxyzjn Dec 19, 2024

Choose a reason for hiding this comment

Persona Data Generation

vwxyzjn Dec 19, 2024

Choose a reason for hiding this comment

vwxyzjn Dec 19, 2024

Choose a reason for hiding this comment