[post training] support more data format #717

SLR722 · 2025-01-03T22:05:32Z

What does this PR do?

Support instruct format (with input and output columns) and chat format (multi turn conversations with conversations / messages column) in post training

We believe this is important for Llama stack post training APIs alpha release and a great enhancement to user experience

Test

verified the alpaca dataset generate exact same data before and after this change

verified the instruct dataset generate exact same data between torchtune OSS and llama stack, the dataset I used for test https://huggingface.co/datasets/vicgalle/alpaca-gpt4

verified the chat dataset generate exact same data between torchtune OSS and llama stack, the dataset I used for test https://huggingface.co/datasets/shibing624/huatuo_medical_qa_sharegpt

yanxi0830 · 2025-01-04T01:20:11Z

llama_stack/providers/inline/post_training/torchtune/common/utils.py

+    "chat_sharegpt": ShareGPTToMessages,
+    "chat_openai": OpenAIToMessages,
+}
+

 EXPECTED_DATASET_SCHEMA = DatasetSchema(


Might be helpful to use DataSchemaValidatorMixin for data schema validation https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/utils/common/data_schema_validator.py#L65

Synced offline: we can keep logic for validation in common utils, but there's no value in having them as mixin. Refactor in #720

yanxi0830 · 2025-01-04T01:21:12Z

llama_stack/providers/inline/post_training/torchtune/common/utils.py


 from torchtune.models.llama3 import llama3_tokenizer, lora_llama3_8b
 from torchtune.models.llama3._tokenizer import Llama3Tokenizer
 from torchtune.models.llama3_2 import lora_llama3_2_3b
+from torchtune.modules.transforms import Transform


 class ColumnName(Enum):


I think this enum class should be merged with

llama-stack/llama_stack/providers/utils/common/data_schema_validator.py

Lines 19 to 25 in 485476c

class ColumnName(Enum):

input_query = "input_query"

expected_answer = "expected_answer"

chat_completion_input = "chat_completion_input"

completion_input = "completion_input"

generated_answer = "generated_answer"

context = "context"

temp commit

346a6c6

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 3, 2025

SLR722 added 3 commits January 3, 2025 14:30

temp commit

82d5758

temp commit

280581a

refine

bbe190a

SLR722 marked this pull request as ready for review January 4, 2025 00:46

SLR722 requested review from ashwinb, yanxi0830, hardikjshah, dltn, raghotham, dineshyv and vladimirivic as code owners January 4, 2025 00:46

SLR722 changed the title ~~add more data format~~ [post training] support more data format Jan 4, 2025

yanxi0830 reviewed Jan 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[post training] support more data format #717

[post training] support more data format #717

SLR722 commented Jan 3, 2025 •

edited

Loading

yanxi0830 Jan 4, 2025

yanxi0830 Jan 4, 2025 •

edited

Loading

yanxi0830 Jan 4, 2025

	class ColumnName(Enum):
	input_query = "input_query"
	expected_answer = "expected_answer"
	chat_completion_input = "chat_completion_input"
	completion_input = "completion_input"
	generated_answer = "generated_answer"
	context = "context"

[post training] support more data format #717

Are you sure you want to change the base?

[post training] support more data format #717

Conversation

SLR722 commented Jan 3, 2025 • edited Loading

What does this PR do?

Test

yanxi0830 Jan 4, 2025

Choose a reason for hiding this comment

yanxi0830 Jan 4, 2025 • edited Loading

Choose a reason for hiding this comment

yanxi0830 Jan 4, 2025

Choose a reason for hiding this comment

SLR722 commented Jan 3, 2025 •

edited

Loading

yanxi0830 Jan 4, 2025 •

edited

Loading