Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[post training] support more data format #717

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

SLR722
Copy link
Contributor

@SLR722 SLR722 commented Jan 3, 2025

What does this PR do?

Support instruct format (with input and output columns) and chat format (multi turn conversations with conversations / messages column) in post training

We believe this is important for Llama stack post training APIs alpha release and a great enhancement to user experience

Test

  1. verified the alpaca dataset generate exact same data before and after this change
Screenshot 2025-01-03 at 4 39 52 PM
  1. verified the instruct dataset generate exact same data between torchtune OSS and llama stack, the dataset I used for test https://huggingface.co/datasets/vicgalle/alpaca-gpt4
Screenshot 2025-01-03 at 4 41 46 PM
  1. verified the chat dataset generate exact same data between torchtune OSS and llama stack, the dataset I used for test https://huggingface.co/datasets/shibing624/huatuo_medical_qa_sharegpt
Screenshot 2025-01-03 at 4 45 52 PM

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 3, 2025
@SLR722 SLR722 marked this pull request as ready for review January 4, 2025 00:46
@SLR722 SLR722 changed the title add more data format [post training] support more data format Jan 4, 2025
"chat_sharegpt": ShareGPTToMessages,
"chat_openai": OpenAIToMessages,
}


EXPECTED_DATASET_SCHEMA = DatasetSchema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be helpful to use DataSchemaValidatorMixin for data schema validation https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/utils/common/data_schema_validator.py#L65

Copy link
Contributor

@yanxi0830 yanxi0830 Jan 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced offline: we can keep logic for validation in common utils, but there's no value in having them as mixin. Refactor in #720


from torchtune.models.llama3 import llama3_tokenizer, lora_llama3_8b
from torchtune.models.llama3._tokenizer import Llama3Tokenizer
from torchtune.models.llama3_2 import lora_llama3_2_3b
from torchtune.modules.transforms import Transform


class ColumnName(Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this enum class should be merged with

class ColumnName(Enum):
input_query = "input_query"
expected_answer = "expected_answer"
chat_completion_input = "chat_completion_input"
completion_input = "completion_input"
generated_answer = "generated_answer"
context = "context"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants