Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset used to pre-train #12

Open
agademic opened this issue Apr 19, 2023 · 5 comments
Open

Dataset used to pre-train #12

agademic opened this issue Apr 19, 2023 · 5 comments

Comments

@agademic
Copy link

Hi there!
First of all, thank you for the amazing work!
The readme says the models were trained on "the new dataset based on The Pile" which is 3x the size of The Pile. Can you give more insights on the dataset and its content?

Thank you!

@gururise
Copy link

gururise commented Apr 19, 2023

The readme indicates they are planning to release a technical report soon. I suspect the details will be in the tech report. I also hope they continue the training on the 3b and 7b model all the way up to 1.5T tokens!

THANK YOU STABILITY AI
Your contributions to the Open Source community are very much appreciated!

@fche
Copy link

fche commented Apr 19, 2023

Can we expect that this forthcoming dataset declaration will include those inputs that imbue this model with politically correct output (even with a neutral SYSTEM prompt) ?

@MarkSchmidty
Copy link
Contributor

MarkSchmidty commented Apr 20, 2023

Can we expect that this forthcoming dataset declaration will include those inputs that imbue this model with politically correct output (even with a neutral SYSTEM prompt) ?

Only the "Tuned" model has a SYSTEM prompt and that model's finetuning datasets are where that is coming from. They're the same finetuning data used for llama finetunes like Alpaca and GPT4All, which are outputs of ChatGPT. So ultimately they come from ChatGPT.

The "Base" model does not have a SYSTEM prompt and does not use those datasets or any like them.

@johann-petrak
Copy link

It would really be interesting and important to share some basic information about the training data already and in the repo though, especially the kind and size of training data for each language.
Judging from a few first trials, the amount of German training data is probably very small and thus results in German are quite poor.

@mcmonkey4eva
Copy link

More information will be published soon and is very likely to answer your questions when it's available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants