Feature proposal: Stacking, potentially heterogeneous, datasets #7279
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Introduction
Hello there,
I noticed that there are two ways to combine multiple datasets: Either through
datasets.concatenate_datasets
ordatasets.interleave_datasets
. However, to my knowledge (please correct me if I am wrong) both approaches require the datasets that are combined to have the same features.I think it would be a great idea to add support for combining multiple datasets that might not follow the same schema (i.e. have different features), for example an image and text dataset. That is why I propose a third function of the
datasets.combine
module calledstack_datasets
, which can be used to combine a list of datasets with (potentially) different features. This would look as follows:Motivation
I motivate this by:
A: The fact that Pytorch offers a similar functionality under
torch.utils.data.StackDataset
(link).B: In settings where one would like to e.g. train a Vision-Language model using an image-text dataset, an image dataset, and a text dataset, this functionality would offer a clean and intuitive solution to create multimodal datasets. I am aware that the aforementioned is also feasible without my proposed function, but I believe this offers a nice approach that aligns with existing functionality and is directly provided within the
datasets
package.API
stack_datasets
has two arguments:datasets
andstopping_strategy
.datasets
is a dictionary of either typeDict[str, Dataset]
orDict[str, IterableDatasets]
, a mixture is not allowed. It contains the names of the datasets (the keys) and the datasets themselves (the values) that should be stacked. Each item returned is a dictionary with one key-value pair for each dataset. The keys are the names of the datasets as provided in the argumentdatasets
, and the values are the respective examples from the datasets.stopping_strategy
is the same as forinterleave_datasets
. If it isfirst_exhausted
we stop if the smallest dataset runs out of examples, if it isall_exhausted
we stop if all datasets ran out of examples at least once. Forall_exhausted
that means that we may visit examples from datasets multiple times.Docs
I saw that there are multiple documentations and guides on the HuggingFace website that introduce
concatenate_datasets
andinterleave_datasets
, for example here. If this request is merged I would be willing to add the new functionality at the appropriate points in the documentation (if desired).Tests
I also added some tests to ensure correctness. Some tests I wrote in tests/test_iterable_dataset.py
run for both
Dataset
andIterableDataset
even though tests forDataset
technically do not belong in this script, but I found that this was a nice way to cover more cases with mostly the same code.Additional information
I tried to write the code in a way so that it is similar to that of
concatenate_datasets
andinterleave_datasets
.I’m open to feedback and willing to make adjustments based on your suggestions, so feel free to give me your take. :)