Skip to content

Releases: huggingface/datasets

2.10.0

22 Feb 12:58
cac733f
Compare
Choose a tag to compare

Important

  • Avoid saving sparse ChunkedArrays in pyarrow tables by @marioga in #5542
    • Big improvements on the speed of .flatten_indices() (x2) + save/load_from_disk (x100) on selected/shuffled datasets
  • Skip dataset verifications by default by @mariosasko in #5303
    • introduces multiple verification_mode you can pass to `load_dataset()):
    • the new default verification steps are much faster (no need to compute expensive checksums)

Datasets features

  • Single TQDM bar in multi-proc map by @mariosasko in #5455
    • No more stacked TQDM bars when calling .map() in multiprocessing
  • Map-style Dataset to IterableDataset by @lhoestq in #5410
  • Select columns of Dataset or DatasetDict by @daskol in #5480
    • introduces .select_column() to return a dataset only containing the requested columns
  • Added functionality: sort datasets by multiple keys by @MichlF in #5502
    • introduces ds = ds.sort(['col_1', 'col_2'], reverse=[True, False])
  • Add JAX device selection when formatting by @alvarobartt in #5547
    • introduces ds = ds.with_format("jax", device=device)
  • Reload features from Parquet metadata by @MFreidank in #5516
  • Speed up batched PyTorch DataLoader by @lhoestq in #5512

Documentation

General improvements and bug fixes

New Contributors

Full Changelog: 2.9.0...ef

2.9.0

26 Jan 19:33
b5672a9
Compare
Choose a tag to compare

Datasets Features

  • Parallel implementation of to_tf_dataset() by @Rocketknight1 in #5377

    • Pass num_workers= to .to_tf_dataset() to make your dataset faster with multiprocessing
  • Distributed support by @lhoestq in #5369

    • Split your dataset for each node for distributed training
    • It supports both Dataset and IterableDataset (e.g. in streaming mode)
    • See the documentation for more details
    import os
    from datasets.distributed import split_dataset_by_node
    
    rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    ds = split_dataset_by_node(ds, rank=rank, world_size=world_size)
  • Support streaming datasets with os.path.exists and Path.exists by @albertvillanova in #5400

  • Tqdm progress bar for to_parquet by @zanussbaum in #5456

  • ZIP files support in iter_archive with better compression type check by @Mehdi2402 in #3379

  • Support other formats than uint8 for image arrays by @vigsterkr in #5365

Documentation

General improvements and bug fixes

New Contributors

Full Changelog: 2.8.0...2.9.0

2.8.0

19 Dec 10:55
037c9b5
Compare
Choose a tag to compare

Important

  • Removed YAML integer keys from class_label metadata by @albertvillanova in #5277
    • From now on, datasets pushed on the Hub and using ClassLabel will use a new YAML model to store the feature types
    • The new model uses strings instead of integers for the ids in label name mapping (e.g. 0 -> "0"). This is due to the Hub limitations. In a few months the Hub may stop allowing users to push the old YAML model.
    • Old versions of datasets are not able to reload datasets pushed with this new model, so we encourage everyone to update.

Datasets Features

  • Fix methods using IterableDataset.map that lead to features=None by @alvarobartt in #5287
    • Datasets in streaming mode now update their features after column renaming or removal
  • Add num_proc to from_csv/generator/json/parquet/text by @lhoestq in #5239
    • Use multiprocessing to load multiple files in parallel
  • Add features param to IterableDataset.map by @alvarobartt in #5311
  • Sharded save_to_disk + multiprocessing by @lhoestq in #5268
    • Pass num_shards or max_shard_size to ds.save_to_disk() or ds.push_to_hub()
    • Pass num_proc to use multiprocessing.
  • Support for decoding Image/Audio types in map when format type is not default one by @mariosasko in #5252
  • Support torch dataloader without torch formatting for IterableDataset by @lhoestq in #5357
    • You can now pass any dataset in streaming mode to a PyTorch DataLoader directly:
    from datasets import load_dataset
    ds = load_dataset("c4", "en", streaming=True, split="train")
    dataloader = DataLoader(ds, batch_size=32, num_workers=4)

Docs

General improvements and bug fixes

New Contributors

Full Changelog: 2.7.0...2.8.0

2.7.1

22 Nov 17:27
5ef1ab1
Compare
Choose a tag to compare

Bug fixes

Full Changelog: 2.7.0...2.7.1

2.6.2

22 Nov 17:49
a6a5a1c
Compare
Choose a tag to compare

Bug fixes

Full Changelog: 2.6.1...2.6.2

2.7.0

16 Nov 10:11
edf1902
Compare
Choose a tag to compare

Dataset Features

  • Multiprocessed dataset builder by @TevenLeScao in #5107
    • Load big datasets faster than before using multiprocessing:
    from datasets import load_dataset
    ds = load_dataset("imagenet-1k", num_proc=4)
  • Make torch.Tensor and spacy models cacheable by @mariosasko in #5191
    • Function passed to map or filter that uses tensors or pipelines can now be cached
  • Drop labels in Image and Audio folders if files are on different levels in directory or if there is only one label by @polinaeterna in #5192
  • TextConfig: added "errors" by @NightMachinery in #5155

Audio setup

Docs

General improvements and bug fixes

New Contributors

Full Changelog: 2.6.1...2.7.0

2.6.1

14 Oct 15:45
Compare
Choose a tag to compare

Bug fixes

  • Fix filter indices when batched by @albertvillanova in #5113
    • fixed a bug where filter could return examples with the wrong indices
  • Fix iter_batches by @lhoestq in #5115
    • fixed a bug where map with batch=True could return a dataset with less examples
  • Fix a typo in arrow_dataset.py by @yangky11 in #5108

New Contributors

Full Changelog: 2.6.0...2.6.1

2.6.0

13 Oct 11:00
Compare
Choose a tag to compare

Important

  • [GH->HF] Remove all dataset scripts from github by @lhoestq in #4974
    • all the dataset scripts and dataset cards are now on https://hf.co/datasets
    • we invite users and contributors to open discussions or pull requests on the Hugging Face Hub from now on

Datasets features

  • Add ability to read-write to SQL databases. by @Dref360 in #4928
    • Read from sqlite file:
    from datasets import Dataset
    dataset = Dataset.from_sql("data_table", "sqlite:///sqlite_file.db")
    • Allow connection objects in from_sql + small doc improvement by @mariosasko in #5091
    from datasets import Dataset
    from sqlite3 import connect
    con = connect(...)
    dataset = Dataset.from_sql("SELECT text FROM table WHERE length(text) > 100 LIMIT 10", con)
  • Image & Audio formatting for numpy/torch/tf/jax by @lhoestq in #5072
    • return numpy/torch/tf/jax tensors with
    from datasets import load_dataset
    ds = load_dataset("imagenet-1k").with_format("torch")  # or numpy/tf/jax
    ds[0]["image"]
  • Added IterableDataset.from_generator by @hamid-vakilzadeh in #5052
  • Fast dataset iter by @mariosasko in #5030
    • speed up by a factor of 2 using the Arrow Table reader
  • Dataset infos in yaml by @lhoestq in #4926
  • Add kwargs to Dataset.from_generator by @mariosasko in #5049
  • Support converters in CsvBuilder by @mariosasko in #5057
  • Restore saved format state in load_from_disk by @asofiaoliveira in #5073

Dataset changes

Dataset cards

General improvements and bug fixes

New Contributors

Full Changelog: 2.5.1...2.6.0

2.5.2

05 Oct 10:17
Compare
Choose a tag to compare

Bug fixes

  • Revert task removal in folder-based builders (#5051)
  • Support hfh 0.10 implicit auth (#5031)

Full Changelog: 2.5.1...2.5.2

2.5.1

21 Sep 15:17
Compare
Choose a tag to compare

Bug fixes

Full Changelog: 2.5.0...2.5.1