Releases · huggingface/datasets

22 Feb 12:58

lhoestq

2.10.0

cac733f

2.10.0

Important

Avoid saving sparse ChunkedArrays in pyarrow tables by @marioga in #5542
- Big improvements on the speed of .flatten_indices() (x2) + save/load_from_disk (x100) on selected/shuffled datasets
Skip dataset verifications by default by @mariosasko in #5303
- introduces multiple verification_mode you can pass to `load_dataset()):
- the new default verification steps are much faster (no need to compute expensive checksums)

Datasets features

Single TQDM bar in multi-proc map by @mariosasko in #5455
- No more stacked TQDM bars when calling .map() in multiprocessing
Map-style Dataset to IterableDataset by @lhoestq in #5410
- introduces .to_iterable_dataset() to get a IterableDataset from a Dataset
- see all the advantages of IterableDataset in the documentation about the differences between Dataset and IterableDataset
Select columns of Dataset or DatasetDict by @daskol in #5480
- introduces .select_column() to return a dataset only containing the requested columns
Added functionality: sort datasets by multiple keys by @MichlF in #5502
- introduces ds = ds.sort(['col_1', 'col_2'], reverse=[True, False])
Add JAX device selection when formatting by @alvarobartt in #5547
- introduces ds = ds.with_format("jax", device=device)
Reload features from Parquet metadata by @MFreidank in #5516
Speed up batched PyTorch DataLoader by @lhoestq in #5512

Documentation

Add section in tutorial for IterableDataset by @stevhliu in #5485
- https://huggingface.co/docs/datasets/main/en/access#iterabledataset
Tutorial for creating a dataset by @stevhliu in #5540
- https://huggingface.co/docs/datasets/main/en/create_dataset
Add JAX-formatting documentation by @alvarobartt in #5535
- https://huggingface.co/docs/datasets/main/en/use_with_jax

General improvements and bug fixes

Pin sqlalchemy by @lhoestq in #5476
Update dataset card creation by @stevhliu in #5470
Add num_test_batches option by @amyeroberts in #5471
Tip for recomputing metadata by @stevhliu in #5478
Disable aiohttp requoting of redirection URL by @albertvillanova in #5459
[MINOR] Typo by @cakiki in #5491
Pin dill lower version by @albertvillanova in #5489
Improved error message for gated/private repos by @osanseviero in #5497
Update docs for nyu_depth_v2 dataset by @awsaf49 in #5484
don't zero copy timestamps by @dwyatte in #5504
Remove unused load_from_cache_file arg from Dataset.shard() docstring by @polinaeterna in #5493
Do not add index column by default when exporting to CSV by @albertvillanova in #5490
Fix bug when casting empty array to class labels by @marioga in #5521
Fix benchmarks CI - pin protobuf by @lhoestq in #5527
Remove py.typed by @mariosasko in #5518
Add missing license in NumpyFormatter by @alvarobartt in #5530
Unify load_from_cache_file type and logic by @HallerPatrick in #5515
Format code with ruff by @mariosasko in #5519
Minor changes in JAX-formatting docstrings & type-hints by @alvarobartt in #5522
Resolve four broken refs in the docs by @tomaarsen in #5550
Use default audio resampling type by @lhoestq in #5556
- resampy is no longer needed to resample audio data
improved message error row formatting by @Plutone11011 in #5553
Make tiktoken tokenizers hashable by @mariosasko in #5552
Suggest scikit-learn instead of sklearn by @osbm in #5551
Add filter desc by @lhoestq in #5557
Fix map suffix_template by @lhoestq in #5559
Ensure last tqdm update in map by @mariosasko in #5560

New Contributors

@amyeroberts made their first contribution in #5471
@awsaf49 made their first contribution in #5484
@dwyatte made their first contribution in #5504
@marioga made their first contribution in #5521
@MFreidank made their first contribution in #5516
@daskol made their first contribution in #5480
@Plutone11011 made their first contribution in #5553
@osbm made their first contribution in #5551
@MichlF made their first contribution in #5502

Full Changelog: 2.9.0...ef

Contributors

dwyatte, cakiki, and 17 other contributors

Assets 2

26 Jan 19:33

lhoestq

2.9.0

b5672a9

2.9.0

Datasets Features

Parallel implementation of to_tf_dataset() by @Rocketknight1 in #5377
- Pass num_workers= to .to_tf_dataset() to make your dataset faster with multiprocessing

Distributed support by @lhoestq in #5369

Split your dataset for each node for distributed training
It supports both Dataset and IterableDataset (e.g. in streaming mode)
See the documentation for more details

import os
from datasets.distributed import split_dataset_by_node

rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
ds = split_dataset_by_node(ds, rank=rank, world_size=world_size)

Support streaming datasets with os.path.exists and Path.exists by @albertvillanova in #5400
Tqdm progress bar for to_parquet by @zanussbaum in #5456
ZIP files support in iter_archive with better compression type check by @Mehdi2402 in #3379
Support other formats than uint8 for image arrays by @vigsterkr in #5365

Documentation

Depth estimation dataset guide by @sayakpaul in #5379
- see https://huggingface.co/docs/datasets/main/en/depth_estimation
Imagefolder docs: mention support of CSV and ZIP by @lhoestq in #5463
- see https://huggingface.co/docs/datasets/main/en/image_load#imagefolder
Update docs of S3 filesystem with async aiobotocore by @maheshpec in #5411
- see https://huggingface.co/docs/datasets/main/en/filesystems#amazon-s3

General improvements and bug fixes

Raise error if ClassLabel names is not python list by @freddyheppell in #5359
Temporarily pin pydantic test dependency by @albertvillanova in #5395
Unpin pydantic test dependency by @albertvillanova in #5397
Replace one letter import in docs by @MKhalusova in #5403
Fix Colab notebook link by @albertvillanova in #5392
Fix fs.open resource leaks by @tkukurin in #5358
Fix deprecation warning when use_auth_token passed to download_and_prepare by @albertvillanova in #5409
Fix streaming pandas.read_excel by @albertvillanova in #5372
ci: 🎡 remove two obsolete issue templates by @severo in #5420
Handle 0-dim tensors in cast_to_python_objects by @mariosasko in #5384
Fix CI by temporarily pinning apache-beam < 2.44.0 by @albertvillanova in #5429
Fix CI benchmarks by temporarily pinning Docker image version by @albertvillanova in #5432
Revert container image pin in CI benchmarks by @0x2b3bfa0 in #5436
Finish deprecating the fs argument by @dconathan in #5393
Update actions/checkout in CD Conda release by @albertvillanova in #5438
Fix RuntimeError: Sharding is ambiguous for this dataset by @albertvillanova in #5416
Fix documentation about batch samplers by @thomasw21 in #5440
Fix CI by temporarily pinning fsspec < 2023.1.0 by @albertvillanova in #5447
Support fsspec 2023.1.0 in CI by @albertvillanova in #5449
Update share tutorial by @stevhliu in #5443
Swap log messages for symbolic/hard links in tar extractor by @albertvillanova in #5452
Fix base directory while extracting insecure TAR files by @albertvillanova in #5453
Fix link in load_dataset docstring by @mariosasko in #5389
Document that removing all the columns returns an empty document and the num_row is lost by @thomasw21 in #5460
Concatenate on axis=1 with misaligned blocks by @lhoestq in #5462
Raise from disconnect error in xopen by @lhoestq in #5382
remove pathlib.Path with URIs by @jonny-cyberhaven in #5466
Remove deprecated shard_size arg from .push_to_hub() by @polinaeterna in #5469

New Contributors

@freddyheppell made their first contribution in #5359
@MKhalusova made their first contribution in #5403
@tkukurin made their first contribution in #5358
@0x2b3bfa0 made their first contribution in #5436
@maheshpec made their first contribution in #5411
@dconathan made their first contribution in #5393
@zanussbaum made their first contribution in #5456
@jonny-cyberhaven made their first contribution in #5466

Full Changelog: 2.8.0...2.9.0

Contributors

vigsterkr, tkukurin, and 17 other contributors

Assets 2

19 Dec 10:55

lhoestq

2.8.0

037c9b5

2.8.0

Important

Removed YAML integer keys from class_label metadata by @albertvillanova in #5277
- From now on, datasets pushed on the Hub and using ClassLabel will use a new YAML model to store the feature types
- The new model uses strings instead of integers for the ids in label name mapping (e.g. 0 -> "0"). This is due to the Hub limitations. In a few months the Hub may stop allowing users to push the old YAML model.
- Old versions of datasets are not able to reload datasets pushed with this new model, so we encourage everyone to update.

Datasets Features

Fix methods using IterableDataset.map that lead to features=None by @alvarobartt in #5287
- Datasets in streaming mode now update their features after column renaming or removal
Add num_proc to from_csv/generator/json/parquet/text by @lhoestq in #5239
- Use multiprocessing to load multiple files in parallel
Add features param to IterableDataset.map by @alvarobartt in #5311
Sharded save_to_disk + multiprocessing by @lhoestq in #5268
- Pass num_shards or max_shard_size to ds.save_to_disk() or ds.push_to_hub()
- Pass num_proc to use multiprocessing.
Support for decoding Image/Audio types in map when format type is not default one by @mariosasko in #5252
Support torch dataloader without torch formatting for IterableDataset by @lhoestq in #5357
- You can now pass any dataset in streaming mode to a PyTorch DataLoader directly:
```
from datasets import load_dataset
ds = load_dataset("c4", "en", streaming=True, split="train")
dataloader = DataLoader(ds, batch_size=32, num_workers=4)
```

Docs

Complete doc migration by @mishig25 in #5248

General improvements and bug fixes

typo by @WrRan in #5253
typo by @WrRan in #5254
remove an unused statement by @WrRan in #5257
fix wrong print by @WrRan in #5256
Fix max_shard_size docs by @lhoestq in #5267
Specify arguments as keywords in librosa.reshape to avoid future errors by @polinaeterna in #5266
Change release procedure to use only pull requests by @albertvillanova in #5250
Warn about checksums by @lhoestq in #5279
Tweak readme by @lhoestq in #5210
Save file name in embed_storage by @lhoestq in #5285
Use correct dataset type in from_generator docs by @mariosasko in #5307
Support streaming datasets with pathlib.Path.with_suffix by @albertvillanova in #5294
Fix xjoin for Windows pathnames by @albertvillanova in #5297
Fix xopen for Windows pathnames by @albertvillanova in #5299
Ci py3.10 by @lhoestq in #5065
Update Overview.ipynb google colab by @lhoestq in #5211
Support xPath for Windows pathnames by @albertvillanova in #5310
Fix description of streaming in the docs by @polinaeterna in #5313
Fix Text sample_by paragraph by @albertvillanova in #5319
[Extract] Place the lock file next to the destination directory by @lhoestq in #5320
Fix loading from HF GCP cache by @lhoestq in #5321
- This was affecting datasets like wikipedia or natural_questions
Fix docs building for main by @albertvillanova in #5328
Origin/fix missing features error by @eunseojo in #5318
fix: 🐛 pass the token to get the list of config names by @severo in #5333
Clarify imagefolder is for small datasets by @stevhliu in #5329
Close stream in ArrowWriter.finalize before inference error by @mariosasko in #5309
Use same num_proc for dataset download and generation by @mariosasko in #5300
Set IterableDataset.map param batch_size typing as optional by @alvarobartt in #5336
fix: dataset path should be absolute by @vigsterkr in #5234
Clean up DatasetInfo and Dataset docstrings by @stevhliu in #5340
Clean up docstrings by @stevhliu in #5334
Remove tasks.json by @lhoestq in #5341
Support topdown parameter in xwalk by @mariosasko in #5308
Improve use_auth_token docstring and deprecate use_auth_token in download_and_prepare by @mariosasko in #5302
Clean up Loading methods docstrings by @stevhliu in #5350
Clean up remaining Main Classes docstrings by @stevhliu in #5349
Clean up Dataset and DatasetDict by @stevhliu in #5344
Clean up Table class docstrings by @stevhliu in #5355
Raise error for .tar archives in the same way as for .tar.gz and .tgz in _get_extraction_protocol by @polinaeterna in #5322
Clean filesystem and logging docstrings by @stevhliu in #5356
ExamplesIterable fixes by @lhoestq in #5366
Simplify skipping by @Muennighoff in #5373
Release: 2.8.0 by @lhoestq in #5375

New Contributors

@WrRan made their first contribution in #5253
@eunseojo made their first contribution in #5318
@vigsterkr made their first contribution in #5234
@Muennighoff made their first contribution in #5373

Full Changelog: 2.7.0...2.8.0

Contributors

vigsterkr, severo, and 10 other contributors

Assets 2

22 Nov 17:27

albertvillanova

2.7.1

5ef1ab1

2.7.1

Bug fixes

Remove YAML integer keys from class_label metadata by @albertvillanova in #5277

Full Changelog: 2.7.0...2.7.1

Contributors

albertvillanova

Assets 2

22 Nov 17:49

albertvillanova

2.6.2

a6a5a1c

2.6.2

Bug fixes

Remove YAML integer keys from class_label metadata by @albertvillanova in #5277

Full Changelog: 2.6.1...2.6.2

Contributors

albertvillanova

Assets 2

16 Nov 10:11

albertvillanova

2.7.0

edf1902

2.7.0

Dataset Features

Multiprocessed dataset builder by @TevenLeScao in #5107
- Load big datasets faster than before using multiprocessing:
```
from datasets import load_dataset
ds = load_dataset("imagenet-1k", num_proc=4)
```
Make torch.Tensor and spacy models cacheable by @mariosasko in #5191
- Function passed to map or filter that uses tensors or pipelines can now be cached
Drop labels in Image and Audio folders if files are on different levels in directory or if there is only one label by @polinaeterna in #5192
TextConfig: added "errors" by @NightMachinery in #5155

Audio setup

Add ffmpeg4 installation instructions in warnings by @polinaeterna in #5167

Docs

Update create image dataset docs by @stevhliu in #5177
add: segmentation guide. by @sayakpaul in #5188
Reword E2E training and inference tips in the vision guides by @sayakpaul in #5217
Add SQL guide by @stevhliu in #5223

General improvements and bug fixes

Add pyproject.toml for black by @mariosasko in #5125
Fix tqdm zip bug by @david1542 in #5120
Install tensorflow-macos dependency conditionally by @albertvillanova in #5124
[TYPO] Update new_dataset_script.py by @cakiki in #5119
Avoid extra cast in class_encode_column by @mariosasko in #5130
Use yaml for issue templates + revamp by @mariosasko in #5116
Update docs once dataset scripts transferred to the Hub by @albertvillanova in #5136
Delete duplicate issue template file by @albertvillanova in #5146
Deprecate num_proc parameter in DownloadManager.extract by @ayushthe1 in #5142
Raise ImportError instead of OSError by @ayushthe1 in #5141
Fix CI require beam by @albertvillanova in #5168
Make iter_files deterministic by @albertvillanova in #5149
Add PB and TB in convert_file_size_to_int by @lhoestq in #5171
Reduce default max writer_batch_size by @mariosasko in #5163
Support dill 0.3.6 by @albertvillanova in #5166
Make filename matching more robust by @riccardobucco in #5128
Preserve None in list type cast in PyArrow 10 by @mariosasko in #5174
Raise ffmpeg warnings only once by @polinaeterna in #5173
Add "ipykernel" to list of co_filenames to remove by @gpucce in #5169
chore: add notebook links to img cls and obj det. by @sayakpaul in #5187
Fix docs about dataset_info in YAML by @albertvillanova in #5194
fsspec lock reset in multiprocessing by @lhoestq in #5159
Add note about the name of a dataset script by @polinaeterna in #5198
Deprecate dummy data generation command by @mariosasko in #5199
Do not sort splits in dataset info by @polinaeterna in #5201
Add missing DownloadConfig.use_auth_token value by @alvarobartt in #5205
Update canonical links to Hub links by @stevhliu in #5203
Refactor CI hub fixtures to use monkeypatch instead of patch by @albertvillanova in #5208
Update github pr docs actions by @mishig25 in #5214
Use hfh hf_hub_url function by @albertvillanova in #5196
Pin typer version in tests to <0.5 to fix Windows CI by @polinaeterna in #5235
Fix shards in IterableDataset.from_generator by @lhoestq in #5233
Fix class name of symbolic link by @riccardobucco in #5126
Make Version hashable by @mariosasko in #5238
Handle ArrowNotImplementedError caused by try_type being Image or Audio in cast by @mariosasko in #5236
Encode path only for old versions of hfh by @lhoestq in #5237
Fix CI require_beam maximum compatible dill version by @albertvillanova in #5212
Support hfh rc version by @lhoestq in #5241
Cleaner error tracebacks for dataset script errors by @mariosasko in #5240

New Contributors

@david1542 made their first contribution in #5120
@ayushthe1 made their first contribution in #5142
@gpucce made their first contribution in #5169
@sayakpaul made their first contribution in #5187
@NightMachinery made their first contribution in #5155

Full Changelog: 2.6.1...2.7.0

Contributors

cakiki, albertvillanova, and 13 other contributors

Assets 2

14 Oct 15:45

lhoestq

2.6.1

1742cf1

2.6.1

Bug fixes

Fix filter indices when batched by @albertvillanova in #5113
- fixed a bug where filter could return examples with the wrong indices
Fix iter_batches by @lhoestq in #5115
- fixed a bug where map with batch=True could return a dataset with less examples
Fix a typo in arrow_dataset.py by @yangky11 in #5108

New Contributors

@yangky11 made their first contribution in #5108

Full Changelog: 2.6.0...2.6.1

Contributors

yangky11, albertvillanova, and lhoestq

Assets 2

13 Oct 11:00

lhoestq

2.6.0

dc3f72e

2.6.0

Important

[GH->HF] Remove all dataset scripts from github by @lhoestq in #4974
- all the dataset scripts and dataset cards are now on https://hf.co/datasets
- we invite users and contributors to open discussions or pull requests on the Hugging Face Hub from now on

Datasets features

Add ability to read-write to SQL databases. by @Dref360 in #4928

Read from sqlite file:

from datasets import Dataset
dataset = Dataset.from_sql("data_table", "sqlite:///sqlite_file.db")

Allow connection objects in from_sql + small doc improvement by @mariosasko in #5091

from datasets import Dataset
from sqlite3 import connect
con = connect(...)
dataset = Dataset.from_sql("SELECT text FROM table WHERE length(text) > 100 LIMIT 10", con)

Image & Audio formatting for numpy/torch/tf/jax by @lhoestq in #5072

return numpy/torch/tf/jax tensors with

from datasets import load_dataset
ds = load_dataset("imagenet-1k").with_format("torch")  # or numpy/tf/jax
ds[0]["image"]

Added IterableDataset.from_generator by @hamid-vakilzadeh in #5052
Fast dataset iter by @mariosasko in #5030
- speed up by a factor of 2 using the Arrow Table reader
Dataset infos in yaml by @lhoestq in #4926
- you can now specify the feature types and number of samples in the dataset card, see https://huggingface.co/docs/datasets/dataset_card
Add kwargs to Dataset.from_generator by @mariosasko in #5049
Support converters in CsvBuilder by @mariosasko in #5057
Restore saved format state in load_from_disk by @asofiaoliveira in #5073

Dataset changes

Update: hendrycks_test - support streaming by @albertvillanova in #5041
Update: swiss judgment prediction by @JoelNiklaus in #5019
- Update swiss judgment prediction by @JoelNiklaus in #5042
Fix: xcsr - fix languages of X-CSQA configs by @albertvillanova in #5022
Fix: sbu_captions - fix URLs by @donglixp in #5020
Fix: xcsr - fix string features by @albertvillanova in #5024
Fix: hendrycks_test - fix NonMatchingChecksumError by @albertvillanova in #5040
Fix: cats_vs_dogs - fix number of samples by @lhoestq in #5047
Fix: lex_glue - fix bug with labels of eurlex config of lex_glue dataset by @iliaschalkidis in #5048
Fix: msr_sqa - fix dataset generation by @Timothyxxx in #3715

Dataset cards

Add description to hellaswag dataset by @julien-c in #4810
Add deprecation warning to multilingual_librispeech dataset card by @albertvillanova in #5010
Update languages in aeslc dataset card by @apergo-ai in #3357
Update license to bookcorpus dataset card by @meg-huggingface in #3526
Update paper link in medmcqa dataset card by @monk1337 in #4290
Add oversampling strategy iterable datasets interleave by @ylacombe in #5036
Fix license/citation information of squadshifts dataset card by @albertvillanova in #5054

General improvements and bug fixes

Fix missing use_auth_token in streaming docstrings by @albertvillanova in #5003
Add some note about running the transformers ci before a release by @lhoestq in #5007
Remove license tag file and validation by @albertvillanova in #5004
Re-apply input columns change by @mariosasko in #5008
patch CI_HUB_TOKEN_PATH with Path instead of str by @Wauplin in #5026
Fix typo in error message by @severo in #5027
Fix import in ClassLabel docstring example by @alvarobartt in #5029
Remove redundant code from some dataset module factories by @albertvillanova in #5033
Fix typos in load docstrings and comments by @albertvillanova in #5035
Prefer split patterns from directories over split patterns from filenames by @polinaeterna in #4985
Fix tar extraction vuln by @lhoestq in #5016
Support hfh 0.10 implicit auth by @lhoestq in #5031
Fix flatten_indices with empty indices mapping by @mariosasko in #5043
Improve CI performance speed of PackagedDatasetTest by @albertvillanova in #5037
Revert task removal in folder-based builders by @mariosasko in #5051
Fix backward compatibility for dataset_infos.json by @lhoestq in #5055
Fix typo by @stevhliu in #5059
Fix CI hfh token warning by @albertvillanova in #5062
Mark CI tests as xfail when 502 error by @albertvillanova in #5058
Fix passed download_config in HubDatasetModuleFactoryWithoutScript by @albertvillanova in #5077
Fix CONTRIBUTING once dataset scripts transferred to Hub by @albertvillanova in #5067
Fix header level in Audio docs by @stevhliu in #5078
Support DEFAULT_CONFIG_NAME when no BUILDER_CONFIGS by @albertvillanova in #5071
Support streaming gzip.open by @albertvillanova in #5066
adding keep in memory by @Mustapha-AJEGHRIR in #5082
refactor: replace AssertionError with more meaningful exceptions (#5074) by @galbwe in #5079
fix: update exception throw from OSError to EnvironmentError in `push… by @rahulXs in #5076
Align signature of list_repo_files with latest hfh by @albertvillanova in #5063
Align signature of create/delete_repo with latest hfh by @albertvillanova in #5064
Fix filter with empty indices by @Mouhanedg56 in #5087
Fix tutorial (#5093) by @riccardobucco in #5095
Use HTML relative paths for tiles in the docs by @lewtun in #5092
Fix loading how to guide (#5102) by @riccardobucco in #5104
url encode hub url (#5099) by @riccardobucco in #5103
Free the "hf" filesystem protocol for hffs by @lhoestq in #5101
Fix task template reload from dict by @lhoestq in #5106