Releases: huggingface/datasets
Releases · huggingface/datasets
2.10.0
Important
- Avoid saving sparse ChunkedArrays in pyarrow tables by @marioga in #5542
- Big improvements on the speed of
.flatten_indices()
(x2) +save/load_from_disk
(x100) on selected/shuffled datasets
- Big improvements on the speed of
- Skip dataset verifications by default by @mariosasko in #5303
- introduces multiple
verification_mode
you can pass to `load_dataset()): - the new default verification steps are much faster (no need to compute expensive checksums)
- introduces multiple
Datasets features
- Single TQDM bar in multi-proc map by @mariosasko in #5455
- No more stacked TQDM bars when calling
.map()
in multiprocessing
- No more stacked TQDM bars when calling
- Map-style Dataset to IterableDataset by @lhoestq in #5410
- introduces
.to_iterable_dataset()
to get aIterableDataset
from aDataset
- see all the advantages of
IterableDataset
in the documentation about the differences between Dataset and IterableDataset
- introduces
- Select columns of Dataset or DatasetDict by @daskol in #5480
- introduces
.select_column()
to return a dataset only containing the requested columns
- introduces
- Added functionality: sort datasets by multiple keys by @MichlF in #5502
- introduces
ds = ds.sort(['col_1', 'col_2'], reverse=[True, False])
- introduces
- Add JAX device selection when formatting by @alvarobartt in #5547
- introduces
ds = ds.with_format("jax", device=device)
- introduces
- Reload features from Parquet metadata by @MFreidank in #5516
- Speed up batched PyTorch DataLoader by @lhoestq in #5512
Documentation
- Add section in tutorial for IterableDataset by @stevhliu in #5485
- Tutorial for creating a dataset by @stevhliu in #5540
- Add JAX-formatting documentation by @alvarobartt in #5535
General improvements and bug fixes
- Pin sqlalchemy by @lhoestq in #5476
- Update dataset card creation by @stevhliu in #5470
- Add num_test_batches option by @amyeroberts in #5471
- Tip for recomputing metadata by @stevhliu in #5478
- Disable aiohttp requoting of redirection URL by @albertvillanova in #5459
- [MINOR] Typo by @cakiki in #5491
- Pin dill lower version by @albertvillanova in #5489
- Improved error message for gated/private repos by @osanseviero in #5497
- Update docs for
nyu_depth_v2
dataset by @awsaf49 in #5484 - don't zero copy timestamps by @dwyatte in #5504
- Remove unused
load_from_cache_file
arg fromDataset.shard()
docstring by @polinaeterna in #5493 - Do not add index column by default when exporting to CSV by @albertvillanova in #5490
- Fix bug when casting empty array to class labels by @marioga in #5521
- Fix benchmarks CI - pin protobuf by @lhoestq in #5527
- Remove py.typed by @mariosasko in #5518
- Add missing license in
NumpyFormatter
by @alvarobartt in #5530 - Unify
load_from_cache_file
type and logic by @HallerPatrick in #5515 - Format code with
ruff
by @mariosasko in #5519 - Minor changes in JAX-formatting docstrings & type-hints by @alvarobartt in #5522
- Resolve four broken refs in the docs by @tomaarsen in #5550
- Use default audio resampling type by @lhoestq in #5556
- resampy is no longer needed to resample audio data
- improved message error row formatting by @Plutone11011 in #5553
- Make tiktoken tokenizers hashable by @mariosasko in #5552
- Suggest scikit-learn instead of sklearn by @osbm in #5551
- Add filter desc by @lhoestq in #5557
- Fix map suffix_template by @lhoestq in #5559
- Ensure last tqdm update in map by @mariosasko in #5560
New Contributors
- @amyeroberts made their first contribution in #5471
- @awsaf49 made their first contribution in #5484
- @dwyatte made their first contribution in #5504
- @marioga made their first contribution in #5521
- @MFreidank made their first contribution in #5516
- @daskol made their first contribution in #5480
- @Plutone11011 made their first contribution in #5553
- @osbm made their first contribution in #5551
- @MichlF made their first contribution in #5502
Full Changelog: 2.9.0...ef
2.9.0
Datasets Features
-
Parallel implementation of to_tf_dataset() by @Rocketknight1 in #5377
- Pass
num_workers=
to.to_tf_dataset()
to make your dataset faster with multiprocessing
- Pass
-
Distributed support by @lhoestq in #5369
- Split your dataset for each node for distributed training
- It supports both
Dataset
andIterableDataset
(e.g. in streaming mode) - See the documentation for more details
import os from datasets.distributed import split_dataset_by_node rank = int(os.environ["RANK"]) world_size = int(os.environ["WORLD_SIZE"]) ds = split_dataset_by_node(ds, rank=rank, world_size=world_size)
-
Support streaming datasets with os.path.exists and Path.exists by @albertvillanova in #5400
-
Tqdm progress bar for
to_parquet
by @zanussbaum in #5456 -
ZIP files support in iter_archive with better compression type check by @Mehdi2402 in #3379
-
Support other formats than uint8 for image arrays by @vigsterkr in #5365
Documentation
- Depth estimation dataset guide by @sayakpaul in #5379
- Imagefolder docs: mention support of CSV and ZIP by @lhoestq in #5463
- Update docs of S3 filesystem with async aiobotocore by @maheshpec in #5411
General improvements and bug fixes
- Raise error if ClassLabel names is not python list by @freddyheppell in #5359
- Temporarily pin pydantic test dependency by @albertvillanova in #5395
- Unpin pydantic test dependency by @albertvillanova in #5397
- Replace one letter import in docs by @MKhalusova in #5403
- Fix Colab notebook link by @albertvillanova in #5392
- Fix
fs.open
resource leaks by @tkukurin in #5358 - Fix deprecation warning when use_auth_token passed to download_and_prepare by @albertvillanova in #5409
- Fix streaming pandas.read_excel by @albertvillanova in #5372
- ci: 🎡 remove two obsolete issue templates by @severo in #5420
- Handle 0-dim tensors in
cast_to_python_objects
by @mariosasko in #5384 - Fix CI by temporarily pinning apache-beam < 2.44.0 by @albertvillanova in #5429
- Fix CI benchmarks by temporarily pinning Docker image version by @albertvillanova in #5432
- Revert container image pin in CI benchmarks by @0x2b3bfa0 in #5436
- Finish deprecating the fs argument by @dconathan in #5393
- Update actions/checkout in CD Conda release by @albertvillanova in #5438
- Fix RuntimeError: Sharding is ambiguous for this dataset by @albertvillanova in #5416
- Fix documentation about batch samplers by @thomasw21 in #5440
- Fix CI by temporarily pinning fsspec < 2023.1.0 by @albertvillanova in #5447
- Support fsspec 2023.1.0 in CI by @albertvillanova in #5449
- Update share tutorial by @stevhliu in #5443
- Swap log messages for symbolic/hard links in tar extractor by @albertvillanova in #5452
- Fix base directory while extracting insecure TAR files by @albertvillanova in #5453
- Fix link in
load_dataset
docstring by @mariosasko in #5389 - Document that removing all the columns returns an empty document and the num_row is lost by @thomasw21 in #5460
- Concatenate on axis=1 with misaligned blocks by @lhoestq in #5462
- Raise from disconnect error in xopen by @lhoestq in #5382
- remove pathlib.Path with URIs by @jonny-cyberhaven in #5466
- Remove deprecated
shard_size
arg from.push_to_hub()
by @polinaeterna in #5469
New Contributors
- @freddyheppell made their first contribution in #5359
- @MKhalusova made their first contribution in #5403
- @tkukurin made their first contribution in #5358
- @0x2b3bfa0 made their first contribution in #5436
- @maheshpec made their first contribution in #5411
- @dconathan made their first contribution in #5393
- @zanussbaum made their first contribution in #5456
- @jonny-cyberhaven made their first contribution in #5466
Full Changelog: 2.8.0...2.9.0
2.8.0
Important
- Removed YAML integer keys from class_label metadata by @albertvillanova in #5277
- From now on, datasets pushed on the Hub and using ClassLabel will use a new YAML model to store the feature types
- The new model uses strings instead of integers for the ids in label name mapping (e.g. 0 -> "0"). This is due to the Hub limitations. In a few months the Hub may stop allowing users to push the old YAML model.
- Old versions of
datasets
are not able to reload datasets pushed with this new model, so we encourage everyone to update.
Datasets Features
- Fix methods using
IterableDataset.map
that lead tofeatures=None
by @alvarobartt in #5287- Datasets in streaming mode now update their
features
after column renaming or removal
- Datasets in streaming mode now update their
- Add num_proc to from_csv/generator/json/parquet/text by @lhoestq in #5239
- Use multiprocessing to load multiple files in parallel
- Add
features
param toIterableDataset.map
by @alvarobartt in #5311 - Sharded save_to_disk + multiprocessing by @lhoestq in #5268
- Pass
num_shards
ormax_shard_size
tods.save_to_disk()
ords.push_to_hub()
- Pass
num_proc
to use multiprocessing.
- Pass
- Support for decoding Image/Audio types in map when format type is not default one by @mariosasko in #5252
- Support torch dataloader without torch formatting for IterableDataset by @lhoestq in #5357
- You can now pass any dataset in streaming mode to a PyTorch DataLoader directly:
from datasets import load_dataset ds = load_dataset("c4", "en", streaming=True, split="train") dataloader = DataLoader(ds, batch_size=32, num_workers=4)
Docs
General improvements and bug fixes
- typo by @WrRan in #5253
- typo by @WrRan in #5254
- remove an unused statement by @WrRan in #5257
- fix wrong print by @WrRan in #5256
- Fix
max_shard_size
docs by @lhoestq in #5267 - Specify arguments as keywords in librosa.reshape to avoid future errors by @polinaeterna in #5266
- Change release procedure to use only pull requests by @albertvillanova in #5250
- Warn about checksums by @lhoestq in #5279
- Tweak readme by @lhoestq in #5210
- Save file name in embed_storage by @lhoestq in #5285
- Use correct dataset type in
from_generator
docs by @mariosasko in #5307 - Support streaming datasets with pathlib.Path.with_suffix by @albertvillanova in #5294
- Fix xjoin for Windows pathnames by @albertvillanova in #5297
- Fix xopen for Windows pathnames by @albertvillanova in #5299
- Ci py3.10 by @lhoestq in #5065
- Update Overview.ipynb google colab by @lhoestq in #5211
- Support xPath for Windows pathnames by @albertvillanova in #5310
- Fix description of streaming in the docs by @polinaeterna in #5313
- Fix Text sample_by paragraph by @albertvillanova in #5319
- [Extract] Place the lock file next to the destination directory by @lhoestq in #5320
- Fix loading from HF GCP cache by @lhoestq in #5321
- This was affecting datasets like
wikipedia
ornatural_questions
- This was affecting datasets like
- Fix docs building for main by @albertvillanova in #5328
- Origin/fix missing features error by @eunseojo in #5318
- fix: 🐛 pass the token to get the list of config names by @severo in #5333
- Clarify imagefolder is for small datasets by @stevhliu in #5329
- Close stream in
ArrowWriter.finalize
before inference error by @mariosasko in #5309 - Use same
num_proc
for dataset download and generation by @mariosasko in #5300 - Set
IterableDataset.map
parambatch_size
typing as optional by @alvarobartt in #5336 - fix: dataset path should be absolute by @vigsterkr in #5234
- Clean up DatasetInfo and Dataset docstrings by @stevhliu in #5340
- Clean up docstrings by @stevhliu in #5334
- Remove tasks.json by @lhoestq in #5341
- Support
topdown
parameter inxwalk
by @mariosasko in #5308 - Improve
use_auth_token
docstring and deprecateuse_auth_token
indownload_and_prepare
by @mariosasko in #5302 - Clean up Loading methods docstrings by @stevhliu in #5350
- Clean up remaining Main Classes docstrings by @stevhliu in #5349
- Clean up Dataset and DatasetDict by @stevhliu in #5344
- Clean up Table class docstrings by @stevhliu in #5355
- Raise error for
.tar
archives in the same way as for.tar.gz
and.tgz
in_get_extraction_protocol
by @polinaeterna in #5322 - Clean filesystem and logging docstrings by @stevhliu in #5356
- ExamplesIterable fixes by @lhoestq in #5366
- Simplify skipping by @Muennighoff in #5373
- Release: 2.8.0 by @lhoestq in #5375
New Contributors
- @WrRan made their first contribution in #5253
- @eunseojo made their first contribution in #5318
- @vigsterkr made their first contribution in #5234
- @Muennighoff made their first contribution in #5373
Full Changelog: 2.7.0...2.8.0
2.7.1
Bug fixes
- Remove YAML integer keys from class_label metadata by @albertvillanova in #5277
Full Changelog: 2.7.0...2.7.1
2.6.2
Bug fixes
- Remove YAML integer keys from class_label metadata by @albertvillanova in #5277
Full Changelog: 2.6.1...2.6.2
2.7.0
Dataset Features
- Multiprocessed dataset builder by @TevenLeScao in #5107
- Load big datasets faster than before using multiprocessing:
from datasets import load_dataset ds = load_dataset("imagenet-1k", num_proc=4)
- Make torch.Tensor and spacy models cacheable by @mariosasko in #5191
- Function passed to
map
orfilter
that uses tensors or pipelines can now be cached
- Function passed to
- Drop labels in Image and Audio folders if files are on different levels in directory or if there is only one label by @polinaeterna in #5192
- TextConfig: added "errors" by @NightMachinery in #5155
Audio setup
- Add ffmpeg4 installation instructions in warnings by @polinaeterna in #5167
Docs
- Update create image dataset docs by @stevhliu in #5177
- add: segmentation guide. by @sayakpaul in #5188
- Reword E2E training and inference tips in the vision guides by @sayakpaul in #5217
- Add SQL guide by @stevhliu in #5223
General improvements and bug fixes
- Add
pyproject.toml
forblack
by @mariosasko in #5125 - Fix
tqdm
zip bug by @david1542 in #5120 - Install tensorflow-macos dependency conditionally by @albertvillanova in #5124
- [TYPO] Update new_dataset_script.py by @cakiki in #5119
- Avoid extra cast in
class_encode_column
by @mariosasko in #5130 - Use yaml for issue templates + revamp by @mariosasko in #5116
- Update docs once dataset scripts transferred to the Hub by @albertvillanova in #5136
- Delete duplicate issue template file by @albertvillanova in #5146
- Deprecate num_proc parameter in DownloadManager.extract by @ayushthe1 in #5142
- Raise ImportError instead of OSError by @ayushthe1 in #5141
- Fix CI require beam by @albertvillanova in #5168
- Make iter_files deterministic by @albertvillanova in #5149
- Add PB and TB in convert_file_size_to_int by @lhoestq in #5171
- Reduce default max
writer_batch_size
by @mariosasko in #5163 - Support dill 0.3.6 by @albertvillanova in #5166
- Make filename matching more robust by @riccardobucco in #5128
- Preserve None in list type cast in PyArrow 10 by @mariosasko in #5174
- Raise ffmpeg warnings only once by @polinaeterna in #5173
- Add "ipykernel" to list of
co_filename
s to remove by @gpucce in #5169 - chore: add notebook links to img cls and obj det. by @sayakpaul in #5187
- Fix docs about dataset_info in YAML by @albertvillanova in #5194
- fsspec lock reset in multiprocessing by @lhoestq in #5159
- Add note about the name of a dataset script by @polinaeterna in #5198
- Deprecate dummy data generation command by @mariosasko in #5199
- Do not sort splits in dataset info by @polinaeterna in #5201
- Add missing
DownloadConfig.use_auth_token
value by @alvarobartt in #5205 - Update canonical links to Hub links by @stevhliu in #5203
- Refactor CI hub fixtures to use monkeypatch instead of patch by @albertvillanova in #5208
- Update github pr docs actions by @mishig25 in #5214
- Use hfh hf_hub_url function by @albertvillanova in #5196
- Pin
typer
version in tests to <0.5 to fix Windows CI by @polinaeterna in #5235 - Fix shards in IterableDataset.from_generator by @lhoestq in #5233
- Fix class name of symbolic link by @riccardobucco in #5126
- Make
Version
hashable by @mariosasko in #5238 - Handle ArrowNotImplementedError caused by try_type being Image or Audio in cast by @mariosasko in #5236
- Encode path only for old versions of hfh by @lhoestq in #5237
- Fix CI require_beam maximum compatible dill version by @albertvillanova in #5212
- Support hfh rc version by @lhoestq in #5241
- Cleaner error tracebacks for dataset script errors by @mariosasko in #5240
New Contributors
- @david1542 made their first contribution in #5120
- @ayushthe1 made their first contribution in #5142
- @gpucce made their first contribution in #5169
- @sayakpaul made their first contribution in #5187
- @NightMachinery made their first contribution in #5155
Full Changelog: 2.6.1...2.7.0
2.6.1
Bug fixes
- Fix filter indices when batched by @albertvillanova in #5113
- fixed a bug where
filter
could return examples with the wrong indices
- fixed a bug where
- Fix iter_batches by @lhoestq in #5115
- fixed a bug where
map
withbatch=True
could return a dataset with less examples
- fixed a bug where
- Fix a typo in arrow_dataset.py by @yangky11 in #5108
New Contributors
Full Changelog: 2.6.0...2.6.1
2.6.0
Important
- [GH->HF] Remove all dataset scripts from github by @lhoestq in #4974
- all the dataset scripts and dataset cards are now on https://hf.co/datasets
- we invite users and contributors to open discussions or pull requests on the Hugging Face Hub from now on
Datasets features
- Add ability to read-write to SQL databases. by @Dref360 in #4928
- Read from sqlite file:
from datasets import Dataset dataset = Dataset.from_sql("data_table", "sqlite:///sqlite_file.db")
- Allow connection objects in
from_sql
+ small doc improvement by @mariosasko in #5091
from datasets import Dataset from sqlite3 import connect con = connect(...) dataset = Dataset.from_sql("SELECT text FROM table WHERE length(text) > 100 LIMIT 10", con)
- Image & Audio formatting for numpy/torch/tf/jax by @lhoestq in #5072
- return numpy/torch/tf/jax tensors with
from datasets import load_dataset ds = load_dataset("imagenet-1k").with_format("torch") # or numpy/tf/jax ds[0]["image"]
- Added
IterableDataset.from_generator
by @hamid-vakilzadeh in #5052 - Fast dataset iter by @mariosasko in #5030
- speed up by a factor of 2 using the Arrow Table reader
- Dataset infos in yaml by @lhoestq in #4926
- you can now specify the feature types and number of samples in the dataset card, see https://huggingface.co/docs/datasets/dataset_card
- Add
kwargs
toDataset.from_generator
by @mariosasko in #5049 - Support
converters
inCsvBuilder
by @mariosasko in #5057 - Restore saved format state in
load_from_disk
by @asofiaoliveira in #5073
Dataset changes
- Update: hendrycks_test - support streaming by @albertvillanova in #5041
- Update: swiss judgment prediction by @JoelNiklaus in #5019
- Update swiss judgment prediction by @JoelNiklaus in #5042
- Fix: xcsr - fix languages of X-CSQA configs by @albertvillanova in #5022
- Fix: sbu_captions - fix URLs by @donglixp in #5020
- Fix: xcsr - fix string features by @albertvillanova in #5024
- Fix: hendrycks_test - fix NonMatchingChecksumError by @albertvillanova in #5040
- Fix: cats_vs_dogs - fix number of samples by @lhoestq in #5047
- Fix: lex_glue - fix bug with labels of eurlex config of lex_glue dataset by @iliaschalkidis in #5048
- Fix: msr_sqa - fix dataset generation by @Timothyxxx in #3715
Dataset cards
- Add description to hellaswag dataset by @julien-c in #4810
- Add deprecation warning to multilingual_librispeech dataset card by @albertvillanova in #5010
- Update languages in aeslc dataset card by @apergo-ai in #3357
- Update license to bookcorpus dataset card by @meg-huggingface in #3526
- Update paper link in medmcqa dataset card by @monk1337 in #4290
- Add oversampling strategy iterable datasets interleave by @ylacombe in #5036
- Fix license/citation information of squadshifts dataset card by @albertvillanova in #5054
General improvements and bug fixes
- Fix missing use_auth_token in streaming docstrings by @albertvillanova in #5003
- Add some note about running the transformers ci before a release by @lhoestq in #5007
- Remove license tag file and validation by @albertvillanova in #5004
- Re-apply input columns change by @mariosasko in #5008
- patch CI_HUB_TOKEN_PATH with Path instead of str by @Wauplin in #5026
- Fix typo in error message by @severo in #5027
- Fix import in
ClassLabel
docstring example by @alvarobartt in #5029 - Remove redundant code from some dataset module factories by @albertvillanova in #5033
- Fix typos in load docstrings and comments by @albertvillanova in #5035
- Prefer split patterns from directories over split patterns from filenames by @polinaeterna in #4985
- Fix tar extraction vuln by @lhoestq in #5016
- Support hfh 0.10 implicit auth by @lhoestq in #5031
- Fix
flatten_indices
with empty indices mapping by @mariosasko in #5043 - Improve CI performance speed of PackagedDatasetTest by @albertvillanova in #5037
- Revert task removal in folder-based builders by @mariosasko in #5051
- Fix backward compatibility for dataset_infos.json by @lhoestq in #5055
- Fix typo by @stevhliu in #5059
- Fix CI hfh token warning by @albertvillanova in #5062
- Mark CI tests as xfail when 502 error by @albertvillanova in #5058
- Fix passed download_config in HubDatasetModuleFactoryWithoutScript by @albertvillanova in #5077
- Fix CONTRIBUTING once dataset scripts transferred to Hub by @albertvillanova in #5067
- Fix header level in Audio docs by @stevhliu in #5078
- Support DEFAULT_CONFIG_NAME when no BUILDER_CONFIGS by @albertvillanova in #5071
- Support streaming gzip.open by @albertvillanova in #5066
- adding keep in memory by @Mustapha-AJEGHRIR in #5082
- refactor: replace AssertionError with more meaningful exceptions (#5074) by @galbwe in #5079
- fix: update exception throw from OSError to EnvironmentError in `push… by @rahulXs in #5076
- Align signature of list_repo_files with latest hfh by @albertvillanova in #5063
- Align signature of create/delete_repo with latest hfh by @albertvillanova in #5064
- Fix filter with empty indices by @Mouhanedg56 in #5087
- Fix tutorial (#5093) by @riccardobucco in #5095
- Use HTML relative paths for tiles in the docs by @lewtun in #5092
- Fix loading how to guide (#5102) by @riccardobucco in #5104
- url encode hub url (#5099) by @riccardobucco in #5103
- Free the "hf" filesystem protocol for
hffs
by @lhoestq in #5101 - Fix task template reload from dict by @lhoestq in #5106
New Contributors
- @Wauplin made their first contribution in #5026
- @donglixp made their first contribution in #5020
- @Timothyxxx made their first contribution in #3715
- @hamid-vakilzadeh made their first contribution in #5052
- @Mustapha-AJEGHRIR made their first contribution in #5082
- @galbwe made their first contribution in #5079
- @rahulXs made their first contribution in #5076
- @Mouhanedg56 made their first contribution in #5087
- @riccardobucco made their first contribution in #5095
- @asofiaoliveira made their first contribution in #5073
Full Changelog: 2.5.1...2.6.0