2.20.0
Important
- Remove default
trust_remote_code=True
by @lhoestq in #6954- datasets with a python loading script now require passing
trust_remote_code=True
to be used
- datasets with a python loading script now require passing
Datasets features
- [Resumable IterableDataset] Add IterableDataset state_dict by @lhoestq in #6658
-
checkpoint and resume an iterable dataset (e.g. when streaming):
>>> iterable_dataset = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3) >>> for idx, example in enumerate(iterable_dataset): ... print(example) ... if idx == 2: ... state_dict = iterable_dataset.state_dict() ... print("checkpoint") ... break >>> iterable_dataset.load_state_dict(state_dict) >>> print(f"restart from checkpoint") >>> for example in iterable_dataset: ... print(example)
Returns:
{'a': 0} {'a': 1} {'a': 2} checkpoint restart from checkpoint {'a': 3} {'a': 4} {'a': 5}
-
General improvements and bug fixes
- Add docs about the CLI by @albertvillanova in #6831
- Remove token arg from CLI examples by @albertvillanova in #6839
- Allow deleting a subset/config from a no-script dataset by @albertvillanova in #6820
- Fix line-endings in tests on Windows by @albertvillanova in #6857
- Fix CI by temporarily pinning huggingface-hub < 0.23.0 by @albertvillanova in #6861
- Fix dataset name for community Hub script-datasets by @albertvillanova in #6855
- Update tqdm >= 4.66.3 to fix vulnerability by @albertvillanova in #6870
- Fix download for dict of dicts of URLs by @albertvillanova in #6871
- Set dev version by @albertvillanova in #6873
- Shorten long logs by @lhoestq in #6875
- Support jax 0.4.27 in CI tests by @albertvillanova in #6885
- Close gzipped files properly by @lhoestq in #6893
- Make CLI convert_to_parquet not raise error if no rights to create script branch by @albertvillanova in #6902
- Fix YAML error in README files appearing on GitHub by @albertvillanova in #6898
- Document that to_json defaults to JSON Lines by @albertvillanova in #6895
- Require Pillow >= 9.4.0 to avoid AttributeError when loading image dataset by @albertvillanova in #6883
- Create function to convert to parquet by @albertvillanova in #6878
- Update features.py to avoid bfloat16 unsupported error by @skaulintel in #6607
- Fix decoding multi part extension by @lhoestq in #6904
- Use pandas ujson in JSON loader to improve performance by @albertvillanova in #6874
- Update requests >=2.32.1 to fix vulnerability by @albertvillanova in #6909
- Fix wrong type hints in data_files by @albertvillanova in #6910
- Remove dead code for non-dict data_files from packaged modules by @albertvillanova in #6911
- Support fsspec 2024.5.0 by @albertvillanova in #6921
- Remove torchaudio remnants from code by @albertvillanova in #6922
- [WebDataset] Add
.pth
support for torch tensors by @lhoestq in #6920 - Unpin hfh by @lhoestq in #6876
- Preserve JSON column order and support list of strings field by @albertvillanova in #6914
- [WebDataset] Support compressed files by @lhoestq in #6931
- update ci user by @lhoestq in #6933
- Revert ci user by @lhoestq in #6934
- Fix NonMatchingSplitsSizesError/ExpectedMoreSplits when passing data_dir/data_files in no-code Hub datasets by @albertvillanova in #6925
- Set dev version by @albertvillanova in #6944
- Update yanked version of minimum requests requirement by @albertvillanova in #6945
- Re-enable import sorting disabled by flake8:noqa directive when using ruff linter by @albertvillanova in #6946
- Update dataset_dict.py by @Arunprakash-A in #6932
- Update process.mdx: Code Listings Fixes by @FadyMorris in #6928
- Fix small typo by @marcenacp in #6955
- update docs on N-dim arrays by @lhoestq in #6956
- Fix typos in docs by @albertvillanova in #6957
- Validate config name and data_files in packaged modules by @albertvillanova in #6915
- Add support for categorical/dictionary types by @EthanSteinberg in #6892
- feat(ci): add trufflehog secrets detection by @McPatate in #6960
- Better error handling in
dataset_module_factory
by @Wauplin in #6959 - Move info_utils errors to exceptions module by @albertvillanova in #6952
- fix(ci): remove unnecessary permissions by @McPatate in #6962
New Contributors
- @skaulintel made their first contribution in #6607
- @Arunprakash-A made their first contribution in #6932
- @FadyMorris made their first contribution in #6928
- @marcenacp made their first contribution in #6955
- @EthanSteinberg made their first contribution in #6892
- @McPatate made their first contribution in #6960
Full Changelog: 2.19.0...2.20.0