2.19.0
Dataset Features
- Add Polars compatibility by @psmyth94 in #6531
- convert to a Polars dataframe using
.to_polars()
;import polars as pl from datasets import load_dataset ds = load_dataset("DIBT/10k_prompts_ranked", split="train") ds.to_polars() \ .groupby("topic") \ .agg(pl.len(), pl.first()) \ .sort("len", descending=True)
- Use Polars formatting to return Polars objects when accessing a dataset:
ds = ds.with_format("polars") ds[:10].group_by("kind").len()
- convert to a Polars dataframe using
- Add
fsspec
support forto_json
,to_csv
, andto_parquet
by @alvarobartt in #6096- Save on HF in any file format:
ds.to_json("hf://datasets/username/my_json_dataset/data.jsonl") ds.to_csv("hf://datasets/username/my_csv_dataset/data.csv") ds.to_parquet("hf://datasets/username/my_parquet_dataset/data.parquet")
- Save on HF in any file format:
- Add
mode
parameter toImage
feature by @mariosasko in #6735- Set images to be read in a certain mode like "RGB"
dataset = dataset.cast_column("image", Image(mode="RGB"))
- Set images to be read in a certain mode like "RGB"
- Add CLI function to convert script-dataset to Parquet by @albertvillanova in #6795
- run command to open a PR in script-based dataset to convert it to Parquet:
datasets-cli convert_to_parquet <dataset_id>
- run command to open a PR in script-based dataset to convert it to Parquet:
- Add Dataset.take and Dataset.skip by @lhoestq in #6813
- same as IterableDataset.take and IterableDataset.skip
ds = ds.take(10) # take only the first 10 examples
- same as IterableDataset.take and IterableDataset.skip
General improvements and bug fixes
- Bump huggingface-hub lower version to 0.21.2 by @albertvillanova in #6713
- fix CastError pickling by @lhoestq in #6712
- Expand no-code dataset info with datasets-server info by @mariosasko in #6714
- Fix sliced ConcatenationTable pickling with mixed schemas vertically by @lhoestq in #6715
- Fix concurrent script loading with force_redownload by @lhoestq in #6718
- get_dataset_default_config_name docstring by @lhoestq in #6723
- Deprecate Beam API and download from HF GCS bucket by @mariosasko in #6474
- Deprecate Pandas builder by @mariosasko in #6730
- Using a registry instead of calling globals for fetching feature types by @psmyth94 in #6727
- Update torch_formatter.py by @VarunNSrivastava in #6402
- Improve default patterns resolution by @mariosasko in #6704
- Transpose images with EXIF Orientation tag by @mariosasko in #6739
- Fix missing download_config in get_data_patterns by @lhoestq in #6742
- Allow null values in dict columns by @mariosasko in #6743
- Fix fsspec tqdm callback by @lhoestq in #6749
- chore(deps): bump fsspec by @shcheklein in #6747
- Fix offline mode with single config by @lhoestq in #6741
- Remove deprecated code by @Wauplin in #6761
- fixing the issue 6755(small typo) by @JINO-ROHIT in #6767
remove_columns
/rename_columns
doc fixes by @mariosasko in #6772- Fix CI by @mariosasko in #6780
- rename datasets-server to dataset-viewer by @severo in #6785
- Install dependencies with
uv
in CI by @mariosasko in #6779 - Fix cache conflict in
_check_legacy_cache2
by @lhoestq in #6792 - Fix typo in docs (upload CLI) by @Wauplin in #6802
- fix
DatasetBuilder._split_generators
incomplete type annotation by @JonasLoos in #6799 - #6791 Improve type checking around FAISS by @Dref360 in #6803
- Fix --repo-type order in cli upload docs by @lhoestq in #6804
- Fix hf-internal-testing/dataset_with_script commit SHA in CI test by @albertvillanova in #6806
- Fix cache path to snakecase for
CachedDatasetModuleFactory
andCache
by @izhx in #6754 - Multithreaded downloads by @lhoestq in #6794
- Remove
os.path.relpath
inresolve_patterns
by @mariosasko in #6815 - Extract data on the fly in packaged builders by @mariosasko in #6784
- add allow_primitive_to_str and allow_decimal_to_str instead of allow_number_to_str by @Modexus in #6811
- Support indexable objects in
Dataset.__getitem__
by @mariosasko in #6817 - Make convert_to_parquet CLI command create script branch by @albertvillanova in #6809
- Fix parquet export infos by @lhoestq in #6822
New Contributors
- @VarunNSrivastava made their first contribution in #6402
- @shcheklein made their first contribution in #6747
- @JINO-ROHIT made their first contribution in #6767
- @JonasLoos made their first contribution in #6799
- @izhx made their first contribution in #6754
- @Modexus made their first contribution in #6811
Full Changelog: 2.18.0...2.19.0