Release 2.19.0 · huggingface/datasets

Dataset Features

Add Polars compatibility by @psmyth94 in #6531

convert to a Polars dataframe using .to_polars();

import polars as pl
from datasets import load_dataset
ds = load_dataset("DIBT/10k_prompts_ranked", split="train")
ds.to_polars() \
    .groupby("topic") \
    .agg(pl.len(), pl.first()) \
    .sort("len", descending=True)

Use Polars formatting to return Polars objects when accessing a dataset:
```
ds = ds.with_format("polars")
ds[:10].group_by("kind").len()
```

Add fsspec support for to_json, to_csv, and to_parquet by @alvarobartt in #6096

Save on HF in any file format:

ds.to_json("hf://datasets/username/my_json_dataset/data.jsonl")
ds.to_csv("hf://datasets/username/my_csv_dataset/data.csv")
ds.to_parquet("hf://datasets/username/my_parquet_dataset/data.parquet")

Add mode parameter to Image feature by @mariosasko in #6735
- Set images to be read in a certain mode like "RGB"
```
dataset = dataset.cast_column("image", Image(mode="RGB"))
```
Add CLI function to convert script-dataset to Parquet by @albertvillanova in #6795
- run command to open a PR in script-based dataset to convert it to Parquet:
```
datasets-cli convert_to_parquet <dataset_id>
```
Add Dataset.take and Dataset.skip by @lhoestq in #6813
- same as IterableDataset.take and IterableDataset.skip
```
ds = ds.take(10)  # take only the first 10 examples
```

General improvements and bug fixes

Bump huggingface-hub lower version to 0.21.2 by @albertvillanova in #6713
fix CastError pickling by @lhoestq in #6712
Expand no-code dataset info with datasets-server info by @mariosasko in #6714
Fix sliced ConcatenationTable pickling with mixed schemas vertically by @lhoestq in #6715
Fix concurrent script loading with force_redownload by @lhoestq in #6718
get_dataset_default_config_name docstring by @lhoestq in #6723
Deprecate Beam API and download from HF GCS bucket by @mariosasko in #6474
Deprecate Pandas builder by @mariosasko in #6730
Using a registry instead of calling globals for fetching feature types by @psmyth94 in #6727
Update torch_formatter.py by @VarunNSrivastava in #6402
Improve default patterns resolution by @mariosasko in #6704
Transpose images with EXIF Orientation tag by @mariosasko in #6739
Fix missing download_config in get_data_patterns by @lhoestq in #6742
Allow null values in dict columns by @mariosasko in #6743
Fix fsspec tqdm callback by @lhoestq in #6749
chore(deps): bump fsspec by @shcheklein in #6747
Fix offline mode with single config by @lhoestq in #6741
Remove deprecated code by @Wauplin in #6761
fixing the issue 6755(small typo) by @JINO-ROHIT in #6767
remove_columns/rename_columns doc fixes by @mariosasko in #6772
Fix CI by @mariosasko in #6780
rename datasets-server to dataset-viewer by @severo in #6785
Install dependencies with uv in CI by @mariosasko in #6779
Fix cache conflict in _check_legacy_cache2 by @lhoestq in #6792
Fix typo in docs (upload CLI) by @Wauplin in #6802
fix DatasetBuilder._split_generators incomplete type annotation by @JonasLoos in #6799
#6791 Improve type checking around FAISS by @Dref360 in #6803
Fix --repo-type order in cli upload docs by @lhoestq in #6804
Fix hf-internal-testing/dataset_with_script commit SHA in CI test by @albertvillanova in #6806
Fix cache path to snakecase for CachedDatasetModuleFactory and Cache by @izhx in #6754
Multithreaded downloads by @lhoestq in #6794
Remove os.path.relpath in resolve_patterns by @mariosasko in #6815
Extract data on the fly in packaged builders by @mariosasko in #6784
add allow_primitive_to_str and allow_decimal_to_str instead of allow_number_to_str by @Modexus in #6811
Support indexable objects in Dataset.__getitem__ by @mariosasko in #6817
Make convert_to_parquet CLI command create script branch by @albertvillanova in #6809
Fix parquet export infos by @lhoestq in #6822

New Contributors

@VarunNSrivastava made their first contribution in #6402
@shcheklein made their first contribution in #6747
@JINO-ROHIT made their first contribution in #6767
@JonasLoos made their first contribution in #6799
@izhx made their first contribution in #6754
@Modexus made their first contribution in #6811

Full Changelog: 2.18.0...2.19.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.19.0

Dataset Features

General improvements and bug fixes

New Contributors

Contributors