Releases · huggingface/datasets

15 Feb 16:54

lhoestq

1.3.0

ef633da

1.3.0

Dataset Features

On-the-fly data transforms (#1795)
ADD S3 support for downloading and uploading processed datasets (#1723)
Allow loading dataset in-memory (#1792)
Support future datasets (#1813)
Enable/disable caching (#1703)
Offline dataset loading (#1726)

Datasets Hub Features

Loading from the Datasets Hub (#1860)
This allows users to create their own dataset repositories in the Datasets Hub and then load them using the library.
Repositories can be created on the website: https://huggingface.co/new-dataset or using the huggingface-cli. More information in the dataset sharing section of the documentation

Dataset Changes

New: LJ Speech (#1878)
New: Add Hindi Discourse Analysis Natural Language Inference Dataset (#1822)
New: cord 19 (#1850)
New: Tweet Eval Dataset (#1829)
New: CIFAR-100 Dataset (#1812)
New: SICK (#1804)
New: BBC Hindi NLI Dataset (#1158)
New: Freebase QA Dataset (#1814)
New: Arabic sarcasm (#1798)
New: Semantic Scholar Open Research Corpus (#1606)
New: DuoRC Dataset (#1800)
New: Aggregated dataset for the GEM benchmark (#1807)
New: CC-News dataset of English language articles (#1323)
New: irc disentangle (#1586)
New: Narrative QA Manual (#1778)
New: Universal Morphologies (#1174)
New: SILICONE (#1761)
New: Librispeech ASR (#1767)
New: OSCAR (#1694, #1868, #1833)
New: CANER Corpus (#1684)
New: Arabic Speech Corpus (#1852)
New: id_liputan6 (#1740)
New: Stuctured Argument Extraction for Korean dataset (#1748)
New: TurkCorpus (#1732)
New: Hatexplain Dataset (#1716)
New: adversarialQA (#1714)
Update: Doc2dial - reading comprehension update to latest version (#1816)
Update: OPUS Open Subtitles - add with metadata information (#1865)
Update: SWDA - use all metadata features(#1799)
Update: SWDA - add metadata and correct splits (#1749)
Update: CommonGen - update citation information (#1787)
Update: SciFact - update URL (#1780)
Update: BrWaC - update features name (#1736)
Update: TLC - update urls to be github links (#1737)
Update: Ted Talks IWSLT - add new version: WIT3 (#1676)
Fix: multi_woz_v22 - fix checksums (#1880)
Fix: limit - fix url (#1861)
Fix: WebNLG - fix test test + more field (#1739)
Fix: PAWS-X - fix csv Dictreader splitting data on quotes (#1763)
Fix: reuters - add missing "brief" entries (#1744)
Fix: thainer: empty token bug (#1734)
Fix: lst20: empty token bug (#1734)

Metrics Changes

New: Word Error Metric (#1847)
New: COMET (#1577, #1753)
Fix: bert_score - set version dependency (#1851)

Metric Docs

Add metrics usage examples and tests (#1820)

CLI Changes

[BREAKING] remove outdated commands (#1869):
- remove outdated "datasets-cli upload_dataset" and "datasets-cli upload_metric"
- instead, use the huggingface-hub CLI

Bug fixes

fix writing GPU Faiss index (#1862)
update pyarrow import warning (#1782)
Ignore definition line number of functions for caching (#1779)
update saving and loading methods for faiss index so to accept path like objects (#1663)
Print error message with filename when malformed CSV (#1826)
Fix default tensors precision when format is set to PyTorch and TensorFlow (#1795)

Refactoring

Refactoring: Create config module (#1848)
Use a config id in the cache directory names for custom configs (#1754)

Logging

Enable logging propagation and remove logging handler (#1845)

Assets 2

13 Jan 15:29

lhoestq

1.2.1

a59580b

1.2.1

New Features

Fast start up (#1690): Importing datasets is now significantly faster.

Datasets Changes

New: MNIST (#1730)
New: Korean intonation-aided intention identification dataset (#1715)
New: Switchboard Dialog Act Corpus (#1678)
Update: Wiki-Auto - Added unfiltered versions of the training data for the GEM simplification task. (#1722)
Update: Scientific papers - Mirror datasets zip (#1721)
Update: Update DBRD dataset card and download URL (#1699)
Fix: Thainer - fix ner_tag bugs (#1695)
Fix: reuters21578 - metadata parsing errors (#1693)
Fix: ade_corpus_v2 - fix config names (#1689)
Fix: DaNE - fix last example (#1688)

Datasets tagging

rename "part-of-speech-tagging" tag in some dataset cards (#1645)

Bug Fixes

Fix column list comparison in transmit format (#1719)
Fix windows path scheme in cached path (#1711)

Docs

Add information about caching and verifications in "Load a Dataset" docs (#1705)

Moreover many dataset cards of datasets added during the sprint were updated ! Thanks to all the contributors :)

Assets 2

04 Jan 18:38

lhoestq

1.2.0

dae6880

1.2.0

Intermediate release before v2.0.0
Includes all the datasets added during the datasets sprint of December 2020 (currently over 610 datasets).

Assets 2

19 Nov 18:33

lhoestq

1.1.3

000b584

1.1.3

Datasets changes

New: NLI-Tr (#787)
New: Amazon Reviews (#791)(#844)(#845)(#799)
New: ASNQ - answer sentence selection (#780)
New: OpenBookCorpus (#856)
New: ASLG-PC12 - sign language translation (#731)
New: Quail - question answering dataset (#747)
Update: SNLI: Created dataset card snli.md (#663)
Update: csv - Use pandas reader in csv (#857)
- Better memory management
- Breaking: the previous read_options, parse_options and convert_options are replaced with plain parameters like pandas.read_csv
Update: conll2000, conll2003, germeval_14, wnut_17, XTREME PAN-X - Create ClassLabel for labelling tasks datasets (#850)
- Breaking: use of ClassLabel features instead of string features + naming of columns updated for consistency
Update: XNLI - Add XNLI train set (#781)
Update: XSUM - Use full released xsum dataset (#754)
Update: CompGuessWhat - New version of CompGuessWhat?! with refined annotations (#748)
Update: CLUE - add OCNLI, a new CLUE dataset (#742)
Fix: KOR-NLI - Fix csv reader (#855)
Fix: Discofuse - fix discofuse urls (#793)
Fix: Emotion - fix description (#745)
Fix: TREC - update urls (#740)

Metrics changes

New: accuracy, precision, recall and F1 metrics (#825)
Fix: squad_v2 (#840)
Fix: seqeval (#810)(#738)
Fix: Rouge - fix description (#774)
Fix: GLUE - fix description (#734)
Fix: BertScore - fix custom baseline (#763)

Command line tools

add clear_cache parameter in the test command (#863)

Dependencies

Integrate file_lock inside the lib for better logging control (#859)

Dataset features

Add writer_batch_size attribute to GeneratorBasedBuilder (#828)
pretty print dataset objects (#725)
allow custom split names in text dataset (#776)

Tests

All configs is a slow test now

Bug fixes

Make save function use deterministic global vars order (#819)
fix type hints pickling in python 3.6 (#818)
fix metric deletion when attributes are missing (#782)
Fix custom builder caching (#770)
Fix metric with cache dir (#772)
Fix train_test_split output format (#719)

Assets 2

06 Oct 14:22

lhoestq

1.1.2

2256521

1.1.2

Dataset changes

Fix: text - use python read instead of pandas reader (#715):
- fix delimiter/overflow issues
- better memory handling

Bug fixes

Fix dataset configuration creation using data_files per splits using NamedSplit (#706)
Fix permission issue on windows - don't use tqdm 4.50.0 (#718)

Assets 2

02 Oct 13:12

lhoestq

1.1.0

fe52b67

1.1.0: Windows support, Better Multiprocessing, New Datasets

Windows support

Add Windows support (#644):
- add tests and CI for Windows
- fix numerous windows specific issues
- The library now fully supports Windows

Dataset changes

New: HotpotQA (#703)
New: OpenWebText (#660)
New: Winogrande - add debiased subset (#655)
Update: XNLI - update download link (#695)
Update: text - switch to pandas reader, better memory usage, fix delimiter issues (#689)
Update: csv - add features parameter to CSV (#685)
Fix: GAP - fix wrong computation of boolean features (#680)
Fix: C4 - fix manual instruction function (#681)

Metric changes

Update: ROUGE - Add rouge 2 and rouge Lsum to rouge metric outputs by default (#701, #702)
Fix: SQuAD - fix kwargs description (#670)

Dataset Features

Use multiprocess from pathos for multiprocessing (#656):
- allow lambda functions in multiprocessed map
- allow local functions in multiprocessed map
- and more ! As long as functions are compatible with dill

Bug fixes

Datasets: fix possible program hanging with tokenizers - Disable tokenizers parallelism in multiprocessed map (#688)
Datasets: fix cast with unordered features - fix column order issue in cast (#684)
Datasets: fix first time creation of cache directory - move cache dir root creation in builder's init (#677)
Datasets: fix OverflowError when using negative ids - fix negative ids in slicing with an array (#679)
Datasets: fix empty dictionaries afetr multiprocessing - keep new columns in transmit format (#659)
Datasets: fix type inference for nested types - handle data alteration when trying type (#653)
Metrics: fix compute metric with empty input - pass metric features to the reader (#654)

Documentation

Elasticsearch integration documentation (#696)

Tests

Use GitHub instead of AWS in remote dataset tests (#694)

Assets 2

21 Sep 08:45

lhoestq

1.0.2

af7cd94

1.0.2

Dataset changes:

New: CoNLL-2003 (#613)
New: ConLL-2000 (#634)
New: MATINF (ACL 2020) (#637)
New: Polyglot-NER (#641)
Update: GLUE - update GLUE urls (now hosted on FB) (#626)
Update: GLUE/qqp - update download checksum (#639)
Update: MLQA - feature names update (#627)
Update: LinCE - update feature names - Consistent ner features (#636)
Update: WNUT 17: update feature names - Consistent ner features (#642)
Update: XTREME/PAN-X - update feature names - Consistent ner features (#636)
Update: RACE - update dataset checksum + add new configurations (#540)
Fix: text - fix delimiter (#631)
Fix: Wiki DPR - fix download error in wiki_dpr (f38a871)

Logging:

Set level to warning (previously info) (#635)

Bug fixes:

make shuffle compatible with temp_seed (#640)
don't use take on dataset table (offset overflow error) (#645)
handle connection error in when downloading from HF google storage (#652)

Assets 2

11 Sep 10:35

lhoestq

1.0.1

7c9d2b5

1.0.1

Fix:

add multiprocessing to dataset dict (#612)

Assets 2

11 Sep 10:19

lhoestq

1.0.0

322ba0e

1.0.0 Release: New name, Speed-ups, Multimodal, Serialization

Package Changes

Rename: nlp -> datasets

Update now with

pip install datasets

Dataset Features

Keep the dataset format after dataset transforms (#607)
Pickle support (#536)
Save and load datasets to/from disk (#571)
Multiprocessing in map and filter (#552)
Multi-dimensional arrays support for multi-modal datasets (#533, #363)
Speed up Tokenization by optimizing casting to python objects (#523)
Speed up shuffle/shard/select methods - use indices mappings (#513)
Add input_column parameter in map and filter(#475)
Speed up download and processing (#563)
Indexed datasets for hybrid models (REALM/RAG/MARGE) (#500)

Dataset Changes

New: IWSLT 2017 (#470)
New: CommonGen Dataset (#578)
New: CLUE Benchmark (11 datasets) (#572)
New: the KILT knowledge source and tasks (#559)
New: DailyDialog (#556)
New: DoQA dataset (ACL 2020) (#473)
New: reuters21578 (#570)
New: HANS (#551)
New: MLSUM (#529)
New: Guardian authorship (#452)
New: web_questions (#401)
New: MS MARCO (#364)
Update: Germeval14 - update download url (#594)
Update: LinCE - update download url (#550)
Update: Hyperpartisan news detection - update download url, manual download no longer required (#504)
Update: Rotten Tomatoes - update download url (#484)
Update: Wiki DPR - Use HNSW faiss index (#500)
Update: Text - Speed up using multi-threaded PyArrow loading (#548)
Fix: GLUE, PAWS-X - skip header (#497)

[Breaking] Update Dataset and DatasetDict API (#459)

Rename the flatten, drop and dictionary_encode_column methods in flatten_, drop_ and dictionary_encode_column_ to indicate that these methods have in-place effects
Remove the dataset.columns property and dataset.nbytes
Add a few more properties and methods to DatasetDict

Metric Features

Disallow the use of positional arguments to avoid predictions vs references mistakes (#466)
Allow to directly feed numpy/pytorch/tensorflow/pandas objects in metrics (#466)

Metric Changes

New: METEOR metric (#479)
Fix: Sacrebleu - fix inputs format (#520)

Loading script Features

Pin the version of the scripts (reproducibility) (#603, #584)
Specify default script_version with the env variable HF_SCRIPTS_VERSION (#584)
Save scripts in a modules cache directory that can be controlled with HF_MODULES_CACHE (#574)

Caching

Better support for tokenizers when caching map results (#601)
Faster caching for text dataset (#573, #502)
Use dataset fingerprints, updated after each transform (#536)
Refactor caching behavior, pickle/cloudpickle metrics and dataset, add tests on metrics (#518)

Documentation

Metrics documentation (#579)

Miscellaneous

Add centralized logging - Bump-up cache loads to warnings (#538)

Bug fixes

Datasets: [Breaking] fixed typo in "formated_as" method: rename formated to formatted (#516)
Datasets: fixed the error message when loading text/csv/json without providing data files (#586)
Datasets: fixed select method for pyarrow < 1.0.0 (#585)
Datasets: fixed elasticsearch result ids returning as strings (#487)
Datasets: fixed config used for slow test on real dataset (#527)
Datasets: fixed tensorflow-formatted datasets outputs by using ragged tensor by default (#530)
Datasets: fixed batched map for formatted dataset (#515)
Datasets: fixed encodings issues on Windows - apply utf-8 encoding to all datasets (#481)
Datasets: fixed dataset.map for function without outputs (#506)
Datasets: fixed bad type in overflow check (#496)
Datasets: fixed dataset info save - dont use beam fs to save info for local cache dir (#498)
Datasets: fixed arrays outputs - stack vectors in numpy, pytorch and tensorflow (#495, #494)
Metrics: fixed locking in distributed settings if one process finished before the other started writing (#564, #547)

Assets 2

11 Aug 09:20

lhoestq

0.4.0

21e8091

0.4.0

Datasets Features

add from_pandas and from_dict
add shard method
add rename/remove/cast columns methods
faster select method
add concatenate datasets
add support for taking samples using numpy arrays
add export to TFRecords
add features parameter when loading from text/json/pandas/csv or when using the map transform
add support for nested features for json
add DatasetDict object with map/filter/sort/shuffle, that is useful when loading several splits of a dataset
add support for post processing Dataset objects in dataset scripts. This is used in Wiki DPR to attach a faiss index to the dataset, in order to be able to query passages for Open Domain QA for example
add indexing using FAISS or ElasticSearch:
- add add_faiss_index and add_elasticsearch_index methods
- add get_nearest_examples and get_nearest_examples_batch to query the index and return examples
- add search and search_batch to query the index and return examples ids
- add save_faiss_index/load_faiss_index to save/load a serialized faiss index

Datasets changes

new: PG19
new: ANLI
new: WikiSQL
new: qa_zre
new: MWSC
new: AG news
new: SQuADShifts
new: doc red
new: Wiki DPR
new: fever
new: hyperpartisan news detection
new: pandas
new: text
new: emotion
new: quora
new: BioMRC
new: web questions
new: search QA
new: LinCE
new: TREC
new: Style Change Detection
new: 20newsgroup
new: social biais frames
new: Emo
new: web of science
new: sogou news
new: crd3
update: xtreme - PAN-X features changed format. Previously each sample was a word/tag pair, and now each sample is a sentence with word/tag pairs.
update: xtreme - add PAWS-X.es
update: xsum - manual download is no longer required.
new processed: Natural Questions

Metrics Features

add seed parameter for metrics that does sampling like rouge
better installation messages

Metrics changes

new: bleurt
update seqeval: fix entities extraction (more info here)

Bug fixes

fix bug in map and select that was causing memory issues
fix pyarrow version check
fix text/json/pandas/csv caching when loading different files in a row
fix metrics caching when they have with different config names
fix cache that was nto discarded when there's a KeybordInterrupt during .map
fix sacrebleu tokenizer's parameter
fix docstrings of metrics when multiple instances are created

More Tests

add tests for features handling in dataset transforms
add tests for dataset builders
add tests for metrics loading

Backward compatibility

because there are changes in the dataset_info.json file format, old versions of the lib (<0.4.0) won't be able to load datasets with a post processing field in dataset_info.json

Assets 2

Releases: huggingface/datasets

1.3.0

Dataset Features

Datasets Hub Features

Dataset Changes

Metrics Changes

Metric Docs

CLI Changes

Bug fixes

Refactoring

Logging

1.2.1

New Features

Datasets Changes

Datasets tagging

Bug Fixes

Docs

1.2.0

1.1.3

Datasets changes

Metrics changes

Command line tools

Dependencies

Dataset features

Tests

Bug fixes

1.1.2

Dataset changes

Bug fixes

1.1.0: Windows support, Better Multiprocessing, New Datasets

Windows support

Dataset changes

Metric changes

Dataset Features

Bug fixes

Documentation

Tests

1.0.2

Dataset changes:

Logging:

Bug fixes:

1.0.1

1.0.0 Release: New name, Speed-ups, Multimodal, Serialization

1.0.0 Release: New name, Speed-ups, Multimodal, Serialization

Package Changes

Dataset Features

Dataset Changes

[Breaking] Update Dataset and DatasetDict API (#459)

Metric Features

Metric Changes

Loading script Features

Caching

Documentation

Miscellaneous

Bug fixes

0.4.0

Datasets Features

Datasets changes

Metrics Features

Metrics changes

Bug fixes

More Tests

Backward compatibility