Skip to content

Releases: huggingface/datasets

1.3.0

15 Feb 16:54
Compare
Choose a tag to compare

Dataset Features

  • On-the-fly data transforms (#1795)
  • ADD S3 support for downloading and uploading processed datasets (#1723)
  • Allow loading dataset in-memory (#1792)
  • Support future datasets (#1813)
  • Enable/disable caching (#1703)
  • Offline dataset loading (#1726)

Datasets Hub Features

Dataset Changes

  • New: LJ Speech (#1878)
  • New: Add Hindi Discourse Analysis Natural Language Inference Dataset (#1822)
  • New: cord 19 (#1850)
  • New: Tweet Eval Dataset (#1829)
  • New: CIFAR-100 Dataset (#1812)
  • New: SICK (#1804)
  • New: BBC Hindi NLI Dataset (#1158)
  • New: Freebase QA Dataset (#1814)
  • New: Arabic sarcasm (#1798)
  • New: Semantic Scholar Open Research Corpus (#1606)
  • New: DuoRC Dataset (#1800)
  • New: Aggregated dataset for the GEM benchmark (#1807)
  • New: CC-News dataset of English language articles (#1323)
  • New: irc disentangle (#1586)
  • New: Narrative QA Manual (#1778)
  • New: Universal Morphologies (#1174)
  • New: SILICONE (#1761)
  • New: Librispeech ASR (#1767)
  • New: OSCAR (#1694, #1868, #1833)
  • New: CANER Corpus (#1684)
  • New: Arabic Speech Corpus (#1852)
  • New: id_liputan6 (#1740)
  • New: Stuctured Argument Extraction for Korean dataset (#1748)
  • New: TurkCorpus (#1732)
  • New: Hatexplain Dataset (#1716)
  • New: adversarialQA (#1714)
  • Update: Doc2dial - reading comprehension update to latest version (#1816)
  • Update: OPUS Open Subtitles - add with metadata information (#1865)
  • Update: SWDA - use all metadata features(#1799)
  • Update: SWDA - add metadata and correct splits (#1749)
  • Update: CommonGen - update citation information (#1787)
  • Update: SciFact - update URL (#1780)
  • Update: BrWaC - update features name (#1736)
  • Update: TLC - update urls to be github links (#1737)
  • Update: Ted Talks IWSLT - add new version: WIT3 (#1676)
  • Fix: multi_woz_v22 - fix checksums (#1880)
  • Fix: limit - fix url (#1861)
  • Fix: WebNLG - fix test test + more field (#1739)
  • Fix: PAWS-X - fix csv Dictreader splitting data on quotes (#1763)
  • Fix: reuters - add missing "brief" entries (#1744)
  • Fix: thainer: empty token bug (#1734)
  • Fix: lst20: empty token bug (#1734)

Metrics Changes

  • New: Word Error Metric (#1847)
  • New: COMET (#1577, #1753)
  • Fix: bert_score - set version dependency (#1851)

Metric Docs

  • Add metrics usage examples and tests (#1820)

CLI Changes

  • [BREAKING] remove outdated commands (#1869):
    • remove outdated "datasets-cli upload_dataset" and "datasets-cli upload_metric"
    • instead, use the huggingface-hub CLI

Bug fixes

  • fix writing GPU Faiss index (#1862)
  • update pyarrow import warning (#1782)
  • Ignore definition line number of functions for caching (#1779)
  • update saving and loading methods for faiss index so to accept path like objects (#1663)
  • Print error message with filename when malformed CSV (#1826)
  • Fix default tensors precision when format is set to PyTorch and TensorFlow (#1795)

Refactoring

  • Refactoring: Create config module (#1848)
  • Use a config id in the cache directory names for custom configs (#1754)

Logging

  • Enable logging propagation and remove logging handler (#1845)

1.2.1

13 Jan 15:29
Compare
Choose a tag to compare

New Features

  • Fast start up (#1690): Importing datasets is now significantly faster.

Datasets Changes

  • New: MNIST (#1730)
  • New: Korean intonation-aided intention identification dataset (#1715)
  • New: Switchboard Dialog Act Corpus (#1678)
  • Update: Wiki-Auto - Added unfiltered versions of the training data for the GEM simplification task. (#1722)
  • Update: Scientific papers - Mirror datasets zip (#1721)
  • Update: Update DBRD dataset card and download URL (#1699)
  • Fix: Thainer - fix ner_tag bugs (#1695)
  • Fix: reuters21578 - metadata parsing errors (#1693)
  • Fix: ade_corpus_v2 - fix config names (#1689)
  • Fix: DaNE - fix last example (#1688)

Datasets tagging

  • rename "part-of-speech-tagging" tag in some dataset cards (#1645)

Bug Fixes

  • Fix column list comparison in transmit format (#1719)
  • Fix windows path scheme in cached path (#1711)

Docs

  • Add information about caching and verifications in "Load a Dataset" docs (#1705)

Moreover many dataset cards of datasets added during the sprint were updated ! Thanks to all the contributors :)

1.2.0

04 Jan 18:38
Compare
Choose a tag to compare

Intermediate release before v2.0.0
Includes all the datasets added during the datasets sprint of December 2020 (currently over 610 datasets).

1.1.3

19 Nov 18:33
Compare
Choose a tag to compare

Datasets changes

  • New: NLI-Tr (#787)
  • New: Amazon Reviews (#791)(#844)(#845)(#799)
  • New: ASNQ - answer sentence selection (#780)
  • New: OpenBookCorpus (#856)
  • New: ASLG-PC12 - sign language translation (#731)
  • New: Quail - question answering dataset (#747)
  • Update: SNLI: Created dataset card snli.md (#663)
  • Update: csv - Use pandas reader in csv (#857)
    • Better memory management
    • Breaking: the previous read_options, parse_options and convert_options are replaced with plain parameters like pandas.read_csv
  • Update: conll2000, conll2003, germeval_14, wnut_17, XTREME PAN-X - Create ClassLabel for labelling tasks datasets (#850)
    • Breaking: use of ClassLabel features instead of string features + naming of columns updated for consistency
  • Update: XNLI - Add XNLI train set (#781)
  • Update: XSUM - Use full released xsum dataset (#754)
  • Update: CompGuessWhat - New version of CompGuessWhat?! with refined annotations (#748)
  • Update: CLUE - add OCNLI, a new CLUE dataset (#742)
  • Fix: KOR-NLI - Fix csv reader (#855)
  • Fix: Discofuse - fix discofuse urls (#793)
  • Fix: Emotion - fix description (#745)
  • Fix: TREC - update urls (#740)

Metrics changes

  • New: accuracy, precision, recall and F1 metrics (#825)
  • Fix: squad_v2 (#840)
  • Fix: seqeval (#810)(#738)
  • Fix: Rouge - fix description (#774)
  • Fix: GLUE - fix description (#734)
  • Fix: BertScore - fix custom baseline (#763)

Command line tools

  • add clear_cache parameter in the test command (#863)

Dependencies

  • Integrate file_lock inside the lib for better logging control (#859)

Dataset features

  • Add writer_batch_size attribute to GeneratorBasedBuilder (#828)
  • pretty print dataset objects (#725)
  • allow custom split names in text dataset (#776)

Tests

  • All configs is a slow test now

Bug fixes

  • Make save function use deterministic global vars order (#819)
  • fix type hints pickling in python 3.6 (#818)
  • fix metric deletion when attributes are missing (#782)
  • Fix custom builder caching (#770)
  • Fix metric with cache dir (#772)
  • Fix train_test_split output format (#719)

1.1.2

06 Oct 14:22
Compare
Choose a tag to compare

Dataset changes

  • Fix: text - use python read instead of pandas reader (#715):
    • fix delimiter/overflow issues
    • better memory handling

Bug fixes

  • Fix dataset configuration creation using data_files per splits using NamedSplit (#706)
  • Fix permission issue on windows - don't use tqdm 4.50.0 (#718)

1.1.0: Windows support, Better Multiprocessing, New Datasets

02 Oct 13:12
Compare
Choose a tag to compare

Windows support

  • Add Windows support (#644):
    • add tests and CI for Windows
    • fix numerous windows specific issues
    • The library now fully supports Windows

Dataset changes

  • New: HotpotQA (#703)
  • New: OpenWebText (#660)
  • New: Winogrande - add debiased subset (#655)
  • Update: XNLI - update download link (#695)
  • Update: text - switch to pandas reader, better memory usage, fix delimiter issues (#689)
  • Update: csv - add features parameter to CSV (#685)
  • Fix: GAP - fix wrong computation of boolean features (#680)
  • Fix: C4 - fix manual instruction function (#681)

Metric changes

  • Update: ROUGE - Add rouge 2 and rouge Lsum to rouge metric outputs by default (#701, #702)
  • Fix: SQuAD - fix kwargs description (#670)

Dataset Features

  • Use multiprocess from pathos for multiprocessing (#656):
    • allow lambda functions in multiprocessed map
    • allow local functions in multiprocessed map
    • and more ! As long as functions are compatible with dill

Bug fixes

  • Datasets: fix possible program hanging with tokenizers - Disable tokenizers parallelism in multiprocessed map (#688)
  • Datasets: fix cast with unordered features - fix column order issue in cast (#684)
  • Datasets: fix first time creation of cache directory - move cache dir root creation in builder's init (#677)
  • Datasets: fix OverflowError when using negative ids - fix negative ids in slicing with an array (#679)
  • Datasets: fix empty dictionaries afetr multiprocessing - keep new columns in transmit format (#659)
  • Datasets: fix type inference for nested types - handle data alteration when trying type (#653)
  • Metrics: fix compute metric with empty input - pass metric features to the reader (#654)

Documentation

  • Elasticsearch integration documentation (#696)

Tests

  • Use GitHub instead of AWS in remote dataset tests (#694)

1.0.2

21 Sep 08:45
Compare
Choose a tag to compare

Dataset changes:

  • New: CoNLL-2003 (#613)
  • New: ConLL-2000 (#634)
  • New: MATINF (ACL 2020) (#637)
  • New: Polyglot-NER (#641)
  • Update: GLUE - update GLUE urls (now hosted on FB) (#626)
  • Update: GLUE/qqp - update download checksum (#639)
  • Update: MLQA - feature names update (#627)
  • Update: LinCE - update feature names - Consistent ner features (#636)
  • Update: WNUT 17: update feature names - Consistent ner features (#642)
  • Update: XTREME/PAN-X - update feature names - Consistent ner features (#636)
  • Update: RACE - update dataset checksum + add new configurations (#540)
  • Fix: text - fix delimiter (#631)
  • Fix: Wiki DPR - fix download error in wiki_dpr (f38a871)

Logging:

  • Set level to warning (previously info) (#635)

Bug fixes:

  • make shuffle compatible with temp_seed (#640)
  • don't use take on dataset table (offset overflow error) (#645)
  • handle connection error in when downloading from HF google storage (#652)

1.0.1

11 Sep 10:35
Compare
Choose a tag to compare

Fix:

  • add multiprocessing to dataset dict (#612)

1.0.0 Release: New name, Speed-ups, Multimodal, Serialization

11 Sep 10:19
Compare
Choose a tag to compare

1.0.0 Release: New name, Speed-ups, Multimodal, Serialization

Package Changes

  • Rename: nlp -> datasets

Update now with

pip install datasets

Dataset Features

  • Keep the dataset format after dataset transforms (#607)
  • Pickle support (#536)
  • Save and load datasets to/from disk (#571)
  • Multiprocessing in map and filter (#552)
  • Multi-dimensional arrays support for multi-modal datasets (#533, #363)
  • Speed up Tokenization by optimizing casting to python objects (#523)
  • Speed up shuffle/shard/select methods - use indices mappings (#513)
  • Add input_column parameter in map and filter(#475)
  • Speed up download and processing (#563)
  • Indexed datasets for hybrid models (REALM/RAG/MARGE) (#500)

Dataset Changes

  • New: IWSLT 2017 (#470)
  • New: CommonGen Dataset (#578)
  • New: CLUE Benchmark (11 datasets) (#572)
  • New: the KILT knowledge source and tasks (#559)
  • New: DailyDialog (#556)
  • New: DoQA dataset (ACL 2020) (#473)
  • New: reuters21578 (#570)
  • New: HANS (#551)
  • New: MLSUM (#529)
  • New: Guardian authorship (#452)
  • New: web_questions (#401)
  • New: MS MARCO (#364)
  • Update: Germeval14 - update download url (#594)
  • Update: LinCE - update download url (#550)
  • Update: Hyperpartisan news detection - update download url, manual download no longer required (#504)
  • Update: Rotten Tomatoes - update download url (#484)
  • Update: Wiki DPR - Use HNSW faiss index (#500)
  • Update: Text - Speed up using multi-threaded PyArrow loading (#548)
  • Fix: GLUE, PAWS-X - skip header (#497)

[Breaking] Update Dataset and DatasetDict API (#459)

  • Rename the flatten, drop and dictionary_encode_column methods in flatten_, drop_ and dictionary_encode_column_ to indicate that these methods have in-place effects
  • Remove the dataset.columns property and dataset.nbytes
  • Add a few more properties and methods to DatasetDict

Metric Features

  • Disallow the use of positional arguments to avoid predictions vs references mistakes (#466)
  • Allow to directly feed numpy/pytorch/tensorflow/pandas objects in metrics (#466)

Metric Changes

  • New: METEOR metric (#479)
  • Fix: Sacrebleu - fix inputs format (#520)

Loading script Features

  • Pin the version of the scripts (reproducibility) (#603, #584)
  • Specify default script_version with the env variable HF_SCRIPTS_VERSION (#584)
  • Save scripts in a modules cache directory that can be controlled with HF_MODULES_CACHE (#574)

Caching

  • Better support for tokenizers when caching map results (#601)
  • Faster caching for text dataset (#573, #502)
  • Use dataset fingerprints, updated after each transform (#536)
  • Refactor caching behavior, pickle/cloudpickle metrics and dataset, add tests on metrics (#518)

Documentation

  • Metrics documentation (#579)

Miscellaneous

  • Add centralized logging - Bump-up cache loads to warnings (#538)

Bug fixes

  • Datasets: [Breaking] fixed typo in "formated_as" method: rename formated to formatted (#516)
  • Datasets: fixed the error message when loading text/csv/json without providing data files (#586)
  • Datasets: fixed select method for pyarrow < 1.0.0 (#585)
  • Datasets: fixed elasticsearch result ids returning as strings (#487)
  • Datasets: fixed config used for slow test on real dataset (#527)
  • Datasets: fixed tensorflow-formatted datasets outputs by using ragged tensor by default (#530)
  • Datasets: fixed batched map for formatted dataset (#515)
  • Datasets: fixed encodings issues on Windows - apply utf-8 encoding to all datasets (#481)
  • Datasets: fixed dataset.map for function without outputs (#506)
  • Datasets: fixed bad type in overflow check (#496)
  • Datasets: fixed dataset info save - dont use beam fs to save info for local cache dir (#498)
  • Datasets: fixed arrays outputs - stack vectors in numpy, pytorch and tensorflow (#495, #494)
  • Metrics: fixed locking in distributed settings if one process finished before the other started writing (#564, #547)

0.4.0

11 Aug 09:20
Compare
Choose a tag to compare

Datasets Features

  • add from_pandas and from_dict
  • add shard method
  • add rename/remove/cast columns methods
  • faster select method
  • add concatenate datasets
  • add support for taking samples using numpy arrays
  • add export to TFRecords
  • add features parameter when loading from text/json/pandas/csv or when using the map transform
  • add support for nested features for json
  • add DatasetDict object with map/filter/sort/shuffle, that is useful when loading several splits of a dataset
  • add support for post processing Dataset objects in dataset scripts. This is used in Wiki DPR to attach a faiss index to the dataset, in order to be able to query passages for Open Domain QA for example
  • add indexing using FAISS or ElasticSearch:
    • add add_faiss_index and add_elasticsearch_index methods
    • add get_nearest_examples and get_nearest_examples_batch to query the index and return examples
    • add search and search_batch to query the index and return examples ids
    • add save_faiss_index/load_faiss_index to save/load a serialized faiss index

Datasets changes

  • new: PG19
  • new: ANLI
  • new: WikiSQL
  • new: qa_zre
  • new: MWSC
  • new: AG news
  • new: SQuADShifts
  • new: doc red
  • new: Wiki DPR
  • new: fever
  • new: hyperpartisan news detection
  • new: pandas
  • new: text
  • new: emotion
  • new: quora
  • new: BioMRC
  • new: web questions
  • new: search QA
  • new: LinCE
  • new: TREC
  • new: Style Change Detection
  • new: 20newsgroup
  • new: social biais frames
  • new: Emo
  • new: web of science
  • new: sogou news
  • new: crd3
  • update: xtreme - PAN-X features changed format. Previously each sample was a word/tag pair, and now each sample is a sentence with word/tag pairs.
  • update: xtreme - add PAWS-X.es
  • update: xsum - manual download is no longer required.
  • new processed: Natural Questions

Metrics Features

  • add seed parameter for metrics that does sampling like rouge
  • better installation messages

Metrics changes

  • new: bleurt
  • update seqeval: fix entities extraction (more info here)

Bug fixes

  • fix bug in map and select that was causing memory issues
  • fix pyarrow version check
  • fix text/json/pandas/csv caching when loading different files in a row
  • fix metrics caching when they have with different config names
  • fix cache that was nto discarded when there's a KeybordInterrupt during .map
  • fix sacrebleu tokenizer's parameter
  • fix docstrings of metrics when multiple instances are created

More Tests

  • add tests for features handling in dataset transforms
  • add tests for dataset builders
  • add tests for metrics loading

Backward compatibility

  • because there are changes in the dataset_info.json file format, old versions of the lib (<0.4.0) won't be able to load datasets with a post processing field in dataset_info.json