06 Jul 22:54

LysandreJik

b0892fa

Tokenizer fixes

Fixes bugs introduced by v3.0.0 and v3.0.1 in tokenizers.

Assets 2

03 Jul 15:37

thomwolf

v3.0.1

fedabcd

Patch v3.0.1: Better backward compatibility for tokenizers

Better backward-compatibility for tokenizers following v3.0.0 refactoring

Version v3.0.0, included a refactoring of the tokenizers' backend to allow a simpler and more flexible user-facing API.

This refactoring was conducted with a particular focus on keeping backward compatibility for the v2.X encoding, truncation and padding API but still led to two breaking changes that could have been avoided.

This patch aims to bring back better backward compatibility, by implementing the following updates:

the prepare_for_model method is now publicly exposed again for both slow and fast tokenizers with an API compatible with both the v2.X truncation/padding API and the v3.0 recommended API.
the truncation strategy now defaults again to longest_first instead of first_only.

Bug fixes and improvements:

Better support for TransfoXL tokenizer when using TextGenerationPipeline #5465 (@TevenLeScao)
Fix use of meme Transformer-XL generations #4826 (@tommccoy1)
Fixing a bug in the NER pipeline which lead to discarding the last identified entity #5439 (@mfuntowicz and @enzoampil)
Better QAPipelines #5429 (@mfuntowicz)
Add Question-Answering and MLM heads to the Reformer model #5433 (@patrickvonplaten)
Refactoring the LongFormer #5219 (@patrickvonplaten)
Various fixes on tokenizers and tests (@sshleifer)
Many improvements to the doc and tutorials (@sgugger)
Fix TensorFlow dataset generator in run_glue #4881 (@jplu)
Update Bertabs example to work again #5355 (@MichaelJanz)
Move GenerationMixin to separate file #5254 (@yjernite)

Assets 4

29 Jun 15:11

LysandreJik

v3.0.0

b62ca59

New tokenizer API, TensorFlow improvements, enhanced documentation & tutorials

Breaking changes since `v2`

In #4874 the language modeling BERT has been split in two: BertForMaskedLM and BertLMHeadModel. BertForMaskedLM therefore cannot do causal language modeling anymore, and cannot accept the lm_labels argument.
The Trainer data collator is now a method instead of a class
Directly setting a tokenizer special token attributes (e.g. tokenizer.mask_token = '<mask>' now only associate the token to the attribute of the tokenizer but doesn't add the token to the vocabulary if it is not in the vocabulary. Tokens are only added by using the tokenizer.add_special_tokens() and tokenizer.add_tokens() methods
The prepare_for_model method was removed as part of the new tokenizer API.
The truncation method is now only_first by default.

New Tokenizer API (@n1t0, @thomwolf, @mfuntowicz)

The tokenizers has evolved quickly in version 2, with the addition of rust tokenizers. It now has a simpler and more flexible API aligned between Python (slow) and Rust (fast) tokenizers. This new API let you control truncation and padding deeper allowing things like dynamic padding or padding to a multiple of 8.

The redesigned API is explained in detail here #4510 and here: https://huggingface.co/transformers/master/preprocessing.html

Notable changes:

it's now possible to truncate to the max input length of a model while padding the longest sequence in a batch
padding and truncation are decoupled and easier to control
it's possible to pad to a multiple of a predefined length, e.g. 8 which can give significant speeds up on recent NVIDIA GPU (V100)
a generic wrapper using tokenizer.__call__ can be used for all case (single sequence, pair of sequences to groups, batches, etc...)
tokenizers now accept pre-tokenized inputs (when the input is already split in word strings e.g. for NER)
All the Rust tokenizers are now fully tested like slow tokenizers
A new class AddedToken can be used to have a more fine-grained control on how added tokens behave during tokenization. In particular the user can control (1) whether left and right spaces are removed around the token during tokenization (2) whether the token will be identified inside another word and (3) whether the token will be recognized in normalized forms (e.g. in lower case if the tokenizer uses lower-casing)
Serialization issues where fixed
Possiblity to create NumPy tensors when using return_tensors parameter on tokenizers.
Introduced a new enum TensorType to map all the possible tensor backends we support: TensorType.TENSORFLOW, TensorType.PYTORCH, TensorType.NUMPY
Tokenizers now accept TensorType enum on encode(...), encode_plus(...), batch_encode_plus(...) tokenizer method for return_tensors parameters.
BatchEncoding new property is_fast indicates if the BatchEncoding comes from a Python (slow) tokenizer or a Rust (fast) tokenizer.
Slow and Fast Tokenizers are now picklable. So is their output, the dict sub-class BatchEncoding.

Several PRs to make the API more stable have been made:

[tokenizers] Fix #5081 and improve backward compatibility #5125 (@thomwolf)
Tokenizers API developments #5103 (@thomwolf)
Clearer error message in the use-case of #5169 (@thomwolf)
Add more tests on tokenizers serialization - fix bugs #5056 (@thomwolf)
[Tokenization] Fix #5181 - make #5155 more explicit - move back the default logging level in tests to WARNING #5252 (@thomwolf)
[tokenizers] Several small improvements and bug fixes #5287
Add pad_to_multiple_of on tokenizers (reimport) #5054 (@mfuntowicz)
[tokenizers] Updates data processors, docstring, examples and model cards to the new API #5308

TensorFlow improvements (@jplu, @dzorlu, @LysandreJik)

Very big release for TensorFlow!

TensorFlow models can now compute the loss themselves, using the TFPretrainedModel.compute_loss method. #4530
Can now resize token embeddings in TensorFlow #4351
Cleaning TensorFlow models #5229

Enhanced documentation (@sgugger)

We welcome @sgugger as a team member in New York. He already introduced a lot of very cool documentation changes:

Added a model summary #4789
Expose classes used in documentation #4808
Explain how to preview the docs in a PR #4795
Clean documentation #4849
Remove old doc page and add note about cache in installation #5027
Fix all sphynx warnings #5068 (@sgugger)
Update pipeline examples to doctest syntax #5030
Reorganize documentation #5064
Update installation page and add contributing to the doc #5084
Update glossary #5148
Quick tour #5145
Switch master/stable doc and add older releases #5193
Add version control menu #5222
Don't recreate old docs #5243
Tokenization tutorial #5257
Remove links for all docs #5280
New model sharing tutorial #5323

Training & fine-tuning quickstart

Our own @joeddav added a training & fine-tuning quickstart to the documentation #5034!

MobileBERT

The MobileBERT from MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou, was added to the library for both PyTorch and TensorFlow.

A single checkpoint is added: mobilebert-uncased which is the uncased_L-24_H-128_B-512_A-4_F-4_OPT checkpoint converted to our API.

This model was first implemented in PyTorch by @lonePatient, ported to the library by @vshampor, then finalized and implemented in Tensorflow by @LysandreJik.

Eli5 examples (@yjernite) #4968

The examples/eli5 folder contains training code for the dense retriever and to fine-tune a BART model, the jupyter notebook for the blog post, and the code for the live demo.
The RetriBert model implements the dense passage retriever. It's basically a wrapper for two Bert models and projection matrices, but it does gradient checkpointing in a way that is very different from a concurrent PR and Yacine thought it would be easier to write its own class for now and see if we can merge into the BART code later.

Enhanced examples/seq2seq (@sshleifer)

the examples/seq2seq folder is a combination of the old examples/summarization and examples/translation folders.
Finetuning works well for summarization, more experiments needed for translation. Finetuning works on multi-gpu, saves rouge scores during validation, and provides --freeze_encoder and --freeze_embeds options. These options make finetuning BART 5x faster on the cnn/dailymail dataset.
Distillbart code is added in distillation.py. It only supports summarization, for now.
Evaluation works well for both summarization and translation.
New weights and biases shared task for collaboration on the XSUM summarization task

Distilbart (@sshleifer)

Distilbart models are smaller versions of bart-large-cnn and bart-large-xsum. They can be loaded using BartForConditionalGeneration.from_pretrained('sshleifer/distilbart-xsum-12-6'), for example See this tweet for more info on available models and their speed/performance.
Commands to reproduce are available in the examples/seq2seq folder

BERT Loses Patience (@JetRunner)

Add BERT Loses Patience (Patience-based Early Exit) based on the paper https://arxiv.org/abs/2006.04152 and the official implementation https://github.com/JetRunner/PABEE

Unifying `label` arguments (@sgugger) #4722

Deprecate any argument that's not labels (like masked_lm_labels, lm_labels, etc.) to labels.

NumPy type in tokenizers (@mfuntowicz) #4585

Introduce a new tensor type for return_tensors on tokenizer for NumPy.

As we're introducing more than two tensor backend alternatives I created an enum TensorType listing all the possible tensor we can create TensorType.TENSORFLOW, TensorType.PYTORCH, TensorType.NUMPY. This might help newcomers who don't know about "tf", "pt".
Note: TensorType are compatible with previous "tf", "pt" and now "np" str to allow backward compatibility (+unittest)
Numpy is now a possible target when creating tensors. This is usefull for JAX.

Community notebooks

Adding notebooks for Fine Tuning #4732 (@abhimishra91):
- Multi-class classification: Using DistilBert
- Multi-label classification: Using Bert
- Summarization: Using T5 - Model Tracking with WandB
Speed up Fine-Tuning in Transformers with Dynamic Padding / Bucketing #5195 (@pommedeterresautee)
How to use Benchmarks (@patrickvonplaten) #5312

Benchmarks (@patrickvonplaten)

The benchmark script was consolidated and some features were added:

Adds the functionality to measure the following functionalities for TF and PT (#4912):

Tensorflow:
- Inference: CPU, GPU, GPU + XLA, GPU + eager mode, CPU + eager mode, TPU
PyTorch:
- Inference: CPU, CPU + torchscript, GPU, GPU + torchscript, GPU + mixed precision, Torch/XLA TPU
- Training: CPU, GPU, GPU + mixed precision, Torch/XLA TPU
[Benchmark] Add encoder decoder to benchmark and clean labels #4810
[Benchmark] add tpu and torchscipt for benchmark #4850
[Benchmark] Extend Benchmark to all model type exte...

Assets 2

02 Jun 14:26

LysandreJik

v2.11.0

b42586e

Longformer

Longformer (@ibeltagy)
Longformer for QA (@patil-suraj + @patrickvonplaten)
Longformer fast tokenizer (@patil-suraj)
Longformer for sequence classification (@patil-suraj)
Longformer for token classification (@patil-suraj)
Longformer for Multiple Choice (@patrickvonplaten)
More user-friendly handling of global attention mask vs local attention mask (@patrickvonplaten)
Fix longformer attention mask type casting when using APEX (@peskotivesgeroff)

New community notebooks!

Long Sequence Modeling with Reformer (@patrickvonplaten)
Fine-tune BART for summarization (@ohmeow)
Fine-tune a pre-trained Transformer on anyone's tweets (@borisdayma, @lavanyashukla)
A step-by-step guide to tracking hugging face model performance with wandb (@jxmorris12, @lavanyashukla)
Fine-tune Longformer for QA (@patil-suraj)
Pretrain Longformer (@ibeltagy)
Fine-tune T5 for sentiment span extraction (@enzoampil)

URLs to model weights are not hardcoded anymore (@julien-c)

Archive maps were dictionaries linking pre-trained models to their S3 URLs. Since the arrival of the model hub, these have become obsolete.

⚠️ This PR is breaking for the following models: BART, Flaubert, bert-japanese, bert-base-finnish, bert-base-dutch. ⚠️
Those models now have to be instantiated with their full model id:

"cl-tohoku/bert-base-japanese"
"cl-tohoku/bert-base-japanese-whole-word-masking"
"cl-tohoku/bert-base-japanese-char"
"cl-tohoku/bert-base-japanese-char-whole-word-masking"
"TurkuNLP/bert-base-finnish-cased-v1"
"TurkuNLP/bert-base-finnish-uncased-v1"
"wietsedv/bert-base-dutch-cased"
"flaubert/flaubert_small_cased"
"flaubert/flaubert_base_uncased"
"flaubert/flaubert_base_cased"
"flaubert/flaubert_large_cased"

all variants of "facebook/bart"

Update: ⚠️ This PR is also breaking for ALBERT from Tensorflow. See issue #4806 for discussion and resolution ⚠️

Fixes and improvements

Fix convert_token_type_ids_from_sequences for fast tokenizers (@n1t0, #4503)
Fixed the default tokenizer of the summarization pipeline (@sshleifer, #4506)
The max_len attribute is now more robust, and warns the user about deprecation (@mfuntowicz, #4528)
Added type hints to modeling_utils.py (@bglearning, #3911)
MMBT model now has nn.Module as a superclass (@shoarora, #4533)
Fixing tokenization of extra_id symbols in the T5 tokenizer (@mansimov, #4353)
Slow GPU tests run daily (@julien-c, #4465)
Removed PyTorch artifacts in TensorFlow XLNet implementation (@ZhuBaohe, #4410)
Fixed the T5 Cross Attention Position Bias (@ZhuBaohe, #4499)
The transformers-cli is now cross-platform (@BramVanroy, #4131) + (@patrickvonplaten, #4614)
GPT-2, CTRL: Accept input_ids and past of variable length (@patrickvonplaten, #4581)
Added back --do_lower_case to SQuAD examples.
Correct framework test requirement for language generation tests (@sshleifer, #4616)
Fix add_special_tokens on fast tokenizers (@n1t0, #4531)
MNLI & SST-2 bugs were fixed (@stdcoutzyx, #4546)
Fixed BERT example for NSP and multiple choice (@siboehm, #3953)
Encoder/decoder fix initialization and save/load bug (@patrickvonplaten, #4680)
Fix onnx export input names order (@RensDimmendaal, #4641)
Configuration: ensure that id2label always takes precedence over num_labels (@julien-c, direct commit to master)
Make docstring match argument (@sgugger, #4711)
Specify PyTorch versions for examples (@LysandreJik, #4710)
Override get_vocab for fast tokenizers (@mfuntowicz, #4717)
Tokenizer should not add special tokens for text generation (@patrickvonplaten, #4686)

Assets 2

22 May 14:57

LysandreJik

v2.10.0

10d7239

Reformer, ElectraForSequenceClassification, ONNX conversion script

Reformer (@patrickvonplaten)

Added a new model "Reformer": https://arxiv.org/abs/2001.04451 to the library. Original trax code: https://github.com/google/trax/tree/master/trax/models/reformer was translated to PyTorch.
Reformer uses chunked attention and reversible layers to model sequences as long as 500,000 tokens.
Reformer is currently available as a casual language model and will soon also be available as encoder only ("Bert"-like) model.
Two pretrained weights are uploaded: https://huggingface.co/models?search=google%2Freformer
https://huggingface.co/google/reformer-enwik8 is the first char lm in the library

Additional architectures

The ElectraForSequenceClassification was added by @liuzzi

Trainer Tweaks and fixes (@LysandreJik, @julien-c )

TPU (@LysandreJik):

Model saving, as well as optimizer and scheduler saving mid-training were hanging
Fixed the optimizer weight updates

Trainer (@julien-c)

Fixed the nn.DataParallel support compatibility for PyTorch v1.5.0
Distributed evaluation: SequentialDistributedSampler + gather all results
Move model to correct device
Map optimizer to correct device after loading from checkpoint (@shaoyent)

QOL: Tokenization, Pipelines

New method for all tokenizers: tokenizer.decode_batch, to decode an entire batch (@sshleifer)
the NER pipeline now returns entity groups (@enzoampil)

ONNX Conversion script (@mfuntowicz)

Added a conversion script to convert both PyTorch/TensorFlow models to ONNX.
Added a notebook explaining how it works

Community notebooks

We've started adding community notebooks to the repository. Three notebooks have made their way into our codebase:

Predict stage for GLUE task, easy submit to gluebenchmark.com

-Adds predict stage for glue tasks, and generate result files which can be submitted to gluebenchmark.com (@stdcoutzyx)

Fixes and improvements

Support flake8 3.8 (@julien-c)
Tests are now faster thanks to using dummy smaller models (@sshleifer)
Fixed the eval loss in the trainer (@patil-suraj)
Fixed the p_mask in SQuAD pre-processing (@LysandreJik)
Github Actions pytorch test are no longer pinned to torch==1.4.0 (@mfuntowicz)
Fixed the multiple-choice script with overflowing tokens (@LysandreJik)
Allow for None values in GradientAccumulator (@jarednielsen, improved by @jplu)
MBart tokenizer saving/loading id was fixed (@Mehrad0711)
TF generation: Fix issue for batch output generation of different output length.(@patrickvonplaten)
Fixed the FP-16 support in the T5 model (@patrickvonplaten)
run_language_modeling fix: actually use the overwrite_cache argument (@borisdayma)
Better, version compatible way to get the learning rate in the trainer (@rakeshchada)
Fixed the slow tests that were failing on GPU (@sshleifer, @patrickvonplaten, @LysandreJik)
ONNX conversion tokenizer fix (@RensDimmendaal)
Correct TF formatting to exclude LayerNorms from weight decay (@oliverastrand)
Removed warning of deprecation (@colanim)
fix no grad in second pruning in run_bertology (@TobiasLee)

Assets 2

14 May 13:15

LysandreJik

v2.9.1

7cb203f

Marian

Marian (@sshleifer)

A new model architecture, MarianMTModel with 1,008+ pretrained weights is available for machine translation in PyTorch.
The corresponding MarianTokenizer uses a prepare_translation_batch method to prepare model inputs.
All pretrained model names use the following format: Helsinki-NLP/opus-mt-{src}-{tgt}
See docs for information on pretrained model discovery and naming, or find your language here

AlbertForPreTraining (@jarednielsen)

A new model architecture has been added: AlbertForPreTraining in both PyTorch and TensorFlow

TF 2.2 compatibility (@mfuntowicz, @jplu)

Changes have been made to both the TensorFlow scripts and our internals so that we are compatible with TensorFlow 2.2

TFTrainer now supports new tasks

Multiple choice has been added to the TFTrainer (@ViktorAlm)
Question Answering has been added to the TFTrainer (@jplu)

Fixes and improvements

Fixed a bug with the tf generation pipeline (@patrickvonplaten)
Fixed the XLA spawn (@julien-c)
The sentiment analysis pipeline tokenizer was cased while the model was uncased (@mfuntowicz)
Albert was added to the conversion CLI (@fgaim)
CamemBERT's token ID generation from tokenizer were removed like RoBERTa, as the model does not use them (@LysandreJik)
Additional migration documentation was added (@guoquan)
GPT-2 can now be exported to ONNX (@tianleiwu)
Simplify cache vars and allow for TRANSFORMERS_CACHE env (@BramVanroy)
Remove hard-coded pad token id in distilbert and albert (@monologg)
BART tests were fixed on GPU (@julien-c)
Better wandb integration (@vanpelt, @borisdayma, @julien-c)

Assets 2

07 May 18:33

LysandreJik

v2.9.0

e7cfc1a

Trainer, TFTrainer, Multilingual BART, Encoder-decoder improvements, Generation Pipeline

Trainer & TFTrainer (@julien-c)

Version 2.9 introduces a new Trainer class for PyTorch, and its equivalent TFTrainer for TF 2.

This let us reorganize the example scripts completely for a cleaner codebase.

The main features of the Trainer are:

Same user-facing API for PyTorch and TF 2
Support for CPU, GPU, Multi-GPU, and TPU
Easier than ever to share your fine-tuned models

The TFTrainer was largely contributed by awesome community member @jplu! 🔥 🔥

A few additional features of the example scripts are:

Generate argparsers from type hints on dataclasses
Can load arguments from json files
Logging through TensorBoard and wandb

Documentation for the Trainer is still work-in-progress, please consider contributing improvements.

TPU Support

Both the TensorFlow and PyTorch trainers have TPU support (@jplu, @LysandreJik, @julien-c). An additional utility is added so that the TPU scripts may be launched in a similar manner to torch.distributed.
This was built with the support of @jysohn23, member of the Google TPU team

Multilingual BART (@sshleifer)

New BART checkpoint converted: this adds mbart-en-ro model, a BART variant finetuned on english-romanian translation.

Improved support for `huggingface/tokenizers`

Additional tests and support has been added to huggingface/tokenizers tokenizers. (@mfuntowicz, @thomwolf)
TensorFlow models work out-of-the-box with the new tokenizers (@LysandreJik)

Decoder caching for T5 (@patrickvonplaten)

Auto-regressive decoding for T5 has been greatly sped up by storing past key/value states. Work done on both PyTorch and TensorFlow.

Breaking change

This introduces a breaking change, in that it increases the default output length of T5Model and T5ForConditionalGeneration from 4 to 5 (including the past_key_value_states).

Encoder-Decoder enhancements

Apply Encoder Decoder 1.5GB memory savings to TF as well (@patrickvonplaten, translation of same work on PyTorch models by @sshleifer)
BART Summarization fine-tuning script now works for T5 as well (@sshleifer)
Clean Encoder-Decoder models with Bart/T5-like API and add generate possibility (@patrickvonplaten)

Additional model architectures

Question Answering support for Albert and Roberta in TF with (@Pierrci):

Question Answering support for Albert and Roberta in TF
TFAlbertForQuestionAnswering

Pipelines

The question answering pipeline now handles impossible answers (@bryant1410)
Remove tqdm logging (@mfuntowicz)
Sentiment analysis pipeline can now handle more than two sequences (@xxbidiao)
Rewritten batch support in pipelines (@mfuntowicz)

Text Generation pipeline (@enzoampil)

Implements a text generation pipeline, GenerationPipeline, which works on any ModelWithLMHead head.

Fixes and improvements

Clean the generate testing functions (@patrickvonplaten)
Notebooks updated in the documentation (@LysandreJik)
Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py (@ethanjperez)
Fixed RoBERTa conversion script (@myleott)
Speedup torch summarization tests (@sshleifer)
Optimize causal mask using torch.where (@Akababa)
Improved benchmarking utils (@patrickvonplaten)
Fixed edge case for bert tokenization (@patrickvonplaten)
SummarizationDataset cleanup (@sshleifer)
BART: Replace config.output_past with use_cache kwarg (@sshleifer)
Better documentation for Summarization and Translation pipeline (@julien-c)
Additional documentation for model cards (@julien-c)
Fix force_download of files on Windows (@calpt)
Fix shuffling issue for distributed training (@elk-cloner)
Shift labels internally within TransfoXLLMHeadModel when called with labels (@TevenLeScao)
Remove output_past everywhere and replace by use_cache argument (@patrickvonplaten)
Added unit test for run_bart_sum (@sshleifer)
Cleaner code by factorizating a few methods back in the PreTrainedModel (@sshleifer)
[Bert] remove hard-coded pad token id (@patrickvonplaten)
Clean pipelines test and remove unnecessary code (@patrickvonplaten)
JITting is not compatible with PyTorch/XLA or any other frameworks that requires serialization. The JITted methods were removed (@LysandreJik)
Change newstest2013 to newstest2014 and clean up (@patrickvonplaten)
Factor out tensor conversion method in PretrainedTokenizer (@sshleifer)
Remove tanh torch warnings (@aryanshomray)
Fix token_type_id in BERT question-answering example (@siboehm)
Add CircleCI workflow to build docs for preview (@harupy)
Higher tolerance for past testing in T5 and TF T5 (@patrickvonplaten)
XLM tokenizer should encode with bos token (@LysandreJik)
XLM tokenizer should encode with bos token (@patrickvonplaten)
fix summarization do_predict (@sshleifer)
Encode to max length of input not max length of tokenizer for batch input (@patrickvonplaten)
Add qas_id to SquadResult and SquadExample (@jarednielsen)
Fix bug in run_*.py scripts: double wrap into DataParallel during eval (@and-kul)
Fix torchhub integration (@julien-c)
Fix TFAlbertForSequenceClassification classifier dropout probability (@jarednielsen)
Change uses of pow(x, 3) to pow(x, 3.0) (@mneilly-et)
Shuffle train subset for summarization example (@colanim)
Removed the boto3 dependency (@julien-c)
Add dialogpt training tips (@patrickvonplaten)
Generation can now start with an empty prompt (@patrickvonplaten)
GPT-2 is now traceable (@jazzcook15)
Add known 3rd party to setup.cfg; removes local/circle ci isort discrepancy. (@sshleifer)
Allow a more backward compatible behavior of max_len_single_sentence and max_len_sentences_pair (@thomwolf)
Now using CDN urls for weights (@julien-c)
[Fix common tests on GPU] send model, ids to torch_device (@sshleifer)
Fix TF input docstrings to refer to tf.Tensor rather than torch.Float (@jarednielsen)
Additional metadata to traing arguments (@parmarsuraj99)
[ci] Load pretrained models into the default (long-lived) cache (@julien-c)
add timeout_decorator to tests (@sshleifer)
Added XLM-R to the multilingual section in the documentation (@stefan-it)
Better num_labels in configuration objects
Updated pytorch lightning scripts (@williamFalcon)
Tests now pass with torch 1.5.0 (@LysandreJik)
Ensure fast tokenizer can construct single-element tensor without pad token (@mfuntowicz)

Assets 2

06 Apr 14:15

LysandreJik

v2.8.0

11c3257

ELECTRA, Bad word filters, bugfixes & improvements

ELECTRA Model (@LysandreJik)

ELECTRA is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the SQuAD 2.0 dataset.

This release comes with 6 ELECTRA checkpoints:

google/electra-small-discriminator
google/electra-small-generator
google/electra-base-discriminator
google/electra-base-generator
google/electra-large-discriminator
google/electra-large-generator

Thanks to the author @clarkkev for his help during the implementation.

Thanks to community members @hfl-rc @stefan-it @shoarora for already sharing more fine-tuned Electra variants!

Bad word filters in `generate` (@patrickvonplaten)

The generate method now has a bad word filter.

Fixes and improvements

Decoder input ids are not necessary for T5 training anymore (@patrickvonplaten)
Update encoder and decoder on set_input_embedding for BART (@sshleifer)
Using loaded checkpoint with --do_predict (instead of random init) for Pytorch-lightning scripts (@ethanjperez)
Clean summarization and translation example testing files for T5 and Bart (@patrickvonplaten)
Cleaner examples (@julien-c)
Extensive testing for T5 model (@patrickvonplaten)
Force models outputs to always have batch_size as their first dim (@patrickvonplaten)
Fix for continuing training in some scripts (@xeb)
Resizing embedding matrix before sending it to the optimizer (@ngarneau)
BertJapaneseTokenizer accept options for mecab (@tamuhey)
Speed up GELU computation with torch.jit (@mryab)
fix argument order of update_mems fn in TF version (@patrickvonplaten, @dmytyar)
Split generate test function into beam search, no beam search (@patrickvonplaten)

Assets 2

30 Mar 13:01

LysandreJik

v2.7.0

6f5a12a

T5 Model, BART summarization example and reduced memory, translation pipeline

T5 Model (@patrickvonplaten, @thomwolf )

T5 is a powerful encoder-decoder model that formats every NLP problem into a text-to-text format. It achieves state of the art results on a variety of NLP tasks (Summarization, Question-Answering, ...).

Five sets of pre-trained weights (pre-trained on a multi-task mixture of unsupervised and supervised tasks) are released. In ascending order from 60 million parameters to 11 billion parameters:

t5-small, t5-base, t5-large, t5-3b, t5-11b

T5 can now be used with the translation and summarization pipeline.

paper
official code
model available in Hugging Face's community models
docs

Big thanks to the original authors, especially @craffel who helped answer our questions, reviewed PRs and tested T5 extensively.

New BART checkpoint: `bart-large-xsum` (@sshleifer)

These weights are from BART finetuned on the XSum abstractive summarization challenge, which encourages shorter (more abstractive) summaries. It achieves state of the art.

BART summarization example with pytorch-lightning (@acarrera94)

New example: BART for summarization, using Pytorch-lightning. Trains on CNN/DM and evaluates.

Translation pipeline (@patrickvonplaten)

A new pipeline is available, leveraging the T5 model. The T5 model was added to the summarization pipeline as well.

Memory improvements with BART (@sshleifer)

In an effort to have the same memory footprint and same computing power necessary to run inference on BART, several improvements have been made on the model:

Remove the LM head and use the embedding matrix instead (~200MB)
Call encoder before expanding input_ids (~1GB)
SelfAttention only returns weights if config.output_attentions (~500MB)
Two separate, smaller decoder attention masks (~500MB)
drop columns that are exclusively pad_token_id from input_ids in evaluate_cnn example.

TensorFlow models may now be serialized (@gthb)

Supports JSON serialization of Keras layers by overriding get_config, so that they can be sent to Tensorboard to display a conceptual graph of the model. TensorFlow models may now be saved using model.save, as other Keras models.

New model: XLMForTokenClassification (@sakares)

A new head was added to XLM: XLMForTokenClassification.

Assets 2

24 Mar 18:13

LysandreJik

v2.6.0

fbc5bf1

BART, organizations, community notebooks, lightning examples, dropping Python 3.5

New Model: BART (added by @sshleifer)

Bart is one of the first Seq2Seq models in the library, and achieves state of the art results on text generation tasks, like abstractive summarization.
Three sets of pretrained weights are released:

bart-large: the pretrained base model
bart-large-cnn: the base model finetuned on the CNN/Daily Mail Abstractive Summarization Task
bart-large-mnli: the base model finetuned on the MNLI classification task.

paper
model pages are at https://huggingface.co/facebook
docs
blogpost

Big thanks to the original authors, especially Mike Lewis, Yinhan Liu, Naman Goyal who helped answer our questions.

Model sharing CLI: support for organizations

The huggingface API for model upload now supports organisations.

Notebooks (@mfuntowicz)

A few beginner-oriented notebooks were added to the library, aiming at demystifying the two libraries huggingface/transformers and huggingface/tokenizers. Contributors are welcome to contribute links to their notebooks as well.

pytorch-lightning examples (@srush)

Examples leveraging pytorch-lightning were added, led by @srush.
The first example that was added is the NER example.
The second example is a lightning GLUE example, added by @nateraw.

New model architectures: CamembertForQuestionAnswering,

CamembertForQuestionAnswering was added to the library and to the SQuAD script @maximeilluin
AlbertForTokenClassification was added to the library and to the NER example @marma

Multiple fixes were done on the fast tokenizers to make them entirely compatible with the python tokenizers (@mfuntowicz)

Most of these fixes were done in the patch 2.5.1. Fast tokenizers should now have the exact same API as the python ones, with some additional functionalities.

Docker images (@mfuntowicz)

Docker images for transformers were added.

Generation overhaul (@patrickvonplaten)

Special token IDs logic were improved in run_generation and in corresponding tests.
Slow tests for generation were added for pre-trained LM models
Greedy generation when doing beam search
Sampling when doing beam search
Generate functionality was added to TF2: with beam search, greedy generation and sampling.
Integration tests were added
no_repeat_ngram_size kwarg to avoid redundant generations (@sshleifer)

Encoding methods now output only model-specific inputs

Models such as DistilBERT and RoBERTa do not make use of token type IDs. These inputs are not returned by the encoding methods anymore, except if explicitly mentioned during the tokenizer initialization.

Pipelines support summarization (@sshleifer)

The default architecture is bart-large-cnn, with the generation parameters published in the paper.

Models may now re-use the cache every time without prompting S3 (@BramVanroy)

Previously all attempts to load a model from a pre-trained checkpoint would check that the S3 etag corresponds to the one hosted locally. This has been updated so that an argument local_files_only prevents this, which can be useful when a firewall is involved.

Usage examples for common tasks (@LysandreJik)

In a continuing effort to onboard new users (new to the lib or new to NLP in general), some usage examples were added to the documentation. These usage examples showcase how to do inference on several tasks:

NER
Sequence classification
Question Answering
Causal Language Modeling
Masked Language Modeling

Test suite on GPU (@julien-c)

CI now runs on GPU. PyTorch and TensorFlow.

Padding token ID needs to be set in order to pad (@patrickvonplaten)

Older tokenizers could pad even when no padding token was defined, which has been updated in this version to match the expected behavior, which is the FastTokenizers' behavior: add a pad token or raise an error when trying to batch without one.

Python >= 3.6

We're now dropping Python 3.5 support.

Community additions/bug-fixes/improvements

Added a warning when using add_special_tokens with the fast tokenizer methods of encoding (@LysandreJik)
encode_plus was modified and tested to have the exact same behaviour as encode, but batches input
Cleanup DistilBERT code (@guillaume-be)
Only use F.gelu for torch >= 1.4.0 (@sshleifer)
Added a get_vocab method to tokenizers, which can be used to retrieve all the vocabulary from the tokenizers. (@joeddav)
Correct behaviour of special_tokens_mask when add_special_tokens=False (@LysandreJik)
Removed untested Model2LSTM and Model2Model which was not working
kwargs were passed to both model and configuration in AutoModels, which made the model crash (@LysandreJik)
Correct transfo-xl tokenization regarding punctions (@patrickvonplaten)
Better docstrings for XLNet (@patrickvonplaten)
Better operations for TPU support (@srush)
XLM-R tokenizer is now tested and bug-free (@LysandreJik)
XLM-R model and tokenizer now have integration tests (@patrickvonplaten)
Better documentation for tokenizers and pipelines (@LysandreJik)
All tests (slow and non-slow) now pass (@julien-c, @LysandreJik, @patrickvonplaten, @sshleifer, @thomwolf)
Correct attention mask with GPT-2 when using past (@patrickvonplaten)
fix n_gpu count when no_cuda flag is activated in all examples (@VictorSanh)
Test TF GPT2 for correct behaviour regarding the past and attn mask variable (@patrickvonplaten)
Fixed bug where some missing keys would not be identified (@LysandreJik)
Correct num_labels initialization (@LysandreJik)
Model special tokens were added to the pretrained configurations (@patrickvonplaten)
QA models for XLNet, XLM and FlauBERT are now set to their "simple" architectures when using the pipeline.
GPT-2 XL was added to TensorFlow (@patrickvonplaten)
NER PL example updated (@shubhamagarwal92)
Improved Error message when loading config/model with .from_pretrained() (@patrickvonplaten, @julien-c)
Cleaner special token initialization in modeling_xxx.py (@patrickvonplaten)
Fixed the learning rate scheduler placement in the run_ner.py script @erip
Use AutoModels in examples (@julien-c, @lifefeel)

Assets 2

Releases: huggingface/transformers

Tokenizer fixes

Tokenizer fixes

Patch v3.0.1: Better backward compatibility for tokenizers

Better backward-compatibility for tokenizers following v3.0.0 refactoring

Bug fixes and improvements:

New tokenizer API, TensorFlow improvements, enhanced documentation & tutorials

New tokenizer API, TensorFlow improvements, enhanced documentation & tutorials

Breaking changes since v2

New Tokenizer API (@n1t0, @thomwolf, @mfuntowicz)

TensorFlow improvements (@jplu, @dzorlu, @LysandreJik)

Enhanced documentation (@sgugger)

Training & fine-tuning quickstart

MobileBERT

Eli5 examples (@yjernite) #4968

Enhanced examples/seq2seq (@sshleifer)

Distilbart (@sshleifer)

BERT Loses Patience (@JetRunner)

Unifying label arguments (@sgugger) #4722

NumPy type in tokenizers (@mfuntowicz) #4585

Community notebooks

Benchmarks (@patrickvonplaten)

Longformer

Longformer

New community notebooks!

URLs to model weights are not hardcoded anymore (@julien-c)

Fixes and improvements

Reformer, ElectraForSequenceClassification, ONNX conversion script

Reformer (@patrickvonplaten)

Additional architectures

Trainer Tweaks and fixes (@LysandreJik, @julien-c )

TPU (@LysandreJik):

Trainer (@julien-c)

QOL: Tokenization, Pipelines

ONNX Conversion script (@mfuntowicz)

Community notebooks

Predict stage for GLUE task, easy submit to gluebenchmark.com

Fixes and improvements

Marian

Marian (@sshleifer)

AlbertForPreTraining (@jarednielsen)

TF 2.2 compatibility (@mfuntowicz, @jplu)

TFTrainer now supports new tasks

Fixes and improvements

Trainer, TFTrainer, Multilingual BART, Encoder-decoder improvements, Generation Pipeline

Trainer & TFTrainer (@julien-c)

TPU Support

Multilingual BART (@sshleifer)

Improved support for huggingface/tokenizers

Decoder caching for T5 (@patrickvonplaten)

Breaking change

Encoder-Decoder enhancements

Additional model architectures

Pipelines

Text Generation pipeline (@enzoampil)

Fixes and improvements

ELECTRA, Bad word filters, bugfixes & improvements

ELECTRA Model (@LysandreJik)

Bad word filters in generate (@patrickvonplaten)

Fixes and improvements

T5 Model, BART summarization example and reduced memory, translation pipeline

T5 Model (@patrickvonplaten, @thomwolf )

New BART checkpoint: bart-large-xsum (@sshleifer)

BART summarization example with pytorch-lightning (@acarrera94)

Translation pipeline (@patrickvonplaten)

Memory improvements with BART (@sshleifer)

TensorFlow models may now be serialized (@gthb)

New model: XLMForTokenClassification (@sakares)

BART, organizations, community notebooks, lightning examples, dropping Python 3.5

New Model: BART (added by @sshleifer)

Model sharing CLI: support for organizations

Notebooks (@mfuntowicz)

pytorch-lightning examples (@srush)

New model architectures: CamembertForQuestionAnswering,

Multiple fixes were done on the fast tokenizers to make them entirely compatible with the python tokenizers (@mfuntowicz)

Docker images (@mfuntowicz)

Generation overhaul (@patrickvonplaten)

Encoding methods now output only model-specific inputs

Pipelines support summarization (@sshleifer)

Models may now re-use the cache every time without prompting S3 (@BramVanroy)

Breaking changes since `v2`

Unifying `label` arguments (@sgugger) #4722

Improved support for `huggingface/tokenizers`

Bad word filters in `generate` (@patrickvonplaten)

New BART checkpoint: `bart-large-xsum` (@sshleifer)