09 Dec 17:07

LysandreJik

e20ac66

Patch release: better error message & invalid trainer attribute

This patch releases introduces:

A better error message when trying to instantiate a SentencePiece-based tokenizer without having SentencePiece installed. #8881
Fixes an incorrect attribute in the trainer. #8996

Assets 2

30 Nov 17:01

LysandreJik

v4.0.0

c781171

Transformers v4.0.0: Fast tokenizers, model outputs, file reorganization

Transformers v4.0.0-rc-1: Fast tokenizers, model outputs, file reorganization

Breaking changes since v3.x

Version v4.0.0 introduces several breaking changes that were necessary.

1. AutoTokenizers and pipelines now use fast (rust) tokenizers by default.

The python and rust tokenizers have roughly the same API, but the rust tokenizers have a more complete feature set. The main breaking change is the handling of overflowing tokens between the python and rust tokenizers.

How to obtain the same behavior as v3.x in v4.x

The pipelines now contain additional features out of the box. See the token-classification pipeline with the grouped_entities flag.
The auto-tokenizers now return rust tokenizers. In order to obtain the python tokenizers instead, the user may use the use_fast flag by setting it to False:

In version v3.x:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xxx")

to obtain the same in version v4.x:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xxx", use_fast=False)

2. SentencePiece is removed from the required dependencies

The requirement on the SentencePiece dependency has been lifted from the setup.py. This is done so that we may have a channel on anaconda cloud without relying on conda-forge. This means that the tokenizers that depend on the SentencePiece library will not be available with a standard transformers installation.

This includes the slow versions of:

XLNetTokenizer
AlbertTokenizer
CamembertTokenizer
MBartTokenizer
PegasusTokenizer
T5Tokenizer
ReformerTokenizer
XLMRobertaTokenizer

How to obtain the same behavior as v3.x in v4.x

In order to obtain the same behavior as version v3.x, you should install sentencepiece additionally:

In version v3.x:

pip install transformers

to obtain the same in version v4.x:

pip install transformers[sentencepiece]

pip install transformers sentencepiece

3. The architecture of the repo has been updated so that each model resides in its folder

The past and foreseeable addition of new models means that the number of files in the directory src/transformers keeps growing and becomes harder to navigate and understand. We made the choice to put each model and the files accompanying it in their own sub-directories.

This is a breaking change as importing intermediary layers using a model's module directly needs to be done via a different path.

How to obtain the same behavior as v3.x in v4.x

In order to obtain the same behavior as version v3.x, you should update the path used to access the layers.

In version v3.x:

from transformers.modeling_bert import BertLayer

to obtain the same in version v4.x:

from transformers.models.bert.modeling_bert import BertLayer

4. Switching the `return_dict` argument to `True` by default

The return_dict argument enables the return of named-tuples-like python objects containing the model outputs, instead of the standard tuples. This object is self-documented as keys can be used to retrieve values, while also behaving as a tuple as users may retrieve objects by index or by slice.

This is a breaking change as the limitation of that tuple is that it cannot be unpacked: value0, value1 = outputs will not work.

How to obtain the same behavior as v3.x in v4.x

In order to obtain the same behavior as version v3.x, you should specify the return_dict argument to False, either in the model configuration or during the forward pass.

In version v3.x:

outputs = model(**inputs)

to obtain the same in version v4.x:

outputs = model(**inputs, return_dict=False)

5. Removed some deprecated attributes

Attributes that were deprecated have been removed if they had been deprecated for at least a month. The full list of deprecated attributes can be found in #8604.

Here is a list of these attributes/methods/arguments and what their replacements should be:

In several models, the labels become consistent with the other models:

masked_lm_labels becomes labels in AlbertForMaskedLM and AlbertForPreTraining.
masked_lm_labels becomes labels in BertForMaskedLM and BertForPreTraining.
masked_lm_labels becomes labels in DistilBertForMaskedLM.
masked_lm_labels becomes labels in ElectraForMaskedLM.
masked_lm_labels becomes labels in LongformerForMaskedLM.
masked_lm_labels becomes labels in MobileBertForMaskedLM.
masked_lm_labels becomes labels in RobertaForMaskedLM.
lm_labels becomes labels in BartForConditionalGeneration.
lm_labels becomes labels in GPT2DoubleHeadsModel.
lm_labels becomes labels in OpenAIGPTDoubleHeadsModel.
lm_labels becomes labels in T5ForConditionalGeneration.

In several models, the caching mechanism becomes consistent with the other models:

decoder_cached_states becomes past_key_values in all BART-like, FSMT and T5 models.
decoder_past_key_values becomes past_key_values in all BART-like, FSMT and T5 models.
past becomes past_key_values in all CTRL models.
past becomes past_key_values in all GPT-2 models.

Regarding the tokenizer classes:

The tokenizer attribute max_len becomes model_max_length.
The tokenizer attribute return_lengths becomes return_length.
The tokenizer encoding argument is_pretokenized becomes is_split_into_words.

Regarding the Trainer class:

The Trainer argument tb_writer is removed in favor of the callback TensorBoardCallback(tb_writer=...).
The Trainer argument prediction_loss_only is removed in favor of the class argument args.prediction_loss_only.
The Trainer attribute data_collator should be a callable.
The Trainer method _log is deprecated in favor of log.
The Trainer method _training_step is deprecated in favor of training_step.
The Trainer method _prediction_loop is deprecated in favor of prediction_loop.
The Trainer method is_local_master is deprecated in favor of is_local_process_zero.
The Trainer method is_world_master is deprecated in favor of is_world_process_zero.

Regarding the TFTrainer class:

The TFTrainer argument prediction_loss_only is removed in favor of the class argument args.prediction_loss_only.
The Trainer method _log is deprecated in favor of log.
The TFTrainer method _prediction_loop is deprecated in favor of prediction_loop.
The TFTrainer method _setup_wandb is deprecated in favor of setup_wandb.
The TFTrainer method _run_model is deprecated in favor of run_model.

Regarding the TrainerArgument and TFTrainerArgument classes:

The TrainerArgument argument evaluate_during_training is deprecated in favor of evaluation_strategy.
The TFTrainerArgument argument evaluate_during_training is deprecated in favor of evaluation_strategy.

Regarding the Transfo-XL model:

The Transfo-XL configuration attribute tie_weight becomes tie_words_embeddings.
The Transfo-XL modeling method reset_length becomes reset_memory_length.

Regarding pipelines:

The FillMaskPipeline argument topk becomes top_k.

Model Templates

Version 4.0.0 will be the first to include the experimental feature of model templates. These model templates aim to facilitate the addition of new models to the library by doing most of the work: generating the model/configuration/tokenization/test files that fit the API, with respect to the choice the user has made in terms of naming and functionality.

This release includes a model template for the encoder model (similar to the BERT architecture). Generating a model using the template will generate the files, put them at the appropriate location, reference them throughout the code-base, and generate a working test suite. The user should then only modify the files to their liking, rather than creating the model from scratch.

Feedback welcome, get started from the README here.

Model templates encoder only #8509 (@LysandreJik)

New model additions

mT5 and T5 version 1.1 (@patrickvonplaten )

The T5v1.1 is an improved version of the original T5 model, see here: https://github.com/google-research/text-to-text-transfer-transformer/blob/master/released_checkpoints.md

The multilingual T5 model (mT5) was presented in https://arxiv.org/abs/2010.11934 and is based on the T5v1.1 architecture.

Multiple pre-trained checkpoints have been added to the library:

t5v1_1: https://huggingface.co/models?search=t5-v1_1
mT5: https://huggingface.co/models?search=mt5

Relevant pull requests:

T5 & mT5 #8552 (@patrickvonplaten)
[MT5] More docs #8589 (@patrickvonplaten)
Fix init for MT5 #8591 (@sgugger)

TF DPR

The DPR model has been added in TensorFlow to match its PyTorch counterpart by @ratthachat

Add TFDPR #8203 (@ratthachat)

TF Longformer

Additional heads have been added to the TensorFlow Longformer implementation: SequenceClassification, MultipleChoice and TokenClassification

Tf longformer for sequence classification #8231 (@elk-cloner)

Bug fixes and improvements

[s2s/distill] hparams.tokenizer_name = hparams.teacher #8382 (@ShichaoSun)
[examples] better PL version check #8429 (@stas00)
Question template #8440 (@sgugger)
[docs] improve bart/marian/mBART/pega...

Assets 2

19 Nov 17:12

LysandreJik

v4.0.0-rc-1

d86b5ff

Transformers v4.0.0-rc-1: Fast tokenizers, model outputs, file reorganization Pre-release

Pre-release

Transformers v4.0.0-rc-1: Fast tokenizers, model outputs, file reorganization

Breaking changes since v3.x

Version v4.0.0 introduces several breaking changes that were necessary.

1. AutoTokenizers and pipelines now use fast (rust) tokenizers by default.

How to obtain the same behavior as v3.x in v4.x

The pipelines now contain additional features out of the box. See the token-classification pipeline with the grouped_entities flag.
The auto-tokenizers now return rust tokenizers. In order to obtain the python tokenizers instead, the user may use the use_fast flag by setting it to False:

In version v3.x:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xxx")

to obtain the same in version v4.x:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xxx", use_fast=False)

2. SentencePiece is removed from the required dependencies

This includes the slow versions of:

XLNetTokenizer
AlbertTokenizer
CamembertTokenizer
MBartTokenizer
PegasusTokenizer
T5Tokenizer
ReformerTokenizer
XLMRobertaTokenizer

How to obtain the same behavior as v3.x in v4.x

In order to obtain the same behavior as version v3.x, you should install sentencepiece additionally:

In version v3.x:

pip install transformers

to obtain the same in version v4.x:

pip install transformers[sentencepiece]

pip install transformers sentencepiece

3. The architecture of the repo has been updated so that each model resides in its folder

This is a breaking change as importing intermediary layers using a model's module directly needs to be done via a different path.

How to obtain the same behavior as v3.x in v4.x

In order to obtain the same behavior as version v3.x, you should update the path used to access the layers.

In version v3.x:

from transformers.modeling_bert import BertLayer

to obtain the same in version v4.x:

from transformers.models.bert.modeling_bert import BertLayer

4. Switching the `return_dict` argument to `True` by default

This is a breaking change as the limitation of that tuple is that it cannot be unpacked: value0, value1 = outputs will not work.

How to obtain the same behavior as v3.x in v4.x

In order to obtain the same behavior as version v3.x, you should specify the return_dict argument to False, either in the model configuration or during the forward pass.

In version v3.x:

outputs = model(**inputs)

to obtain the same in version v4.x:

outputs = model(**inputs, return_dict=False)

5. Removed some deprecated attributes

Attributes that were deprecated have been removed if they had been deprecated for at least a month. The full list of deprecated attributes can be found in #8604.

Model Templates

Feedback welcome, get started from the README here.

Model templates encoder only #8509 (@LysandreJik)

New model additions

mT5 and T5 version 1.1 (@patrickvonplaten )

The T5v1.1 is an improved version of the original T5 model, see here: https://github.com/google-research/text-to-text-transfer-transformer/blob/master/released_checkpoints.md

The multilingual T5 model (mT5) was presented in https://arxiv.org/abs/2010.11934 and is based on the T5v1.1 architecture.

Multiple pre-trained checkpoints have been added to the library:

t5v1_1: https://huggingface.co/models?search=t5-v1_1
mT5: https://huggingface.co/models?search=mt5

Relevant pull requests:

T5 & mT5 #8552 (@patrickvonplaten)
[MT5] More docs #8589 (@patrickvonplaten)
Fix init for MT5 #8591 (@sgugger)

TF DPR

The DPR model has been added in TensorFlow to match its PyTorch counterpart by @ratthachat

Add TFDPR #8203 (@ratthachat)

TF Longformer

Additional heads have been added to the TensorFlow Longformer implementation: SequenceClassification, MultipleChoice and TokenClassification

Tf longformer for sequence classification #8231 (@elk-cloner)

Bug fixes and improvements

[s2s/distill] hparams.tokenizer_name = hparams.teacher #8382 (@ShichaoSun)
[examples] better PL version check #8429 (@stas00)
Question template #8440 (@sgugger)
[docs] improve bart/marian/mBART/pegasus docs #8421 (@sshleifer)
Add auto next sentence prediction #8432 (@jplu)
Windows dev section in the contributing file #8436 (@jplu)
[testing utils] get_auto_remove_tmp_dir more intuitive behavior #8401 (@stas00)
Add missing import #8444 (@jplu)
[T5 Tokenizer] Fix t5 special tokens #8435 (@patrickvonplaten)
using multi_gpu consistently #8446 (@stas00)
Add missing tasks to pipeline docstring #8428 (@bryant1410)
[No merge] TF integration testing #7621 (@LysandreJik)
[T5Tokenizer] fix t5 token type ids #8437 (@patrickvonplaten)
Bug fix for apply_chunking_to_forward chunking dimension check #8391 (@pedrocolon93)
Fix TF Longformer #8460 (@jplu)
Add next sentence prediction loss computation #8462 (@jplu)
Fix TF next sentence output #8466 (@jplu)
Example NER script predicts on tokenized dataset #8468 (@sarnoult)
Replaced unnecessary iadd operations on lists in tokenization_utils.py with proper list methods #8433 (@bombs-kim)
Flax/Jax documentation #8331 (@mfuntowicz)
[s2s] distill t5-large -> t5-small #8376 (@sbhaktha)
Update deploy-docs dependencies on CI to enable Flax #8475 (@mfuntowicz)
Fix on "examples/language-modeling" to support more datasets #8474 (@zeyuyun1)
Fix doc bug #8500 (@mymusise)
Model sharing doc #8498 (@sgugger)
Fix SqueezeBERT for masked language model #8479 (@forresti)
Fix logging in the examples #8458 (@jplu)
Fix check scripts for Windows #8491 (@jplu)
Add pretraining loss computation for TF Bert pretraining #8470 (@jplu)
[T5] Bug correction & Refactor #8518 (@patrickvonplaten)
Model sharing doc: more tweaks #8520 (@julien-c)
[T5] Fix load weights function #8528 (@patrickvonplaten)
Rework some TF tests #8492 (@jplu)
[breaking|pipelines|tokenizers] Adding slow-fast tokenizers equivalence tests pipelines - Removing sentencepiece as a required dependency #8073 (@thomwolf)
Adding the prepare_seq2seq_batch function to ProphetNet #8515 (@forest1988)
Update version to v4.0.0-dev #8568 (@sgugger)
TAPAS tokenizer & tokenizer tests #8482 (@LysandreJik)
Switch return_dict to True by default. #8530 (@sgugger)
Fix mixed precision issue for GPT2 #8572 (@jplu)
Reorganize repo #8580 (@sgugger)
Tokenizers: ability to load from model subfolder #8586 (@julien-c)
Fix model templates #8595 (@sgugger)
[examples tests] tests that are fine on multi-gpu #8582 (@stas00)
Fix check repo utils #8600 (@sgugger)
Tokenizers should be framework agnostic #8599 (@LysandreJik)
Remove deprecated #8604 (@sgugger)
Fixed link to the wrong paper. #8607 (@cronoik)
Reset loss to zero on logging in Trainer to avoid bfloat16 issues #8561 (@bminixhofer)
Fix DataCollatorForLanguageModeling #8621 (@sgugger)
[s2s] multigpu skip #8613 (@stas00)
[s2s] fix finetune.py to adjust for #8530 changes #8612 (@stas00)
tf_bart typo - self.self.activation_dropout #8611 (@ratthachat)
New TF loading weights #8490 (@jplu)
Adding PrefixConstrainedLogitsProcessor #8529 (@nicola-decao)
[Tokenizer Doc] Improve tokenizer summary #8622 (@patrickvonplaten)
Fixes the training resuming with gradient accumulation #8624 (@sgugger)
Fix training from scratch in new scripts #8623 (@sgugger)
[s2s] distillation apex breaks return_dict obj #8631 (@stas00)
Updated the Extractive Question Answering code snippe...

Assets 2

13 Nov 15:28

sgugger

v3.5.1

d5b3e56

v3.5.1

Fix a typo that raised an error instead of a deprecation warning.

Assets 2

10 Nov 13:57

LysandreJik

v3.5.0

818878d

v3.5.0: Model versioning, TensorFlow encoder-decoder models, new scripts, refactor of the `generate` method

Model versioning, TensorFlow encoder-decoder models, new scripts, refactor of the `generate` method

Model versioning

We host more and more of the community's models which is awesome ❤️. To scale this sharing, we needed to change the infra to both support more models, and unlock new powerful features.

To that effect, we have rebuilt the storage backend that we use for models (currently S3), to our own git repos (using S3 as a git-lfs endpoint for large files), with one model = one repo.

The benefits of this switch are:

built-in versioning (I mean… it’s git. It’s pretty much what you use for versioning. Versioning in S3 has a ton a limitations)
access control (will unlock private models, private datasets, etc)
scalability (our usage of S3 to maintain lists of models was starting to bottleneck)

Let's dive in to the actual changes:

I. On the website

You'll now see a "Browse files and versions" tab or button on each model page. (design is not final, we'll make it more prominent/streamlined in the near future)

This is what this page looks like:

The UX should look familiar and self-explanatory, but we'll add more ML-specific features in the future.

You can:

see commit histories and diffs of changes made to any text file, like config.json:
- changes made by the HuggingFace team will be way clearer – we can perform updates to the models to ensure they work well with the library(ies) (you'll be able to opt out from those changes)
Large binary files are stored using https://git-lfs.github.com/ which is pretty standard now, and interoperable out of the box with git
Ability to update your text files, like your README.md model card, directly on the website!
- with instant preview 🔥

II. In the transformers library

The PR to enable this new storage mode in the transformers library is available here: #8324

This PR has two parts:

1. changes to the file downloading code used in from_pretrained() methods to use the new file URLs.
Large files are stored in an S3 bucket and served by Cloudfront so downloads should be as fast as they are right now.

In addition, you now have a way to pin a specific version of a model, to a commit hash, tag or branch.

For instance:

tokenizer = AutoTokenizer.from_pretrained(
  "julien-c/EsperBERTo-small",
  revision="v2.0.1" # tag name, or branch name, or commit hash
)

Finally, the networking code is more robust and doesn't gobble up errors anymore, so in case you have trouble downloading a specific file you'll know exactly why.

2. changes to the model upload CLI to create a model repo then be able to git clone and git push to it.
We are intentionally not wrapping git too much because we expect most model authors to be familiar with git (and possibly git-lfs), let us know if not the case.

To create a repo:

transformers-cli repo create your-model-name

Then you'll get a repo url that you'll be able to clone:

git clone https://huggingface.co/username/your-model-name

# Then commit as usual
cd your-model-name
echo "hello" >> README.md
git add . && git commit -m "Update from $USER"

A nice side effect of the new system on the upload side is that file uploading should be more robust for very large files (hello T5!) as git-lfs handles the networking code.

By the way, again, every model is its own repo. So you can git clone any public model if you'd like:

git clone https://huggingface.co/gpt2

But you won't be able to push unless it's one of your models (or one of your orgs').

III. Backward compatibility

Backward compatibility on model downloads is expected, because even though the new models will be stored in huggingface.co-hosted git repos, we will backport all file changes to S3 automatically.
⚠️ Model uploads using the current system won't work anymore: you'll need to upgrade your transformers installation to the next release, v3.5.0, or to build from master.
Alternatively, in the next week or so we'll add the ability to create a repo from the website directly so you'll be able to push even without the transformers library.

TFMarian, TFMbart, TFPegasus, TFBlenderbot

Add tensorflow 2.0 functionality for SOTA seq2seq transformers #7987 (@sshleifer)

New and updated scripts

We'working on giving examples on how to leverage the 🤗 Datasets library and the Trainer API. Those scripts are meant as examples easy to customize, with lots of comments explaining the various steps. The following tasks are now covered:

Text classification : New run glue script #7917 (@sgugger)
Causal Language Modeling: New run_clm script #8105 (@sgugger)
Masked Language Modeling: Add line by line option to mlm/plm scripts #8240 (@sgugger)
Token classification: Add new token classification example #8340 (@sgugger)

Seq2Seq Trainer

A child of Trainer specialized for training seq2seq models, from @patil-suraj, @stas00 and @sshleifer. Accessible through examples/seq2seq/finetune_trainer.py. API is similar to examples/seq2seq/finetune.py, but API support is better. Example scripts are in examples/seq2seq/builtin_trainer.

[seq2seq testing] multigpu test run via subprocess #7281 (@stas00)
[s2s trainer] tests to use distributed on multi-gpu machine #7965 (@stas00)
[Seq2Seq] Allow EncoderDecoderModels to be trained with Seq2Seq #7809 (@patrickvonplaten)
[Seq2Seq Trainer] Make sure padding is implemented for models without pad_token #8043 (@patrickvonplaten)
[Seq2SeqTrainer] Move import to init to make file self-contained #8194 (@patrickvonplaten)
[s2s test] cleanup #8131 (@stas00)
[Seq2Seq] Correct import in Seq2Seq Trainer #8254 (@patrickvonplaten)
[Seq2Seq] Make Seq2SeqArguments an independent file #8267 (@patrickvonplaten)
[Seq2SeqDataCollator] dont pass add_ prefix_space=False to all tokenizers #8329 (@sshleifer)

Seq2Seq Testing and Documentation Improvements

[s2s] create doc for pegasus/fsmt replication #7934 (@stas00)
[s2s] test_distributed_eval #8315 (@stas00)
[s2s] test_bash_script.py - actually learn something #8318 (@stas00)
[s2s examples test] fix data path #8398 (@stas00)
[s2s test_finetune_trainer] failing multigpu test #8400 (@stas00)
[s2s/distill] remove run_distiller.sh, fix xsum script #8412 (@sshleifer)

Docs for DistillBART Paper Replication

Re-run experiments from the paper here

[s2s] distillBART docs for paper replication #8150 (@sshleifer)

Refactoring the `generate()` function

The generate() method now has a new design so that the user can directly call upon the methods
sample(), greedy_search(), beam_search() and beam_sample(). The code was made more readable, and beam search was sped-up by ca. 5-10%.

Refactoring the generate() function #6949 (@patrickvonplaten)

Notebooks

added qg evaluation notebook #7958 (@zolekode)
adding beginner-friendly notebook on text classification with DistilBERT/TF #7964 (@peterbayerle)
[Notebooks] Add new encoder-decoder notebooks #8246 (@patrickvonplaten)

General improvements and bugfixes

Respect the 119 line chars #7928 (@LysandreJik)
PPL guide code snippet minor fix #7938 (@joeddav)
[ProphetNet] Add Question Generation Model + Test #7942 (@patrickvonplaten)
[multiple models] skip saving/loading deterministic state_dict keys #7878 (@stas00)
Add missing comma #7870 (@mrm8488)
TensorBoard/Wandb/optuna/raytune integration improvements. #7935 (@madlag)
[ProphetNet] Correct Doc string example #7944 (@patrickvonplaten)
[GPT2 batch generation] Make test clearer. do_sample=True is not deterministic. #7947 (@patrickvonplaten)
fix 'encode_plus' docstring for 'special_tokens_mask' (0s and 1s were reversed) #7949 (@epwalsh)
Herbert tokenizer auto load #7968 (@rmroczkowski)
[testing] slow tests should be marked as slow #7895 (@stas00)
support relative path for best_model_checkpoint #7973 (@HaebinShin)
Disable inference API for t5-11b #7978 (@julien-c)
[fsmt test] basic config test with online model + super tiny model #7860 (@stas00)
Add whole word mask support for lm fine-tune #7925 (@wlhgtc)
[PretrainedConfig] Fix save pretrained config for edge case #7943 (@patrickvonplaten)
GPT2 - Remove else branch adding 0 to the hidden state if token_type_embeds is None. #7977 (@mfuntowicz)
Fixing the "translation", "translation_XX_to_YY" pipelines. #7975 (@Narsil)
FillMaskPipeline: support passing top_k on call #7971 (@julien-c)
Only log total_flos at the end of training #7981 (@sgugger)
add zero shot pipeline tags & examples #7983 (@joeddav)
Reload checkpoint #7984 (@sgugger)
[gh ci] less output ( --durations=50) #7989 (@sshleifer)
Move NoLayerEmbedTokens #7945 (@sshleifer)
update zero shot default widget example #7992 (@joeddav)
[RAG] Handle the case when title is None while loading own datasets #7941 (@lalitpagaria)
[tests|tokenizers] Refactoring pipelines test backbone - Small tokenizers improvements - General tests speedups #7970 (@thomwolf)
[Reformer] remove reformer pad_token_id #7991 (@patrickvonplaten)
Fix BatchEncoding.word_to_tokens for removed tokens #7939 (@n1t0)
Handling longformer model_type #7990 (@ethanjperez)
[doc prepare_seq2seq_batch] fix docs #8013 (@patil-suraj)
[tokenizers] Fixing #8001 - Adding tests on tokenizers serialization #8006 (@thomwolf)
Add mixed...

Assets 2

20 Oct 14:30

LysandreJik

v3.4.0

eb0e0ce

ProphetNet, Blenderbot, SqueezeBERT, DeBERTa

ProphetNET

Two new models are released as part of the ProphetNet implementation: ProphetNet and XLM-ProphetNet.

ProphetNet is an encoder-decoder model and can predict n-future tokens for “ngram” language modeling instead of just the next token.

XLM-ProphetNet is an encoder-decoder model with an identical architecture to ProhpetNet, but the model was trained on the multi-lingual “wiki100” Wikipedia dump.

The ProphetNet model was proposed in ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020.

It was added to the library in PyTorch with the following checkpoints:

microsoft/xprophetnet-large-wiki100-cased-xglue-ntg
microsoft/prophetnet-large-uncased
microsoft/prophetnet-large-uncased-cnndm
microsoft/xprophetnet-large-wiki100-cased
microsoft/xprophetnet-large-wiki100-cased-xglue-qg

Contributions:

ProphetNet #7157 (@qiweizhen, @patrickvonplaten)

BlenderBot

Blenderbot is an encoder-decoder model for open-domain chat. It uses a standard seq2seq model transformer-based architecture.

The Blender chatbot model was proposed in Recipes for building an open-domain chatbot Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.

It was added to the library in PyTorch with the following checkpoints:

facebook/blenderbot-90M
facebook/blenderbot-3B

Contributions:

Blenderbot #7418 (@sshleifer)

SqueezeBERT

The SqueezeBERT model was proposed in SqueezeBERT: What can computer vision teach NLP about efficient neural networks? by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer. It’s a bidirectional transformer similar to the BERT model. The key difference between the BERT architecture and the SqueezeBERT architecture is that SqueezeBERT uses grouped convolutions instead of fully-connected layers for the Q, K, V and FFN layers.

It was added to the library in PyTorch with the following checkpoints:

squeezebert/squeezebert-mnli
squeezebert/squeezebert-uncased
squeezebert/squeezebert-mnli-headless

Contributions:

SqueezeBERT architecture #7083 (@forresti)
Fix squeezebert docs #7587 (@LysandreJik)

DeBERTa

The DeBERTa model was proposed in DeBERTa: Decoding-enhanced BERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google’s BERT model released in 2018 and Facebook’s RoBERTa model released in 2019.

It was added to the library in PyTorch with the following checkpoints:

microsoft/deberta-base
microsoft/deberta-large

Contributions:

Add DeBERTa model #5929 (@BigBird01)
Fix DeBERTa integration tests #7729 (@LysandreJik)

Both SentencePiece and Tokenizers are now optional libraries

Support for SentencePiece is now part of the tokenizers library! Thanks to this we now have near-full support of fast tokenizers in the library.

With this new feature, we slightly change the paradigm regarding installation:

SentencePiece is now an optional dependency, paving the way to a fully-featured conda install in the near future
Tokenizers is now also an optional dependency, making it possible to install and use the library even when rust cannot be compiled on the machine.
[Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies #7659 (@thomwolf)

The main __init__ has been improved to always import the same functions and classes. If someone then tries to use a class that requires an optional dependency, an ImportError will be raised at init (with instructions on how to install the missing dependency) #7537 (@sgugger)

Improvements made to the `Trainer`

The Trainer API has been improved to work with models requiring several labels or returning several outputs, and to have clearer progress tracking. A new TrainerCallback class has been added to allow the user to easily customize the default training loop.

Remove config assumption in Trainer #7464 (@sgugger)
Clean the Trainer state #7490 (@sgugger)
Small QOL improvements to TrainingArguments #7475 (@sgugger)
Allow nested tensors in predicted logits #7542 (@sgugger)
Trainer callbacks #7596 (@sgugger)
Add specific notebook ProgressCalback #7793 (@sgugger)
Small fixes to NotebookProgressCallback #7813 (@sgugger)
Add predict step accumulation #7767 (@sgugger)
Don't use store_xxx on optional bools #7786 (@sgugger)

Seq2Seq Trainer

A child of Trainer specialized for training seq2seq models, from @patil-suraj and @sshleifer. Accessible through examples/seq2seq/finetune_trainer.py.

example scripts at examples/seq2seq/builtin_trainer/
same functionality as examples/seq2seq/finetune.py, but better TPU support.
[examples/s2s] clean up finetune_trainer #7509 (@patil-suraj)
[s2s] trainer scripts: Remove --run_name, thanks sylvain! #7521 (@sshleifer)
[s2s] Adafactor support for builtin trainer #7522 (@sshleifer)
[s2s] add config params like Dropout in Seq2SeqTrainingArguments #7532 (@patil-suraj)
Distributed Trainer: 2 little fixes #7461 (@sshleifer)
[s2sTrainer] test + code cleanup #7467 (@sshleifer)
Seq2SeqDataset: avoid passing src_lang everywhere #7470 (@amanpreet692)
[s2strainer] fix eval dataset loading #7477 (@patil-suraj)
[pseudolabels] cleanup markdown table #7653 (@sshleifer)

Distributed Generation

You can run model.generate in pytorch on a large dataset and split the work across multiple GPUs, using examples/seq2seq/run_distributed_eval.py
[s2s] release pseudolabel links and instructions #7639 (@sshleifer)
[s2s] Fix t5 warning for distributed eval #7487 (@sshleifer)
[s2s] fix kwargs style #7488 (@sshleifer)
[s2s] fix lockfile and peg distillation constants #7545 (@sshleifer)
[s2s] fix nltk pytest race condition with FileLock #7515 (@sshleifer)

Notebooks

Train T5 in Tensoflow 2 Community Notebook #7428 (@HarrisDePerceptron)

General improvements and bugfixes

remove codecov PR comments #7400 (@sshleifer)
Get a better error when check_copies fails #7457 (@sgugger)
Multi-GPU Testing setup #7453 (@LysandreJik)
Fix LXMERT with DataParallel #7471 (@LysandreJik)
Number of GPUs for multi-gpu #7472 (@LysandreJik)
Make transformers install check positive #7473 (@FremyCompany)
Alphabetize model lists #7478 (@sgugger)
Bump isort version. #7484 (@sgugger)
Add forgotten return_dict argument in the docs #7483 (@sgugger)
Enable pegasus fp16 by clamping large activations #7243 (@sshleifer)
Update LayoutLM doc #7388 (@Al31415)
Report Tune metrics in final evaluation #7507 (@krfricke)
Fix Ray Tune progress_reporter kwarg #7508 (@krfricke)
[Seq2Seq] Fix a couple of bugs and clean examples #7474 (@patrickvonplaten)
[Attention Mask] Fix data type #7513 (@patrickvonplaten)
Fix seq2seq example test #7518 (@sgugger)
Remove labels from the RagModel example #7560 (@sgugger)
added script for fine-tuning roberta for sentiment analysis task #7505 (@DhavalTaunk08)
LayoutLM: add exception handling for bbox values #7452 (@Al31415)
Cleanup documentation for BART, Marian, MBART and Pegasus #7523 (@sgugger)
Add Electra unexpected keys #7569 (@LysandreJik)
Fix tokenization in SQuAD for RoBERTa, Longformer, BART #7387 (@tholor)
docs(pretrained_models): fix num parameters #7575 (@amineabdaoui)
Update Code example according to deprecation of AutoModeWithLMHead #7555 (@jshamg)
Allow soft dependencies in the namespace with ImportErrors at use #7537 (@sgugger)
Fix post_init of some TrainingArguments #7525 (@sgugger)
Check and update model list in index.rst automatically #7527 (@sgugger)
Expand test to locate flakiness #7580 (@sgugger)
Custom TF weights loading #7422 (@jplu)
Documentation fixes #7585 (@sgugger)
Documentation framework toggle should stick #7586 (@LysandreJik)
Support T5 Distillation w/hidden state supervision #7599 (@sshleifer)
[makefile] check only .py files #7588 (@stas00)
[TF generation] Fix typo #7582 (@SidJain1412)
change return dicitonary for DataCollatorForNextSentencePrediction from masked_lm_labels to labels #7595 (@gmihaila)
Docker GPU Images: Add NVIDIA/apex to the cuda images with pytorch #7598 (@AdrienDS)
typo fix #7611 (@agemagician)
[bart] fix config.classif_dropout #7593 (@sshleifer)
[s2s] save first batch to json for debugging purposes #6810 (@sshleifer)
Add GPT2ForSequenceClassification based on DialogRPT #7501 (@LysandreJik)
Fix wrong reference name/filename in docstring of SquadProcessor #7616 (@phiyodr)
Fix tokenizer UnboundLocalError when padding is set to PaddingStrategy.MAX_LENGTH #7610 (@GabrielePicco)
Add GPT2 to sequence classification auto model #7630 (@LysandreJik)
Replaced torch.load for loading the pretrained vocab of TransformerXL tokenizer to pickle.load #6935 (@w4nderlust)
Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer #7141 (@thomwolf)
Green tests: update torch-hub test dependencies (add protobuf and pin tokenizer 0.9.0-RC2) #7658 (@thomwolf)
Fix RobertaForCausalLM docs #7642 (@LysandreJik)
[s2s] configure lr_scheduler from command line #7641 (@patil-suraj)
[pseudo] Switch URLS to CDN #7661 (@sshleifer)
[s2s] Switch README urls to cdn #7670 (@sshleifer)
fix nn.DataParallel compatibility with PyTorch 1.5 #7671 (@guhur)
Update XLM-RoBERTa pretrained model details #7669 (@noahtren)
Fix dataset cardinality #7678 (@jplu)
[pegasus] Faster ...

Assets 2

29 Sep 18:30

sgugger

v3.3.1

1ba08dc

v3.3.1

Fixes errors due to the name conflicts between the datasets library and local folder or modules named datasets.

Assets 2

28 Sep 14:32

LysandreJik

v3.3.0

0613f05

RAG

RAG Model

The RAG model is a retrieval-augmented generation model that can be leveraged for question-answering tasks using RagTokenForGeneration or RagSequenceForGeneration as proposed in Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela.

It was added to the library in PyTorch with the following checkpoints:

facebook/rag-token-nq
facebook/rag-sequence-nq
facebook/rag-token-base
facebook/rag-sequence-base

Contributions:

RAG #6813 (@ola13)
[RAG] Add attention_mask to RAG generate #7373 (@patrickvonplaten)
[RAG] Add missing doc and attention_mask to rag #7382 (@patrickvonplaten)
[Rag] Fix wrong usage of num_beams and bos_token_id in Rag Sequence generation #7386 (@patrickvonplaten)
[RAG] Fix retrieval offset in RAG's HfIndex and better integration tests #7372 (@lhoestq)
[RAG] Remove dependency on examples/seq2seq from rag #7395 (@ola13)
[Rag] fix rag retriever save_pretrained method #7399 (@patrickvonplaten)
[RAG] Clean Rag readme in examples #7413 (@ola13)
[RAG] Model cards - clean cards #7420 (@patrickvonplaten)
Document RAG again #7377 (@sgugger)

Bug fixes and improvements

Mark big downloads slow #7325 (@sgugger)
[Bug Fix] The actual batch_size is inconsistent with the settings. #7235 (@HuangLianzhe)
Fixed results of SQuAD-FR evaluation #7313 (@psorianom)
[s2s] add supported architecures to MD #7252 (@sshleifer)
Add num workers cli arg #7322 (@chadykamar)
[s2s] add src_lang kwarg for distributed eval #7300 (@sshleifer)
[s2s] only save metrics.json from rank zero #7331 (@sshleifer)
[code quality] fix confused flake8 #7309 (@stas00)
[testing] skip decorators: docs, tests, bugs #7334 (@stas00)
Fixed evaluation_strategy on epoch end bug #7340 (@WissamAntoun)
Models doc #7345 (@sgugger)
Ensure that integrations are imported before transformers or ml libs #7330 (@dsblank)
[Benchmarks] Change all args to from no_... to their positive form #7075 (@fmcurti)
Remove reference to args in XLA check #7344 (@ZeroCool2u)
wip: Code to add lang tags to marian model cards #6586 (@sshleifer)
Expand a bit the documentation doc #7350 (@sgugger)
Check decorator order #7326 (@sgugger)
Update modeling_tf_longformer.py #7359 (@Line290)
Updata tokenization_auto.py #6870 (@hjptriplebee)
Update the TF models to remove their interdependencies #7238 (@jplu)
Make PyTorch model files independent from each other #7352 (@sgugger)
Clean RAG docs and template docs #7348 (@sgugger)
Fixing case in which Trainer hung while saving model in distributed training #7365 (@TevenLeScao)
Formatter #7368 (@LysandreJik)
[seq2seq] make it easier to run the scripts #7274 (@stas00)
Remove mentions of RAG from the docs #7376 (@sgugger)
[fsmt] build/test scripts #7257 (@stas00)
[s2s] distributed eval allows num_return_sequences > 1 #7254 (@sshleifer)
Seq2SeqTrainer #6769 (@patil-suraj)
modeling_bart: 3 small cleanups that dont change outputs #7381 (@sshleifer)
Check config type using type instead of isinstance #7363 (@LysandreJik)
[s2s, examples] minor doc changes #7385 (@patil-suraj)
Remove unhelpful bart warning #7391 (@sshleifer)
[code quality] new make target that combines style and quality targets #7310 (@stas00)
Speedup check_copies script #7394 (@sgugger)
Fix BartModel output documentation #7390 (@sgugger)
Fix FP16 and attention masks in FunnelTransformer #7374 (@sgugger)
[Longformer, Bert, Roberta, ...] Fix multi gpu training #7272 (@patrickvonplaten)
[s2s] add create student script #7290 (@patil-suraj)
[s2s] rougeLSum expects \n between sentences #7410 (@sshleifer)
[T5] allow config.decoder_layers to control decoer size #7409 (@sshleifer)
Flos fix #7384 (@marrrcin)
Catch PyTorch warning when saving/loading scheduler #7401 (@sgugger)
Pull request template #7392 (@LysandreJik)
Reorganize documentation navbar #7423 (@sgugger)

Assets 2

22 Sep 15:58

LysandreJik

v3.2.0

3ebb1b3

Bert Seq2Seq models, FSMT, LayoutLM, Funnel Transformer, LXMERT

Bert Seq2Seq models, FSMT, Funnel Transformer, LXMERT

BERT Seq2seq models

The BertGeneration model is a BERT model that can be leveraged for sequence-to-sequence tasks using EncoderDecoderModel as proposed in Leveraging Pre-trained Checkpoints for Sequence Generation Tasks by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.

It was added to the library in PyTorch with the following checkpoints:

google/roberta2roberta_L-24_bbc
google/roberta2roberta_L-24_gigaword
google/roberta2roberta_L-24_cnn_daily_mail
google/roberta2roberta_L-24_discofuse
google/roberta2roberta_L-24_wikisplit
google/bert2bert_L-24_wmt_de_en
google/bert2bert_L-24_wmt_en_de

Contributions:

Add "Leveraging Pretrained Checkpoints for Generation" Seq2Seq models. #6594 (@patrickvonplaten)

FSMT (FairSeq MachineTranslation)

FSMT (FairSeq MachineTranslation) models were introduced in Facebook FAIR’s WMT19 News Translation Task Submission by Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, Sergey Edunov.

It was added to the library in PyTorch, with the following checkpoints:

facebook/wmt19-en-ru
facebook/wmt19-en-de
facebook/wmt19-ru-en
facebook/wmt19-de-en

Contributions:

[ported model] FSMT (FairSeq MachineTranslation) #6940 (@stas00)
build/eval/gen-card scripts for fsmt #7155 (@stas00)
skip failing FSMT CUDA tests until investigated #7220 (@stas00)
[fsmt] rewrite SinusoidalPositionalEmbedding + USE_CUDA test fixes + new TranslationPipeline test #7224 (@stas00)
[s2s] adjust finetune + test to work with fsmt #7263 (@stas00)
[fsmt] SinusoidalPositionalEmbedding no need to pass device #7292 (@stas00)
Adds FSMT to LM head AutoModel #7312 (@LysandreJik)

LayoutLM

The LayoutLM model was proposed in LayoutLM: Pre-training of Text and Layout for Document Image Understandin by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. It’s a simple but effective pre-training method of text and layout for document image understanding and information extraction tasks, such as form understanding and receipt understanding.

It was added to the library in PyTorch with the following checkpoints:

layoutlm-base-uncased
layoutlm-large-uncased

Contributions:

Add LayoutLM Model #7064 (@liminghao1630)
Fixes for LayoutLM #7318 (@sgugger)

Funnel Transformer

The Funnel Transformer model was proposed in the paper Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. It is a bidirectional transformer model, like BERT, but with a pooling operation after each block of layers, a bit like in traditional convolutional neural networks (CNN) in computer vision.

It was added to the library in both PyTorch and TensorFlow, with the following checkpoints:

funnel-transformer/small
funnel-transformer/small-base
funnel-transformer/medium
funnel-transformer/medium-base
funnel-transformer/intermediate
funnel-transformer/intermediate-base
funnel-transformer/large
funnel-transformer/large-base
funnel-transformer/xlarge
funnel-transformer/xlarge-base

Contributions:

Funnel transformer #6908 (@sgugger)
Add TF Funnel Transformer #7029 (@sgugger)

LXMERT

The LXMERT model was proposed in LXMERT: Learning Cross-Modality Encoder Representations from Transformers by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders (one for the vision modality, one for the language modality, and then one to fuse both modalities) pre-trained using a combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives. The pretraining consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering, VQA 2.0, and GQA.

It was added to the library in TensorFlow with the following checkpoints:

unc-nlp/lxmert-base-uncased
unc-nlp/lxmert-vqa-uncased
unc-nlp/lxmert-gqa-uncased

Contributions

Adding the LXMERT pretraining model (MultiModal languageXvision) to HuggingFace's suite of models #5793 (@eltoto1219)
[LXMERT] Fix tests on gpu #6946 (@patrickvonplaten)

New pipelines

The following pipeline was added to the library:

[pipelines] Text2TextGenerationPipeline #6744 (@patil-suraj)

Notebooks

The following community notebooks were contributed to the library:

Demoing LXMERT with raw images by incorporating the FRCNN model for roi-pooled extraction and bounding-box predction on the GQA answer set. #6986 (@eltoto1219)
[Community notebooks] Add notebook on fine-tuning GPT-2 Model with Trainer Class #7005 (@philschmid)
Add "Fine-tune ALBERT for sentence-pair classification" notebook to the community notebooks #7255 (@NadirEM)
added multilabel text classification notebook using distilbert to community notebooks #7201 (@DhavalTaunk08)

Encoder-decoder architectures

An additional encoder-decoder architecture was added:

[EncoderDecoder] Add xlm-roberta to encoder decoder #6878 (@patrickvonplaten)

Bug fixes and improvements

TF Flaubert w/ pre-norm #6841 (@LysandreJik)
Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task #6644 (@HuangLianzhe)
Fix in Adafactor docstrings #6845 (@sgugger)
Fix resuming training for Windows #6847 (@sgugger)
Only access loss tensor every logging_steps #6802 (@jysohn23)
Marian distill scripts + integration test #6799 (@sshleifer)
Add checkpointing to Ray Tune HPO #6747 (@krfricke)
Split hp search methods #6857 (@sgugger)
Update ONNX notebook to include section on quantization. #6831 (@mfuntowicz)
Fix marian slow test #6854 (@sshleifer)
[s2s] command line args for faster val steps #6833 (@sshleifer)
Bart can make decoder_input_ids from labels #6758 (@sshleifer)
add a final report to all pytest jobs #6861 (@stas00)
Logging doc #6852 (@sgugger)
Restore PaddingStrategy.MAX_LENGTH on QAPipeline while no v2. #6875 (@mfuntowicz)
[Generate] Facilitate PyTorch generate using ModelOutputs #6735 (@patrickvonplaten)
Add cache_dir to save features TextDataset #6879 (@jysohn23)
[Docs, Examples] Fix QA example for PT #6890 (@patrickvonplaten)
Update modeling_bert.py #6897 (@parthe)
[Electra] fix warning for position ids #6884 (@patrickvonplaten)
minor docs grammar fixes #6889 (@harrywang)
Fix error class instantiation #6634 (@tamuhey)
Output attention takes an s #6903 (@sgugger)
[testing] fix ambiguous test #6898 (@stas00)
test_tf_common: remove un_used mixin class parameters #6866 (@PuneethaPai)
Template updates #6914 (@sgugger)
Changed link to the correct paper in the second paragraph #6905 (@sengl)
tweak tar command in readme #6919 (@brettkoonce)
[s2s]: script to convert pl checkpoints to hf checkpoints #6911 (@sshleifer)
[s2s] allow task_specific_params=summarization_xsum #6923 (@sshleifer)
move wandb/comet logger init to train() to allow parallel logging #6850 (@krfricke)
[s2s] use --eval_beams command line arg #6926 (@sshleifer)
[s2s] support early stopping based on loss, rather than rouge #6927 (@sshleifer)
Fix mixed precision issue in TF DistilBert #6915 (@chiapas)
[docstring] misc arg doc corrections #6932 (@stas00)
[s2s] distill: --normalize_hidden --supervise_forward #6834 (@sshleifer)
[s2s] run_eval.py parses generate_kwargs #6948 (@sshleifer)
[doc] remove the implied defaults to :obj:None, s/True/ :obj:`True/, etc. #6956 (@stas00)
[s2s] warn if --fp16 for torch 1.6 #6977 (@sshleifer)
feat: allow prefix for any generative model #5885 (@borisdayma)
Trainer with grad accum #6930 (@sgugger)
Cannot index None #6984 (@LysandreJik)
[docstring] missing arg #6933 (@stas00)
[testing] add dependency: parametrize #6958 (@stas00)
Fixed the default number of attention heads in Reformer Configuration #6973 (@tznurmin)
[gen utils] missing else case #6980 (@stas00)
match CI's version of flake8 #6941 (@stas00)
Conversion scripts shouldn't have relative imports #6991 (@LysandreJik)
Add missing arguments for BertWordPieceTokenizer #5810 (@monologg)
fixed trainer tr_loss memory leak #6999 (@StuartMesham)
Floating-point operations logging in trainer #6768 (@TevenLeScao)
Fixing FLOPS merge by checking if torch is available #7013 (@LysandreJik)
[Longformer] Fix longformer documentation #7016 (@patrickvonplaten)
pegasus.rst: fix expected output #7017 (@sshleifer)
adding TRANSFORMERS_VERBOSITY env var #6961 (@stas00)
[generation] consistently add eos tokens #6982 (@stas00)
[from_pretrained] Allow tokenizer_type ≠ model_type #6995 (@julien-c)
replace torch.triu with onnx compatible code #6929 (@HenryDashwood)
Batch encore plus and overflowing tokens fails when non existing overflowing tokens for a sequence #6677 (@LysandreJik)
add -y to bypass prompt for transformers-cli upload #7035 (@stas00)
Fix confusing warnings during TF2 import from PyTorch #6623 (@jcrocholl)
Albert pretrain datasets/ datacollator #6168 (@yl-to)
Fix template #7040 (@LysandreJik)
Small fixes in tf template #7044 (@sgugger)
Add "Leveraging Pretrained Checkpoints for Generation" Seq2Seq models. #6594 (@patrickvonplaten)
fix to ensure that returned tensors after the tokenization is Long #7039 (@GeetDsa)
[BertGeneration] Correct Doc Title #7048 (@patrickvonplaten)
[BertGeneration, Docs] Fix another old name in docs #7050 (@patrickvonplaten)
[xlm tok] config dict: fix str into int to match definition #7034 (@stas00)
[s2s] --eval_max_generate_length #7018 (@sshleifer)
Fix CI w...

Assets 2

01 Sep 12:36

LysandreJik

v3.1.0

4b3ee9c

Pegasus, DPR, self-documented outputs, new pipelines and MT support

Pegasus, mBART, DPR, self-documented outputs and new pipelines

Pegasus

The Pegasus model from PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization by Jingqing Zhang, Yao Zhao, Mohammad Saleh, Peter J. Liu, was added to the library in PyTorch.

Model implemented as a collaboration between Jingqing Zhang and @sshleifer in #6340

PegasusForConditionalGeneration (torch version) #6340
add pegasus finetuning script #6811 script. (warning very slow)

DPR

The DPR model from Dense Passage Retrieval for Open-Domain Question Answering by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih was added to the library in PyTorch.

Add DPR model #5279 (@lhoestq)
Fix tests imports dpr #5576 (@lhoestq)

DeeBERT

The DeeBERT model from DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference by Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, Jimmy Lin has been added to the examples/ folder alongside its training script, in PyTorch.

Add DeeBERT (entropy-based early exiting for *BERT) #5477 (@ji-xin)

Self-documented outputs

As well as returning tuples, PyTorch and TensorFlow models now return a subclass of ModelOutput that is appropriate. A ModelOutput is a dataclass containing all model returns. This allows for easier inspection, and for self-documenting model outputs.

Change model outputs types to self-document outputs #5438 (@sgugger)
Tf model outputs #6247 (@sgugger)

Models return tuples by default, and return self-documented outputs if the return_dict configuration flag is set to True or if the return_dict=True keyword argument is passed to the forward/call method.

Summary of the behavior:

# The new outputs are opt-in, you have to activate them explicitly with `return_dict=True`
# Either at instantiation
model = BertForSequenceClassification.from_pretrained('bert-base-cased', return_dict=True)
# Or when calling the model
output = model(**inputs, return_dict=True)

# You can access the elements of the outputs with
# (1) named attributes
loss = outputs.loss
logits = outputs.logits

# (2) their names as strings like a dict
loss = outputs["loss"]
logits = outputs["logits"]

# (3) their index as integers or slices in the pre-3.1.0 outputs tuples
loss = outputs[0]
logits = outputs[1]
loss, logits = outputs[:2]

# One **breaking behavior** of these new outputs (which is the reason you have to opt-in to use these new outputs:
# Iterating on the outputs now return the names (keys) instead of the values:
print([element for element in outputs])
>>> ['loss', 'logits']
# Thus you cannot unpack the output like pre-3.1.0 (you get the string names instead of the values):
# (But you can query a slice like indicated in (3) above)
loss_keys, logits_key = outputs

Encoder-Decoder framework

The encoder-decoder framework has been enhanced to allow more encoder decoder model combinations, e.g.:
Bert2Bert, Bert2GPT2, Roberta2Roberta, Longformer2Roberta, ....

[EncoderDecoder] Add encoder-decoder for roberta/ vanilla longformer #6411 (@patrickvonplaten)
[EncoderDecoder] Add Cross Attention for GPT2 #6415 (@patrickvonplaten)
[EncoderDecoder] Add functionality to tie encoder decoder weights #6538 (@patrickvonplaten)
Multiple combinations of EncoderDecoder models have been fine-tuned and evaluated on CNN/Daily-Mail summarization: https://huggingface.co/models?search=cnn_dailymail-fp16 (@patrickvonplaten)

TensorFlow as a first-class citizen

As we continue working towards having TensorFlow be a first-class citizen, we continually improve on our TensorFlow API and models.

[Almost all TF models] TF clean up: add missing CLM / MLM loss; fix T5 naming and keras compile #5395 (@patrickvonplaten)
[Benchmark] Add benchmarks for TF Training #5594 (@patrickvonplaten)

Machine Translation

MarianMTModel

en-zh and 357 other checkpoints for machine translation were added from the Helsinki-NLP group's Tatoeba Project (@sshleifer + @jorgtied). There are now > 1300 supported pairs for machine translation.
Marian converter updates #6342 (@sshleifer)
Marian distill scripts + integration test #6799 (@sshleifer)

mBART

The mBART model from Multilingual Denoising Pre-training for Neural Machine Translation was can now be accessed through MBartForConditionalGeneration.

Add mbart-large-cc25, support translation finetuning #5129 (@sshleifer)
[mbart] prepare_translation_batch passes **kwargs to allow DeprecationWarning #5581 (@sshleifer)
MBartForConditionalGeneration #6441 (@patil-suraj)
[fix] mbart_en_ro_generate test now identical to fairseq #5731 (@sshleifer)
[Doc] explaining romanian postprocessing for MBART BLEU hacking #5943 (@sshleifer)
[test] partial coverage for train_mbart_enro_cc25.sh #5976 (@sshleifer)
MbartTokenizer: do not hardcode vocab size #5998 (@sshleifer)
MBART: support summarization tasks where max_src_len > max_tgt_len #6003 (@sshleifer)
Fix #6096: MBartTokenizer's mask token #6098 (@sshleifer)
[s2s] Document better mbart finetuning command #6229 (@sshleifer)
mBART Conversion script #6230 (@sshleifer)
[s2s] add BartTranslationDistiller for distilling mBART #6363 (@sshleifer)
[Doc] add more MBart and other doc #6490 (@patil-suraj)

examples/seq2seq

examples/seq2seq/finetune.py supports --task translation
All sequence to sequence tokenizers (T5, Bart, Marian, Pegasus) expose a prepare_seq2seq_batch method that makes batches for sequence to sequence trianing.

PRs:

Seq2SeqDataset uses linecache to save memory #5792 (@Pradhy729)
[examples/seq2seq]: add --label_smoothing option #5919 (@sshleifer)
seq2seq/run_eval.py can take decoder_start_token_id #5949 (@sshleifer)
[examples (seq2seq)] fix preparing decoder_input_ids for T5 #5994 (@patil-suraj)
[s2s] add support for overriding config params #6149 (@stas00)
s2s: fix LR logging, remove some dead code. #6205 (@sshleifer)
[s2s] tiny QOL improvement: run_eval prints scores #6341 (@sshleifer)
[s2s] fix label_smoothed_nll_loss #6344 (@patil-suraj)
[s2s] fix --gpus clarg collision #6358 (@sshleifer)
[s2s] Script to save wmt data to disk #6403 (@sshleifer)
rename prepare_translation_batch -> prepare_seq2seq_batch #6103 (@sshleifer)
Mult rouge by 100: standard units #6359 (@sshleifer)
allow spaces in bash args with "$@" #6521 (@sshleifer)
[seq2seq] MAX_LEN env var for MT commands #5837 (@sshleifer)
[seq2seq] distillation.py accepts trainer arguments #5865 (@sshleifer)
[s2s]Use prepare_translation_batch for Marian finetuning #6293 (@sshleifer)
[BartTokenizer] add prepare s2s batch #6212 (@patil-suraj)
[T5Tokenizer] add prepare_seq2seq_batch method #6122 (@patil-suraj)
[s2s] round runtime in run_eval #6798 (@sshleifer)
[s2s README] Add more dataset download instructions #6737 (@sshleifer)
[s2s] round bleu, rouge to 4 digits #6704 (@sshleifer)
[s2s] command line args for faster val steps #6833

New documentation

Several new documentation pages have been added and older documentation has been tweaked to be more accurate and understandable. An open in colab button has been added on the tutorial pages.

Guide to fixed-length model perplexity evaluation #5449 (@joeddav)
Improvements to PretrainedConfig documentation #5642 (@sgugger)
Document model outputs #5673 (@sgugger)
docs(wandb): explain how to use W&B integration #5607 (@borisdayma)
Model utils doc #6005 (@sgugger)
ONNX documentation #5992 (@mfuntowicz)
Tokenizer documentation #6110 (@sgugger)
Pipeline documentation #6175 (@sgugger)
Encoder decoder config docs #6195 (@afcruzs)
Colab button #6389 (@sgugger)
Generation documentation #6470 (@sgugger)
Add custom datasets tutorial #6466 (@joeddav)
Logging documentation #6852 (@sgugger)

Trainer updates

New additions to the Trainer

Added data collator for permutation (XLNet) language modeling and related calls #5522 (@shngt)
Trainer support for iterabledataset #5834 (@Pradhy729)
Adding PaddingDataCollator #6442 (@sgugger)
Add hyperparameter search to Trainer #6576 (@sgugger)
[examples] Add trainer support for question-answering #4829 (@patil-suraj)
Adds comet_ml to the list of auto-experiment loggers #6176 (@dsblank)
Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task #6644 (@HuangLianzhe)

New models & model architectures

The following model architectures have been added to the library

FlaubertForTokenClassification #5644 (@stas00)
TFXLMForTokenClassification #5614 (@LysandreJik)
TFXLMForMultipleChoice #5614 (@LysandreJik)
TFFlaubertForTokenClassification #5614 (@LysandreJik)
TFFlaubertForMultipleChoice #5614 (@LysandreJik)
TFElectraForSequenceClassification #6227 (@jplu)
TFElectraForMultipleChoice #6227 (@jplu)
TF Longformer #5764 (@patrickvonplaten)
CamembertForCausalLM #6577 (@patil-suraj)

Regression testing on TPU & TPU CI

Thanks to @zcain117 we now have access to TPU CI for the PyTorch/xla framework. This enables regression testing on the TPU aspects of the Trainer, and offers very simple regression testing on model training performance.

Test XLA examples #5583
Add setup for TPU CI to run every hour. #6219 (@zcain117)
Add missing docker arg for TPU CI. #6393 (@zcain117)
Get GKE logs via kubectl logs instead of gcloud logging read. #6446 (@zcain117)

New pipelines

New pipe...

Assets 2

Releases: huggingface/transformers

Patch release: better error message & invalid trainer attribute

Transformers v4.0.0: Fast tokenizers, model outputs, file reorganization

Transformers v4.0.0-rc-1: Fast tokenizers, model outputs, file reorganization

Breaking changes since v3.x

1. AutoTokenizers and pipelines now use fast (rust) tokenizers by default.

How to obtain the same behavior as v3.x in v4.x

2. SentencePiece is removed from the required dependencies

How to obtain the same behavior as v3.x in v4.x

3. The architecture of the repo has been updated so that each model resides in its folder

How to obtain the same behavior as v3.x in v4.x

4. Switching the return_dict argument to True by default

How to obtain the same behavior as v3.x in v4.x

5. Removed some deprecated attributes

Model Templates

New model additions

mT5 and T5 version 1.1 (@patrickvonplaten )

TF DPR

TF Longformer

Bug fixes and improvements