Releases: huggingface/transformers
Patch release: better error message & invalid trainer attribute
Transformers v4.0.0: Fast tokenizers, model outputs, file reorganization
Transformers v4.0.0-rc-1: Fast tokenizers, model outputs, file reorganization
Breaking changes since v3.x
Version v4.0.0 introduces several breaking changes that were necessary.
1. AutoTokenizers and pipelines now use fast (rust) tokenizers by default.
The python and rust tokenizers have roughly the same API, but the rust tokenizers have a more complete feature set. The main breaking change is the handling of overflowing tokens between the python and rust tokenizers.
How to obtain the same behavior as v3.x in v4.x
- The pipelines now contain additional features out of the box. See the token-classification pipeline with the
grouped_entities
flag. - The auto-tokenizers now return rust tokenizers. In order to obtain the python tokenizers instead, the user may use the
use_fast
flag by setting it toFalse
:
In version v3.x
:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("xxx")
to obtain the same in version v4.x
:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("xxx", use_fast=False)
2. SentencePiece is removed from the required dependencies
The requirement on the SentencePiece dependency has been lifted from the setup.py
. This is done so that we may have a channel on anaconda cloud without relying on conda-forge
. This means that the tokenizers that depend on the SentencePiece library will not be available with a standard transformers
installation.
This includes the slow versions of:
XLNetTokenizer
AlbertTokenizer
CamembertTokenizer
MBartTokenizer
PegasusTokenizer
T5Tokenizer
ReformerTokenizer
XLMRobertaTokenizer
How to obtain the same behavior as v3.x in v4.x
In order to obtain the same behavior as version v3.x
, you should install sentencepiece
additionally:
In version v3.x
:
pip install transformers
to obtain the same in version v4.x
:
pip install transformers[sentencepiece]
or
pip install transformers sentencepiece
3. The architecture of the repo has been updated so that each model resides in its folder
The past and foreseeable addition of new models means that the number of files in the directory src/transformers
keeps growing and becomes harder to navigate and understand. We made the choice to put each model and the files accompanying it in their own sub-directories.
This is a breaking change as importing intermediary layers using a model's module directly needs to be done via a different path.
How to obtain the same behavior as v3.x in v4.x
In order to obtain the same behavior as version v3.x
, you should update the path used to access the layers.
In version v3.x
:
from transformers.modeling_bert import BertLayer
to obtain the same in version v4.x
:
from transformers.models.bert.modeling_bert import BertLayer
4. Switching the return_dict
argument to True
by default
The return_dict
argument enables the return of named-tuples-like python objects containing the model outputs, instead of the standard tuples. This object is self-documented as keys can be used to retrieve values, while also behaving as a tuple as users may retrieve objects by index or by slice.
This is a breaking change as the limitation of that tuple is that it cannot be unpacked: value0, value1 = outputs
will not work.
How to obtain the same behavior as v3.x in v4.x
In order to obtain the same behavior as version v3.x
, you should specify the return_dict
argument to False
, either in the model configuration or during the forward pass.
In version v3.x
:
outputs = model(**inputs)
to obtain the same in version v4.x
:
outputs = model(**inputs, return_dict=False)
5. Removed some deprecated attributes
Attributes that were deprecated have been removed if they had been deprecated for at least a month. The full list of deprecated attributes can be found in #8604.
Here is a list of these attributes/methods/arguments and what their replacements should be:
In several models, the labels become consistent with the other models:
masked_lm_labels
becomeslabels
inAlbertForMaskedLM
andAlbertForPreTraining
.masked_lm_labels
becomeslabels
inBertForMaskedLM
andBertForPreTraining
.masked_lm_labels
becomeslabels
inDistilBertForMaskedLM
.masked_lm_labels
becomeslabels
inElectraForMaskedLM
.masked_lm_labels
becomeslabels
inLongformerForMaskedLM
.masked_lm_labels
becomeslabels
inMobileBertForMaskedLM
.masked_lm_labels
becomeslabels
inRobertaForMaskedLM
.lm_labels
becomeslabels
inBartForConditionalGeneration
.lm_labels
becomeslabels
inGPT2DoubleHeadsModel
.lm_labels
becomeslabels
inOpenAIGPTDoubleHeadsModel
.lm_labels
becomeslabels
inT5ForConditionalGeneration
.
In several models, the caching mechanism becomes consistent with the other models:
decoder_cached_states
becomespast_key_values
in all BART-like, FSMT and T5 models.decoder_past_key_values
becomespast_key_values
in all BART-like, FSMT and T5 models.past
becomespast_key_values
in all CTRL models.past
becomespast_key_values
in all GPT-2 models.
Regarding the tokenizer classes:
- The tokenizer attribute
max_len
becomesmodel_max_length
. - The tokenizer attribute
return_lengths
becomesreturn_length
. - The tokenizer encoding argument
is_pretokenized
becomesis_split_into_words
.
Regarding the Trainer
class:
- The
Trainer
argumenttb_writer
is removed in favor of the callbackTensorBoardCallback(tb_writer=...)
. - The
Trainer
argumentprediction_loss_only
is removed in favor of the class argumentargs.prediction_loss_only
. - The
Trainer
attributedata_collator
should be a callable. - The
Trainer
method_log
is deprecated in favor oflog
. - The
Trainer
method_training_step
is deprecated in favor oftraining_step
. - The
Trainer
method_prediction_loop
is deprecated in favor ofprediction_loop
. - The
Trainer
methodis_local_master
is deprecated in favor ofis_local_process_zero
. - The
Trainer
methodis_world_master
is deprecated in favor ofis_world_process_zero
.
Regarding the TFTrainer
class:
- The
TFTrainer
argumentprediction_loss_only
is removed in favor of the class argumentargs.prediction_loss_only
. - The
Trainer
method_log
is deprecated in favor oflog
. - The
TFTrainer
method_prediction_loop
is deprecated in favor ofprediction_loop
. - The
TFTrainer
method_setup_wandb
is deprecated in favor ofsetup_wandb
. - The
TFTrainer
method_run_model
is deprecated in favor ofrun_model
.
Regarding the TrainerArgument
and TFTrainerArgument
classes:
- The
TrainerArgument
argumentevaluate_during_training
is deprecated in favor ofevaluation_strategy
. - The
TFTrainerArgument
argumentevaluate_during_training
is deprecated in favor ofevaluation_strategy
.
Regarding the Transfo-XL model:
- The Transfo-XL configuration attribute
tie_weight
becomestie_words_embeddings
. - The Transfo-XL modeling method
reset_length
becomesreset_memory_length
.
Regarding pipelines:
- The
FillMaskPipeline
argumenttopk
becomestop_k
.
Model Templates
Version 4.0.0 will be the first to include the experimental feature of model templates. These model templates aim to facilitate the addition of new models to the library by doing most of the work: generating the model/configuration/tokenization/test files that fit the API, with respect to the choice the user has made in terms of naming and functionality.
This release includes a model template for the encoder model (similar to the BERT architecture). Generating a model using the template will generate the files, put them at the appropriate location, reference them throughout the code-base, and generate a working test suite. The user should then only modify the files to their liking, rather than creating the model from scratch.
Feedback welcome, get started from the README here.
- Model templates encoder only #8509 (@LysandreJik)
New model additions
mT5 and T5 version 1.1 (@patrickvonplaten )
The T5v1.1 is an improved version of the original T5 model, see here: https://github.com/google-research/text-to-text-transfer-transformer/blob/master/released_checkpoints.md
The multilingual T5 model (mT5) was presented in https://arxiv.org/abs/2010.11934 and is based on the T5v1.1 architecture.
Multiple pre-trained checkpoints have been added to the library:
Relevant pull requests:
- T5 & mT5 #8552 (@patrickvonplaten)
- [MT5] More docs #8589 (@patrickvonplaten)
- Fix init for MT5 #8591 (@sgugger)
TF DPR
The DPR model has been added in TensorFlow to match its PyTorch counterpart by @ratthachat
- Add TFDPR #8203 (@ratthachat)
TF Longformer
Additional heads have been added to the TensorFlow Longformer implementation: SequenceClassification, MultipleChoice and TokenClassification
- Tf longformer for sequence classification #8231 (@elk-cloner)
Bug fixes and improvements
Transformers v4.0.0-rc-1: Fast tokenizers, model outputs, file reorganization
Transformers v4.0.0-rc-1: Fast tokenizers, model outputs, file reorganization
Breaking changes since v3.x
Version v4.0.0 introduces several breaking changes that were necessary.
1. AutoTokenizers and pipelines now use fast (rust) tokenizers by default.
The python and rust tokenizers have roughly the same API, but the rust tokenizers have a more complete feature set. The main breaking change is the handling of overflowing tokens between the python and rust tokenizers.
How to obtain the same behavior as v3.x in v4.x
- The pipelines now contain additional features out of the box. See the token-classification pipeline with the
grouped_entities
flag. - The auto-tokenizers now return rust tokenizers. In order to obtain the python tokenizers instead, the user may use the
use_fast
flag by setting it toFalse
:
In version v3.x
:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("xxx")
to obtain the same in version v4.x
:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("xxx", use_fast=False)
2. SentencePiece is removed from the required dependencies
The requirement on the SentencePiece dependency has been lifted from the setup.py
. This is done so that we may have a channel on anaconda cloud without relying on conda-forge
. This means that the tokenizers that depend on the SentencePiece library will not be available with a standard transformers
installation.
This includes the slow versions of:
XLNetTokenizer
AlbertTokenizer
CamembertTokenizer
MBartTokenizer
PegasusTokenizer
T5Tokenizer
ReformerTokenizer
XLMRobertaTokenizer
How to obtain the same behavior as v3.x in v4.x
In order to obtain the same behavior as version v3.x
, you should install sentencepiece
additionally:
In version v3.x
:
pip install transformers
to obtain the same in version v4.x
:
pip install transformers[sentencepiece]
or
pip install transformers sentencepiece
3. The architecture of the repo has been updated so that each model resides in its folder
The past and foreseeable addition of new models means that the number of files in the directory src/transformers
keeps growing and becomes harder to navigate and understand. We made the choice to put each model and the files accompanying it in their own sub-directories.
This is a breaking change as importing intermediary layers using a model's module directly needs to be done via a different path.
How to obtain the same behavior as v3.x in v4.x
In order to obtain the same behavior as version v3.x
, you should update the path used to access the layers.
In version v3.x
:
from transformers.modeling_bert import BertLayer
to obtain the same in version v4.x
:
from transformers.models.bert.modeling_bert import BertLayer
4. Switching the return_dict
argument to True
by default
The return_dict
argument enables the return of named-tuples-like python objects containing the model outputs, instead of the standard tuples. This object is self-documented as keys can be used to retrieve values, while also behaving as a tuple as users may retrieve objects by index or by slice.
This is a breaking change as the limitation of that tuple is that it cannot be unpacked: value0, value1 = outputs
will not work.
How to obtain the same behavior as v3.x in v4.x
In order to obtain the same behavior as version v3.x
, you should specify the return_dict
argument to False
, either in the model configuration or during the forward pass.
In version v3.x
:
outputs = model(**inputs)
to obtain the same in version v4.x
:
outputs = model(**inputs, return_dict=False)
5. Removed some deprecated attributes
Attributes that were deprecated have been removed if they had been deprecated for at least a month. The full list of deprecated attributes can be found in #8604.
Model Templates
Version 4.0.0 will be the first to include the experimental feature of model templates. These model templates aim to facilitate the addition of new models to the library by doing most of the work: generating the model/configuration/tokenization/test files that fit the API, with respect to the choice the user has made in terms of naming and functionality.
This release includes a model template for the encoder model (similar to the BERT architecture). Generating a model using the template will generate the files, put them at the appropriate location, reference them throughout the code-base, and generate a working test suite. The user should then only modify the files to their liking, rather than creating the model from scratch.
Feedback welcome, get started from the README here.
- Model templates encoder only #8509 (@LysandreJik)
New model additions
mT5 and T5 version 1.1 (@patrickvonplaten )
The T5v1.1 is an improved version of the original T5 model, see here: https://github.com/google-research/text-to-text-transfer-transformer/blob/master/released_checkpoints.md
The multilingual T5 model (mT5) was presented in https://arxiv.org/abs/2010.11934 and is based on the T5v1.1 architecture.
Multiple pre-trained checkpoints have been added to the library:
Relevant pull requests:
- T5 & mT5 #8552 (@patrickvonplaten)
- [MT5] More docs #8589 (@patrickvonplaten)
- Fix init for MT5 #8591 (@sgugger)
TF DPR
The DPR model has been added in TensorFlow to match its PyTorch counterpart by @ratthachat
- Add TFDPR #8203 (@ratthachat)
TF Longformer
Additional heads have been added to the TensorFlow Longformer implementation: SequenceClassification, MultipleChoice and TokenClassification
- Tf longformer for sequence classification #8231 (@elk-cloner)
Bug fixes and improvements
- [s2s/distill] hparams.tokenizer_name = hparams.teacher #8382 (@ShichaoSun)
- [examples] better PL version check #8429 (@stas00)
- Question template #8440 (@sgugger)
- [docs] improve bart/marian/mBART/pegasus docs #8421 (@sshleifer)
- Add auto next sentence prediction #8432 (@jplu)
- Windows dev section in the contributing file #8436 (@jplu)
- [testing utils] get_auto_remove_tmp_dir more intuitive behavior #8401 (@stas00)
- Add missing import #8444 (@jplu)
- [T5 Tokenizer] Fix t5 special tokens #8435 (@patrickvonplaten)
- using multi_gpu consistently #8446 (@stas00)
- Add missing tasks to
pipeline
docstring #8428 (@bryant1410) - [No merge] TF integration testing #7621 (@LysandreJik)
- [T5Tokenizer] fix t5 token type ids #8437 (@patrickvonplaten)
- Bug fix for apply_chunking_to_forward chunking dimension check #8391 (@pedrocolon93)
- Fix TF Longformer #8460 (@jplu)
- Add next sentence prediction loss computation #8462 (@jplu)
- Fix TF next sentence output #8466 (@jplu)
- Example NER script predicts on tokenized dataset #8468 (@sarnoult)
- Replaced unnecessary iadd operations on lists in tokenization_utils.py with proper list methods #8433 (@bombs-kim)
- Flax/Jax documentation #8331 (@mfuntowicz)
- [s2s] distill t5-large -> t5-small #8376 (@sbhaktha)
- Update deploy-docs dependencies on CI to enable Flax #8475 (@mfuntowicz)
- Fix on "examples/language-modeling" to support more datasets #8474 (@zeyuyun1)
- Fix doc bug #8500 (@mymusise)
- Model sharing doc #8498 (@sgugger)
- Fix SqueezeBERT for masked language model #8479 (@forresti)
- Fix logging in the examples #8458 (@jplu)
- Fix check scripts for Windows #8491 (@jplu)
- Add pretraining loss computation for TF Bert pretraining #8470 (@jplu)
- [T5] Bug correction & Refactor #8518 (@patrickvonplaten)
- Model sharing doc: more tweaks #8520 (@julien-c)
- [T5] Fix load weights function #8528 (@patrickvonplaten)
- Rework some TF tests #8492 (@jplu)
- [breaking|pipelines|tokenizers] Adding slow-fast tokenizers equivalence tests pipelines - Removing sentencepiece as a required dependency #8073 (@thomwolf)
- Adding the prepare_seq2seq_batch function to ProphetNet #8515 (@forest1988)
- Update version to v4.0.0-dev #8568 (@sgugger)
- TAPAS tokenizer & tokenizer tests #8482 (@LysandreJik)
- Switch
return_dict
toTrue
by default. #8530 (@sgugger) - Fix mixed precision issue for GPT2 #8572 (@jplu)
- Reorganize repo #8580 (@sgugger)
- Tokenizers: ability to load from model subfolder #8586 (@julien-c)
- Fix model templates #8595 (@sgugger)
- [examples tests] tests that are fine on multi-gpu #8582 (@stas00)
- Fix check repo utils #8600 (@sgugger)
- Tokenizers should be framework agnostic #8599 (@LysandreJik)
- Remove deprecated #8604 (@sgugger)
- Fixed link to the wrong paper. #8607 (@cronoik)
- Reset loss to zero on logging in Trainer to avoid bfloat16 issues #8561 (@bminixhofer)
- Fix DataCollatorForLanguageModeling #8621 (@sgugger)
- [s2s] multigpu skip #8613 (@stas00)
- [s2s] fix finetune.py to adjust for #8530 changes #8612 (@stas00)
- tf_bart typo - self.self.activation_dropout #8611 (@ratthachat)
- New TF loading weights #8490 (@jplu)
- Adding PrefixConstrainedLogitsProcessor #8529 (@nicola-decao)
- [Tokenizer Doc] Improve tokenizer summary #8622 (@patrickvonplaten)
- Fixes the training resuming with gradient accumulation #8624 (@sgugger)
- Fix training from scratch in new scripts #8623 (@sgugger)
- [s2s] distillation apex breaks return_dict obj #8631 (@stas00)
- Updated the Extractive Question Answering code snippe...
v3.5.1
v3.5.0: Model versioning, TensorFlow encoder-decoder models, new scripts, refactor of the `generate` method
Model versioning, TensorFlow encoder-decoder models, new scripts, refactor of the generate
method
Model versioning
We host more and more of the community's models which is awesome ❤️. To scale this sharing, we needed to change the infra to both support more models, and unlock new powerful features.
To that effect, we have rebuilt the storage backend that we use for models (currently S3), to our own git repos (using S3 as a git-lfs endpoint for large files), with one model = one repo.
The benefits of this switch are:
- built-in versioning (I mean… it’s git. It’s pretty much what you use for versioning. Versioning in S3 has a ton a limitations)
- access control (will unlock private models, private datasets, etc)
- scalability (our usage of S3 to maintain lists of models was starting to bottleneck)
Let's dive in to the actual changes:
I. On the website
You'll now see a "Browse files and versions" tab or button on each model page. (design is not final, we'll make it more prominent/streamlined in the near future)
This is what this page looks like:
The UX should look familiar and self-explanatory, but we'll add more ML-specific features in the future.
You can:
- see commit histories and diffs of changes made to any text file, like config.json:
- changes made by the HuggingFace team will be way clearer – we can perform updates to the models to ensure they work well with the library(ies) (you'll be able to opt out from those changes)
- Large binary files are stored using https://git-lfs.github.com/ which is pretty standard now, and interoperable out of the box with git
- Ability to update your text files, like your README.md model card, directly on the website!
- with instant preview 🔥
II. In the transformers library
The PR to enable this new storage mode in the transformers
library is available here: #8324
This PR has two parts:
1. changes to the file downloading code used in from_pretrained()
methods to use the new file URLs.
Large files are stored in an S3 bucket and served by Cloudfront so downloads should be as fast as they are right now.
In addition, you now have a way to pin a specific version of a model, to a commit hash, tag or branch.
For instance:
tokenizer = AutoTokenizer.from_pretrained(
"julien-c/EsperBERTo-small",
revision="v2.0.1" # tag name, or branch name, or commit hash
)
Finally, the networking code is more robust and doesn't gobble up errors anymore, so in case you have trouble downloading a specific file you'll know exactly why.
2. changes to the model upload CLI to create a model repo then be able to git clone and git push to it.
We are intentionally not wrapping git
too much because we expect most model authors to be familiar with git (and possibly git-lfs), let us know if not the case.
To create a repo:
transformers-cli repo create your-model-name
Then you'll get a repo url that you'll be able to clone:
git clone https://huggingface.co/username/your-model-name
# Then commit as usual
cd your-model-name
echo "hello" >> README.md
git add . && git commit -m "Update from $USER"
A nice side effect of the new system on the upload side is that file uploading should be more robust for very large files (hello T5!) as git-lfs handles the networking code.
By the way, again, every model is its own repo. So you can git clone any public model if you'd like:
git clone https://huggingface.co/gpt2
But you won't be able to push unless it's one of your models (or one of your orgs').
III. Backward compatibility
- Backward compatibility on model downloads is expected, because even though the new models will be stored in huggingface.co-hosted git repos, we will backport all file changes to S3 automatically.
⚠️ Model uploads using the current system won't work anymore: you'll need to upgrade your transformers installation to the next release,v3.5.0
, or to build frommaster
.
Alternatively, in the next week or so we'll add the ability to create a repo from the website directly so you'll be able to push even without the transformers library.
TFMarian, TFMbart, TFPegasus, TFBlenderbot
- Add tensorflow 2.0 functionality for SOTA seq2seq transformers #7987 (@sshleifer)
New and updated scripts
We'working on giving examples on how to leverage the 🤗 Datasets library and the Trainer API. Those scripts are meant as examples easy to customize, with lots of comments explaining the various steps. The following tasks are now covered:
- Text classification : New run glue script #7917 (@sgugger)
- Causal Language Modeling: New run_clm script #8105 (@sgugger)
- Masked Language Modeling: Add line by line option to mlm/plm scripts #8240 (@sgugger)
- Token classification: Add new token classification example #8340 (@sgugger)
Seq2Seq Trainer
A child of Trainer
specialized for training seq2seq models, from @patil-suraj, @stas00 and @sshleifer. Accessible through examples/seq2seq/finetune_trainer.py
. API is similar to examples/seq2seq/finetune.py
, but API support is better. Example scripts are in examples/seq2seq/builtin_trainer
.
- [seq2seq testing] multigpu test run via subprocess #7281 (@stas00)
- [s2s trainer] tests to use distributed on multi-gpu machine #7965 (@stas00)
- [Seq2Seq] Allow EncoderDecoderModels to be trained with Seq2Seq #7809 (@patrickvonplaten)
- [Seq2Seq Trainer] Make sure padding is implemented for models without pad_token #8043 (@patrickvonplaten)
- [Seq2SeqTrainer] Move import to init to make file self-contained #8194 (@patrickvonplaten)
- [s2s test] cleanup #8131 (@stas00)
- [Seq2Seq] Correct import in Seq2Seq Trainer #8254 (@patrickvonplaten)
- [Seq2Seq] Make Seq2SeqArguments an independent file #8267 (@patrickvonplaten)
- [Seq2SeqDataCollator] dont pass add_ prefix_space=False to all tokenizers #8329 (@sshleifer)
Seq2Seq Testing and Documentation Improvements
- [s2s] create doc for pegasus/fsmt replication #7934 (@stas00)
- [s2s] test_distributed_eval #8315 (@stas00)
- [s2s] test_bash_script.py - actually learn something #8318 (@stas00)
- [s2s examples test] fix data path #8398 (@stas00)
- [s2s test_finetune_trainer] failing multigpu test #8400 (@stas00)
- [s2s/distill] remove run_distiller.sh, fix xsum script #8412 (@sshleifer)
Docs for DistillBART Paper Replication
Re-run experiments from the paper here
- [s2s] distillBART docs for paper replication #8150 (@sshleifer)
Refactoring the generate()
function
The generate()
method now has a new design so that the user can directly call upon the methods
sample()
, greedy_search()
, beam_search()
and beam_sample()
. The code was made more readable, and beam search was sped-up by ca. 5-10%.
Refactoring the generate() function #6949 (@patrickvonplaten)
Notebooks
- added qg evaluation notebook #7958 (@zolekode)
- adding beginner-friendly notebook on text classification with DistilBERT/TF #7964 (@peterbayerle)
- [Notebooks] Add new encoder-decoder notebooks #8246 (@patrickvonplaten)
General improvements and bugfixes
- Respect the 119 line chars #7928 (@LysandreJik)
- PPL guide code snippet minor fix #7938 (@joeddav)
- [ProphetNet] Add Question Generation Model + Test #7942 (@patrickvonplaten)
- [multiple models] skip saving/loading deterministic state_dict keys #7878 (@stas00)
- Add missing comma #7870 (@mrm8488)
- TensorBoard/Wandb/optuna/raytune integration improvements. #7935 (@madlag)
- [ProphetNet] Correct Doc string example #7944 (@patrickvonplaten)
- [GPT2 batch generation] Make test clearer.
do_sample=True
is not deterministic. #7947 (@patrickvonplaten) - fix 'encode_plus' docstring for 'special_tokens_mask' (0s and 1s were reversed) #7949 (@epwalsh)
- Herbert tokenizer auto load #7968 (@rmroczkowski)
- [testing] slow tests should be marked as slow #7895 (@stas00)
- support relative path for best_model_checkpoint #7973 (@HaebinShin)
- Disable inference API for t5-11b #7978 (@julien-c)
- [fsmt test] basic config test with online model + super tiny model #7860 (@stas00)
- Add whole word mask support for lm fine-tune #7925 (@wlhgtc)
- [PretrainedConfig] Fix save pretrained config for edge case #7943 (@patrickvonplaten)
- GPT2 - Remove else branch adding 0 to the hidden state if token_type_embeds is None. #7977 (@mfuntowicz)
- Fixing the "translation", "translation_XX_to_YY" pipelines. #7975 (@Narsil)
- FillMaskPipeline: support passing top_k on call #7971 (@julien-c)
- Only log total_flos at the end of training #7981 (@sgugger)
- add zero shot pipeline tags & examples #7983 (@joeddav)
- Reload checkpoint #7984 (@sgugger)
- [gh ci] less output ( --durations=50) #7989 (@sshleifer)
- Move NoLayerEmbedTokens #7945 (@sshleifer)
- update zero shot default widget example #7992 (@joeddav)
- [RAG] Handle the case when title is None while loading own datasets #7941 (@lalitpagaria)
- [tests|tokenizers] Refactoring pipelines test backbone - Small tokenizers improvements - General tests speedups #7970 (@thomwolf)
- [Reformer] remove reformer pad_token_id #7991 (@patrickvonplaten)
- Fix BatchEncoding.word_to_tokens for removed tokens #7939 (@n1t0)
- Handling longformer model_type #7990 (@ethanjperez)
- [doc prepare_seq2seq_batch] fix docs #8013 (@patil-suraj)
- [tokenizers] Fixing #8001 - Adding tests on tokenizers serialization #8006 (@thomwolf)
- Add mixed...
ProphetNet, Blenderbot, SqueezeBERT, DeBERTa
ProphetNet, Blenderbot, SqueezeBERT, DeBERTa
ProphetNET
Two new models are released as part of the ProphetNet implementation: ProphetNet
and XLM-ProphetNet
.
ProphetNet is an encoder-decoder model and can predict n-future tokens for “ngram” language modeling instead of just the next token.
XLM-ProphetNet is an encoder-decoder model with an identical architecture to ProhpetNet, but the model was trained on the multi-lingual “wiki100” Wikipedia dump.
The ProphetNet model was proposed in ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020.
It was added to the library in PyTorch with the following checkpoints:
microsoft/xprophetnet-large-wiki100-cased-xglue-ntg
microsoft/prophetnet-large-uncased
microsoft/prophetnet-large-uncased-cnndm
microsoft/xprophetnet-large-wiki100-cased
microsoft/xprophetnet-large-wiki100-cased-xglue-qg
Contributions:
- ProphetNet #7157 (@qiweizhen, @patrickvonplaten)
BlenderBot
Blenderbot is an encoder-decoder model for open-domain chat. It uses a standard seq2seq model transformer-based architecture.
The Blender chatbot model was proposed in Recipes for building an open-domain chatbot Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
It was added to the library in PyTorch with the following checkpoints:
facebook/blenderbot-90M
facebook/blenderbot-3B
Contributions:
- Blenderbot #7418 (@sshleifer)
SqueezeBERT
The SqueezeBERT model was proposed in SqueezeBERT: What can computer vision teach NLP about efficient neural networks? by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer. It’s a bidirectional transformer similar to the BERT model. The key difference between the BERT architecture and the SqueezeBERT architecture is that SqueezeBERT uses grouped convolutions instead of fully-connected layers for the Q, K, V and FFN layers.
It was added to the library in PyTorch with the following checkpoints:
squeezebert/squeezebert-mnli
squeezebert/squeezebert-uncased
squeezebert/squeezebert-mnli-headless
Contributions:
- SqueezeBERT architecture #7083 (@forresti)
- Fix squeezebert docs #7587 (@LysandreJik)
DeBERTa
The DeBERTa model was proposed in DeBERTa: Decoding-enhanced BERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google’s BERT model released in 2018 and Facebook’s RoBERTa model released in 2019.
It was added to the library in PyTorch with the following checkpoints:
microsoft/deberta-base
microsoft/deberta-large
Contributions:
- Add DeBERTa model #5929 (@BigBird01)
- Fix DeBERTa integration tests #7729 (@LysandreJik)
Both SentencePiece and Tokenizers are now optional libraries
Support for SentencePiece is now part of the tokenizers
library! Thanks to this we now have near-full support of fast tokenizers in the library.
With this new feature, we slightly change the paradigm regarding installation:
-
SentencePiece is now an optional dependency, paving the way to a fully-featured conda install in the near future
-
Tokenizers is now also an optional dependency, making it possible to install and use the library even when rust cannot be compiled on the machine.
-
[Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies #7659 (@thomwolf)
The main __init__
has been improved to always import the same functions and classes. If someone then tries to use a class that requires an optional dependency, an ImportError
will be raised at init (with instructions on how to install the missing dependency) #7537 (@sgugger)
Improvements made to the Trainer
The Trainer
API has been improved to work with models requiring several labels or returning several outputs, and to have clearer progress tracking. A new TrainerCallback
class has been added to allow the user to easily customize the default training loop.
- Remove config assumption in Trainer #7464 (@sgugger)
- Clean the Trainer state #7490 (@sgugger)
- Small QOL improvements to TrainingArguments #7475 (@sgugger)
- Allow nested tensors in predicted logits #7542 (@sgugger)
- Trainer callbacks #7596 (@sgugger)
- Add specific notebook ProgressCalback #7793 (@sgugger)
- Small fixes to NotebookProgressCallback #7813 (@sgugger)
- Add predict step accumulation #7767 (@sgugger)
- Don't use
store_xxx
on optional bools #7786 (@sgugger)
Seq2Seq Trainer
A child of Trainer
specialized for training seq2seq models, from @patil-suraj and @sshleifer. Accessible through examples/seq2seq/finetune_trainer.py
.
- example scripts at
examples/seq2seq/builtin_trainer/
- same functionality as
examples/seq2seq/finetune.py
, but better TPU support. - [examples/s2s] clean up finetune_trainer #7509 (@patil-suraj)
- [s2s] trainer scripts: Remove --run_name, thanks sylvain! #7521 (@sshleifer)
- [s2s] Adafactor support for builtin trainer #7522 (@sshleifer)
- [s2s] add config params like Dropout in Seq2SeqTrainingArguments #7532 (@patil-suraj)
- Distributed Trainer: 2 little fixes #7461 (@sshleifer)
- [s2sTrainer] test + code cleanup #7467 (@sshleifer)
- Seq2SeqDataset: avoid passing src_lang everywhere #7470 (@amanpreet692)
- [s2strainer] fix eval dataset loading #7477 (@patil-suraj)
- [pseudolabels] cleanup markdown table #7653 (@sshleifer)
Distributed Generation
- You can run
model.generate
in pytorch on a large dataset and split the work across multiple GPUs, usingexamples/seq2seq/run_distributed_eval.py
- [s2s] release pseudolabel links and instructions #7639 (@sshleifer)
- [s2s] Fix t5 warning for distributed eval #7487 (@sshleifer)
- [s2s] fix kwargs style #7488 (@sshleifer)
- [s2s] fix lockfile and peg distillation constants #7545 (@sshleifer)
- [s2s] fix nltk pytest race condition with FileLock #7515 (@sshleifer)
Notebooks
- Train T5 in Tensoflow 2 Community Notebook #7428 (@HarrisDePerceptron)
General improvements and bugfixes
- remove codecov PR comments #7400 (@sshleifer)
- Get a better error when check_copies fails #7457 (@sgugger)
- Multi-GPU Testing setup #7453 (@LysandreJik)
- Fix LXMERT with DataParallel #7471 (@LysandreJik)
- Number of GPUs for multi-gpu #7472 (@LysandreJik)
- Make transformers install check positive #7473 (@FremyCompany)
- Alphabetize model lists #7478 (@sgugger)
- Bump isort version. #7484 (@sgugger)
- Add forgotten return_dict argument in the docs #7483 (@sgugger)
- Enable pegasus fp16 by clamping large activations #7243 (@sshleifer)
- Update LayoutLM doc #7388 (@Al31415)
- Report Tune metrics in final evaluation #7507 (@krfricke)
- Fix Ray Tune progress_reporter kwarg #7508 (@krfricke)
- [Seq2Seq] Fix a couple of bugs and clean examples #7474 (@patrickvonplaten)
- [Attention Mask] Fix data type #7513 (@patrickvonplaten)
- Fix seq2seq example test #7518 (@sgugger)
- Remove labels from the RagModel example #7560 (@sgugger)
- added script for fine-tuning roberta for sentiment analysis task #7505 (@DhavalTaunk08)
- LayoutLM: add exception handling for bbox values #7452 (@Al31415)
- Cleanup documentation for BART, Marian, MBART and Pegasus #7523 (@sgugger)
- Add Electra unexpected keys #7569 (@LysandreJik)
- Fix tokenization in SQuAD for RoBERTa, Longformer, BART #7387 (@tholor)
- docs(pretrained_models): fix num parameters #7575 (@amineabdaoui)
- Update Code example according to deprecation of AutoModeWithLMHead #7555 (@jshamg)
- Allow soft dependencies in the namespace with ImportErrors at use #7537 (@sgugger)
- Fix post_init of some TrainingArguments #7525 (@sgugger)
- Check and update model list in index.rst automatically #7527 (@sgugger)
- Expand test to locate flakiness #7580 (@sgugger)
- Custom TF weights loading #7422 (@jplu)
- Documentation fixes #7585 (@sgugger)
- Documentation framework toggle should stick #7586 (@LysandreJik)
- Support T5 Distillation w/hidden state supervision #7599 (@sshleifer)
- [makefile] check only .py files #7588 (@stas00)
- [TF generation] Fix typo #7582 (@SidJain1412)
- change return dicitonary for DataCollatorForNextSentencePrediction from masked_lm_labels to labels #7595 (@gmihaila)
- Docker GPU Images: Add NVIDIA/apex to the cuda images with pytorch #7598 (@AdrienDS)
- typo fix #7611 (@agemagician)
- [bart] fix config.classif_dropout #7593 (@sshleifer)
- [s2s] save first batch to json for debugging purposes #6810 (@sshleifer)
- Add GPT2ForSequenceClassification based on DialogRPT #7501 (@LysandreJik)
- Fix wrong reference name/filename in docstring of
SquadProcessor
#7616 (@phiyodr) - Fix tokenizer UnboundLocalError when padding is set to PaddingStrategy.MAX_LENGTH #7610 (@GabrielePicco)
- Add GPT2 to sequence classification auto model #7630 (@LysandreJik)
- Replaced torch.load for loading the pretrained vocab of TransformerXL tokenizer to pickle.load #6935 (@w4nderlust)
- Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer #7141 (@thomwolf)
- Green tests: update torch-hub test dependencies (add protobuf and pin tokenizer 0.9.0-RC2) #7658 (@thomwolf)
- Fix RobertaForCausalLM docs #7642 (@LysandreJik)
- [s2s] configure lr_scheduler from command line #7641 (@patil-suraj)
- [pseudo] Switch URLS to CDN #7661 (@sshleifer)
- [s2s] Switch README urls to cdn #7670 (@sshleifer)
- fix nn.DataParallel compatibility with PyTorch 1.5 #7671 (@guhur)
- Update XLM-RoBERTa pretrained model details #7669 (@noahtren)
- Fix dataset cardinality #7678 (@jplu)
- [pegasus] Faster ...
v3.3.1
RAG
RAG
RAG Model
The RAG model is a retrieval-augmented generation model that can be leveraged for question-answering tasks using RagTokenForGeneration
or RagSequenceForGeneration
as proposed in Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela.
It was added to the library in PyTorch with the following checkpoints:
facebook/rag-token-nq
facebook/rag-sequence-nq
facebook/rag-token-base
facebook/rag-sequence-base
Contributions:
- RAG #6813 (@ola13)
- [RAG] Add
attention_mask
to RAG generate #7373 (@patrickvonplaten) - [RAG] Add missing doc and attention_mask to rag #7382 (@patrickvonplaten)
- [Rag] Fix wrong usage of
num_beams
andbos_token_id
in Rag Sequence generation #7386 (@patrickvonplaten) - [RAG] Fix retrieval offset in RAG's HfIndex and better integration tests #7372 (@lhoestq)
- [RAG] Remove dependency on
examples/seq2seq
from rag #7395 (@ola13) - [Rag] fix rag retriever save_pretrained method #7399 (@patrickvonplaten)
- [RAG] Clean Rag readme in examples #7413 (@ola13)
- [RAG] Model cards - clean cards #7420 (@patrickvonplaten)
- Document RAG again #7377 (@sgugger)
Bug fixes and improvements
- Mark big downloads slow #7325 (@sgugger)
- [Bug Fix] The actual batch_size is inconsistent with the settings. #7235 (@HuangLianzhe)
- Fixed results of SQuAD-FR evaluation #7313 (@psorianom)
- [s2s] add supported architecures to MD #7252 (@sshleifer)
- Add num workers cli arg #7322 (@chadykamar)
- [s2s] add src_lang kwarg for distributed eval #7300 (@sshleifer)
- [s2s] only save metrics.json from rank zero #7331 (@sshleifer)
- [code quality] fix confused flake8 #7309 (@stas00)
- [testing] skip decorators: docs, tests, bugs #7334 (@stas00)
- Fixed evaluation_strategy on epoch end bug #7340 (@WissamAntoun)
- Models doc #7345 (@sgugger)
- Ensure that integrations are imported before transformers or ml libs #7330 (@dsblank)
- [Benchmarks] Change all args to from
no_...
to their positive form #7075 (@fmcurti) - Remove reference to args in XLA check #7344 (@ZeroCool2u)
- wip: Code to add lang tags to marian model cards #6586 (@sshleifer)
- Expand a bit the documentation doc #7350 (@sgugger)
- Check decorator order #7326 (@sgugger)
- Update modeling_tf_longformer.py #7359 (@Line290)
- Updata tokenization_auto.py #6870 (@hjptriplebee)
- Update the TF models to remove their interdependencies #7238 (@jplu)
- Make PyTorch model files independent from each other #7352 (@sgugger)
- Clean RAG docs and template docs #7348 (@sgugger)
- Fixing case in which
Trainer
hung while saving model in distributed training #7365 (@TevenLeScao) - Formatter #7368 (@LysandreJik)
- [seq2seq] make it easier to run the scripts #7274 (@stas00)
- Remove mentions of RAG from the docs #7376 (@sgugger)
- [fsmt] build/test scripts #7257 (@stas00)
- [s2s] distributed eval allows num_return_sequences > 1 #7254 (@sshleifer)
- Seq2SeqTrainer #6769 (@patil-suraj)
- modeling_bart: 3 small cleanups that dont change outputs #7381 (@sshleifer)
- Check config type using
type
instead ofisinstance
#7363 (@LysandreJik) - [s2s, examples] minor doc changes #7385 (@patil-suraj)
- Remove unhelpful bart warning #7391 (@sshleifer)
- [code quality] new make target that combines style and quality targets #7310 (@stas00)
- Speedup check_copies script #7394 (@sgugger)
- Fix BartModel output documentation #7390 (@sgugger)
- Fix FP16 and attention masks in FunnelTransformer #7374 (@sgugger)
- [Longformer, Bert, Roberta, ...] Fix multi gpu training #7272 (@patrickvonplaten)
- [s2s] add create student script #7290 (@patil-suraj)
- [s2s] rougeLSum expects \n between sentences #7410 (@sshleifer)
- [T5] allow config.decoder_layers to control decoer size #7409 (@sshleifer)
- Flos fix #7384 (@marrrcin)
- Catch PyTorch warning when saving/loading scheduler #7401 (@sgugger)
- Pull request template #7392 (@LysandreJik)
- Reorganize documentation navbar #7423 (@sgugger)
Bert Seq2Seq models, FSMT, LayoutLM, Funnel Transformer, LXMERT
Bert Seq2Seq models, FSMT, Funnel Transformer, LXMERT
BERT Seq2seq models
The BertGeneration model is a BERT model that can be leveraged for sequence-to-sequence tasks using EncoderDecoderModel as proposed in Leveraging Pre-trained Checkpoints for Sequence Generation Tasks by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
It was added to the library in PyTorch with the following checkpoints:
google/roberta2roberta_L-24_bbc
google/roberta2roberta_L-24_gigaword
google/roberta2roberta_L-24_cnn_daily_mail
google/roberta2roberta_L-24_discofuse
google/roberta2roberta_L-24_wikisplit
google/bert2bert_L-24_wmt_de_en
google/bert2bert_L-24_wmt_en_de
Contributions:
- Add "Leveraging Pretrained Checkpoints for Generation" Seq2Seq models. #6594 (@patrickvonplaten)
FSMT (FairSeq MachineTranslation)
FSMT (FairSeq MachineTranslation) models were introduced in Facebook FAIR’s WMT19 News Translation Task Submission by Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, Sergey Edunov.
It was added to the library in PyTorch, with the following checkpoints:
facebook/wmt19-en-ru
facebook/wmt19-en-de
facebook/wmt19-ru-en
facebook/wmt19-de-en
Contributions:
- [ported model] FSMT (FairSeq MachineTranslation) #6940 (@stas00)
- build/eval/gen-card scripts for fsmt #7155 (@stas00)
- skip failing FSMT CUDA tests until investigated #7220 (@stas00)
- [fsmt] rewrite SinusoidalPositionalEmbedding + USE_CUDA test fixes + new TranslationPipeline test #7224 (@stas00)
- [s2s] adjust finetune + test to work with fsmt #7263 (@stas00)
- [fsmt] SinusoidalPositionalEmbedding no need to pass device #7292 (@stas00)
- Adds FSMT to LM head AutoModel #7312 (@LysandreJik)
LayoutLM
The LayoutLM model was proposed in LayoutLM: Pre-training of Text and Layout for Document Image Understandin by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. It’s a simple but effective pre-training method of text and layout for document image understanding and information extraction tasks, such as form understanding and receipt understanding.
It was added to the library in PyTorch with the following checkpoints:
layoutlm-base-uncased
layoutlm-large-uncased
Contributions:
- Add LayoutLM Model #7064 (@liminghao1630)
- Fixes for LayoutLM #7318 (@sgugger)
Funnel Transformer
The Funnel Transformer model was proposed in the paper Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. It is a bidirectional transformer model, like BERT, but with a pooling operation after each block of layers, a bit like in traditional convolutional neural networks (CNN) in computer vision.
It was added to the library in both PyTorch and TensorFlow, with the following checkpoints:
funnel-transformer/small
funnel-transformer/small-base
funnel-transformer/medium
funnel-transformer/medium-base
funnel-transformer/intermediate
funnel-transformer/intermediate-base
funnel-transformer/large
funnel-transformer/large-base
funnel-transformer/xlarge
funnel-transformer/xlarge-base
Contributions:
LXMERT
The LXMERT model was proposed in LXMERT: Learning Cross-Modality Encoder Representations from Transformers by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders (one for the vision modality, one for the language modality, and then one to fuse both modalities) pre-trained using a combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives. The pretraining consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering, VQA 2.0, and GQA.
It was added to the library in TensorFlow with the following checkpoints:
unc-nlp/lxmert-base-uncased
unc-nlp/lxmert-vqa-uncased
unc-nlp/lxmert-gqa-uncased
Contributions
- Adding the LXMERT pretraining model (MultiModal languageXvision) to HuggingFace's suite of models #5793 (@eltoto1219)
- [LXMERT] Fix tests on gpu #6946 (@patrickvonplaten)
New pipelines
The following pipeline was added to the library:
- [pipelines] Text2TextGenerationPipeline #6744 (@patil-suraj)
Notebooks
The following community notebooks were contributed to the library:
- Demoing LXMERT with raw images by incorporating the FRCNN model for roi-pooled extraction and bounding-box predction on the GQA answer set. #6986 (@eltoto1219)
- [Community notebooks] Add notebook on fine-tuning GPT-2 Model with Trainer Class #7005 (@philschmid)
- Add "Fine-tune ALBERT for sentence-pair classification" notebook to the community notebooks #7255 (@NadirEM)
- added multilabel text classification notebook using distilbert to community notebooks #7201 (@DhavalTaunk08)
Encoder-decoder architectures
An additional encoder-decoder architecture was added:
- [EncoderDecoder] Add xlm-roberta to encoder decoder #6878 (@patrickvonplaten)
Bug fixes and improvements
- TF Flaubert w/ pre-norm #6841 (@LysandreJik)
- Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task #6644 (@HuangLianzhe)
- Fix in Adafactor docstrings #6845 (@sgugger)
- Fix resuming training for Windows #6847 (@sgugger)
- Only access loss tensor every logging_steps #6802 (@jysohn23)
- Marian distill scripts + integration test #6799 (@sshleifer)
- Add checkpointing to Ray Tune HPO #6747 (@krfricke)
- Split hp search methods #6857 (@sgugger)
- Update ONNX notebook to include section on quantization. #6831 (@mfuntowicz)
- Fix marian slow test #6854 (@sshleifer)
- [s2s] command line args for faster val steps #6833 (@sshleifer)
- Bart can make decoder_input_ids from labels #6758 (@sshleifer)
- add a final report to all pytest jobs #6861 (@stas00)
- Logging doc #6852 (@sgugger)
- Restore PaddingStrategy.MAX_LENGTH on QAPipeline while no v2. #6875 (@mfuntowicz)
- [Generate] Facilitate PyTorch generate using
ModelOutputs
#6735 (@patrickvonplaten) - Add cache_dir to save features TextDataset #6879 (@jysohn23)
- [Docs, Examples] Fix QA example for PT #6890 (@patrickvonplaten)
- Update modeling_bert.py #6897 (@parthe)
- [Electra] fix warning for position ids #6884 (@patrickvonplaten)
- minor docs grammar fixes #6889 (@harrywang)
- Fix error class instantiation #6634 (@tamuhey)
- Output attention takes an s #6903 (@sgugger)
- [testing] fix ambiguous test #6898 (@stas00)
- test_tf_common: remove un_used mixin class parameters #6866 (@PuneethaPai)
- Template updates #6914 (@sgugger)
- Changed link to the correct paper in the second paragraph #6905 (@sengl)
- tweak tar command in readme #6919 (@brettkoonce)
- [s2s]: script to convert pl checkpoints to hf checkpoints #6911 (@sshleifer)
- [s2s] allow task_specific_params=summarization_xsum #6923 (@sshleifer)
- move wandb/comet logger init to train() to allow parallel logging #6850 (@krfricke)
- [s2s] use --eval_beams command line arg #6926 (@sshleifer)
- [s2s] support early stopping based on loss, rather than rouge #6927 (@sshleifer)
- Fix mixed precision issue in TF DistilBert #6915 (@chiapas)
- [docstring] misc arg doc corrections #6932 (@stas00)
- [s2s] distill: --normalize_hidden --supervise_forward #6834 (@sshleifer)
- [s2s] run_eval.py parses generate_kwargs #6948 (@sshleifer)
- [doc] remove the implied defaults to :obj:
None
, s/True/ :obj:`True/, etc. #6956 (@stas00) - [s2s] warn if --fp16 for torch 1.6 #6977 (@sshleifer)
- feat: allow prefix for any generative model #5885 (@borisdayma)
- Trainer with grad accum #6930 (@sgugger)
- Cannot index
None
#6984 (@LysandreJik) - [docstring] missing arg #6933 (@stas00)
- [testing] add dependency: parametrize #6958 (@stas00)
- Fixed the default number of attention heads in Reformer Configuration #6973 (@tznurmin)
- [gen utils] missing else case #6980 (@stas00)
- match CI's version of flake8 #6941 (@stas00)
- Conversion scripts shouldn't have relative imports #6991 (@LysandreJik)
- Add missing arguments for BertWordPieceTokenizer #5810 (@monologg)
- fixed trainer tr_loss memory leak #6999 (@StuartMesham)
- Floating-point operations logging in trainer #6768 (@TevenLeScao)
- Fixing FLOPS merge by checking if torch is available #7013 (@LysandreJik)
- [Longformer] Fix longformer documentation #7016 (@patrickvonplaten)
- pegasus.rst: fix expected output #7017 (@sshleifer)
- adding TRANSFORMERS_VERBOSITY env var #6961 (@stas00)
- [generation] consistently add eos tokens #6982 (@stas00)
- [from_pretrained] Allow tokenizer_type ≠ model_type #6995 (@julien-c)
- replace torch.triu with onnx compatible code #6929 (@HenryDashwood)
- Batch encore plus and overflowing tokens fails when non existing overflowing tokens for a sequence #6677 (@LysandreJik)
- add -y to bypass prompt for transformers-cli upload #7035 (@stas00)
- Fix confusing warnings during TF2 import from PyTorch #6623 (@jcrocholl)
- Albert pretrain datasets/ datacollator #6168 (@yl-to)
- Fix template #7040 (@LysandreJik)
- Small fixes in tf template #7044 (@sgugger)
- Add "Leveraging Pretrained Checkpoints for Generation" Seq2Seq models. #6594 (@patrickvonplaten)
- fix to ensure that returned tensors after the tokenization is Long #7039 (@GeetDsa)
- [BertGeneration] Correct Doc Title #7048 (@patrickvonplaten)
- [BertGeneration, Docs] Fix another old name in docs #7050 (@patrickvonplaten)
- [xlm tok] config dict: fix str into int to match definition #7034 (@stas00)
- [s2s] --eval_max_generate_length #7018 (@sshleifer)
- Fix CI w...
Pegasus, DPR, self-documented outputs, new pipelines and MT support
Pegasus, mBART, DPR, self-documented outputs and new pipelines
Pegasus
The Pegasus model from PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization by Jingqing Zhang, Yao Zhao, Mohammad Saleh, Peter J. Liu, was added to the library in PyTorch.
Model implemented as a collaboration between Jingqing Zhang and @sshleifer in #6340
- PegasusForConditionalGeneration (torch version) #6340
- add pegasus finetuning script #6811 script. (warning very slow)
DPR
The DPR model from Dense Passage Retrieval for Open-Domain Question Answering by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih was added to the library in PyTorch.
DeeBERT
The DeeBERT model from DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference by Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, Jimmy Lin has been added to the examples/
folder alongside its training script, in PyTorch.
Self-documented outputs
As well as returning tuples, PyTorch and TensorFlow models now return a subclass of ModelOutput
that is appropriate. A ModelOutput
is a dataclass containing all model returns. This allows for easier inspection, and for self-documenting model outputs.
- Change model outputs types to self-document outputs #5438 (@sgugger)
- Tf model outputs #6247 (@sgugger)
Models return tuples by default, and return self-documented outputs if the return_dict
configuration flag is set to True
or if the return_dict=True
keyword argument is passed to the forward/call method.
Summary of the behavior:
# The new outputs are opt-in, you have to activate them explicitly with `return_dict=True`
# Either at instantiation
model = BertForSequenceClassification.from_pretrained('bert-base-cased', return_dict=True)
# Or when calling the model
output = model(**inputs, return_dict=True)
# You can access the elements of the outputs with
# (1) named attributes
loss = outputs.loss
logits = outputs.logits
# (2) their names as strings like a dict
loss = outputs["loss"]
logits = outputs["logits"]
# (3) their index as integers or slices in the pre-3.1.0 outputs tuples
loss = outputs[0]
logits = outputs[1]
loss, logits = outputs[:2]
# One **breaking behavior** of these new outputs (which is the reason you have to opt-in to use these new outputs:
# Iterating on the outputs now return the names (keys) instead of the values:
print([element for element in outputs])
>>> ['loss', 'logits']
# Thus you cannot unpack the output like pre-3.1.0 (you get the string names instead of the values):
# (But you can query a slice like indicated in (3) above)
loss_keys, logits_key = outputs
Encoder-Decoder framework
The encoder-decoder framework has been enhanced to allow more encoder decoder model combinations, e.g.:
Bert2Bert, Bert2GPT2, Roberta2Roberta, Longformer2Roberta, ....
- [EncoderDecoder] Add encoder-decoder for roberta/ vanilla longformer #6411 (@patrickvonplaten)
- [EncoderDecoder] Add Cross Attention for GPT2 #6415 (@patrickvonplaten)
- [EncoderDecoder] Add functionality to tie encoder decoder weights #6538 (@patrickvonplaten)
- Multiple combinations of EncoderDecoder models have been fine-tuned and evaluated on CNN/Daily-Mail summarization: https://huggingface.co/models?search=cnn_dailymail-fp16 (@patrickvonplaten)
TensorFlow as a first-class citizen
As we continue working towards having TensorFlow be a first-class citizen, we continually improve on our TensorFlow API and models.
- [Almost all TF models] TF clean up: add missing CLM / MLM loss; fix T5 naming and keras compile #5395 (@patrickvonplaten)
- [Benchmark] Add benchmarks for TF Training #5594 (@patrickvonplaten)
Machine Translation
MarianMTModel
- en-zh and 357 other checkpoints for machine translation were added from the Helsinki-NLP group's Tatoeba Project (@sshleifer + @jorgtied). There are now > 1300 supported pairs for machine translation.
- Marian converter updates #6342 (@sshleifer)
- Marian distill scripts + integration test #6799 (@sshleifer)
mBART
The mBART model from Multilingual Denoising Pre-training for Neural Machine Translation was can now be accessed through MBartForConditionalGeneration
.
- Add mbart-large-cc25, support translation finetuning #5129 (@sshleifer)
- [mbart] prepare_translation_batch passes **kwargs to allow DeprecationWarning #5581 (@sshleifer)
- MBartForConditionalGeneration #6441 (@patil-suraj)
- [fix] mbart_en_ro_generate test now identical to fairseq #5731 (@sshleifer)
- [Doc] explaining romanian postprocessing for MBART BLEU hacking #5943 (@sshleifer)
- [test] partial coverage for train_mbart_enro_cc25.sh #5976 (@sshleifer)
- MbartTokenizer: do not hardcode vocab size #5998 (@sshleifer)
- MBART: support summarization tasks where max_src_len > max_tgt_len #6003 (@sshleifer)
- Fix #6096: MBartTokenizer's mask token #6098 (@sshleifer)
- [s2s] Document better mbart finetuning command #6229 (@sshleifer)
- mBART Conversion script #6230 (@sshleifer)
- [s2s] add BartTranslationDistiller for distilling mBART #6363 (@sshleifer)
- [Doc] add more MBart and other doc #6490 (@patil-suraj)
examples/seq2seq
- examples/seq2seq/finetune.py supports --task translation
- All sequence to sequence tokenizers (T5, Bart, Marian, Pegasus) expose a
prepare_seq2seq_batch
method that makes batches for sequence to sequence trianing.
PRs:
- Seq2SeqDataset uses linecache to save memory #5792 (@Pradhy729)
- [examples/seq2seq]: add --label_smoothing option #5919 (@sshleifer)
- seq2seq/run_eval.py can take decoder_start_token_id #5949 (@sshleifer)
- [examples (seq2seq)] fix preparing decoder_input_ids for T5 #5994 (@patil-suraj)
- [s2s] add support for overriding config params #6149 (@stas00)
- s2s: fix LR logging, remove some dead code. #6205 (@sshleifer)
- [s2s] tiny QOL improvement: run_eval prints scores #6341 (@sshleifer)
- [s2s] fix label_smoothed_nll_loss #6344 (@patil-suraj)
- [s2s] fix --gpus clarg collision #6358 (@sshleifer)
- [s2s] Script to save wmt data to disk #6403 (@sshleifer)
- rename prepare_translation_batch -> prepare_seq2seq_batch #6103 (@sshleifer)
- Mult rouge by 100: standard units #6359 (@sshleifer)
- allow spaces in bash args with "$@" #6521 (@sshleifer)
- [seq2seq] MAX_LEN env var for MT commands #5837 (@sshleifer)
- [seq2seq] distillation.py accepts trainer arguments #5865 (@sshleifer)
- [s2s]Use prepare_translation_batch for Marian finetuning #6293 (@sshleifer)
- [BartTokenizer] add prepare s2s batch #6212 (@patil-suraj)
- [T5Tokenizer] add prepare_seq2seq_batch method #6122 (@patil-suraj)
- [s2s] round runtime in run_eval #6798 (@sshleifer)
- [s2s README] Add more dataset download instructions #6737 (@sshleifer)
- [s2s] round bleu, rouge to 4 digits #6704 (@sshleifer)
- [s2s] command line args for faster val steps #6833
New documentation
Several new documentation pages have been added and older documentation has been tweaked to be more accurate and understandable. An open in colab button has been added on the tutorial pages.
- Guide to fixed-length model perplexity evaluation #5449 (@joeddav)
- Improvements to PretrainedConfig documentation #5642 (@sgugger)
- Document model outputs #5673 (@sgugger)
- docs(wandb): explain how to use W&B integration #5607 (@borisdayma)
- Model utils doc #6005 (@sgugger)
- ONNX documentation #5992 (@mfuntowicz)
- Tokenizer documentation #6110 (@sgugger)
- Pipeline documentation #6175 (@sgugger)
- Encoder decoder config docs #6195 (@afcruzs)
- Colab button #6389 (@sgugger)
- Generation documentation #6470 (@sgugger)
- Add custom datasets tutorial #6466 (@joeddav)
- Logging documentation #6852 (@sgugger)
Trainer updates
New additions to the Trainer
- Added data collator for permutation (XLNet) language modeling and related calls #5522 (@shngt)
- Trainer support for iterabledataset #5834 (@Pradhy729)
- Adding PaddingDataCollator #6442 (@sgugger)
- Add hyperparameter search to Trainer #6576 (@sgugger)
- [examples] Add trainer support for question-answering #4829 (@patil-suraj)
- Adds comet_ml to the list of auto-experiment loggers #6176 (@dsblank)
- Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task #6644 (@HuangLianzhe)
New models & model architectures
The following model architectures have been added to the library
- FlaubertForTokenClassification #5644 (@stas00)
- TFXLMForTokenClassification #5614 (@LysandreJik)
- TFXLMForMultipleChoice #5614 (@LysandreJik)
- TFFlaubertForTokenClassification #5614 (@LysandreJik)
- TFFlaubertForMultipleChoice #5614 (@LysandreJik)
- TFElectraForSequenceClassification #6227 (@jplu)
- TFElectraForMultipleChoice #6227 (@jplu)
- TF Longformer #5764 (@patrickvonplaten)
- CamembertForCausalLM #6577 (@patil-suraj)
Regression testing on TPU & TPU CI
Thanks to @zcain117 we now have access to TPU CI for the PyTorch/xla framework. This enables regression testing on the TPU aspects of the Trainer
, and offers very simple regression testing on model training performance.
- Test XLA examples #5583
- Add setup for TPU CI to run every hour. #6219 (@zcain117)
- Add missing docker arg for TPU CI. #6393 (@zcain117)
- Get GKE logs via kubectl logs instead of gcloud logging read. #6446 (@zcain117)
New pipelines
New pipe...