Skip to content

Releases: huggingface/transformers

Patch release: better error message & invalid trainer attribute

09 Dec 17:07
Compare
Choose a tag to compare

This patch releases introduces:

  • A better error message when trying to instantiate a SentencePiece-based tokenizer without having SentencePiece installed. #8881
  • Fixes an incorrect attribute in the trainer. #8996

Transformers v4.0.0: Fast tokenizers, model outputs, file reorganization

30 Nov 17:01
Compare
Choose a tag to compare

Transformers v4.0.0-rc-1: Fast tokenizers, model outputs, file reorganization

Breaking changes since v3.x

Version v4.0.0 introduces several breaking changes that were necessary.

1. AutoTokenizers and pipelines now use fast (rust) tokenizers by default.

The python and rust tokenizers have roughly the same API, but the rust tokenizers have a more complete feature set. The main breaking change is the handling of overflowing tokens between the python and rust tokenizers.

How to obtain the same behavior as v3.x in v4.x

In version v3.x:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xxx")

to obtain the same in version v4.x:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xxx", use_fast=False)

2. SentencePiece is removed from the required dependencies

The requirement on the SentencePiece dependency has been lifted from the setup.py. This is done so that we may have a channel on anaconda cloud without relying on conda-forge. This means that the tokenizers that depend on the SentencePiece library will not be available with a standard transformers installation.

This includes the slow versions of:

  • XLNetTokenizer
  • AlbertTokenizer
  • CamembertTokenizer
  • MBartTokenizer
  • PegasusTokenizer
  • T5Tokenizer
  • ReformerTokenizer
  • XLMRobertaTokenizer

How to obtain the same behavior as v3.x in v4.x

In order to obtain the same behavior as version v3.x, you should install sentencepiece additionally:

In version v3.x:

pip install transformers

to obtain the same in version v4.x:

pip install transformers[sentencepiece]

or

pip install transformers sentencepiece

3. The architecture of the repo has been updated so that each model resides in its folder

The past and foreseeable addition of new models means that the number of files in the directory src/transformers keeps growing and becomes harder to navigate and understand. We made the choice to put each model and the files accompanying it in their own sub-directories.

This is a breaking change as importing intermediary layers using a model's module directly needs to be done via a different path.

How to obtain the same behavior as v3.x in v4.x

In order to obtain the same behavior as version v3.x, you should update the path used to access the layers.

In version v3.x:

from transformers.modeling_bert import BertLayer

to obtain the same in version v4.x:

from transformers.models.bert.modeling_bert import BertLayer

4. Switching the return_dict argument to True by default

The return_dict argument enables the return of named-tuples-like python objects containing the model outputs, instead of the standard tuples. This object is self-documented as keys can be used to retrieve values, while also behaving as a tuple as users may retrieve objects by index or by slice.

This is a breaking change as the limitation of that tuple is that it cannot be unpacked: value0, value1 = outputs will not work.

How to obtain the same behavior as v3.x in v4.x

In order to obtain the same behavior as version v3.x, you should specify the return_dict argument to False, either in the model configuration or during the forward pass.

In version v3.x:

outputs = model(**inputs)

to obtain the same in version v4.x:

outputs = model(**inputs, return_dict=False)

5. Removed some deprecated attributes

Attributes that were deprecated have been removed if they had been deprecated for at least a month. The full list of deprecated attributes can be found in #8604.

Here is a list of these attributes/methods/arguments and what their replacements should be:

In several models, the labels become consistent with the other models:

  • masked_lm_labels becomes labels in AlbertForMaskedLM and AlbertForPreTraining.
  • masked_lm_labels becomes labels in BertForMaskedLM and BertForPreTraining.
  • masked_lm_labels becomes labels in DistilBertForMaskedLM.
  • masked_lm_labels becomes labels in ElectraForMaskedLM.
  • masked_lm_labels becomes labels in LongformerForMaskedLM.
  • masked_lm_labels becomes labels in MobileBertForMaskedLM.
  • masked_lm_labels becomes labels in RobertaForMaskedLM.
  • lm_labels becomes labels in BartForConditionalGeneration.
  • lm_labels becomes labels in GPT2DoubleHeadsModel.
  • lm_labels becomes labels in OpenAIGPTDoubleHeadsModel.
  • lm_labels becomes labels in T5ForConditionalGeneration.

In several models, the caching mechanism becomes consistent with the other models:

  • decoder_cached_states becomes past_key_values in all BART-like, FSMT and T5 models.
  • decoder_past_key_values becomes past_key_values in all BART-like, FSMT and T5 models.
  • past becomes past_key_values in all CTRL models.
  • past becomes past_key_values in all GPT-2 models.

Regarding the tokenizer classes:

  • The tokenizer attribute max_len becomes model_max_length.
  • The tokenizer attribute return_lengths becomes return_length.
  • The tokenizer encoding argument is_pretokenized becomes is_split_into_words.

Regarding the Trainer class:

  • The Trainer argument tb_writer is removed in favor of the callback TensorBoardCallback(tb_writer=...).
  • The Trainer argument prediction_loss_only is removed in favor of the class argument args.prediction_loss_only.
  • The Trainer attribute data_collator should be a callable.
  • The Trainer method _log is deprecated in favor of log.
  • The Trainer method _training_step is deprecated in favor of training_step.
  • The Trainer method _prediction_loop is deprecated in favor of prediction_loop.
  • The Trainer method is_local_master is deprecated in favor of is_local_process_zero.
  • The Trainer method is_world_master is deprecated in favor of is_world_process_zero.

Regarding the TFTrainer class:

  • The TFTrainer argument prediction_loss_only is removed in favor of the class argument args.prediction_loss_only.
  • The Trainer method _log is deprecated in favor of log.
  • The TFTrainer method _prediction_loop is deprecated in favor of prediction_loop.
  • The TFTrainer method _setup_wandb is deprecated in favor of setup_wandb.
  • The TFTrainer method _run_model is deprecated in favor of run_model.

Regarding the TrainerArgument and TFTrainerArgument classes:

  • The TrainerArgument argument evaluate_during_training is deprecated in favor of evaluation_strategy.
  • The TFTrainerArgument argument evaluate_during_training is deprecated in favor of evaluation_strategy.

Regarding the Transfo-XL model:

  • The Transfo-XL configuration attribute tie_weight becomes tie_words_embeddings.
  • The Transfo-XL modeling method reset_length becomes reset_memory_length.

Regarding pipelines:

  • The FillMaskPipeline argument topk becomes top_k.

Model Templates

Version 4.0.0 will be the first to include the experimental feature of model templates. These model templates aim to facilitate the addition of new models to the library by doing most of the work: generating the model/configuration/tokenization/test files that fit the API, with respect to the choice the user has made in terms of naming and functionality.

This release includes a model template for the encoder model (similar to the BERT architecture). Generating a model using the template will generate the files, put them at the appropriate location, reference them throughout the code-base, and generate a working test suite. The user should then only modify the files to their liking, rather than creating the model from scratch.

Feedback welcome, get started from the README here.

New model additions

mT5 and T5 version 1.1 (@patrickvonplaten )

The T5v1.1 is an improved version of the original T5 model, see here: https://github.com/google-research/text-to-text-transfer-transformer/blob/master/released_checkpoints.md

The multilingual T5 model (mT5) was presented in https://arxiv.org/abs/2010.11934 and is based on the T5v1.1 architecture.

Multiple pre-trained checkpoints have been added to the library:

Relevant pull requests:

TF DPR

The DPR model has been added in TensorFlow to match its PyTorch counterpart by @ratthachat

TF Longformer

Additional heads have been added to the TensorFlow Longformer implementation: SequenceClassification, MultipleChoice and TokenClassification

Bug fixes and improvements

Read more

Transformers v4.0.0-rc-1: Fast tokenizers, model outputs, file reorganization

19 Nov 17:12
Compare
Choose a tag to compare

Transformers v4.0.0-rc-1: Fast tokenizers, model outputs, file reorganization

Breaking changes since v3.x

Version v4.0.0 introduces several breaking changes that were necessary.

1. AutoTokenizers and pipelines now use fast (rust) tokenizers by default.

The python and rust tokenizers have roughly the same API, but the rust tokenizers have a more complete feature set. The main breaking change is the handling of overflowing tokens between the python and rust tokenizers.

How to obtain the same behavior as v3.x in v4.x

In version v3.x:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xxx")

to obtain the same in version v4.x:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xxx", use_fast=False)

2. SentencePiece is removed from the required dependencies

The requirement on the SentencePiece dependency has been lifted from the setup.py. This is done so that we may have a channel on anaconda cloud without relying on conda-forge. This means that the tokenizers that depend on the SentencePiece library will not be available with a standard transformers installation.

This includes the slow versions of:

  • XLNetTokenizer
  • AlbertTokenizer
  • CamembertTokenizer
  • MBartTokenizer
  • PegasusTokenizer
  • T5Tokenizer
  • ReformerTokenizer
  • XLMRobertaTokenizer

How to obtain the same behavior as v3.x in v4.x

In order to obtain the same behavior as version v3.x, you should install sentencepiece additionally:

In version v3.x:

pip install transformers

to obtain the same in version v4.x:

pip install transformers[sentencepiece]

or

pip install transformers sentencepiece

3. The architecture of the repo has been updated so that each model resides in its folder

The past and foreseeable addition of new models means that the number of files in the directory src/transformers keeps growing and becomes harder to navigate and understand. We made the choice to put each model and the files accompanying it in their own sub-directories.

This is a breaking change as importing intermediary layers using a model's module directly needs to be done via a different path.

How to obtain the same behavior as v3.x in v4.x

In order to obtain the same behavior as version v3.x, you should update the path used to access the layers.

In version v3.x:

from transformers.modeling_bert import BertLayer

to obtain the same in version v4.x:

from transformers.models.bert.modeling_bert import BertLayer

4. Switching the return_dict argument to True by default

The return_dict argument enables the return of named-tuples-like python objects containing the model outputs, instead of the standard tuples. This object is self-documented as keys can be used to retrieve values, while also behaving as a tuple as users may retrieve objects by index or by slice.

This is a breaking change as the limitation of that tuple is that it cannot be unpacked: value0, value1 = outputs will not work.

How to obtain the same behavior as v3.x in v4.x

In order to obtain the same behavior as version v3.x, you should specify the return_dict argument to False, either in the model configuration or during the forward pass.

In version v3.x:

outputs = model(**inputs)

to obtain the same in version v4.x:

outputs = model(**inputs, return_dict=False)

5. Removed some deprecated attributes

Attributes that were deprecated have been removed if they had been deprecated for at least a month. The full list of deprecated attributes can be found in #8604.

Model Templates

Version 4.0.0 will be the first to include the experimental feature of model templates. These model templates aim to facilitate the addition of new models to the library by doing most of the work: generating the model/configuration/tokenization/test files that fit the API, with respect to the choice the user has made in terms of naming and functionality.

This release includes a model template for the encoder model (similar to the BERT architecture). Generating a model using the template will generate the files, put them at the appropriate location, reference them throughout the code-base, and generate a working test suite. The user should then only modify the files to their liking, rather than creating the model from scratch.

Feedback welcome, get started from the README here.

New model additions

mT5 and T5 version 1.1 (@patrickvonplaten )

The T5v1.1 is an improved version of the original T5 model, see here: https://github.com/google-research/text-to-text-transfer-transformer/blob/master/released_checkpoints.md

The multilingual T5 model (mT5) was presented in https://arxiv.org/abs/2010.11934 and is based on the T5v1.1 architecture.

Multiple pre-trained checkpoints have been added to the library:

Relevant pull requests:

TF DPR

The DPR model has been added in TensorFlow to match its PyTorch counterpart by @ratthachat

TF Longformer

Additional heads have been added to the TensorFlow Longformer implementation: SequenceClassification, MultipleChoice and TokenClassification

Bug fixes and improvements

Read more

v3.5.1

13 Nov 15:28
Compare
Choose a tag to compare

Fix a typo that raised an error instead of a deprecation warning.

v3.5.0: Model versioning, TensorFlow encoder-decoder models, new scripts, refactor of the `generate` method

10 Nov 13:57
Compare
Choose a tag to compare

Model versioning, TensorFlow encoder-decoder models, new scripts, refactor of the generate method

Model versioning

We host more and more of the community's models which is awesome ❤️. To scale this sharing, we needed to change the infra to both support more models, and unlock new powerful features.

To that effect, we have rebuilt the storage backend that we use for models (currently S3), to our own git repos (using S3 as a git-lfs endpoint for large files), with one model = one repo.

The benefits of this switch are:

  • built-in versioning (I mean… it’s git. It’s pretty much what you use for versioning. Versioning in S3 has a ton a limitations)
  • access control (will unlock private models, private datasets, etc)
  • scalability (our usage of S3 to maintain lists of models was starting to bottleneck)

Let's dive in to the actual changes:

I. On the website


You'll now see a "Browse files and versions" tab or button on each model page. (design is not final, we'll make it more prominent/streamlined in the near future)

This is what this page looks like:

Screenshot 2020-11-06 at 19.23.05|584x500

The UX should look familiar and self-explanatory, but we'll add more ML-specific features in the future.

You can:

  • see commit histories and diffs of changes made to any text file, like config.json:
    • changes made by the HuggingFace team will be way clearer – we can perform updates to the models to ensure they work well with the library(ies) (you'll be able to opt out from those changes)
  • Large binary files are stored using https://git-lfs.github.com/ which is pretty standard now, and interoperable out of the box with git
  • Ability to update your text files, like your README.md model card, directly on the website!
    • with instant preview 🔥

II. In the transformers library


The PR to enable this new storage mode in the transformers library is available here: #8324

This PR has two parts:

1. changes to the file downloading code used in from_pretrained() methods to use the new file URLs.
Large files are stored in an S3 bucket and served by Cloudfront so downloads should be as fast as they are right now.

In addition, you now have a way to pin a specific version of a model, to a commit hash, tag or branch.

For instance:

tokenizer = AutoTokenizer.from_pretrained(
  "julien-c/EsperBERTo-small",
  revision="v2.0.1" # tag name, or branch name, or commit hash
)

Finally, the networking code is more robust and doesn't gobble up errors anymore, so in case you have trouble downloading a specific file you'll know exactly why.

2. changes to the model upload CLI to create a model repo then be able to git clone and git push to it.
We are intentionally not wrapping git too much because we expect most model authors to be familiar with git (and possibly git-lfs), let us know if not the case.

To create a repo:

transformers-cli repo create your-model-name

Then you'll get a repo url that you'll be able to clone:

git clone https://huggingface.co/username/your-model-name

# Then commit as usual
cd your-model-name
echo "hello" >> README.md
git add . && git commit -m "Update from $USER"

A nice side effect of the new system on the upload side is that file uploading should be more robust for very large files (hello T5!) as git-lfs handles the networking code.

By the way, again, every model is its own repo. So you can git clone any public model if you'd like:

git clone https://huggingface.co/gpt2

But you won't be able to push unless it's one of your models (or one of your orgs').

III. Backward compatibility


  • Backward compatibility on model downloads is expected, because even though the new models will be stored in huggingface.co-hosted git repos, we will backport all file changes to S3 automatically.
  • ⚠️ Model uploads using the current system won't work anymore: you'll need to upgrade your transformers installation to the next release, v3.5.0, or to build from master.
    Alternatively, in the next week or so we'll add the ability to create a repo from the website directly so you'll be able to push even without the transformers library.

TFMarian, TFMbart, TFPegasus, TFBlenderbot

  • Add tensorflow 2.0 functionality for SOTA seq2seq transformers #7987 (@sshleifer)

New and updated scripts

We'working on giving examples on how to leverage the 🤗 Datasets library and the Trainer API. Those scripts are meant as examples easy to customize, with lots of comments explaining the various steps. The following tasks are now covered:

  • Text classification : New run glue script #7917 (@sgugger)
  • Causal Language Modeling: New run_clm script #8105 (@sgugger)
  • Masked Language Modeling: Add line by line option to mlm/plm scripts #8240 (@sgugger)
  • Token classification: Add new token classification example #8340 (@sgugger)

Seq2Seq Trainer

A child of Trainer specialized for training seq2seq models, from @patil-suraj, @stas00 and @sshleifer. Accessible through examples/seq2seq/finetune_trainer.py. API is similar to examples/seq2seq/finetune.py, but API support is better. Example scripts are in examples/seq2seq/builtin_trainer.

Seq2Seq Testing and Documentation Improvements

Docs for DistillBART Paper Replication

Re-run experiments from the paper here

Refactoring the generate() function

The generate() method now has a new design so that the user can directly call upon the methods
sample(), greedy_search(), beam_search() and beam_sample(). The code was made more readable, and beam search was sped-up by ca. 5-10%.

Refactoring the generate() function #6949 (@patrickvonplaten)

Notebooks

General improvements and bugfixes

Read more

ProphetNet, Blenderbot, SqueezeBERT, DeBERTa

20 Oct 14:30
Compare
Choose a tag to compare

ProphetNet, Blenderbot, SqueezeBERT, DeBERTa

ProphetNET

Two new models are released as part of the ProphetNet implementation: ProphetNet and XLM-ProphetNet.

ProphetNet is an encoder-decoder model and can predict n-future tokens for “ngram” language modeling instead of just the next token.

XLM-ProphetNet is an encoder-decoder model with an identical architecture to ProhpetNet, but the model was trained on the multi-lingual “wiki100” Wikipedia dump.

The ProphetNet model was proposed in ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020.

It was added to the library in PyTorch with the following checkpoints:

  • microsoft/xprophetnet-large-wiki100-cased-xglue-ntg
  • microsoft/prophetnet-large-uncased
  • microsoft/prophetnet-large-uncased-cnndm
  • microsoft/xprophetnet-large-wiki100-cased
  • microsoft/xprophetnet-large-wiki100-cased-xglue-qg

Contributions:

BlenderBot

Blenderbot is an encoder-decoder model for open-domain chat. It uses a standard seq2seq model transformer-based architecture.

The Blender chatbot model was proposed in Recipes for building an open-domain chatbot Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.

It was added to the library in PyTorch with the following checkpoints:

  • facebook/blenderbot-90M
  • facebook/blenderbot-3B

Contributions:

SqueezeBERT

The SqueezeBERT model was proposed in SqueezeBERT: What can computer vision teach NLP about efficient neural networks? by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer. It’s a bidirectional transformer similar to the BERT model. The key difference between the BERT architecture and the SqueezeBERT architecture is that SqueezeBERT uses grouped convolutions instead of fully-connected layers for the Q, K, V and FFN layers.

It was added to the library in PyTorch with the following checkpoints:

  • squeezebert/squeezebert-mnli
  • squeezebert/squeezebert-uncased
  • squeezebert/squeezebert-mnli-headless

Contributions:

DeBERTa

The DeBERTa model was proposed in DeBERTa: Decoding-enhanced BERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google’s BERT model released in 2018 and Facebook’s RoBERTa model released in 2019.

It was added to the library in PyTorch with the following checkpoints:

  • microsoft/deberta-base
  • microsoft/deberta-large

Contributions:

Both SentencePiece and Tokenizers are now optional libraries

Support for SentencePiece is now part of the tokenizers library! Thanks to this we now have near-full support of fast tokenizers in the library.

With this new feature, we slightly change the paradigm regarding installation:

  • SentencePiece is now an optional dependency, paving the way to a fully-featured conda install in the near future

  • Tokenizers is now also an optional dependency, making it possible to install and use the library even when rust cannot be compiled on the machine.

  • [Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies #7659 (@thomwolf)

The main __init__ has been improved to always import the same functions and classes. If someone then tries to use a class that requires an optional dependency, an ImportError will be raised at init (with instructions on how to install the missing dependency) #7537 (@sgugger)

Improvements made to the Trainer

The Trainer API has been improved to work with models requiring several labels or returning several outputs, and to have clearer progress tracking. A new TrainerCallback class has been added to allow the user to easily customize the default training loop.

Seq2Seq Trainer

A child of Trainer specialized for training seq2seq models, from @patil-suraj and @sshleifer. Accessible through examples/seq2seq/finetune_trainer.py.

Distributed Generation

  • You can run model.generate in pytorch on a large dataset and split the work across multiple GPUs, using examples/seq2seq/run_distributed_eval.py
  • [s2s] release pseudolabel links and instructions #7639 (@sshleifer)
  • [s2s] Fix t5 warning for distributed eval #7487 (@sshleifer)
  • [s2s] fix kwargs style #7488 (@sshleifer)
  • [s2s] fix lockfile and peg distillation constants #7545 (@sshleifer)
  • [s2s] fix nltk pytest race condition with FileLock #7515 (@sshleifer)

Notebooks

General improvements and bugfixes

Read more

v3.3.1

29 Sep 18:30
Compare
Choose a tag to compare

Fixes errors due to the name conflicts between the datasets library and local folder or modules named datasets.

RAG

28 Sep 14:32
Compare
Choose a tag to compare
RAG

RAG

RAG Model

The RAG model is a retrieval-augmented generation model that can be leveraged for question-answering tasks using RagTokenForGeneration or RagSequenceForGeneration as proposed in Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela.

It was added to the library in PyTorch with the following checkpoints:

  • facebook/rag-token-nq
  • facebook/rag-sequence-nq
  • facebook/rag-token-base
  • facebook/rag-sequence-base

Contributions:

Bug fixes and improvements

Bert Seq2Seq models, FSMT, LayoutLM, Funnel Transformer, LXMERT

22 Sep 15:58
Compare
Choose a tag to compare

Bert Seq2Seq models, FSMT, Funnel Transformer, LXMERT

BERT Seq2seq models

The BertGeneration model is a BERT model that can be leveraged for sequence-to-sequence tasks using EncoderDecoderModel as proposed in Leveraging Pre-trained Checkpoints for Sequence Generation Tasks by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.

It was added to the library in PyTorch with the following checkpoints:

  • google/roberta2roberta_L-24_bbc
  • google/roberta2roberta_L-24_gigaword
  • google/roberta2roberta_L-24_cnn_daily_mail
  • google/roberta2roberta_L-24_discofuse
  • google/roberta2roberta_L-24_wikisplit
  • google/bert2bert_L-24_wmt_de_en
  • google/bert2bert_L-24_wmt_en_de

Contributions:

FSMT (FairSeq MachineTranslation)

FSMT (FairSeq MachineTranslation) models were introduced in Facebook FAIR’s WMT19 News Translation Task Submission by Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, Sergey Edunov.

It was added to the library in PyTorch, with the following checkpoints:

  • facebook/wmt19-en-ru
  • facebook/wmt19-en-de
  • facebook/wmt19-ru-en
  • facebook/wmt19-de-en

Contributions:

  • [ported model] FSMT (FairSeq MachineTranslation) #6940 (@stas00)
  • build/eval/gen-card scripts for fsmt #7155 (@stas00)
  • skip failing FSMT CUDA tests until investigated #7220 (@stas00)
  • [fsmt] rewrite SinusoidalPositionalEmbedding + USE_CUDA test fixes + new TranslationPipeline test #7224 (@stas00)
  • [s2s] adjust finetune + test to work with fsmt #7263 (@stas00)
  • [fsmt] SinusoidalPositionalEmbedding no need to pass device #7292 (@stas00)
  • Adds FSMT to LM head AutoModel #7312 (@LysandreJik)

LayoutLM

The LayoutLM model was proposed in LayoutLM: Pre-training of Text and Layout for Document Image Understandin by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. It’s a simple but effective pre-training method of text and layout for document image understanding and information extraction tasks, such as form understanding and receipt understanding.

It was added to the library in PyTorch with the following checkpoints:

  • layoutlm-base-uncased
  • layoutlm-large-uncased

Contributions:

Funnel Transformer

The Funnel Transformer model was proposed in the paper Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. It is a bidirectional transformer model, like BERT, but with a pooling operation after each block of layers, a bit like in traditional convolutional neural networks (CNN) in computer vision.

It was added to the library in both PyTorch and TensorFlow, with the following checkpoints:

  • funnel-transformer/small
  • funnel-transformer/small-base
  • funnel-transformer/medium
  • funnel-transformer/medium-base
  • funnel-transformer/intermediate
  • funnel-transformer/intermediate-base
  • funnel-transformer/large
  • funnel-transformer/large-base
  • funnel-transformer/xlarge
  • funnel-transformer/xlarge-base

Contributions:

LXMERT

The LXMERT model was proposed in LXMERT: Learning Cross-Modality Encoder Representations from Transformers by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders (one for the vision modality, one for the language modality, and then one to fuse both modalities) pre-trained using a combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives. The pretraining consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering, VQA 2.0, and GQA.

It was added to the library in TensorFlow with the following checkpoints:

  • unc-nlp/lxmert-base-uncased
  • unc-nlp/lxmert-vqa-uncased
  • unc-nlp/lxmert-gqa-uncased

Contributions

New pipelines

The following pipeline was added to the library:

Notebooks

The following community notebooks were contributed to the library:

  • Demoing LXMERT with raw images by incorporating the FRCNN model for roi-pooled extraction and bounding-box predction on the GQA answer set. #6986 (@eltoto1219)
  • [Community notebooks] Add notebook on fine-tuning GPT-2 Model with Trainer Class #7005 (@philschmid)
  • Add "Fine-tune ALBERT for sentence-pair classification" notebook to the community notebooks #7255 (@NadirEM)
  • added multilabel text classification notebook using distilbert to community notebooks #7201 (@DhavalTaunk08)

Encoder-decoder architectures

An additional encoder-decoder architecture was added:

Bug fixes and improvements

Read more

Pegasus, DPR, self-documented outputs, new pipelines and MT support

01 Sep 12:36
Compare
Choose a tag to compare

Pegasus, mBART, DPR, self-documented outputs and new pipelines

Pegasus

The Pegasus model from PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization by Jingqing Zhang, Yao Zhao, Mohammad Saleh, Peter J. Liu, was added to the library in PyTorch.

Model implemented as a collaboration between Jingqing Zhang and @sshleifer in #6340

  • PegasusForConditionalGeneration (torch version) #6340
  • add pegasus finetuning script #6811 script. (warning very slow)

DPR

The DPR model from Dense Passage Retrieval for Open-Domain Question Answering by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih was added to the library in PyTorch.

DeeBERT

The DeeBERT model from DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference by Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, Jimmy Lin has been added to the examples/ folder alongside its training script, in PyTorch.

  • Add DeeBERT (entropy-based early exiting for *BERT) #5477 (@ji-xin)

Self-documented outputs

As well as returning tuples, PyTorch and TensorFlow models now return a subclass of ModelOutput that is appropriate. A ModelOutput is a dataclass containing all model returns. This allows for easier inspection, and for self-documenting model outputs.

Models return tuples by default, and return self-documented outputs if the return_dict configuration flag is set to True or if the return_dict=True keyword argument is passed to the forward/call method.

Summary of the behavior:

# The new outputs are opt-in, you have to activate them explicitly with `return_dict=True`
# Either at instantiation
model = BertForSequenceClassification.from_pretrained('bert-base-cased', return_dict=True)
# Or when calling the model
output = model(**inputs, return_dict=True)

# You can access the elements of the outputs with
# (1) named attributes
loss = outputs.loss
logits = outputs.logits

# (2) their names as strings like a dict
loss = outputs["loss"]
logits = outputs["logits"]

# (3) their index as integers or slices in the pre-3.1.0 outputs tuples
loss = outputs[0]
logits = outputs[1]
loss, logits = outputs[:2]

# One **breaking behavior** of these new outputs (which is the reason you have to opt-in to use these new outputs:
# Iterating on the outputs now return the names (keys) instead of the values:
print([element for element in outputs])
>>> ['loss', 'logits']
# Thus you cannot unpack the output like pre-3.1.0 (you get the string names instead of the values):
# (But you can query a slice like indicated in (3) above)
loss_keys, logits_key = outputs

Encoder-Decoder framework

The encoder-decoder framework has been enhanced to allow more encoder decoder model combinations, e.g.:
Bert2Bert, Bert2GPT2, Roberta2Roberta, Longformer2Roberta, ....

TensorFlow as a first-class citizen

As we continue working towards having TensorFlow be a first-class citizen, we continually improve on our TensorFlow API and models.

Machine Translation

MarianMTModel

  • en-zh and 357 other checkpoints for machine translation were added from the Helsinki-NLP group's Tatoeba Project (@sshleifer + @jorgtied). There are now > 1300 supported pairs for machine translation.
  • Marian converter updates #6342 (@sshleifer)
  • Marian distill scripts + integration test #6799 (@sshleifer)

mBART

The mBART model from Multilingual Denoising Pre-training for Neural Machine Translation was can now be accessed through MBartForConditionalGeneration.

examples/seq2seq

  • examples/seq2seq/finetune.py supports --task translation
  • All sequence to sequence tokenizers (T5, Bart, Marian, Pegasus) expose a prepare_seq2seq_batch method that makes batches for sequence to sequence trianing.

PRs:

New documentation

Several new documentation pages have been added and older documentation has been tweaked to be more accurate and understandable. An open in colab button has been added on the tutorial pages.

Trainer updates

New additions to the Trainer

  • Added data collator for permutation (XLNet) language modeling and related calls #5522 (@shngt)
  • Trainer support for iterabledataset #5834 (@Pradhy729)
  • Adding PaddingDataCollator #6442 (@sgugger)
  • Add hyperparameter search to Trainer #6576 (@sgugger)
  • [examples] Add trainer support for question-answering #4829 (@patil-suraj)
  • Adds comet_ml to the list of auto-experiment loggers #6176 (@dsblank)
  • Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task #6644 (@HuangLianzhe)

New models & model architectures

The following model architectures have been added to the library

Regression testing on TPU & TPU CI

Thanks to @zcain117 we now have access to TPU CI for the PyTorch/xla framework. This enables regression testing on the TPU aspects of the Trainer, and offers very simple regression testing on model training performance.

New pipelines

New pipe...

Read more