Release New model: RoBERTa, tokenizer sequence pair handling for sequence classification models. · huggingface/transformers

New model: RoBERTa

RoBERTa (from Facebook), a Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du et al.

Thanks to Myle Ott from Facebook for his help.

Tokenizer sequence pair handling

Tokenizers get two new methods:

tokenizer.add_special_tokens_single_sentence(token_ids)

and

tokenizer.add_special_tokens_sentences_pair(token_ids_0, token_ids_1)

These methods add the model-specific special tokens to sequences. The sentence pair creates a list of tokens with the cls and sep tokens according to the way the model was trained.

Sequence pair examples:

For BERT:

[CLS] SEQUENCE_0 [SEP] SEQUENCE_1 [SEP]

For RoBERTa:

<s> SEQUENCE_0 </s></s> SEQUENCE_1 </s>

Tokenizer encoding function

The tokenizer encode function gets two new arguments:

tokenizer.encode(text, text_pair=None, add_special_tokens=False)

If the text_pair is specified, encode will return a tuple of encoded sequences. If the add_special_tokens is set to True, the sequences will be built with the models' respective special tokens using the previously described methods.

AutoConfig, AutoModel and AutoTokenizer

There are three new classes with this release that instantiate one of the base model classes of the library from a pre-trained model configuration: AutoConfig, AutoModel, and AutoTokenizer.

Those classes take as input a pre-trained model name or path and instantiate one of the corresponding classes. The input string indicates to the class which architecture should be instantiated. If the string contains "bert", AutoConfig instantiates a BertConfig, AutoModel instantiates a BertModel and AutoTokenizer instantiates a BertTokenizer.

The same can be done for all the library's base models. The Auto classes check for the associated strings: "openai-gpt", "gpt2", "transfo-xl", "xlnet", "xlm" and "roberta". The documentation associated with this change can be found here.

Examples

Some examples have been refactored to better reflect the current library. Those are: simple_lm_finetuning.py, finetune_on_pregenerated.py, as well as run_glue.py that has been adapted to the RoBERTa model. The examples run_squad and run_glue.py have better dataset processing with caching.

Bug fixes and improvements to the library modules

Fixed multi-gpu training when using FP16 (@zijunsun)
Re-added the possibility to import BertPretrainedModel (@thomwolf)
Improvements to tensorflow -> pytorch checkpoints (@dhpollack)
Fixed save_pretrained to save the correct added tokens (@joelgrus)
Fixed version issues in run_openai_gpt (@rabeehk)
Fixed issue with line return with Chinese BERT (@Yiqing-Zhou)
Added more flexibility regarding the PretrainedModel.from_pretrained (@xanlsh)
Fixed issues regarding backward compatibility to Pytorch 1.0.0 (@thomwolf)
Added the unknown token to GPT-2 (@thomwolf)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New model: RoBERTa, tokenizer sequence pair handling for sequence classification models.