New model: RoBERTa, tokenizer sequence pair handling for sequence classification models.
New model: RoBERTa
RoBERTa (from Facebook), a Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du et al.
Thanks to Myle Ott from Facebook for his help.
Tokenizer sequence pair handling
Tokenizers get two new methods:
tokenizer.add_special_tokens_single_sentence(token_ids)
and
tokenizer.add_special_tokens_sentences_pair(token_ids_0, token_ids_1)
These methods add the model-specific special tokens to sequences. The sentence pair creates a list of tokens with the cls
and sep
tokens according to the way the model was trained.
Sequence pair examples:
For BERT:
[CLS] SEQUENCE_0 [SEP] SEQUENCE_1 [SEP]
For RoBERTa:
<s> SEQUENCE_0 </s></s> SEQUENCE_1 </s>
Tokenizer encoding function
The tokenizer encode function gets two new arguments:
tokenizer.encode(text, text_pair=None, add_special_tokens=False)
If the text_pair
is specified, encode
will return a tuple of encoded sequences. If the add_special_tokens
is set to True
, the sequences will be built with the models' respective special tokens using the previously described methods.
AutoConfig, AutoModel and AutoTokenizer
There are three new classes with this release that instantiate one of the base model classes of the library from a pre-trained model configuration: AutoConfig
, AutoModel
, and AutoTokenizer
.
Those classes take as input a pre-trained model name or path and instantiate one of the corresponding classes. The input string indicates to the class which architecture should be instantiated. If the string contains "bert", AutoConfig
instantiates a BertConfig
, AutoModel
instantiates a BertModel
and AutoTokenizer
instantiates a BertTokenizer
.
The same can be done for all the library's base models. The Auto classes check for the associated strings: "openai-gpt", "gpt2", "transfo-xl", "xlnet", "xlm" and "roberta". The documentation associated with this change can be found here.
Examples
Some examples have been refactored to better reflect the current library. Those are: simple_lm_finetuning.py
, finetune_on_pregenerated.py
, as well as run_glue.py
that has been adapted to the RoBERTa model. The examples run_squad
and run_glue.py
have better dataset processing with caching.
Bug fixes and improvements to the library modules
- Fixed multi-gpu training when using FP16 (@zijunsun)
- Re-added the possibility to import BertPretrainedModel (@thomwolf)
- Improvements to tensorflow -> pytorch checkpoints (@dhpollack)
- Fixed save_pretrained to save the correct added tokens (@joelgrus)
- Fixed version issues in run_openai_gpt (@rabeehk)
- Fixed issue with line return with Chinese BERT (@Yiqing-Zhou)
- Added more flexibility regarding the
PretrainedModel.from_pretrained
(@xanlsh) - Fixed issues regarding backward compatibility to Pytorch 1.0.0 (@thomwolf)
- Added the unknown token to GPT-2 (@thomwolf)