Release Adding OpenAI GPT and Transformer-XL pretrained models, python2 support, pre-training script for BERT, SQuAD 2.0 example · huggingface/transformers

New pretrained models:

Open AI GPT pretrained on the Toronto Book Corpus ("Improving Language Understanding by Generative Pre-Training" by Alec Radford et al.).
- This is a slightly modified version of our previous PyTorch implementation to increase the performances by spliting words and position embeddings in separate embeddings matrices.
- Performance checked to be on part with the TF implementation on ROCStories: single run evaluation accuracy of 86.4% vs. authors reporting a median accuracy of 85.8% with the TensorFlow code (see details in the example section of the readme).
Transformer-XL pretrained on WikiText 103 ("Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context" by Zihang Dai, Zhilin Yang et al.). This is a slightly modified version of Google/CMU's PyTorch implementation to match the performances of the TensorFlow version by:
- untying relative positioning embeddings across layers,
- changing memory cells initialization to keep sinusoïdal positions identical
- adding full logits outputs in the adaptive softmax to use it in a generative setting.
- Performance checked to be on part with the TF implementation on WikiText 103: evaluation perplexity of 18.213 vs. authors reporting a perplexity of 18.3 on this dataset with the TensorFlow code (see details in the example section of the readme).

Updated the SQuAD fine-tuning script to work also on SQuAD V2.0 by @abeljim and @Liangtaiwan
run_lm_finetuning.py let you pretrain a BERT language model or fine-tune it with masked-language-modeling and next-sentence-prediction losses by @deepset-ai, @tholor and @nhatchan (compatibility Python 3.5)

add a never_split option and arguments to the tokenizers (@WrRan)
better handle errors when BERT is feed with inputs that are too long (@patrick-s-h-lewis)
better layer normalization layer initialization and bug fix in examples scripts: args.do_lower_case is always True(@donglixp)
fix learning rate schedule issue in example scripts (@matej-svejda)
readme fixes (@danyaljj, @nhatchan, @davidefiocco, @girishponkiya )
importing unofficial TF models in BERT (@nhatchan)
only keep the active part of the loss for token classification (@Iwontbecreative)
fix argparse type error in example scripts (@ksurya)
docstring fixes (@rodgzilla, @wlhgtc )
improving run_classifier.py loading of saved models (@SinghJasdeep)
In examples scripts: allow do_eval to be used without do_train and to use the pretrained model in the output folder (@jaderabbit, @likejazz and @JoeDumoulin )
in run_squad.py: fix error when bert_model param is path or url (@likejazz)
add license to source distribution and use entry-points instead of scripts (@sodre)