Releases: huggingface/optimum
v1.2.0: pipeline and AutoModelForXxx classes to run ONNX Runtime inference
ORTModel
ORTModelForXXX
classes such as ORTModelForSequenceClassification
were integrated with the Hugging Face Hub in order to easily export models through the ONNX format, load ONNX models, as well as easily save the resulting model and push it to the 🤗 Hub by using respectively the save_pretrained
and push_to_hub
methods. An already optimized and / or quantized ONNX model can also be loaded using the ORTModelForXXX classes using the from_pretrained
method.
Below is an example that downloads a DistilBERT model from the Hub, exports it through the ONNX format and saves it :
from optimum.onnxruntime import ORTModelForSequenceClassification
# Load model from hub and export it through the ONNX format
model = ORTModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english",
from_transformers=True
)
# Save the exported model
model.save_pretrained("a_local_path_for_convert_onnx_model")
Pipelines
Built-in support for transformers pipelines was added. This allows us to leverage the same API used from Transformers, with the power of accelerated runtimes such as ONNX Runtime.
The currently supported tasks with the default model for each are the following :
- Text Classification (DistilBERT model fine-tuned on SST-2)
- Question Answering (DistilBERT model fine-tuned on SQuAD v1.1)
- Token Classification(BERT large fine-tuned on CoNLL2003)
- Feature Extraction (DistilBERT)
- Zero Shot Classification (BART model fine-tuned on MNLI)
- Text Generation (DistilGPT2)
Below is an example that downloads a RoBERTa model from the Hub, exports it through the ONNX format and loads it with transformers
pipeline for question-answering
.
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForQuestionAnswering
# load vanilla transformers and convert to onnx
model = ORTModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2",from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
# test the model with using transformers pipeline, with handle_impossible_answer for squad_v2
optimum_qa = pipeline(task, model=model, tokenizer=tokenizer, handle_impossible_answer=True)
prediction = optimum_qa(
question="What's my name?", context="My name is Philipp and I live in Nuremberg."
)
print(prediction)
# {'score': 0.9041663408279419, 'start': 11, 'end': 18, 'answer': 'Philipp'}
Improvements
- Add loss when performing the evalutation step using an instance of
ORTTrainer
, previously not enabled when inference was performed with ONNX Runtime in #152
v1.1.1: Patch release
Habana
- Installation details added for Optimum-Habana which provides optimized transformers integration for Intel's Habana Gaudi Processor (HPU).
ONNX Runtime
- Add the possibility to specify the execution provider in
ORTModel
. - Add
IncludeFullyConnectedNodes
class to find the nodes composing the fully connected layers in order to (only) target the latter for quantization to limit the accuracy drop. - Update
QuantizationPreprocessor
so that the intersection of the two sets representing the nodes to quantize and the nodes to exclude from quantization to be an empty set. - Rename
Seq2SeqORTTrainer
toORTSeq2SeqTrainer
for clarity and to keep consistency. - Add
ORTOptimizer
support for ELECTRA models. - Fix the loading of pretrained
ORTConfig
which contains optimization and quantization config.
v1.1.0: ORTTrainer, Seq2SeqORTTrainer, ONNX Runtime optimization and quantization API improvements
ORTTrainer and Seq2SeqORTTrainer
The ORTTrainer
and Seq2SeqORTTrainer
are two newly experimental classes.
- Both
ORTTrainer
andSeq2SeqORTTrainer
were created to have a similar user-facing API as theTrainer
andSeq2SeqTrainer
of the Transformers library. ORTTrainer
allows the usage of the ONNX Runtime backend to train a given PyTorch model in order to accelerate training. ONNX Runtime will run the forward and backward passes using an optimized automatically-exported ONNX computation graph, while the rest of the training loop is executed by native PyTorch.ORTTrainer
allows the usage of ONNX Runtime inferencing during both the evaluation and the prediction step.- For
Seq2SeqORTTrainer
, ONNX Runtime inferencing is incompatible with--predict_with_generate
, as the generate method is not supported yet.
ONNX Runtime optimization and quantization APIs improvements
The ORTQuantizer
and ORTOptimizer
classes underwent a massive refactoring that should allow a simpler and more flexible user-facing API.
- Addition of the possibility to iteratively compute the quantization activation ranges when applying static quantization by using the
ORTQuantizer
methodpartial_fit
. This is especially useful when using memory-hungry calibration methods such as Entropy and Percentile methods. - When using the MinMax calibration method, it is now possible to compute the moving average of the minimum and maximum values representing the activations quantization ranges instead of the global minimum and maximum (feature available with onnxruntime v1.11.0 or higher).
- The classes
OptimizationConfig
,QuantizationConfig
andCalibrationConfig
were added in order to better segment the different ONNX Runtime related parameters instead of having one unique configurationORTConfig
. - The
QuantizationPreprocessor
class was added in order to find the nodes to include and / or exclude from quantization, by finding the nodes following a given pattern (such as the nodes forming LayerNorm for example). This is particularly useful in the context of static quantization, where the quantization of modules such as LayerNorm or GELU are responsible of important drop in accuracy.
v1.0.0: ONNX Runtime optimization and quantization support
ONNX Runtime support
- An
ORTConfig
class was introduced, allowing the user to define the desired export, optimization and quantization strategies. - The
ORTOptimizer
class takes care of the model's ONNX export as well as the graph optimization provided by ONNX Runtime. In order to create an instance ofORTOptimizer
, the user needs to provide anORTConfig
object, defining the export and graph-level transformations informations. Then optimization can be perfomed by calling theORTOptimizer.fit
method. - ONNX Runtime static and dynamic quantization can also be applied on a model by using the newly added
ORTQuantizer
class. In order to create an instance ofORTQuantizer
, the user needs to provide anORTConfig
object, defining the export and quantization informations, such as the quantization approach to use or the activations and weights data types. Then quantization can be applied by calling theORTQuantizer.fit
method.
Additionnal features for Intel Neural Compressor
We have also added a new class called IncOptimizer
which will take care of combining the pruning and the quantization processes.
v0.1.2: Intel Neural Compressor's pruning support
With this release, we enable Intel Neural Compressor v1.8 magnitude pruning for a variety of NLP tasks with the introduction of IncTrainer
which handles the pruning process.
v0.1.1: Intel Neural Compressor's dynamic, post-training and aware-training quantization support
With this release, we enable Intel Neural Compressor v1.7 PyTorch dynamic, post-training and aware-training quantization for a variety of NLP tasks. This support includes the overall process, from quantization application to the loading of the resulting quantized model. The latter being enabled by the introduction of the IncQuantizedModel
class.
Optimum v0.0.1 - EAP
Initial release for early access to Optimum library featuring Intel's LPOT quantization and pruning support.