Releases: huggingface/optimum
v1.6.0: Optimum CLI, Stable Diffusion ONNX export, BetterTransformer & ONNX support for more architectures
Optimum CLI
The Optimum command line interface is introduced, and is now the official entrypoint for the ONNX export. Example commands:
optimum-cli --help
optimum-cli export onnx --help
optimum-cli export onnx --model bert-base-uncased --task sequence-classification bert_onnx/
Stable Diffusion ONNX export
Optimum now supports the ONNX export of stable diffusion models from the diffusers library:
optimum-cli export onnx --model runwayml/stable-diffusion-v1-5 sd_v15_onnx/
- Add Stable Diffusion ONNX export by @echarlaix in #570
BetterTransformer support for more architectures
BetterTransformer integration includes new models in this release: CLIP, RemBERT, mBART, ViLT, FSMT
The complete list of supported models is available in the documentation.
- [BT] Add
Bettertransformer
support for FSMT by @Sumanth077 in #494 - [BT] add
BetterTransformer
support for ViLT architecture by @ka00ri in #508 - Add
MBart
support forBetterTransformer
by @ravenouse in #516 - Add CLIP BetterTransformer by @fxmarty in #534
- Add BetterTransformer support for RemBERT by @hchings in #545
ONNX export for more architectures
The ONNX export now supports Swin, MobileNet-v1, MobileNet-v2.
- Add Swin support in exporters.onnx by @fxmarty in #528
- [
ONNX
] addmobilenet
support by @younesbelkada in #633
Extended ONNX export for encoder-decoder and decoder models
Encoder-decoder or decoder-only models normally making use of the generate()
method in transformers can now be exported in several files using the --for-ort
argument:
optimum-cli export onnx --model t5-small --task seq2seq-lm-with-past --for-ort t5_small_onnx
yielding:
.
└── t5_small_onnx
├── config.json
├── decoder_model.onnx
├── decoder_with_past_model.onnx
├── encoder_model.onnx
├── special_tokens_map.json
├── spiece.model
├── tokenizer_config.json
└── tokenizer.json
Passing --for-ort
, exported models are expected to be loadable directly into ORTModel.
- Add ort export in exporters for encoder-decoder models by @mht-sharma in #497
- Support decoder generated with
--for-ort
fromoptimum.exporters.onnx
inORTDecoder
by @fxmarty in #554
Support for ONNX models with external data at export, optimization, quantization
The ONNX export from PyTorch normally creates external data in case the exported model is larger than 2 GB. This release introduces a better support for the export and use of large models, writting all external data into a .onnx_data
file if necessary.
- Handling ONNX models with external data by @NouamaneTazi in #586
- Improve the compatibility dealing with large ONNX proto in ORTOptimizer and ORTQuantizer by @JingyaHuang in #332
ONNX Runtime API improvement
Various improvements to allow for a better user experience in the ONNX Runtime integration:
-
ORTModel
,ORTModelDecoder
andORTModelForConditionalGeneration
can now load any ONNX model files regardless of their names, allowing to load optimized and quantized models without having to specify a file name argument. -
ORTModel.from_pretrained()
withfrom_transformers=True
now downloads and loads the model in a temporary directory instead of the cache, which was not a right place to store it. -
ORTQuantizer.save_pretrained()
now saves the model configuration and the preprocessor, making the exported directory usable end-to-end. -
ORTOptimizer.save_pretrained()
now saves the preprocessor, making the exported directory usable end-to-end. -
ONNX Runtime integration API improvement by @michaelbenayoun in #515
Custom shapes support at ONNX export
The shape of the example input to provide for the export to ONNX can be overridden in case the validity of the ONNX model is sensitive to the shape used during the export.
Read more: optimum-cli export onnx --help
- Support custom shapes for dummy inputs by @fxmarty in #522
- Support for custom input shapes in exporters onnx by @fxmarty in #575
Enable use_cache=True
for ORTModelForCausalLM
Reusing past key values for models using ORTModelForCausalLM (e.g. gpt2) is now possible using use_cache=True
, avoiding to recompute them at each iteration of the decoding:
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = ORTModelForCausalLM.from_pretrained("gpt2", from_transformers=True, use_cache=True)
inputs = tokenizer("My name is Arthur and I live in", return_tensors="pt")
gen_tokens = model.generate(**inputs)
tokenizer.batch_decode(gen_tokens)
- Enable past_key_values for ORTModelForCausalLM by @echarlaix in #326
IO binding support for ORTModelForCustomTasks
ORTModelForCustomTasks now supports IO Binding when using CUDAExecutionProvider.
- Add IO binding support for custom ORTModel by @JingyaHuang in #447
Experimental support to merge ONNX decoder with/without past key values
Along with --for-ort
, when passing --task causal-lm-with-past
, --task seq2seq-with-past
or --task speech2seq-lm-with-past
during the ONNX export exports two models: one not using the previously computed keys/values, and one using them.
An experimental support is introduced to merge the two models in one. Example:
optimum-cli export onnx --model t5-small --task seq2seq-lm-with-past --for-ort t5_onnx/
import onnx
from optimum.onnx import merge_decoders
decoder = onnx.load("t5_onnx/decoder_model.onnx")
decoder_with_past = onnx.load("t5_onnx/decoder_with_past_model.onnx")
merged_model = merge_decoders(decoder, decoder_with_past)
onnx.save(merged_model, "t5_onnx/decoder_merged_model.onnx")
- Merge ONNX decoder models by @JingyaHuang in #587
Major bugs fixed
- Fix BetterTransformer with padding="max_length" by @fxmarty in #543
- Fix non-nesting bug in BetterTransformer integration by @younesbelkada in #637
Other changes, bugfixes and improvements
- Fix doc-builder premission error by @mishig25 in #482
- Fix doc build pr premissions by @mishig25 in #484
- Re-order the task manager doc by @michaelbenayoun in #483
- Fix whisper device for gpu test by @fxmarty in #486
- Fix tensorflow CI by @fxmarty in #489
- Fix PR doc generation by @regisss in #495
- Fix broken links in the doc by @fxmarty in #499
- Update iobinding ORT encoder whisper by @mht-sharma in #498
- fix NormalizedConfig init error message by @PaulQbFeng in #500
- Change import structure for ORTModel by @fxmarty in #456
- [BT] Fix failing CI tests by @younesbelkada in #501
- Remove redundant condition statement in ORTDecoder(Seq2seq) by @JingyaHuang in #504
- [BT] put decorator on the correct place by @younesbelkada in #509
- [BT] clearer error message for
norm_first
by @younesbelkada in #510 - Deprecate PyTorch 1.12. for BetterTransformer by @fxmarty in #513
- Fix ORTModelForSeq2SeqLM test by @fxmarty in #455
- Clearer error messages when initilizing the requested ONNX Runtime execution provider fails by @fxmarty in #514
- [BT] Fix doc bugs by @younesbelkada in #517
- Replace sklearn by scikit-learn by @lesteve in #502
- ORTModel uses optimum.exporters.onnx by @michaelbenayoun in #490
- Cleanup deprecated ONNX Runtime training docker files by @JingyaHuang in #523
- Added support for Tapas Model by @juheon...
v1.5.2: Patch release
Constraint temporarily numpy<1.24.0 (#614)
v1.5.1: Patch release
Deprecate PyTorch 1.12. for BetterTransformer with better error message (#513)
v1.5.0: BetterTransformer Integration, IOBinding, Optimum Exporters, and Whisper with ONNX Runtime
BetterTransformer
Convert your model into its PyTorch BetterTransformer
format using a one liner with the new BetterTransformer
integration for faster inference on CPU and GPU!
from optimum.bettertransformer import BetterTransformer
model = BetterTransformer.transform(model)
Check the full list of supported models in the documentaiton, and check out the Google Colab demo.
Contributions
ONNX Runtime IOBinding support
ORT models (except for ORTModelForCustomTasks
) now support IOBinding to avoid data copying overheads between the host and device. Significant inference speedup during the decoding process on GPU.
By default, use_io_binding
is set to True
when using CUDA. You can turn off the IOBinding in case of any memory issue:
from optimum.onnxruntime import ORTModelForSeq2SeqLM
model = ORTModelForSeq2SeqLM.from_pretrained("optimum/t5-small", use_io_binding=False)
Contributions
- Add IOBinding support to ONNX Runtime module (#421)
Optimum Exporters
optimum.exporters
is a new module that handles the export of PyTorch and TensorFlow models to several backends. Only ONNX is supported for now, and more than 50 architectures can already be exported, among which BERT, GPT-Neo, Bloom, T5, ViT, Whisper, CLIP.
The export can be done via the CLI:
python -m optimum.exporters.onnx --model openai/whisper-tiny.en whisper_onnx/
For more information, check the documentation.
Contributions
Whisper
- Whisper can be exported to ONNX using
optimum.exporters
. - Whisper can also be exported and ran using
optimum.onnxruntime
, IO binding is also supported.
Note: For the now the export from optimum.exporters
will not be usable by ORTModelForSpeechSeq2Seq
. To be able to run inference, export Whisper directly using ORTModelForSpeechSeq2Seq
. This will be solved in the next release.
Contributions
- Whisper support with
optimum.onnxruntime
andoptimum.exporters
(#420)
Other contributions
- ONNX Runtime training now supports ORT 1.13.1 and
transformers
4.23.1 (#434) ORTModel
can load models from subfolders in a similar fashion as intransformers
(#443)ORTOptimizer
has been refactored, and a factory class has been added to create commonOptimizationConfig
s (#457)- Fixes and updates in the documentation (#411, #432, #437, #441)
- Fixes IOBinding (#454, #461)
v1.4.1: Patch release
- Add inference with
ORTModel
toORTTrainer
andORTSeq2SeqTrainer
#189 - Add
InferenceSession
options and provider toORTModel
#271 - Add mT5 (#341) and Marian (#393) support to
ORTOptimizer
- Add batchnorm folding
torch.fx
transformations #348 - The
torch.fx
transformations now use the marking methodsmark_as_transformed
,mark_as_restored
,get_transformed_nodes
#385 - Update
BaseConfig
fortransformers
4.22.0
release #386 - Update
ORTTrainer
fortransformers
4.22.1
release #388 - Add extra ONNX Runtime quantization options #398
- Add possibility to pass
provider_options
toORTModel
#401 - Add support to pass a specific device for
ORTModel
, astransformers
does for pipelines #427 - Fixes to support onnxruntime 1.13.1 #430
v1.4.0: ORTQuantizer and ORTOptimizer refactorization
ONNX Runtime
- Refactorization of
ORTQuantizer
(#270) andORTOptimizer
(#294) - Add ONNX Runtime fused Adam Optimizer (#295)
- Add
ORTModelForCustomTasks
allowing ONNX Runtime inference support for custom tasks (#303) - Add
ORTModelForMultipleChoice
allowing ONNX Runtime inference for models with multiple choice classification head (#358)
Torch FX
- Add
FuseBiasInLinear
a transformation that fuses the weight and the bias of linear modules (#253)
Improvements and bugfixes
- Enable the possibility to disregard the precomputed
past_key_values
during ONNX Runtime inference of Seq2Seq models (#241) - Enable node exclusion from quantization for benchmark suite (#284)
- Enable possibility to use a token authentication when loading a calibration dataset (#289)
- Fix optimum pipeline when no model is given (#301)
v1.3.0: Torch FX transformations, ORTModelForSeq2SeqLM and ORTModelForImageClassification
Torch FX
The optimum.fx.optimization
module (#232) provides a set of torch.fx
graph transformations, along with classes and functions to write your own transformations and compose them.
- The
Transformation
andReversibleTransformation
represent non-reversible and reversible transformations, and it is possible to write such transformations by inheriting from those classes - The
compose
utility function enables transformation composition - Two reversible transformations were added:
MergeLinears
: merges linear layers that have the same inputChangeTrueDivToMulByInverse
: changes a division by a static value to a multiplication of its inverse
ORTModelForSeq2SeqLM
ORTModelForSeq2SeqLM
(#199) allows ONNX export and ONNX Runtime inference for Seq2Seq models.
- When exported, Seq2Seq models are decomposed into three parts : the encoder, the decoder (actually consisting of the decoder with the language modeling head), and the decoder with pre-computed key/values as additional inputs.
- This specific export comes from the fact that during the first pass, the decoder has no pre-computed key/values hidden-states, while during the rest of the generation past key/values will be used to speed up sequential decoding.
Below is an example that downloads a T5 model from the Hugging Face Hub, exports it through the ONNX format and saves it :
from optimum.onnxruntime import ORTModelForSeq2SeqLM
# Load model from hub and export it through the ONNX format
model = ORTModelForSeq2SeqLM.from_pretrained("t5-small", from_transformers=True)
# Save the exported model in the given directory
model.save_pretrained(output_dir)
ORTModelForImageClassification
ORTModelForImageClassification
(#226) allows ONNX Runtime inference for models with an image classification head.
Below is an example that downloads a ViT model from the Hugging Face Hub, exports it through the ONNX format and saves it :
from optimum.onnxruntime import ORTModelForImageClassification
# Load model from hub and export it through the ONNX format
model = ORTModelForImageClassification.from_pretrained("google/vit-base-patch16-224", from_transformers=True)
# Save the exported model in the given directory
model.save_pretrained(output_dir)
ORTOptimizer
Adds support for converting model weights from fp32 to fp16 by adding a new optimization parameter (fp16
) to OptimizationConfig
(#273).
Pipelines
Additional pipelines tasks are now supported, here is a list of the supported tasks along with the default model for each:
- Image Classification (ViT)
- Text-to-Text Generation (T5 small)
- Summarization (T5 base)
- Translation (T5 base)
Below is an example that downloads a T5 small model from the Hub and loads it with transformers pipeline for translation :
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("optimum/t5-small")
model = ORTModelForSeq2SeqLM.from_pretrained("optimum/t5-small")
onnx_translation = pipeline("translation_en_to_fr", model=model, tokenizer=tokenizer)
text = "What a beautiful day !"
pred = onnx_translation(text)
# [{'translation_text': "C'est une belle journée !"}]
Breaking change
The ORTModelForXXX
execution provider default value is now set to CPUExecutionProvider
(#203). Before, if no execution provider was provided, it was set to CUDAExecutionProvider
if a gpu was detected, or to CPUExecutionProvider
otherwise.
v1.2.3: Patch release
- Remove intel sub-package, migrating to
optimum-intel
(#212) - Fix the loading and saving of
ORTModel
optimized and quantized models (#214)
v1.2.2: Patch release
v1.2.1: Patch release
Add support to Python version 3.7 (#176)