Releases: openvinotoolkit/nncf
Releases · openvinotoolkit/nncf
v2.14.1
v2.14.0
Post-training Quantization:
Features:
- Introduced
backup_mode
optional parameter innncf.compress_weights()
to specify the data type for embeddings, convolutions and last linear layers during 4-bit weights compression. Available options are INT8_ASYM by default, INT8_SYM, and NONE which retains the original floating-point precision of the model weights. - Added the
quantizer_propagation_rule
parameter, providing fine-grained control over quantizer propagation. This advanced option is designed to improve accuracy for models where quantizers with different granularity could be merged to per-tensor, potentially affecting model accuracy. - Introduced
nncf.data.generate_text_data
API method that utilizes LLM to generate data for further data-aware optimization. See the example for details. - (OpenVINO) Extended support of data-free and data-aware weight compression methods for
nncf.compress_weights()
with NF4 per-channel quantization, which makes compressed LLMs more accurate and faster on NPU. - (OpenVINO) Introduced a new option
statistics_path
to cache and reuse statistics fornncf.compress_weights()
, reducing the time required to find optimal compression configurations. See the TinyLlama example for details. - (TorchFX, Experimental) Added support for quantization and weight compression of Torch FX models. The compressed models can be directly executed via
torch.compile(compressed_model, backend="openvino")
(see details here). Added INT8 quantization example. The list of supported features:- INT8 quantization with SmoothQuant, MinMax, FastBiasCorrection, and BiasCorrection algorithms via
nncf.quantize()
. - Data-free INT8, INT4, and mixed-precision weights compression with
nncf.compress_weights()
.
- INT8 quantization with SmoothQuant, MinMax, FastBiasCorrection, and BiasCorrection algorithms via
- (PyTorch, Experimental) Added model tracing and execution pre-post hooks based on TorchFunctionMode.
Fixes:
- Resolved an issue with redundant quantizer insertion before elementwise operations, reducing noise introduced by quantization.
- Fixed type mismatch issue for
nncf.quantize_with_accuracy_control()
. - Fixed BiasCorrection algorithm for specific branching cases.
- (OpenVINO) Fixed GPTQ weight compression method for Stable Diffusion models.
- (OpenVINO) Fixed issue with the variational statistics processing for
nncf.compress_weights()
. - (PyTorch, ONNX) Scaled dot product attention pattern quantization setup is aligned with OpenVINO.
Improvements:
- Reduction in peak memory by 30-50% for data-aware
nncf.compress_weights()
with AWQ, Scale Estimation, LoRA and mixed-precision algorithms. - Reduction in compression time by 10-20% for
nncf.compress_weights()
with AWQ algorithm. - Aligned behavior for ignored subgraph between different
networkx
versions. - Extended ignored patterns with RoPE block for
nncf.ModelType.TRANSFORMER
scheme. - (OpenVINO) Extended to the ignored scope for
nncf.ModelType.TRANSFORMER
scheme with GroupNorm metatype. - (ONNX) SE-block ignored pattern variant for
torchvision
mobilenet_v3 has been extended.
Tutorials:
- Post-Training Optimization of Llama-3.2-11B-Vision Model
- Post-Training Optimization of YOLOv11 Model
- Post-Training Optimization of Whisper in Automatic speech recognition with OpenVINO Generate API
- Post-Training Optimization of Pixtral Model
- Post-Training Optimization of LLM ReAct Agent Model
- Post-Training Optimization of CatVTON Model
- Post-Training Optimization of Stable Diffusion v3 Model in Torch FX Representation
Known issues:
- (ONNX)
nncf.quantize()
method can generate inaccurate INT8 results for MobileNet models with the BiasCorrection algorithm.
Deprecations/Removals:
- Migrated from using
setup.py
topyproject.toml
for the build and package configuration. It is aligned with Python packaging standards as outlined in PEP 517 and PEP 518. The installation throughsetup.py
does not work anymore. No impact on the installation from PyPI and Conda. - Removed support for Python 3.8.
- (PyTorch)
nncf.torch.create_compressed_model()
function has been deprecated.
Requirements:
- Updated ONNX (1.17.0) and ONNXRuntime (1.19.2) versions.
- Updated PyTorch (2.5.1) and Torchvision (0.20.1) versions.
- Updated NumPy (<2.2.0) version support.
- Updated Ultralytics (8.3.22) version.
Acknowledgements
Thanks for contributions from the OpenVINO developer community:
@rk119
@zina-cs
v2.13.0
Post-training Quantization:
Features:
- (OpenVINO) Added support for combining GPTQ with AWQ and Scale Estimation (SE) algorithms in
nncf.compress_weights()
for more accurate weight compression of LLMs. Thus, the following combinations with GPTQ are now supported: AWQ+GPTQ+SE, AWQ+GPTQ, GPTQ+SE, GPTQ. - (OpenVINO) Added LoRA Correction Algorithm to further improve the accuracy of int4 compressed models on top of other algorithms - AWQ and Scale Estimation. It can be enabled via the optional
lora_correction
parameter of thenncf.compress_weights()
API. The algorithm increases compression time and incurs a negligible model size overhead. Refer to accuracy/footprint trade-off for different int4 compression methods. - (PyTorch) Added implementation of the experimental Post-training Activation Pruning algorithm. Refer to Activation Sparsity for details.
- Added a memory monitoring tool for logging the memory a piece of python code or a script allocates. Refer to NNCF tools for details.
Fixes:
- (OpenVINO) Fixed the quantization of Convolution and LSTMSequence operations in cases where some inputs are part of a ShapeOF subgraph.
- (OpenVINO) Fixed issue with the FakeConvert duplication for FP8.
- Fixed Smooth Quant algorithm issue in case of the incorrect shapes.
- Fixed non-deterministic layer-wise scheduling.
Improvements:
- (OpenVINO) Increased hardware-fused pattern coverage.
- Improved progress bar logic during weights compression for more accurate remaining time estimation.
- Extended Scale estimation bitness range support for the
nncf.compress_weights()
. - Removed extra logging for the algorithm-generated ignored scope.
Tutorials:
- Post-Training Optimization of Flux.1 Model
- Post-Training Optimization of PixArt-α Model
- Post-Training Optimization of InternVL2 Model
- Post-Training Optimization of Qwen2Audio Model
- Post-Training Optimization of NuExtract Model
- Post-Training Optimization of MiniCPM-V2 Model
Compression-aware training:
Fixes:
- (PyTorch) Fixed some scenarios of NNCF patching interfering with
torch.compile
.
Requirements:
- Updated PyTorch (2.4.0) and Torchvision (0.19.0) versions.
Acknowledgements
Thanks for contributions from the OpenVINO developer community:
@rk119
v2.12.0
Post-training Quantization:
Features:
- (OpenVINO, PyTorch, ONNX) Excluded comparison operators from the quantization scope for
nncf.ModelType.TRANSFORMER
. - (OpenVINO, PyTorch) Changed the representation of symmetrically quantized weights from an unsigned integer with a fixed zero-point to a signed data type without a zero-point in the
nncf.compress_weights()
method. - (OpenVINO) Extended patterns support of the AWQ algorithm as part of
nncf.compress_weights()
. This allows apply AWQ for the wider scope of the models. - (OpenVINO) Introduced
nncf.CompressWeightsMode.E2M1
mode
option ofnncf.compress_weights()
as the new MXFP4 precision (Experimental). - (OpenVINO) Added support for models with BF16 precision in the
nncf.quantize()
method. - (PyTorch) Added quantization support for the
torch.addmm
. - (PyTorch) Added quantization support for the
torch.nn.functional.scaled_dot_product_attention
.
Fixes:
- (OpenVINO, PyTorch, ONNX) Fixed Fast-/BiasCorrection algorithms with correct support of transposed MatMul layers.
- (OpenVINO) Fixed
nncf.IgnoredScope()
functionality for models with If operation. - (OpenVINO) Fixed patterns with PReLU operations.
- Fixed runtime error while importing NNCF without Matplotlib package.
Improvements:
- Reduced the amount of memory required for applying
nncf.compress_weights()
to OpenVINO models. - Improved logging in case of the not empty
nncf.IgnoredScope()
.
Tutorials:
- Post-Training Optimization of Stable Audio Open Model
- Post-Training Optimization of Phi3-Vision Model
- Post-Training Optimization of MiniCPM-V2 Model
- Post-Training Optimization of Jina CLIP Model
- Post-Training Optimization of Stable Diffusion v3 Model
- Post-Training Optimization of HunyuanDIT Model
- Post-Training Optimization of DDColor Model
- Post-Training Optimization of DynamiCrafter Model
- Post-Training Optimization of DepthAnythingV2 Model
- Post-Training Optimization of Kosmos-2 Model
Compression-aware training:
Fixes:
- (PyTorch) Fixed issue with wrapping for operator without patched state.
Requirements:
- Updated Tensorflow (2.15) version. This version requires Python 3.9-3.11.
Acknowledgements
Thanks for contributions from the OpenVINO developer community:
@Lars-Codes
v2.11.0
Post-training Quantization:
Features:
- (OpenVINO) Added Scale Estimation algorithm for 4-bit data-aware weights compression. The optional scale_estimation parameter was introduced to nncf.compress_weights() and can be used to minimize accuracy degradation of compressed models (note that this algorithm increases the compression time).
- (OpenVINO) Added GPTQ algorithm for 8/4-bit data-aware weights compression, supporting INT8, INT4, and NF4 data types. The optional gptq parameter was introduced to nncf.compress_weights() to enable the GPTQ algorithm.
- (OpenVINO) Added support for models with BF16 weights in the weights compression method, nncf.compress_weights().
- (PyTorch) Added support for quantization and weight compression of the custom modules.
Fixes:
- (OpenVINO) Fixed incorrect node with bias determination in Fast-/BiasCorrection and ChannelAlighnment algorithms.
- (OpenVINO, PyTorch) Fixed incorrect behaviour of nncf.compress_weights() in case of compressed model as input.
- (OpenVINO, PyTorch) Fixed SmoothQuant algorithm to work with Split ports correctly.
Improvements:
- (OpenVINO) Aligned resulting compression subgraphs for the nncf.compress_weights() in different FP precisions.
- Aligned 8-bit scheme for NPU target device with the CPU.
Examples:
- (OpenVINO, ONNX) Updated ignored scope for YOLOv8 examples utilizing a subgraphs approach.
Tutorials:
- Post-Training Optimization of Stable Video Diffusion Model
- Post-Training Optimization of YOLOv10 Model
- Post-Training Optimization of LLaVA Next Model
- Post-Training Optimization of S3D MIL-NCE Model
- Post-Training Optimization of Stable Cascade Model
Compression-aware training:
Features:
- (PyTorch) nncf.quantize method is now the recommended path for the quantization initialization for Quantization-Aware Training.
- (PyTorch) Compression modules placement in the model now can be serialized and restored with new API functions: compressed_model.nncf.get_config() and nncf.torch.load_from_config. The documentation for the saving/loading of a quantized model is available, and Resnet18 example was updated to use the new API.
Fixes:
- (PyTorch) Fixed compatibility with torch.compile.
Improvements:
- (PyTorch) Base parameters were extended for the EvolutionOptimizer (LeGR algorithm part).
- (PyTorch) Improved wrapping for parameters which are not tensors.
Examples:
- (PyTorch) Added an example for STFPM model from Anomalib.
Tutorials:
Deprecations/Removals:
- Removed extra dependencies to install backends from setup.py (like [torch] are [tf], [onnx] and [openvino]).
- Removed openvino-dev dependency.
Requirements:
- Updated PyTorch (2.3.0) and Torchvision (0.18.0) versions.
Acknowledgements
Thanks for contributions from the OpenVINO developer community:
@DaniAffCH
@UsingtcNower
@anzr299
@AdiKsOnDev
@Viditagarwal7479
@truhinnm
v2.10.0
Post-training Quantization:
Features:
- Introduced the subgraph defining functionality for the nncf.IgnoredScope() option.
- Introduced limited support for the batch size of more than 1. MobilenetV2 PyTorch example was updated with batch support.
Fixes:
- Fixed issue with the nncf.OverflowFix parameter absence in some scenarios.
- Aligned the list of correctable layers for the FastBiasCorrection algorithm between PyTorch, OpenVINO and ONNX backends.
- Fixed issue with the nncf.QuantizationMode parameters combination.
- Fixed MobilenetV2 (PyTorch, ONNX, OpenVINO) examples for the Windows platform.
- (OpenVINO) Fixed Anomaly Classification example for the Windows platform.
- (PyTorch) Fixed bias shift magnitude calculation for fused layers.
- (OpenVINO) Fixed removing the ShapeOf graph which led to an error in the nncf.quantize_with_accuracy_control() method.
- Improvements:
- OverflowFix, AdvancedSmoothQuantParameters and AdvancedBiasCorrectionParameters were exposed into the nncf.* namespace.
- (OpenVINO, PyTorch) Introduced scale compression to FP16 for weights in nncf.compress_weights() method, regardless of model weights precision.
- (PyTorch) Modules that NNCF inserted were excluded from parameter tracing.
- (OpenVINO) Extended the list of correctable layers for the BiasCorrection algorithm.
- (ONNX) Aligned BiasCorrection algorithm behaviour with OpenVINO in specific cases.
Tutorials:
- Post-Training Optimization of PhotoMaker Model
- Post-Training Optimization of Stable Diffusion XL Model
- Post-Training Optimization of KerasCV Stable Diffusion Model
- Post-Training Optimization of Paint By Example Model
- Post-Training Optimization of aMUSEd Model
- Post-Training Optimization of InstantID Model
- Post-Training Optimization of LLaVA Next Model
- Post-Training Optimization of AnimateAnyone Model
- Post-Training Optimization of YOLOv8-OBB Model
- Post-Training Optimization of LLM Agent
Compression-aware training:
Features:
- (PyTorch) nncf.quantize method now may be used as quantization initialization for Quantization-Aware Training. Added a Resnet18-based example with the transition from the Post-Training Quantization to a Quantization-Aware Training algorithm.
- (PyTorch) Introduced extractors for the fused Convolution, Batch-/GroupNorm, and Linear functions.
Fixes:
- (PyTorch) Fixed apply_args_defaults function issue.
- (PyTorch) Fixed dtype handling for the compressed torch.nn.Parameter.
- (PyTorch) Fixed is_shared parameter propagation.
Improvements:
- (PyTorch) Updated command creation behaviour to reduce the number of adapters.
- (PyTorch) Added option to insert point for models that wrapped with replace_modules=False.
Deprecations/Removals:
- (PyTorch) Removed the binarization algorithm.
- NNCF installation via pip install nncf[] option is now deprecated.
Requirements:
- Updated PyTorch (2.2.1) and CUDA (12.1) versions.
- Updated ONNX (1.16.0) and ONNXRuntime (1.17.1) versions.
Acknowledgements
Thanks for contributions from the OpenVINO developer community:
@Candyzorua
@clinty
@UsingtcNower
@DaniAffCH
v2.9.0
Post-training Quantization:
Features:
- (OpenVINO) Added modified AWQ algorithm for 4-bit data-aware weights compression. This algorithm applied only for patterns
MatMul->Multiply->Matmul
. For thatawq
optional parameter has been added tonncf.compress_weights()
and can be used to minimize accuracy degradation of compressed models (note that this option increases the compression time). - (ONNX) Introduced support for the ONNX backend in the
nncf.quantize_with_accuracy_control()
method. Users can now perform quantization with accuracy control foronnx.ModelProto
. By leveraging this feature, users can enhance the accuracy of quantized models while minimizing performance impact. - (ONNX) Added an example based on the YOLOv8n-seg model for demonstrating the usage of quantization with accuracy control for the ONNX backend.
- (PT) Added SmoothQuant algorithm for PyTorch backend in
nncf.quantize()
. - (OpenVINO) Added an example with the hyperparameters tuning for the TinyLLama model.
- Introduced the
nncf.AdvancedAccuracyRestorerParameters
. - Introduced the
subset_size
option for thenncf.compress_weights()
. - Introduced
TargetDevice.NPU
as the replacement forTargetDevice.VPU
.
Fixes:
- Fixed API Enums serialization/deserialization issue.
- Fixed issue with required arguments for
revert_operations_to_floating_point_precision
method.
Improvements:
- (ONNX) Aligned statistics collection with OpenVINO and PyTorch backends.
- Extended
nncf.compress_weights()
with Convolution & Embeddings compression in order to reduce memory footprint.
Deprecations/Removals:
- (OpenVINO) Removed outdated examples with
nncf.quantize()
for BERT and YOLOv5 models. - (OpenVINO) Removed outdated example with
nncf.quantize_with_accuracy_control()
for SSD MobileNetV1 FPN model. - (PyTorch) Deprecated the
binarization
algorithm. - Removed Post-training Optimization Tool as OpenVINO backend.
- Removed Dockerfiles.
TargetDevice.VPU
was replaced byTargetDevice.NPU
.
Tutorials:
- Post-Training Optimization of Stable Diffusion v2 Model
- Post-Training Optimization of DeciDiffusion Model
- Post-Training Optimization of DepthAnything Model
- Post-Training Optimization of Stable Diffusion ControlNet Model
Compression-aware training:
Fixes
- (PyTorch) Fixed issue with
NNCFNetworkInterface.get_clean_shallow_copy
missed arguments.
Acknowledgements
Thanks for contributions from the OpenVINO developer community:
@AishwaryaDekhane
@UsingtcNower
@Om-Doiphode
v2.8.1
Post-training Quantization:
Bugfixes:
- (Common) Fixed issue with
nncf.compress_weights()
to avoid overflows on 32-bit Windows systems. - (Common) Fixed performance issue with
nncf.compress_weights()
on LLama models. - (Common) Fixed
nncf.quantize_with_accuracy_control
pipeline withtune_hyperparams=True
enabled option. - (OpenVINO) Fixed issue for stateful LLM models and added state restoring after the inference for it.
- (PyTorch) Fixed issue with
nncf.compress_weights()
for LLM models with the executingis_floating_point
with tracing.
v2.8.0
Post-training Quantization:
Breaking changes:
nncf.quantize
signature has been changed to addmode: Optional[nncf.QuantizationMode] = None
as its 3-rd argument, between the originalcalibration_dataset
andpreset
arguments.- (Common)
nncf.common.quantization.structs.QuantizationMode
has been renamed tonncf.common.quantization.structs.QuantizationScheme
General:
- (OpenVINO) Changed default OpenVINO opset from 9 to 13.
Features:
- (OpenVINO) Added 4-bit data-aware weights compression. For that
dataset
optional parameter has been added tonncf.compress_weights()
and can be used to minimize accuracy degradation of compressed models (note that this option increases the compression time). - (PyTorch) Added support for PyTorch models with shared weights and custom PyTorch modules in
nncf.compress_weights()
. The weights compression algorithm for PyTorch models is now based on tracing the model graph. Thedataset
parameter is now required innncf.compress_weights()
for the compression of PyTorch models. - (Common) Renamed the
nncf.CompressWeightsMode.INT8
tonncf.CompressWeightsMode.INT8_ASYM
and introducenncf.CompressWeightsMode.INT8_SYM
that can be efficiently used with dynamic 8-bit quantization of activations.
The originalnncf.CompressWeightsMode.INT8
enum value is now deprecated. - (OpenVINO) Added support for quantizing the ScaledDotProductAttention operation from OpenVINO opset 13.
- (OpenVINO) Added FP8 quantization support via
nncf.QuantizationMode.FP8_E4M3
andnncf.QuantizationMode.FP8_E5M2
enum values, invoked via passing one of these values as an optionalmode
argument tonncf.quantize
. Currently, OpenVINO supports inference of FP8-quantized models in reference mode with no performance benefits and can be used for accuracy projections. - (Common) Post-training Quantization with Accuracy Control -
nncf.quantize_with_accuracy_control()
has been extended byrestore_mode
optional parameter to revert weights to int8 instead of the original precision.
This parameter helps to reduce the size of the quantized model and improves its performance.
By default, it's disabled and model weights are reverted to the original precision innncf.quantize_with_accuracy_control()
. - (Common) Added an
all_layers: Optional[bool] = None
argument tonncf.compress_weights
to indicate whether embeddings and last layers of the model should be compressed to a primary precision. This is relevant to 4-bit quantization only. - (Common) Added a
sensitivity_metric: Optional[nncf.parameters.SensitivityMetric] = None
argument tonncf.compress_weights
for finer control over the sensitivity metric for assigning quantization precision to layers.
Defaults to weight quantization error if a dataset is not provided for weight compression and to maximum variance of the layers' inputs multiplied by inverted 8-bit quantization noise if a dataset is provided.
By default, the backup precision is assigned for the embeddings and last layers.
Fixes:
- (OpenVINO) Models with embeddings (e.g.
gpt-2
,stable-diffusion-v1-5
,stable-diffusion-v2-1
,opt-6.7b
,falcon-7b
,bloomz-7b1
) are now more accurately quantized. - (PyTorch)
nncf.strip(..., do_copy=True)
now actually returns a deepcopy (stripped) of the model object. - (PyTorch) Post-hooks can now be set up on operations that return
torch.return_type
(such astorch.max
). - (PyTorch) Improved dynamic graph tracing for various tensor operations from
torch
namespace. - (PyTorch) More robust handling of models with disjoint traced graphs when applying PTQ.
Improvements:
- Reformatted the tutorials section in the top-level
README.md
for better readability.
Deprecations/Removals:
- (Common) The original
nncf.CompressWeightsMode.INT8
enum value is now deprecated. - (PyTorch) The Git patch for integration with HuggingFace
transformers
repository is marked as deprecated and will be removed in a future release.
Developers are advised to use optimum-intel instead. - Dockerfiles in the NNCF Git repository are deprecated and will be removed in a future release.
v2.7.0
Post-training Quantization:
Features:
- (OpenVINO) Added support for data-free 4-bit weights compression through NF4 and INT4 data types (
compress_weights(…)
pipeline). - (OpenVINO) Added support for IF operation quantization.
- (OpenVINO) Added
dump_intermediate_model
parameter support for AccuracyAwareAlgorithm (quantize_with_accuracy_control(…)
pipeline). - (OpenVINO) Added support for SmoothQuant and ChannelAlignment algorithms for HyperparameterTuner algorithm (
quantize_with_tune_hyperparams(…)
pipeline). - (PyTorch) Post-training Quantization is now supported with
quantize(…)
pipeline and the common implementation of quantization algorithms. Deprecatedcreate_compressed_model()
method for Post-training Quantization. - Added new types (AvgPool, GroupNorm, LayerNorm) to the ignored scope for
ModelType.Transformer
scheme. QuantizationPreset.Mixed
was set as the default forModelType.Transformer
scheme.
Fixes:
- (OpenVINO, ONNX, PyTorch) Aligned/added patterns between backends (SE block, MVN layer, multiple activations, etc.) to restore performance/metrics.
- Fixed patterns for
ModelType.Transformer
to align with the quantization scheme.
Improvements:
- Improved UX with the new progress bar for pipeline, new exceptions, and .dot graph visualization updates.
- (OpenVINO) Optimized WeightsCompression algorithm (
compress_weights(…)
pipeline) execution time for LLM's quantization, added ignored scope support. - (OpenVINO) Optimized AccuracyAwareQuantization algorithm execution time with multi-threaded approach while calculating ranking score (
quantize_with_accuracy_control(…)
pipeline). - (OpenVINO) Added extract_ov_subgraph tool for large IR subgraph extraction.
- (ONNX) Optimized quantization pipeline (up to 1.15x speed up).
Tutorials:
- Post-Training Optimization of BLIP Model
- Post-Training Optimization of DeepFloyd IF Model
- Post-Training Optimization of Grammatical Error Correction Model
- Post-Training Optimization of Dolly 2.0 Model
- Post-Training Optimization of Massively Multilingual Speech Model
- Post-Training Optimization of OneFormer Model
- Post-Training Optimization of InstructPix2Pix Model
- Post-Training Optimization of LLaVA Model
- Post-Training Optimization of Latent Consistency Model
- Post-Training Optimization of Distil-Whisper Model
- Post-Training Optimization of FastSAM Model
Known issues:
- (ONNX)
quantize(...)
method can generate inaccurate int8 results for models with the BatchNormalization layer that contains biases. To get the best accuracy, use thedo_constant_folding=True
option during export from PyTorch to ONNX.
Compression-aware training:
Fixes:
- (PyTorch) Fixed Hessian trace calculation to solve #2155 issue.
Requirements:
- Updated PyTorch version (2.1.0).
- Updated numpy version (<1.27).
Deprecations/Removals:
- (PyTorch) Removed legacy external quantizer storage names.
- (PyTorch) Removed torch < 2.0 version support.