Releases: openvinotoolkit/nncf
v2.6.0
Post-training Quantization:
Features:
- Added
CPU_SPR
device type support. - Added quantizers scales unification.
- Added quantization scheme for ReduceSum operation.
- Added new types (ReduceL2, ReduceSum, Maximum) to the ignored scope for
ModelType.Transformer
. - (OpenVINO) Added SmoothQuant algorithm.
- (OpenVINO) Added ChannelAlignment algorithm.
- (OpenVINO) Added HyperparameterTuner algorithm.
- (PyTorch) Added FastBiasCorrection algorithm support.
- (OpenVINO, ONNX) Added embedding weights quantization.
- (OpenVINO, PyTorch) Added new
compress_weights
method that provides data-free INT8 weights compression.
Fixes:
- Fixed detection of decomposed post-processing in models.
- Multiple fixes (new patterns, bugfixes, etc.) to solve #1936 issue.
- Fixed model reshaping while quantization to keep original model shape.
- (OpenVINO) Added support for sequential models quanitzation.
- (OpenVINO) Fixed in-place statistics cast to support empty dimensions.
- (OpenVINO, ONNX) Fixed quantization of the MatMul operation with weights rank > 2.
- (OpenVINO, ONNX) Fixed BiasCorrection algorithm to enable CLIP model quantization.
Improvements:
- Optimized
quantize(…)
pipeline (up to 4.3x speed up in total). - Optimized
quantize_with_accuracy_control(…)
pipelilne (up to 8x speed up for 122-quantizing-model-with-accuracy-control notebook). - Optimized general statistics collection (up to 1.2x speed up for ONNX backend).
- Ignored patterns separated from Fused patterns scheme (with multiple patterns addition).
Tutorials:
- Post-Training Optimization of Segment Anything Model.
- Post-Training Optimization of CLIP Model.
- Post-Training Optimization of ImageBind Model.
- Post-Training Optimization of Whisper Model.
- Post-Training Optimization with accuracy control.
Compression-aware training:
Features:
- Added shape pruning processor for BootstrapNAS algorithm.
- Added KD loss for BootstrapNAS algorithm.
- Added
validate_scopes
parameter for NNCF configuration. - (PyTorch) Added PyTorch 2.0 support.
- (PyTorch) Added
.strip()
option to API. - (PyTorch) Enabled bfloat data type for quantization kernels.
- (PyTorch) Quantized models can now be
torch.jit.trace
d without calling.strip()
. - (PyTorch) Added support for overridden
forward
instance attribute on model objects passed intocreate_compressed_model
. - (Tensorflow) Added Tensorflow 2.12 support.
Fixes:
- (PyTorch) Fixed padding adjustment issue in the elastic kernel to work with the different active kernel sizes.
- (PyTorch) Fixed the torch graph tracing in the case the tensors belonging to parallel edges are interleaved in the order of the tensor argument.
- (PyTorch) Fixed recurrent nodes matching (LSTM, GRU cells) condition with the strict rule to avoid adding not necessary nodes to the ignored scope.
- (PyTorch) Fixed
torch.jit.script
wrapper so that user-side handling exceptions duringtorch.jit.script
invocation do not cause NNCF to be permanently disabled. - (PyTorch, Tensorflow) Adjusted quantizer propagation algorithm to check if quantizer propagation will result in output quantization.
- (PyTorch) Added redefined
__class__
method for ProxyModule that avoids causing error while calling.super()
in forward method.
Deprecations/Removals:
- (PyTorch) Removed deprecated
NNCFNetwork.__getattr__
,NNCFNetwork.get_nncf_wrapped_model
methods.
Requirements:
- Updated PyTorch version (2.0.1).
- Updated Tensorflow version (2.12.0).
v2.5.0
Post-training Quantization:
Features:
- Official release of OpenVINO framework support.
- Ported NNCF OpenVINO backend to use the nGraph representation of OpenVINO models.
- Changed dependecies of NNCF OpenVINO backend. It now depends on
openvino
package and not on theopenvino-dev
package. - Added GRU/LSTM quantization support.
- Added quantizer scales unification.
- Added support for models with 3D and 5D Depthwise convolution.
- Added FP16 OpenVINO models support.
- Added
"overflow_fix"
parameter (forquantize(...)
&quantize_with_accuracy_control(...)
methods) support & functionality. It improves accuracy for optimized model for affected devices. More details in Quantization section. - (OpenVINO) Added support for in-place statistics collection (reduce memory footprint during optimization).
- (OpenVINO) Added Quantization with accuracy control algorithm.
- (OpenVINO) Added YOLOv8 examples for
quantize(...)
&quantize_with_accuracy_control(...)
methods. - (PyTorch) Added min-max quantization algorithm as experimental.
Fixes:
- Fixed
ignored_scope
attribute behaviour for weights. Now, the weighted layers excludes from optimization scope correctly. - (ONNX) Checking correct ONNX opset version via the
nncf.quantize(...)
. Now, models with opset < 13 are optimized correctly in per-tensor quantization.
Improvements:
- Added improvements for statistic collection process (collect weights statistics only once).
- (PyTorch, OpenVINO, ONNX) Introduced unified quantizer parameters calculation.
Known issues:
quantize(...)
method can generate inaccurate int8 results for models with the DenseNet-like architecture. Usequantize_with_accuracy_control(...)
in such case.quantize(...)
method can hang on models with transformer architecture whenfast_bias_correction
optional parameter is set to False. Don't set it to False or usequantize_with_accuracy_control(...)
in such case.quantize(...)
method can generate inaccurate int8 results for models with the MobileNet-like architecture on non-VNNI machines.
Compression-aware training:
New Features:
- Introduced automated structured pruning algorithm for JPQD with support for BERT, Wave2VecV2, Swin, ViT, DistilBERT, CLIP, and MobileBERT models.
- Added
nncf.common.utils.patcher.Patcher
- this class can be used to patch methods on live PyTorch model objects with wrappers such asnncf.torch.dynamic_graph.context.no_nncf_trace
when doing so in the model code is not possible (e.g. if the model comes from an external library package). - Compression controllers of the
nncf.api.compression.CompressionAlgorithmController
class now have a.strip()
method that will return the compressed model object with as many custom NNCF additions removed as possible while preserving the functioning of the model object as a compressed model.
Fixes:
- Fixed statistics computation for pruned layers.
- (PyTorch) Fixed traced tensors to implement the YOLOv8 from Ultralytics.
Improvements:
- Extension of attributes (
transpose/permute/getitem
) for pruning node selector. - NNCFNetwork was refactored from a wrapper-approach to a mixin-like approach.
- Added average pool 3d-like ops to pruning mask.
- Added Conv3d for overflow fix.
nncf.set_log_file(...)
can now be used to set location of the NNCF log file.- (PyTorch) Added support for pruning of
torch.nn.functional.pad
operation. - (PyTorch) Added
torch.baddbmm
as an alias for the matmul metatype for quantization purposes. - (PyTorch) Added config file for ResNet18 accuracy-aware pruning + quantization on CIFAR10.
- (PyTorch) Fixed JIT-traceable PyTorch models with internal patching.
- (PyTorch) Added
__matmul__
magic functions to the list of patched ops (for SwinTransformer by Microsoft).
Requirements:
- Updated ONNX version (1.13)
- Updated Tensorflow version (2.11)
General changes:
- Added Windows support for NNCF.
v2.4.0
Target version updates:
- Bump target framework versions to PyTorch 1.13.1, TensorFlow 2.8.x, ONNX 1.12, ONNXRuntime 1.13.1
- Increased target HuggingFace transformers version for the integration patch to 4.23.1
Features:
- Official release of the ONNX framework support.
NNCF may now be used for post-training quantization (PTQ) on ONNX models.
Added an example script demonstrating the ONNX post-training quantization on MobileNetV2. - Preview release of OpenVINO framework support.
NNCF may now be used for post-training quantization on OpenVINO models. Added an example script demonstrating the OpenVINO post-training quantization on MobileNetV2.
pip install nncf[openvino]
will install NNCF with the required OV framework dependencies. - Common post-training quantization API across the supported framework model formats (PyTorch, TensorFlow, ONNX, OpenVINO IR) via the
nncf.quantize(...)
function.
The parameter set of the function is the same for all frameworks - actual framework-specific implementations are being dispatched based on the type of the model object argument. - (PyTorch, TensorFlow) Improved the adaptive compression training functionality to reduce effective training time.
- (ONNX) Post-processing nodes are now automatically excluded from quantization.
- (PyTorch - Experimental) Joint Pruning, Quantization and Distillation for Transformers enabled for certain models from HuggingFace
transformers
repo.
See description of the movement pruning involved in the JPQD for details.
Bugfixes:
- Fixed a division by zero if every operation is added to ignored scope
- Improved logging output, cutting down on the number of messages being output to the standard
logging.INFO
log level. - Fixed FLOPS calculation for linear filters - this impacts existing models that were pruned with a FLOPS target.
- "chunk" and "split" ops are correctly handled during pruning.
- Linear layers may now be pruned by input and output independently.
- Matmul-like operations and subsequent arithmetic operations are now treated as a fused pattern.
- (PyTorch) Fixed a rare condition with accumulator overflow in CUDA quantization kernels, which led to CUDA runtime errors and NaN values appearing in quantized tensors and
- (PyTorch)
transformers
integration patch now allows to export to ONNX during training, and not only at the end of it. - (PyTorch)
torch.nn.utils.weight_norm
weights are now detected correctly. - (PyTorch) Exporting a model with sparsity or pruning no longer leads to weights in the original model object in-memory to be hard-set to 0.
- (PyTorch - Experimental) improved automatic search of blocks to skip within the NAS algorithm – overlapping blocks are correctly filtered.
- (PyTorch, TensorFlow) Various bugs and issues with compression training were fixed.
- (TensorFlow) Fixed an error with
"num_bn_adaptation_samples": 0
in config leading to aTypeError
during quantization algo initialization. - (ONNX) Temporary model file is no longer saved on disk.
- (ONNX) Depthwise convolutions are now quantizable in per-channel mode.
- (ONNX) Improved the working time of PTQ by optimizing the calls to ONNX shape inferencing.
Breaking changes:
- Fused patterns will be excluded from quantization via
ignored_scopes
only if the top-most node in data flow order matches againstignored_scopes
- NNCF config's
"ignored_scopes"
and"target_scopes"
are now strictly checked to be matching against at least one node in the model graph instead of silently ignoring the unmatched entries. - Calling
setup.py
directly to install NNCF is deprecated and no longer guaranteed to work. - Importing NNCF logger as
from nncf.common.utils.logger import logger as nncf_logger
is deprecated - usefrom nncf import nncf_logger
instead. pruning_rate
is renamed topruning_level
in pruning compression controllers.- (ONNX) Removed CompressionBuilder. Excluded examples of NNCF for ONNX with CompressionBuilder API
v2.3.0
New features
- (ONNX) PTQ API support for ONNX.
- (ONNX) Added PTQ examples for ONNX in image classification, object detection, and semantic segmentation.
- (PyTorch) Added
BootstrapNAS
to find high-performing sub-networks from the super-network optimization.
Bugfixes
- (PyTorch) Returned the initial quantized model when the retraining failed to find out the best checkpoint.
- (Experimental) Fixed weight initialization for
ONNXGraph
andMinMaxQuantization
.
v2.2.0
New features
- Pre-production quality
- (TensorFlow) Added TensorFlow 2.5.x support.
- (TensorFlow) The
SubclassedConverter
class was added to createNNCFGraph
for thetf.Graph
Keras model. - (TensorFlow) Added
TFOpLambda
layer support withTFModelConverter
,TFModelTransformer
, andTFOpLambdaMetatype
. - (TensorFlow) Patterns from
MatMul
andConv2D
toBiasAdd
andMetatypes
of TensorFlow operations with weightsTFOpWithWeightsMetatype
are added. - (PyTorch, TensorFlow) Added prunings for
Reshape
andLinear
asReshapePruningOp
andLinearPruningOp
. - (PyTorch) Added mixed precision quantization config with HAWQ for
Resnet50
andMobilenet_v2
for the latest VPU. - (PyTorch) Splitted
NNCFBatchNorm
intoNNCFBatchNorm1d
,NNCFBatchNorm2d
,NNCFBatchNorm3d
. - (PyTorch - Experimental) Added the
BNASTrainingController
andBNASTrainingAlgorithm
for BootstrapNAS to search the model's architecture. - (Experimental) ONNX
ModelProto
is now converted toNNCFGraph
throughGraphConverter
. - (Experimental)
ONNXOpMetatype
and extended patterns for fusing HW config is now available. - (Experimental) Added
ONNXPostTrainingQuantization
andMinMaxQuantization
supports for ONNX.
Bugfixes
- (PyTorch, TensorFlow) Added exception handling of BN adaptation for zero sample values.
- (PyTorch, TensorFlow) Fixed learning rate after validation step for
EarlyExitCompressionTrainingLoop
. - (PyTorch) Fixed
FakeQuantizer
to make exact zeros. - (PyTorch) Fixed Quantizer misplacements during ONNX export.
- (PyTorch) Restored device information during ONNX export.
- (PyTorch) Fixed the statistics collection from the pruned model.
v2.1.0
New features
- (PyTorch) All PyTorch operations are now NNCF-wrapped automatically.
- (TensorFlow) Scales for concat-affecting quantizers are now unified
- (PyTorch) The pruned filters are now set to 0 in the exported ONNX file instead of removing them from the ONNX definition.
- (PyTorch, TensorFlow) Extended accuracy-aware training pipeline with the
early_exit
mode. - (PyTorch, TensorFlow) Added support for quantization presets to be specified in NNCF config.
- (PyTorch, TensorFlow) Extended pruning statistics displayed to the user.
- (PyTorch, TensorFlow) Users may now register a
dump_checkpoints_fn
callback to control the location of checkpoint saving during accuracy-aware training. - (PyTorch, TensorFlow) Default pruning schedule is now exponential.
- (PyTorch) SILU activation now supported.
- (PyTorch) Dynamic graph no longer traced during compressed model execution, which improves training performance of models compressed with NNCF.
- (PyTorch) Added BERT-MRPC quantization results and integration instructions to the HuggingFace Transformers integration patch.
- (PyTorch) Knowledge distillation extended with the option to specify temperature for the
softmax
mode. - (TensorFlow) Added
mixed_min_max
option for quantizer range initialization. - (PyTorch, TensorFlow) ReLU6-based HSwish and HSigmoid activations are now properly fused.
- (PyTorch - Experimental) Added an algorithm to search the model's architecture for basic building blocks.
Bugfixes:
- (TensorFlow) Fixed a bug where an operation with int32 inputs (following a Cast op) was attempted to be quantized.
- (PyTorch, TensorFlow) LeakyReLU now properly handled during pruning
- (PyTorch) Fixed errors with custom modules failing at the
determine_subtype
stage of metatype assignment. - (PyTorch) Fix handling modules with
torch.nn.utils.weight_norm.WeightNorm
applied
v2.0.2
v2.0.1
Target version updates:
- Bump target framework versions to PyTorch 1.9.1 and TensorFlow 2.4.3
- Increased target HuggingFace transformers version for the integration patch to 4.9.1
Bugfixes:
- Fixed statistic collection for the algo mixing scenario
- Increased pruning algorithm robustness in cases of a disconnected NNCF graph
- Fixed the fatality of NNCF graph PNG rendering failures
- Fixed README command lines
- (PyTorch) Fixed a bug with quantizing shared weights multiple times
- (PyTorch) Fixed knowledge distillation failures in CPU-only and DataParallel scenarios
- (PyTorch) Fixed sparsity application for torch.nn.Embedding and EmbeddingBag modules
- (PyTorch) Added GroupNorm + ReLU as a fusable pattern
- (TensorFlow) Fixed gamma fusion handling for pruning TF BatchNorm
- (PyTorch) Fixed pruning for models where operations have multiple convolution predecessors
- (PyTorch) Fixed NNCFNetwork wrapper so that
self
in the calls to the wrapped model refer to the wrapper NNCFNetwork object and not to the wrapped model - (PyTorch) Fixed tracing of
view
operations to handle shape arguments with thetorch.Tensor
type - (PyTorch) Added matmul ops to be considered for fusing
- (PyTorch, TensorFlow) Fixed tensorboard logging for accuracy-aware scenarios
- (PyTorch, TensorFlow) Fixed FLOPS calculation for grouped convolutions
- (PyTorch) Fixed knowledge distillation failures for tensors of unsupported shapes - will ignore output tensors with unsupported shapes now instead of crashing.
v2.0.0
New features:
-
Added TensorFlow 2.4.2 support - NNCF can now be used to apply the compression algorithms to models originally trained in TensorFlow.
NNCF with TensorFlow backend supports the following features:- Compression algorithms:
- Quantization (with HW-specific targeting aligned with PyTorch)
- Sparsity:
- Magnitude Sparsity
- RB Sparsity
- Filter pruning
- Support for only Keras models consisting of standard Keras layers and created by:
- Keras Sequential API
- Keras Functional API
- Automatic, configurable model graph transformation to obtain the compressed model.
- Distributed training on multiple GPUs on one machine is supported using
tf.distribute.MirroredStrategy
. - Exporting compressed models to SavedModel or Frozen Graph format, ready to use with OpenVINO™ toolkit.
- Compression algorithms:
-
Added model compression samples for NNCF with TensorFlow backend:
- Classification
- Keras training loop.
- Models form the tf.keras.applications module (ResNets, MobileNets, Inception and etc.) are supported.
- TensorFlow Datasets (TFDS) and TFRecords (ImageNet2012, Cifar100, Cifar10) are supported.
- Compression results are claimed for MobileNet V2, MobileNet V3 small, MobileNet V3 large, ResNet50, Inception V3.
- Object Detection (Compression results are claimed for RetinaNet, YOLOv4)
- Custom training loop.
- TensorFlow Datasets (TFDS) and TFRecords for COCO2017 are supported.
- Compression results for are claimed for RetinaNet, YOLOv4.
- Instance Segmentation
- Custom training loop
- TFRecords for COCO2017 is supported.
- Compression results are claimed for MaskRCNN
- Classification
-
Accuracy-aware training available for filter pruning and sparsity in order to achieve best compression results within a given accuracy drop threshold in a fully automated fashion.
-
Framework-specific checkpoints produced with NNCF now have NNCF-specific compression state information included, so that the exact compressed model state can be restored/loaded without having to provide the same NNCF config file that was used during the creation of the NNCF-compressed checkpoint
-
Common interface for compression methods for both PyTorch and TensorFlow backends (https://github.com/openvinotoolkit/nncf/tree/develop/nncf/api).
-
(PyTorch) Added an option to specify an effective learning rate multiplier for the trainable parameters of the compression algorithms via NNCF config, for finer control over which should tune faster - the underlying FP32 model weights or the compression parameters.
-
(PyTorch) Unified scales for concat operations - the per-tensor quantizers that affect the concat operations will now have identical scales so that the resulting concatenated tensor can be represented without loss of accuracy w.r.t. the concatenated subcomponents.
-
(TensorFlow) Algo-mixing: Added configuration files and reference checkpoints for filter-pruned + qunatized models: ResNet50@ImageNet2012(40% of filters pruned + INT8), RetinaNet@COCO2017(40% of filters pruned + INT8).
-
(Experimental, PyTorch) Learned Global Ranking filter pruning mechanism for better pruning ratios with less accuracy drop for a broad range of models has been implemented.
-
(Experimental, PyTorch) Knowledge distillation supported, ready to be used with any compression algorithm to produce an additional loss source of the compressed model against the uncompressed version
Breaking changes:
CompressionLevel
has been renamed toCompressionStage
"ignored_scopes"
and "target_scopes" no longer allow prefix matching - use full-fledged regular expression approach via {re} if anything more than an exact match is desired.- (PyTorch) Removed version-agnostic name mapping for ReLU operations, i.e. the NNCF configs that referenced "RELU" (all caps) as an operation name will now have to reference an exact ReLU PyTorch function name such as "relu" or "relu_"
- (PyTorch) Removed the example of code modifications (Git patches and base commit IDs are provided) for mmdetection repository.
- Batchnorm adaptation "forgetting" step has been removed since it has been observed to introduce accuracy degradation; the "num_bn_forget_steps" parameter in the corresponding NNCF config section has been removed.
- Framework-specific requirements no longer installed during
pip install nncf
orpython setup.py install
and are assumed to be present in the user's environment; the pip's "extras" syntax must be used to install the BKC requirements, e.g. by executingpip install nncf[tf]
,pip install nncf[torch]
orpip install nncf[tf,torch]
"quantizable_subgraph_patterns"
option removed from the NNCF config
Bugfixes:
- (PyTorch) Fixed a hang with batchnorm adaptation being applied in DDP mode
- (PyTorch) Fixed tracing of the operations that return NotImplemented