openvinotoolkit · AlexKoff88 · Dec 26, 2024 · Dec 27, 2024 · Dec 27, 2024
@@ -5,8 +5,9 @@ LLM Weight Compression
    :maxdepth: 1
    :hidden:
 
-   weight-compression/microscaling-quantization
    weight-compression/4-bit-weight-quantization
+   weight-compression/microscaling-quantization
+
 
 
 Weight compression enhances the efficiency of models by reducing their memory footprint,
@@ -16,14 +17,13 @@ Unlike full model quantization, where both weights and activations are quantized
 only targets weights, keeping activations as floating-point numbers. This means preserving most
 of the model's accuracy while improving its
 speed and reducing its size. The reduction in size is especially noticeable with larger models.
-For instance the 7 billion parameter Llama 2 model can be reduced
-from about 25GB to 4GB using 4-bit weight compression.
+For instance the 8 billion parameter Llama 3 model can be reduced
+from about 16.1 GB to 4.8 GB using 4-bit weight quantization on top of bfloat16 model.
 
 .. note::
 
-   With smaller language models (i.e. less than 1B parameters), weight
+   With smaller language models (i.e. less than 1B parameters), low-bit weight
    compression may result in more accuracy reduction than with larger models.
-   Therefore, weight compression is recommended for use with LLMs only.
 
 LLMs and other GenAI models that require
 extensive memory to store the weights during inference can benefit
@@ -36,7 +36,7 @@ from weight compression as it:
 * improves inference speed by reducing the latency of memory access when computing the
   operations with weights, for example, Linear layers. The weights are smaller and thus
   faster to load from memory;
-* unlike quantization, does not require sample data to calibrate the range of
+* unlike full static quantization, does not require sample data to calibrate the range of
   activation values.
 
 Currently, `NNCF <https://github.com/openvinotoolkit/nncf>`__
@@ -64,7 +64,7 @@ by running the following command:
    pip install optimum[openvino]
 
 **8-bit weight quantization** offers a good balance between reducing the size and lowering the
-accuracy of a model. It usually results in significant improvements for transformer-based models
+accuracy of a model. It usually results in significant improvements for Transformer-based models
 and guarantees good model performance for a vast majority of supported CPU and GPU platforms.
 By default, weights are compressed asymmetrically to "INT8_ASYM" mode.
 
@@ -223,17 +223,6 @@ depending on the model.
       For more details, refer to the article on how to
       :doc:`infer LLMs using Optimum Intel <../../../learn-openvino/llm_inference_guide/llm-inference-hf>`.
 
-The code snippet below shows how to do 4-bit quantization of the model weights represented
-in OpenVINO IR using NNCF:
-
-.. tab-set::
-
-   .. tab-item:: OpenVINO
-      :sync: openvino
-
-      .. doxygensnippet:: docs/optimization_guide/nncf/code/weight_compression_openvino.py
-         :language: python
-         :fragment: [compression_4bit]
 
 Refer to the article about
 :doc:`4-bit weight quantization <./weight-compression/4-bit-weight-quantization>`

@@ -133,7 +133,12 @@ trade-offs after optimization:
   There are three modes: INT8_ASYM, INT8_SYM, and NONE, which retains
   the original floating-point precision of the model weights (``INT8_ASYM`` is default value).
 
-|
+
+
+.. tip::
+
+   NNCF allows stacking the supported optimization methods. For example, AWQ, Scale Estimation
+   and GPTQ methods can be enabled all together to achieve better accuracy.
 
 4-bit Weight Quantization with GPTQ
 ###################################

@@ -21,24 +21,24 @@ In OpenVINO, the default optimization tool is NNCF (Neural Network Compression F
 It is a `set of compression algorithms <https://github.com/openvinotoolkit/nncf/blob/develop/README.md>`__,
 organized as a Python package, that make your models smaller and faster. Note that NNCF
 is **not part of the OpenVINO package**, so it needs to be installed separately. It supports
-models in **PyTorch**, **TensorFlow** , **ONNX**, and **OpenVINO IR** formats, offering
+models in **OpenVINO IR**, **PyTorch**, **ONNX**, and **TensorFlow** formats, offering
 the following main optimizations:
 
 .. image:: ../assets/images/WHAT_TO_USE.svg
 
 
 | :doc:`Weight Compression <model-optimization-guide/weight-compression>`:
-|      an easy-to-use method for Large Language Model footprint reduction and inference
+|      An easy-to-use method for Large Language Model footprint reduction and inference
        acceleration.
 
 | :doc:`Post-training Quantization <model-optimization-guide/quantizing-models-post-training>`:
-|      designed to optimize deep learning models by applying 8-bit integer quantization. Being
+|      Designed to optimize deep learning models by applying 8-bit integer quantization. Being
        the easiest way to optimize a model it does not require its retraining or fine-tuning
        but may result in a drop in accuracy. If the accuracy-performance tradeoff is not
        acceptable, Training-time Optimization may be a better option.
 
 | :doc:`Training-time Optimization <model-optimization-guide/compressing-models-during-training>`:
-|      involves a suite of advanced methods such as Structured or Unstructured Pruning, as well
+|      Involves a suite of advanced methods such as Structured or Unstructured Pruning, as well
        as Quantization-aware Training. This kind of optimization requires the use of the model's
        original framework, for NNCF, it is either PyTorch or TensorFlow.
 
@@ -54,13 +54,7 @@ Recommended workflows
   3. If the accuracy drop is unacceptable, use quantization-aware training instead. It will give
      you the same level of performance boost, with a smaller impact on accuracy.
 
-* **Weight compression** works **only with LLMs**. Do not try to use it with other models.
-* For **visual-multimodal** use cases, the encoder / decoder split approach may be recommended.
-
-
-
-
-
+* **Weight compression** works with **LLMs**, **VLMs** and other Transformer-based models.