Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM optimization documentation fixes and updates. #28212

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@ LLM Weight Compression
:maxdepth: 1
:hidden:

weight-compression/microscaling-quantization
weight-compression/4-bit-weight-quantization
weight-compression/microscaling-quantization



Weight compression enhances the efficiency of models by reducing their memory footprint,
Expand All @@ -16,14 +17,13 @@ Unlike full model quantization, where both weights and activations are quantized
only targets weights, keeping activations as floating-point numbers. This means preserving most
of the model's accuracy while improving its
speed and reducing its size. The reduction in size is especially noticeable with larger models.
For instance the 7 billion parameter Llama 2 model can be reduced
from about 25GB to 4GB using 4-bit weight compression.
For instance the 8 billion parameter Llama 3 model can be reduced
from about 16.1 GB to 4.8 GB using 4-bit weight quantization on top of bfloat16 model.

.. note::

With smaller language models (i.e. less than 1B parameters), weight
With smaller language models (i.e. less than 1B parameters), low-bit weight
compression may result in more accuracy reduction than with larger models.
Therefore, weight compression is recommended for use with LLMs only.

LLMs and other GenAI models that require
extensive memory to store the weights during inference can benefit
Expand All @@ -36,7 +36,7 @@ from weight compression as it:
* improves inference speed by reducing the latency of memory access when computing the
operations with weights, for example, Linear layers. The weights are smaller and thus
faster to load from memory;
* unlike quantization, does not require sample data to calibrate the range of
* unlike full static quantization, does not require sample data to calibrate the range of
activation values.

Currently, `NNCF <https://github.com/openvinotoolkit/nncf>`__
Expand Down Expand Up @@ -64,7 +64,7 @@ by running the following command:
pip install optimum[openvino]

**8-bit weight quantization** offers a good balance between reducing the size and lowering the
accuracy of a model. It usually results in significant improvements for transformer-based models
accuracy of a model. It usually results in significant improvements for Transformer-based models
and guarantees good model performance for a vast majority of supported CPU and GPU platforms.
By default, weights are compressed asymmetrically to "INT8_ASYM" mode.

Expand Down Expand Up @@ -223,17 +223,6 @@ depending on the model.
For more details, refer to the article on how to
:doc:`infer LLMs using Optimum Intel <../../../learn-openvino/llm_inference_guide/llm-inference-hf>`.

The code snippet below shows how to do 4-bit quantization of the model weights represented
in OpenVINO IR using NNCF:

.. tab-set::

.. tab-item:: OpenVINO
:sync: openvino

.. doxygensnippet:: docs/optimization_guide/nncf/code/weight_compression_openvino.py
:language: python
:fragment: [compression_4bit]

Refer to the article about
:doc:`4-bit weight quantization <./weight-compression/4-bit-weight-quantization>`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,12 @@ trade-offs after optimization:
There are three modes: INT8_ASYM, INT8_SYM, and NONE, which retains
the original floating-point precision of the model weights (``INT8_ASYM`` is default value).

|


.. tip::

NNCF allows stacking the supported optimization methods. For example, AWQ, Scale Estimation
and GPTQ methods can be enabled all together to achieve better accuracy.

4-bit Weight Quantization with GPTQ
###################################
Expand Down
16 changes: 5 additions & 11 deletions docs/articles_en/openvino-workflow/model-optimization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,24 +21,24 @@ In OpenVINO, the default optimization tool is NNCF (Neural Network Compression F
It is a `set of compression algorithms <https://github.com/openvinotoolkit/nncf/blob/develop/README.md>`__,
organized as a Python package, that make your models smaller and faster. Note that NNCF
is **not part of the OpenVINO package**, so it needs to be installed separately. It supports
models in **PyTorch**, **TensorFlow** , **ONNX**, and **OpenVINO IR** formats, offering
models in **OpenVINO IR**, **PyTorch**, **ONNX**, and **TensorFlow** formats, offering
the following main optimizations:

.. image:: ../assets/images/WHAT_TO_USE.svg


| :doc:`Weight Compression <model-optimization-guide/weight-compression>`:
| an easy-to-use method for Large Language Model footprint reduction and inference
| An easy-to-use method for Large Language Model footprint reduction and inference
acceleration.

| :doc:`Post-training Quantization <model-optimization-guide/quantizing-models-post-training>`:
| designed to optimize deep learning models by applying 8-bit integer quantization. Being
| Designed to optimize deep learning models by applying 8-bit integer quantization. Being
the easiest way to optimize a model it does not require its retraining or fine-tuning
but may result in a drop in accuracy. If the accuracy-performance tradeoff is not
acceptable, Training-time Optimization may be a better option.

| :doc:`Training-time Optimization <model-optimization-guide/compressing-models-during-training>`:
| involves a suite of advanced methods such as Structured or Unstructured Pruning, as well
| Involves a suite of advanced methods such as Structured or Unstructured Pruning, as well
as Quantization-aware Training. This kind of optimization requires the use of the model's
original framework, for NNCF, it is either PyTorch or TensorFlow.

Expand All @@ -54,13 +54,7 @@ Recommended workflows
3. If the accuracy drop is unacceptable, use quantization-aware training instead. It will give
you the same level of performance boost, with a smaller impact on accuracy.

* **Weight compression** works **only with LLMs**. Do not try to use it with other models.
* For **visual-multimodal** use cases, the encoder / decoder split approach may be recommended.





* **Weight compression** works with **LLMs**, **VLMs** and other Transformer-based models.



Expand Down
Loading