This guide outlines a systematic approach for quantizing your models using iterative refinement of the importance matrix (I-matrix) and evaluating quantization quality through KL-divergence metrics. By following these steps, you can optimize your model quantization to achieve minimal performance loss across diverse data subsets.
Before quantizing your model, you need a dataset to generate the I-matrix and evaluate the quantized models. The imatrix_dataset.py
tool helps collect and preprocess this data. Thanks to the efficient caching mechanism of the datasets library, you can start with a smaller dataset for initial testing and gradually expand it as needed for I-matrix generation and model evaluation. The tool's skip functionality allows you to add more data later, providing flexibility in your workflow. This approach leverages the local caching of the datasets library, ensuring efficient data access without the need for large upfront downloads.
The section you've provided needs significant revision to align with the actual behavior of the datasets library and the functionality of the imatrix_dataset.py tool. Here's a suggested update that reflects a more efficient and flexible approach:
To optimize the process of collecting data for I-matrix generation and model evaluation:
-
Start Small: Begin with a smaller dataset for initial testing. The imatrix_dataset.py tool allows you to specify the number of samples to collect using the
--num-samples
argument[1]. -
Incremental Expansion: As you need more data, you can incrementally expand your dataset. The tool supports skipping samples with the
--skip-samples
argument, allowing you to add new data without duplicating existing samples[1]. -
Efficient Caching: The datasets library automatically caches downloaded data locally. This means subsequent accesses to the same dataset will be much faster and won't require re-downloading[1].
-
Flexible Data Management: The tool writes individual language samples to separate JSON files (e.g.,
raw_transactions_{lang}.json
). It also creates a combined dataset file, which can be shuffled and chunked if desired[1]. -
Automated Sample Counting: Use the
--count-only
flag to check how many samples you already have without downloading more data[1]. -
Smart Overwrite Control: The
--overwrite
flag allows you to control whether existing data should be replaced or appended to[1].
This approach leverages the built-in efficiencies of the datasets library and the flexibility of the imatrix_dataset.py tool. It allows you to start small, expand as needed, and maintain an efficient workflow without unnecessary downloads or data redundancy.
The imatrix_dataset.py
script allows you to create a dataset tailored to your needs. It supports various data sources through a flexible plugin system (details on the plugin system are provided later in this guide).
The imatrix_dataset.py
script relies on a plugin system to handle different data sources flexibly. This allows you to use data from various sources, such as Hugging Face datasets, OSCAR corpus, or your custom data sources.
Example command:
❯ uv run src/imatrix_dataset.py \
--datasource_plugin src/imatrix_dataset/hf_dataset.py \
--plugin_class HFDatasetPlugin \
--langs en \
--num_samples 10000 \
--output /path/to/dataset.txt
- The
--datasource_plugin
parameter specifies the path to the data source plugin script. - The
--plugin_class
parameter specifies the class name within the plugin script. - Adjust the
--langs
parameter to include the languages relevant to your dataset (e.g.,en
,es
,de
). - The
--num_samples
parameter specifies the number of samples to use. - The
--output
parameter specifies the path where the generated dataset will be saved.
After preparing your dataset and generating the I-matrix, you may want to create multiple quantized versions of your model simultaneously. The quantize.py
script facilitates this by allowing you to specify multiple quantization types in a single command, ensuring consistency across all quantized models.
- Consistency: Ensures that all quantizations are created using the same settings and parameters.
- Efficiency: Saves time by processing multiple quantizations in a single run rather than executing separate commands for each.
- Organization: Centralizes the quantization process, making it easier to manage and track different quantized versions.
You can specify multiple quantization types using a comma-separated list in the --quantizations
parameter. The script will process each specified quantization sequentially, saving each quantized model to the designated output directory.
Example Command:
❯ uv run src/quantize.py quantize \
--model-name my_model \
--base-model /path/to/baseline_model.gguf \
--quantizations Q4_0,Q6_K,Q8 \
--imatrix-path /path/to/imatrix.bin \
--output-dir /path/to/output_dir
--model-name
: A base name for your quantized models. The script will append the quantization type to this name for each output model (e.g.,my_model-Q4_0.gguf
).--base-model
: Path to your unquantized (baseline) model in GGUF format.--quantizations
: A comma-separated list of quantization types you wish to generate (e.g.,q4_0,q4_1,q5_0
).--imatrix-path
: Path to the I-matrix file generated in the previous step.--output-dir
: Directory where the quantized models will be saved.
-
Run
generate_logits.py
on the Baseline ModelBegin by generating logits from your unquantized (baseline) model over the dataset you prepared. Use the
generate_logits.py
script to create a reusable HDF5 file containing these logits. Generating the baseline logits once saves time and storage in subsequent comparisons.Example command:
❯ uv run src/generate_logits.py \ --model /path/to/baseline_model.gguf \ --context-size <size, ie 2048> \ --dataset /path/to/dataset.txt \ --output baseline_logits.hdf5
- Replace
/path/to/baseline_model.gguf
with the path to your unquantized model in GGUF format, and with your model's context size. - The
--dataset
parameter uses the dataset generated byimatrix_dataset.py
in the previous step. - The
--output
parameter specifies the name of the output HDF5 file.
notes:
- If working with large datasets, you can use the
--from
and/or--to
arguments to process the dataset in resumable chunks, allowing you to generate logits gradually without starting over each time. - At this time, llama-cpp-python library will default to 512 context size, rather than reading it from the underlying model, so a choice was made to make
--context-size
required.
- Replace
-
Save Baseline Logits for Reference
Store the generated
baseline_logits.hdf5
file in a secure and accessible location. This file will serve as the reference point for comparing outputs from quantized models in later steps.
-
Quantize the Model without an I-Matrix using
quantize.py
Create an initial quantized version of your model without using an I-matrix. Use the
quantize.py
script to perform the quantization, which streamlines the process and allows for consistent quantization settings.Example command:
❯ uv run src/quantize.py quantize \ --model-name my_model \ --base-model /path/to/baseline_model.gguf \ --quantizations q4_0 \ --output-dir /path/to/output_dir
- Replace
my_model
with a name for your model. - Replace
/path/to/baseline_model.gguf
with the path to your unquantized model. - The
--quantizations
parameter specifies the quantization type(s); here,q4_0
is used. - The
--output-dir
parameter specifies where the quantized model will be saved.
- Replace
-
Run
kl_d_bench.py
for Initial ComparisonUse the
kl_d_bench.py
script to compare the logits of the quantized model against the baseline logits. This script processes the stored baseline logits and computes KL-divergence metrics efficiently.Example command:
❯ uv run src/kl_d_bench.py \ --baseline-logits baseline_logits.hdf5 \ --target-model /path/to/output_dir/my_model-q4_0.gguf \ --dataset /path/to/dataset.txt \ --output-file initial_comparison.hdf5
- Replace
/path/to/output_dir/my_model-q4_0.gguf
with the path to your quantized model. - The
--output-file
parameter specifies where to save the comparison results.
The
--early-stopping
flag enables the early stopping mechanism, allowing the script to terminate the comparison process once sufficient statistical evidence is gathered. In subsequent tests, conforming all future tests to the number of chunks found with--to-chunk
can greatly reduce the amount of data written to disk.Example command with Early Stopping:
❯ uv run src/kl_d_bench.py \ --baseline-logits baseline_logits.hdf5 \ --target-model /path/to/output_dir/my_model-q4_0.gguf \ --dataset /path/to/dataset.txt \ --output-file initial_comparison.hdf5 \ --early-stopping
- Replace
-
Collect and Evaluate Metrics
After running
kl_d_bench.py
, review the KL-divergence metrics, including the median and the 90th, 95th, and 99th percentiles for each data chunk. This initial assessment serves as a reference point for evaluating improvements from I-matrix calibration in subsequent iterations.
-
Generate I-Matrices with Incrementally Larger Dataset Subsets
Begin refining the I-matrix by generating it using a small subset of your dataset. If you haven't already, use the
imatrix_dataset.py
script to create the I-matrix tailored to your data. As you iterate, you will increase the dataset size to improve the I-matrix.Example command:
❯ uv run src/imatrix_dataset.py \ --datasource_plugin src/imatrix_dataset/hf_dataset.py \ --plugin_class HFDatasetPlugin \ --langs en \ --num_samples 50000 \ --output imatrix_en_50k.bin
- Increase
--num_samples
in each iteration (e.g., 50,000, 100,000, 200,000). - The
--output
parameter specifies the name of the I-matrix file for each iteration.
Note: Remember that
imatrix_dataset.py
uses a plugin system to handle various data sources, but the details are covered later in this guide. - Increase
-
Quantize the Model with the New I-Matrix using
quantize.py
Use the updated I-matrix in the quantization process.
Example command:
❯ uv run src/quantize.py quantize \ --model-name my_model \ --base-model /path/to/baseline_model.gguf \ --quantizations q4_0 \ --imatrix-path imatrix_en_50k.bin \ --output-dir /path/to/output_dir
- The
--imatrix-path
parameter specifies the path to the I-matrix file generated in the current iteration.
- The
-
Use
kl_d_bench.py
for ComparisonFor each quantized model with an updated I-matrix, use
kl_d_bench.py
to compare against the baseline logits. Utilize the--to-chunk
flag to potentially reduce computation time by halting the comparison at the point previously found with--early-stopping
.-
Run the Comparison Script
❯ uv run src/kl_d_bench.py \ --baseline-logits baseline_logits.hdf5 \ --target-model /path/to/output_dir/my_model-q4_0.gguf \ --dataset /path/to/dataset.txt \ --output-file comparison_iteration_1.hdf5 \ --to-chunk 42
- Update
comparison_iteration_1.hdf5
to reflect the iteration number (e.g.,comparison_iteration_2.hdf5
).
- Update
-
Analyze KL-Divergence Metrics
Focus on key metrics, especially the high-percentile KL-divergence values, to assess the effectiveness of the quantization with each updated I-matrix.
-
-
Evaluate Metrics to Determine When to Stop Adding Data
Monitor the KL-divergence metrics across iterations. Pay special attention to the 90th, 95th, and 99th percentiles. When successive iterations show marginal improvements (diminishing returns), you can consider the I-matrix sufficiently refined for your application.
To balance overall performance with outlier minimization, we suggest using a composite metric that combines the median KL-divergence and higher percentile values.
Suggested Composite Metric:
-
Explanation:
- Median KL-Divergence (
Median
): Represents typical performance across the dataset. - High Percentile KL-Divergence (
KLD_{90}
,KLD_{95}
,KLD_{99}
): Capture the worst-case divergences, indicating how well the model handles outlier cases. - Weighting Factors: The weights (1 for
KLD_{99}
, 4 forKLD_{95}
, 5 forKLD_{90}
) emphasize reducing higher divergences, with greater weight on percentiles covering more data. - Exponents: The exponents (1/3 for the median, 2/3 for the weighted sum) balance the influence of average performance and outlier cases in the overall score.
- Median KL-Divergence (
By minimizing this composite score, you ensure that the quantized model maintains strong overall performance while mitigating significant divergences in less common scenarios.
Selecting an appropriate dataset size and coverage is crucial for effective I-matrix calibration. We recommend:
- Starting Small: Use a representative subset that includes key languages or domains relevant to your application.
- Gradual Expansion: Increase the dataset size in each iteration to include more diversity and complexity.
- Balancing Diversity and Size: Ensure that the dataset remains manageable while covering the necessary range of tokens and contexts.
For detailed insights into dataset selection and initial sizing, refer to on_quantization.md. This document provides guidance on balancing dataset diversity and size to optimize the I-matrix calibration process.
-
Measuring Perplexity with
quantize.py
After quantization, you can measure the perplexity of your quantized model to assess its performance.
Example command:
❯ uv run src/quantize.py perplexity \ --model-name my_model \ --base-model /path/to/output_dir/my_model-q4_0.gguf \ --dataset /path/to/perplexity_dataset.txt
- Replace
/path/to/perplexity_dataset.txt
with a dataset suitable for perplexity measurement.
- Replace
In addition to the core scripts, the src/extras/
folder contains supplementary tools that enhance your workflows. These scripts provide functionalities for optimization, visualization, data extraction, and file maintenance.
-
Optimizing Batch Sizes with
best_bub.py
Before generating logits or quantizing large models, you may want to optimize batch (
--batch
) and micro-batch (--ubatch
) sizes to maximize performance given your hardware constraints.Example command:
❯ uv run src/extras/best_bub.py --model /path/to/baseline_model.gguf --context-size 2048
- Adjust the
--context-size
to match your model's maximum context size. - This script will suggest optimal
--batch-size
and--ubatch-size
settings.
- Adjust the
These scripts help analyze and visualize data generated during quantization, logit generation, and KL-divergence comparisons. They are useful for evaluating model performance, exploring patterns, and generating insights.
-
analyze_comparison_progress_from_logs.py
:
Visualizes early stopping factors and tracks progress duringcompare_logits.py
runs. Projects remaining runtime and statistical trends. Outputs live updates or exports raw data for later analysis.Example Command:
❯ uv run src/extras/analyze_comparison_progress_from_logs.py --logfile <log_file_path>
-
composite_comparison.py
:Evaluates multiple comparisons using a composite metric. Provides overall KL-divergence curves, chunk-by-chunk scores, and 3D performance manifolds to identify quantization trends.
Example Command:
❯ uv run src/extras/composite_comparison.py --input-files <file1> <file2> --output <summary_file>
-
visualize_results.py
:
Creates detailed visualizations of KL-divergence outputs, including chunk-by-chunk graphs and 3D manifolds. Helps identify patterns or anomalies in quantization results.Example Command:
❯ uv run src/extras/visualize_results.py --input <comparison_file> --output <graph_file>
-
read_kl_d_benchmarks.py
:
Extracts and displays KL-divergence statistics from comparison HDF5 files. Filters metrics by chunk range or includes overall statistics for a concise summary.Example Command:
❯ uv run src/extras/read_kl_d_benchmarks.py --input <comparison_file> --start-chunk 0 --end-chunk 10
These scripts are designed to repair, repurpose, or modify HDF5 files used during logit generation or KL-divergence comparisons. They ensure data integrity and enable reuse in interrupted or experimental workflows.
-
append_overall.py
:
Computes and appends "overall" KL-divergence metrics to a comparison file, particularly for interrupted runs. Ensures a complete summary of results.Example Command:
❯ uv run src/extras/append_overall.py --input <comparison_file> --output <updated_file>
-
reshape_logits.py
:
Reshapes large logit files into smaller, evenly divided chunks, useful for exploring settings across multiple smaller chunks.Example Command:
❯ uv run src/extras/reshape_logits.py --input <logits_file> --output <reshaped_file> --chunk-size <size>
-
unfree.py
:
Resets thefreed_chunks
dataset in HDF5 logit files. This is helpful for resuming interrupted logit generation processes without losing progress.Example Command:
❯ uv run src/extras/unfree.py --input <hdf5_file>
As mentioned earlier, the imatrix_dataset.py
script uses a plugin system to support various data sources flexibly. Here's how you can utilize this system:
-
Selecting a Data Source Plugin
Choose an existing plugin or create a new one depending on your data source.
- Existing Plugins:
oscar_plugin.py
: An example huggingface dataset, for the OSCAR corpus.
- Custom Plugins: Create a plugin tailored to your data source.
- Existing Plugins:
-
Specifying the Plugin in the Command
When running
imatrix_dataset.py
, use the following arguments to specify the plugin:--datasource_plugin
: Path to the plugin script.--plugin_class
: The class name of the plugin within the script.
Example command with an existing plugin:
❯ uv run src/imatrix_dataset.py \ --datasource_plugin src/imatrix_dataset/oscar_plugin.py \ --plugin_class OscarDataSource \ --langs en es de \ --num_samples 100000 \ --output /path/to/dataset.txt
-
Creating a Custom Data Source Plugin
If your data source isn't covered by existing plugins, you can create your own.
Steps to Create a Custom Plugin:
-
Create a New Python File: Save it with a descriptive name, e.g.,
my_custom_plugin.py
. -
Import the Base Class:
from plugin_base import DataSourcePluginBase
-
Define Your Plugin Class: Inherit from
DataSourcePluginBase
.class MyCustomPlugin(DataSourcePluginBase): def __init__(self, name="my_dataset", **kwargs): super().__init__(name, **kwargs) self.schema = {'content': 'path.to.text.field'}
-
Implement the
load_data
Method:def load_data(self, lang, num_samples=200, skip_samples=0): # Implement your data loading logic here data_samples = [] # Fetch data samples based on lang, num_samples, skip_samples return data_samples
-
Optionally Override
get_content
Method:If your data records have a different structure, override this method to extract the text content.
def get_content(self, record): # Extract and return the text content from the record return record['desired_text_field']
Using Your Custom Plugin:
❯ uv run src/imatrix_dataset.py \ --datasource_plugin /path/to/my_custom_plugin.py \ --plugin_class MyCustomPlugin \ --langs en \ --num_samples 100000 \ --output /path/to/dataset.txt
-
-
Understanding the Plugin Base Class
The
DataSourcePluginBase
class defines the interface that all plugins must implement. It requires:- Initialization: Set up any necessary configurations or parameters.
load_data
Method: Must return a list of data records for the specified language and sample counts.get_content
Method: Extracts the textual content from a data record, used in building the combined dataset.
-
Example of Using the OSCAR Plugin
The OSCAR dataset is a large multilingual corpus. Here's how to use the
oscar_plugin.py
:❯ uv run src/imatrix_dataset.py \ --datasource_plugin src/imatrix_dataset/oscar_plugin.py \ --plugin_class OscarDataSource \ --langs en fr es \ --num_samples 50000 \ --output /path/to/dataset.txt
- This command generates a dataset using 50,000 samples from each of the specified languages.
With the introduction of the --early-stopping
flag in both compare_logits.py
and kl_d_bench.py
, you can optimize the comparison process by allowing the script to terminate early once sufficient statistical evidence is gathered. This feature leverages statistical tests to determine when additional data processing is unlikely to yield significant new insights, thereby saving computational resources and time. See the compare_logits specification file.
This set of tools is designed to simplify each stage of model quantization, from setting up datasets and generating I-matrices to quantizing models and evaluating performance. By following these steps, you can make targeted, data-driven adjustments at every stage, helping you achieve quantization results that preserve model quality while accommodating diverse data requirements.