LLMs Quantization Recipes

Intel® Neural Compressor supported advanced large language models (LLMs) quantization technologies including SmoothQuant (SQ) and Weight-Only Quant (WOQ), and verified a list of LLMs on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch, Intel® Extension for PyTorch and Intel® Extension for Transformers.
This document aims to publish the specific recipes we achieved for the popular LLMs and help users to quickly get an optimized LLM with limited 1% accuracy loss.

Notes:

The quantization algorithms provide by Intel® Neural Compressor and the evaluate functions provide by Intel® Extension for Transformers.

The model list are continuing update, please expect to find more LLMs in the future.

Large Language Models Recipes

Models	SQ INT8	WOQ INT8	WOQ INT4
EleutherAI/gpt-j-6b	✔	✔	✔
facebook/opt-1.3b	✔	✔	✔
facebook/opt-30b	✔	✔	✔
meta-llama/Llama-2-7b-hf	✔	✔	✔
meta-llama/Llama-2-13b-hf	✔	✔	✔
meta-llama/Llama-2-70b-hf	✔	✔	✔
tiiuae/falcon-7b	✔	✔	✔
tiiuae/falcon-40b	✔	✔	✔
baichuan-inc/Baichuan-13B-Chat	✔	✔	✔
baichuan-inc/Baichuan2-13B-Chat	✔	✔	✔
baichuan-inc/Baichuan2-7B-Chat	✔	✔	✔
bigscience/bloom-1b7	✔	✔	✔
databricks/dolly-v2-12b	✖	✔	✖
EleutherAI/gpt-neox-20b	✖	✔	✔
mistralai/Mistral-7B-v0.1	✖	✔	✔
THUDM/chatglm2-6b	✔	✔	✔
THUDM/chatglm3-6b	WIP	✔	WIP

Detail recipes can be found HERE.

Notes:

This model list comes from IPEX.

The WIP recipes will be published soon.

Large Language Models Accuracy

Model	lambada_openai
	FP32	SQ INT8		WOQ INT8		WOQ INT4 GPTQ		WOQ INT4 AutoRound
	ACC	ACC	Ratio	ACC	Ratio	ACC	Ratio	ACC	Ratio
baichuan-inc/Baichuan-13B-Chat	67.57%	69.07%	1.0222	67.55%	0.9997	68.12%	1.0081	66.93%	0.9905
baichuan-inc/Baichuan2-13B-Chat	71.51%	75.57%	1.0568	71.57%	1.0008	70.81%	0.9902	N/A	N/A
baichuan-inc/Baichuan2-7B-Chat	67.67%	68.06%	1.0058	67.61%	0.9991	67.90%	1.0034	N/A	N/A
bigscience/bloom-1b7	46.34%	47.99%	1.0356	46.21%	0.9972	46.90%	1.0121	N/A	N/A
databricks/dolly-v2-12b	64.35%	N/A	N/A	63.92%	0.9933	N/A	N/A	N/A	N/A
EleutherAI/gpt-j-6b	68.31%	68.27%	0.9994	68.27%	0.9994	68.35%	1.0006	68.02%	0.9958
EleutherAI/gpt-neox-20b	72.33%	N/A	N/A	72.29%	0.9994	71.74%	0.9918	N/A	N/A
facebook/opt-1.3b	57.89%	57.68%	0.9964	58.12%	1.0040	58.26%	1.0064	N/A	N/A
facebook/opt-30b	71.49%	71.78%	1.0041	71.53%	1.0006	71.59%	1.0014	71.80%	1.0043
meta-llama/Llama-2-13b-hf	76.77%	76.25%	0.9932	76.89%	1.0016	77.66%	1.0116	76.60%	0.9978
meta-llama/Llama-2-70b-hf	79.64%	79.14%	0.9937	79.62%	0.9997	80.09%	1.0057	79.68%	1.0005
meta-llama/Llama-2-7b-hf	73.92%	73.45%	0.9936	73.90%	0.9997	73.84%	0.9989	N/A	N/A
mistralai/Mistral-7B-v0.1	75.90%	N/A	N/A	75.80%	0.9987	76.25%	1.0046	75.74%	0.9979
THUDM/chatglm2-6b	53.23%	52.86%	0.9930	53.00%	0.9957	52.90%	0.9938	52.92%	0.9942
THUDM/chatglm3-6b	59.09%	N/A	N/A	59.03%	0.9990	N/A	N/A	N/A	N/A
tiiuae/falcon-40b	77.22%	76.95%	0.9965	77.18%	0.9995	77.55%	1.0043	77.82%	1.0078
tiiuae/falcon-7b	74.67%	76.63%	1.0262	74.73%	1.0008	75.06%	1.0052	74.00%	0.9910

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm_recipes.md

llm_recipes.md

LLMs Quantization Recipes

Large Language Models Recipes

Large Language Models Accuracy

Files

llm_recipes.md

Latest commit

History

llm_recipes.md

File metadata and controls

LLMs Quantization Recipes

Large Language Models Recipes

Large Language Models Accuracy