Intel® Neural Compressor supported advanced large language models (LLMs) quantization technologies including SmoothQuant (SQ) and Weight-Only Quant (WOQ),
and verified a list of LLMs on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch,
Intel® Extension for PyTorch and Intel® Extension for Transformers.
This document aims to publish the specific recipes we achieved for the popular LLMs and help users to quickly get an optimized LLM with limited 1% accuracy loss.
Notes:
- The quantization algorithms provide by Intel® Neural Compressor and the evaluate functions provide by Intel® Extension for Transformers.
- The model list are continuing update, please expect to find more LLMs in the future.
Models | SQ INT8 | WOQ INT8 | WOQ INT4 |
---|---|---|---|
EleutherAI/gpt-j-6b | ✔ | ✔ | ✔ |
facebook/opt-1.3b | ✔ | ✔ | ✔ |
facebook/opt-30b | ✔ | ✔ | ✔ |
meta-llama/Llama-2-7b-hf | ✔ | ✔ | ✔ |
meta-llama/Llama-2-13b-hf | ✔ | ✔ | ✔ |
meta-llama/Llama-2-70b-hf | ✔ | ✔ | ✔ |
tiiuae/falcon-7b | ✔ | ✔ | ✔ |
tiiuae/falcon-40b | ✔ | ✔ | ✔ |
baichuan-inc/Baichuan-13B-Chat | ✔ | ✔ | ✔ |
baichuan-inc/Baichuan2-13B-Chat | ✔ | ✔ | ✔ |
baichuan-inc/Baichuan2-7B-Chat | ✔ | ✔ | ✔ |
bigscience/bloom-1b7 | ✔ | ✔ | ✔ |
databricks/dolly-v2-12b | ✖ | ✔ | ✖ |
EleutherAI/gpt-neox-20b | ✖ | ✔ | ✔ |
mistralai/Mistral-7B-v0.1 | ✖ | ✔ | ✔ |
THUDM/chatglm2-6b | ✔ | ✔ | ✔ |
THUDM/chatglm3-6b | WIP | ✔ | WIP |
Detail recipes can be found HERE.
Notes:
- This model list comes from IPEX.
- The WIP recipes will be published soon.
Model | lambada_openai | ||||||||
---|---|---|---|---|---|---|---|---|---|
FP32 | SQ INT8 | WOQ INT8 | WOQ INT4 GPTQ | WOQ INT4 AutoRound | |||||
ACC | ACC | Ratio | ACC | Ratio | ACC | Ratio | ACC | Ratio | |
baichuan-inc/Baichuan-13B-Chat | 67.57% | 69.07% | 1.0222 | 67.55% | 0.9997 | 68.12% | 1.0081 | 66.93% | 0.9905 |
baichuan-inc/Baichuan2-13B-Chat | 71.51% | 75.57% | 1.0568 | 71.57% | 1.0008 | 70.81% | 0.9902 | N/A | N/A |
baichuan-inc/Baichuan2-7B-Chat | 67.67% | 68.06% | 1.0058 | 67.61% | 0.9991 | 67.90% | 1.0034 | N/A | N/A |
bigscience/bloom-1b7 | 46.34% | 47.99% | 1.0356 | 46.21% | 0.9972 | 46.90% | 1.0121 | N/A | N/A |
databricks/dolly-v2-12b | 64.35% | N/A | N/A | 63.92% | 0.9933 | N/A | N/A | N/A | N/A |
EleutherAI/gpt-j-6b | 68.31% | 68.27% | 0.9994 | 68.27% | 0.9994 | 68.35% | 1.0006 | 68.02% | 0.9958 |
EleutherAI/gpt-neox-20b | 72.33% | N/A | N/A | 72.29% | 0.9994 | 71.74% | 0.9918 | N/A | N/A |
facebook/opt-1.3b | 57.89% | 57.68% | 0.9964 | 58.12% | 1.0040 | 58.26% | 1.0064 | N/A | N/A |
facebook/opt-30b | 71.49% | 71.78% | 1.0041 | 71.53% | 1.0006 | 71.59% | 1.0014 | 71.80% | 1.0043 |
meta-llama/Llama-2-13b-hf | 76.77% | 76.25% | 0.9932 | 76.89% | 1.0016 | 77.66% | 1.0116 | 76.60% | 0.9978 |
meta-llama/Llama-2-70b-hf | 79.64% | 79.14% | 0.9937 | 79.62% | 0.9997 | 80.09% | 1.0057 | 79.68% | 1.0005 |
meta-llama/Llama-2-7b-hf | 73.92% | 73.45% | 0.9936 | 73.90% | 0.9997 | 73.84% | 0.9989 | N/A | N/A |
mistralai/Mistral-7B-v0.1 | 75.90% | N/A | N/A | 75.80% | 0.9987 | 76.25% | 1.0046 | 75.74% | 0.9979 |
THUDM/chatglm2-6b | 53.23% | 52.86% | 0.9930 | 53.00% | 0.9957 | 52.90% | 0.9938 | 52.92% | 0.9942 |
THUDM/chatglm3-6b | 59.09% | N/A | N/A | 59.03% | 0.9990 | N/A | N/A | N/A | N/A |
tiiuae/falcon-40b | 77.22% | 76.95% | 0.9965 | 77.18% | 0.9995 | 77.55% | 1.0043 | 77.82% | 1.0078 |
tiiuae/falcon-7b | 74.67% | 76.63% | 1.0262 | 74.73% | 1.0008 | 75.06% | 1.0052 | 74.00% | 0.9910 |