Added gemma specific fp8 quantization file (#1445)

huggingface · Oct 22, 2024 · 058e91c · 058e91c
1 parent fc54347
commit 058e91c
Show file tree

Hide file tree

Showing 2 changed files with 37 additions and 0 deletions.
diff --git a/examples/text-generation/README.md b/examples/text-generation/README.md
@@ -451,6 +451,31 @@ QUANT_CONFIG=./quantization_config/maxabs_quant_phi.json python run_generation.p
 --reuse_cache
 ```
 
+Here is an example to measure the tensor quantization statistics on gemma with 1 card:
+
+```bash
+QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py \
+--model_name_or_path google/gemma-7b \
+--use_hpu_graphs \
+--use_kv_cache \
+--max_new_tokens 100 \
+--batch_size 1 \
+--reuse_cache \
+--bf16
+```
+
+Here is an example to quantize the model based on previous measurements for gemma with 1 card:
+```bash
+QUANT_CONFIG=./quantization_config/maxabs_quant_gemma.json python run_generation.py \
+--model_name_or_path google/gemma-7b \
+--use_hpu_graphs \
+--use_kv_cache \
+--max_new_tokens 100 \
+--batch_size 1 \
+--reuse_cache \
+--bf16
+```
+
 
 ### Running FP8 models on single device
 

diff --git a/examples/text-generation/quantization_config/maxabs_quant_gemma.json b/examples/text-generation/quantization_config/maxabs_quant_gemma.json
@@ -0,0 +1,12 @@
+{
+    "method": "HOOKS",
+    "mode": "QUANTIZE",
+    "observer": "maxabs",
+    "scale_method": "maxabs_hw",
+    "blocklist": {"types": [], "names":  [
+        "matmul_qk",
+        "matmul_av",
+        "lm_head"
+    ]},
+    "dump_stats_path": "./hqt_output/measure"
+}