docs: update readme and docs (#757)

PygmalionAI · Sep 22, 2024 · 5878e88 · 5878e88
1 parent d730945
commit 5878e88
Show file tree

Hide file tree

Showing 2 changed files with 34 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -13,14 +13,16 @@ The compute necessary for Aphrodite's development is provided by [Arc Compute](h
 
 
 ## 🔥 News
+(09/2024) v0.6.1 is here. You can now load FP16 models in FP2 to FP7 quant formats, to achieve extremely high throughput and save on memory.
+
 (09/2024) v0.6.0 is released, with huge throughput improvements, many new quant formats (including fp8 and llm-compressor), asymmetric tensor parallel, pipeline parallel and more! Please check out the [exhaustive documentation](https://aphrodite.pygmalion.chat) for the User and Developer guides.
 
 ## Features
 
 - Continuous Batching
 - Efficient K/V management with [PagedAttention](./aphrodite/modeling/layers/attention.py) from vLLM
 - Optimized CUDA kernels for improved inference
-- Quantization support via AQLM, AWQ, Bitsandbytes, GGUF, GPTQ, QuIP#, Smoothquant+, SqueezeLLM, Marlin, FP4, FP6, FP8, FP12
+- Quantization support via AQLM, AWQ, Bitsandbytes, GGUF, GPTQ, QuIP#, Smoothquant+, SqueezeLLM, Marlin, FP2-FP12
 - Distributed inference
 - 8-bit KV Cache for higher context lengths and throughput, at both FP8 E5M3 and E4M3 formats.
 
@@ -29,7 +31,7 @@ The compute necessary for Aphrodite's development is provided by [Arc Compute](h
 
 Install the engine:
 ```sh
-pip install -U aphrodite-engine==0.6.0
+pip install -U aphrodite-engine
 ```
 
 Then launch a model:

diff --git a/docs/pages/quantization/quantization-methods.md b/docs/pages/quantization/quantization-methods.md
@@ -51,7 +51,7 @@ python main.py $MODEL_PATH $DATASET_PATH \
 You can then load the quantized model for inference using Aphrodite:
 
 ```sh
-aphrodite run --model $SAVE_PATH
+aphrodite run $SAVE_PATH
 ```
 
 
@@ -94,7 +94,7 @@ model.save_pretrained(f"{model_id}-AWQ")
 You can then load the quantized model for inference using Aphrodite:
 
 ```sh
-aphrodite run --model /path/to/model-AWQ
+aphrodite run /path/to/model-AWQ
 ```
 
 :::tip
@@ -167,7 +167,7 @@ model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")
 Then, you can load the quantized model for inference using Aphrodite:
 
 ```sh
-aphrodite run --model /path/to/quantized/model
+aphrodite run /path/to/quantized/model
 ```
 
 ## FBGEMM_FP8
@@ -239,7 +239,7 @@ model.save_pretrained(f"{model_id}-GPTQ")
 You can then load the quantized model for inference using Aphrodite:
 
 ```sh
-aphrodite run --model /path/to/model-GPTQ
+aphrodite run /path/to/model-GPTQ
 ```
 :::tip
 By default, Aphrodite will load GPTQ models using the Marlin kernels for high throughput. If this is undesirable, you can use the `-q gptq` flag to load the model using the GPTQ library instead.
@@ -253,6 +253,32 @@ Reference:
 
 Aphrodite supports LLM Compressor-produced quants. Please refer to their repo on how to generate these quants.
 
+## Quant-LLM
+Reference:
+- [GitHub](https://github.com/usyd-fsalab/fp6_llm)
+- [Paper](https://arxiv.org/abs/2401.14112)
+
+Aphrodite supports loading FP16 models quantized to FP2, FP3, FP4, FP5, FP6, and FP7 using the Quant-LLM method at runtime, to achieve extremely high throughput. 
+
+To load a model with Quant-LLM quantization, you can simply run:
+```sh
+aphrodite run <fp16 model> -q fpX
+```
+
+Where `X` is the desired weight quantization: 2, 3, 4, 5, 6, or 7 (although 2 and 3-bit quantization is not recommended due to significant accuracy loss).
+
+We also provide fine-grained control over the exact exponent-mantissa combination, if the user wants to the experiment with other formats:
+
+```sh
+aphrodite run <fp16 model> -q quant_llm --quant-llm-exp-bits 4 
+```
+
+The valid values for `--quant-llm-exp-bits` are 1, 2, 3, 4, and 5. The heuristic we use to determine mantissa is `weight_bits - exp_bits - 1`, so make sure your provided value does not result in a negative mantissa.
+
+See [here](https://gist.github.com/AlpinDale/17babab5be16f522d4d3b134e171001a) for a list of all valid combinations.
+
+To see accuracy and throughput benchmarks, see [here](https://github.com/PygmalionAI/aphrodite-engine/pull/755).
+
 ## The other methods
 
 Aphrodite also supports the following quantization methods: