Skip to content

Commit

Permalink
docs: update readme and docs (#757)
Browse files Browse the repository at this point in the history
  • Loading branch information
AlpinDale authored Sep 22, 2024
1 parent d730945 commit 5878e88
Show file tree
Hide file tree
Showing 2 changed files with 34 additions and 6 deletions.
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,16 @@ The compute necessary for Aphrodite's development is provided by [Arc Compute](h


## 🔥 News
(09/2024) v0.6.1 is here. You can now load FP16 models in FP2 to FP7 quant formats, to achieve extremely high throughput and save on memory.

(09/2024) v0.6.0 is released, with huge throughput improvements, many new quant formats (including fp8 and llm-compressor), asymmetric tensor parallel, pipeline parallel and more! Please check out the [exhaustive documentation](https://aphrodite.pygmalion.chat) for the User and Developer guides.

## Features

- Continuous Batching
- Efficient K/V management with [PagedAttention](./aphrodite/modeling/layers/attention.py) from vLLM
- Optimized CUDA kernels for improved inference
- Quantization support via AQLM, AWQ, Bitsandbytes, GGUF, GPTQ, QuIP#, Smoothquant+, SqueezeLLM, Marlin, FP4, FP6, FP8, FP12
- Quantization support via AQLM, AWQ, Bitsandbytes, GGUF, GPTQ, QuIP#, Smoothquant+, SqueezeLLM, Marlin, FP2-FP12
- Distributed inference
- 8-bit KV Cache for higher context lengths and throughput, at both FP8 E5M3 and E4M3 formats.

Expand All @@ -29,7 +31,7 @@ The compute necessary for Aphrodite's development is provided by [Arc Compute](h

Install the engine:
```sh
pip install -U aphrodite-engine==0.6.0
pip install -U aphrodite-engine
```

Then launch a model:
Expand Down
34 changes: 30 additions & 4 deletions docs/pages/quantization/quantization-methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ python main.py $MODEL_PATH $DATASET_PATH \
You can then load the quantized model for inference using Aphrodite:

```sh
aphrodite run --model $SAVE_PATH
aphrodite run $SAVE_PATH
```


Expand Down Expand Up @@ -94,7 +94,7 @@ model.save_pretrained(f"{model_id}-AWQ")
You can then load the quantized model for inference using Aphrodite:

```sh
aphrodite run --model /path/to/model-AWQ
aphrodite run /path/to/model-AWQ
```

:::tip
Expand Down Expand Up @@ -167,7 +167,7 @@ model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")
Then, you can load the quantized model for inference using Aphrodite:

```sh
aphrodite run --model /path/to/quantized/model
aphrodite run /path/to/quantized/model
```

## FBGEMM_FP8
Expand Down Expand Up @@ -239,7 +239,7 @@ model.save_pretrained(f"{model_id}-GPTQ")
You can then load the quantized model for inference using Aphrodite:

```sh
aphrodite run --model /path/to/model-GPTQ
aphrodite run /path/to/model-GPTQ
```
:::tip
By default, Aphrodite will load GPTQ models using the Marlin kernels for high throughput. If this is undesirable, you can use the `-q gptq` flag to load the model using the GPTQ library instead.
Expand All @@ -253,6 +253,32 @@ Reference:

Aphrodite supports LLM Compressor-produced quants. Please refer to their repo on how to generate these quants.

## Quant-LLM
Reference:
- [GitHub](https://github.com/usyd-fsalab/fp6_llm)
- [Paper](https://arxiv.org/abs/2401.14112)

Aphrodite supports loading FP16 models quantized to FP2, FP3, FP4, FP5, FP6, and FP7 using the Quant-LLM method at runtime, to achieve extremely high throughput.

To load a model with Quant-LLM quantization, you can simply run:
```sh
aphrodite run <fp16 model> -q fpX
```

Where `X` is the desired weight quantization: 2, 3, 4, 5, 6, or 7 (although 2 and 3-bit quantization is not recommended due to significant accuracy loss).

We also provide fine-grained control over the exact exponent-mantissa combination, if the user wants to the experiment with other formats:

```sh
aphrodite run <fp16 model> -q quant_llm --quant-llm-exp-bits 4
```

The valid values for `--quant-llm-exp-bits` are 1, 2, 3, 4, and 5. The heuristic we use to determine mantissa is `weight_bits - exp_bits - 1`, so make sure your provided value does not result in a negative mantissa.

See [here](https://gist.github.com/AlpinDale/17babab5be16f522d4d3b134e171001a) for a list of all valid combinations.

To see accuracy and throughput benchmarks, see [here](https://github.com/PygmalionAI/aphrodite-engine/pull/755).

## The other methods

Aphrodite also supports the following quantization methods:
Expand Down

0 comments on commit 5878e88

Please sign in to comment.