Hi,
We are very pleased to announce the 0.16.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Added guided decoding support with XGrammar backend.
- Added quantization support for RecurrentGemma. Refer to
examples/recurrentgemma/README.md
. - Added ulysses context parallel support. Refer to an example on building LLaMA 7B using 2-way tensor parallelism and 2-way context parallelism at
examples/llama/README.md
. - Added W4A8 quantization support to BF16 models on Ada (SM89).
- Added PDL support for the FP8 GEMM plugins.
- Added a runtime
max_num_tokens
dynamic tuning feature, which can be enabled by setting--enable_max_num_tokens_tuning
togptManagerBenchmark
. - Added typical acceptance support for EAGLE.
- Supported chunked context and sliding window attention to be enabled together.
- Added head size 64 support for the XQA kernel.
- Added the following features to the LLM API:
- Lookahead decoding.
- DeepSeek V1 support.
- Medusa support.
max_num_tokens
andmax_batch_size
arguments to control the runtime parameters.extended_runtime_perf_knob_config
to enable various performance configurations.
- Added LogN scaling support for Qwen models.
- Added
AutoAWQ
checkpoints support for Qwen. Refer to the “INT4-AWQ” section inexamples/qwen/README.md
. - Added
AutoAWQ
andAutoGPTQ
Hugging Face checkpoints support for LLaMA. (#2458) - Added
allottedTimeMs
to the C++Request
class to support per-request timeout. - [BREAKING CHANGE] Removed NVIDIA V100 GPU support.
API Changes
- [BREAKING CHANGE] Removed
enable_xqa
argument fromtrtllm-build
. - [BREAKING CHANGE] Chunked context is enabled by default when KV cache and paged context FMHA is enabled on non-RNN based models.
- [BREAKING CHANGE] Enabled embedding sharing automatically when possible and remove the flag
--use_embedding_sharing
from convert checkpoints scripts. - [BREAKING CHANGE] The
if __name__ == "__main__"
entry point is required for both single-GPU and multi-GPU cases when using theLLM
API. - [BREAKING CHANGE] Cancelled requests now return empty results.
- Added the
enable_chunked_prefill
flag to theLlmArgs
of theLLM
API. - Integrated BERT and RoBERTa models to the
trtllm-build
command.
Model Updates
- Added Qwen2-VL support. Refer to the “Qwen2-VL” section of
examples/multimodal/README.md
. - Added multimodal evaluation examples. Refer to
examples/multimodal
. - Added Stable Diffusion XL support. Refer to
examples/sdxl/README.md
. Thanks for the contribution from @Zars19 in #1514.
Fixed Issues
- Fixed unnecessary batch logits post processor calls. (#2439)
- Fixed a typo in the error message. (#2473)
- Fixed the in-place clamp operation usage in smooth quant. Thanks for the contribution from @StarrickLiu in #2485.
- Fixed
sampling_params
to only be setup ifend_id
is None andtokenizer
is not None in theLLM
API. Thanks to the contribution from @mfuntowicz in #2573.
Infrastructure Changes
- Updated the base Docker image for TensorRT-LLM to
nvcr.io/nvidia/pytorch:24.11-py3
. - Updated the base Docker image for TensorRT-LLM Backend to
nvcr.io/nvidia/tritonserver:24.11-py3
. - Updated to TensorRT v10.7.
- Updated to CUDA v12.6.3.
- Added support for Python 3.10 and 3.12 to TensorRT-LLM Python wheels on PyPI.
- Updated to ModelOpt v0.21 for Linux platform, while v0.17 is still used on Windows platform.
Known Issues
- There is a known AllReduce performance issue on AMD-based CPU platforms on NCCL 2.23.4, which can be workarounded by
export NCCL_P2P_LEVEL=SYS
.
We are updating the main
branch regularly with new features, bug fixes and performance optimizations. The rel
branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team