Release TensorRT-LLM Release 0.16.0 · NVIDIA/TensorRT-LLM

Hi,

We are very pleased to announce the 0.16.0 version of TensorRT-LLM. This update includes:

Key Features and Enhancements

Added guided decoding support with XGrammar backend.
Added quantization support for RecurrentGemma. Refer to examples/recurrentgemma/README.md.
Added ulysses context parallel support. Refer to an example on building LLaMA 7B using 2-way tensor parallelism and 2-way context parallelism at examples/llama/README.md.
Added W4A8 quantization support to BF16 models on Ada (SM89).
Added PDL support for the FP8 GEMM plugins.
Added a runtime max_num_tokens dynamic tuning feature, which can be enabled by setting --enable_max_num_tokens_tuning to gptManagerBenchmark.
Added typical acceptance support for EAGLE.
Supported chunked context and sliding window attention to be enabled together.
Added head size 64 support for the XQA kernel.
Added the following features to the LLM API:
- Lookahead decoding.
- DeepSeek V1 support.
- Medusa support.
- max_num_tokens and max_batch_size arguments to control the runtime parameters.
- extended_runtime_perf_knob_config to enable various performance configurations.
Added LogN scaling support for Qwen models.
Added AutoAWQ checkpoints support for Qwen. Refer to the “INT4-AWQ” section in examples/qwen/README.md.
Added AutoAWQ and AutoGPTQ Hugging Face checkpoints support for LLaMA. (#2458)
Added allottedTimeMs to the C++ Request class to support per-request timeout.
[BREAKING CHANGE] Removed NVIDIA V100 GPU support.

API Changes

[BREAKING CHANGE] Removed enable_xqa argument from trtllm-build.
[BREAKING CHANGE] Chunked context is enabled by default when KV cache and paged context FMHA is enabled on non-RNN based models.
[BREAKING CHANGE] Enabled embedding sharing automatically when possible and remove the flag --use_embedding_sharing from convert checkpoints scripts.
[BREAKING CHANGE] The if __name__ == "__main__" entry point is required for both single-GPU and multi-GPU cases when using the LLM API.
[BREAKING CHANGE] Cancelled requests now return empty results.
Added the enable_chunked_prefill flag to the LlmArgs of the LLM API.
Integrated BERT and RoBERTa models to the trtllm-build command.

Model Updates

Added Qwen2-VL support. Refer to the “Qwen2-VL” section of examples/multimodal/README.md.
Added multimodal evaluation examples. Refer to examples/multimodal.
Added Stable Diffusion XL support. Refer to examples/sdxl/README.md. Thanks for the contribution from @Zars19 in #1514.

Fixed Issues

Fixed unnecessary batch logits post processor calls. (#2439)
Fixed a typo in the error message. (#2473)
Fixed the in-place clamp operation usage in smooth quant. Thanks for the contribution from @StarrickLiu in #2485.
Fixed sampling_params to only be setup if end_id is None and tokenizer is not None in the LLM API. Thanks to the contribution from @mfuntowicz in #2573.

Infrastructure Changes

Updated the base Docker image for TensorRT-LLM to nvcr.io/nvidia/pytorch:24.11-py3.
Updated the base Docker image for TensorRT-LLM Backend to nvcr.io/nvidia/tritonserver:24.11-py3.
Updated to TensorRT v10.7.
Updated to CUDA v12.6.3.
Added support for Python 3.10 and 3.12 to TensorRT-LLM Python wheels on PyPI.
Updated to ModelOpt v0.21 for Linux platform, while v0.17 is still used on Windows platform.

Known Issues

There is a known AllReduce performance issue on AMD-based CPU platforms on NCCL 2.23.4, which can be workarounded by export NCCL_P2P_LEVEL=SYS.

We are updating the main branch regularly with new features, bug fixes and performance optimizations. The rel branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,
The TensorRT-LLM Engineering Team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT-LLM Release 0.16.0