TensorRT-LLM 0.13.0 Release #2270

Shixiaowei02 · 2024-09-30T08:37:56Z

Shixiaowei02
Sep 30, 2024
Maintainer

Hi,

We are very pleased to announce the 0.13.0 version of TensorRT-LLM. This update includes:

Key Features and Enhancements

Supported lookahead decoding (experimental), see docs/source/speculative_decoding.md.
Added some enhancements to the ModelWeightsLoader (a unified checkpoint converter, see docs/source/architecture/model-weights-loader.md).
- Supported Qwen models.
- Supported auto-padding for indivisible TP shape in INT4-wo/INT8-wo/INT4-GPTQ.
- Improved performance on *.bin and *.pth.
Supported OpenAI Whisper in C++ runtime.
Added some enhancements to the LLM class.
- Supported LoRA.
- Supported engine building using dummy weights.
- Supported trust_remote_code for customized models and tokenizers downloaded from Hugging Face Hub.
Supported beam search for streaming mode.
Supported tensor parallelism for Mamba2.
Supported returning generation logits for streaming mode.
Added curand and bfloat16 support for ReDrafter.
Added sparse mixer normalization mode for MoE models.
Added support for QKV scaling in FP8 FMHA.
Supported FP8 for MoE LoRA.
Supported KV cache reuse for P-Tuning and LoRA.
Supported in-flight batching for CogVLM models.
Supported LoRA for the ModelRunnerCpp class.
Supported head_size=48 cases for FMHA kernels.
Added FP8 examples for DiT models, see examples/dit/README.md.
Supported decoder with encoder input features for the C++ executor API.

API Changes

[BREAKING CHANGE] Set use_fused_mlp to True by default.
[BREAKING CHANGE] Enabled multi_block_mode by default.
[BREAKING CHANGE] Enabled strongly_typed by default in builder API.
[BREAKING CHANGE] Renamed maxNewTokens, randomSeed and minLength to maxTokens, seed and minTokens following OpenAI style.
The LLM class
- [BREAKING CHANGE] Updated LLM.generate arguments to include PromptInputs and tqdm.
The C++ executor API
- [BREAKING CHANGE] Added LogitsPostProcessorConfig.
- Added FinishReason to Result.

Model Updates

Supported Gemma 2, see "Run Gemma 2" section in examples/gemma/README.md.

Fixed Issues

Fixed an accuracy issue when enabling remove padding issue for cross attention. (T5 model, large difference in results when remove_input_padding is enabled #1999)
Fixed the failure in converting qwen2-0.5b-instruct when using smoothquant. (convert qwen2-0.5b-instruct failed when using smoothquant #2087)
Matched the exclude_modules pattern in convert_utils.py to the changes in quantize.py. ([Fix] Match exclude_modules pattern in convert_utils.py to quantize.py changes. #2113)
Fixed build engine error when FORCE_NCCL_ALL_REDUCE_STRATEGY is set.
Fixed unexpected truncation in the quant mode of gpt_attention.
Fixed the hang caused by race condition when canceling requests.
Fixed the default factory for LoraConfig. (in python 3.11 and release 0.8.0: ValueError: mutable default <class 'datasets.utils.version.Version'> for field version is not allowed: use default_factory #1323)

Infrastructure Changes

Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.07-py3.
Base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:24.07-py3.
The dependent TensorRT version is updated to 10.4.0.
The dependent CUDA version is updated to 12.5.1.
The dependent PyTorch version is updated to 2.4.0.
The dependent ModelOpt version is updated to v0.15.

We are updating the main branch regularly with new features, bug fixes and performance optimizations. The rel branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,
The TensorRT-LLM Engineering Team

This discussion was created from the release TensorRT-LLM 0.13.0 Release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT-LLM 0.13.0 Release #2270

{{title}}

Replies: 0 comments

Select a reply

TensorRT-LLM 0.13.0 Release #2270

Shixiaowei02 Sep 30, 2024 Maintainer

Key Features and Enhancements

API Changes

Model Updates

Fixed Issues

Infrastructure Changes

Replies: 0 comments

Shixiaowei02
Sep 30, 2024
Maintainer