Technology

TensorRT-LLM

An open-source NVIDIA library: it accelerates Large Language Model (LLM) inference on NVIDIA GPUs using a simplified Python API and state-of-the-art optimizations.

TensorRT-LLM is the core toolkit for high-performance LLM inference on NVIDIA hardware (H100, A100, etc.). It provides a PyTorch-native, easy-to-use Python API to define LLMs and build highly-optimized TensorRT engines for deployment (single-GPU to multi-node). The library incorporates critical optimizations: custom attention kernels, in-flight batching, paged KV caching, and various quantization techniques (FP8, INT4 AWQ). This tech delivers serious performance gains: for example, boosting Llama 3.3 70B inference throughput by up to 3x with speculative decoding. It’s the definitive solution for deploying fast, efficient generative AI models at scale.

https://github.com/NVIDIA/TensorRT-LLM

2 projects · 2 cities

Related technologies

llama 40 Mistral 22 Qwen 15 vLLM 30

Recent Talks & Demos

Showing 1-2 of 2

Members-Only

Stopwatch: LLM Engine Benchmarks

New York City May 19

llama Mistral

TensorRT-LLM Engine Builder

Chicago Aug 20