Technology
TensorRT-LLM
An open-source NVIDIA library: it accelerates Large Language Model (LLM) inference on NVIDIA GPUs using a simplified Python API and state-of-the-art optimizations.
TensorRT-LLM is the core toolkit for high-performance LLM inference on NVIDIA hardware (H100, A100, etc.). It provides a PyTorch-native, easy-to-use Python API to define LLMs and build highly-optimized TensorRT engines for deployment (single-GPU to multi-node). The library incorporates critical optimizations: custom attention kernels, in-flight batching, paged KV caching, and various quantization techniques (FP8, INT4 AWQ). This tech delivers serious performance gains: for example, boosting Llama 3.3 70B inference throughput by up to 3x with speculative decoding. It’s the definitive solution for deploying fast, efficient generative AI models at scale.
Related technologies
Recent Talks & Demos
Showing 1-2 of 2