Technology
Flash Attention 2
FlashAttention-2 re-engineers attention to reach 230 TFLOPS on A100 GPUs, doubling the speed of its predecessor through optimized parallelism and work partitioning.
Tri Dao's FlashAttention-2 eliminates memory bottlenecks by prioritizing SRAM over high-bandwidth memory (HBM). By refining the online softmax algorithm and increasing occupancy on NVIDIA Hopper and Ampere architectures, this implementation achieves up to 70% utilization of theoretical peak throughput. It supports head dimensions up to 256 and handles causal masking efficiently, enabling researchers to scale Transformer context windows to 32k tokens or more without the quadratic memory penalty (O(N^2)) typical of standard attention.
Recent Talks & Demos
Showing 1-0 of 0