Summary of HyperAttention Long-context Attention in Near-Linear Time

Summary HyperAttention Long-context Attention in Near-Linear Time arxiv.org

9,216 words - PDF document - View PDF document

One Line

Yale and Google developed HyperAttention, an improved attention mechanism utilizing Locality Sensitive Hashing that surpasses FlashAttention.

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

HyperAttention: Accelerating Attention Mechanisms for Large Language Models

Source: arxiv.org - PDF - 9,216 words - view

Introducing HyperAttention

• HyperAttention is an approximate attention mechanism developed to address computational challenges faced by large language models.

• It surpasses existing methods like FlashAttention, providing significant speed improvements.

• HyperAttention achieves a near-linear time guarantee for attention approximation without requiring bounded entries or stable rank assumptions.

Modular Design and Integration

• HyperAttention features a modular design that can integrate other fast low-level implementations, including FlashAttention.

• This allows for flexibility and further optimization of the attention mechanism.

• Integration of FlashAttention and other implementations enhances the performance of HyperAttention.

Empirical Performance and Speed Improvements

• Empirically, HyperAttention outperforms existing methods like FlashAttention, providing significant speed improvements.

• It achieves over a 50x acceleration in forward and backward propagation for sequence lengths of 131k.

• HyperAttention maintains performance levels comparable to exact computation, ensuring high-quality results.

Scalability Limitations of Transformers

• Transformers face scalability limitations due to the quadratic complexity of their attention layers.

• Approaches to approximate intermediate matrices in attention layers do not provide end-to-end guarantees or support causal masking.

• HyperAttention offers a practical and efficient solution for attention approximation, supporting causal masking without compromising performance.

Versatility and Applications

• HyperAttention is versatile and can be applied to both inference and training in significantly long sequences.

• It has the potential to scale self-attention and improve the efficiency of large language models.

• The algorithm supports various learning tasks and can be adapted to different contexts.

Accelerating Vision Transformers

• The paper also discusses a new method for accelerating vision transformers using a linear Taylor attention mechanism.

• The approach combines low-rank and sparse approximation techniques, improving the efficiency of vision transformers.

• Vision transformers can benefit from HyperAttention's modular design and integration capabilities.

Closing Remarks

• HyperAttention is an efficient algorithm that provides a near-linear time approximation for attention mechanisms in large language models.

• It offers significant speed improvements while maintaining performance levels comparable to exact computation.

• The algorithm is versatile, supports causal masking, and does not require bounded entries or stable rank assumptions.

• HyperAttention has the potential to revolutionize self-attention in both language and vision models.

Embracing HyperAttention for Enhanced Efficiency

• HyperAttention offers a practical and efficient solution for accelerating attention mechanisms in large language models.

• By integrating HyperAttention, researchers and practitioners can achieve significant speed improvements without sacrificing performance.

• Embrace HyperAttention to unlock the full potential of large language models and revolutionize the field of natural language processing.

Key Points

Researchers have developed an approximate attention mechanism called "HyperAttention" to address computational challenges faced by large language models when dealing with long contexts.
HyperAttention outperforms existing methods like FlashAttention, providing significant speed improvements.
The algorithm achieves a practical and efficient near-linear time guarantee for attention approximation without requiring bounded entries or stable rank assumptions.
HyperAttention maintains performance levels comparable to exact computation and has the potential to scale self-attention for both inference and training in significantly long sequences.
The paper discusses a new method for accelerating vision transformers using a linear Taylor attention mechanism, combining low-rank and sparse approximation techniques.

Summaries

16 word summary

Yale and Google created HyperAttention, an approximate attention mechanism that surpasses FlashAttention using Locality Sensitive Hashing.

77 word summary

Yale University and Google Research have developed HyperAttention, an approximate attention mechanism that outperforms existing methods like FlashAttention. It uses Locality Sensitive Hashing (LSH) to identify large entries, providing significant speed improvements. HyperAttention achieves near-linear time guarantee for attention approximation, supports causal masking, and does not require bounded entries or stable rank assumptions. It demonstrates over a 50x acceleration in forward and backward propagation for sequence lengths of 131k while maintaining performance levels similar to exact computation.

130 word summary

Yale University and Google Research have developed HyperAttention, an approximate attention mechanism that addresses computational challenges in large language models dealing with long contexts. HyperAttention outperforms existing methods like FlashAttention by using Locality Sensitive Hashing (LSH) to identify large entries, providing significant speed improvements. It achieves a near-linear time guarantee for attention approximation, supports causal masking, and does not require bounded entries or stable rank assumptions. The algorithm involves finding large entries of the attention matrix through a black box method and using fine-grained parameters to analyze time complexity. HyperAttention demonstrates over a 50x acceleration in forward and backward propagation for sequence lengths of 131k while maintaining performance levels similar to exact computation. It significantly improves inference and training speeds with minimal performance degradation and offers versatility in various tasks.

427 word summary

Yale University and Google Research have developed an approximate attention mechanism called "HyperAttention" to address computational challenges in large language models dealing with long contexts. HyperAttention features a modular design that integrates other fast low-level implementations, including FlashAttention. Using Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods like FlashAttention, providing significant speed improvements. The researchers validated its performance on various long-context length datasets, demonstrating faster inference times while maintaining comparable performance levels.

Transformers face scalability limitations due to the quadratic complexity of their attention layers. This work presents an algorithm called HyperAttention that achieves a practical and efficient near-linear time guarantee for attention approximation. It supports causal masking and does not require bounded entries or stable rank assumptions. The algorithm involves finding large entries of the attention matrix through a black box method and using fine-grained parameters to analyze time complexity. Empirically, HyperAttention demonstrates significant speed improvements, achieving over a 50x acceleration in forward and backward propagation for sequence lengths of 131k while maintaining performance levels similar to exact computation.

The robustness of HyperAttention on different tasks is investigated, and it is discovered that summarization and code completion tasks are more resilient to approximate attention layers than question answering. The concept of sortLSH, a variant of angular locality-sensitive hashing, is introduced to efficiently identify dominant entries within the attention matrix. The algorithm sorts keys and queries based on their hash buckets, allowing large entries to be captured by computing equal-sized blocks along the diagonal.

To approximate the diagonal scaling matrix D, a two-step procedure is proposed. First, sortLSH is used to identify dominant entries within the attention matrix, and then a small subset of keys is randomly selected. This simple approach provides a spectral approximation guarantee for the estimated matrix D. To approximate the matrix product between D ?1 A and the value matrix V, sampling based on the squared row norms of V is used.

The performance of HyperAttention is evaluated by integrating it into existing LLMs, measuring perplexity and speedup. HyperAttention significantly improves inference and training speeds with minimal performance degradation. The speedup compared to exact computation using FlashAttention is up to 54x faster runtime without causal masking and 5.4x faster runtime with causal masking for sequence lengths up to 131k.

In conclusion, HyperAttention is an efficient algorithm that provides a near-linear time approximation for attention mechanisms in large language models. It offers significant speed improvements while maintaining performance levels comparable to exact computation. The algorithm is versatile, supports causal masking, and does not require bounded entries or stable rank assumptions.

623 word summary

Yale University and Google Research have developed an approximate attention mechanism called "HyperAttention" to address computational challenges in large language models (LLMs) dealing with long contexts. Previous work has shown that quadratic time is necessary for attention layers unless the attention matrix has bounded entries or low stable rank. The researchers introduced two parameters to measure the hardness of the problem: the max column norm in the normalized attention matrix and the ratio of row norms in the unnormalized attention matrix after removing large entries. HyperAttention features a modular design that integrates other fast low-level implementations, including FlashAttention. Empirically, using Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods like FlashAttention, providing significant speed improvements. The researchers validated its performance on various long-context length datasets, demonstrating faster inference times while maintaining comparable performance levels.

Transformers, successfully applied to various learning tasks, face scalability limitations due to the quadratic complexity of their attention layers. Several approaches have been explored to approximate intermediate matrices in attention layers, but they do not provide end-to-end guarantees or support causal masking. Recent theoretical bounds suggest that entry-wise approximations to the attention matrix are impossible in sub-quadratic time, except for KDEFormer, which provides provable approximation in sub-quadratic time under the assumption of bounded entries. This work presents an algorithm that achieves a practical and efficient near-linear time guarantee for attention approximation. It supports causal masking and does not require bounded entries or stable rank assumptions. The algorithm involves finding large entries of the attention matrix through a black box method and using fine-grained parameters to analyze time complexity. Empirically, HyperAttention demonstrates significant speed improvements, achieving over a 50x acceleration in forward and backward propagation for sequence lengths of 131k while maintaining performance levels similar to exact computation.

The paper also discusses a new method for accelerating vision transformers using a linear Taylor attention mechanism, providing references to relevant works in the field. It includes proofs for key lemmas and theorems, as well as corollaries that further demonstrate the effectiveness of the proposed method.

1175 word summary

Researchers from Yale University and Google Research have developed an approximate attention mechanism called "HyperAttention" to address the computational challenges faced by large language models (LLMs) when dealing with long contexts. Previous work has shown that quadratic time is necessary for attention layers unless the attention matrix has bounded entries or low stable rank. The researchers introduced two parameters to measure the hardness of the problem: the max column norm in the normalized attention matrix and the ratio of row norms in the unnormalized attention matrix after removing large entries.

HyperAttention features a modular design that can integrate other fast low-level implementations, including FlashAttention. Empirically, using Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods like FlashAttention, providing significant speed improvements. The researchers validated the empirical performance of HyperAttention on various long-context length datasets, demonstrating faster inference times and maintaining performance levels comparable to the original models.

Transformers, which have been successfully applied to various learning tasks, face scalability limitations due to the quadratic complexity of their attention layers. Several approaches have been explored to approximate intermediate matrices in attention layers, but they do not provide end-to-end guarantees or support causal masking. Recent theoretical bounds suggest that entry-wise approximations to the attention matrix are impossible in sub-quadratic time. However, a recent work called KDEFormer provides provable approximation in sub-quadratic time under the assumption of bounded entries.

In this work, the researchers provide an algorithm that achieves a practical and efficient near-linear time guarantee for attention approximation. Their approach supports causal masking and does not require bounded entries or stable rank assumptions. The algorithm involves finding large entries of the attention matrix through a black box method and using fine-grained parameters to analyze time complexity. Empirically, HyperAttention demonstrates significant speed improvements, achieving over a 50x acceleration in forward and backward propagation for sequence lengths of 131k. When applied to pretrained LLMs, it maintains performance levels that closely match those of the original models.

The researchers also investigate the robustness of HyperAttention on different tasks and discover that summarization and code completion tasks are more resilient to approximate attention layers than question answering. They introduce the concept of sortLSH, a variant of angular locality-sensitive hashing, to efficiently identify dominant entries within the attention matrix. The algorithm sorts keys and queries based on their hash buckets, allowing large entries to be captured by computing equal-sized blocks along the diagonal. This approach aligns with modern hardware's block-memory access patterns and can be efficiently parallelized.

To approximate the diagonal scaling matrix D, the researchers propose a two-step procedure. First, they use sortLSH to identify dominant entries within the attention matrix, and then they randomly select a small subset of keys. This simple approach provides a spectral approximation guarantee for the estimated matrix D. To approximate the matrix product between D ?1 A and the value matrix V, they use sampling based on the squared row norms of V.

The researchers evaluate the performance of HyperAttention by monkey patching it into existing LLMs and measuring perplexity and speedup. They find that HyperAttention can significantly improve inference and training speeds with minimal performance degradation. They also measure the speedup of HyperAttention compared to exact computation using FlashAttention and observe up to 54x faster runtime without causal masking and 5.4x faster runtime with causal masking for sequence lengths up to 131k.

The paper titled "HyperAttention Long-context Attention in Near-Linear Time" discusses a new method for accelerating vision transformers using a linear Taylor attention mechanism. The authors propose a unified approach that combines low-rank and sparse approximation techniques to improve the efficiency of vision transformers.

The paper references several related works in the field, including Vitality by Yingyan Lin, which also focuses on accelerating vision transformers. Other notable references include BERT by Jacob Devlin et al., LongNet by Jiayu Ding et al., An Image is Worth 16x16 Words by Alexey Dosovitskiy et al., and Transformers are RNNs by Angelos Katharopoulos et al. These works provide insights into pre-training deep bidirectional transformers, scaling transformers to large token sizes, and applying transformers to image recognition tasks.

The authors also mention the use of fast Monte-Carlo algorithms for approximate matrix multiplication by Petros Drineas and Ravi Kannan, which is relevant to the efficient computation of attention mechanisms. They reference the GLM language model pretraining method by Zhengxiao Du et al. and the LM-infinite model by Chi Han et al., both of which contribute to improving language models.

Additionally, the paper cites works such as Reformer by Nikita Kitaev et al., which focuses on efficient transformers, and Heavy Hitters via Cluster-Preserving Clustering by Kasper Green Larsen et al., which discusses clustering algorithms. The authors also reference Faster Algorithms for Rectangular Matrix Multiplication by François Le Gall, which is relevant to matrix multiplication operations.

Other works mentioned include Learning the Positions in CountSketch by Yi Li et al., Textbooks Are All You Need II: phi-1.5 Technical Report by Yuanzhi Li et al., Concentration by Colin McDiarmid, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel et al., Efficient Content-Based Sparse Attention with Routing Transformers by Aurko Roy et al., Sparse Attention with Learning to Hash by Zhiqing Sun et al., Attention Is All You Need by Ashish Vaswani et al., XLNet by Zhilin Yang et al., Big Bird: Transformers for Longer Sequences by Manzil Zaheer et al., KDEformer: Accelerating Transformers via Kernel Density Estimation by Amir Zandieh et al., and Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting by Haoyi Zhou et al.

The paper includes omitted proofs, specifically the proof of Lemma 1. The lemma shows that the calculated value in line 3 of Algorithm D2 is close to the maximum row sum of a matrix. It also defines subsets and sets bounds on their cardinalities. The proof concludes by bounding the operator norm of the error.

The paper also includes the proof of Theorem 1, which demonstrates the spectral approximation guarantee for the proposed method. The proof relies on Lemma 1 and provides a bound on the stable rank of a matrix. The runtime of the algorithm is analyzed, and it is shown that the computation time is dominated by matrix multiplication and attention matrix calculations.

The paper concludes with two corollaries. Corollary 1 discusses collision probabilities in hash functions, and Corollary 2 presents a method for constructing a mask matrix based on the support of certain columns.

In summary, the paper introduces a novel approach for accelerating vision transformers using a linear Taylor attention mechanism. It references relevant works in the field, presents proofs for key lemmas and theorems, and includes corollaries that further demonstrate the effectiveness of the proposed method.

Raw indexed text (47,218 chars / 9,216 words / 1,397 lines)

HyperAttention: Long-context Attention in Near-Linear Time

Insu Han

Yale University

[email protected]

Rajesh Jarayam

Google Research

[email protected]

Vahab Mirrokni

Google Research

[email protected]

Amin Karbasi

Yale University, Google Research

[email protected]

David P. Woodruff

CMU, Google Research

[email protected]

Amir Zandieh

Independent Researcher

[email protected]

Abstract

We present an approximate attention mechanism named “HyperAttention” to address the

computational challenges posed by the growing complexity of long contexts used in Large Lan-

guage Models (LLMs). Recent work suggests that in the worst-case scenario, quadratic time is

necessary unless the entries of the attention matrix are bounded or the matrix has low stable

rank. We introduce two parameters which measure: (1) the max column norm in the normalized

attention matrix, and (2) the ratio of row norms in the unnormalized attention matrix after de-

tecting and removing large entries. We use these fine-grained parameters to capture the hardness

of the problem. Despite previous lower bounds, we are able to achieve a linear time sampling

algorithm even when the matrix has unbounded entries or a large stable rank, provided the above

parameters are small. HyperAttention features a modular design that easily accommodates in-

tegration of other fast low-level implementations, particularly FlashAttention. Empirically, em-

ploying Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms

existing methods, giving significant speed improvements compared to state-of-the-art solutions

like FlashAttention 1 . We validate the empirical performance of HyperAttention on a variety of

different long-context length datasets. For example, HyperAttention makes the inference time

of ChatGLM2 50% faster on 32k context length while perplexity increases from 5.6 to 6.3. On

larger context length, e.g., 131k, with causal masking, HyperAttention offers 5-fold speedup on

a single attention layer.

Introduction

Transformers [29] have been successfully applied to a wide variety of learning tasks in areas such

as natural language processing [13, 30, 3, 26], computer vision [4, 15], and time series forecast-

ing [33]. Despite their success, these models face serious scalability limitations because naı̈ve exact

computation of their attention layers incurs quadratic (in the sequence length) runtime and mem-

ory complexities. This presents a fundamental challenge for scaling transformer models to longer

context lengths.

Various approaches have been explored to tackle the quadratic-time attention layer, with one notable

direction focusing on approximating intermediate matrices in attention layers. Methods for doing

Empirical studies are conducted by I. Han and A. Zandieh.

1this include approximations by sparse matrices [20, 11, 27, 28, 14, 18], low-rank matrices [9, 19],

or a combination of both [8, 31, 7, 12]. These methods, however, give no end-to-end guarantees

on the approximation of the attention output matrices. These methods aim to provide faster

approximation to various components of attention, but none of these methods provide end-to-end

approximations of the full dot-product attention. Moreover, none of these works support the use

of causal masking, which is a crucial part of modern transformer architectures. On the negative

side, recent theoretical bounds suggest that entry-wise approximations to the attention matrix are

impossible in sub-quadratic time in general [1].

Nevertheless, a recent work, dubbed KDEFormer Zandieh et al. [32], was shown to provide provable

approximation in subquadratic time, under the assumption that the entries of the attention matrix

are bounded. Theoretically, KDEFormer runs in roughly Õ(n 1.173 ) time; it employs kernel density

estimation (KDE) to approximate column norms, allowing one to compute probabilities with which

to sample columns of the attention matrix. However, the current algorithms for KDE are lacking

practical efficiency [6], and even in theory, there is a gap between the runtime of KDEFormer and

the theoretically feasible O(n) time algorithms. In [1], the authors demonstrated that under the

same assumption of bounded entries, a nearly linear time O(n 1+o(1) ) algorithm is possible. However,

their algorithm also involves using the polynomial method to approximate the softmax and is likely

impractical (e.g., it was not empirically evaluated by the authors). In this work, we provide an

algorithm which achieves the best of both worlds, being both a (1) practically efficient algorithm

that (2) achieves the best possible near-linear time guarantee. Additionally, our approach supports

casual masking, which was not possible via previous works.

1.1

Problem Statement

The dot-product attention [29] involves processing three input matrices: Q (queries), K (keys),

V (values), all of size n × d, where n is the number of tokens in the input sequence and d is the

dimension of latent representations. This process outputs the following:

Att = D −1 AV

Here, matrix A := exp QK ⊤ is defined as the element-wise exponential of QK ⊤ . Additionally,

D is an n × n diagonal matrix derived from the sum of rows of A, D i,i = ∥A i,: ∥ 1 for i ∈ [n]. In

this context, matrix A is referred to as the “attention matrix”, and D −1 A is called the “softmax

matrix”. It is important to note that calculating the attention matrix A directly requires Θ(n 2 d)

operations, and storing it consumes Θ(n 2 ) memory. Consequently, a straightforward computation

of Att demands a runtime of Ω(n 2 d) and Ω(n 2 ) memory.

Our objective is to efficiently approximate the output matrix Att while retaining its spectral prop-

erties. Our strategy involves designing an efficient estimator for the diagonal scaling matrix D in

near-linear time. Additionally, we aim to quickly approximate the matrix product of the softmax

matrix D −1 A and value matrix V through subsampling. To be more specific, our objective is to

find a sampling matrix S ∈ R m×n with a limited number m = n o(1) of rows, along with a diagonal

e ∈ R n×n , such that the following bound on the operator norm of the error is met:

matrix D

e −1 AS ⊤ · SV

Att − D

≤ ε · D −1 A

∥V ∥ op .

(1)1.2

Our Contributions

We show that efficiently solving the matrix multiplication component of the attention approximation

problem in Eq. (1) can be achieved by defining the sampling matrix S based on the row norms

of V . The more challenging aspect lies in obtaining a reliable spectral approximation for the

diagonal matrix D. In a recent result, Zandieh et al. [32] effectively leverages fast KDE solvers to

attain a high-quality approximation of D. However, we streamline the KDEformer procedure and

demonstrate that uniform sampling is sufficient to achieve the desired spectral guarantee, eliminating

the need for importance sampling based on kernel densities. This significant simplification allows

us to develop a practical and provably linear time algorithm.

In contrast to prior work [1, 32], our approach does not necessitate bounded entries or bounded

stable rank. Furthermore, the fine-grained parameters we introduce to analyze the time complexity

may remain small even when the entries in the attention matrix or the stable rank are large.

Our work is inspired by the hard instance of Alman & Song [1] for showing quadratic time lower

bounds. Such instances have one randomly placed large entry in each row of the attention matrix.

Our algorithm has an initial phase where we find large entries of the attention matrix in a black

box manner, such as by using Locality Sensitive Hashing [20], or a possibly learned CountSketch

applied to the attention matrix [5, 23], or just a known heavy entry pattern [7]. We assume these

procedures are fast, and that after removing the heavy entries, two parameters in the resulting

attention matrix are small: (1) the max column ℓ 1 -norm, and (2) the ratio of row norms in the

un-normalized attention matrix.

Prior work of Zandieh et al. [32] used KDE to identify columns in the attention matrix with large

norm and to perform approximate matrix product with the value matrix by sampling such columns.

As mentioned, finding such columns requires at least O(n 1.173 ) time. Instead, we observe that by

doing a one-sided sampling from the squared row norms of V , we can avoid the use of KDEs and

achieve the same spectral norm guarantee in terms of the stable rank. Although our algorithm

is simple and just samples by the row norms of the value matrix (or even samples uniformly in

practice), the main technical challenge is that we do not know the row norms of the attention

matrix needed in order to normalize it and produce a proper factorization of it. This is reminiscent

of the quadratic time hard instance of [1] where we may not be able to find a heavy entry in a row

easily, and thus cannot normalize by its norm in the attention matrix. Our parameters (1) and

(2) above allow us to argue that the heavy entries, if they exist, are not distributed in the worst

possible way.

Empirically, HyperAttention demonstrates significant speed improvements, achieving over a 50×

acceleration in forward and backward propagation for sequence lengths of n = 131k. When dealing

with causal masking, the method still delivers a substantial 5× speedup. Moreover, when our

approach is applied to pretrained LLMs, e.g., chatglm2-6b-32k [17] and evaluated on long-context

benchmark datasets, so-called LongBench [2], it maintains performance levels that closely match

those of the original models, even without the need for fine-tuning. Furthermore, we investigate

task-specific evaluations and discover summarization and code completion tasks are more robust to

approximate attention layers than question answerings.

Preliminaries

We make use of the Hamming sorted LSH, a variant of angular locality-sensitive hashing introduced

in the work by Zandieh et al. [32]. In this variant, the hash buckets are arranged in order of

their Hamming distances. This LSH variant is particularly well-suited for designing GPU-friendly

algorithms aimed at identifying dominant entries within the attention matrix A. In the context of

Hamming sorted LSH, if we let H : R d → [B] be a hash function with B buckets drawn from an LSH

family, then the collision probability Pr H [H(q) = H(k)] is “roughly” proportional to ⟨q, k⟩. A very

useful property of this LSH variant is that its buckets are ordered in such a way that geometrically

adjacent buckets have consecutive buckets. We provide the following definition.

Definition 1 (Hamming sorted LSH, Definition 7.3 of [32]). For positive integer r, there exists

an LSH function

H : R d → [2 r ], such that for

any x, y ∈ R d its collision probability is Pr[H(x) =

⊤

x y

H(y)] = 1 − θ(x,y)

where θ(x, y) := cos −1 ∥x∥∥y∥

. Furthermore, this LSH function hashes

similar points to adjacent buckets. Specifically, the probability that two points end up in adjacent

r−1

θ(x,y)

−

buckets is given by Pr [H(x) = H(y) ± 1 (mod 2 r )] = 2θ(x,y)

Using this LSH function, as demonstrated by Zandieh et al. [32], we can sort keys and queries within

an attention layer in such a way that large entries get shifted towards the diagonal of the attention

matrix. Subsequently, these significant entries in the attention matrix can be captured by computing

equal-sized blocks along the diagonal. This approach aligns with the block-memory access patterns

of modern hardware and can be efficiently parallelized through batching across blocks.

Algorithm

To obtain a spectral guarantee when approximating Att, our initial step involves producing a 1 ± ε

approximation of the diagonal entries in the matrix D. Subsequently, we approximate the matrix

product between D −1 A and V via sampling according to the squared row ℓ 2 -norms of V .

Estimating D. Our procedure for approximating D consists of two steps. Initially, we identify

the dominant entries within the attention matrix using an algorithm rooted in the Hamming sorted

LSH, as defined in Definition 1. The second step revolves around randomly selecting a small subset

of keys K. We will demonstrate that under certain mild assumptions about matrices A and D,

this simple approach allows us to establish spectral bounds on the estimated matrix. Our aim is to

e that satisfies:

find a sufficiently precise approximate matrix D

e −1 − D −1 A

D −1 A op

(2)

≤

Our assumption is that the column norms of the softmax matrix exhibit a relatively uniform dis-

tribution. To be more precise, we assume that for any i ∈ [n] there exists some α = n o(1) such that

D −1 A · e (i) 2 ≤ αn . It’s worth noting that our assumption is more general in comparison to the

bounded input entries assumption made in [1]. In fact, if their assumption holds, it implies that

o(1)

D −1 A · e (i) 2 ≤ n n for all i ∈ [n].

4...

k 1 k 2

k 9

k 2 k 8 k 4 k 7 k 3 k 6 k 9 k 1 k 5

q 1

q 2

P K

⇒

P Q

q 9

q 2

q 8

q 5

q 4

q 3

q 6

q 7

q 9

q 1

−1

P K

⇒

−1

P Q

A P Q ,P K

M H ⊙ A

Figure 1: How sortLSH finds large entries of A: (Left) Keys and queries undergo hashing using

the Hamming ordered LSH H(·). (Middle) Keys and queries are rearranged based on their hash

buckets. Attention matrix after applying these row and column permutations is denoted as A P Q ,P K .

Large entries of A P Q ,P K are concentrated around the diagonal blocks. (Right) rows and columns

permutations are reversed on the attention matrix and M H ⊙ A is highlighted.

Algorithm 1: sortLSH for locating large entries of A

input: matrices Q, K ∈ R n×d , and block size b

2: Let H(·) be a Hamming sorted LSH as per Definition 1 and hash rows of Q, K

3: Let P K , P Q ∈ Sym(n) be permutations satisfying P K (i) < P K (j) if H(K i,: ) ≤ H(K j,: ) and

P Q (i) < P Q (j) if H(Q i,: ) ≤ H(Q j,: )

H = 1

4: return Mask matrix M H ∈ {0, 1} n×n defined as M i,j

{⌊P Q (i)/b⌋=⌊P K (j)/b⌋}

The first step of our empirical algorithm involves identifying large entries of the attention matrix

A through hashing keys and queries into uniformly-sized buckets using the Hamming sorted LSH,

which we refer to as sortLSH. This process is detailed in Algorithm 1 and is visually illustrated in

Fig. 1. Note that we also mention other was of identifying large patterns, such as checking for a

known heavy hitter pattern, or using CountSketch which we describe more below.

Algorithm 1 returns a sparse mask designed to isolate the dominant entries of the attention matrix.

Given this mask, we compute an approximation of the matrix D in Algorithm 2 that satisfies the

spectral guarantee in Eq. (2). This algorithm accomplishes this by combining the attention values

corresponding to the mask with a randomly chosen subset of columns from the attention matrix.

We remark that our algorithm is versatile and can function effectively with a predefined mask that

specifies the positions of dominant entries within the attention matrix, mirroring the approach taken

in [7]. The main guarantee provided by this algorithm is given in Lemma 1.

Lemma 1 (Approximating D). For any Q, K ∈ R n×d , let A = exp(QK ⊤ ). Also let D ∈ R n×n be

the diagonal matrix with D i,i = ∥A i,: ∥ 1 . Additionally, suppose that α = n · max i∈[n] D −1 A · e (i) 2 .

max

⟨1−M H ,A i,: ⟩

For any mask matrix M H ∈ {0, 1} n×n let us define the condition number κ := min i∈[n] ⟨1−M i,:

. If

j∈[n]

j,: ,A j,: ⟩

7 2

e of Algorithm 2 satisfies Eq. (2) with probability at least 1− 1 .

log n , the output D

m = Ω κ ε ·α

poly(n)

Approximating the product of softmax matrix D −1 A and values matrix V . Given a

e that meets the spectral approximation conditions as in Eq. (2), we can achieve the spectral

5Algorithm 2: ApproxD for estimating diagonal matrix D

input: matrices Q, K ∈ R n×d , large entries mask M H ∈ {0, 1} n×n , parameters κ > 0, ε >

α > ε 2 κ, and integer m

Randomly choose

a subset T ⊆ [n] E

with cardinality |T| = m

⊤

τ ← max j∈T 1 − M j,: , exp(KQ j,: )

Generate m i.i.d. sample ℓ 1 , ℓ 2 , . . . ℓ m ∼ Unif([n])

for i ∈ [n] do

H , exp(KQ ⊤ )⟩ + τ /κ

C i ← Θ n ε log m n · ⟨M i,:

i,:

n P

H ) · min exp ⟨Q , K

d i ← m

−

i,:

ℓ j ,: ⟩ , C i

j∈[m]

i,ℓ j

H , exp(KQ ⊤ )⟩ + max (d , τ /κ)

d ˜ i ← ⟨M i,:

i,:

return diagonal matrix D = diag({ d ˜ i } n )

κ 4

i=1

Algorithm 3: HyperAttention: attention mechanism in near-linear time

input: matrices Q, K, V ∈ R n×d , mask matrix M H ∈ {0, 1} n×n , and parameter ε > n o(1)

e ← ApproxD Q, K, M H , n o(1) , ε, n o(1) , d · n o(1)

2: Run Algorithm 2 and let D

m×n

3: Let S ∈ R

be an i.i.d. sampling matrix based on squared row norms of V as in Lemma 2

4: return D and S

constraint in Eq. (1), by finding a sampling matrix that satisfies the following condition,

e −1 AS ⊤ · SV − D

e −1 AV

≤

· D −1 A

∥V ∥ op

(3)

We can efficiently find a sampling matrix S ∈ R m×n with a small number m of rows that satisfies

Eq. (3) by leveraging well-established techniques in Approximate Matrix Multiplication (AMM).

e A ∈ R n×n , V ∈ R n×d consider a sampling matrix S ∈ R m×n con-

Lemma 2. For any matrices D,

structed as follows: first generate m i.i.d. samples ℓ 1 , . . . ℓ m ∈ [n] according to squared row norms of

∥V ∥ 2

∥V ∥

e −1 A)

matrix V , i.e., ∥V i,: ∥ 2 2 , then let the r th row of S be √ m· V F

·e (ℓ r ) . If m = Ω ε −2 d · srank( D

∥ ℓr ,: ∥ 2

for some ε > 0, the following holds with probability at least 0.99:

e −1 AS ⊤ · SV − D

e −1 AV

e −1 A

≤ ε · D

∥V ∥ op .

The above result is standard and for proof refer to [16].

e and

Main Theorem. Now, we can integrate the subroutines for approximating the diagonal D

−1

approximating the matrix product between D A and values matrix V . With this, we introduce

the HyperAttention, an efficient algorithm that can approximate the attention mechanism with

spectral guarantees as per Eq. (1) in near-linear time. Our Algorithm 3 takes as input a mask

M H that defines the positions of dominant entries within the attention matrix. This mask can be

generated using the sortLSH algorithm (Algorithm 1), or it can be a predefined mask similar to

the approach taken in [7]. The large entries mask M H is assumed to be sparse by design and its

6number of nonzero entries is bounded nnz(M H ) = n 1+o(1) . We now introduce our main theorem

which will be proved in Appendix A.

Theorem 1 (HyperAttention guarantee). For any matrices Q, K, V ∈ R n×d , any mask matrix

M H ∈ {0, 1} n×n , and parameter ε > n o(1)

, let A = exp(QK ⊤ ) and let D ∈ R n×n be the diagonal

matrix with D i,i = ∥A i,: ∥ 1 . If max i∈[n] D −1 A · e (i)

≤

n o(1)

and

H ,A ⟩

max i∈[n] ⟨1−M i,:

i,:

min j∈[n] ⟨1−M j,: ,A j,: ⟩

≤ n o(1) then

e of Algorithm 3 satisfy the spectral condition as in

with probability at least 0.98 the outputs S, D

Eq. (1). Moreover, this algorithm’s runtime is O(d · n 1+o(1) + d · nnz(M H )).

Note that even if M H is not given to us, but M H can be found in d·n 1+o(1) time, the theorem holds.

We also give examples when this is possible by using Hamming sorted LSH, which our experiments

are based on, or using the ExpanderSketch of [21] which is based on CountSketch [5] but also gives

a fast recovery time. In the supplementary we show:

Corollary 1 (HyperAttention with sortLSH). Suppose all preconditions of Theorem 1 hold, where

the mask matrix M H is defined as follows. Suppose M H ∈ {0, 1} n×n is generated as in Algorithm 1

with block size b = n o(1) and r = log 2 n in Definition 1. We further assume there are at most n 1+o(1)

pairs (i, j) with θ(Q i,∗ , K j,∗ ) ≤ π 2 (1 − o(1)), where θ is as in Definition 1. Then with probability

1−1/n o(1) , the M H we find in Algorithm 1 has at most n 1+o(1) non-zero entries and with probability

e of Algorithm 3 satisfy Eq. (1) and the overall runtime is O(d · n 1+o(1) ).

at least .98, the outputs S, D

We note the assumption on the angles of the rows of Q and K in Corollary 1 is satisfied if most

rows are drawn uniformly at random from a d-dimensional sphere, since in this case they will be

nearly orthogonal, i.e., have angle at most π 2 (1−o(1)) with high probability. However, the corollarly

also allows n 1+o(1) pairs of rows to have arbitrary angle, which may be more realistic.

Corollary 2 (HyperAttention with ExpanderSketch). Suppose all preconditions of Theorem 1 hold,

where the mask matrix M H is defined as follows. Suppose M H ∈ {0, 1} n×n is defined such that

∥QK ⊤ e ∥ 2

j 2

H = 1 if and only if (QK ⊤ ) 2 ≥

there is a threshold τ = n o(1) such that M i,j

. Then we can

i,j

e of

find M exactly with probability 1 − O(1/n ), and with probability at least .98, the outputs S, D

Algorithm 3 satisfy Eq. (1). The runtime is O(d · n 1+o(1) ).

The key idea behind the proof of Corollary 2 is to first sketch Q by an ExpanderSketch T , which

is efficient since T has a small number of rows. Then compute (T · Q) · K ⊤ which is again efficient

since (T · Q) has a small number of rows. Thus, we never form the matrix Q · K ⊤ .

3.1

Causal Masking

Language models commonly employ causal masking. The causal mask is a lower triangular binary

C = 1

square matrix denoted as M C where M i,j

{i≥j} . The causal attention mechanism is defined as:

Att C = D C −1 (M C ⊙ A)V ,

where A := exp QK ⊤ is defined as before and D C is an n × n diagonal matrix derived from the

C , A ⟩ for i ∈ [n]. To

sum of rows of the masked attention M C ⊙ A, specifically [D C ] i,i = ⟨M i,:

i,:

7→

⇒

(recursive call)

M 1 C ! A 11 M 2 C ! A 22

→

M C ! A

CausalApproxD

ApproxD

A 21

Figure 2: Causal attention matrix can be divided into three equal-sized non-zero sections: M 1 C ⊙A 11

and M 2 C ⊙ A 22 are both causal attention matrices, and A 21 is an unmasked attention matrix.

Algorithm 4: CausalApproxD, recursive approximation of D C for causal masking

input: matrices Q, K ∈ R n×d

⊤ ⊤

⊤

⊤ ⊤

Split Q and K into equal sized sub-matrices: Q = [Q ⊤

1 , Q 2 ] and K = [K 1 , K 2 ]

e C11 ← CausalApproxD(Q 1 , K 1 ) and D

e C22 ← CausalApproxD(Q 2 , K 2 )

Run the unmasked

algorithm ApproxD

# (Algorithm 2) on Q 2 , K 1 to get D 21

e C = D C11 ,

return D

e C22

D 21 + D

approximate causal attention with a spectral guarantee, we require two components. First, we need

a spectral approximation for the diagonal matrix D C . Second, we need to approximate the matrix

product between D C −1 (M C ⊙ A) and V , which can be achieved using the same sampling technique

as described in Algorithm 3 and Lemma 2. The first component is more intricate, and we employ

a recursive method to address it. So we focus on how to efficiently approximate the diagonal D C .

Our approach is based on a key observation, as depicted in Fig. 2. The masked attention M C ⊙A can

be decomposed into three non-zero matrices, each of which has half the size of the original attention

matrix. The block A 21 , located entirely below the diagonal is unmasked attention. Consequently,

we can approximate its row sums using Algorithm 2. The two diagonal blocks M 1 C ⊙ A 11 and

M 2 C ⊙ A 22 shown in Fig. 2 are causal attentions with half the original size. To handle these, we

apply a recursive approach and further partition them into smaller blocks, and repeat this procedure.

We present a pseudocode for this procedure in Algorithm 4.

Experiments

In this section, we benchmark our algorithms by scaling up existing large language models to handle

long-range sequences. All experiments are performed on a single A100 GPU with 40 GB memory

and we use FlashAttention 2 [10] for the exact attention computation.

841

1.5

1.4

1.2

2.0

perplexity

0 4 8 12 16 20 24 28

number of replaced layers

1.0

(a) chatglm2-6b-32k

0 4 8 12 16 20 24

number of replaced layers

1.0

(b) phi-1.5

Figure 3: Perplexity and speedup of chatglm2-6b-32k (left) and phi-1.5 (right) monkey patched

with HyperAttention. We vary the number of replaced attention layers in the final order.

4.1

Monkey Patching Self-attention

We first evaluate HyperAttention on two pre-trained LLMs. We choose three models with different

architectures that are widely used in practical applications: chatglm2-6b-32k [17], and phi-1.5 [24].

We patch their final ℓ attention layers by replacing with HyperAttentions where ℓ can vary from

0 to the number of all attention layers in each LLM. Note that attentions in both models requires

causal masking and we make use of Algorithm 4 by recursively applying it until the input sequence

lengths n are less than 4,096. We set both bucket size b and the number of sampled columns m to

256 for all sequence lengths. We evaluate the performance of such monkey patched models in terms

of perplexity and speedup.

We use LongBench [2], a collection of long context benchmark datasets, which contains 6 differ-

ent tasks ranging from single and multiple-document question answering, summarization, few-shot

learning, synthetic tasks, and code completion. We select a subset of dataset whose encoded se-

quence lengths are larger than 32,768 and trim them if the length is over 32,768 so that all data have

sequence lengths of 32,768. Then, we compute the perplexity (i.e., loss on next tokens prediction)

of each model. To highlight the scalability on the long sequences, we calculate the total speedup on

all attention layers whether performed by HyperAttention or FlashAttention.

The results are summarized in Fig. 3. Observe that chatglm2-6b-32k shows a reasonable perplexity

even after monkey patched by HyperAttention, e.g., after replacing 20 layers the perplexity increases

approximately by 1 and it slowly goes up until 24 layers. But it improves runtimes in attention

layers about 50%. If all the layers are replaced then the perplexity goes to up 12 but it runs about

2.3× faster. For phi-1.5, similar happens but the perplexities are linearly increasing as the number

of HyperAttention grows.

In addition, we evaluate the performances of monkey patched chatglm2-6b-32k on LongBench

datasets and compute task-specific evaluation scores on each task including single-document ques-

tion answering, multiple-document question answering, summarization, few-shot learning, synthetic

tasks and code completion. Results are provided in Table 1. While replacing HyperAttention gen-

erally leads to performance degradation, we observe that its role can vary depending on the task at

hand. For example, summarization and code completion are more robust to other tasks. Notably,

9Task

Number of

Replaced Layers single-qa multi-qa summarization few-shot synthetic code

0 (exact)

28 80.63

72.34

71.32

44.75

27.07 68.14

65.37

62.97

48.91

22.65 53.12

52.54

52.13

50.90

46.28 186.58

182.12

169.02

150.44

65.74 84.00

82.00

80.00

31.00

8.33 99.57

102.06

92.12

82.58

73.68

Table 1: Performance evaluation of chatglm2-6b-32k equipped with HyperAttentions on LongBench

datasets [2]. They contain 6 different tasks and we evaluate each of them with its own metric where

higher value indicates better performance.

when half of all attention layers are patched (i.e., 14 layers), we verify that most of the tasks do not

degrade more than 13%. In particular, the performance of the summarization task remained almost

unchanged, suggesting that this task may be more robust to partial alterations in the attention

mechanism. We recall that computations in attention layers can be 1.5× faster when n = 32k.

4.2

Single Self Attention Layer

We further explore the speedup of HyperAttention with varying sequence lengths from 4,096 to

131,072. We measure wall-clock times of both forward and forward+backward operations when they

are computed with FlashAttention or are accelerated by HyperAttention. We measure the times

with and without causal masking. All inputs Q, K, V have the same length and their dimensions

are fixed to d = 64 and the number of attention heads is set by 12. We chose the same parameters in

HyperAttention as described in the previous section. In Fig. 4, we observe that HyperAttention runs

to up 54× faster without causal masking and 5.4× when the causal masking applies. Although time

complexities of both causal masking and non-masking are the same, a practical algorithm for causal

masking (Algorithm 1) requires additional operations such as partitioning Q, K, V , and merging

attention outputs which result in an increase of practical runtime. However, those speedups will

increase when the sequence length n grows. We believe this opens the door to scale self-attention not

only for inference but also for training or fine-tuning the LLMs to fit in significantly long sequences.

Conclusion

In this work, we propose a simple linear time attention approximation algorithm by simplifying

the existing algorithm based on kernel density estimation (KDE). We introduce a more general

parameterization for a spectral approximation guarantee based on the condition number, which

does not require assumptions used in prior work. Our algorithm makes use of sortLSH to find large

entries and we adopt fast matrix multiplication via row norm sampling. We additionally study how

our algorithm is used for causal masking by recursive partitioning. Empirically, we illustrate that

pre trained LLMs using our algorithm can enhance both inference and training speeds with only

minimal performance degradation.

10Forward

Forward+Backward

Forward

Forward+Backward

16k

32k

65k

131k

16k

32k

65k

131k

sequence length n

(a) Without causal masking

(b) With causal masking

Figure 4: Speedup of the exact computation using FlashAttention [10] and HyperAttention (this

work) in single self-attention layer during forward and backward operations. For n =131k, Hyper-

Attention runs up to 54× faster without causal masking and 5.4× with causal masking.

References

[1] Josh Alman and Zhao Song.

arXiv:2302.13214, 2023.

Fast attention requires bounded entries.

arXiv preprint

[2] Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao

Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A

Bilingual, Multitask Benchmark for Long Context Understanding, 2023.

[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari-

wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models

are few-shot learners. Neural Information Processing Systems (NeurIPS), 2020.

[4] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and

Sergey Zagoruyko. End-to-end object detection with transformers. In Proceedings of the Eu-

ropean Conference on Computer Vision(ECCV), 2020.

[5] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data

streams. In International Colloquium on Automata, Languages, and Programming (ICAMP),

2002.

[6] Moses Charikar, Michael Kapralov, Navid Nouri, and Paris Siminelakis. Kernel density esti-

mation through density constrained near neighbor search. In Foundations of Computer Science

(FOCS), 2020.

[7] Beidi Chen, Tri Dao, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, and Christopher

Re. Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models. In

International Conference on Learning Representations (ICLR), 2021.

[8] Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Re. Scatterbrain:

Unifying sparse and low-rank attention. Neural Information Processing Systems (NeurIPS),

2021.

11[9] Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea

Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser,

et al. Rethinking Attention with Performers. In International Conference on Learning Repre-

sentations (ICLR), 2021.

[10] Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.

2023.

[11] Giannis Daras, Nikita Kitaev, Augustus Odena, and Alexandros G Dimakis. Smyrf-efficient

attention using asymmetric clustering. Neural Information Processing Systems (NeurIPS),

2020.

[12] Jyotikrishna Dass, Shang Wu, Huihong Shi, Chaojian Li, Zhifan Ye, Zhongfeng Wang, and

Yingyan Lin. Vitality: Unifying low-rank and sparse approximation for vision transformer

acceleration with a linear taylor attention. arXiv preprint arXiv:2211.05109, 2022.

[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training

of deep bidirectional transformers for language understanding. In Conference of the North

American Association for Computational Linguistics (NAACL), 2018.

[14] Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, and Furu

Wei. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486,

2023.

[15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,

Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly,

Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image

Recognition at Scale. In International Conference on Learning Representations (ICLR), 2021.

[16] Petros Drineas and Ravi Kannan. Fast monte-carlo algorithms for approximate matrix mul-

tiplication. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pp.

452–459. IEEE, 2001.

[17] Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang.

Glm: General language model pretraining with autoregressive blank infilling. 2022.

[18] Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Sim-

ple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137,

2023.

[19] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Francois Fleuret. Transformers

are rnns: Fast autoregressive transformers with linear attention. In International Conference

on Machine Learning (ICML), 2020.

[20] Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The Efficient Transformer. In

International Conference on Learning Representations (ICLR), 2020.

[21] Kasper Green Larsen, Jelani Nelson, Huy L. Nguyen, and Mikkel Thorup. Heavy hitters via

cluster-preserving clustering. CoRR, abs/1604.01357, 2016.

12[22] François Le Gall. Faster algorithms for rectangular matrix multiplication. In 2012 IEEE 53rd

annual symposium on foundations of computer science, pp. 514–523. IEEE, 2012.

[23] Yi Li, Honghao Lin, Simin Liu, Ali Vakilian, and David P. Woodruff. Learning the positions

in countsketch. In The Eleventh International Conference on Learning Representations, ICLR

2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.

[24] Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and

Yin Tat Lee. Textbooks Are All You Need II: phi-1.5 technical report. arXiv preprint

arXiv:2309.05463, 2023.

[25] Colin McDiarmid. Concentration. In Probabilistic methods for algorithmic discrete mathemat-

ics, pp. 195–248. Springer, 1998.

[26] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,

Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the Limits of Transfer Learning with a Unified

Text-to-Text Transformer. Journal of Machine Learning Research (JMLR), 2020.

[27] Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based

sparse attention with routing transformers. Transactions of the Association for Computational

Linguistics (ACL), 2021.

[28] Zhiqing Sun, Yiming Yang, and Shinjae Yoo. Sparse Attention with Learning to Hash. In

International Conference on Learning Representations (ICLR), 2021.

[29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,

Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Neural Information Processing

Systems (NeurIPS), 2017.

[30] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V

Le. Xlnet: Generalized autoregressive pretraining for language understanding. Neural Infor-

mation Processing Systems (NeurIPS), 2019.

[31] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, San-

tiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Trans-

formers for longer sequences. In Neural Information Processing Systems (NeurIPS), 2020.

[32] Amir Zandieh, Insu Han, Majid Daliri, and Amin Karbasi. Kdeformer: Accelerating transform-

ers via kernel density estimation. In International Conference on Machine Learning (ICML),

2023.

[33] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai

Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In

Conference on Artificial Intelligence (AAAI), 2021.

13A

Omitted proofs

Here we include the proofs that were omitted in the main body of the paper. First, we present the

proof of Lemma 1.

Proof of Lemma 1: First, we show that τ calculated in line 3 of Algorithm

D 2 is close to the maxi-

H , exp(KQ ⊤ ) ≤

mum row sum of the matrix (1 n − M ) ⊙ A. It is easy to check that κ ≤ 1 − M i,:

i,:

τ κ for all i ∈ [n] because of the definition of κ in the lemma statement. Furthermore, if we define

the set:

S 0 := i ∈ [n] : ⟨1 − M i,:

, exp(KQ ⊤

)⟩

(4)

i,:

then we can show that |S 0 | is small. In fact, with probability at least 1 −

choice of the subset T drawn in line 2 of the algorithm, we have:

|S 0 | ≤ O κ −7 · α −2 · ε 6 · n

poly(n)

over the random

(5)

Next, let us define the upper-capped version of matrix A where entries of i-th row on positions

where the mask M H value is equal to zero are capped at value C i (line 6 of the algorithm) as:

(

n×n

e ∈ R

e i,j := min (A i,j , C i ) if M i,j = 0 ,

: A

for every i, j ∈ [n].

A i,j

otherwise

We proceed by bounding the total mass of large entries of matrix D −1 A lost through capping (i.e.,

b := 2 ε 2 m , we can write,

entries of A that are larger than thresholds C i ). If we define constant C

κ n log n

D −1 (A − A)

i,j∈[n]

(1 − M i,j

) · (A i,j − min (A i,j , C i )) /D i,i

∞

1 {M H =0} ·

i,j

t=0

log 2

i,j∈[n]

b i,i

A i,j − min (A i,j , C i )

= 0 & A i,j > C i + (2 t − 1) CD

The inequality in Eq. (6) follows because, for every i ∈ [n], the cardinality of the set

b i,i

i, j ∈ [n] : M i,j

A i,j > C i + (2 t − 1) CD

must be bounded by

b 2 .

(2 t C)

The proof of this is by contradiction because otherwise, there must be an

H =0

b i,i

l ∈ [n] such that the cardinality of the set H l := i ∈ [n] : M i,l

A i,l > C i + (2 t − 1) CD

14is at least |H l | >

however, this contradicts with the precondition of the lemma about α = n · max i∈[n] D −1 A · e (i) 2 .

Now, if we defined the sets S 1 , S 2 ⊆ [n] as

, S 2 = i ∈ [n] : ∥A i,: − A

(7)

< ∥A i,: − A

then it follows from Eq. (6) that the cardinalities of S 1 and S 2 are bounded by

|S 1 | ≤ O κ −4 · α −1 · ε 3 n , |S 2 | ≤ O κ −4 · α −1 · ε 4 n .

(8)

Next note that d i computed in line 7 of the algorithm is an estimator for i-th row norm of the

e Let us define an estimator for the unmasked capped

capped and masked matrix (1 n − M H ) ⊙ A.

matrix A as d i := d i + ⟨M i,: , A i,: ⟩. By invoking Chernoff-Hoeffding inequality (see e.g., [25]) along

7 2

log

the following

with union bound, because the lemma statement assumes that m = Ω κ ε ·α

holds simultaneously for all i ∈ [n] \ S 2 with probability 1 −

This inequality combined with definition of S 1 , S 2 in Eq. (7) implies that for any i ∈ [n] \ (S 1 ∪ S 2 ),

−1

(1 − ε/2) · D i,i

≤ d b −1

i ≤ (1 + ε/2) · D i,i . Now we bound the operator norm of the error as follows:

where the second inequality above follows from the inequality in Eq. (9) and also because the

e i,i ≥ (1 − 1/3) · D i,i for any i ∈ S 1 . The third inequality above

definition of S 1 ensures that D

e j,j ≥ ⟨M H , A j,: ⟩ + τ /κ

follows because the lower capping in line 8 of the algorithm ensures that D

j,:

H , A ⟩ + τ for any j ∈

for any j ∈ S 2 while D j,j ≤ ⟨M j,:

D j,j ≤ ⟨M j,: , A j,: ⟩ + τ κ for j ∈ S 0 .

Finally we conclude the proof by bounding the terms D S −1

for r ∈ {0, 1, 2} in Eq. (10). Fix

some r ∈ {0, 1, 2}. Let v be the unit-normed vector that realizes the operator norm of D S −1

A. Since

15D S −1

A is a non-negative matrix, w.l.o.g., we can assume that v is a non-negative vector. More

precisely v := arg max x∈R n + D S −1

A · x 2 . One has that D S −1

A · v 2 = D S −1

A op . We define

∥x∥ 2 =1

the sequence of binary matrices B 0 , B 1 , B 2 . . . which have same shape as D S −1

A as follows:

B i,j

:= 1 n 2 −t−1 √ α/n< D −1 A ≤2 −t √ α/n o

[ S r ] i,j

for every integers t ≥ 0.

(11)

Note that because of the precondition of the lemma about α = n · max i∈[n] D −1 A · e (i) 2 which

implies [D −1 A] i,j ≤ α/n, we have the following inequality on each entry of the matrix D S −1

A and v both have non-negative entries, the above inequality implies the following:

Now to bound B t · v 2 we first find bounds on the number of 1’s in rows and columns of B t . Using

the definition of B t in Eq. (11) and the fact that row sums in matrix D −1 A are equal to 1, we have:

Additionally, using the precondition of the lemma about α = n · max i∈[n] D −1 A · e (i)

B :,j

Now we bound the norm B t · v

≤ min(2 2t+2 , |S r |).

(14)

(Cauchy–Schwarz inequality)

for an arbitrary integer t ≥ 0 as follows:

n/α · min(2 2t+2 , |S r |) ·

v j 2 = 2 t+1 n/α · min(2 2t+2 , |S r |),

j∈[n]

where the inequality in second line above follows from Eq. (13) and the inequality in the last line

follows from Eq. (14). The last equality follows from the assumption that ∥v∥ 2 = 1. Therefore,

2 2 · (n/α) 1/4 · |S r | otherwise

16Now by plugging the above inequalities into Eq. (12) we find that:

where the last line above follows from the upper bound on the size of sets S r we obtained in Eq. (5)

and Eq. (8). Finally, by plugging the above inequality into Eq. (10) and using the fact that D −1 A

is a row-stochastic matrix and thus D −1 A op ≥ 1 the lemma follows.

Next, we prove the main theorem.

e in line 2 is computed by invoking Algorithm 2. By

Proof of Theorem 1: The diagonal matrix D

e −1 − D −1 A

Lemma 1 with probability at least 1− poly(n)

, it holds that D

≤ ε/2· D −1 A op .

Furthermore, in line 3 of the algorithm S is defined as the sampling matrix according to the row

e −1 A.

norms of V . To invoke Lemma 2, we need to have a bound on the stable rank of D

e −1 A

First, from Lemma 1 we know that D

≥ (1 − ε/2) · D −1 A

≥ 1/2, where the second in-

e −1 A) ≤

equality follows because D −1 A is a row-stochastic matrix. Therefore we have srank( D

e −1 A . Second, the lower capping in line 8 of Algorithm 2 ensures that D

≤ n o(1) , thus srank( D

e −1 A) , Lemma 2

With this bound on the stable rank and since m = d · n o(1) = Ω ε −2 d · srank( D

with probability 0.99.

e −1 A

≤ ε/3 · D

∥V ∥ op ≤ ε/2 · D −1 A

∥V ∥ op

By combining these two inequalities, we obtain the spectral approximation guarantee in Eq. (1).

The runtime of this algorithm is primarily determined by the time it takes to invoke Algorithm 2

in line 2, which is dominated by the time to multiply two matrices of sizes n × d and d × m and

the time to calculate attention matrix entries at the positions defined by M H . Using the matrix

multiplication result from [22], the first computation can be done in time O(dn 1+o(1) ) and the latter

can be done in nnz(M H ).

Proof of Corollary 1: Because r = log 2 n, we have Pr[H(Q i,∗ ) = H(K j,∗ )] ≤ 1/n 1−o(1) whenever

θ(Q i,∗ , K j,∗ ) ≥ π 2 (1 − o(1)). As there are at most n 2 total pairs, the expected number of such pairs

that collide under H is at most n 1+o(1) and so by a Markov bound is at most n 1+o(1) with failure

probability 1/n o(1) .

Since we also assume there are at most n 1+o(1) pairs (i, j) with θ(Q i,∗ , K j,∗ ) <

can be at most n 1+o(1) additional pairs that collide.

2 (1

− o(1)), there

Thus, in total we have n 1+o(1) collisions, and consequently the number of non-zero entries in M H

17is at most n 1+o(1) with failure probability 1/n o(1) . The proof now follows by the assumptions of the

corollary statement as well as Theorem 1.

Proof of Corollary 2: Let T be an ExpanderSketch matrix with O(τ log n) rows. By the guar-

antees [21] of T , we have that with probability at least 1 − n 1 3 , for any fixed vector x ∈ R n , that

from T · x, one can recover a set S of indices i ∈ [n] such that if x 2 i ≥

/ S. Further, S can be found from T x in n o(1) time.

∥x∥ 22

τ ,

then i ∈ S, whereas if

We compute T · Q, followed by (T · Q) · K ⊤ . Note that the time for this computation is O(τ n log n).

Next, for each column j of QK ⊤ this allows us to construct a set S j with the property that if

∥Q·K ⊤ e ∥ 2

j 2

(Q · K ⊤ ) i,j ≥

, then i ∈ S j . This holds simultaneously for all columns with probability

at least 1 − n 2 by a union bound. The time for constructing all the sets S j is n 1+o(1) d

Note that |S j | ≤ 2τ for all j, and we can explicitly compute the exact value of (Q · K ⊤ ) i,j for all

i ∈ S j and all j, in O(nτ d) time. By the assumptions of the corollary, we have that S j contains a

superset of the support of the j-th column of M H , and since we can compute the values exactly,

we can exactly construct the mask M H matrix that the corollary requires, and in n 1+o(1) d time.

The proof now follows by the assumptions of the statement as well as Theorem 1.