Summary of Scaling Transformers to 1000000000 Tokens

Summary Scaling Transformers to 1000000000 Tokens arxiv.org

6,326 words - PDF document - View PDF document

One Line

The L ONG N ET Transformer variant has the ability to process sequences up to 1 billion tokens with dilated attention while still performing well on shorter sequences.

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

Scaling Transformers to 1,000,000,000 Tokens

Source: arxiv.org - PDF - 6,326 words - view

Introduction

• L ONG N ET is a Transformer variant that can scale the sequence length to over 1 billion tokens without sacrificing performance on shorter sequences.

• Dilated attention is proposed as a way to expand the attentive field exponentially as the distance grows.

• The ability to process longer sequences while maintaining performance is crucial in many applications.

Challenges of Scaling Sequence Length

• One approach to scaling the sequence length is to decrease the complexity of Transformers.

• Implementing sliding windows or convolution modules over the attention reduces complexity but sacrifices the ability to recall early tokens.

• Balancing complexity reduction and the ability to handle long sequences is a key challenge in scaling.

Dilated Attention

• Dilated attention splits input into segments and sparsifies them along the sequence dimension.

• This reduces computation cost while capturing both long-range and short-range information.

• Dilated attention allows for efficient processing of longer sequences without sacrificing performance.

Longformer - A Solution for Scaling

• Longformer is a transformer model that addresses the challenge of scaling sequence length to 1 billion tokens.

• The computation complexity of dilated attention has been reduced to O(Nd), making it feasible to scale the sequence length.

• Longformer maintains performance on shorter sequences while enabling processing of extremely long sequences.

Comparison with Vanilla Transformer and Sparse Transformers

• L ONG N ET is compared to vanilla Transformer and sparse Transformers.

• The main differences lie in the attention layers, with dilated attention being the key feature of L ONG N ET .

• The ability to scale sequence length without sacrificing performance sets L ONG N ET apart from other variants.

Experimental Framework - torchscale codebase

• The torchscale codebase is used for all experiments in scaling Transformers.

• It provides a reliable and consistent framework for testing and comparing different variants.

• The experiments demonstrate the effectiveness of L ONG N ET in handling longer sequences.

References - Scaling Transformers and Language Models

• This document provides a list of references to papers related to scaling transformers and language models.

• The references include papers on various topics, such as grounding multimodal large language models and training multi-billion parameter language models.

• These papers contribute to the understanding and advancement of scaling techniques in the field.

Maximizing Sequence Length without Sacrificing Performance

• L ONG N ET offers a solution for scaling sequence length to over 1 billion tokens.

• Dilated attention allows for efficient processing of longer sequences while maintaining performance on shorter sequences.

• By leveraging dilated attention, L ONG N ET enables the handling of extremely long sequences without sacrificing model performance.

Key Points

L ONG N ET is a Transformer variant that can scale the sequence length to over 1 billion tokens without sacrificing performance on shorter sequences.
Dilated attention is proposed as a way to expand the attentive field exponentially as the distance grows.
Scaling the sequence length can be achieved by implementing sliding windows or convolution modules over the attention, but this sacrifices the ability to recall early tokens.
Dilated attention splits input into segments and sparsifies them along the sequence dimension to reduce computation cost while capturing long-range and short-range information.
Longformer is a transformer model that addresses the challenge of scaling sequence length to 1 billion tokens by reducing the computation complexity of dilated attention.
L ONG N ET is compared to vanilla Transformer and sparse Transformers, with differences observed in the attention layers.
The torchscale codebase is used for all experiments.
The list of references includes papers and articles related to scaling transformers and neural language models from conferences and journals such as ICLR, NeurIPS, and ICML.

Summaries

23 word summary

The L ONG N ET Transformer variant can handle sequences up to 1 billion tokens using dilated attention, maintaining performance on shorter sequences.

44 word summary

L ONG N ET is a Transformer variant that can scale the sequence length to 1 billion tokens and beyond without sacrificing performance on shorter sequences. It achieves this through dilated attention, which expands the attentive field exponentially as the distance grows. This reduces

337 word summary

This article introduces L ONG N ET , a Transformer variant that can scale sequence length to over 1 billion tokens without sacrificing performance on shorter sequences. The authors propose dilated attention, which expands the attentive field exponentially as the distance grows. L O

One approach to scaling the sequence length is to decrease the complexity of Transformers. This can be done by implementing sliding windows or convolution modules over the attention, which reduces the complexity to nearly linear. However, this sacrifices the ability to recall early tokens at the

The text describes the concept of dilated attention, which splits input into segments and sparsifies them along the sequence dimension. This reduces computation cost while capturing both long-range and short-range information. The text also explains the implementation of a mixture of dil

Longformer is a transformer model that addresses the challenge of scaling sequence length to 1 billion tokens. The computation complexity of dilated attention has been reduced to O(Nd), but it is still infeasible to scale the sequence length on a single

The torchscale codebase is used for all experiments. The paper compares L ONG N ET with vanilla Transformer and sparse Transformers, noting that the differences lie in the attention layers. The sequence length of the models is scaled from 2K to

L ONG N ET is a Transformer variant that can scale the sequence length to 1 billion tokens and beyond without losing performance on shorter sequences. It achieves this through dilated attention, which reduces computation complexity. L ONG N ET can be used

This text excerpt is a list of references to papers and articles related to scaling transformers and neural language models. The references include papers from various conferences and journals, such as ICLR, NeurIPS, and ICML. Some of the key papers mentioned

This document provides a list of references to papers related to scaling transformers and language models. The references include papers on various topics such as grounding multimodal large language models, linear transformers, length-extrapolatable transformers, training multi-billion parameter language models,

Raw indexed text (37,874 chars / 6,326 words / 795 lines)

L ONG N ET : Scaling Transformers to

1,000,000,000 Tokens

Abstract

Scaling sequence length has become a critical demand in the era of large language

models. However, existing methods struggle with either computational complexity

or model expressivity, rendering the maximum sequence length restricted. In this

work, we introduce L ONG N ET , a Transformer variant that can scale sequence

length to more than 1 billion tokens, without sacrificing the performance on shorter

sequences. Specifically, we propose dilated attention, which expands the attentive

field exponentially as the distance grows. L ONG N ET has significant advantages: 1)

it has a linear computation complexity and a logarithm dependency between tokens;

2) it can be served as a distributed trainer for extremely long sequences; 3) its dilated

attention is a drop-in replacement for standard attention, which can be seamlessly

integrated with the existing Transformer-based optimization. Experiments results

demonstrate that L ONG N ET yields strong performance on both long-sequence

modeling and general language tasks. Our work opens up new possibilities for

modeling very long sequences, e.g., treating a whole corpus or even the entire

Internet as a sequence. Code is available at https://aka.ms/LongNet.

1000

LongNet (1B)

800

Jiayu Ding ∗ Shuming Ma ∗ Li Dong Xingxing Zhang Shaohan Huang Wenhui Wang Furu Wei †

Microsoft Research

https://aka.ms/GeneralAI

600

400

Memorizing

RMT (1M)

Transformers (262K)

200

GPT (512)

2017

2018

Sparse Transformer Reformer (64K)

(12K)

2019

2020

2021

2022

Figure 1: Trend of Transformer sequence lengths over time.

∗

Equal contribution. † Corresponding author.

2023

20241

Introduction

Recent years have witnessed a trend toward scaling neural networks [BMR + 20, KMH + 20, ZKHB22,

CND + 22, DDM + 23]. The depth is primarily scaled up for exponential expressivity, producing

many powerful deep networks [HZRS16, HCB + 19, WMD + 22]. Then, the sparse MoE mod-

els [LLX + 21, FZS21, ZBK + 22] and model parallelism approaches [SPP + 19, KCL + 22] efficiently

enlarge the hidden dimension. Sequence length, as the last atomic dimension of the neural net-

work, is desirable to be unlimited. Breaking the limitation of sequence length introduces significant

advantages. First, it provides large memory and receptive field for models, which is practical for them

to interact with human and the world. Second, a longer context contains more complex causality

and reasoning paths that models can exploit in training data. In contrast, short dependency has more

spurious correlations, which is harmful to generalization. Third, it enables to explore the limits of

in-context learning, which has the potential to be a paradigm shift for many-shot learning, as an

extremely long context may help the models alleviate catastrophic forgetting.

The major challenge of scaling up sequence length is striking the right balance between the

computational complexity and the model expressivity. RNN-style models are primarily imple-

mented to increase the length. However, its sequential nature limits the parallelization dur-

ing training, which is essential in long-sequence modeling. More recently, state space mod-

els [GGR22, SWL23, FDS + 23, PMN + 23] are appealing to sequence modeling. It can operate

as a CNN during training, and transform to an efficient RNN at test time. While they perform

well at long-range benchmarks [TDA + 21], their performance on regular lengths is not as good as

Transformers, limited mainly by the model expressivity [FPB + 23].

Another strand of scaling the sequence length is to decrease the complexity of Transformers, i.e., the

quadratic complexity of self-attention. Implementing sliding windows or convolution modules over

the attention is a straightforward way to make the complexity nearly linear. Nevertheless, this sacri-

fices the ability to recall the early tokens, forgetting the prompts at the very beginning of the sequence.

Sparse attention reduces the computation by sparsifying the attention matrix, preserving

the possibility

√

of recalling long-distant information. For example, [CGRS19] obtains O(N N d) time complexity

with a fixed sparse pattern. Besides the heuristic patterns [ZGD + 20, BPC20], the learnable patterns

prove to be useful for sparse attention [KKL20, ALdJ + 23]. There are also some other efficient

Transformer-based variants, including low-rank attention [WLK + 20, WCL + 20], kernel-based meth-

ods [KVPF20, CLD + 21, QHS + 22], downsampling approaches [LLK + 19, JGB + 21, MKW + 21],

recurrent models [DYY + 19, BKB23], and retrieval-based methods [WRHS22, WDC + 23]. Yet, none

has been scaled to 1 billion tokens (see Figure 1).

Method

Computation Complexity

O(N d 2 )

O(N

√ d)

O(N N d)

Recurrent

Vanilla Attention

Sparse Attention

Dilated Attention (This Work)

O(N d)

Table 1: Comparison of computation complexity among different methods. N is the sequence length

and d is the hidden dimension.

In this work, we successfully scale the sequence length to 1 billion tokens. Our solution is

L ONG N ET , which replaces the attention of vanilla Transformers with a novel component named

dilated attention. The general design principle is - attention allocation decreases exponentially as

the distance between tokens grows. We prove that it obtains a linear computation complexity and a

logarithm dependency between tokens. This deals with the contradiction between limited attention

resources and the accessibility to every token. In the implementation, L ONG N ET can be transformed

into a dense Transformer, which seamlessly supports the off-the-shelf optimization for Transformers

(e.g., kernel fusion, quantization, and distributed training). Taking advantage of the linear complexity,

L ONG N ET can parallelize the training across nodes, breaking the constraint of both computation and

memory with a distributed algorithm. This allows us to efficiently scale up the sequence length to 1B

tokens with nearly constant runtime (see Figure 5), while vanilla Transformer suffers from quadratic

complexity.

2.1

L ONG N ET

Preliminary

The core of Transformers [VSP + 17] is self-attention, which maps a query and a set of keys and

values to output. Given the inputs Q, K, V ∈ R N ×d , it computes the outputs O with

O = softmax(QK T )V

(1)

Self-attention struggles with long sequences, due to its quadratic dependency on the sequence length.

One query would attend to all keys and values, leading to computational inefficiencies.

Sparse attention alleviates this issue by restricting the query’s access to a subset of keys and values.

The key of sparse attention is the sparse attention pattern S ∈ {0, 1} N ×N , which determines specific

keys and values that the query Q can attend to.

O = softmax(QK T ⊙ 1 S )V

(2)

For example, the fixed pattern of sparse Transformer [CGRS19] is composed of a local pattern and a

strided pattern. The sequence is divided into blocks of length l. The local pattern allows one query to

attend to tokens within the same block, while strided pattern allows one query to attend to the last c

(1)

tokens of each block. Formally, the local pattern S i = {j | ⌊j/l⌋ = ⌊i/l⌋}, and the strided pattern

(2)

S i = {j | j mod l ∈ {t, t + 1, ..., l}}.

2.2

Dilated Attention

Figure 2 illustrates the overview of dilated attention. Dilated attention splits the input (Q, K, V ) into

e i , K

e i , V e i )} N

w equally with a segment length w. Each segment is then sparsified along

segments {( Q

the sequence dimension by selecting the rows with an interval r. The computation can be written as:

e i = [Q iw , Q iw+r , Q iw+2r , ..., Q (i+1)w−1 ]

Q (3)

e i = [K iw , K iw+r , K iw+2r , ..., K (i+1)w−1 ]

K (4)

V e i = [V iw , V iw+r , V iw+2r , ..., V (i+1)w−1 ] (5)

e i , K

e i , V e i )} N

w are fed into the attention in parallel, after which are

The sparsified segments {( Q

scattered and concatenated as the output O:

e i = softmax( Q

e i K

e i T ) V e i

O (6)

e i,j |j mod r = 0; 0|j mod r ̸ = 0}

Ô i = { O (7)

O = [ Ô 0 , Ô 1 , ..., Ô N −1 ] (8)

In the implementation, the dilated attention can be transformed into dense attention between a

e i , so it

gathering operation over the input (Q, K, V ) and a scattering operation over the output O

can directly reuse any optimization for vanilla attention (e.g., flash attention [DFE + 22]). Dilated

attention can significantly reduce the computation cost by a factor of N

w r over the vanilla attention.

In practice, the segment size w trades the globality of attention for efficiency, while the dilation

with a size r reduces the computation cost by approximating the attention matrix. To capture both

long-range and short-range information efficiently, we implement a mixture of dilated attentions with

different segment sizes and dilation rates {r i , w i } k :

3Segment Length: 4

Dilated Rate: 1

Segment Length: 8

Dilated Rate: 2

Segment Length: 16

Dilated Rate: 4

Figure 2: Building blocks of dilated attention used in L ONG N ET . It consists of a series of attention

patterns for modeling both short-range and long-range dependency. The number of attention patterns

can be extended according to the sequence length.

O =

α i O| r i ,w i

(9)

i=1

s i

α i = P

s j

(10)

where s i denotes the denominator of the attention softmax for O| r i ,w i . Note that the computations

for {O| r i ,w i } k are in parallel because there is no computation dependency among them. Experiments

show that dynamic weights calculated by the denominator of the attention softmax are better than

learnable fixed weights. For a query attends to keys in different dilated attentions, our method to mix

dilated attentions is equivalent to gather keys in different parts and calculate softmax together.

Intuitively, the local attention should be precisely computed, while the global attention can be

approximate. Therefore, we set a larger w i with a bigger r i . Moreover, we gradually increase the w i

for each attention until it reaches the maximum length N or the number of attention patterns k:

w = {w 0 , w 1 , w 2 , ..., N } k

r = {1, r 1 , r 2 , ..., r k } k

(w i < w i+1 < N ) (11)

(1 < r i < r i+1 ) (12)

In practice, we set w and r to geometric sequences for an exponential attentive field.

41 st head

2 nd head

3 rd head

4 th head

Segment Length: 8

Dilated Rate: 2

Heads: 4

Figure 3: Dilated attention with multiple heads. The attention patterns differ among heads by shifting

the position successively.

2.3

Multi-Head Dilated Attention

As shown in Figure 3, we differ in the computation among different heads by sparsifying different

parts of the query-key-value pairs. Specifically, for the j-th head, we have an offset s j = j mod r

when selecting the (Q, K, V ):

e i = [Q iw+s , Q iw+s +r , Q iw+s +2r , ..., Q (i+1)w+s −1 ]

j (13)

e i = [K iw+s , K iw+s +r , K iw+s +2r , ..., K (i+1)w+s −1 ]

j (14)

V e i = [V iw+s j , V iw+s j +r , V iw+s j +2r , ..., V (i+1)w+s j −1 ] (15)

Following the vanilla multi-head attention, the outputs of different heads are concatenated into a final

output. The rest of the computation remains the same as the single-head counterpart in Section 2.2.

2.4

Computational Complexity and Token Dependency

Given dilated attention with a segment size and dilation rate of (r, w), each query-key-value pair is

sparsified from (Q, K, V ) ∈ R N ×d to (Q, K, V ) ∈ R r ×d , so the flops of the attention computation

are estimated as:

2N w 2

2N wd

F LOP s =

( ) d =

(16)

w r

r 2

We further extend it to dilated attention with multiple segment sizes and dilation rates. The flops can

be written as:

w i

F LOP s = 2N d

(17)

r 2

i=1 i

With the segment sizes and dilation rates in Equation (11) and Equation (12), the flops are given by

F LOP s = 2w 0 N d

k−1

i=0

2α

w 0 N d (α > 1)

≤

α i

α − 1

(18)

where w 0 is a predefined constant and α is the common ratio for geometric sequences w and r.

Therefore, the computation complexity of dilated attention is approximate to O(N d).

Moreover, the information of each tokens can be propagated to a maximum distance of D:

D =

l−1

i=0

w i = w 0

l−1

i=0

α i ≈

w 0 l

α − 1

(19)GPU 1 GPU 2

𝑂 1 𝑂 2

෩ 𝑉 ෨

𝐾

All gather

𝑄 ෨ 1

෩ 1

𝐾

𝑉 ෨ 1

෩ 2

𝐾

𝑉 ෨ 2 𝑄 ෨ 2

𝑉 2 𝑄 2

Sparsify

𝑄 1

𝐾 1

𝑉 1

𝐾 2

Project

𝑋 1

𝑋 2

Split

𝑋

Figure 4: Distributed training of L ONG N ET on two GPU devices. It parallelizes the training by

partitioning the sequence dimension. The computation and communication costs are nearly constant

as the number of devices grows.

where l is the length of the propagated path. Therefore, the maximum path length of a sequence with

N tokens can be estimated as:

N (α − 1)

L ≈ log α

(α > 1)

(20)

w 0

This proves that the token dependency is approximate to O(log N ).

L ONG N ET as a Distributed Trainer: Scaling up to 1B Tokens

Although the computation complexity of dilated attention has been greatly reduced to O(N d), it

is infeasible to scale the sequence length to the million level on a single GPU device due to the

computation and memory constraints. There are some distributed training algorithms for large-scale

model training, such as model parallelism [SPP + 19], sequence parallelism [LXLY21, KCL + 22], and

pipeline parallelism [HCB + 19]. However, they are insufficient for L ONG N ET especially when the

sequence dimension is extremely large.

3.1

Distributed Algorithm

We take advantage of the linear computation complexity of L ONG N ET for the distributed training of

sequence dimension. Without loss of generality, Figure 4 presents our distributed algorithm on two

GPUs, which can be further scaled to an arbitrary number of devices. We start by splitting the input

sequence along the sequence dimension. Each sequence is put on one device separately:

X = [X 1 , X 2 ]

(21)

Then, they are projected into queries, keys, and values on the two devices:

[Q 1 , K 1 , V 1 ] = [W Q , W K , W V ]X 1 ,

[Q 2 , K 2 , V 2 ] = [W Q , W K , W V ]X 2

(22)

For the segment length w i ≤ l (where l is the sequence length on the local device), we compute the

attention locally with Equation (3) to Equation (8). For the segment length w i > l, the keys and values

6Dilated attention w/ FlashAttention

Vanilla attention w/ FlashAttention

(ms)

5000

4000

3000

2000

1000

8K 16K 32K 64K 128K

512K

Sequence Length

32M

128M

Figure 5: Runtime of our dilated attention and vanilla attention. Both are equipped with FlashAtten-

tion [DFE + 22].

are distributed across different devices. Therefore, we collect the key-value pairs before computing

e K,

e V e }. An

the attention. We use Equation (3) to Equation (5) to sparsify the {Q, K, V } into { Q,

all-gather operation is implemented to collect the key-value pairs:

e = [ K

f 1 , K

f 2 ],

f 1 , V

f 2 ]

V e = [ V

(23)

Note that the all-gather operation in the backward becomes a reduce-scatter operation. Different

f i and V e i are independent of the sequence length N , making the

from vanilla attention, both sizes of K

communication cost constant.

f i and the global key-value pairs

Finally, we compute the cross-attention with the local queries Q

e V e }. The formulation is written as:

{ K,

f 1 = softmax( Q

f 1 K

e T ) V e ,

f 2 = softmax( Q

f 2 K

e T ) V e

(24)

The concatenation of the outputs across different devices becomes the final attention output:

e = [ O

f 1 , O

f 2 ]

(25)

The distributed algorithm described above is orthogonal to other parallelisms, including data paral-

lelism which partitions the batch dimension, model parallelism which partitions the hidden dimension,

and pipeline parallelism which partitions the layers.

3.2

Scaling up to 1B Tokens

We verify the feasibility of scaling to 1B tokens with the modern distributed systems. Starting from

8K, we gradually scale the sequence length until the limit of GPU memory. We reduce the batch size

accordingly to keep the number of tokens per batch at 1 billion. Each model of different sequence

lengths has up to 3 segment lengths, which are 2,048, the number of tokens per device, and the

sequence length. We compute the average speed in the forward propagation for 10 different runs.

Figure 5 reports the runtime of vanilla attention and our dilated attention. Both of them are imple-

mented with FlashAttention Kernel for saving memory and improving speed. It shows that dilated

72K Github

8K 32K

256 4.24 5.07 11.29

8K 64 4.39

4.23 3.35

3.24 8.79

3.36

Sparse Transformer [CGRS19]

L ONG N ET (ours) 16K 32 4.85

4.27 3.73

3.26 19.77

3.31

Sparse Transformer [CGRS19]

L ONG N ET (ours) 32K 16 5.15

4.37 4.00

3.33 3.64

3.01

Model

Length Batch

Transformer [VSP 17] 2K Sparse Transformer [CGRS19]

L ONG N ET (ours)

Table 2: Perplexity of language models for L ONG N ET and the baselines.

attention can successfully scale up the sequence length with almost constant latency. By partitioning

the sequence dimension, it can leverage the distributed systems to scale the sequence length to 1

billion tokens. In contrast, vanilla attention suffers from the quadratic dependency on the sequence

length. Its latency dramatically increases as the length grows. Moreover, there is no distributed

algorithm for vanilla attention to break sequence length limitation. This proves the advantage of the

linear complexity as well as the distributed algorithm for L ONG N ET .

Experiments on Language Modeling

4.1

Setup

We implement L ONG N ET on language modeling.

The backbone architecture is M AG -

NETO [WMH + 22] with X P OS [SDP + 22] relative position encoding, except that we replace the

standard attention with our dilated attention. We use the base-size configuration of M AGNETO , which

has a hidden dimension of 768, 12 attention heads, and 12 decoder layers. We pre-train the model

with The Stack dataset [KLA + 22], a source code collection in over 300 programming languages.

The data is preprocessed with the tiktoken tokenizer 2 with cl100k_base encoding. The models are

trained with a batch size of 0.5M tokens for 300K steps. More details regarding the hyperparameters

can be found in the appendix. All experiments are conducted based on the torchscale [MWH + 22]

codebase.

4.2

Results

We compare L ONG N ET with both vanilla Transformer and sparse Transformers. The differ-

ences among the architectures are the attention layers, while the others remain the same. We

scale the sequence length of these models from 2K to 32K, while reducing the batch size to

keep the number of tokens per batch constant. For L ONG N ET , we use segment lengths of

w = {2048, 4096, 8192, 16384, 32768}, and the dilated ratios are r = {1, 2, 4, 6, 12}. We im-

plement the fixed pattern for sparse attention as in [CGRS19] with multiple heads attending to distinct

subblocks. The block size is set to 2048. We adjust their sparse ratios to match the computation flops

with L ONG N ET so that the comparison is fair. The attention layers in vanilla Transformers are dense

and fully connected, so the computation cost is much higher. Due to the computation constraints, we

only scale it up to 32K sequence length. All of our implementations of attention variants are based

on FlashAttention 3 for training efficiency. We customize the flash attention kernels for both sparse

attention and dilated attention.

Table 2 summarizes the results of these models on the Stack dataset. We use perplexity as the

evaluation metric. The models are tested with different sequence lengths, ranging from 2K to 32K.

When the input is longer than the maximum length that the models support, we implement block-

wise causal attention (BCA) [SDP + 22], a state-of-the-art extrapolation method for language model

inference. Besides, we remove the absolute position encoding. Primarily, the results demonstrate that

https://github.com/openai/tiktoken

https://github.com/HazyResearch/flash-attention/tree/main

82K

Transformer

LongNet

8K 16K

32K

4 × 10 16

6 × 10 16

FLOPs

16K

10 17

Figure 6: Test perplexity of L ONG N ET and dense Transformers using different sequence lengths dur-

ing training. L ONG N ET outperforms dense Transformers with a lower perplexity and a significantly

smaller amount of computation.

increasing the sequence length during training generally leads to a better language model. Secondly,

the extrapolation of sequence length in inference does not apply to the case when the length is much

larger than the model supports. Finally, L ONG N ET consistently outperforms the baseline models,

proving its effectiveness in language modeling.

4.3

Scaling Curves of Sequence Length

Previous work [KMH + 20] has shown that language models follow some scaling laws by increasing

parameters or training tokens. We are interested in the performance of language models when

the context length is scaled up during training. We test the losses with inputs of a mixture of

different lengths, from 1K to 32K. We use blockwise causal attention during inference to improve the

generalization of sequence lengths.

Figure 6 plots the scaling curves of sequence length for both vanilla Transformers and L ONG N ET .

We estimate the amount of compute by calculating the total flops of matrix multiplication. The

results show that both vanilla Transformers and L ONG N ET benefit from a larger context length during

training. However, L ONG N ET can scale up the context length more efficiently, achieving a lower test

loss with a smaller amount of computing. This demonstrates the advantage of longer training input

over extrapolation. In conclusion, our experiments show that L ONG N ET is a more efficient way to

scale up the context length in language models. This is because L ONG N ET can learn longer-range

dependencies more effectively.

4.4

Scaling up Model Size

An important property of large language models is that the loss scales as a power law with compute.

To verify whether L ONG N ET still follows the similar scaling law, we train a series of models with

different model sizes, from 125 million to 2.7 billion parameters. The 2.7B model is trained with

300B tokens, while the rest digest about 40B tokens. Figure 7(a) plots the scaling curve of L ONG N ET

regarding the compute. We compute the perplexity on the same test set. The amount of compute

is estimated by calculating the total flops of matrix multiplication during training. It proves that

L ONG N ET can still follow the power law. This implies that the dense Transformer is not a prerequisite

9LongNet

350M

LongNet

2.2

2.1

760M

2.6 125M

2.4

2.2

2.0

1.8

1.6

1.4

1.2

10 16

2.0

1.9

1.8

1.7

2.7B

10 17

FLOPs

1.6

10 18

(a)

16K

Context Window

32K

(b)

Figure 7: Left: Test loss of L ONG N ET with an increasing model size. The scaling curve follows

a similar law to the vanilla Transformers. Right: Test loss of L ONG N ET using different context

windows. A longer context window yields better language modeling.

for scaling the language models. Additionally, the scalability and the efficiency are both obtained by

L ONG N ET .

4.5

Long Context Prompting

Prompting is an essential method to guide and provide additional information to the language models.

We conduct experiments to verify whether L ONG N ET can benefit from a longer context window for

prompting. Specifically, we reserve a piece of prefixes as the prompt and test the perplexity of its

suffixes. We gradually scale the length of the prompt from 2K to 32K. For a fair comparison, we

keep the suffixes the same, while increasing the length of the prefixes to the maximum lengths of

the models. Figure 7(b) reports the results on the test set. It shows that the test loss of L ONG N ET

gradually decreases as the context window grows. This demonstrates the superiority of L ONG N ET in

fully leveraging the long context to improve the language model.

Conclusion and Future Work

We present L ONG N ET , a Transformer variant that can scale the sequence length to 1 billion tokens and

beyond, with no loss in shorter sequences. The core of L ONG N ET is dilated attention, which reduces

the computation complexity from quadratic to linear. L ONG N ET can be served as a distributed trainer

that parallelizes the training of a sequence across multiple GPU devices. Experiments show that

L ONG N ET has superior performance over the strong baselines on modeling both long and short

sequences. In the future, we will extend L ONG N ET to support more tasks, e.g., multimodal large

language modeling [HDW + 23, PWD + 23], BEiT pretraining [BDPW22, PDB + 22, WBD + 23], and

genomic data modeling.

Acknowledgement We would like to acknowledge Yuqing Xia and Jilong Xue for the early

exploration of the flash attention kernel.

References

[ALdJ + 23] Joshua Ainslie, Tao Lei, Michiel de Jong, Santiago Ontañón, Siddhartha Brahma, Yury

Zemlyanskiy, David C. Uthus, Mandy Guo, James Lee-Thorp, Yi Tay, Yun-Hsuan

Sung, and Sumit Sanghai. CoLT5: Faster long-range transformers with conditional

computation. CoRR, abs/2303.09752, 2023.

[BDPW22] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: BERT pre-training of image

transformers. In International Conference on Learning Representations, 2022.

10[BKB23] Aydar Bulatov, Yuri Kuratov, and Mikhail S. Burtsev. Scaling transformer to 1m tokens

and beyond with RMT. CoRR, abs/2304.11062, 2023.

[BMR + 20] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla

Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini

Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya

Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark

Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher

Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language

models are few-shot learners. In NeurIPS 2020, 2020.

[BPC20] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document

transformer. CoRR, abs/2004.05150, 2020.

[CGRS19] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences

with sparse transformers. ArXiv, abs/1904.10509, 2019.

[CLD + 21] Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song,

Andreea Gane, Tamás Sarlós, Peter Hawkins, Jared Quincy Davis, Afroz Mohiud-

din, Lukasz Kaiser, David Benjamin Belanger, Lucy J. Colwell, and Adrian Weller.

Rethinking attention with performers. In 9th International Conference on Learning

Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net,

2021.

[CND + 22] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Many Others, Jeff Dean, Slav

Petrov, and Noah Fiedel. PaLM: Scaling language modeling with Pathways. ArXiv,

abs/2204.02311, 2022.

[DDM + 23] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek,

Justin Gilmer, Many Others, Xiaohua Zhai, Daniel Keysers, Jeremiah Harmsen, and Neil

Houlsby. Scaling vision transformers to 22 billion parameters. CoRR, abs/2302.05442,

2023.

[DFE + 22] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention:

Fast and memory-efficient exact attention with io-awareness. In NeurIPS, 2022.

[DYY + 19] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan

Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length

context. In Anna Korhonen, David R. Traum, and Lluís Màrquez, editors, Proceedings

of the 57th Conference of the Association for Computational Linguistics, ACL 2019,

Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 2978–2988.

Association for Computational Linguistics, 2019.

[FDS + 23] Daniel Y. Fu, Tri Dao, Khaled Kamal Saab, Armin W. Thomas, Atri Rudra, and

Christopher Ré. Hungry hungry hippos: Towards language modeling with state space

models. In The Eleventh International Conference on Learning Representations, ICLR

2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.

[FPB + 23] Mahan Fathi, Jonathan Pilault, Pierre-Luc Bacon, Christopher Pal, Orhan Firat, and

Ross Goroshin. Block-state transformer. CoRR, abs/2306.09539, 2023.

[FZS21] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to

trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961,

2021.

[GGR22] Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences

with structured state spaces. In The Tenth International Conference on Learning

Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.

[HCB + 19] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu

Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen.

Gpipe: Efficient training of giant neural networks using pipeline parallelism. In NeurIPS

2019, pages 103–112, 2019.

11[HDW + 23] Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma,

Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, Kriti Aggarwal, Zewen Chi,

Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. Language is

not all you need: Aligning perception with language models. ArXiv, abs/2302.14045,

2023.

[HZRS16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning

for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), pages 770–778, 2016.

[JGB + 21] Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and

João Carreira. Perceiver: General perception with iterative attention. In Marina Meila

and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine

Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of

Machine Learning Research, pages 4651–4664. PMLR, 2021.

[KCL + 22] Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch,

Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large

transformer models. CoRR, abs/2205.05198, 2022.

[KKL20] Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient trans-

former. In 8th International Conference on Learning Representations, ICLR 2020,

Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.

[KLA + 22] Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz

Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry

Bahdanau, Leandro von Werra, and Harm de Vries. The stack: 3 TB of permissively

licensed source code. CoRR, abs/2211.15533, 2022.

[KMH + 20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess,

Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws

for neural language models. CoRR, abs/2001.08361, 2020.

[KVPF20] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Trans-

formers are rnns: Fast autoregressive transformers with linear attention. In Proceedings

of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July

2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages

5156–5165. PMLR, 2020.

[LLK + 19] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek, Seungjin Choi, and Yee Whye

Teh. Set transformer: A framework for attention-based permutation-invariant neural

networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings

of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June

2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning

Research, pages 3744–3753. PMLR, 2019.

[LLX + 21] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping

Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant

models with conditional computation and automatic sharding. In ICLR 2021, 2021.

[LXLY21] Shenggui Li, Fuzhao Xue, Yongbin Li, and Yang You. Sequence parallelism: Making

4d parallelism possible. CoRR, abs/2105.13120, 2021.

[MKW + 21] Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, and

Luke Zettlemoyer. Luna: Linear unified nested attention. In Marc’Aurelio Ranzato,

Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan,

editors, Advances in Neural Information Processing Systems 34: Annual Conference on

Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021,

virtual, pages 2441–2453, 2021.

[MWH + 22] Shuming Ma, Hongyu Wang, Shaohan Huang, Wenhui Wang, Zewen Chi, Li Dong,

Alon Benhaim, Barun Patra, Vishrav Chaudhary, Xia Song, and Furu Wei. TorchScale:

Transformers at scale. CoRR, abs/2211.13184, 2022.

12[PDB + 22] Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, and Furu Wei. BEiT v2: Masked

image modeling with vector-quantized visual tokenizers. 2022.

[PMN + 23] Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus,

Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger

convolutional language models. CoRR, abs/2302.10866, 2023.

[PWD + 23] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and

Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.

ArXiv, abs/2306, 2023.

[QHS + 22] Zhen Qin, Xiaodong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, and

Yiran Zhong. The devil in linear transformer. In Yoav Goldberg, Zornitsa Kozareva,

and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods

in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates,

December 7-11, 2022, pages 7025–7041. Association for Computational Linguistics,

2022.

[SDP + 22] Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim,

Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer.

CoRR, abs/2212.10554, 2022.

[SPP + 19] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper,

and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models

using model parallelism. CoRR, abs/1909.08053, 2019.

[SWL23] Jimmy T. H. Smith, Andrew Warrington, and Scott W. Linderman. Simplified state space

layers for sequence modeling. In The Eleventh International Conference on Learning

Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.

[TDA + 21] Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham,

Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena : A

benchmark for efficient transformers. In 9th International Conference on Learning

Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net,

2021.

[VSP + 17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.

Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS

2017, pages 5998–6008, 2017.

[WBD + 23] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti

Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, and Furu Wei.

Image as a foreign language: BEiT pretraining for vision and vision-language tasks. In

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

2023.

[WCL + 20] Genta Indra Winata, Samuel Cahyawijaya, Zhaojiang Lin, Zihan Liu, and Pascale Fung.

Lightweight and efficient end-to-end speech recognition using low-rank transformer.

In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing,

ICASSP 2020, Barcelona, Spain, May 4-8, 2020, pages 6144–6148. IEEE, 2020.

[WDC + 23] Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu

Wei. Augmenting language models with long-term memory. CoRR, abs/2306.07174,

2023.

[WLK + 20] Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer:

Self-attention with linear complexity. CoRR, abs/2006.04768, 2020.

[WMD + 22] Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu

Wei. DeepNet: Scaling transformers to 1,000 layers. CoRR, abs/2203.00555, 2022.

13[WMH + 22] Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng,

Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Patra, Zhun Liu, Vishrav

Chaudhary, Xia Song, and Furu Wei. Foundation transformers. CoRR, abs/2210.06423,

2022.

[WRHS22] Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, and Christian Szegedy. Memo-

rizing transformers. In The Tenth International Conference on Learning Representations,

ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.

[ZBK + 22] Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam

Shazeer, and William Fedus. Designing effective sparse expert models. CoRR,

abs/2202.08906, 2022.

[ZGD + 20] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Al-

berti, Santiago Ontañón, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and

Amr Ahmed. Big bird: Transformers for longer sequences. In Hugo Larochelle,

Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, edi-

tors, Advances in Neural Information Processing Systems 33: Annual Conference on

Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020,

virtual, 2020.

[ZKHB22] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision

transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition,

CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 1204–1213. IEEE, 2022.

14A

Hyperparameters

Hyperparameters Value

Layers

Hidden size

FFN size

Heads 12

768

3072

Learning rate

LR scheduler

Warm-up steps

Tokens per batch

Adam β

Training steps 6e-4

Polynomial decay

750

500K

(0.9, 0.98)

300K

Gradient clipping

Dropout

Weight decay 2.0

0.0

0.01

Table 3: Hyperparamters used for the models in Table 2.

Parameters Layers Hidden Heads Learning Rate Batch Size

125M

350M

760M

2.7B 12

32 768

1024

1536

2560 12

32 6e-4

6e-4

2e-4 500K

500K

Table 4: Hyperparamters used for the experiments in Figure 7(a).