Summary of Efficient Fine-Tuning with Long Context Sizes

Summary Efficient Fine-Tuning with Long Context Sizes arxiv.org

9,854 words - PDF document - View PDF document

One Line

LongLoRA effectively extends context sizes in large language models with shift short attention and revisits parameter-efficient fine-tuning, leading to improved results in topic retrieval tasks and reduced computational costs.

Slides

Slide Presentation (12 slides)

Copy slides outline Copy embed code Download as Word

Efficient Fine-Tuning with Long Context Sizes

Source: arxiv.org - PDF - 9,854 words - view

Introduction

• LongLoRA extends the context sizes of pre-trained large language models (LLMs) efficiently.

• Shift short attention (S2-Attn) enables context extension and saves computation.

• LongLoRA achieves strong empirical results on various tasks on LLaMA2 models from 7B/13B to 70B.

Shift Short Attention (S2-Attn)

• S2-Attn introduces a shift short attention mechanism.

• S2-Attn enables context extension while reducing GPU memory cost and training time.

• S2-Attn retains the original attention architecture during inference.

Strong Empirical Results

• LongLoRA achieves improved results in topic retrieval tasks.

• LongLoRA performs well on various evaluation datasets.

• LongLoRA outperforms other open LLMs with long contexts.

Compatibility with Existing Techniques

• LongLoRA is compatible with most existing techniques, like FlashAttention-2.

• Existing optimization and infrastructure techniques for LLMs can be applied to LongLoRA.

• LongLoRA can be used alongside other techniques to further enhance performance.

Significant Reduction in Computational Costs

• LongLoRA reduces computational costs significantly compared to full fine-tuning.

• Fine-tuning can be done on a single 8x A100 machine.

• LongLoRA is practical for various tasks with different context lengths.

Perplexity Improvement in Long-Sequence Language Modeling

• LongLoRA achieves better perplexity with longer context sizes.

• Increasing the context window size improves the perplexity of LLaMA2 models.

• Perplexity improvements are observed across different model sizes.

Comparable Performance in Retrieval Tasks

• LongLoRA performs well in retrieval tasks, achieving comparable performance to state-of-the-art models.

• LongLoRA's extended context length improves topic retrieval accuracy.

• Evaluation on various retrieval datasets demonstrates the effectiveness of LongLoRA.

Efficiency Profile of LongLoRA

• Ablation studies analyze the efficiency profile of LongLoRA.

• The proportion of attention decreases with the use of S2-Attn compared to full attention.

• LongLoRA reduces FLOPs, especially in large context lengths.

Fine-Tuning Steps and Attention Patterns

• Full fine-tuning has faster convergence than low-rank training initially.

• Different attention patterns are investigated during fine-tuning.

• Shifting between layers is acceptable, while other patterns have varying effects.

Conclusion

• LongLoRA is an efficient method for extending the context length of LLMs.

• It achieves better perplexity in long-sequence language modeling and performs well in retrieval tasks.

• Ablation studies provide insights into the efficiency profile and attention patterns of LongLoRA.

Key Takeaways

• LongLoRA extends context sizes efficiently with S2-Attn.

• It achieves strong empirical results and is compatible with existing techniques.

• LongLoRA significantly reduces computational costs while maintaining comparable performance.

Key Points

LongLoRA is an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs) with limited computation cost.
LongLoRA introduces shift short attention (S2-Attn), which enables context extension and saves computation.
LongLoRA achieves strong empirical results on various tasks on LLaMA2 models from 7B/13B to 70B.
LongLoRA is compatible with most existing techniques, like FlashAttention-2, and can be applied using optimization and infrastructure techniques for LLMs.
LongLoRA reduces computational costs significantly while achieving comparable performance to full fine-tuning.

Summaries

33 word summary

LongLoRA extends context sizes in large language models efficiently using shift short attention and revisits parameter-efficient fine-tuning. It achieves strong empirical results, outperforms other models in topic retrieval tasks, and reduces computational costs.

79 word summary

LongLoRA is an efficient method for extending context sizes in large language models (LLMs). It incorporates shift short attention (S2-Attn) and revisits parameter-efficient fine-tuning for context expansion. LongLoRA achieves strong empirical results and is compatible with existing techniques. It introduces a dataset called LongQA for supervised fine-tuning and demonstrates comparable performance to full fine-tuning while reducing computational costs. Additionally, it outperforms other LLMs in topic retrieval tasks and allows for fine-tuning on a single machine, significantly reducing computational costs.

128 word summary

LongLoRA is an efficient approach for extending the context sizes of pre-trained large language models (LLMs). It introduces shift short attention (S2-Attn) and revisits parameter-efficient fine-tuning for context expansion. LongLoRA achieves strong empirical results and is compatible with existing techniques. To make it practical, a dataset called LongQA is collected for supervised fine-tuning. LongLoRA demonstrates comparable performance to full fine-tuning while reducing computational costs by using S2-Attn during fine-tuning. It achieves good performance on evaluation datasets and outperforms other LLMs with long contexts in topic retrieval tasks. Efficiency is a key aspect of LongLoRA as it allows for fine-tuning on a single machine, reducing computational costs by a large margin. Ablation studies and experiments further demonstrate the effectiveness and efficiency of LongLoRA in extending context length for LLMs.

446 word summary

LongLoRA is an efficient fine-tuning approach for extending the context sizes of pre-trained large language models (LLMs). It addresses the computational cost of training LLMs with long context sizes by introducing shift short attention (S2-Attn) and revisiting the parameter-efficient fine-tuning regime for context expansion. LongLoRA achieves strong empirical results on various tasks and is compatible with existing techniques.

To make LongLoRA practical, a dataset called LongQA is collected for supervised fine-tuning. It contains more than 3k long context question-answer pairs. The code, models, dataset, and demo for LongLoRA are available on GitHub.

In experiments, LongLoRA demonstrates comparable performance to full fine-tuning while significantly reducing computational costs. It achieves this by using S2-Attn during fine-tuning, which retains the original attention architecture during inference. Existing optimization and infrastructure techniques for LLMs can also be applied to LongLoRA.

LongLoRA fine-tunes LLaMA2 models with extended context lengths and achieves good performance on various evaluation datasets. For example, on the Proof-pile test set, LongLoRA achieves lower perplexity as the evaluation context length increases. It also outperforms other open LLMs with long contexts in topic retrieval tasks.

Efficiency is a key aspect of LongLoRA. It allows for the fine-tuning of LLaMA2 models on a single 8x A100 machine, with the maximum context length reaching 100k for 7B models, 64k for 13B models, and 32k for 70B models. LongLoRA reduces the computational cost by a large margin compared to full fine-tuning.

The efficiency of LongLoRA is further demonstrated by its FLOPs profile. It reduces FLOPs, especially when the context length is large, by breaking down the model into different components.

The effectiveness of LongLoRA is demonstrated through experiments on long-sequence language modeling and retrieval in long contexts. The models achieve better perplexity with longer context sizes and perform well on retrieval tasks.

Ablation studies analyze the efficiency profile of LongLoRA, showing that the proportion of attention sharply increases as the context length increases. However, the use of S2-Attn reduces the proportion of attention compared to full attention.

The relationship between perplexity and fine-tuning steps is explored in an ablation study, revealing that full fine-tuning has a faster convergence at the beginning, but the final gap is not large. Different attention patterns during fine-tuning are also investigated.

In conclusion, LongLoRA is an efficient method for extending the context length of LLMs. It achieves comparable performance to full fine-tuning while significantly reducing computational costs.

572 word summary

The efficiency of LongLoRA is further demonstrated by its FLOPs profile. It reduces FLOPs, especially when the context length is large, by breaking down the model into different components.

In conclusion, LongLoRA is an efficient fine-tuning approach that extends the context sizes of pre-trained LLMs. It achieves comparable performance to full fine-tuning while significantly reducing computational costs. LongLoRA is compatible with existing techniques and is practical for various tasks.

The study presents LongLoRA, a method for efficiently extending the context length of large language models (LLMs). The authors propose a shift short attention mechanism that approximates the standard self-attention pattern during training. This approach reduces GPU memory cost and training time compared to full fine-tuning, while maintaining minimal accuracy compromise. The models trained with shift short attention retain the original attention architecture during inference, making them compatible with existing infrastructure and optimization techniques.

The study investigates the maximum context length that can be fine-tuned on a single 8x A100 machine. The LLaMA2 7B, 13B, and 70B models are extended to context lengths of 100k, 65536, and 32768, respectively. However, some perplexity degradation is observed on small context sizes for the extended models.

Ablation studies are conducted to analyze the efficiency profile of LongLoRA. The breakdown of FLOPs in different types of layers shows that the proportion of attention sharply increases as the context length increases. However, the use of S2-Attn reduces the proportion of attention compared to full attention.

In conclusion, LongLoRA is an efficient method for extending the context length of L

823 word summary

LongLoRA is an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs) with limited computation cost. Training LLMs with long context sizes is computationally expensive, but LongLoRA speeds up the context extension in two ways. First, it introduces shift short attention (S2-Attn), which enables context extension and saves computation. S2-Attn can be implemented with just two lines of code in training. Second, LongLoRA revisits the parameter-efficient fine-tuning regime for context expansion. It finds that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA achieves strong empirical results on various tasks on LLaMA2 models from 7B/13B to 70B. It extends models' context while retaining their original architectures and is compatible with most existing techniques, like FlashAttention-2.

The efficiency of LongLoRA is further demonstrated by its FLOPs profile. It reduces FLOPs, especially when the context length is large, by breaking down the model into different components.

The effectiveness of LongLoRA is demonstrated through experiments on long-sequence language modeling and retrieval in long contexts. In long-sequence language modeling, the models achieve better perplexity with longer context sizes. For example, increasing the context window size from 8192 to 32768 improves the perplexity of the LLaMA2 7B model from 2.72 to 2.50. Similar improvements are observed for the LLaMA2 13B model. The models also perform well on retrieval tasks, achieving comparable performance to state-of-the-art models.

The study also investigates the maximum context length that can be fine-tuned on a single 8x A100 machine. The LLaMA2 7B, 13B, and 70B models are extended to context lengths of 100k, 65536, and 32768, respectively. The results demonstrate promising performance on these extremely large settings. However, some perplexity degradation is observed on small context sizes for the extended models, which is a known limitation of position interpolation.

The relationship between perplexity and fine-tuning steps is explored in an ablation study. The results show that full fine-tuning has a faster convergence than low-rank training at the beginning, but the final gap is not large. The study also investigates different attention patterns during fine-tuning. It is found that shifting between layers is acceptable, while setting all attention heads as pattern 1 or pattern 2 does not work well. Dilated attention performs well in full fine-tuning but not in low-rank adaptation, while fine-tuning with stride sparse attention is harmful.

In conclusion, LongLoRA is an efficient method for extending the context length of LLMs. It reduces GPU memory cost and training time while maintaining minimal accuracy compromise. The proposed shift short attention mechanism approximates the standard self-attention pattern during training and retains the original attention architecture during inference. The method achieves better perplexity in long-sequence language modeling and performs well in retrieval tasks. Ablation studies provide insights into the efficiency profile of LongLoRA and the effects of different attention patterns. Further research is planned to investigate the compatibility of LongLoRA with other types of LLMs and position encodings.

Raw indexed text (61,990 chars / 9,854 words / 1,306 lines)

L ONG L O RA: E FFICIENT F INE - TUNING

C ONTEXT L ARGE L ANGUAGE M ODELS

Yukang Chen 1

Shengju Qian 1

Haotian Tang 2

Zhijian Liu

Song Han 2

Jiaya Jia 1

CUHK

L ONG -

Xin Lai 1

MIT

A BSTRACT

We present LongLoRA, an efficient fine-tuning approach that extends the context

sizes of pre-trained large language models (LLMs), with limited computation cost.

Typically, training LLMs with long context sizes is computationally expensive,

requiring extensive training hours and GPU resources. For example, training on the

context length of 8192 needs 16× computational costs in self-attention layers as that

of 2048. In this paper, we speed up the context extension of LLMs in two aspects.

On the one hand, although dense global attention is needed during inference, fine-

tuning the model can be effectively and efficiently done by sparse local attention.

The proposed shift short attention (S 2 -Attn) effectively enables context extension,

leading to non-trivial computation saving with similar performance to fine-tuning

with vanilla attention. Particularly, it can be implemented with only two lines of

code in training, while being optional in inference. On the other hand, we revisit the

parameter-efficient fine-tuning regime for context expansion. Notably, we find that

LoRA for context extension works well under the premise of trainable embedding

and normalization. LongLoRA demonstrates strong empirical results on various

tasks on LLaMA2 models from 7B/13B to 70B. LongLoRA adopts LLaMA2 7B

from 4k context to 100k, or LLaMA2 70B to 32k on a single 8× A100 machine.

LongLoRA extends models’ context while retaining their original architectures,

and is compatible with most existing techniques, like FlashAttention-2. In addition,

to make LongLoRA practical, we collect a dataset, LongQA, for supervised fine-

tuning. It contains more than 3k long context question-answer pairs. All our code,

models, dataset, and demo are available at github.com/dvlab-research/LongLoRA.

Perplexity

3.73

3.6

3.2

2.8

2.78

2.66

2.4

Context 8192

2.94

2.52

16364

46.3

65536

34.6

Full FT

LoRA

32768

65536

OOM

46.3

25.6

Context 8192

16364

Training hours

hours

120

69.8

57.4

2.48

32768

68.8

2.98

GPU memory OOM

52.4

39.8

7.4

5.2

Context 8192

16.3

11.3

16364

24.6

32768

65536

Long-LoRA

Figure 1: Performance and efficiency comparison between full fine-tuning, plain LoRA, and our

LongLoRA. We fine-tune LLaMA2 7B on various context lengths, with FlashAttention-2 (Dao, 2023)

and DeepSpeed (Rasley et al., 2020) stage 2. Perplexity is evaluated on the Proof-pile (Azerbayev

et al., 2022) test set. Plain LoRA baseline spends limited GPU memory cost, but its perplexity

gets worse as the context length increases. LongLoRA achieves comparable performance to full

fine-tuning while the computational cost is much less.

I NTRODUCTION

Large language models (LLMs) are typically trained with a pre-defined context size, such as 2048

tokens for LLaMA (Touvron et al., 2023a) and 4096 tokens for LLaMA2 (Touvron et al., 2023b).

1Each pa1ern in half heads

Embedding !

Norm input !

Mul(-head

Self-A1en(on

Lora

❄

Norm post

Pa1ern 1 - w/o shi@

Pa1ern 2 - w/ shi@

Combina(on

(a) Shi3 short a6en7on

Feed Forward ❄

x N

(b) Low-rank adapt

Figure 2: Overview of LongLoRA designs. LongLoRA introduces shift short attention during fine-

tuning. The trained model can retain its original standard self-attention during inference. In addition

to plain LoRA weights, LongLoRA additionally makes embedding and normalization layers trainable,

which is ! essential

to long

context

learning, but takes up only

a (1.94%)

small proportion of parameters.

Embedding (1.94%)

❄ Linear

Projec3on (96%)

❄ Head

! a Norm (0.004%)

! Lora (0.12%)

! Trainable ❄ Frozen

However, the pre-defined size limits LLMs in many applications, like summarizing long documents

or answering long questions. To resolve this limitation, some recent works (Chen et al., 2023;

Tworkowski et al., 2023; Mohtashami & Jaggi, 2023) train or fine-tune LLM to longer context.

However, training an LLM from scratch with long sequences poses computational challenges, and

fine-tuning an existing pre-trained LLM is also considerably expensive. For instance, Position

Interpolation (Chen et al., 2023) spent 32 A100 GPUs to extend LLaMA models from 2k to 8k

context, and 128 A100 GPUs for longer context fine-tuning. FOT (Tworkowski et al., 2023) used 32

TPUs for standard transformer training and 128 TPUs for LongLLaMA. These computation resources

are typically unaffordable for common researchers, which naturally leads us to question: Can we

extend the context window of LLMs efficiently?

One straightforward approach is to fine-tune a pre-trained LLM via low-rank adaptation (LoRA) (Hu

et al., 2022). LoRA modifies the linear projection layers in self-attention blocks by utilizing low-rank

matrices, which are generally efficient and reduce the number of trainable parameters. However, our

empirical findings indicate that training long context models in this manner is neither sufficiently

effective nor efficient. In terms of effectiveness, plain low-rank adaptation results in a high perplexity

in long context extension, as in Table 3. Increasing the rank to a higher value, e.g., rank = 256,

does not alleviate this issue. In terms of efficiency, regardless of whether LoRA is employed or not,

computational cost increases dramatically as the context size expands, primarily due to the standard

self-attention mechanism (Vaswani et al., 2017). As shown in Figure 1, even with LoRA, the training

hours for the standard LLaMA2 model increase substantially when the context window expands.

In this work, we introduce LongLoRA, an efficient fine-tuning approach that extends the context

windows of pre-trained LLMs, e.g., LLaMA2 (Touvron et al., 2023b). LoRA (Hu et al., 2022) uses

low-rank weight updates to approximate full fine-tuning. Similarly, we find that short attention is

also able to approximate long context during training. We present shift short attention (S 2 -Attn) as

an efficient substitute for standard self-attention. As shown in Figure 2, we split context length into

several groups and conduct attention in each group individually. In half attention heads, we shift the

tokens by half group size, which ensures the information flow between neighbouring groups. For

example, we use S 2 -Attn with group size 2048 to approximate the total 8192 context length training.

This shares a high-level spirit with Swin Transformer (Liu et al., 2021).

Models fine-tuned via S 2 -Attn retain the original attention architecture during inference. This

facilitates most existing optimization and infrastructure. Techniques for common LLMs can also be

applied to ours. For example, FlashAttention-2 (Dao et al., 2022; Dao, 2023) is compatible with our

method in both training and inference time. The reason behind this is that short attention resembles

the attention scheme in the pre-training stage of LLMs. Other efficient attentions, e.g., dilated or

sparse attention, have a large gap to the standard style in the pre-training stage, as shown in Table 2.

We empirically show that learnable embedding and normalization layers are the key to unlocking

long context LoRA fine-tuning, in Table 3. Embedding and normalization layers take up a small

2Half

group

Attention w/o shift

Group

size

Step 1 Step 2 Step 3

Split heads Shi$ tokens Reshape

Half heads

Shift short attention

Attention w shift

Figure 3: Illustration of shift short attention. Shift short attention involves three steps. First, it splits

features along the head dimension into two chunks. Second, tokens in one of the chunks are shifted

by half of the group size. Third, we split tokens into groups and reshape them into batch dimensions.

Attention only computes in each group in ours while standard self-attention computes among all

tokens. The information flows between groups via shifting.

proportion of parameters in the entire LLM. For example, embedding has (< 2%) parameters, and

normalization has (≤ 0.004%) parameters in LLaMA2 7B. This ratio decreases for even larger LLMs.

In experiments, we show that LongLoRA is effective and efficient. We present experimental results of

extending the context window for LLaMA2 7B, 13B, and 70B. Following the experimental settings

of Position Interpolation (Chen et al., 2023), we fine-tune models with proper position embeddings.

The trained models achieve comparable performance to the full-attention and fully fine-tuned results,

while the computational cost is much less as shown in Figure 1. LongLoRA can fine-tune LLaMA2

7B up to 100k context, or a 70B model up to 32k, on a single 8× A100 machine.

In addition, we present a dataset, LongQA, for supervised fine-tuning (SFT). LongQA contains more

than 3k long questions and the corresponding answers. We design various types of questions for

technical paper, science fiction, and other books. SFT is important for improving the chat ability of

LLMs. We present some examples of our trained models in the appendix.

R ELATED W ORK

Long-context Transformers. A large body of research has been developed to increase the context

length of transformers. Some of these approaches are retrieval-based (Karpukhin et al., 2020; Izacard

et al., 2022; Guu et al., 2020), which augment language models via fetching related documents and

including the retrieved results into contexts. Our work is complementary to these works, as our

attention mechanism is unmodified during inference. Many works modify multi-head attention to be

approximated ones (Wang et al., 2020; Beltagy et al., 2020; Zaheer et al., 2020; Kitaev et al., 2020;

Bulatov et al., 2022; Ding et al., 2023; Qiu et al., 2020). They alleviate the quadratic complexity of

the self-attention computation. For example, Longformer (Beltagy et al., 2020) and BigBird (Zaheer

et al., 2020) use sparse attention to handle long sequences. Other works (Wu et al., 2022; Bulatov

et al., 2022) utilize memory mechanisms as a compression on past inputs, to look up relevant tokens.

One limitation of these works is that these compression has a large gap to full attention, making

it infeasible to fine-tune pre-trained LLMs. Although our work also involves an approximation of

attention mechanism, it has a similar shape and a small gap to standard attention. This enables ours

to fine-tune pre-trained LLMs and maintain full attention during inference.

Long-context LLMs. LLMs are typically pre-trained with a pre-defined context length, such as

2048 for LLaMA (Touvron et al., 2023a) and 4096 for LLaMA2 (Touvron et al., 2023b). Training

LLMs with long context from scratch is prohibitively expensive for most researchers. Recently,

a number of works have tried to extend the context length of LLMs via fine-tuning. Position

Interpolation (Chen et al., 2023) introduces a modification upon rotary position encoding (Su et al.,

2021) and extends the context length of LLaMA to 32768. Focused Transformer (Tworkowski

et al., 2023) utilizes contrastive learning to train LongLLaMA. Both of them rely on full fine-

tuning, which is computationally expensive (128 A100 GPUs / 128 TPUv3 for training). Landmark

attention (Mohtashami & Jaggi, 2023) is an efficient approach, but somewhat lossy. It compresses

long context inputs into retrieved tokens. Our method saves substantial fine-tuning costs, while

3Table 1: Ablations on different training patterns and target context length. ‘Short’ means 1/4 of the

target context length. ‘Long’ equals to the target context length. Models are fully fine-tuned upon an

LLaMA2 (Touvron et al., 2023b) model in 7B size, on RedPajama (Computer, 2023) dataset. Results

are tested in perplexity on PG19 (Rae et al., 2020) validation split.

Position Embedding Training

Attention Shift

Train-free PI (Chen et al., 2023)

NTK-Aware (ntk, 2023) w/o fine-tuning

Full Attn

Short Attn

S 2 -Attn PI (Chen et al., 2023)

Setting

Long

Short

✗

✓

Target Context Length

8192 16384 32768

15.82 94.57 236.99

10.89 88.44 932.85

8.02

8.05

8.04

8.29

8.83

9.47

8.04

8.03

8.08

preserving the quality of the original attention. Ours maintain full access to the entire input via

unmodified attention during inference.

Some literature focuses on the position embedding modification of LLMs for long context extension,

including Position Interpolation (Chen et al., 2023), NTK-aware (ntk, 2023), Yarn (Peng et al.,

2023), positional Skipping (Zhu et al., 2023), and the out-of-distribution related method (Han et al.,

2023). Our method focuses on efficient fine-tuning and retaining the original architecture during

inference, which is orthogonal to these position embedding methods. Our models apply the Position

Interpolation (Chen et al., 2023) in experiments.

Efficient Fine-tuning. This work is based on LoRA (Hu et al., 2022), a classical efficient fine-tuning

approach. In addition to LoRA (Hu et al., 2022), there are many other parameter-efficient fine-tuning

methods, including prompt tuning (Lester et al., 2021), prefix tuning (Li & Liang, 2021), hidden state

tuning (Liu et al., 2022), bias tuning (Zaken et al., 2022), and masked weight learning (Sung et al.,

2021). Input-tuning (An et al., 2022) introduces an adapter to tune input embedding. Although the

input embedding layers are also trainable in ours, this is not enough for long context extension. We

make a comprehensive analysis on layer types in experiments, in Table 3.

3.1

L ONG L O RA

B ACKGROUND

Transformer. LLMs are typically built with transformers. Taking LLaMA2 (Touvron et al., 2023b)

for example, as shown in Figure 2, an LLM model consists of an embedding input layer and a number

of decoder layers. Each decoder layer comprises a self-attention module. It maps input features

into a set of queries, keys, and values {q, k, v}, via linear projection layers with weight matrices

{W q , W k , W v }. Given {q, k, v}, it computes the outputs o as

o = softmax(qk T )v

(1)

The outputs are then projected by a linear layer with a weight matrix W o . And MLP layers are

followed. Before and after self-attention modules, layer normalization (Ba et al., 2016) is applied. A

final normalization is conducted after all decoder layers.

For long sequences, self-attention struggles with computation cost, which is quadratic to the sequence

length. This dramatically slows down the training procedure and increases GPU memory costs.

Low-rank Adaptation. LoRA (Hu et al., 2022) hypothesizes that the weight updates in pre-trained

models have a low intrinsic rank during adaptation. For a pre-trained weight matrix W ∈ R d×k , it is

updated with a low-rank decomposition W + ∆W = W + BA, where B ∈ R d×r and A ∈ R r×k .

The rank r ≪ min(d, k). During training, W is frozen with no gradient updates, while A and B are

trainable. This is the reason why LoRA training is much more efficient than full fine-tuning.

In the Transformer structure, LoRA only adapts the attention weights (W q , W k , W v , W o ) and freezes

all other layers, including MLP and normalization layers. This manner is simple and parameter-

efficient. However, we empirically show that only low-rank adaptation in attention weights does not

work for long context extension.

4Table 2: Ablation on attention patterns during fine-tuning. We fine-tune an LLaMA2 7B model to

32768 context length via various attention patterns, with the improved LoRA setting. We include

four typical efficient attention designs, e.g., shift, dilate, stride sparse for comparison. ‘cro. heads /

layers’ means to swap different attention settings across attention heads or sequential layers. Taking

S 2 -Attn as an example, ‘cro. layers’ is to swap between w/ and w/o shift in sequential self-attention

layers. ‘only P1/P2’ means all attention heads use pattern 1 (all no shift) or Pattern 2 (all shift) in

Figure 2. Dilated attention (Ding et al., 2023) varies dilated rate from 1 to 4 in attention heads. Stride

sparse attention is introduced in (Child et al., 2019), which is also swapped between local and stride

attention in attention heads.

Test w/

Full-Attn

✗

✓

cro. heads

8.64

8.12

S 2 -Attn

cro. layers only P1.

8.63

9.17

9.70

8.39

only P2.

9.64

9.81

Dilate

cro. heads

8.75

11.78

Stride sparse

cro. heads

31.46

>1000

Algorithm 1: Pseudocode of Shift Short Attention in PyTorch-like style.

# B: batch size; S: sequence length or number of tokens; G: group size;

# H: number of attention heads; D: dimension of each attention head

# qkv in shape (B, N, 3, H, D), projected queries, keys, and values

# Key line 1: split qkv on H into 2 chunks, and shift G/2 on N

qkv = cat((qkv.chunk(2, 3)[0], qkv.chunk(2, 3)[1].roll(-G/2, 1)), 3).view(B*N/G,G,3,H,D)

# standard self-attention function

out = self_attn(qkv)

# out in shape (B, N, H, D)

# Key line 2: split out on H into 2 chunks, and then roll back G/2 on N

out = cat((out.chunk(2, 2)[0], out.chunk(2, 2)[1].roll(G/2, 1)), 2)

cat: concatenation; chunk: split into the specified number of chunks; roll: roll the tensor along the given dimension.

3.2

S HIFT S HORT A TTENTION

Standard self-attention pattern cost O(n 2 ) computations, making LLMs on long sequences high

memory cost and slow. To avoid this issue during training, we propose shift short attention (S 2 -Attn),

as shown in Figure 2. In the following, we explain our designs step by step.

Pilot Study. In Table 1, we first validate the importance of fine-tuning. Without fine-tuning, models

perform worse as the context length grows up, even with proper position embeddings (Chen et al.,

2023; ntk, 2023) equipped. We build up a standard baseline that is trained and tested with full

attention and full fine-tuning, which presents consistently good quality in various context lengths.

The first trial is to train with short attention, only pattern 1 in Figure 2. As we know for a long context,

the high cost mainly comes from self-attention modules. Thus, in this trial, since the input is long,

we split into several groups in self-attention. For example, the model takes 8192 tokens as input in

both the training and testing stages, but self-attention is conducted in each group with a 2048 size.

The group number is 4, as ablated in Table 11. This pattern is efficient but still does not work in a

very long context, as shown in Table 1. The perplexity becomes larger as the context length increases.

The reason behind this is that there is no information exchange between different groups.

To introduce communication between groups, we include a shifted pattern, as shown in Figure 2. We

shift the group partition by half group size in half attention heads. Taking the overall 8192 context

length for example, in pattern 1, the first group conducts self-attention from 1 st to 2048 th tokens.

In Pattern 2, the group partition is shifted by 1024. The first attention group begins from 1025 th

and ends at 3072 th tokens, while the first and the last 1024 tokens belong to the same group. We

use patterns 1 and 2 in each half self-attention heads respectively. This manner does not increase

additional computation cost but enables the the information flow between different groups. We show

that it gets close to the standard attention baseline in Table 1.

Consistency to Full Attention. Existing efficient attention designs can also improve the efficiency

of long-context LLMs. In Table 2, we compare the proposed S 2 -Attn with several typical efficient

5Table 3: Ablation on fine-tuning and ablations in various settings. Models are trained based on

LLaMA2 (Touvron et al., 2023b) model in 7B size, with the proposed Shift Short Attention. The target

context length is 32768. ‘+ Normal / Embed’ means including normalization or embedding layers as

trainable. We use RedPajama (Computer, 2023) dataset for training. Results are tested in perplexity

on PG19 (Rae et al., 2020) validation set. For long context adaptation, standard LoRA (Hu et al.,

2022) has a large gap to the full fine-tuning result. Without trainable normalization or embeddings,

larger ranks in LoRA have no effects.

Method

PPL

Full FT

8.08

11.44

11.82

LoRA (rank)

11.92 11.96

128

11.97

256

11.98

LoRA (rank = 8)

+ Norm + Norm + Embed

10.49

8.12

attention, including short attention, dilated attention (Ding et al., 2023), and stride sparse atten-

tion (Child et al., 2019). We show that S 2 -Attn not only enables efficient fine-tuning but also supports

full attention testing.

Some efficient attention designs are infeasible for long-context fine-tuning. The transformers (Qiu

et al., 2020; Child et al., 2019), developed for training from scratch, have gaps to the standard full

attention, which is used in pre-training. Thus, these attentions are not suitable for long context

fine-tuning. S 2 -Attn supports full attention testing, although the model is fine-tuned with shift short

attention, as shown in Table 2. Although other attentions, like dilated attention (Ding et al., 2023) and

stride sparse attention (Child et al., 2019), can also be used in long context fine-tuning, models must

be tested with the attention used during fine-tuning. Shifting prevents models from being over-fitted

to specific attention patterns. In S 2 -Attn, pattern 1 or 2 only does not work as in Table 2.

Easy Implementation. Shift short attention is easy to implement. It involves only two steps: (1)

shifting tokens in half attention heads, and (2) transposing features from token dimension to batch

dimension. Two lines of code are enough. We provide a PyTorch-style code in Algorithm 1. In the

following, we make a pilot study and clarify the reasons for our design step by step.

3.3

I MPROVED L O RA FOR L ONG C ONTEXT

LoRA (Hu et al., 2022) is an efficient and popular manner for adapting LLMs to other datasets. It

saves much trainable parameters and memory cost, compared to full fine-tuning. However, adapting

LLMs from short context length to long is not easy. We empirically observe an obvious gap between

LoRA and full fine-tuning. As shown in Table 3, the gap between LoRA and full fine-tuning grows

as the target context length becomes larger. And LoRA with larger ranks cannot reduce the gap.

To bridge this gap, we open embedding and normalization layers for training. As shown in Table 3,

they occupy limited parameters but make effects for long context adaptation. Especially for normal-

ization layers, the parameters are only 0.004% in the whole LLaMA2 7B. We denote this improved

version of LoRA as LoRA + in experiments.

4.1

E XPERIMENT

E XPERIMENTAL S ETTINGS

Models. We extend the pre-trained 7B, 13B, and 70B LLaMA2 (Touvron et al., 2023b) models.

The maximum extended context window sizes are up to 100k for 7B models, 65536 for 13B models,

and 32768 for 70B models. The position indices for these models are re-scaled with Position

Interpolation (Chen et al., 2023).

Training Procedure. We follow most training hyper-parameters in Position Interpolation (Chen

et al., 2023), except that our batch size is smaller as we use a single 8× A100 GPUs machine in some

cases. All models are fine-tuned via the next token prediction objective. We use AdamW (Loshchilov

& Hutter, 2019) with β 1 = 0.9 and β 2 = 0.95. The learning rate is set to 2 × 10 −5 for 7B and 13B

models, and 10 −5 for 70B models. We also use a linear learning rate warmup. The weight decay is

6Table 4: Evaluation perplexity on proof-pile dataset (Rae et al., 2020) test split. S 2 -Attn: Shift

Short Attention. LoRA + : improved LoRA with embedding and normalization layers trainable. We

fine-tune LLaMA2 (Touvron et al., 2023b) in 7B and 13B model sizes. We use RedPajama (Computer,

2023) dataset for training. Models fine-tuned with LongLoRA show progressively lower perplexity

with longer evaluation context length. We use the same training setting as the model evaluated on

PG19 (Rae et al., 2020), which is introduced in Table 9 in the appendix.

Size

Training

Context Length

8192

16384

32768

8192

13B

16384

32768

LongLoRA

S 2 -Attn LoRA +

✓

2048

3.14

3.15

3.20

3.17

3.20

3.35

2.96

3.01

3.04

2.99

3.03

3.04

3.05

Evaluation Context Length

4096 8192 16384 32768

2.85 2.66

2.86 2.68

2.91 2.72

2.87 2.68

2.55

2.87 2.66

2.51

2.90 2.69

2.54

2.49

3.01 2.78

2.61

2.50

2.69 2.53

2.74 2.57

2.77 2.60

2.72 2.53

2.40

2.74 2.55

2.41

2.75 2.56

2.42

2.33

2.76 2.57

2.42

2.32

Table 5: Maximum context length that we can fine-tune for various model sizes on a single 8× A100

machine. We fine-tune LLaMA2 (Touvron et al., 2023b) in 7B, 13B, and 70B model sizes, using

RedPajama (Computer, 2023) dataset, and evaluate the perplexity on Proof-pile dataset (Rae et al.,

2020) test split. We use FlashAttention-2 (Dao, 2023) and DeepSpeed (Rasley et al., 2020) in Stage 3

during fine-tuning. With LongLoRA, the maximum context length for 7B, 13B, and 70B models are

100k, 64k, and 32k respectively. Evaluation on PG19 (Rae et al., 2020) is Table 10 in the appendix.

Size

13B

70B

Training

Context Length

100,000

65536

32768

2048

3.36

3.20

2.84

4096

3.01

2.88

2.57

Evaluation Context Length

8192 16384 32768 65536

2.78

2.60

2.58

2.57

2.66

2.50

2.39

2.38

2.39

2.26

2.17

100,000

2.52

zero. We set the per-device batch size as 1 and gradient accumulation steps as 8, which means that

the global batch size equals 64, using 8 GPUs. We train our models for 1000 steps.

Datasets. We use the Redpajama (Computer, 2023) dataset for training. We evaluate the long-

sequence language modeling performance of our fine-tuned models on the book corpus dataset

PG19 (Rae et al., 2020) and the cleaned Arxiv Math proof-pile dataset (Azerbayev et al., 2022). We

use the test split of PG19 (Rae et al., 2020), consisting of 100 documents. For the proof-pile dataset,

we also use the test split of it for evaluation. We follow Position Interpolation (Chen et al., 2023)

for Proof-pile data processing. We evaluate perplexity by using a sliding window approach with

S = 256, following (Press et al., 2022).

In addition, we build a long context QA dataset, LongQA, for supervised fine-tuning. Although the

models fine-tuned with Redpajama (Computer, 2023) present good perplexities, their chat ability is

limited. We collect more than 3k question-answer pairs, relating to the materials like technical paper,

science fiction, and other books. The questions we designed include summarization, relationships,

characters, and other details related to the material. For more details, please refer to the appendix.

7Table 6: Evaluation on topic retrieval using LongChat (Li et al., 2023). We compare our model to

other open LLMs with long contexts. This task involves retrieving target topics from a very long

conversation with lengths around 3k, 6k, 10k, 13k, and 16k. As some questions in the evaluation set

are longer than 16k, our model is fine-tuned via 18k context length upon LLaMA2 13B. It achieves

comparable performance to LongChat-13B (Li et al., 2023), the state-of-the-art model in this task,

while ours is from an efficient fine-tuning manner.

Evaluation Context

ChatGLM2-6B (Du et al., 2022)

MPT-30B-chat (Team, 2023a)

MPT-7B-storywriter (Team, 2023b)

LongChat-13B (Li et al., 2023)

Ours

0.88

0.96

0.46

1.0

0.46

1.0

0.46

1.0

0.98

10k

0.02

0.76

0.28

1.0

0.98

13k

0.02

0.34

0.98

16k

0.02

0.36

0.9

0.94

Table 7: Efficiency profile in terms of FLOPs on various context lengths. We break down the

LLaMA2 7B model into FFN (feed-forward layers), Proj (projection layers for queries, keys, values,

and attention outputs), Attn (self-attention kernel), and Others (e.g., embedding, normalization, LLM

head). The ratio of attention in the overall model increases as the context length increases. S 2 -Attn

reduces the FLOPs by a large margin, especially when the context length is large.

Context

Length

8192

16384

32768

65536

4.2

S 2 -Attn

✗

✓

✗

✓

✗

✓

✗

✓

Attn

35.2

8.8

140.7

35.2

562.9

140.7

2251.8

562.9

Proj

FLOPs (T)

FFN Others

35.2 70.9 2.2

70.4 141.8 4.3

140.7 283.7 8.7

281.5 567.4 17.3

Total

143.5

117.1

357.2

251.7

996.0

573.8

3118.0

1429.1

M AIN R ESULTS

Long-sequence Language Modeling. In Table 4 and Table 9, we report the perplexity for our

models and baseline on Proof-pile (Azerbayev et al., 2022) and PG19 datasets. Under certain training

context lengths, our models achieve better perplexity with longer context sizes. This indicates the

effectiveness of our efficient fine-tuning method. In Table 4, for the same training and evaluation

context length cases, the perplexity decreases as the context size increases. By increasing the context

window size from 8192 to 32768, for LLaMA2 7B model, we observe that the perplexity gets better

from 2.72 to 2.50 by -0.22. For LLaMA2 13B model, we observe that the perplexity reduces from

2.60 to 2.32 by -0.28.

In Table 5, we further examine the maximum context length that we can fine-tune on a single 8× A100

machine. We extend LLaMA2 7B, 13B, and 70B to 100k, 65536, 32768 context length respectively.

LongLoRA achieves promising results on these extremely large settings. In addition, we find some

perplexity degradation on small context sizes for the extended models. This is a known limitation of

Position Interpolation (Chen et al., 2023).

Retrieval-based Evaluation. In addition to long-sequence language modeling, we also conduct

experiments on retrieval in long contexts. In Table 6, we compare our model with other open LLMs on

the topic retrieval task introduced in LongChat (Li et al., 2023). This task is to retrieve the target topic

from a very long conversation, with lengths varying from 3k, 6k, 10k, 13k, to 16k. As some questions

in LongChat (Li et al., 2023) are longer than 16k, we fine-tuned LLaMA2 13B with a context length

of 18k. The training cost is similar to that for 16k. Our model achieves comparable performance to

LongChat-13B (Li et al., 2023), the state-of-the-art model in this task. Unlike LongChat-13B (Li

et al., 2023), which is fully fine-tuned on self-collected long context conversation text, our model is

efficiently adapted on the open RedPajama (Computer, 2023) via next-token generation. Our model

even slightly outperforms LongChat-13B in the 16k evaluation.

8Table 8: Ablation on fine-tuning steps in both full fine-tuning and low-rank training (with trainable

normalization and embedding). We fine-tune LLaMA2 (Touvron et al., 2023b) 7B with the proposed

Shift Short Attention. The target context length is 8192. We use RedPajama (Computer, 2023) for

training and PG19 (Rae et al., 2020) validation set for perplexity testing. Full fine-tuning has a faster

convergence than the low-rank at the beginning, while the final gap is not large.

Training

Full FT

LoRA +

4.3

15.82

100

8.17

8.63

200

8.10

8.16

Number of fine-tuning steps

300 400 500 600 700

8.07 8.06 8.03 7.99 7.99

8.15 8.14 8.12 8.11 8.10

800

7.96

8.08

900

7.95

8.04

1000

7.94

8.02

A BLATION S TUDY

Efficiency Profile. In Table 7, we breakdown LLaMA2 7B (Touvron et al., 2023b) into various types

of layers, including FFN - feed-forward layers, Proj - projection for queries, values, keys, and attention

outputs, Attn - self-attention computation, Others - other layers like embedding, normalization, LLM

head. We analyze FLOPs. For full attention, the proportion of Attn sharply increases as the context

length increases. For example, Attn has 24.5% of the total FLOPs at the 8192 context length while it

increases to 72.2% at the 65536 context length. It decreases to 39.4% when S 2 -Attn is used.

Ablation on Fine-tuning Steps. We report the relationship between perplexity and fine-tuning

steps for an LLaMA2 7B model extending to the 8192 context length on the PG19 validation set, in

Table 8. We see that without fine-tuning, at step 0, the model has a limited long context capability,

e.g., 15.82 perplexity. We show that the perplexity drops quickly. Full fine-tuning converges faster

than low-rank training. They come closer after 200 steps, without a large gap at the end.

Attention Patterns. In Table 2, we show the effects of different attention patterns during fine-

tuning. We fine-tune an LLaMA2 7B (Touvron et al., 2023b) model to 32768 context length

on Redpajama (Computer, 2023) datasets and evaluate the perplexity on PG19 (Rae et al., 2020)

validation set. We first examine the manner of swapping among various settings. For the shift

operation we used in LongLoRA, there are three choices: disabling it, shifting between sequential

layers, and shifting among attention heads. We show that shifting between layers is acceptable but

not the best. In addition, setting all attention heads as pattern 1 or pattern 2 does not work.

We then test other types of efficient attention designs, including dilated attention (Ding et al., 2023),

and stride sparse attention (Child et al., 2019). For dilated attention (Ding et al., 2023), we vary

the dilate rate from 1 to 4 evenly among attention heads. Stride sparse attention (Child et al.,

2019) contains both local and stride patterns. These attention patterns are invented in training-from-

scratch transformers. This experiment is to examine their capability of fine-tuning on pre-trained

LLMs (Touvron et al., 2023b), toward long context adaptation. Dilated attention performs well in

full fine-tuning but is not well with low-rank adaptation. Fine-tuning with stride sparse attention is

harmful. They have a large gap to full attention, which is applied in the pre-training stage.

C ONCLUSION

In this work, we propose LongLoRA that can efficiently extend the context length of LLMs to be

significantly larger. LongLoRA has less GPU memory cost and training time than standard full

fine-tuning, with minimal accuracy compromise. At the architecture level, we propose shift short

attention to approximate the standard self-attention pattern during training. Shift short attention

is easy to implement, requiring only two lines of code. Moreover, models trained via shift short

attention retain the original standard attention architecture during inference, making most pre-existing

infrastructure and optimization reusable. At the training level, we bridge the gap between LoRA and

full fine-tuning with trainable normalization and embedding. Our method can extend LLaMA2 7B to

100k context length and 70B model to 32k context length, on a single 8× A100 machine. We believe

that LongLoRA is a general method that could be compatible with more types of LLMs and position

encodings, which we plan to investigate in the future.

Acknowledgement We would like to thank Xiuyu Li and Bohao Pengfor the helpful discussions.

9R EFERENCES

Ntk-aware scaled rope, 2023.

URL https://www.reddit.com/r/LocalLLaMA/

comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_

have/.

Shengnan An, Yifei Li, Zeqi Lin, Qian Liu, Bei Chen, Qiang Fu, Weizhu Chen, Nanning Zheng, and

Jian-Guang Lou. Input-tuning: Adapting unfamiliar inputs to frozen pretrained models. CoRR,

abs/2203.03131, 2022.

Zhangir Azerbayev, Edward Ayers, and Bartosz Piotrowski. Proof-pile, 2022. URL https:

//github.com/zhangir-azerbayev/proof-pile.

Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton.

abs/1607.06450, 2016.

Layer normalization.

CoRR,

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer.

CoRR, abs/2004.05150, 2020.

Aydar Bulatov, Yuri Kuratov, and Mikhail S. Burtsev. Recurrent memory transformer. In NeurIPS,

2022.

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of

large language models via positional interpolation. CoRR, abs/2306.15595, 2023.

Yukang Chen, Gaofeng Meng, Qian Zhang, Shiming Xiang, Chang Huang, Lisen Mu, and Xinggang

Wang. RENAS: reinforced evolutionary neural architecture search. In CVPR, pp. 4787–4796,

2019.

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse

transformers. CoRR, abs/1904.10509, 2019.

Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023.

URL https://github.com/togethercomputer/RedPajama-Data.

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. CoRR,

abs/2307.08691, 2023.

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and

memory-efficient exact attention with io-awareness. In NeurIPS, 2022.

Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng,

and Furu Wei. Longnet: Scaling transformers to 1, 000, 000, 000 tokens. CoRR, abs/2307.02486,

2023.

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm:

General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th

Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.

320–335, 2022.

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM: retrieval-

augmented language model pre-training. CoRR, abs/2002.08909, 2020.

Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Simple

on-the-fly length generalization for large language models. CoRR, abs/2308.16137, 2023.

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang,

and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR, 2022.

Gautier Izacard, Patrick S. H. Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick,

Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with

retrieval augmented language models. CoRR, abs/2208.03299, 2022.

10Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu, Sergey Edunov,

Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In

EMNLP, pp. 6769–6781, 2020.

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In ICLR,

2020.

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt

tuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.),

EMNLP, pp. 3045–3059, 2021.

Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica,

Xuezhe Ma, and Hao Zhang. How long can open-source llms truly promise on context length?,

June 2023. URL https://lmsys.org/blog/2023-06-29-longchat.

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In

Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), ACL, pp. 4582–4597, 2021.

Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin

Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In

NeurIPS, 2022.

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining

Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pp.

9992–10002, 2021.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019.

Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, and Sayak Paul. Peft: State-

of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/

peft, 2022.

Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context

length for transformers. CoRR, abs/2305.16300, 2023.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,

Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas

Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,

Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-

performance deep learning library. In NeurIPS, pp. 8024–8035, 2019.

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window

extension of large language models. CoRR, abs/2309.00071, 2023.

Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases

enables input length extrapolation. In ICLR, 2022.

Xiaojuan Qi, Renjie Liao, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. 3d graph neural networks for

RGBD semantic segmentation. In ICCV, pp. 5209–5218, 2017.

Xiaojuan Qi, Zhengzhe Liu, Renjie Liao, Philip H. S. Torr, Raquel Urtasun, and Jiaya Jia. Geonet++:

Iterative geometric neural network with edge-aware refinement for joint depth and surface normal

estimation. IEEE Trans. Pattern Anal. Mach. Intell., 44(2):969–984, 2022.

Jiezhong Qiu, Hao Ma, Omer Levy, Wen-tau Yih, Sinong Wang, and Jie Tang. Blockwise self-

attention for long document understanding. In EMNLP, volume EMNLP 2020 of Findings of

ACL, pp. 2555–2565, 2020.

Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap.

Compressive transformers for long-range sequence modelling. In ICLR, 2020.

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System opti-

mizations enable training deep learning models with over 100 billion parameters. In KDD, pp.

3505–3506. ACM, 2020.

11Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with

rotary position embedding. CoRR, abs/2104.09864, 2021.

Yi-Lin Sung, Varun Nair, and Colin Raffel. Training neural networks with fixed sparse masks. In

NeurIPS, pp. 24193–24205, 2021.

MosaicML NLP Team. Introducing mpt-30b: Raising the bar for open-source foundation models,

2023a. URL www.mosaicml.com/blog/mpt-30b. Accessed: 2023-06-22.

MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable

llms, 2023b. URL www.mosaicml.com/blog/mpt-7b.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée

Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand

Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language

models. CoRR, abs/2302.13971, 2023a.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay

Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian

Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin

Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar

Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann,

Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana

Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor

Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan

Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang,

Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang,

Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey

Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. CoRR,

abs/2307.09288, 2023b.

Szymon Tworkowski, Konrad Staniszewski, Mikolaj Pacek, Yuhuai Wu, Henryk Michalewski, and

Piotr Milos. Focused transformer: Contrastive training for context scaling. CoRR, abs/2307.03170,

2023.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz

Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, pp. 5998–6008, 2017.

Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention

with linear complexity. CoRR, abs/2006.04768, 2020.

Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing trans-

formers. In ICLR, 2022.

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago

Ontañón, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird:

Transformers for longer sequences. In NeurIPS, 2020.

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning

for transformer-based masked language-models. In Smaranda Muresan, Preslav Nakov, and Aline

Villavicencio (eds.), ACL, pp. 1–9, 2022.

Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. Pose:

Efficient context window extension of llms via positional skip-wise training, 2023.

12Table 9: Evaluation perplexity on PG19 dataset (Rae et al., 2020) test split. S 2 -Attn: Shift Short

Attention. LoRA + : improved LoRA with embedding and normalization layers trainable. We fine-tune

LLaMA2 (Touvron et al., 2023b) in 7B and 13B sizes along 8192, 16384, and 32768 context length.

Size

Training

Context Length

8192

16384

32768

8192

13B

16384

32768

LongLoRA

S 2 -Attn LoRA +

✓

2048

7.55

7.53

7.70

7.56

7.65

7.76

8.29

6.95

6.94

7.03

6.90

7.05

7.14

Evaluation Context Length

4096 8192 16384 32768

7.21 6.98

7.20 7.01

7.35 7.14

7.21 6.97

6.80

7.28 7.02

6.86

7.36 7.09

7.04

7.03

7.83 7.54

7.35

7.22

6.60 6.43

6.63 6.45

6.73 6.58

6.58 6.37

6.22

6.70 6.47

6.31

6.76 6.52

6.39

6.36

6.78 6.55

6.38

6.29

Table 10: Evaluation perplexity on PG19 dataset (Rae et al., 2020), with the maximum context length

that we can fine-tune on a single 8× A100 machine. We fine-tune LLaMA2 (Touvron et al., 2023b)

models, using RedPajama (Computer, 2023) dataset. We use the same setting to that in Table 5.

Size

13B

70B

Training

Context Length

100,000

65536

32768

2048

8.38

7.63

5.93

4096

7.90

7.21

5.63

Evaluation Context Length

8192 16384 32768 65536

7.57

7.33

7.16

7.06

6.94

6.75

6.62

6.57

5.44

5.32

5.27

100,000

7.04

A PPENDIX

Environments. All our experiments are conducted on an 8× A100 machine. We train all models

using PyTorch (Paszke et al., 2019) with the DeepSpeed (Rasley et al., 2020) and FlashAttention-

2 (Dao, 2023). Gradient checkpoint is used by default, which is a common technique in the Peft

codebase Mangrulkar et al. (2022). Note that sometimes, like fine-tuning 7B models to 8192 context

size, 3090 Ti GPUs are acceptable.

Evaluation Perplexity on PG19 Test Split. In Table 9 and Table 10, we present the evaluation

results on the PG19 test split. We use the same training settings as the models in Table 4 and Table 5.

Similarly, for a model trained on a certain context length, as the evaluation context length increases,

our models achieve better perplexity. Note that the perplexity in Table 9 and Table 10 is higher than

that in the Proof-pile dataset, as PG19 (Rae et al., 2020) has very different writing styles.

Ablation on Group Sizes. In Table 11, we provide an ablation study on the group size of the shift

short attention. We experimented on fine-tuning LLaMA2 7B to 8192 context length via LongLoRA.

The group size varies from {1/2, 1/4, 1/6, 1/8} of the target context length. For example, the group

size is 1024 for 1/8 of the context length 8192. We find that the 1/2 and 1/4 settings have minor gap

to full attention fine-tuning. Group sizes less than 1/4 would be not good enough. We set the group

size as 1/4 of the context length in experiments by default.

LongQA for Supervised Fine-tuning. To improve the chat ability of our models, we build up

a long context QA dataset, LongQA, for supervised fine-tuning (SFT). It contains more than 3k

question-answer pairs. We build the prompt format as the following line:

13Table 11: Ablation on group size. We conduct experiments upon an LLaMA2 7B model and fine-tune

it to 8192 context length via LongLoRA on PG19. We vary the group size of Shift Short Attention

from {1/2, 1/4, 1/6, 1/8} of the target context length. ‘Full’ means the standard full attention.

Group

PPL

Full

8.02

1/2

8.04

1/4

8.04

1/6

8.10

1/8

8.16

Below is {material type}. Memorize the content and answer my question after the paper.

{material content} n Now the material ends. {question}

{material type} can be ”book”, ”paper”, and others. {material content} is the long-context content

in the document. {question} is the questions we design. We list some example questions as the

following:

• Please tell me what high-level idea the author wants to indicate in this book.

• Please describe the relationship among the roles in the book.

• What are the main contributions and novelties of this paper?

• What are some limitations of the proposed method?

• Why doesn’t Professor Snape seem to like Harry?

• ...

These questions can be some commonly used ones, like summarization and limitation. Or they can

be specific to the material, like the question that is related to some roles in the book. The context

length of each question is no longer than 32k tokens.

For SFT on LongQA, we use the models that have already been fine-tuned on Redpajama (Computer,

2023) for context extension in this step. We use the same learning rate, weight decay, and batch sizes

as the context extension step. We train the models for 3 epochs. In the following, we provide some

example questions and the answers from our model, in Figure 4, Figure 5, and Figure 6.

14Abstraction

Below is some content in the book. Memorize the content and answer my question after the book.

{book_content} \n Now the material ends.

Please tell me that what high-level idea the author want to indicate in this book.

The dangers of first contact with an alien civilization. The story shows how a poorly thought out initial contact with

the Trisolarans leads to them threatening Earth. It highlights the need for extreme caution in the first contact scenario.

The difficulties of interstellar travel and colonization. The story deals at length with the challenges of building

lightspeed ships and colonizing other planets. It shows that these are complex problems that have not yet been solved.

The conflict between science and society. There are many instances in the story where scientific progress and the

pursuit of knowledge come into conflict with the interests of society or the government. This represents the ongoing

tension between science and the broader community.

The fragility of human civilization. The story depicts humanity's vulnerability to external threats and how easily

human civilization could be wiped out. This serves as a reminder of how fragile human civilization still is.

The need for unity and cooperation. The only way for humanity to survive and thrive is through unity, cooperation

and the sharing of knowledge and resources. The story shows that disunity and division only lead to weakness.

So in summary, the overarching idea is that humanity needs to come together in order to survive and prosper, while also

highlighting the many challenges and dangers along the way. The story is meant to provoke thought about the human

condition, civilization's place in the universe and the prospects for the future.

Character

Below is some content in the book. Memorize the content and answer my question after the book.

{book_content} \n Now the material ends.

How does Monkey's character change over the course of the journey?

Initially, Monkey demonstrates a mischievous and egotistical character. He uses tricks and threats to get what he wants

from the monks at Precious Wood Monastery. However, as the journey progresses, Monkey shows signs of wisdom and

growth. During the discussion about the moon, Monkey explains to Sanzang the deeper meaning behind the lunar cycle,

showing an intellectual side. He also shows maturity by taking the lead in protecting Sanzang from the demons attacking

along the journey. While Monkey retains his mischievous nature, he begins to channel his powers towards helping

Sanzang achieve his goal of reaching the Western Heaven, demonstrating development in Monkey's character.

Below is some content in the book. Memorize the content and answer my question after the book.

{book_content} \n Now the material ends.

What does this passage reveal about Gandalf's character and role?

Based on the passage, we can see several key aspects of Gandalf's character and role:

He is wise and full of counsel. Gandalf gives Thjoden wise advice, urging him to rally his riders and go to war

against Saruman quickly before it is too late. His advice seems thoughtful and based on good insight and

understanding of the situation.

He is powerful. Gandalf is able to intimidate and even to some extent manipulate others through the power of his

words and presence. He commands authority and respect, even from kings like Thjoden.

He acts as a mentor and guide. Gandalf guides and instructs Thjoden, helping him see the truth of the situation and

shaking him out of his doubts and despair. He acts as a moral compass and motivator.

He is an ally and friend to the people of Rohan. Gandalf comes to the aid of Rohan in their time of need, seeking to

help them fight the threat of Saruman. He has built relationships of trust and friendship with people like Thjoden.

He has hidden knowledge and abilities. Gandalf hints at having knowledge and power that exceeds what is

immediately apparent. He possesses great magics and secrets that he reveals selectively to aid his cause.

So in summary, Gandalf acts as a wise mentor, guide and ally, wielding hidden powers and knowledge to aid the free

people of Middle-earth in their struggles against evil.

Figure 4: Examples on book sections 123 and questions related to abstraction and character. For all

these examples, we select the related section in each book with less than 32k tokens.

https://en.wikipedia.org/wiki/Death%27s_End

https://en.wikipedia.org/wiki/Journey_to_the_West

https://en.wikipedia.org/wiki/The_Two_Towers

15Relationship

Below is some content in the book. Memorize the content and answer my question after the book.

{book_content} \n Now the material ends.

Please describe the relationship among the roles in the book.

The book depicts a complex set of relationships among the various characters. Some of the key relationships are:

Harry Potter and Ron Weasley: They are close friends. Harry and Ron share a dorm room and have many adventures

together at Hogwarts.

Harry and Hermione Granger: They are also good friends. Hermione is often bossy and knowledgeable, but she also

helps Harry and Ron out of difficult situations with her intelligence.

Harry and Draco Malfoy: They are enemies. Draco is part of the Slytherin house and is always trying to cause trouble

for Harry and his friends.

Harry and Professor Dumbledore: Dumbledore is the headmaster of Hogwarts and a mentor figure to Harry. He trusts

and believes in Harry even when others doubt him.

Harry and Professor Snape: They have an antagonistic relationship. Snape seems to dislike Harry from the start and

favors the Slytherin students over Harry and his friends.

Harry and Professor Lockhart: Lockhart enjoys the fame and attention that comes from being connected to Harry.

However, Harry finds Lockhart vain and insincere.

Overall, the book depicts complex and nuanced relationships that develop and change over time, between students,

teachers, and other characters at Hogwarts. The relationships help shape Harry's story and journey at the school of

witchcraft and wizardry.

Details

Below is some content in the book. Memorize the content and answer my question after the book.

{book_content} \n Now the material ends.

Why doesn't Professor Snape seem to like Harry?

While Professor Snape says he doesn't "dislike" Harry but hates him, there are a few reasons he seems to treat Harry poorly:

Snape was a friend of Harry's parents, especially Lily Potter. But there was tension between Snape and Harry's father,

James Potter, while they were at Hogwarts. So Snape's negative feelings toward James may transfer to Harry.

Snape wanted the Defense Against the Dark Arts teaching position but never got it. Professor Quirrell holds the

position when Harry arrives at Hogwarts. Snape's resentment toward not getting that job may cause him to take out

his frustrations on Harry.

As Harry gains fame for surviving Voldemort's attack as a baby, Snape likely resents Harry for achieving fame and

attention that Snape never received, despite Snape's talents. This jealousy may fuel Snape's hostility toward Harry.

In summary, a combination of past history between Snape and Harry's father, professional jealousy, and resentment all

seem to contribute to Snape's unfair treatment of Harry during his first Potions lesson.

Summarization

Below is some content in the book. Memorize the content and answer my question after the book.

{book_content} \n Now the material ends.

Please summarize the book in one paragraph.

War and Peace tells the story of five aristocratic Russian families set against the backdrop of the Napoleonic Wars. The

novel explores themes of free will and determinism, fate and chance, and faith and reason. The main characters struggle to

find meaning and purpose in life amidst the tumultuous events of the time. The book traces the growth and development of

the characters over time, highlighting how they are shaped by war, love, family relationships, and the larger social and

political upheavals during a turbulent and transformative period in Russian history. Despite all of the hardships and

uncertainties, the characters find moments of joy, happiness, and connection with others that give meaning to their lives.

The novel concludes with a sense of hope while also acknowledging the inevitability of death and the fleeting nature of

human existence.

Figure 5: Examples on book sections 456 and questions of relationship, details, and summarization.

https://en.wikipedia.org/wiki/Harry_Potter_and_the_Philosopher%27s_Stone

https://en.wikipedia.org/wiki/Harry_Potter_and_the_Chamber_of_Secrets

https://en.wikipedia.org/wiki/War_and_Peace

16Contribution

Below is a paper. Memorize the material and answer my question after the paper.

{paper} \n Now the material ends.

What are the main contributions and novelties of this work?

The main contributions and novelties of this work are:

A novel neural architecture search framework is proposed that integrates evolution algorithm and reinforcement

learning. This hybrid approach combines the advantages of both EA and RL to ensure search efficiency.

A reinforced mutation controller is designed to learn the effects of slight modifications and make actions to guide the

evolution. This helps the population evolve to a better status in fewer iterations.

A competitive neural architecture called RenaS Net is discovered. It achieves state-of-the-art results on CIFAR-10

and ImageNet classification tasks, as well as on the semantic segmentation task.

The integration of EA and RL brings benefits compared to just using RL or EA independently. The RL training

becomes more efficient since making modifications to an architecture requires fewer actions than constructing a

model from scratch. The evolution process also becomes more efficient and stable with the help of the mutation

controller.

In summary, the main contributions are the proposed hybrid EA-RL search framework, the reinforced mutation controller,

and the discovered RenaS Net architecture that achieves state-of-the-art results on multiple tasks. The integration of EA

and RL also brings benefits in terms of search efficiency compared to either EA or RL alone.

Limitation

Below is a paper. Memorize the material and answer my question after the paper.

{paper} \n Now the material ends.

What are some limitations of the proposed 3DGNN method?

Possible answers:

The 3DGNN method relies on the initial graph construction based on 3D points and nearest neighbors. The accuracy

of this graph construction can affect the performance of the model. The graph construction method used may not be

optimal for all scenes.

The 3DGNN model uses a fixed number of propagation steps. The optimal number of propagation steps may vary for

different scenes and objects. Using a fixed number of steps may not be the best approach.

The 3DGNN model struggles with objects that have similar depths or complex shapes, as discussed in the failure case

analysis. The model's performance relies on the 3D geometric context, which may not be sufficient for some types of

objects.

The model requires depth information from an RGBD sensor, which limits its applicability to scenes where depth

data is not available. The model cannot be directly applied to 2D RGB images.

The model performance may degrade for very small objects, as shown in the analysis of performance gain by object

size. The model seems better suited for medium and large objects.

Summarization

Below is a paper. Memorize the material and answer my question after the paper.

{paper} \n Now the material ends.

Please summarize the paper in one paragraph.

The paper proposes a Geometric Neural Network with Edge Aware Refinement (GeoNet++) to jointly predict depth and

surface normal from a single image. The proposed GeoNet++ includes depth to normal and normal to depth modules that

incorporate geometric constraints into the predictions to make them geometrically consistent and accurate. An ensemble

network then refines the predictions by fusing the geometric predictions and the initial predictions from the backbone

networks. An edge aware refinement network further updates the predictions in planar and boundary regions. An iterative

inference scheme is used to progressively improve the results. Extensive experiments show that GeoNet++ achieves state-

of-the-art results in terms of both 2D metrics and a newly proposed 3D geometric metric, which measures the quality of 3D

surface reconstruction. The results demonstrate that explicitly incorporating geometric constraints into neural networks

leads to more accurate depth and surface normal estimation.

Figure 6: Examples on paper (Chen et al., 2019; Qi et al., 2017; 2022) and questions related to

contribution, limitation, and summarization.