Summary of Effective Long-Context Scaling of Foundation Models

Summary Effective Long-Context Scaling of Foundation Models arxiv.org

12,352 words - PDF document - View PDF document

One Line

Meta's long-context language models (LLMs) are highly proficient in a range of tasks including coding, math, conversations, and search queries, ensuring safety and offering valuable insights.

Slides

Slide Presentation (13 slides)

Copy slides outline Copy embed code Download as Word

Effective Long-Context Scaling of Foundation Models

Source: arxiv.org - PDF - 12,352 words - view

Introduction

• Meta presents a series of long-context language models (LLMs) that support context windows of up to 32,768 tokens.

• These models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks compared to L LAMA 2.

• Context length is an important axis of scaling LLMs, and the models can continually improve their performance as the context length increases.

Pretraining and Training Sequences

• The models are pretrained from L LAMA 2 checkpoints with additional 400 billion tokens formed as long training sequences.

• The smaller 7B/13B variants are trained with 32,768-token sequences, while the 34B/70B variants are trained with 16,384-token sequences.

• Pretraining with long sequences is more efficient and similarly effective compared to pretraining from scratch.

Strong Performance on Long-Context Benchmarks

• The models achieve stronger overall performance than gpt-3.5-turbo-16k on a series of long-context benchmarks.

• They excel in question answering, summarization, and multi-document aggregation tasks.

• The models are continually pretrained to improve their performance.

Safety and Bias Performance

• The models maintain similar safety performance compared to L LAMA 2 CHAT and are safer and less biased compared to other open-source LLMs.

• They have been evaluated on three standard academic benchmarks: TruthfulQA, ToxiGen, and BOLD.

Evaluation and Scaling of Large Language Models

• The paper explores the evaluation and scaling of large language models trained on code, including positional encodings and methods for extending sequence length.

• The theoretical analysis of positional encodings contributes to the understanding of long-context scaling.

• The models demonstrate power-law scaling behavior with respect to context lengths.

Extrapolation Capabilities

• The models exhibit effective extrapolation abilities, as shown through experiments using validation loss and a synthetic FIRST-SENTENCE-RETRIEVAL task.

• They maintain low loss in the extrapolation area, indicating their ability to adapt to extended sequence lengths.

Generation of Self-Instruct Data

• The models generate self-instruct data using Llama 2 Chat.

• They automatically generate long-context instruct data from short-context models, providing question-answer pairs for reading comprehension tests.

• This process demonstrates the potential and applicability of these models in real-world scenarios.

Key Takeaways

• Meta's long-context language models (LLMs) support context windows of up to 32,768 tokens and achieve strong performance on both short and long-context tasks.

• Pretraining from L LAMA 2 checkpoints with additional training sequences improves the models' performance.

• The models are safer and less biased compared to other open-source LLMs, maintaining similar safety performance to L LAMA 2 CHAT.

• The evaluation and scaling of large language models, including positional encodings and extending sequence length, contribute to their understanding and applicability in real-world scenarios.

[Optional visual: Graph showing the improvement in performance as context length increases]

[Optional visual: Comparison chart showing the safety performance of Meta's models compared to other open-source LLMs]

[Optional visual: Illustration demonstrating the generation of self-instruct data using Llama 2 Chat]

Note: Please consult the original document for more detailed information on each point.

Key Points

Meta presents a series of long-context language models (LLMs) that support context windows of up to 32,768 tokens.
The models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks compared to L LAMA 2.
Context length is an important axis of scaling LLMs, and the models can continually improve their performance as the context length increases.
The models are pretrained from L LAMA 2 checkpoints with additional 400 billion tokens formed as long training sequences.
The models achieve stronger overall performance than gpt-3.5-turbo-16k on a series of long-context benchmarks.
The models maintain similar safety performance compared to L LAMA 2 CHAT and are safer and less biased compared to other open-source LLMs.
The paper explores the evaluation and scaling of large language models trained on code, including positional encodings and methods for extending sequence length.
The experiments on extrapolation capabilities and the generation of self-instruct data demonstrate the potential and applicability of these models in real-world scenarios.

Summaries

21 word summary

Meta's long-context language models (LLMs) excel in long-context tasks, coding, math, conversations, and search queries, while maintaining safety and providing insights.

67 word summary

Meta has developed long-context language models (LLMs) that outperform previous models on long-context tasks and improve on regular tasks. Long texts in the pretrain dataset are not crucial, and long context continual pretraining is more efficient. The models achieve strong results in coding, math, knowledge-intensive tasks, multi-turn conversations, and multi-document search queries. They maintain safety performance and provide insights into positional encodings and sequence length extension methods.

137 word summary

Meta has developed long-context language models (LLMs) that can support context windows of up to 32,768 tokens. These models consistently outperform previous models on long-context tasks and show improvements on regular tasks. Abundant long texts in the pretrain dataset are not crucial for achieving strong performance, and long context continual pretraining is more efficient. The models achieve on-par or stronger results compared to previous models on standard short-context tasks, particularly in coding, math, and knowledge-intensive tasks. They also demonstrate competitive performance in multi-turn conversation data and multi-document search query answering data. Safety performance is maintained while being safer and less biased compared to other open-source LLMs. The models present insights into positional encodings, sequence length extension methods, extrapolation capabilities, and the generation of self-instruct data, contributing to the understanding and applicability of these models in real-world scenarios.

422 word summary

Meta has developed long-context language models (LLMs) that can support context windows of up to 32,768 tokens. These models have undergone extensive evaluation and show consistent improvements on regular tasks and significant improvements on long-context tasks compared to previous models. In fact, the 70B variant of the model can outperform gpt-3.5-turbo-16k on a suite of long-context tasks using a cost-effective instruction tuning procedure.

The models provide a detailed analysis of their method's individual components, examining the limitations of position encodings and exploring different design choices in the pretraining process. They find that having abundant long texts in the pretrain dataset is not crucial for achieving strong performance and that long context continual pretraining is more efficient and equally effective compared to pretraining from scratch with long sequences.

The models demonstrate a power-law scaling behavior, consistently benefiting from more contexts. Performance continually improves as the context length increases up to 32,768 tokens.

To build long-context LLMs with superior performance, the models continually pretrain from L LAMA 2 checkpoints with additional 400 billion tokens formed as long training sequences. They achieve on-par or stronger results compared to previous models on standard short-context tasks, particularly in coding, math, and knowledge-intensive tasks.

The models explore a simple and cost-effective procedure for instruction tuning without human-annotated data. They leverage a pre-built short-prompt dataset and augment it with synthetic self-instruct long data generated by L LAMA 2 CHAT. This approach leads to stronger overall performance on long-context benchmarks covering question answering, summarization, and multi-document aggregation tasks.

The models conduct human evaluations comparing their generation quality with other proprietary models. They achieve competitive performance in terms of helpfulness, honesty, and harmlessness in multi-turn conversation data and multi-document search query answering data.

The models perform ablation experiments to justify their design choices, finding that their proposed positional encoding refinement performs the best. Adjusting the length distribution of the pretrain data does not provide major benefits, and improvements mostly come from the quality of the data itself.

The models evaluate their safety performance on three standard academic benchmarks and maintain similar safety performance compared to previous models while being safer and less biased compared to other open-source LLMs.

In conclusion, the models present long-context LLMs that achieve strong performance on both short and long-context tasks. The paper also explores the evaluation and scaling of large language models trained on code, providing insights into positional encodings, sequence length extension methods, extrapolation capabilities, and the generation of self-instruct data. These findings contribute to the understanding and applicability of these models in real-world scenarios.

556 word summary

Meta has developed a series of long-context language models (LLMs) that can support context windows of up to 32,768 tokens. These models are created through continual pretraining from L LAMA 2 using longer training sequences and an upsampled dataset of long texts. The models have been extensively evaluated on language modeling, synthetic context probing tasks, and various research benchmarks. The results show consistent improvements on regular tasks and significant improvements on long-context tasks compared to L LAMA 2. In fact, the 70B variant of the model can outperform gpt-3.5-turbo-16k on a suite of long-context tasks when using a cost-effective instruction tuning procedure.

The models also provide a detailed analysis of their method's individual components. They examine the limitations of L LAMA's position encodings in modeling long dependencies and explore the impact of different design choices in the pretraining process. Their experiments suggest that having abundant long texts in the pretrain dataset is not crucial for achieving strong performance. They also find that long context continual pretraining is more efficient and equally effective compared to pretraining from scratch with long sequences.

The models demonstrate a power-law scaling behavior, showing their ability to consistently benefit from more contexts. They highlight that context length is an important factor in scaling LLMs, with performance continually improving as the context length increases up to 32,768 tokens.

To build long-context LLMs with superior performance, the models continually pretrain from L LAMA 2 checkpoints with additional 400 billion tokens formed as long training sequences. The smaller variants are trained with longer sequences, while the larger variants are trained with shorter sequences. The models achieve on-par or stronger results compared to L LAMA 2 on standard short-context tasks, particularly in coding, math, and knowledge intensive tasks.

The models also explore a simple and cost-effective procedure for instruction tuning without human-annotated data. They leverage a pre-built short-prompt dataset and augment it with synthetic self-instruct long data generated by L LAMA 2 CHAT. This approach leads to stronger overall performance compared to gpt-3.5-turbo-16k on a series of long-context benchmarks covering question answering, summarization, and multi-document aggregation tasks.

The models conduct human evaluations to compare their generation quality with other proprietary models. The evaluations focus on multi-turn conversation data and multi-document search query answering data. The models achieve competitive performance against proprietary models in terms of helpfulness, honesty, and harmlessness.

The models perform ablation experiments to justify their design choices, including positional encoding variants, pretraining data mix, and training curriculum. They find that their proposed positional encoding refinement performs the best among the explored variants. They also observe that adjusting the length distribution of the pretrain data does not provide major benefits, and the improvements mostly come from the quality of the data itself.

The models evaluate their safety performance on three standard academic benchmarks: TruthfulQA, ToxiGen, and BOLD. They maintain similar safety performance compared to L LAMA 2 CHAT and are safer and less biased compared to other open-source LLMs.

In conclusion, the models present a series of long-context LLMs that achieve strong performance on both short and long-context tasks. The paper also explores the evaluation and scaling of large language models trained on code, providing insights into positional encodings, sequence length extension methods, extrapolation capabilities, and the generation of self-instruct data. These findings contribute to the understanding and applicability of these models in real-world scenarios.

949 word summary

Meta presents a series of long-context language models (LLMs) that support context windows of up to 32,768 tokens. These models are built through continual pretraining from L LAMA 2 with longer training sequences and on a dataset where long texts are upsampled. The models are evaluated extensively on language modeling, synthetic context probing tasks, and a wide range of research benchmarks. On research benchmarks, the models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks compared to L LAMA 2. With a cost-effective instruction tuning procedure, the 70B variant of the model can surpass gpt-3.5-turbo-16k's overall performance on a suite of long-context tasks.

The models also provide an in-depth analysis of the individual components of their method. They delve into L LAMA's position encodings and discuss their limitations in modeling long dependencies. They also examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths. Their experiments suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and they verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences.

The models demonstrate a clear power-law scaling behavior with respect to context lengths, showing their ability to consistently benefit from more contexts. They also suggest that context length is another important axis of scaling LLMs, and the models can continually improve their performance as the context length increases up to 32,768 tokens.

To build long-context LLMs with superior performance, the models continually pretrain from L LAMA 2 checkpoints with additional 400 billion tokens formed as long training sequences. The smaller 7B/13B variants are trained with 32,768-token sequences, while the 34B/70B variants are trained with 16,384-token sequences. The models are evaluated on standard short-context tasks and achieve on-par or stronger results compared to L LAMA 2, particularly on coding, math, and knowledge intensive tasks.

The models also explore a simple and cost-effective procedure for instruction tuning without human-annotated data. They leverage a pre-built large and diverse short-prompt dataset and augment it with synthetic self-instruct long data generated by L LAMA 2 CHAT. The models achieve stronger overall performance than gpt-3.5-turbo-16k on a series of long-context benchmarks covering question answering, summarization, and multi-document aggregation tasks.

The models are continually pretrained from the 7B L LAMA 2 checkpoint with increased sequence length while keeping the same number of tokens per batch. They train all models for a total of 400B tokens over 100,000 steps. The larger 34B/70B models require a smaller learning rate to achieve monotonically decreasing validation losses.

The models also conduct human evaluations to compare the generation quality of the instruction finetuned model with other proprietary models. The evaluation focuses on multi-turn conversation data and multi-document search query answering data. The models achieve competitive performance against proprietary models in terms of helpfulness, honesty, and harmlessness.

In conclusion, the models present a series of long-context LLMs that achieve strong performance on both short and long-context tasks

The paper "Effective Long-Context Scaling of Foundation Models" explores the evaluation and scaling of large language models trained on code. The authors reference several related works that provide insights into training and evaluating large language models. They also mention the importance of long-context scaling in improving the performance and capabilities of these models.

The paper presents a theoretical analysis of positional encodings (RoPE) and introduces two methods for extending the sequence length of a trained transformer model: Position Interpolation (PI) and Adjusted Base Frequency (ABF). The authors compare the cosine similarity between consecutive images of the positional embeddings for both methods and provide mathematical proofs for their bounds. They conclude that both methods can effectively adapt to extended sequence lengths, but ABF exhibits higher granularity and may be more suitable for distinguishing between positional embedding images.

To evaluate the extrapolation capabilities of their models, the authors conduct experiments using validation loss and a synthetic FIRST-SENTENCE-RETRIEVAL task. The results show that their 70B model with either RoPE ABF or xPos ABF maintains low loss in the extrapolation area, indicating effective extrapolation abilities. In the FIRST-SENTENCE-RETRIEVAL task, some performance degradation is observed during extrapolation, but overall, the models perform well.

The paper also discusses the generation of self-instruct data using Llama 2 Chat. The authors describe a process for automatically generating long-context instruct data from short-context models. They split long documents into smaller chunks and use prompts to generate question-answer pairs. The questions are based on the text chunks and are used as reading comprehension tests over the entire document. The authors provide prompts for generating normal answer and short answer data, along with corresponding templates for constructing long question-answer data.

Overall, the paper provides valuable insights into the evaluation and scaling of large language models trained on code. The theoretical analysis of positional encodings and the comparison of different methods for extending sequence length contribute to the understanding of long-context scaling. The experiments on extrapolation capabilities and the generation of self-instruct data demonstrate the potential and applicability of these models in real-world scenarios.

Raw indexed text (73,857 chars / 12,352 words / 1,888 lines)

Effective Long-Context Scaling of Foundation Models

Wenhan Xiong †∗ , Jingyu Liu † , Igor Molybog,

Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang ∗ , Hao Ma ∗

Meta

Abstract

We present a series of long-context LLMs that support effective context windows

of up to 32,768 tokens. Our model series are built through continual pretraining

from L LAMA 2 with longer training sequences and on a dataset where long texts

are upsampled. We perform extensive evaluation on language modeling, synthetic

context probing tasks, and a wide range of research benchmarks. On research

benchmarks, our models achieve consistent improvements on most regular tasks

and significant improvements on long-context tasks over L LAMA 2. Notably, with

a cost-effective instruction tuning procedure that does not require human-annotated

long instruction data, the 70B variant can already surpass gpt-3.5-turbo-16k’s

overall performance on a suite of long-context tasks. Alongside these results, we

provide an in-depth analysis on the individual components of our method. We

delve into L LAMA ’s position encodings and discuss its limitation in modeling

long dependencies. We also examine the impact of various design choices in

the pretraining process, including the data mix and the training curriculum of

sequence lengths – our ablation experiments suggest that having abundant long

texts in the pretrain dataset is not the key to achieving strong performance, and

we empirically verify that long context continual pretraining is more efficient and

similarly effective compared to pretraining from scratch with long sequences.

Llama 2 Long 7B

= 25.4, = 0.45,

Llama 2 Long 13B

= 19.5, = 0.48,

Llama 2 Long 34B

= 17.7, = 0.50,

Llama 2 Long 70B

= 17.9, = 0.51,

Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta,

Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang,

Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan,

= 1.56

= 1.45

= 1.41

= 1.35

10 0

10 1

10 2

Context length

10 3

10 4

Figure 1: We show that our model’s validation loss can be fit as a function of the context length:

L(c) = ( αc ) β + γ with a different set of α, β, γ for each model size. This power-law relationship also

suggests that context length is another important axis of scaling LLMs and our model can continually

improve its performance as we increase the context length up to 32,768 tokens.1

Introduction

Large language models (LLMs), trained with an unprecedented magnitude of data and compute, hold

the promise of fundamentally improving the way we interact with the digital world. As LLMs get

rapidly deployed and continue to evolve through scaling, we envision these models to serve more

intricate and complex use cases, such as analyzing dense knowledge-rich documents, powering more

genuine and engaging chatbot experiences, and aiding human users in iterative creation processes

such as coding and design, etc. A crucial feature supporting this evolution is the ability to effectively

process long-context inputs.

Until now, LLMs with robust long-context capabilities are primarily provided through proprietary

LLM APIs (Anthropic, 2023; OpenAI, 2023) and there is no open recipe for building long-context

model that can demonstrate on-par downstream performance as these proprietary models. Moreover,

existing open-sourced long-context models (Tworkowski et al., 2023b; Chen et al., 2023; Mohtashami

and Jaggi, 2023; MosaicML, 2023b) often fall short on evaluations and primarily measure long-context

capabilities with the language modeling loss and synthetic tasks, which do not comprehensively

demonstrate their effectiveness in diverse, real-world scenarios. Additionally, these models often

overlook the necessity of maintaining strong performance on standard short-context tasks, either

bypassing the evaluations or reporting degenerated performance (Peng et al., 2023; Chen et al., 2023).

In this work, we describe our approach to build long-context LLMs with superior performance over

all existing open-sourced models. We build our models by continually pretraining from L LAMA 2

checkpoints with additional 400 billion tokens formed as long training sequences. Among the model

series, the smaller 7B/13B variants are trained with 32,768-token sequences while the 34B/70B

variants with 16,384-token sequences. In contrast to the limited evaluation performed by existing

studies, we extensively evaluate our models using language modeling, synthetic tasks, and also

a wide range of real-world benchmarks covering both long and short context tasks. On language

modeling, our model demonstrates a clear power-law scaling behavior with respect to context lengths.

This scaling behavior, as shown in Figure 1, not only shows our models’ ability to consistently

benefit from more contexts but also suggest that context length is another importance axis of scaling

LLMs. When comparing our models to L LAMA 2 on research benchmarks, we not only observe

significant improvements on long-context tasks but also modest improvements on standard short-

context tasks, especially on coding, math, and knowledge benchmarks. We explored using a simple

and cost-effective procedure to instruction finetune our continually pretrained long models without any

human-annotated data. The end result is a chat model that can achieve stronger overall performance

than gpt-3.5-turbo-16k on a series of long-context benchmarks covering question answering,

summarization, and multi-document aggregation tasks.

In the remaining part of this paper, we begin by presenting the continual long-context pretraining

approach and a lightweight instruction tuning procedure, followed by detailed results on a range of

short and long context tasks. To facilitate future research, we complement our results with an analysis

section discussing how the design of positional encodings, the length distribution of the dataset and

the training curriculum contributes to the final performance. Finally, we report responsible safety

evaluations, which validates that our models can largely maintain the safety performance of the

original L LAMA 2 series.

Method

2.1

Continual Pretraining

Training with longer sequence lengths can introduce significant computational overhead due to the

quadratic attention calculations. This is the main motivation of our continual pretraining approach.

The underlying hypothesis that similar long-context capabilities can be learned by continually

pretraining from a short-context model is later validated in Section 4.4 through comparing different

training curricula. We keep the original L LAMA 2 architecture nearly intact for continual pretraining

and only make a necessary modification to the positional encoding that is crucial for the model to

†

∗

Equal contribution

Corresponding authors:{xwhan, sinongwang, haom}@meta.com

2attend longer. We also choose not to apply sparse attention (Child et al., 2019) in this work, since

given L LAMA 2 70B’s model dimension (h = 8192), the cost of attention matrix calculation and value

aggregation only becomes a computation bottleneck when the sequence length exceeds 49,152 (6h)

tokens (Narayanan et al., 2021). 1

Positional Encoding Through early experiments at the 7B scale, we identified a key limitation

of L LAMA 2’s positional encoding (PE) that prevents the attention module from aggregating infor-

mation of distant tokens. We adopt a minimal yet necessary modification on the RoPE positional

encoding (Su et al., 2022) for long-context modeling – decreasing the rotation angle (controlled

by the hyperparameter “base frequency b”), which reduces the decaying effect of RoPE for distant

tokens. In Section 4.1, we show this simple method outperforms a concurrent approach (Chen et al.,

2023) for extending L LAMA ’s context length and provide a theoretic explanation of its superiority.

Data Mix On top of the working model with the modified PE, we further explored different

pretrain data mixes in Section 4.2 for improving long-context abilities, either by adjusting the ratio of

L LAMA 2’s pretraining data or adding new long text data. We found that often the quality of the data

plays a more critical role than the length of texts for long-context continual pretraining.

Optimization Details We continually pretrain L LAMA 2 checkpoints with increased sequence

length while keeping the same number of tokens per batch as in L LAMA 2. We train all models

for a total of 400B tokens over 100,000 steps. With F LASH A TTENTION (Dao et al., 2022), there is

negligible GPU memory overhead as we increase the sequence length and we observe around 17%

speed loss when increasing the sequence length from 4,096 to 16,384 for the 70B model. For the

7B/13B models, we use learning rate 2e −5 and a cosine learning rate schedule with 2000 warm-up

steps. For the larger 34B/70B models, we find it important to set a smaller learning rate (1e −5 ) to get

monotonically decreasing validation losses.

2.2

Instruction Tuning

Collecting human demonstration and preference labels for LLM alignment is a cumbersome and

expensive process (Ouyang et al., 2022; Touvron et al., 2023). The challenge and cost are more pro-

nounced under long-context scenarios, which often involve complex information flow and specialized

knowledge, e.g., processing dense legal/scientific documents, making the annotation task nontrivial

even for skilled annotators. In fact, most existing open-source instruction datasets (Conover et al.,

2023; Köpf et al., 2023) predominantly consist of short samples.

In this work, we found that a simple and cheap approach which leverages a pre-built large and diverse

short-prompt dataset works surprisingly well on long-context benchmarks. Specifically, we take the

RLHF dataset used in L LAMA 2 C HAT and augment it with synthetic self-instruct (Wang et al., 2022)

long data generated by L LAMA 2 C HAT itself, in the hope that the model can learn a diverse set of

skills through the large amount of RLHF data and transfer that knowledge to long-context scenarios

via self-instruct data. The data generation process focuses on QA-format tasks: starting from a

long document in our pretraining corpus, we select a random chunk and prompt L LAMA 2 C HAT

to write question-answer pairs based on information in the text chunk. We collect both long and

short form answers with different prompts. After that, we also adopt a self-critique step where we

prompt L LAMA 2 C HAT to verify the model-generated answers. Given a generated QA pair, we use

the original long document (truncated to fit the model’s maximum context length) as the context to

construct a training instance.

For short instruction data, we concatenate them as 16,384-token sequences. For long instruction

data, we add padding tokens on the right so that models can process each long instance individually

without truncation. While standard instruction tuning only calculates loss on the output tokens, we

find it particularly beneficial to also calculate the language modeling loss on the long input prompts,

which gives consistent improvements on downstream tasks (Section 4.3).

While sparse attention might be useful for reducing the key/value cache size at inference time when trading

off performance, it can complicate the inference pipeline and the improvements can also be offset by quantization

methods.

3Model Size Coding Math MMLU Commonsense OpenQA

L LAMA 2 7B

13B

34B

70B 16.8

24.5

27.8

37.4 8.55

16.3

24.2

35.2 45.3

54.8

62.6

68.9 63.9

66.9

69.9

71.9 48.9

55.4

58.7

63.6

L LAMA 2 L ONG 7B

13B

34B

70B 20.6

25.7

29.9

39.9 10.5

21.5

29.0

41.3 47.8

60.1

65.0

71.7 64.9

67.8

70.9

72.7 51.0

56.8

60.3

64.0

Table 1: Performance on standard short-context benchmarks. We report Coding score as the average

of pass@1 of HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021); Math score as the

average of top-1 accuracy of 8-shot GSM8K (Cobbe et al., 2021) and 4-shot MATH (Hendrycks

et al., 2021); OpenQA score as the average of 5-shot performance on NaturalQuestions (Kwiatkowski

et al., 2019) and TriviaQA (Joshi et al., 2017); Commonsense score as the average of PIQA (Bisk

et al., 2020), SIQA (Sap et al., 2019), HellaSwag (Zellers et al., 2019), WinoGrande (Sakaguchi

et al., 2021), ARC easy and challenge (Clark et al., 2018), OpenBookQA (Mihaylov et al., 2018) and

CommonsenseQA (Talmor et al., 2018).

Task

MMLU (5-shot)

Natural Questions (1-shot)

GSM8K (8-shot)

HumanEval (0-shot)

GPT-3.5 GPT-4 PaLM PaLM-2-L L LAMA 2 L LAMA 2 L ONG

70.0

57.1

48.1 86.4

92.0

67.0 69.3

29.3

56.5

26.2 78.3

37.5

80.7

- 68.9

33.0

56.8

29.9 71.7

35.7

65.4

32.9

Table 2: Comparison with closed models on standard short tasks.

Main Results

3.1

Pretrained Model Evaluation

Short Tasks To make long-context LLMs universally useful, an important desiderata is to ensure

robust performance on standard short-context tasks. We verify our models’ performance on a series

of common benchmarks following the previous work (Touvron et al., 2023). The aggregated results

are shown in Table 1. Overall, we observe on-par and, in most cases, stronger results than L LAMA 2.

Notably, we observe significantly improved results on coding, math, and knowledge intensive tasks

such as MMLU. As shown in Table 2, our model outperforms GPT-3.5 on MMLU and GSM8k. This

is in contrast to a previous work (Chen et al., 2023) which observes degradation on short tasks. We

attribute the improvements to additional computation FLOPs and the knowledge learned from newly

introduced long data.

Long Tasks Different from previous works (Chen et al., 2023; Mohtashami and Jaggi, 2023)

that mostly rely on perplexity and synthetic tasks to gauge long-context performance, we perform

long-context evaluation using real-world language tasks. We evaluate 0-shot performance on Nar-

rativeQA (Kočiský et al., 2018), 2-shot on QuALITY (Pang et al., 2022) and Qasper (Dasigi et al.,

2021), and 1-shot on QMSum (Zhong et al., 2021). The number of shots are decided based on

the average sample length of each dataset (i.e., samples in Qasper and QuALITY are often much

shorter than those of NarrativeQA). We focus these QA-style tasks because of the ease of prompt

engineering 2 and less biased automatic evaluations. The input prompts are truncated from the left

side if the prompts exceed the maximum input length of the model or 16,384 tokens. We compare

with open-source long-context models available in Huggingface Transformers, namely Focused

Transformer (Tworkowski et al., 2023a), YaRN (Peng et al., 2023), Xgen (Nijkamp et al., 2023),

MPT (MosaicML, 2023b,a) and Together’s L LAMA 2 fork (Together, 2023). As shown in Table 3, our

models achieve superior performance compared to these models. At the 7B scale, only “Together-7B-

We use simple prompt “{ CONTEXT } Q: { QUESTION }, A:” to evaluate all pretrained models.

4NarrativeQA

F1 (0-shot) Qasper

F1 (2-shot) QuALITY

EM (2-shot) QMSum

ROUGE-geo ∗ (1-shot)

Focused Transformer (3B)

Yarn-7B-128k

Together-7B-32k †

Xgen-7B-8k-base

MPT-7B-8k

Yarn-13B-128k

MPT-30B-8k 16.3

20.9

23.3

17.4

18.8

23.4

22.9 15.4

26.2

27.3

20.5

24.7

27.1

29.0 20.5

32.3

41.2

21.0

23.7

46.4

41.5 10.6

11.4

12.6

6.79

8.78

11.9

10.3

L LAMA 2 70B 25.7 27.5 53.0 11.9

L LAMA 2

L LAMA 2 21.9

25.6

29.4

30.9 27.8

31.2

33.7

35.7 43.2

57.6

65.7

79.7 14.9

15.7

15.9

16.5

Model

L ONG 7B

L ONG 13B

L ONG 34B

L ONG 70B

Table 3: Comparison with open-source long-context models on research benchmarks. † : “together-

7B-32k" is not a purely pretrained model and has been trained using supervised datasets which can

improve its few-shot results. ∗ : ROUGE-geo is the geometric mean of ROUGE-1, 2 and L. All

numbers are validation results and the maximum allowed prompt length is set to 16,384 tokens.

27.5

30.9

30.2

28.7

Narrative QA

35.1

35.7

32.7

Qasper

28.9

80.3

79.7

79.6

QuALITY

63.9

16.5

14.4

10.9

4,096

11.9

QMSum

8,192

12,288

Maximum input length

16,384

Figure 2: Performance on long-context tasks as the maximum context lengths of prompts increase.

32k” can match our model’s performance. Note that this model is not a purely self-supervised model

and has been finetuned using a large supervised dataset to improve its few-shot results. As the 7/13B

variants of our models have been trained with 32k-token sequences, we also perform comparisons

using 32,768 maximum prompts lengths and the results are consistent, as shown in Table 13.

Effective Context Utilization To validate that our models can effectively use increased context

window, we first show in Figure 2 that the results on each long task improve monotonically as we

increase the context lengths. Inspired by (Kaplan et al., 2020; Hoffmann et al., 2022), we also found

that the language modeling loss of our model follows a power-law plus constant scaling relationship

with the context length (Figure 1), suggesting:

• Our model continues to show gain in performance (on the language modeling loss) up

to 32,768 tokens of text, despite having diminishing returns. Taking our 70B model for

example, if we double the context length, we can expect the loss to be reduced by a factor of

2 −β ≈ 0.7 plus a model specific constant (1 − 2 −β ) · γ.

5• Larger models can leverage the contexts more effectively, indicated by the larger β value of

the curves.

3.2

Instruction Tuning Results

We test our instruction tuned model on ZeroSCROLLS (Shaham et al., 2023) which bundles 10 long-

context datasets spanning from summarization, question answering, to multi-document aggregation

tasks. For a fair comparison, we use the same configuration (prompts, truncation strategy, and maxi-

mum generation lengths, etc) as specified by the benchmark. As shown in Table 4, without using any

human annotated long context data, our 70B chat model is able to outperform gpt-3.5-turbo-16k

on 7 out of the 10 tasks. In addition, we run evaluations on six new long tasks introduced in L-

Eval (An et al., 2023) and again observe strong results, as shown in Table 17 in the Appendix. We see

that the finetuned model is particularly good at QA tasks which is the main theme of the self-instruct

data. We expect the performance to be further improved if more diverse data are used for finetuning.

It is worth mentioning that evaluating long-context LLMs is a nontrivial task. The automatic metrics

used in these benchmarks are limited in many ways. For instance, the summarization tasks only come

with a single ground-truth summary and the n-gram matching metrics do not necessarily align with

human preference. For QA and aggregation tasks, where the metric is less of a concern, truncating the

input context might also remove the information necessary to answer the question. Another important

caveat is that most proprietary models do not share their training data details, which makes it hard to

take into consideration the potential leakage during public benchmark evaluation.

Model Summarization

QM GPT-3.5-turbo (4k)

GPT-3.5-turbo-16k †

Claude (8k)

GPT4 (8k) 21.3

24.3

24.2

26.3 16.1

16.2

16.1

17.3 15.6

17.4

14.6

18.5 20.4

21.4

21.0

22.6 49.3

50.0

52.3

50.7 25.1

29.5

32.6

27.6 66.6

72.0

84.8

89.2

L LAMA 2 L ONG C HAT 70B 26.0 15.0 20.0 20.9 52.0 31.7 82.6

SQAL

Question answering

Qspr Nrtv QALT

Aggregation

SpDg BkSS Avg

27.1

27.0

36.1

41.1 49.1

54.1

61.6

62.8 49.8

54.6

47.4

60.5 34.0

36.7

39.1

41.7

27.3 55.5 46.2 37.7

MuSQ

Table 4: ZeroSCROLLS long-context leaderboard results. † Evaluated as of 8/7/2023. The GPT-4 and

Claude results are directly copied from the leaderboard. Underscored are the 7/10 tasks where our

model outperforms gpt-3.5-turbo-16k.

3.3

Human Evaluation

Figure 3: Human preference on model responses with multi-turn conversation and multi-document

search query answering data.

Complementary to the automatic evaluation benchmark results, we conduct human evaluations

by asking annotators whether they prefer the generation from our instruction finetuned model or

6from proprietary models like MPT-30B-chat, GPT-4, GPT-3.5-turbo-16k, and Claude-2 in terms of

helpfulness, honesty, and harmlessness. Unlike automatic metrics, humans are better at evaluating

the quality of model responses for long context models because of the large space of acceptable

answers. We focus on two major application scenarios with a total of 2,352 examples. For multi-turn

conversation data, each prompt is a chat history based on which the model needs to generate a

coherent response. For the multi-document search query answering application, the model is provided

with a few most relevant documents retrieved from a search session and the corresponding search

query. We then evaluate how well these models can leverage the information (retrieved documents)

to answer the given query. Each comparison example was evaluated by 3 different human annotators.

The standard win rate of our our model over each model is calculated by averaging the result of

each comparison example and the final score along with the 95% confidence interval is shown in

Figure 3. With very little instruction data, our model can achieve competitive performance against

MPT-30B-chat, GPT-3.5-turbo-16k, and Claude-2. It is worth noting that human evaluation on longer

context tasks is challenging and generally requires well trained and skilled annotators. We hope this

study can not only give a sense of the potential of our instruction finetuned model on some long

context downstream applications but also motivate future efforts in developing more robust long

context automatic evaluations.

Analysis

In this section. We perform ablation experiments to justify our design choices (i.e. architecture

modification, data mixes, and training curriculum) and quantify their contributions to the final

performance.

4.1

Positional Encoding for Long Text

Our early experiments used a synthetic “ FIRST -

SENTENCE - RETRIEVAL ” task to probe the effective

Books

Wikipedia

context window of the pretrained models where we

RoPE

6.548 6.816

3.802

simply prompt the model to return the first sentence

of the input. Our initial task results suggest that, with

R O PE PI

6.341 6.786

3.775

the original L LAMA 2 architecture untouched, our

R O PE ABF

6.323 6.780

3.771

model was unable to effectively attend beyond 4, 000

X P OS ABF

6.331 6.780

3.771

- 6, 000 tokens even after extensive long-context con-

tinual pretraining. We hypothesize that this bottle- Table 5: Validation perplexity of models with

neck comes from the R O PE positional encoding used different positional encoding variants. All

in L LAMA 2 series which imposes a heavy decay on samples are 32, 768-token sequences (CC:

the attention scores 3 for distant tokens. We propose a CommonCrawl).

simple modification to the default RoPE encoding to

reduce the decaying effect – increasing the “base frequency b” of R O PE from 10, 000 to 500, 000,

which essentially reduces the rotation angles of each dimension. The idea is also concurrently

suggested in the Reddit r/LocalLLaMa community and Rozière et al. (2023). The effect of the

base frequency change is visualized in Figure 4. Another concurrent approach named “position

interpolation” (PI) (Chen et al., 2023) proposes to linearly scale the input positions such that the

positions of tokens in the long sequences will to mapped to the model’s original position range. As

shown by the figure, it also implicitly achieves a decay reduction effect.

Another interesting observation from the visualization is that RoPE introduces large “oscillation” in

the long-range regions, which could be undesirable for language modeling (Sun et al., 2022). To

investigate whether this effect hurts performance, we also explored another recently proposed variant

of rotary encoding, X P OS (Sun et al., 2022), which smooths the high-frequency component. Note

that X P OS with the default parameters suffers from the same decaying issue as R O PE and therefore,

we also applied a similar decay fix to X P OS .

Specifically, we empirically compare the following methods: the R O PE baseline, PI, our proposed

RoPE with adjusted base frequency (denoted as R O PE ABF), and X P OS ABF (visual comparisons in

Figure 4). We report results on 1) long-sequence validation perplexity in Table 5 and Figure 5a, 2) the

The quantity that heavily decays is E q,k [R O PE(q, m) ⊤ R O PE(k, n)|m, n] as the relative position |m − n|

gets larger where q, k are the query and key of the two tokens at position m and n.

7HumanEval Math MMLU HellaSwag TQA

RoPE 14.63 3.62 45.69 76.31 65.23

R O PE PI

R O PE-ABF

X P OS -ABF 15.24

17.07

16.46 3.08

3.52

3.54 45.84

46.24

45.72 76.65

76.73

76.68 65.96

66.04

66.14

Table 6: The performance of models with different positional encoding variants on standard short-

context benchmarks.

context probing task 4 in Figure 5b, and 3) some representative regular

context tasks in Table 6 (to validate that long models do not degenerate on short-context tasks). All

model variants are continually pretrained from the 7B L LAMA 2 checkpoint with additional 80B

tokens organized as 32,768-token long sequences.

FIRST - SENTENCE - RETRIEVAL

RoPE

RoPE PI

RoPE ABF

xPos ABF

5000

10000 15000 20000 25000 30000

Distance between query and key tokens

Figure 4: Decaying raw attention scores for distant tokens of explored positional encoding variants

(assuming keys and queries are all-ones vectors).

Overall, results on these evaluations suggest that R O PE ABF performs the best among all explored

variants. In particular, we see that R O PE ABF is the only variant that can maintain its performance

up to the full 32,768-token context window on the FIRST - SENTENCE - RETRIEVAL task. We also found

that X P OS ABF with less oscillation does not lead to substantial gains, suggesting that these artifacts

are not detrimental to language modeling. While X P OS is claimed to possess better extrapolation

property (Sun et al., 2022), we found that, with the base frequency modification, X P OS does not

extrapolate better than R O PE (see Appendix C). In addition to empirical results, we provide a

theoretical analysis of RoPE ABF and its difference to PI in Appendix B. We argue that RoPE ABF

distributes the embedded vectors with an increased granularity when compared to RoPE PI, making

it a easier for the model to distinguish between positions. It is worth noting that the relative distance

between the embedded vectors has a linear dependence on the key parameter of RoPE PI and a

logarithmic dependence on the key parameter of RoPE ABF, which coincides with our empirical

observation that the base-frequency is not very sensitive and can be easily adjusted based on the max

sequence length.

4.2

Pretraining Data Mix

The data used to continually pretrain our model combines existing datasets used by L LAMA 2 and

new long text data. We also adjusted the data source mix ratio to up-weight long data samples. Our

early experiments with 7B models confirms the significant improvements using this data mix for

We also test on the P ASS K EY task as used in (Mohtashami and Jaggi, 2023). All the model variants except

R O PE can achieve perfect accuracy. We believe this task is overly simple for context probing.

86.675

6.650

6.625

6.600

perplexity

100

RoPE PI

RoPE ABF

xPos ABF

6.575

6.550

6.525

RoPE

RoPE PI

RoPE ABF

xPos ABF

6.500

6.475

5000

10000

15000

Continual train steps

0 256 1k 2k 4k 8k 10k 12k 14k 16k 20k 24k 28k 30k

Task length

20000

(a) Validation PPL (16k-token sequences) on a held-out long-context dataset.

(b) Performance on FIRST - SENTENCE - RETRIEVAL task.

Figure 5: Comparison of positional encoding variants on synthetic sentence retrieval task and

validation perplexity evolution during continual pretraining.

Continual Pretrain Data

L LAMA 2 L ONG data mix

L LAMA 2 data mix

- remove long text data

- upsample existing long text data

NarrativeQA

∆ F1 Qasper

∆ F1 Quality

∆ EM QMSum

∆ ROUGE-geo

23.70%

18.23%

19.48%

22.15% 43.64%

38.12%

39.14%

36.82% 75.5%

60.3%

67.1%

65.0% 45.70%

44.87%

36.60%

42.83%

Table 7: Comparison of different pretraining data mix on long-context tasks. Instead of showing

the absolute performance, we report relative improvements over the 7B L LAMA 2 which has a

4,096-token context window. All models are evaluated with prompts truncated at 16,384 tokens.

long-context tasks, as shown in the first two rows of Table 7. In this section, we aim to rigorously

investigate the source of improvements. In particular, we are interested in differentiating the effects

of the data length distribution and the quality of the corpus itself.

We perform two additional ablations using L LAMA 2’s pretrain datasets: 1) we remove the long text

data from the L LAMA 2 dataset and continually pretrain our model with mostly short documents; 2)

we increase the sample weights of existing long text data to be similar to the long text ratio used by

proposed new model. Interestingly, even with most of the long texts removed, the model can still

obtain most of the performance gain over L LAMA 2. We also find that there is no clear and consistent

advantage as we greatly increase the long data ratio (the third row v.s. the fourth row in Table 7 and

Table 8). We observe similar results on the FIRST - SENTENCE - RETRIEVAL task as shown by Figure 7

in the Appendix.

Based on the above ablations, we can see that adjusting the length distribution of the pretrain data does

not provide major benefits. However, as we evaluate these model variants’ performance on standard

short-context tasks, we find that new data mix also leads to large improvements in many cases,

especially knowledge-intensive tasks like MMLU, as shown in Table 8. These results suggest that

long-context LLMs can be effectively trained even with very limited long data and the improvements

of our pretrain data over the one used by L LAMA 2 mostly come from the quality of the data itself,

instead of the length distribution difference.

Continual Pretrain Data

L LAMA 2 L ONG data mix

L LAMA 2 data mix

- remove long text data

- upsample existing long text data

HumanEval Math MMLU HellaSwag TQA

17.08

15.24

17.07

17.07 4.09

3.61

3.57

3.53 48.62

46.30

46.25

46.25 76.74

76.63

76.76

76.74 66.24

66.71

65.90

66.04

Table 8: Standard short task performance of long-context models with different pretrain data mix.

94.3

Instruction Tuning

We explored various strategies to instruction-finetune the pre-trained long context model which do not

require any supervised long data. We start with only finetuning the models with short instruction data

from L LAMA 2 C HAT (referred as "RLHF V5" in (Touvron et al., 2023)) and then blend in with some

pretrain data to avoid forgetting of previous long context continual pretraining. As demonstrated in

Table 9, using only short instruction data can already produce a decent long model that significantly

outperforms L LAMA 2 on various long-context tasks. On top of this dataset that only includes short

prompts, we see that adding pretrain data (calculating language modeling loss on the whole sequence)

can further boost the performance on most datasets. Inspired by this, we add the LM loss over the

long context inputs when we finetune with self-instruct data. This simple trick makes learning more

stable when we have unbalanced input and output lengths 5 , which gives significant improvements on

most of the tested tasks (the last two rows of Table 9).

Settings

Qasper NarrativeQA QuALITY SummScreenFD QMSum

L LAMA 2 C HAT baseline 12.2 9.13 56.7 10.5 14.4

L LAMA 2 L ONG finetuned with:

"RLHF V5"

"RLHF V5" mix pretrain

"RLHF V5" mix self-inst w/o LM loss

"RLHF V5" mix self-inst with LM loss 22.3

23.7

35.7

38.9 13.2

16.6

22.3

23.3 71.4

76.2

59.3

77.3 14.8

15.7

12.2

14.5 16.9

17.8

13.4

18.5

Table 9: Comparison of different instruction finetuning data mixes.

4.4

Training Curriculum

Continual pretraining has demonstrated its efficacy in our experiments, but an open question still

remains: does pretraining from scratch with long sequences yield better performance than continual

pretraining? In this section, we study different training curricula and try to investigate if continual

pretraining can offer competitive performance with less computation budget. We start off by pre-

training a 7B model with 32,768 sequence length from start to the end. Then we explored various

two-stage training curricula where we begin with 4096 sequence length and switch to 32,768 when

the model completes 20%, 40%, 80% of whole training process. For all cases, we keep the same

number of total training tokens and make sure the number of tokens per each gradient update remains

constant (4 million tokens) by adjusting the batch size and sequence length accordingly.

We evaluate our models on the long-text QA tasks used in Section 4.2 and report the final models’

perplexity on different validation corpora. As shown in Table 10 and Table 11, continual pretraining

from short context models can easily save around 40% FLOPs while imposing almost no loss on

performance. These results also align with the training loss curves we observed from each run in

Figure 6 – the models can quickly adapt to the increased sequence length and get to similar loss scale.

Pretrain Curriculum

32k from scratch

4k→32k @ 20%

4k→32k @ 40%

4k→32k @ 80%

FLOPs NarrativeQA

F1 Qasper

F1 Quality

EM QMSum

ROUGE-geo

3.783 × 10 22

3.405 × 10 22

3.026 × 10 22

2.270 × 10 22 18.5

20.0

20.1

18.5 28.6

28.1

27.0

25.0 37.9

38.8

37.4

38.3 11.46

12.09

12.44

11.00

Table 10: Comparison of models with different training curricula on long context QA tasks.

In our cases, the output lengths of most samples are a lot shorter than the those of the long-context inputs.

10Model CC Books Wikipedia

32k from scratch

4k→32k @ 20%

4k→32k @ 40%

4k→32k @ 80% 7.67

7.59

7.59 6.52

6.46

6.49 4.31

4.26

4.25

Table 11: Perplexity evaluation of models with different training curricula on three validation sets.

2.5

32k

2.4

2.3

2.4

2.3

2.2

2.1

2.0

2.2

2.1

2.0

1.9 1.9

1.8 1.8

1.7

10k 20k 30k 40k 50k 60k 70k 80k 90k 100k

Train steps

32k

4k 32k @ 20 %

4k 32k @ 40 %

4k 32k @ 80 %

1.7

10k 20k 30k 40k 50k 60k 70k 80k 90k 100k

Train steps

Figure 6: Smoothed loss curves for the training curriculum ablation. On the left, we show losses for

models trained with a fixed context window. On the right, we compare training curricula where we

switch the context length from 4,096 to 32,768 tokens at different stages indicated by the dashed

lines. Our models can quickly adapt to the new sequence length within a few thousand steps.

5.1

AI Safety

Evaluation on Safety Benchmarks

Despite showing excellent performance on various of downstream tasks, large language models are

prone to generating harmful, misinformative, and biased contents (Lin et al., 2021; Hartvigsen et al.,

2022; Dhamala et al., 2021; Ji et al., 2023). Long-context language models can process extended

inputs in their context window, but at the same time, they also face a higher risk of jailbreak, especially

through means such as prompt injection (Greshake et al., 2023). In this section, we evaluate the

safety capability of instruction fine-tuned model using three standard academic benchmarks including

TruthfulQA (Lin et al., 2021), ToxiGen (Hartvigsen et al., 2022), and BOLD (Dhamala et al., 2021),

similar to (Touvron et al., 2023). We focus on the largest instruction fine-tuned model variant (i.e.,

70B) and compare its results with both open sourced LLMs (Falcon-instruct Almazrouei et al. (2023),

MPT-instruct MosaicML (2023a)) and propriety LLMS (GPT-3.5, GPT-4 (OpenAI, 2023), Claude-2

(Anthropic, 2023)) in Table 12.

We observe that in general instruction fine-tuned model maintains similar safety performance com-

pared to L LAMA 2 C HAT and is safer and less biased compared to other open-source LLMs such as

Falcon-instruct and MPT-instruct. AI safety is a complex domain and it can be extremely difficult to

comprehensively evaluate all safety aspects of instruction fine-tuned model with three benchmarks.

However, we hope our analysis can serve as a pilot study and provide directional signals on long-

context large language models’ safety performance, which are not discussed in other works on the

same topic (Tworkowski et al., 2023b; Ding et al., 2023; Chen et al., 2023). Currently the community

also lacks dedicated safety benchmarks for long-context large language model evaluation and we

plan to invest in this direction in our future work.

TruthfulQA We evaluate instruction fine-tuned model on TruthfulQA (Lin et al., 2021) to bench-

mark its factuality. The benchmark consists of 817 questions covering 38 categories including health,

law, finance, and politics (Lin et al., 2021). Similar to (Touvron et al., 2023), we use few-shot prompts

with 6 random QA pairs for generation and then leverage two fine-tuned GPT-3 models to classify

11whether the generation is truthful and informative. We report the percentage of generations that are

both truthful and informative as the final metric in Table 12.

ToxiGen We measure the toxicity of instruction fine-tuned model using ToxiGen (Hartvigsen et al.,

2022) where we check the percentage of toxic and hateful generations against 13 minority groups.

Following (Touvron et al., 2023), we filtered out prompts where annotators disagree with each other

on the target demographic group. We use the default ToxiGen classifier fine-tuned based on RoBERTa

(Liu et al., 2019) to evaluate the level of toxicity of the model’s outputs. We report the percentage of

toxic generations across all groups in Table 12.

BOLD Bias in Open-Ended Language Dataset (BOLD) Dhamala et al. (2021) is used in this work

to quantify how biased the models are against people from different demographic groups. This dataset

consists of 23,679 prompts extracted from English Wikipedia covering five domains including race,

gender, religion, political ideology and profession with 43 subgroups in total. Following Touvron

et al. (2023), we exclude prompts belonging to Hinduism and Atheism religious subgroups as they

only feature 12 and 29 prompts, respectively. After generations are inferred from each model, we

leverage the Valence Aware Dictionary and Sentiment Reasoner (VADER) Hutto and Gilbert (2014)

to perform sentiment analysis with a score ranging between -1 and 1. A positive score corresponds to

a positive sentiment towards the subgroup mentioned in the prompt and vice versa. A sentiment score

close to 0 indicates neutral sentiment which is desired. We report the average sentiment score across

43 demographic subgroups as the final metric for BOLD in Table 12.

Model Size

TruthfulQA ↑ ToxiGen ↓ BOLD ↓

GPT-3.5-turbo

GPT-3.5-turbo-16k

Claude-2

GPT4

Falcon-instruct

MPT-instruct -

40B

30B 78.46

75.15

62.66

80.66

57.41

42.71 0.01

0.07

0.05

0.03

3.3

16.85 0.50

0.49

0.46

0.43

0.39

0.34

L LAMA 2 C HAT 70B 64.14 0.01 0.41

L LAMA 2 L ONG C HAT 70B 60.95 0.00 0.40

Table 12: Evaluation of fine-tuned LLMs on three safety benchmarks. For TruthfulQA, we present the

percentage of generations that are both truthful and informative (the higher the better). For ToxiGen,

we present the percentage of toxic generations across all groups (the smaller the better). For BOLD,

we report the average sentiment score across 43 demographic groups (the closer to 0 the better).

5.2

Red Teaming Exercises

Currently there is no open-sourced safety benchmark designed for long-context understanding. To

ensure that the models are safe in long context use scenarios, we performed internal red teaming to

better understand the vulnerability of our chat model. We attack the model by feeding long contexts

(e.g., long conversations) to it, followed by adversarial prompts covering risky areas including

illicit and criminal conducts (e.g., terrorism, theft, and human trafficking), hateful and harmful

behaviors (e.g., defamation, self-harm, eating disorders, and discrimination), and unqualified advice

Touvron et al. (2023). Through manual inspection, we did not observe significant risks compared to

L LAMA 2 C HAT Touvron et al. (2023). We plan to invest more in new attack vectors against long

context large models in future work.

Limitations

Limited Functionality. The our model proposed in this paper has not yet been finetuned for a wide

range of long-context applications, such as creative writing that require long-form outputs. Applying

existing alignment recipes, e.g., RLHF, for various scenarios is expensive and nontrivial. Even skilled

annotators may struggle to the intricate details in dense texts. In this regard, we consider developing

efficient alignment methods for long LLMs to be a very valuable direction for future research.

12Tokenizer Efficiency. While the proposed our model series can consume contexts up to 32,768

tokens, the actually number of words our model can take is largely affected by the tokenizer behaviour.

The tokenizer used by the Llama series has a relatively small vocabulary (32k symbols) and often

produces longer sequences compare to the sequences given by GPT-3.5’s tokenizer – we observe our

tokenizer often produce 10% more tokens on average. Additionally, the tokenizer we use also cannot

efficiently handle whitespace, making it inefficient to process long code data.

Hallucination. Like other LLMs, we have observed hallucination issue when testing the proposed

our model. While this issue is common for short-context models, tackling with this problem for

long-context models can be more pronounced because of the dense information they consume and

the insufficient alignment process.

Conclusion

We present a series of long-context LLMs that leverage a simple yet necessary position encoding

refinement and continual pretraining to achieve strong long-context performance. Our long context

scaling is performed by continually pretraining from L LAMA 2 with additional 400B tokens and

outperform L LAMA 2 on both short and long-context tasks. Our models also demonstrate superior

performance compared to existing open-source long-context models and compare favorably against

gpt-3.5-turbo-16k on a suite of long-context tasks after a simple instruction finetuning procedure

without human supervision. We complement our results with a comprehensive analysis, providing

insights on the influences of various factors including the nuances of position encodings, the data mix,

and the pretraining curriculum on the final performance. We hope our study could make long-context

LLMs more accessible and facilitate further advancements in this field.

Acknowledgement

We would like to thank Nikolay Bashlykov, Matt Wilde, Wenyin Fu, Jiangyu Huang, Jenya Lee,

Mathew Oldham, and Shawn Xu for their invaluable support on the data, infrastructure, and various

other aspects of this project.

References

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mer-

ouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune,

Baptiste Pannier, and Guilherme Penedo. Falcon-40B: an open large language model with state-of-the-art

performance. 2023.

Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-eval:

Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088, 2023.

Anthropic. Introducing 100K Context Windows, 2023.

100k-context-windows.

URL https://www.anthropic.com/index/

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen

Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis with large language

models. arXiv:abs/2108.07732, 2021.

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense

in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages

7432–7439, 2020.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri

Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on

code. arXiv preprint arXiv:2107.03374, 2021.

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large

language models via positional interpolation, 2023.

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.

arXiv preprint arXiv:1904.10509, 2019.

13Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind

Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint

arXiv:1803.05457, 2018.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert,

Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv

preprint arXiv:2110.14168, 2021.

Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi,

Patrick Wendell, Matei Zaharia, and Reynold Xin.

Free dolly: Introducing the world’s first

truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/

dolly-first-open-commercially-viable-instruction-tuned-llm.

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient

exact attention with io-awareness. In NeurIPS, 2022.

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-

seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the

North American Chapter of the Association for Computational Linguistics: Human Language Technologies,

pages 4599–4610, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.

naacl-main.365. URL https://aclanthology.org/2021.naacl-main.365.

Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul

Gupta. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of

the 2021 ACM conference on fairness, accountability, and transparency, pages 862–872, 2021.

Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, and Furu Wei. Longnet:

Scaling transformers to 1,000,000,000 tokens, 2023.

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what

you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection.

arXiv preprint arXiv:2302.12173, 2023.

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen:

A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint

arXiv:2203.09509, 2022.

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob

Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,

2021.

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford,

Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie

Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich

Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-optimal large language models, 2022.

Clayton Hutto and Eric Gilbert. Vader: A parsimonious rule-based model for sentiment analysis of social media

text. In Proceedings of the international AAAI conference on web and social media, volume 8, pages 216–225,

2014.

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto,

and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):

1–38, 2023.

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised

challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray,

Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020.

Tomáš Kočiský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward

Grefenstette. The NarrativeQA reading comprehension challenge. Transactions of the Association for

Computational Linguistics, 6:317–328, 2018. doi: 10.1162/tacl_a_00023. URL https://aclanthology.

org/Q18-1023.

Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Ab-

dullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, et al. Openassistant conversations–

democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.

14Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle

Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question

answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods.

arXiv preprint arXiv:2109.07958, 2021.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke

Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint

arXiv:1907.11692, 2019.

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a

new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.

Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context length for

transformers. arXiv preprint arXiv:2305.16300, 2023.

MosaicML. Introducing mpt-30b: Raising the bar for open-source foundation models, 2023a. URL www.

mosaicml.com/blog/mpt-30b. Accessed: 2023-06-22.

MosaicML. Introducing mpt-7b: A new standard for open-source, ly usable llms, 2023b. URL www.mosaicml.

com/blog/mpt-7b.

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti,

Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language

model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High

Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.

Erik Nijkamp, Tian Xie, Hiroaki Hayashi, Bo Pang, Congying Xia, Chen Xing, Jesse Vig, Semih Yavuz, Philippe

Laban, Ben Krause, et al. Long sequence modeling with xgen: A 7b llm trained on 8k input sequence length.

Salesforce AI Research Blog, 2023.

OpenAI. Gpt-4 technical report, 2023.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang,

Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with

human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh

Padmakumar, Johnny Ma, Jana Thompson, He He, and Samuel Bowman. QuALITY: Question answering

with long input texts, yes! In Proceedings of the 2022 Conference of the North American Chapter of the

Association for Computational Linguistics: Human Language Technologies, pages 5336–5358, Seattle, United

States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.391. URL

https://aclanthology.org/2022.naacl-main.391.

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of

large language models, 2023.

r/LocalLLaMa. NTK-Aware Scaled RoPE allows llama models to have extended (8k+) context size without any

fine-tuning and minimal perplexity degradation. https://www.reddit.com/r/LocalLLaMA/comments/

14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/. Accessed: 2023-08-25.

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi,

Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt,

Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar,

Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open

foundation models for code, 2023.

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial

winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning

about social interactions. arXiv preprint arXiv:1904.09728, 2019.

Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, and Omer Levy. Zeroscrolls: A zero-shot benchmark for

long text understanding, 2023.

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced

transformer with rotary position embedding, 2022.

15Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song,

and Furu Wei. A length-extrapolatable transformer, 2022.

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering

challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.

Together. Llama-2-7b-32k-instruct — and fine-tuning for llama-2 models with together api, 2023. URL

https://together.ai/blog/llama-2-7b-32k-instruct.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,

Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat

models. arXiv preprint arXiv:2307.09288, 2023.

Szymon Tworkowski, Konrad Staniszewski, Mikołaj Pacek, Yuhuai Wu, Henryk Michalewski, and Piotr Miłoś.

Focused transformer: Contrastive training for context scaling, 2023a.

Szymon Tworkowski, Konrad Staniszewski, Mikołaj Pacek, Yuhuai Wu, Henryk Michalewski, and Piotr Miłoś.

Focused transformer: Contrastive training for context scaling, 2023b.

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Han-

naneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint

arXiv:2212.10560, 2022.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really

finish your sentence?, 2019.

Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli

Celikyilmaz, Yang Liu, Xipeng Qiu, and Dragomir Radev. QMSum: A new benchmark for query-based

multi-domain meeting summarization. In Proceedings of the 2021 Conference of the North American

Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5905–5921,

Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.472. URL

https://aclanthology.org/2021.naacl-main.472.

16A

More Results

Prompt length NarrativeQA

F1 (0-shot) Qasper

F1 (2-shot) QuALITY

EM (2-shot) QMSum

ROUGE-geo ∗ (1-shot)

Yarn-7B-128k

Together-7B-32k

Yarn-13B-128k

Yarn-7B-128k

Together-7B-32k

Yarn-13B-128k 16k

16k

32k

32k 20.9

23.3

23.4

24.0

24.7

25.5 26.2

27.3

27.1

26.2

27.3

27.1 32.3

41.2

46.4

30.4

41.3

48.0 11.4

12.6

11.9

13.6

14.2

13.8

L LAMA 2

L LAMA 2 16k

16k

32k

32k 21.9

25.6

24.4

27.4 27.8

31.2

28.7

31.6 43.2

57.6

43.6

59.0 14.9

15.7

15.9

17.0

Model

L ONG 7B

L ONG 13B

L ONG 7B

L ONG 13B

Table 13: Comparison of our models with open-source long-context models on research benchmarks

using a maximum prompt length of 32,768 tokens.

Model

L LAMA 2

Humanities STEM Social Sciences Other

54.8

69.0

73.5

80.1 35.7

44.4

49.9

55.5 58.4

71.3

78.4

84.4 53.2

65.8

69.3

74.9

L ONG 7B

L ONG 13B

L ONG 34B

L ONG 70B

Table 14: Decomposed MMLU results.

Model

L LAMA 2 L ONG 7B

L LAMA 2 L ONG 13B

L LAMA 2 L ONG 34B

L LAMA 2 L ONG 70B

HumanEval MBPP MATH GSM8k NQ TQA

18.3

19.5

22.6

32.9 23.0

31.8

37.2

46.8 4.22

8.38

10.6

17.2 16.8

34.6

47.4

65.4 27.5

32.5

35.0

39.8 74.4

81.1

85.6

88.2

Table 15: Results on HumanEval (0-shot), MBPP (3-shot), MATH (4-shot), GSM8K (8-shot),

NaturalQuestions (5-shot) and TriviaQA-wiki (5-shot).

Model

L LAMA 2

L ONG 7B

L ONG 13B

L ONG 34B

L ONG 70B

PIQA SIQA HellaSwag WinoGrande ARC-e ARC-c OBQA CSQA

78.9

81.6

82.6

83.3 48.7

50.7

51.7

52.8 77.8

81.2

83.8

85.7 70.4

74.1

77.5

79.6 76.2

77.7

79.7

80.3 52.0

51.4

54.8

58.4 59.0

55.6

60.2

59.6 61.0

70.4

77.0

81.9

Table 16: Commonsense reasoning decomposed results. We use the same number of shots and

evaluation metrics for all tasks as L LAMA 2.

Theoretical Analysis of Positional Encodings

RoPE maps an argument vector x ∈ R d into the embedding curve on a sphere in C d/2 parametrized

by a real parameter t ∈ R and “base frequency” b:

−

f RoP E (x, t) j = (x 2j + ix 2j+1 ) e ib

.Model

Coursera TPO TopicRetrieval FinQA ContractQA NaturalQuestions

60.2

59.7 83.6

69.9 70.6

69.3 -

45.4 -

24.9 -

45.9

Best open models reported in An et al. (2023)

longchat-13b-16k

36.8

chatglm2-6b-8k

47.2 55.4

54.6 33.3

10.0 37.9

34.8 21.1

16.4 22.8

17.6

L LAMA 2 L ONG C HAT 81.8 76.0 47.3 25.5 66.7

Claude 1.3 100k

gpt-3.5-turbo-16k

52.9

Table 17: Evaluation on additional long-context tasks from L-Eval. We report the official metrics

defined in An et al. (2023) and the results of compared models are directly token from the paper.

100

Llama-2 data mix

Llama-2 data mix: remove long text data

Llama-2 data mix: upsample long text data

Llama-2-Long data mix

0 256 1k 2k 4k 8k 10k 12k 14k 16k 20k 24k 28k 30k

Task length

Figure 7: FIRST - SENTENCE - RETRIEVAL performance of models trained with different data mixes.

The purpose of this mapping is to help the attention module to separate the vectors corresponding to

two instances of the same token that are situated at different positions in the input sequence.

Aiming at extending the sequence length of a transformer pretrained with a particular positional

embedding f from L to L̂, we would like to come up with a positional embedding f ˆ that minimizes

the distance between the old and the new images of the embedded vectors:

d(f, f ˆ ) = max

min

x∈X k∈{0,..N −1} j∈{0,..N̂ −1}

dist[f (x, k), f ˆ (x, j)],

where X ⊂ R d is the set of vectors that would need to be positionally embedded. (Chen et al.,

2023) computed this distance through the magnitude of the attention scores, but still argued for the

efficiency of their method “position interpolation”) due to its reduced value of the distance to the

original RoPE images when compared to the naive extrapolation of the positional embedding.

With this in mind, we consider two different methods to extend the sequence length of a trained

transformer: Position Interpolation (PI) parameterized with α, and Adjusted Base Frequency (ABF)

parameterized with β. These two methods correspond to the following embedding curves:

f RoP E+P I (x, t) j = (x 2j + ix 2j+1 ) e iα·(b

− 2j

d )t

− 2j

d t

f RoP E+ABF (x, t) j = (x 2j + ix 2j+1 ) e i(βb)

Evaluating a positional embedding a-priori, we should consider the degree of granularity with which

the embedding images are being distributed over the embedding space. Comparing alternative

positional embeddings f ˆ mapping R d × N into C d/2 , we should prefer the one with the maximal

value of the distance between the two closest images:

q( f ˆ ) =

dist[ f ˆ (x, k), f ˆ (x, j)].

min

x∈X ;k̸ = j∈{0..N̂ −1}

181.00 1.00 1.00

0.75 0.75 0.75

0.50 0.50 0.50

0.25 0.25 0.00 0.00

0.25

0.50

0.75

1.00

1.00 0.75 0.50

0.25 0.00 0.25

0.50 0.75 1.00

(a) RoPE

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

0.25

0.50

0.75

1.00

1.00 0.75 0.50

0.25 0.00 0.25

0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

(b) RoPE+PI

0.00

0.25

0.50

0.75

1.00

1.00 0.75 0.50

0.25 0.00 0.25

0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

Figure 8: RoPE variants visualization as helices.

This leaves us with a multi-objective decision selecting the positional embedding for a model with

extended context: on one hand, f ˆ should be chosen so that it minimizes d(f, f ˆ ), while on the other

hand its value of q( f ˆ ) should be big enough.

Before proceeding to the explanation on how we make this multi-objective decision, we would like

to provide a geometric intuition for the positional embeddings considered here. While it is difficult

to visualize a mapping R d × N → C d/2 , we can consider x ∈ R d to be fixed and visualize the

projection

R → R 3 . To

PI and ABF, let us

RoP

get the intuition behind

consider the helix that is formed

by Re f

(x, t) 0 , Im f RoP E (x, t) 0 and Re f RoP E (x, t) j . The example on the Figure 8a

depicts a black helix line given with the system

x = cos t; y = sin t; z = sin at.

The red dots on the line correspond to 11 integer values of t.

Figure 8b aims to illustrate the impact of Position Interpolation on the relative position of the mapped

vectors. The distance between the consecutive points got reduced considerably compered to Figure

8a. The impact of Adjusted Base Frequency is illustrated on Figure 8c. The distance between the

consecutive points remained almost the same as on Figure 8a, although the minimal distance between

points got considerably reduced due to the increased frequency of the helix. This effect of increased

frequency of the helix would be reduced in the high dimension setting. The value of the coefficient a

for the helix depicted on Figure 8a is two times larger than the value of the coefficient a for the helix

depicted on Figure 8c. If the dimension of the input of the attention mechanism is d = 128, then the

difference between θ 1 = b − d at b = 10, 000 and θ 1 = b − d at b = 500, 000 is only 6%. Thus, we

further focus specifically on the distance between the consecutive images of the embeddings.

We make a formal comparison between Positional Interpolation and Adjusted Base Frequency

by analytically comparing the pairwise distances between the images given by f RoP E+P I and

f RoP E+ABF for consecutive integer values of t. This corresponds to the evaluation of q( f ˆ ) discussed

earlier. We will measure the distance between embedding images in terms of the cosine similarity

metric since all versions of RoPE are norm-preserving.

cos ∠(a, b) =

Re⟨a, b⟩

∥a∥∥b∥

The following result states that in a high-dimensional space, the cosine similarity

cos ∠(f RoP E+ABF (x, n + 1), f RoP E+ABF (x, n)) between two consecutive embedding images

of a vector x can be bounded with a value proportional to (log b + log β) −1 . Moreover, the similarity

cos ∠(f RoP E+P I (x, n + 1), f RoP E+P I (x, n)) can be bounded using α(log b) −1 .

Theorem 1. For x ∈ R d and n ∈ N, the cosine similarity between the two consecutive images of a

positional embedding can be bounded as

min k x 2 k

∥x∥ 2 C d

≤ cos ∠(f (x, n + 1), f (x, n)) ≤

max k x 2 k

∥x∥ 2 C d

where lim d→∞ C d ≈

(log b + log β) −1 if f = f RoP E+ABF

α(log b) −1 if f = f RoP E+P I

under the assumptions of α ≪ 1

and b ≫ 1.

Proof. Let us begin the proof by writing down the expressions for the inner product between two

images of RoPE variants.

− 2j

P d 2 −1 2

⟨f RoP E+P I (x, m), f RoP E+P I (x, n)⟩ = j=0

x 2j + x 22j+1 e ib d α(m−n)

⟨f RoP E+ABF (x, m), f RoP E+ABF (x, n)⟩ =

− 2j − 2j

x 22j + x 22j+1 e ib d β d (m−n)

P d 2 −1

j=0

From them, we can derive the expressions for the cosine similarity between the images of the

positional embeddings:

cos ∠(f RoP E+P I (x, m), f RoP E+P I (x, n)) =

P d

−1

j=0

( x 22j +x 22j+1 ) sin(b − d

P d−1

j=0

P d

−1

cos ∠(f RoP E+ABF (x, m), f RoP E+ABF (x, n)) =

j=0

α(m−n))

x 2 j

( x 22j +x 22j+1 ) sin(b − d

β −

(m−n))

P d−1

j=0 x j

Let’s put m = n + 1 to compare the distance between the two consecutive positional embedding

images of the same vector x.

P d 2 −1 2

∥x∥ 2 cos ∠(f RoP E+P I (x, n + 1), f RoP E+P I (x, n)) = j=0

x 2j + x 22j+1 sin(b − d α)

P d 2 −1 2

∥x∥ 2 cos ∠(f RoP E+ABF (x, n + 1), f RoP E+ABF (x, n)) = j=0

x 2j + x 22j+1 sin(b − d β − d )

Due to the range of b, α and β that is typically considered, we can boud the arguments of the sine

functions as 0 < αb − d ≤ 1 as well as 0 < (βb) − d ≤ 1. Using that, we derive that sin(b − d β − d )

and sin(b − d α) are non-negative as well as x 2 j for any j ∈ {1, . . . d}. Thus, the following inequalities

hold:

2 −1

j=0

min x 2 k

sin(b

− 2j

) ≤

2 −1

j=0

Carrying

min k x 2 k

2 −1

sin(b

− 2j

α) ≤

sin(b

− 2j

) ≤

and

2 −1

x 22j

x 22j+1

sin(b

− 2j

α) ≤

2 −1

max x 2 k sin(b − d α).

j=0

out of the summation signs, we obtain

sin(b

− 2j

) ≤

2 −1

min x 2 k

x 22j

x 22j+1

sin(b

− 2j

j=0

2 −1

) ≤

max x 2 k

sin(b

− 2j

α) ≤

j=0

Introducing C d ABF =

the Theorem:

2 −1

j=0

2 −1

sin(b − d β − d ),

j=0

x 22j + x 22j+1 sin(b

− 2j

α) ≤ max x 2 k

j=0

P d 2 −1

max x 2 k sin(b − d β − d ),

j=0

max k x 2 k

j=0

min x 2 k

x 22j+1

j=0

2 −1

x 22j

sin(b − d β − d ) and C d P I =

P d 2 −1

j=0

2 −1

sin(b − d α).

j=0

sin(b − d α) proves the first part of

min k x 2 k ABF

max k x 2 k ABF

RoP E+ABF

≤

cos

∠(f

(x,

1),

(x,

n))

≤

C d

∥x∥ 2

min k x 2 k P I

max k x 2 k P I

C d ≤ cos ∠(f RoP E+P I (x, n + 1), f RoP E+P I (x, n)) ≤

C d .

∥x∥

∥x∥ 2

20Now, considering the limit of C d , we notice that due to the inequalities on the arguments of the sines,

the following bounds hold:

(bβ) − d

1 − (bβ) − d /π ≤ sin(b − d β − d ) ≤ (bβ) − d ,

αb − d

1 − αb − d /π ≤ sin(b − d α) ≤ αb − d

Using the formula of geometric sums and a corollary of the exponential (second) foundational limit,

we establish the limits of the sums of these bounds as d → ∞:

2 −1

αb − d =

j=0

α(b − 1)b 2/d

b − 1

→ α

as d → ∞

2/d+1

b log b

− b

2 −1

α 2 b − d =

j=0

α 2 (b 2 − 1)b 4/d

2 b − 1

→

as d → ∞

b 2 log b

b 4/d+2 − b 2

2 −1

(bβ − 1)(bβ) 2/d

(bβ) − 1

(bβ) − d =

→

as d → ∞

2/d+1

(bβ) log(bβ)

(bβ)

− bβ

j=0

2 −1

(bβ) 2 − 1

(b 2 β 2 − 1)(bβ) 4/d

(bβ) − d =

→

as d → ∞

4/d+2

(bβ) 2 log(bβ)

(bβ)

− b β

j=0

Substituting these into the bounds on lim d→∞ C d , one achieves:

(bβ) − 1 (bβ) 2 − 1

(bβ) − 1

(log b + log β) −1

−

≤ lim C d ABF ≤ (log b + log β) −1

d→∞

(bβ)

π(bβ)

(bβ)

α(log b) −1

b − 1 α b 2 − 1

−

π b 2

≤ lim C d P I ≤ α(log b) −1

d→∞

b − 1

From these bounds, one can see that in the setting considered within this paper, where b = 10000 and

α < 1/4, the approximation of lim d→∞ C d used in the statement of the Theorem is of a high quality.

Based on this theoretical derivation, we return to the interpretation of our experimental resuts. On

one hand, the experiments have shown that the model can adapt to the new sequence length with

both RoPE PI (α = 1/4 or α = 1/8) and RoPE ABF (β = 50). Thus, we can conclude that

the chosen hyperparameters provide a sufficient degree of approximation of RoPE images under

b = 10000. In other words, both d(f, f RoP E+ABF ) and d(f, f RoP E+P I ) are small enough to allow

rapid adaptation. On the other hand, comparing the expressions of C d for RoPE ABF and RoPE

PI, we can observe that for the values of α = 14 or α = 8 1 and b = 10000 that were used in our

experiments, the granularity (the distance between two consecutive images of RoPE) is much lower

for the RoPE PI (α(log b) −1 ≈ 0.027) than for RoPE ABF ((log b + log β) −1 ≈ 0.076) with β = 50.

We further hypothesise that the higher degree of granularity is related to the higher evaluation on

the downstream tasks of the RoPE ABF variant compared to RoPE PI because it makes the task of

distinguishing between the positional embedding images simpler for the model. In other words, this

corresponds to the case of q(f RoP E+ABF ) > q(f RoP E+P I ).

Throughout this consideration we implicitly assumed that the distance between the consecutive

images of an embedding is smaller than the distance between any other pair of the images. While this

assumption is likely to hold true in a high-dimensional space, significantly increasing the parameter

of β in RoPE ABF may violate this assumption due to the changed geometry of the embedding curve.

212.2

Interpolation (16k)

Extrapolation (32k)

Entropy

100

RoPE ABF

xPos ABF

Llama 2

2.4

2.0

Interpolation (16k)

1.8

1.6

5000

RoPE ABF

xPos ABF

0 256 1k 2k 4k 8k 10k 12k 14k 16k 20k 24k 28k 30k

Task length

10000 15000 20000 25000 30000

Position

(a) Validation loss calculated at each position of 32,768 context window.

(b) Context window probing with FIRST - SENTENCE - RETRIEVAL task.

Figure 9: Evaluation on our 70B model’s extrapolation capabilities.

Length Extrapolation Results

Despite not the focus of this work, extrapolation is an important property for long context models.

Extrapolation refers to a model’s ability to conduct inference on input sequences that are longer than

its training sequences. We evaluate how our 70B model extrapolates with two tasks:

• Validation loss at each position: In Figure 9a, we visualize the average loss at each position

of the 32,768 sequence length where the first 16,384 is the interpolation area (within training

sequence length) and the second half is extrapolation. We use 50 batches of samples and

average across them. To make plots smoother, we also take the mean of losses every 500

positions. As we can see, our 70B model with either R O PE ABF or X P OS ABF maintain

the loss in the extrapolation area. To contrast this, we also plot the result for L LAMA 2 with

4,096 context window: the loss explodes after the position goes beyond training sequence

length, which suggests that L LAMA 2 does not extrapolate effectively.

• Synthetic FIRST - SENTENCE - RETRIEVAL task: To complement validation loss evaluation,

we also test our 70B model with two different PEs on the context probing task. Unlike

validation loss task where it is hard to find data samples that require very long range

dependencies consistently, FIRST - SENTENCE - RETRIEVAL imposes a very strict requirement

for models to attend with a specific length. In Figure 9b, we visualize the results up to 32,768

where we do see some performance degradation when the model needs to extrapolate. In

addition, we observe that, despite often considered as having better extrapolation properties,

X P OS ABF does not outperform R O PE ABF in our setting.

Self-Instruct Data

As described in Section 4.3, we use L LAMA 2 C HAT to bootstrap self-instruct data for instruct

finetuning. In this section we describe the detailed procedure as well as providing the necessary

prompts used for generating this dataset. The main challenge is that we need an automated process of

generating long context instruct data with only short context models at hand. The core idea behind

this is to split the long documents into chunks of texts that can fit into short model’s context and apply

self-instruct. We focus primarily on question answering dataset. We first split the long document

into smaller chunks, and for each chunk we construct a prompt as in Figure 10 which gets fed into

L LAMA 2 C HAT to get a question-answer pair. To diversify the question types, we randomly choose

between the two prompts that ask for either normal or short answers. Once we extract the question

and answer from the response (using tags as required by the prompt), we can construct long question

answering instruct data together with the original long document, using the templates in Figure 11 of

the corresponding answer type.

22Normal Answer Prompt:

[INST] You are given a text chunk (delimited by triple quotes) taken from a long

text. Write a question about this text and provide the correct answer. The answer

needs to be based on the text. This question will later be used as a reading

comprehension test over the entire document. Wrap the question and answer using

XML tags ( and , and ).

"""

{TEXT_CHUNK}

"""

[/INST]

Short Answer Prompt:

[INST] You are given a text chunk (delimited by triple quotes) from a long

document. Based on information from the text, come up with a specific question

**which can be answered in a few words or a single phrase** and provide the

correct answer without explanation. The answer needs to be based on the text.

This question will later be used as a reading comprehension test over the

entire document. Wrap the question and answer using XML tags (

and , and ). Again, the answer needs to be short.

"""

{TEXT_CHUNK}

"""

[/INST]

Figure 10: Prompts used for generating question and answer pairs by boostrapping L LAMA 2 C HAT .

We split the long documents into chunks and feed each chunk into one of the prompts with equal

probability. We prompt the models to wrap the answer with XML tags, which enables more accurate

answer extraction.

Normal Answer Data Template:

[INST] You are given a long text (delimited by triple quotes) and a question.

Read the text and answer the question in the end.

"""

{FULL_DOCUMENT}

"""

Question: {QUESTION}

[/INST]

{ANSWER}

Short Answer Data Template:

[INST] You are given a long text (delimited by triple quotes) and a question.

Read the text and answer the question in the end as concisely as you can,

using a single phrase or sentence if possible. Do not provide any explanation.

"""

{FULL_DOCUMENT}

"""

Question: {QUESTION}

[/INST]

{ANSWER}

Figure 11: Data templates for constructing long question-answer data. The question and answer pair

is extracted from the response of L LAMA 2 C HAT .