Summary of Mistral 7B A High-Performance Language Model

Summary Mistral 7B A High-Performance Language Model arxiv.org

3,836 words - PDF document - View PDF document

One Line

Mistral 7B, a language model with 7 billion parameters, outperforms previous models in reasoning, math, and code generation.

Slides

Slide Presentation (8 slides)

Copy slides outline Copy embed code Download as Word

Mistral 7B: Unlocking High Performance and Efficiency

Source: arxiv.org - PDF - 3,836 words - view

Introduction

• Mistral 7B is designed for high performance and efficiency in Natural Language Processing (NLP).

• Outperforms previous models in reasoning, mathematics, and code generation.

• Leverages grouped-query attention (GQA) and sliding window attention (SWA) for faster inference and reduced computational cost.

Architectural details

• Mistral 7B is based on a transformer architecture.

• Parameters: dim, n-layers, head-dim, hidden-dim, n-heads, n-kv-heads, window-size, context-len, vocab-size.

• SWA allows effective handling of longer sequences with reduced computational cost.

Results

• Mistral 7B surpasses Llama 2 13B across all metrics.

• Outperforms Llama 1 34B in mathematics, code generation, and reasoning benchmarks.

• Demonstrates superior performance in code, mathematics, and reasoning benchmarks.

Instruction Finetuning

• Mistral 7B can be fine-tuned for specific tasks.

• Mistral 7B - Instruct model outperforms Llama 2 13B in chat benchmarks.

• Achieves adaptability and superior performance in various tasks.

Adding guardrails for front-facing applications

• System prompts can be used to enforce output constraints and ensure safe responses.

• Mistral 7B provides accurate content moderation with self-reflection capabilities.

• Enables effective filtering of content based on specific categories.

Unlocking the Potential of Mistral 7B

• Mistral 7B delivers high performance while maintaining efficiency.

• Offers opportunities for compressing knowledge in language models.

• Enforces guardrails and ensures safe and appropriate responses.

• Provides content moderation with self-reflection capabilities.

[Visuals: Graphs illustrating performance comparisons, architecture diagram of Mistral 7B, examples of system prompts and content moderation]

Key Points

Mistral 7B is a 7-billion-parameter language model designed for high performance and efficiency.
Mistral 7B outperforms previous models in various benchmarks, including reasoning, mathematics, and code generation.
Mistral 7B utilizes grouped-query attention (GQA) and sliding window attention (SWA) for faster inference and reduced computational cost.
Mistral 7B is released under the Apache 2.0 license and can be easily deployed on various platforms.
Mistral 7B can be fine-tuned for specific tasks, such as following instructions, and achieves superior performance compared to other models.
Mistral 7B demonstrates the potential for compressing knowledge in language models.
System prompts can be used with Mistral 7B to enforce guardrails and ensure safe and appropriate responses.
Mistral 7B can also be used for content moderation with self-reflection capabilities.

Summaries

19 word summary

Mistral 7B, a high-performance language model with 7 billion parameters, surpasses previous models in reasoning, math, and code generation.

67 word summary

Mistral 7B is a high-performance language model with 7 billion parameters that surpasses previous models in reasoning, math, and code generation. It utilizes grouped-query attention and sliding window attention mechanisms to improve speed and reduce memory usage. Mistral 7B can be fine-tuned for specific tasks, outperforming other models in benchmarks and enabling content moderation. It demonstrates the potential for efficient and effective language models in real-world applications.

286 word summary

Mistral 7B is a high-performance language model with 7 billion parameters that outperforms previous models in reasoning, mathematics, and code generation benchmarks. It achieves this through the use of grouped-query attention (GQA) and sliding window attention (SWA) mechanisms that improve inference speed and reduce memory requirements. Mistral 7B is released under the Apache 2.0 license and comes with a reference implementation for easy deployment.

The model's architectural details are based on a transformer architecture and include parameters such as dimensionality, number of layers, head dimension, hidden dimension, number of heads, number of key-value heads, window size, context length, and vocabulary size. The introduction of sliding window attention (SWA) allows for more effective handling of longer sequences at a reduced computational cost.

Mistral 7B outperforms previous Llama models in various benchmarks such as commonsense reasoning, world knowledge, reading comprehension, math, and code generation. It achieves superior performance compared to Llama 2 13B and approaches the coding performance of Code-Llama 7B without sacrificing non-code benchmarks. The model's efficiency is evaluated using "equivalent model sizes" of Llama models, and Mistral 7B is found to compress knowledge more effectively.

Mistral 7B can be fine-tuned for specific tasks and outperforms Llama 2 13B - Chat model in both human and automated benchmarks when fine-tuned as a chat model called Mistral 7B - Instruct. It can enforce guardrails for front-facing applications by using a system prompt to guide its output and ensure compliance with specified constraints. Mistral 7B can also perform content moderation by classifying prompts or generated answers.

In conclusion, Mistral 7B demonstrates that language models can deliver high performance while maintaining efficiency. It provides a valuable contribution to the development of affordable, efficient, and high-performing language models for real-world applications.

399 word summary

Mistral 7B is a high-performance language model with 7 billion parameters that demonstrates the ability to deliver both high-level performance and efficiency. It outperforms previous models, including Llama 2 13B and Llama 1 34B, in various benchmarks such as reasoning, mathematics, and code generation. Mistral 7B achieves this by leveraging grouped-query attention (GQA) and sliding window attention (SWA) mechanisms, which improve inference speed and reduce memory requirements. The model is released under the Apache 2.0 license and is accompanied by a reference implementation for easy deployment.

The architectural details of Mistral 7B are based on a transformer architecture and include parameters such as dimensionality, number of layers, head dimension, hidden dimension, number of heads, number of key-value heads, window size, context length, and vocabulary size. The model introduces sliding window attention (SWA) to handle longer sequences more effectively at a reduced computational cost. This attention mechanism allows each token to attend to a limited number of tokens from the previous layer. Mistral 7B also utilizes a rolling buffer cache to limit cache size and reduce memory usage without impacting model quality.

In terms of results, Mistral 7B outperforms Llama models in various benchmarks such as commonsense reasoning, world knowledge, reading comprehension, math, and code generation. It achieves superior performance compared to Llama 2 13B on all metrics and approaches the coding performance of Code-Llama 7B without sacrificing performance on non-code benchmarks. The model's efficiency is also evaluated using "equivalent model sizes" of Llama models, and Mistral 7B is found to compress knowledge more effectively.

Mistral 7B can be fine-tuned for specific tasks, and a chat model fine-tuned from Mistral 7B, called Mistral 7B - Instruct, outperforms Llama 2 13B - Chat model in both human and automated benchmarks. The model can also enforce guardrails for front-facing applications by using a system prompt to guide the model's output and ensure compliance with specified constraints. Mistral 7B can also perform content moderation by classifying prompts or generated answers as acceptable or falling into categories such as illegal activities, hateful content, or unqualified advice.

In conclusion, Mistral 7B demonstrates that language models can deliver high performance while maintaining efficiency. The model's performance surpasses previous models in various benchmarks and its efficiency is evaluated in terms of training cost, inference cost, and model capabilities. Mistral 7B provides a valuable contribution to the development of affordable, efficient, and high-performing language models for real-world applications.

Raw indexed text (24,794 chars / 3,836 words / 402 lines)

Mistral 7B

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford,

Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel,

Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux,

Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix,

William El Sayed

Abstract

We introduce Mistral 7B, a 7–billion-parameter language model engineered for

superior performance and efficiency. Mistral 7B outperforms the best open 13B

model (Llama 2) across all evaluated benchmarks, and the best released 34B

model (Llama 1) in reasoning, mathematics, and code generation. Our model

leverages grouped-query attention (GQA) for faster inference, coupled with sliding

window attention (SWA) to effectively handle sequences of arbitrary length with a

reduced inference cost. We also provide a model fine-tuned to follow instructions,

Mistral 7B – Instruct, that surpasses Llama 2 13B – chat model both on human and

automated benchmarks. Our models are released under the Apache 2.0 license.

Code: https://github.com/mistralai/mistral-src

Webpage: https://mistral.ai/news/announcing-mistral-7b/

Introduction

In the rapidly evolving domain of Natural Language Processing (NLP), the race towards higher model

performance often necessitates an escalation in model size. However, this scaling tends to increase

computational costs and inference latency, thereby raising barriers to deployment in practical,

real-world scenarios. In this context, the search for balanced models delivering both high-level

performance and efficiency becomes critically essential. Our model, Mistral 7B, demonstrates that

a carefully designed language model can deliver high performance while maintaining an efficient

inference. Mistral 7B outperforms the previous best 13B model (Llama 2, [26]) across all tested

benchmarks, and surpasses the best 34B model (LLaMa 34B, [25]) in mathematics and code

generation. Furthermore, Mistral 7B approaches the coding performance of Code-Llama 7B [20],

without sacrificing performance on non-code related benchmarks.

Mistral 7B leverages grouped-query attention (GQA) [1], and sliding window attention (SWA) [6, 3].

GQA significantly accelerates the inference speed, and also reduces the memory requirement during

decoding, allowing for higher batch sizes hence higher throughput, a crucial factor for real-time

applications. In addition, SWA is designed to handle longer sequences more effectively at a reduced

computational cost, thereby alleviating a common limitation in LLMs. These attention mechanisms

collectively contribute to the enhanced performance and efficiency of Mistral 7B.Mistral 7B is released under the Apache 2.0 license. This release is accompanied by a reference

implementation 1 facilitating easy deployment either locally or on cloud platforms such as AWS, GCP,

or Azure using the vLLM [17] inference server and SkyPilot 2 . Integration with Hugging Face 3 is

also streamlined for easier integration. Moreover, Mistral 7B is crafted for ease of fine-tuning across

a myriad of tasks. As a demonstration of its adaptability and superior performance, we present a chat

model fine-tuned from Mistral 7B that significantly outperforms the Llama 2 13B – Chat model.

Mistral 7B takes a significant step in balancing the goals of getting high performance while keeping

large language models efficient. Through our work, our aim is to help the community create more

affordable, efficient, and high-performing language models that can be used in a wide range of

real-world applications.

Architectural details

Figure 1: Sliding Window Attention. The number of operations in vanilla attention is quadratic in the sequence

length, and the memory increases linearly with the number of tokens. At inference time, this incurs higher

latency and smaller throughput due to reduced cache availability. To alleviate this issue, we use sliding window

attention: each token can attend to at most W tokens from the previous layer (here, W = 3). Note that tokens

outside the sliding window still influence next word prediction. At each attention layer, information can move

forward by W tokens. Hence, after k attention layers, information can move forward by up to k × W tokens.

Mistral 7B is based on a transformer architecture [27]. The main

parameters of the architecture are summarized in Table 1. Compared

to Llama, it introduces a few changes that we summarize below.

Parameter Value

dim

n_layers

head_dim

hidden_dim

n_heads

n_kv_heads

window_size

context_len

vocab_size 4096

128

14336

4096

8192

32000

Sliding Window Attention. SWA exploits the stacked layers of a trans-

former to attend information beyond the window size W . The hidden

state in position i of the layer k, h i , attends to all hidden states from

the previous layer with positions between i − W and i. Recursively,

h i can access tokens from the input layer at a distance of up to W × k

tokens, as illustrated in Figure 1. At the last layer, using a window size

of W = 4096, we have a theoretical attention span of approximately

131K tokens. In practice, for a sequence length of 16K and W = 4096,

changes made to FlashAttention [11] and xFormers [18] yield a 2x Table 1: Model architecture.

speed improvement over a vanilla attention baseline.

Rolling Buffer Cache. A fixed attention span means that we can limit our cache size using a rolling

buffer cache. The cache has a fixed size of W , and the keys and values for the timestep i are stored

in position i mod W of the cache. As a result, when the position i is larger than W , past values

in the cache are overwritten, and the size of the cache stops increasing. We provide an illustration

in Figure 2 for W = 3. On a sequence length of 32k tokens, this reduces the cache memory usage

by 8x, without impacting the model quality.

https://github.com/mistralai/mistral-src

https://github.com/skypilot-org/skypilot

https://huggingface.co/mistralai

2Figure 2: Rolling buffer cache. The cache has a fixed size of W = 4. Keys and values for position i are stored

in position i mod W of the cache. When the position i is larger than W , past values in the cache are overwritten.

The hidden state corresponding to the latest generated tokens are colored in orange.

Pre-fill and Chunking. When generating a sequence, we need to predict tokens one-by-one, as

each token is conditioned on the previous ones. However, the prompt is known in advance, and we

can pre-fill the (k, v) cache with the prompt. If the prompt is very large, we can chunk it into smaller

pieces, and pre-fill the cache with each chunk. For this purpose, we can select the window size as

our chunk size. For each chunk, we thus need to compute the attention over the cache and over the

chunk. Figure 3 shows how the attention mask works over both the cache and the chunk.

The cat

sat on

the mat and saw the dog

go to

the 0 0 0 0 0 1 1 1 1 0 0 0

dog 0 0 0 0 0 0 1 1 1 1 0 0

go 0 0 0 0 0 0 0 1 1 1 1 0

to 0 0 0 0 0 0 0 0 1 1 1 1

Past

Cache

Current

Figure 3: Pre-fill and chunking. During pre-fill of the cache, long sequences are chunked to limit memory

usage. We process a sequence in three chunks, “The cat sat on”, “the mat and saw”, “the dog go to”. The figure

shows what happens for the third chunk (“the dog go to”): it attends itself using a causal mask (rightmost block),

attends the cache using a sliding window (center block), and does not attend to past tokens as they are outside of

the sliding window (left block).

Results

We compare Mistral 7B to Llama, and re-run all benchmarks with our own evaluation pipeline for

fair comparison. We measure performance on a wide variety of tasks categorized as follow:

• Commonsense Reasoning (0-shot): Hellaswag [28], Winogrande [21], PIQA [4], SIQA [22],

OpenbookQA [19], ARC-Easy, ARC-Challenge [9], CommonsenseQA [24]

• World Knowledge (5-shot): NaturalQuestions [16], TriviaQA [15]

• Reading Comprehension (0-shot): BoolQ [8], QuAC [7]

• Math: GSM8K [10] (8-shot) with maj@8 and MATH [13] (4-shot) with maj@4

• Code: Humaneval [5] (0-shot) and MBPP [2] (3-shot)

• Popular aggregated results: MMLU [12] (5-shot), BBH [23] (3-shot), and AGI Eval [29]

(3-5-shot, English multiple-choice questions only)

Detailed results for Mistral 7B, Llama 2 7B/13B, and Code-Llama 7B are reported in Table 2. Figure 4

compares the performance of Mistral 7B with Llama 2 7B/13B, and Llama 1 34B 4 in different

categories. Mistral 7B surpasses Llama 2 13B across all metrics, and outperforms Llama 1 34B on

most benchmarks. In particular, Mistral 7B displays a superior performance in code, mathematics,

and reasoning benchmarks.

Since Llama 2 34B was not open-sourced, we report results for Llama 1 34B.

3Figure 4: Performance of Mistral 7B and different Llama models on a wide range of benchmarks. All

models were re-evaluated on all metrics with our evaluation pipeline for accurate comparison. Mistral 7B

significantly outperforms Llama 2 7B and Llama 2 13B on all benchmarks. It is also vastly superior to Llama 1

34B in mathematics, code generation, and reasoning benchmarks.

Model

Modality MMLU HellaSwag WinoG PIQA Arc-e Arc-c

LLaMA 2 7B Pretrained 44.4%

LLaMA 2 13B Pretrained 55.6%

77.1%

80.7%

69.5% 77.9% 68.7% 43.2% 24.7%

72.9% 80.8% 75.2% 48.8% 29.0%

TriviaQA HumanEval MBPP MATH GSM8K

63.8%

69.6% 11.6%

18.9% 26.1% 3.9%

35.4% 6.0% 16.0%

34.3%

20.8%

Code-Llama 7B Finetuned 36.9% 62.9% 62.3% 72.8% 59.4% 34.5% 11.0% 34.9% 31.1% 52.5% 5.2%

Pretrained 60.1% 81.3% 75.3% 83.0% 80.0% 55.5% 28.8% 69.9% 30.5% 47.5% 13.1% 52.2%

Mistral 7B

Table 2: Comparison of Mistral 7B with Llama. Mistral 7B outperforms Llama 2 13B on all metrics, and

approaches the code performance of Code-Llama 7B without sacrificing performance on non-code benchmarks.

Size and Efficiency. We computed “equivalent model sizes” of the Llama 2 family, aiming to

understand Mistral 7B models’ efficiency in the cost-performance spectrum (see Figure 5). When

evaluated on reasoning, comprehension, and STEM reasoning (specifically MMLU), Mistral 7B

mirrored performance that one might expect from a Llama 2 model with more than 3x its size. On

the Knowledge benchmarks, Mistral 7B’s performance achieves a lower compression rate of 1.9x,

which is likely due to its limited parameter count that restricts the amount of knowledge it can store.

Evaluation Differences. On some benchmarks, there are some differences between our evaluation

protocol and the one reported in the Llama 2 paper: 1) on MBPP, we use the hand-verified subset 2)

on TriviaQA, we do not provide Wikipedia contexts.

Instruction Finetuning

To evaluate the generalization capabilities of

Mistral 7B, we fine-tuned it on instruction datasets

publicly available on the Hugging Face repository.

No proprietary data or training tricks were utilized:

Mistral 7B – Instruct model is a simple and

preliminary demonstration that the base model can

easily be fine-tuned to achieve good performance.

In Table 3, we observe that the resulting model,

Mistral 7B – Instruct, exhibits superior perfor-

mance compared to all 7B models on MT-Bench,

and is comparable to 13B – Chat models. An

independent human evaluation was conducted on

https://llmboxing.com/leaderboard.

Model

WizardLM 13B v1.2

Mistral 7B Instruct

Llama 2 13B Chat

Vicuna 13B

Llama 2 7B Chat

Vicuna 7B

Alpaca 13B

Chatbot Arena

MT Bench

ELO Rating

1047

1031

1012

1041

985

997

914

7.2

6.84 +/- 0.07

6.65

6.57

6.27

6.17

4.53

Table 3: Comparison of Chat models. Mistral 7B –

Instruct outperforms all 7B models on MT-Bench, and

is comparable to 13B – Chat models.

In this evaluation, participants were provided with a set of questions along with anonymous responses

from two models and were asked to select their preferred response, as illustrated in Figure 6. As of

October 6, 2023, the outputs generated by Mistral 7B were preferred 5020 times, compared to 4143

times for Llama 2 13B.

4Figure 5: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension for

Mistral 7B and Llama 2 (7B/13B/70B). Mistral 7B largely outperforms Llama 2 13B on all evaluations, except

on knowledge benchmarks, where it is on par (this is likely due to its limited parameter count, which limits the

amount of knowledge it can compress).

Adding guardrails for front-facing applications

The ability to enforce guardrails when it comes to AI generation is important for front-facing appli-

cations. In this section, we highlight how to leverage system prompting to optionally enforce output

constraints on top of our models. Additionally, we showcase the ability of Mistral 7B to perform

fine-grained content moderation, which can be useful to enforce quality content in applications.

5.1

System prompt to enforce guardrails

We introduce a system prompt (see below) to guide the model to generate answers within specified

guardrails, similar to the work done with Llama 2. Using this prompt allows the user to move on the

Pareto front of model utility / guardrails enforcement, as indicated in Table 4.

Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful,

unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.

We use a set of 175 unsafe prompts for evaluating safety.

With the recommended system prompt, the model properly

declines to answer 100% of the harmful questions.

As an illustration, we provide in Table 5 the answers of

both Mistral 7B – Instruct and Llama 2 Chat 13B to the

question How to kill a linux process with system prompts

activated. We observe that Mistral 7B provides a correct

response while Llama 2 declines to answer. Note that on

this specific question, both models answer correctly when

system prompts are deactivated.

5.2

Guardrails MT Bench

No system prompt

Llama 2 system prompt

Mistral system prompt 6.84 ± 0.07

6.38 ± 0.07

6.58 ± 0.05

Table 4: System prompts. Mean official

MT Bench score over 10 iterations with

standard deviation for Mistral 7B – Instruct.

For reference, Llama 2 13B – Chat reports

official results of 6.65.

Content moderation with self-reflection

Mistral 7B – Instruct can be used as a content moderator: the model itself is able to accurately

classify a user prompt or its generated answer as being either acceptable or falling into one of the

following categories: Illegal activities such as terrorism, child abuse or fraud; Hateful, harassing

or violent content such as discrimination, self-harm or bullying; Unqualified advice for instance

in legal, medical or financial domains.

5Model Answer

Mistral 7B – Instruct

with Mistral system prompt To kill a Linux process, you can use the `kill`command followed by the process ID (PID)

of the process you want to terminate. For example, to kill process with PID 1234, you

would run the command `kill 1234`. It’s important to note that killing a process can

have unintended consequences, so it’s generally a good idea to only kill processes that

you are certain you want to terminate. Additionally, it’s generally a good idea to use the

`kill`command with caution, as it can be used to terminate critical system processes that

are necessary for the operation of the system.

Llama 2 13B – Chat

with Llama 2 system prompt As a helpful and respectful assistant, I cannot provide information on how to kill a Linux

process. It is not appropriate or ethical to intentionally harm or terminate processes, as it

can cause damage to the system and potentially result in data loss. Additionally, it is

important to respect the autonomy and security of other users’ processes, as they may be

critical to their work or personal activities. Instead, I suggest that you explore other

options to address any issues or concerns you may have.

Table 5: Comparison between Mistral and Llama system prompts. Mistral provides the right answer whereas

Llama 2 declines to answer to the question How to kill a linux process.

To do so, we designed a self-reflection prompt that makes Mistral 7B classify a prompt or a generated

answer. We evaluated self-reflection on our manually curated and balanced dataset of adversarial

and standard prompts and got a precision of 99.4% for a recall of 95.6% (considering acceptable

prompts as positives).

The use cases are vast, from moderating comments on social media or forums to brand monitoring

on the internet. In particular, the end user is able to select afterwards which categories to effectively

filter based on their particular use-case.

Conclusion

Our work on Mistral 7B demonstrates that language models may compress knowledge more than

what was previously thought. This opens up interesting perspectives: the field has so far put the

emphasis on scaling laws in 2 dimensions (directly associating model capabilities to training cost, as

in [14]); the problem is rather 3 dimensional (model capabilities, training cost, inference cost), and

much remains to be explored to obtain the best performance with the smallest possible model.

Acknowledgements

We are grateful to CoreWeave for their 24/7 help in marshalling our cluster. We thank the

CINECA/EuroHPC team, and in particular the operators of Leonardo, for their resources and help.

We thank the maintainers of FlashAttention, vLLM, xFormers, Skypilot for their precious assistance

in implementing new features and integrating their solutions into ours. A huge thanks to Tri Dao

and Daniel Haziza for helping include Mistral related changes to FlashAttention and xFormers on

a tight schedule. We thank the teams of Hugging Face, AWS, GCP, Azure ML for their intense help

in making our model compatible everywhere.

6Figure 6: Human evaluation of Mistral 7B – Instruct vs Llama 2 13B – Chat Example. An example of

human evaluation from llmboxing.com. The question asks for recommendations of books in quantum physics.

Llama 2 13B – Chat recommends a general physics book, while Mistral 7B – Instruct recommends a more

relevant book on quantum physics and describes in the contents in more detail.

7References

[1] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and

Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head

checkpoints. arXiv preprint arXiv:2305.13245, 2023.

[2] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David

Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large

language models. arXiv preprint arXiv:2108.07732, 2021.

[3] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.

arXiv preprint arXiv:2004.05150, 2020.

[4] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys-

ical commonsense in natural language. In Proceedings of the AAAI conference on artificial

intelligence, 2020.

[5] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared

Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large

language models trained on code. arXiv preprint arXiv:2107.03374, 2021.

[6] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with

sparse transformers. arXiv preprint arXiv:1904.10509, 2019.

[7] Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and

Luke Zettlemoyer. Quac: Question answering in context. arXiv preprint arXiv:1808.07036,

2018.

[8] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and

Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions.

arXiv preprint arXiv:1905.10044, 2019.

[9] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick,

and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning

challenge. arXiv preprint arXiv:1803.05457, 2018.

[10] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,

Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to

solve math word problems. arXiv preprint arXiv:2110.14168, 2021.

[11] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast

and memory-efficient exact attention with IO-awareness. In Advances in Neural Information

Processing Systems, 2022.

[12] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and

Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint

arXiv:2009.03300, 2020.

[13] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn

Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.

arXiv preprint arXiv:2103.03874, 2021.

[14] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza

Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas

Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia

Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent

Sifre. An empirical analysis of compute-optimal large language model training. In Advances in

Neural Information Processing Systems, volume 35, 2022.

[15] Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large

scale distantly supervised challenge dataset for reading comprehension. arXiv preprint

arXiv:1705.03551, 2017.

[16] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris

Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a

benchmark for question answering research. Transactions of the Association for Computational

Linguistics, 7:453–466, 2019.

8[17] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu,

Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan-

guage model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium

on Operating Systems Principles, 2023.

[18] Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano,

Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, and Daniel Haziza.

xformers: A modular and hackable transformer modelling library. https://github.com/

facebookresearch/xformers, 2022.

[19] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct

electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789,

2018.

[20] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan,

Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models

for code. arXiv preprint arXiv:2308.12950, 2023.

[21] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An

adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106,

2021.

[22] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Com-

monsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.

[23] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won

Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei.

Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint

arXiv:2210.09261, 2022.

[24] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A ques-

tion answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937,

2018.

[25] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo-

thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open

and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.

[26] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,

Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open

foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.

[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,

Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information

processing systems, 30, 2017.

[28] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a

machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.

[29] Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied,

Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation

models. arXiv preprint arXiv:2304.06364, 2023.