Summary of Challenges and Applications of Large Language Models

Summary Challenges and Applications of Large Language Models arxiv.org

54,315 words - PDF document - View PDF document

One Line

Large Language Models (LLMs) have issues with misaligned behavior, outdated knowledge, and brittle evaluations, but they find applications in chatbots, computational biology, and computer programming, while holistic benchmarking suites like HELM help standardize evaluation methods, and model editing techniques are explored.

Slides

Slide Presentation (10 slides)

Copy slides outline Copy embed code Download as Word

Challenges and Applications of Large Language Models

Source: arxiv.org - PDF - 54,315 words - view

Introduction

• Large Language Models (LLMs) face challenges with misaligned behavior, outdated knowledge, brittle evaluations, and indistinguishability from human-written text.

• LLMs lack experimental designs and reproducibility.

Applications of LLMs

• LLMs find applications in chatbots, computational biology, and computer programming.

• They can be used for creative work, knowledge work, and law.

Tokenization

• Tokenization is a process that breaks words into smaller units called tokens.

• Subword tokenization is commonly used, but it has drawbacks.

• Byte-level tokenization is an alternative that can be used with subword tokenizers or to define a limited vocabulary.

Training Strategies

• Training smaller models intensively upfront can offset larger inference costs in the future.

• Scaling laws for performance prediction differ between upstream and downstream setups.

• The majority of training costs go towards pre-training, which requires significant compute hours and resources.

Masked Language Modeling

• Large language models have different approaches for conditioning on tokens before and after masked ones.

• Span Corruption replaces contiguous token sequences with a unique masking token.

• Masked Language Modeling hides tokens by replacing them with a special [MASK] token.

Task Learning

• LLMs possess the capability of task learning and can acquire new input-label mappings.

• The order of few-shot examples provided to LLMs significantly affects their performance.

Alignment with Human Values

• LLMs often generate outputs that don't align with human values.

• Pre-training with human feedback can improve alignment.

• Increasing diversity in response generation can also help.

Conclusion

• LLMs face challenges but have promising applications.

• They can revolutionize chatbots, computational biology, and computer programming.

• The alignment with human values is a key area for improvement.

Key Takeaways

• LLMs face challenges with misaligned behavior, outdated knowledge, brittle evaluations, and lack of experimental designs.

• Applications of LLMs include chatbots, computational biology, and computer programming.

• Pre-training with human feedback can improve alignment and generate diverse responses.

Key Points

Large language models (LLMs) face challenges with misaligned behavior, outdated knowledge, brittle evaluations, and indistinguishability from human-written text.
LLMs lack experimental designs and reproducibility.
Applications of LLMs include chatbots, computational biology, and computer programming.
Tokenization is a process that breaks words into smaller units called tokens.
Training smaller models intensively upfront can offset larger inference costs in the future.
Large language models have different approaches for conditioning on tokens before and after masked ones.
LLMs possess the capability of task learning and can acquire new input-label mappings.
LLMs often generate outputs that don't align with human values, and pre-training with human feedback can improve alignment.

Summaries

34 word summary

Large Language Models (LLMs) present challenges with misaligned behavior, outdated knowledge, and brittle evaluations. Applications include chatbots, computational biology, and computer programming. Holistic benchmarking suites like HELM standardize evaluation methods. Model editing techniques and

112 word summary

Large Language Models (LLMs) have become prevalent in machine learning, but they face challenges with misaligned behavior, outdated knowledge, brittle evaluations, and indistinguishability from human-written text. Applications include chatbots, computational biology, and computer programming.

Large language models (LLMs) face challenges related to bias, toxicity detection, prompt injections, and outdated knowledge. Holistic benchmarking suites like HELM standardize evaluation methods. Model editing techniques and retrieval augmentation can address outdated knowledge. Watermarks remain

This summary includes a list of references to various research papers and articles related to large language models and their applications in different fields. These include topics such as context length, few-shot learning, privacy attacks, attention mechanisms, code generation, social biases, evaluation

2103 word summary

Large Language Models (LLMs) have quickly become prevalent in machine learning, but there are still challenges and application areas to explore. This paper aims to establish a systematic set of open problems and successes to help ML researchers understand the current state of the field

Large language models face challenges with misaligned behavior, outdated knowledge, brittle evaluations, and indistinguishability from human-written text. They also lack experimental designs and reproducibility. Applications include chatbots, computational biology, and computer programming.

Challenges in large language models include static evaluations, lack of experimental designs, and reproducibility. LLM design decisions are made before deployment, behavioral challenges occur during deployment, and science challenges hinder academic progress. Creative work, knowledge work, law,

Challenges and Applications of Large Language Models: This review addresses the challenges and applications of large language models (LLMs). The challenges include data contamination, unfathomable datasets, and the presence of near-duplicates. These challenges can lead to inflated performance

Over 1% of tokens emitted by large language models are part of a memorized sequence, including personally identifiable information. The diversity and size of pre-training datasets impact downstream performance. Fine-tuning models on multiple tasks with few examples per task has shown

Tokenization is a process that breaks words into smaller units called tokens. Subword tokenization is commonly used, but it has drawbacks. Byte-level tokenization is an alternative that can be used with subword tokenizers or to define a limited vocabulary.

Training smaller models intensively upfront can offset larger inference costs in the future. Scaling laws for performance prediction differ between upstream and downstream setups. The majority of training costs go towards pre-training, which requires significant compute hours and resources. Performance increases with larger compute

Large language models have different approaches for conditioning on tokens before and after masked ones. Span Corruption replaces contiguous token sequences with a unique masking token. Masked Language Modeling hides tokens by replacing them with a special [MASK] token. The view that mental states

Large Language Models (LLMs) present challenges in training and inference due to their size. Model parallelism and pipeline parallelism are strategies used to distribute the model and data across multiple devices, reducing waiting times and maximizing computation resources. Techniques such as stacking

Large language models have achieved competitive performance with minimal training data. Techniques such as soft prompts, scaling activations, and memory-efficient optimization have been explored. Efficient attention mechanisms can be achieved through hardware modifications or sub-quadratic approximations. Attention sparsity patterns and

Large language models face challenges in computation, routing, decoding strategies, and software efficiency. Efficient attention mechanisms and positional embedding schemes are explored to handle longer context lengths.

Efficient attention mechanisms that can process longer inputs are being developed to address the limited context of large language models (LLMs). These mechanisms include Luna, which uses nested linear attention functions, and alternative attention mechanisms that require less memory and compute resources. Other

Relative Positional Bias and ALiBi are methods that bias attention computations in large language models. While some positional encoding schemes offer better generalization to long sequences, their reliability is unclear. Fine-tuning pre-trained models is insufficient for length generalization,

Lisa has 5 easy peelers and buys 2 nets with 6 easy peelers each, resulting in a total of 17 easy peelers. The cafeteria initially had 37 bananas and then bought 5 bunches, each containing 5

The cafeteria has 37 bananas and bought 5 more bunches with 5 bananas each, bringing the total to 62. Lisa has 5 easy peelers and buys 2 more nets with 6 each, giving her a total of

Large language models (LLMs) possess the capability of task learning, which involves acquiring new input-label mappings. The order of few-shot examples provided to LLMs significantly affects their performance. Various explanations for the input-conditioned learning (ICL)

In large language models, hallucinations can occur when the output cannot be verified or contradicts the source content. Retrieval augmentation, where external knowledge is used to ground the model's input, can help mitigate hallucinations. Various approaches, such as retrieving

Large Language Models (LLMs) often generate outputs that don't align with human values. Pre-training with human feedback (PHF) during the pre-training stage improves alignment. Conditional training is the most effective PHF approach. Increasing diversity in response generation

RLHF can lead to unwanted effects in language models, such as repeating a user's political views and expressing strong political and religious opinions. Self-improvement techniques, such as fine-tuning on self-generated data, have been used to align models with

Research areas related to red teaming and debate aim to evaluate the safety and usefulness of large language models (LLMs) during training. LLMs can improve factuality and reasoning through self-play and short statement evaluations. However, this approach requires multiple

Large language models (LLMs) face challenges related to bias, toxicity detection, prompt injections, and outdated knowledge. Bias in LLMs arises from the inclusion of web-crawled data containing political discourse, hate speech, and other media biases.

Holistic benchmarking suites like HELM standardize evaluation methods and cover a wide range of capabilities. Language models are also benchmarked on tests designed for humans. Model editing techniques and retrieval augmentation can address outdated knowledge. Large language models achieve human-level performance

Low-entropy tokens are difficult to change, so a "soft" version is introduced for high-entropy tokens. Watermarks remain detectable in LLMs, even after being rewritten or mixed into longer hand-written documents. Watermarked LLMs

Compositional tasks are used to test whether language models can go beyond rote memorization. Large models show no improvement in solving composed problems compared to sub-problems. Transformers reduce compositional tasks to shortcut learning and lack robust generalization. LLM

Large language models are categorized by their size and whether they are encoder and decoder models or decoder-only models. Sizes range from 245M to 1.5T, with variations in the number of parameters.

Large language models are categorized by their size and the tasks they are designed for, including encoding-only, decoding-only, encoding and decoding, and multilingual models. Various sizes and language combinations are available.

This summary discusses the challenges and applications of large language models. It highlights various models developed by different organizations and their release dates. The summary also mentions the issue of repeatability in training runs and generations of close-sourced API-served models. The

The scheduling and communication strategies between nodes in large language models can be non-deterministic, which can affect the final result. Reproducibility is compromised due to changes in pre-training datasets and non-deterministic parallelism strategies. Commercial language models

Glaese et al. propose Sparrow, a chatbot based on a large language model (LLM) called Chinchilla. Various applications of LLMs are discussed, including chatbots, genomics, computational biology, computer programming,

Several large language models have been developed for specific applications, such as genomic analysis and code generation. For genomic analysis, models like GenSLMs, Nucleotide Transformers, and HyenaDNA have been trained on gene sequences to predict new variants and genomic

Training phi-1 with filtered datasets and synthetic data achieves near SOTA results with fewer parameters. Long-range dependencies in code repositories can be addressed using retrieval-based frameworks like RepoCoder. Polycoder is a multilingual programming LLM, but Codex

The challenges and applications of large language models (LLMs) are discussed. LLMs are used for story generation, creative tasks, visual creative tasks, knowledge work, and data analysis. The inability of LLMs to keep the entire generated work

GPT-4 uses a modular prompting framework and performs well but underperforms compared to human data analysts. Galactica LLM is trained specifically for scientific knowledge work. GPT-3.5 achieves high accuracy on qualitative sections and shows potential

Large language models (LLMs) have been evaluated for their ability to complete judicial opinions and medical question answering tasks. GPT-4 outperforms GPT-3.5 in medical benchmarks, but issues of erroneous generations and bias remain. L

LLMs have been used for various applications, including improving GPT-3.5's performance on reasoning benchmarks and breaking down mathematical word problems. In the medical field, LLMs have been applied to extract data from medical sources and disambig

GPT-3.5/4 outperforms existing algorithms in causal benchmarks, while ChatGPT performs poorly. LLMs are used to simulate human behavior, analyze behavioral characteristics, and simulate social relationships. LLMs are limited in their

Large language models (LLMs) have been used in various research areas, including planning in simulated worlds and modeling human behavior in social sciences and psychology. LLMs have shown potential in replicating human judgments and behaviors, although larger models tend to perform

Das et al. (2022) developed Qameleon, a multilingual question-answering model trained with only 5 examples. Ahia et al. (2023) studied tokenization costs in commercial language models. Ainslie et

This document contains a list of references to various papers and studies on large language models, including topics such as collaborative inference, fine-tuning, parameter efficiency, memorization, adversarial alignment, and more.

This excerpt includes a list of references to various papers and articles related to large language models.

Transformer-XL, a language model beyond fixed-length context, is discussed along with other relevant models and approaches in the field of computational linguistics.

This document includes various research papers and preprints related to large language models, covering topics such as structured information extraction, analysis of model performance compared to humans, limitations of transformers, reducing hallucination in dialogue systems, mathematical frameworks for transformer circuits, self-sup

Excerpted from the document are various references to papers and preprints related to large language models. These references cover topics such as automated formalization of theorem statements, sparse training with mixture-of-experts, social reasoning in language models, reducing harms through

Efficient evolution of human antibodies from general protein language models. Artificial muses: Generative AI chatbots with human-level creativity. Red teaming with coevolution. A theory of emergent in-context learning as implicit structure induction. Classifier-free diffusion

Large language models have the potential for self-improvement and can be used for creative writing, AI safety, and few-shot learning. They can also be trained as zero-shot planners, for protein structure prediction, and to evaluate and induce personality. Efficient

This excerpt contains a list of references to various papers and articles related to large language models.

GeDi: Generative discriminator guided sequence generation; Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense; Subword regularization improves neural network translation models; Sentencepiece: A simple and language independent subword tokenizer; GitHub

The summary includes a list of various research papers and articles related to large language models and their applications in different fields. These include topics such as context length, few-shot learning, privacy attacks, attention mechanisms, code generation, social biases, evaluation of language

This document contains a collection of research papers on various challenges and applications of large language models. The papers cover topics such as designing effective instruction tuning, measuring the effects of training data, overcoming prompt order sensitivity, maximizing communication efficiency, analyzing leakage of personally identifiable

This summary includes a selection of research papers and projects related to large language models, covering topics such as healthcare, screenplay writing, model editing, text detection, cross-task generalization, genomic sequence modeling, code synthesis, and image generation. It also mentions

The challenges and applications of large language models are discussed in various papers and technical reports. These include issues related to the speed of chat GPT-4, asynchronous pipelines for processing large corpora, the use of fairseq toolkit for sequence modeling, and

This excerpt includes various references to papers, blog posts, and conference proceedings related to large language models, transformer frameworks, scaling language models, and training techniques. It also mentions specific topics such as legal information extraction, arithmetic and symbolic induction, sentiment analysis,

In a document discussing challenges and applications of large language models, various sources and studies are referenced. Topics include model training, non-deterministic inference, automated evaluation methods, object hallucination in image captioning, the false consensus effect, risk psychology,

This text excerpt contains a list of references and citations from various sources related to large language models.

Several sources and papers related to large language models and their applications are referenced in the text excerpt. These sources cover topics such as quantifying the capabilities of language models, grammatical error correction, knowledge-enhanced pre-training, legal aspects of language models,

Several studies and papers on large language models for various applications have been referenced, including models for science, email understanding, dialog applications, data-to-text generation, authorship attribution, math word problem solving, robotics, biomedical text, protein generalization, scene

The text excerpt consists of a list of references to various papers and articles related to large language models and their applications.

This excerpt contains a list of references to various articles and papers related to large language models.

This text excerpt includes citations for various research papers related to large language models and their applications.

This document provides a list of references to various papers and articles related to large language models. It includes studies on model hallucinations, optimization, alignment, benchmarking, prompting, vulnerability, parallelism, fine-tuning, and other topics.

Raw indexed text (350,307 chars / 54,315 words / 8,553 lines)

Challenges and Applications of Large Language Models

Jean Kaddour α, †, ∗ , Joshua Harris β, ∗ , Maximilian Mozes α ,

Herbie Bradley γ, δ, ϵ , Roberta Raileanu ζ , and Robert McHardy η, ∗

University College London β UK Health Security Agency γ EleutherAI

University of Cambridge ϵ Stability AI ζ Meta AI Research η InstaDeep

Abstract

Large Language Models (LLMs) went from

non-existent to ubiquitous in the machine learn-

ing discourse within a few years. Due to the

fast pace of the field, it is difficult to identify

the remaining challenges and already fruitful

application areas. In this paper, we aim to es-

tablish a systematic set of open problems and

application successes so that ML researchers

can comprehend the field’s current state more

quickly and become productive.

Desi gn

Unfathom able Datasets,

Tokenizer -Reliance,

Fine-Tuning Over head

Beh av i or

High Infer ence

Latency, Lim ited

Context Length,

Hallucinations

Pr om pt Br ittleness,

M isaligned Behavior ,

Outdated Know ledge

Tasks Not

Solvable

By Scale

High Pr e-Tr aining

Costs

Detecting

Gener ated

Texts, Br ittle

Evaluations

Sci en ce

Contents

1 Introduction 2 Challenges

2.1 Unfathomable Datasets . . . . . .

2.2 Tokenizer-Reliance . . . . . . . .

2.3 High Pre-Training Costs . . . . .

2.4 Fine-Tuning Overhead . . . . . .

2.5 High Inference Latency . . . . . .

2.6 Limited Context Length . . . . . .

2.7 Prompt Brittleness . . . . . . . .

2.8 Hallucinations . . . . . . . . . . .

2.9 Misaligned Behavior . . . . . . .

2.10 Outdated Knowledge . . . . . . .

2.11 Brittle Evaluations . . . . . . . .

2.12 Evaluations Based on Static,

Human-Written Ground Truth . .

2.13 Indistinguishability between Gen-

erated and Human-Written Text .

2.14 Tasks Not Solvable By Scale . . .

2.15 Lacking Experimental Designs . .

2.16 Lack of Reproducibility . . . . . . 2

3 Applications

3.1 Chatbots . . . . . . . . . . . . . .

3.2 Computational Biology . . . . . .

3.3 Computer Programming . . . . . 34

Evaluations Based on Static

Hum an-Wr itten Gr ound Tr uth,

Lacking Exper im ental Designs,

Lack of Repr oducibility

Figure 1: Overview of LLM Challenges. Designing

LLMs relates to decisions taken before deployment. Be-

haviorial challenges occur during deployment. Science

challenges hinder academic progress.

3.4

3.5

3.6

3.7

3.8

3.9

3.10

3.11

Creative Work . . . . . . . . . .

Knowledge Work . . . . . . . .

Law . . . . . . . . . . . . . . .

Medicine . . . . . . . . . . . .

Reasoning . . . . . . . . . . . .

Robotics and Embodied Agents .

Social Sciences & Psychology .

Synthetic Data Generation . . .

4 Related Work 49

5 Conclusion 49

1 Introduction

Given the quickly growing plethora of LLM re-

search papers, we aim to address two questions: (1)

Challenges: What problems remain unresolved?

and (2) Applications: Where are LLMs currently

being applied, and how are the challenges con-

straining them? For (1), we group the challenges

* Equal

contribution.

{jean.kaddour,robert.mchardy}[email protected],

[email protected]

†

1word sequence repeated 61, 036 times in the train-

ing split. By deduplicating it, they reduce the rate

of emitted memorizations by 10x. Abbas et al. [6]

introduce SemDeDup, a technique designed to iden-

tify semantic duplicates that, although perceptually

distinct, convey predominantly similar information,

such as sentences with analogous structures with

certain words replaced by synonyms. After apply-

ing their method to C4, they find that it improves

over NearDup. Similarly, Kaddour [250] find near-

duplicates in the Pile [165] by clustering document

embeddings and identifying clusters gathering du-

plicates.

in Fig. 1 into three broader categories “Design”,

“Behavior”, and “Science”. To provide answers

for (2), we explore the fields of chatbots, compu-

tational biology, computer programming, creative

work, knowledge work, law, medicine, reasoning,

robotics, and the social sciences.

This paper is an opinionated review and assumes

familiarity with LLMs and how they work (we refer

to more introductory works in Sec. 4). Further, we

focus on models trained on text data. We target a

technical researcher audience and do not discuss

political, philosophical, or moral perspectives on

LLMs.

Challenges

Benchmark Data Contamination occurs when

the training dataset contains data from or similar

to the evaluation test set. This can lead to inflated

performance metrics, as the model can memorize

the test data and simply regurgitate it back during

testing.

Challenge

This box highlights a challenge.

2.1

Unfathomable Datasets

Scaling the amount of pre-training data has been

one of the major drivers to equip LLMs with

general-purpose capabilities [256]. The size of

pre-training datasets quickly outgrew the number

of documents most human teams could manually

quality-check. Instead, most data collection proce-

dures rely on heuristics regarding data sources and

filtering.

In this section, we explore the adverse conse-

quences of these heuristics and the reality that many

model practitioners possess only a nebulous under-

standing of the data on which their model has been

trained. We refer to this issue as follows.

Finding and removing all training and test data

overlaps is difficult in practice. For example, the

GPT-3 authors Brown et al. [59] found a code bug

after training, resulting in only partially removing

all detected overlaps from the training data. They

could not afford to retrain the model, so they used it

with the remaining overlaps and “cleaned” variants

of the considered benchmarks, with all potentially

leaked examples removed. They define overlap-

ping examples as examples that share at least 13

consecutive words with any other example in the

pre-training set. If an example is shorter than 13

words, they consider it overlapping if it shares all

of its words with another example.

Unfathomable Datasets

Similarly, Dodge et al. [125] search for test data

in the web-crawled C4 corpus but measure exact

matches, normalized for capitalization and punctu-

ation. They find various input-and-label contamina-

tions of text generation and knowledge completion

tasks; and input-only contaminations of the GLUE

benchmark. They argue that there are two ways test

data can end up in a snapshot of Common Crawl

(the original dump source of C4): either a given

test set is built from a web text or uploaded after

creation. Sainz et al. [472] ask ChatGPT to gener-

ate academic benchmark instances, finding that it

has memorized multiple ones, including some test

splits. Jacovi et al. [237] propose three strategies to

mitigate contamination, including encryption and

training exclusion controls.

The size of modern pre-training datasets ren-

ders it impractical for any individual to read

or conduct quality assessments on the en-

compassed documents thoroughly.

Near-Duplicates can arise in different forms

and have been reported to degrade model per-

formance [294, 200, 250]. Near-duplicates are

harder to find compared to exact duplicates; fil-

tering out of such is a standard step in most data

collection pipelines, e.g., using the MinHash algo-

rithm [57]. Lee et al. [294] propose the NearDup

method and find that over 1% of tokens emitted

unprompted from a model are part of a memorized

sequence of the C4 dataset, e.g., it contains a 61-

2Personally Identifiable Information (PII) such

as phone numbers and email addresses, have

been found within pre-training corpora, resulting

in privacy leaks during prompting. Carlini et al.

[65, 67], Lukas et al. [344] extract PII data by

prompting GPT-2; Kulkarni [283] report how an en-

gineer yields secret API keys by prompting GitHub

Copilot. Henderson et al. [195] discuss the avail-

ability of PII in law data across different jurisdic-

tions and filter it based on the legal norm in the

respective jurisdiction. El-Mhamdi et al. [137]

contend that because strong model performance

typically requires memorization of the training

data [146, 58], the (undetected) existence of PII

in the training data will likely result in models that

render them extractable.

Date

Pre-Training Domain Mixtures Several stud-

ies have argued for diversity in the pre-training

corpus [165, 341, 291]. Many popular corpora fol-

low this by concatenating datasets from different

sources, as illustrated in Table 1. However, it re-

mains underexplored what amount of data from

different sources is necessary for strong down-

stream performances. Finding suboptimal mix-

tures can cause low transferability to downstream

tasks [593, 580] and reliance on spurious corre-

lations [253, 618, 347]. Xie et al. [622] find do-

main mixture proportions by training a small proxy

model using group-distributionally robust optimiza-

tion [471]; surprisingly, they find that the final

model trained using their found domain weights

yields improved perplexity across all domains, even

when it down-weights a domain. Given a tar-

get downstream task, Yao et al. [641], Xie et al.

[624] select subsets most useful for pre-training.

Longpre et al. [341] measure the effects of domain

compositions and find that inclusion of heteroge-

neous data sources is broadly beneficial and likely

more important than the data quality (as measured

by the document quality classifier employed by

PaLM [86] and GLaM [130]) or size, which also

motivates smaller yet more diverse pre-training

datasets [250].

Size

Name

Sources

GB Tokens ∗

2014 BookCorpus

[684, 36] 5 GB 11 B

2019 OSCAR

[399] 6.3 T 2019 WebText

[440] 12.2020

Public

Novels Yes

? Webpages in 166

languages Yes

40 GB ? Webpages No

CC-100

[100] 2.5 TB 292 B Webpages in 100

Languages Yes

12.2020 The

Pile

[165, 41] 825 GB 300 B Science, Webpages,

GitHub Code, Law,

etc. Yes

2020 C4 [443] 745 GB 156 B Webpages Yes

10.2020 mC4 [631] ? 6.3 T Webpages in 101

Languages Yes

2021 MassiveText

[441] 10.5 TB 2.34 T Webpages, Books,

News, and Code No

12.2021 GLaM [130] ? 1.6 T Webpages,

Wikipedia, Conver-

sations, Forums,

Books, News No

01.2022 Infiniset

[551] ? 2.81 T Forum

dialogs,

C4 data, Code,

Wikipedia, Web-

pages No

06.2022 ROOTS

[289] 1.61 TB 2.34 T Webpages in 46 lan-

guages and GitHub

Code in 13 lan-

guages Yes

11.2022 The Stack

[271] 6 TB 235 B GitHub Code in 30

languages Yes

04.2023 LLaMA

[556] / Red-

Pajama [98] 2.7 TB 1.2 T Webpages, GitHub

Code,

Science,

Wikipedia, Books Yes

06.2023 RefinedWeb

[415] 2.8 TB 600 B Webpages Yes

Table 1: Overview of Selected Pre-Training Datasets.

Over the years, pre-training datasets have become more

unfathomable: they grew rapidly in size and diversity,

and not all datasets are publicly available (we do not

include datasets that have very little or no information

available about them). Unless stated otherwise, the

natural language is in English. ∗ We report the number

of tokens as provided by the respective paper based on

their proposed tokenization scheme.

For example, instruction fine-tuning via task in-

structions prepended to each set of input-output

pairs is a very popular scheme, which we will later

discuss in more detail in Sec. 2.9. Wang et al. [589]

propose Super-NaturalInstructions, a

fine-tuning dataset with 1,616 diverse tasks and

expert-written instructions. Muennighoff et al.

[377] extend MTLM to the multilingual setting,

showing that fine-tuning on multilingual tasks with

English prompts improves results on tasks in all

languages.

Fine-Tuning Task Mixtures have to be deter-

mined for fine-tuning a pre-trained model on many

different tasks, usually with comparatively few ex-

amples per task. This technique, which we call

multitask-prompted fine-tuned LMs (MTLMs), has

demonstrated significant generalization improve-

ments with very little additional training compute.

However, similar to the previous paragraph, how

to balance the task datasets well remains unclear.

3essary to convey the same information varies

significantly across languages, making the pric-

ing policy of API language models, which charge

users based on the number of processed or gen-

erated tokens, potentially unfair. They find that

users of many supported languages are overcharged

while receiving subpar results, with this group pre-

dominantly residing in areas where these APIs are

already less affordable.

Further, discrepancies between the data that

a tokenizer and a model have been trained on

can lead to glitch tokens [465], which can sub-

sequently cause unexpected model behavior as

their corresponding embeddings are essentially un-

trained. This coupling between the tokenizer and

pre-training corpus creates the burden of a new

training run of the tokenizer each time the pre-

training corpus is modified.

Next, Tokenization schemes that work well in a

multilingual setting, particularly with non-space-

separated languages such as Chinese or Japanese,

remain challenging [157, 91].

Existing subword tokenization schemes are pre-

dominantly greedy algorithms trying to encode

language as efficiently as possible regarding the

number of tokens used. Naturally, these methods

favor subwords comprising larger parts of the train-

ing data and, therefore, subwords that are shared

across many languages. This favors languages

with shared scripts like Latin and Cyrillic, result-

ing in suboptimal tokenization of low-resource lan-

guages [92, 676].

As the tasks can vary in size considerably, Raf-

fel et al. [443] mix each task in proportion to the

number of examples in its ’train’ split (up to some

max_num_examples). Jang et al. [239] report

that MTLMs can underperform expert LLMs fine-

tuned on only a single task because of (i) nega-

tive task transfer, where learning multiple tasks at

once hinders the learning of some specific tasks,

and (ii) catastrophic forgetting of previous tasks

when learning new tasks. Iyer et al. [235] study

varying task (sets) proportions, finding several

trade-offs and concluding that the right values for

these parameters depend on the downstream end-

goals. Longpre et al. [340] balance different sets of

task sources by omitting them, one at a time, and

ranking their contributions on the MMLU bench-

mark [197]; further, they mix the input prompt

templates of zero- and few-shot prompting; find-

ing that this improves the performance in both set-

tings. Another trend is to imitate closed-source

models like ChatGPT by collecting a dataset of

API outputs (against OpenAI’s terms and condi-

tions) and fine-tuning an open-source LM with

it [540]. However, Gudibande et al. [180] point

out that such imitation models are only good at

mimicking the proprietary model’s style but not

its content, a distinction that has been discussed

extensively in the causality literature [253]. They

conclude that substantial capability gaps between

fine-tuned open-sourced and closed-source models

remain, motivating future work for better imitation

data.

2.2

Tokenizer-Reliance

Tokenizers introduce several challenges,

e.g., computational overhead, language de-

pendence, handling of novel words, fixed

vocabulary size, information loss, and low

human interpretability.

Tokenization is the process of breaking a sequence

of words or characters into smaller units called

tokens, such that they can be fed into the model.

One common tokenization approach is subword to-

kenization, where we split words into smaller units,

called subwords or WordPieces [490]. The goal

is to handle rare and out-of-vocabulary words in

a model’s vocabulary effectively while maintain-

ing a limited number of tokens per sequence in the

interest of computational complexity. Subword to-

kenizers are usually trained unsupervised to build

a vocabulary and optionally merge rules to encode

the training data efficiently.

However, the necessity of tokenization comes

with multiple drawbacks [257]; some of which we

discuss below. For example, Ahia et al. [13], Petrov

et al. [426] show that the number of tokens nec-

Subword-Level Inputs are the dominant

paradigm, providing a good trade-off between

vocabulary size and sequence length. Byte-Pair

Encoding [490, 577] (BPE) starts with the set

of symbols (characters or bytes) that comprise

the training data. The tokenizer is then trained

to learn rules to merge the most frequent pair

of two consecutive tokens—defined by the

existing vocabulary—into a new vocabulary item.

Byte-level BPE (BBPE) [577] is an extension

of BPE with byte-level subwords, particularly

4(1) Tokenizer Training Costs

English

Chinese

token

for

boundaries

中

loss

lead

息

定

for for

loss loss

lead

…

lead

are

boundaries

chinese chinese

example example

中中

⾔⾔

息

合

致

定

會

明

[

導

]

界

義

def

許

信

…

標

array

…

False

)

sort

…

信

標

時

單單

合合

i i

def def

array array

[

] ]

… …

AAACEXicbVC7TsMwFHXKq5RXgJHFokLqVCWI18BQCZAYC6IPqQmV4zqtVceJbAepSvMLLPwKCwMIsbKx8Tc4bQZoOZKl43Pu1b33eBGjUlnWt1FYWFxaXimultbWNza3zO2dpgxjgUkDhywUbQ9JwignDUUVI+1IEBR4jLS84UXmtx6IkDTkd2oUETdAfU59ipHSUtesOAFSA8+HV9ChHE5/XnKb3ifj5hg6igZEwsu01DXLVtWaAM4TOydlkKPeNb+cXojjgHCFGZKyY1uRchMkFMWMpCUnliRCeIj6pKMpR3qQm0wuSuGBVnrQD4V+XMGJ+rsjQYGUo8DTldnGctbLxP+8Tqz8MzehPIoV4Xg6yI8ZVCHM4oE9KghWbKQJwoLqXSEeIIGw0iFmIdizJ8+T5mHVPqke3xyVa+d5HEWwB/ZBBdjgFNTANaiDBsDgETyDV/BmPBkvxrvxMS0tGHnPLvgD4/MHRwacqQ==

E 2 R |V |⇥D

…

致

時

[

for

息

定

range

are boundaries

信

單

##ization

時

token

##ization

多

致

token

are

example

where

##ization

chinese

⾔

標

Python

def bubble_sort(array):

n = len(array)

for i in range(n):

swapped = False

for j in range(0, n - i - 1):

if array[j] > array[j + 1]:

swap(array[j], array[j + 1])

….

Softmax over

Vocabulary

where

標記化有時會導致信息丟失。例如，在單

詞邊界沒有明確定義的語⾔中，例如中⽂，

或者在具有許多複合詞的複雜語⾔中，......

Transformer

Blocks

Embedding

Matrix

Vocabulary

Training Sequences

Tokenization can sometimes lead to a loss of

information. For example, in languages where

word boundaries are not clearly deﬁned, such

as Chinese. …

(2) Arch. depends on Vocabulary

AAACIXicbVDLTsMwEHR4lvIqcORiUSFxqhLEowcOSHDgWBBtkZpSOe4GLBwnsjeIKuRXuPArXDiAEDfEz+C0PfAaydLszK7WO0EihUHX/XAmJqemZ2ZLc+X5hcWl5crKasvEqebQ5LGM9UXADEihoIkCJVwkGlgUSGgHN0eF374FbUSsznGQQDdiV0qEgjO0Uq9S9yOG10FI29QXio6qIDvLL7PjXuYj3GEWxX2QeU59FBEYet+6p3m5V6m6NXcI+pd4Y1IlYzR6lXe/H/M0AoVcMmM6nptgN2MaBZeQl/3UQML4DbuCjqWK2V3dbHhhTjet0qdhrO1TSIfq94mMRcYMosB2FheY314h/ud1Ugzr3UyoJEVQfLQoTCXFmBZx0b7QwFEOLGFcC/tXyq+ZZhxtqEUI3u+T/5LWds3bq+2e7lQPD8ZxlMg62SBbxCP75JCckAZpEk4eyBN5Ia/Oo/PsvDnvo9YJZzyzRn7A+fwC7oij/A==

W 2 R D model ⇥|V |

Figure 2: Exemplary Drawbacks of relying on Tokenization. (1) The tokenizer training step involves non-trivial

computations, e.g., multiple passes over the entire pre-training dataset, and introduces a dependency on it, which

can become especially problematic in multilingual settings. (2) The embedding layer E and output layer W of

LLMs involve the vocabulary size; e.g., making up ≈ 66% of the model’s parameter count in T5 models [629].

Byte-Level Inputs are an alternative to subword

tokenization is use byte-level inputs. Byte-level

inputs can either be used in combination with sub-

word tokenizers [577] or used to define a limited

vocabulary that can be used to encode all possi-

ble sequences. For example, Xue et al. [630]

train a non-subword mT5 model using UTF-8

bytes rather than subword tokens as inputs, show-

ing promising performance on multilingual data.

While this enables subword-free LLMs, UTF-8 en-

codes Latin languages with fewer bytes than e.g.,

Chinese, Japanese or Korean 1 . Tay et al. [546] pro-

pose the Charformer, a tokenization-free model

which learns a soft subword tokenization in la-

tent space (Gradient-Based Subword Tokenization)

given byte-level inputs. Charformer performs com-

parably to subword-based models while incurring

less computational overhead than other byte or

subword models. Choe et al. [83] train a small-

scale, 0.8B language model based on raw byte-

level inputs and show that it performs compara-

bly. On a smaller scale, Clark et al. [94] show that

their tokenization- and vocabulary-free encoder Ca-

nine outperforms a comparable tokenization-based

model. Yu et al. [652] address the computational

cost that byte-level tokenization incurs by segment-

ing input sequences into local patches, which can

be processed in parallel. Similarly, Horton et al.

[212] propose to operate directly on file bytes. In a

suited for multilingual tasks where it enables

vocabulary sharing between languages. A trained

BPE tokenizer applies the previously learned rules

to tokenize inputs. WordPiece [485, 617] is a

closed-source tokenization algorithm used, e.g.,

in BERT [120]. Like BPE, WordPiece starts with

a small initial vocabulary, which is iteratively

extended by learning merge rules and creating new

vocabulary items. Rather than selecting the most

frequent pair of consecutive tokens, WordPiece

uses a scoring function to normalize the frequency

of the pair by the frequencies of the individual

tokens to prioritize common pairs with rare

individual tokens. Unigram Tokenization [281]

iteratively trims a large base vocabulary to a given

target size. To this end, at each step of the tokenizer

training, a unigram language model is used to

compute a loss over the training data conditional

on a certain vocabulary item being removed.

A proportion of the subwords with the lowest

losses are removed to form the base vocabulary

for the next iteration. Unigram tokenization is

probabilistic, i.e., during inference, all possible

tokenizations of a given sequence are scored

using the unigram language model, and the most

likely one is selected. SentencePiece [282] is a

commonly used open-source library, implementing

several tokenization algorithms such as (B)BPE

and Unigram tokenization. SentencePiece also

implements non-subword tokenization approaches

like word- and character-level tokenization.

https://www.unicode.org/versions/Unicode15.0.0/parallel line of work, Rust et al. [467] render text

as images and train an encoder model to predict the

raw pixels of the images.

2.3

compute budgets. For example, OpenAI [398] re-

port that they were able to accurately predict the

model performance of the full-size GPT-4 model

based on the performance of a series of smaller

models using at most 10,000x less compute than

the full model.

The exact power law coefficients are still heav-

ily debated. Kaplan et al. [256] put forward that

the model size should be scaled more aggressively

than the dataset size to use a given compute budget

optimally. Contrary to this, Hoffmann et al. [206]

find that many LLMs are undertrained and argue

that the number of parameters and data should be

scaled equally. However, power laws sometimes

come in the form of bounds, which can span an

order of magnitude difference in the amount of

data to be used given a concrete compute budget

[665]. Further, the pre-training loss does not al-

ways correlate well with downstream performance

[252, 332, 251].

The viewpoint of Touvron et al. [556], Vries

[571], Touvron et al. [557] is that when selecting

a model size, the computation resources for later

usage (inference) should be considered, not just

the one-time training costs. They suggest that it

might be beneficial to train a smaller model more

intensively upfront to offset larger inference costs

in the future. Hence, they train models of various

sizes on more tokens than are typically used to

achieve the best performance possible, given the

model size.

One remaining hurdle of performance prediction

is inverse scaling, which we discuss in Sec. 2.14.

Since scaling laws were typically constructed in the

context of pre-training and thereby decoupled from

downstream tasks, it remains an open question of

how to predict inverse scaling properties. Tay et al.

[544] find that scaling laws can differ in upstream

and downstream setups; aside from only the model

size, model shape matters for downstream fine-

tuning.

High Pre-Training Costs

The vast majority of the training costs go toward the

pre-training process. Training a single LLM can

require hundreds of thousands of compute hours,

which in turn cost millions of dollars and consume

energy amounts equivalent to that used by several

typical US families annually [412, 86, 44]. Re-

cently proposed scaling laws [256] posit that model

performances scale as a power law with model size,

dataset size, and the amount of compute used for

training, which is fairly unsustainable and can be

classified as Red AI [487], where state-of-the-art re-

sults are essentially “bought” by spending massive

computational resources. For example, depending

on the exact law coefficients, reducing the error

from 3% to 2% can require an order of magnitude

more data or compute [518].

Unsustainable Loss Power-Law [256]

Performance increases through larger com-

pute budgets but at a decreasing rate if the

model or dataset size is fixed, reflecting a

power law with diminishing returns.

In the following, we look at two lines of work

aiming at resolving such issues.

Compute-Optimal Training Recipes [201, 256]

In Sec. 2.1, we discussed how the availability

of LLM pre-training data has become abundant

through the quickly-spread practice of including

web-crawled text. Further, thanks to the intro-

duction of Transformer models [563] and suit-

able hardware [210], we have scaled models to

unprecedented sizes. Assuming that we have not

yet reached the limits of data [45, 568, 415] nor

model sizes [256, 206, 398]; currently, the main

bottleneck is the amount of compute available [1].

Given a particular budget, how large should the pre-

training corpus and model be to maximize training

efficiency?

As mentioned at the beginning of this section,

one recent proposal is to learn empirical “scaling

laws” [201, 256], which describe the relationship

between LLM performance and the compute bud-

get, model, and dataset size. These laws can pro-

vide the right scaling recipe for compute-optimal

training, ideally, even when extrapolating to larger

Pre-Training Objectives Various pre-training

objectives (PTO) are suitable for performing self-

supervised training of LLMs. The exact choice of

PTO heavily influences the model’s data efficiency

during pre-training, which in turn can reduce the

number of iterations required. A PTO typically

is a function of the (i) architecture, (ii) input/tar-

gets construction (e.g., target span length, low/high

corruption, see Fig. 4), and (iii) masking strategy

(Fig. 3). While (i) and (ii) can be disentangled and

6Masked LM

Language Modeling

Preﬁx LM

where the model uses tokens before and after the

target token for predictions, leveraging a more

holistic understanding of its context than the NTP

objective. Furthermore, we can use each input

sentence to predict multiple masked tokens in a

single pass, while the NTP objective typically

learns from predicting one token at a time.

Let x MASK denote the set of indices of the

masked tokens and x ¬MASK the unmasked tokens.

The objective of MLM is then to maximize the

likelihood given the parameters θ,

y 5

y 4

y 3

y 2

y 1

x 1

x 2

x 3 x 4

Input

x 5

x 1

x 2

x 3 x 4

Input

x 5

x 1

x 2

x 3 x 4

Input

x 5

Figure 3: Masking Strategies. Each row denotes to

which inputs x i (columns) a particular output y i (row)

can attend to (uni- or bi-directional).

should not be conflated conceptually [545], in prac-

tice, there exist popular combinations that achieve

good performances.

Attending to all tokens, as shown in Fig. 3(left),

is the most data-efficient strategy since it uses con-

text from before and after the token to be predicted.

However, for that reason, it is unsuitable for text

generation [120], since it considers future context

for prediction. We typically employ it in natural

language understanding (NLU) tasks [120], where

it has shown strong results. The next token predic-

tion objective is most suitable for natural language

generation (NLG) but also the least data efficient

since it only attends to the past context (Fig. 3(mid-

dle)). More recent advances in pre-training objec-

tives aim to find a middle-ground to increase data

efficiency by providing stronger and more diverse

training signals, e.g., the Prefix LM, which partly

attends to past tokens, as illustrated in Fig. 3(right)

and discussed below.

The following discusses the trade-offs between

some of the recently proposed objectives. Fig. 4

visually depicts the different pre-training objectives.

Notation-wise, we denote a sequence of N tokens

x as x = x 1 , . . . , x N .

We start with the most basic and still widely-

used Language Modeling [59] (or next token pre-

diction) objective. Here, we learn parameters θ by

maximizing the likelihood of the next token given

the previous tokens,

L(x) =

i=1

log P (x i |x 1 , . . . , x i−1 ; θ).

L(x MASK |x ¬MASK ) =

|x MASK |

log P (x MASK i |x ¬MASK ; θ).

(2)

i∈x MASK

Patel et al. [410] show that such models produce

representations more suitable for transfer learning;

however, they come with difficulties in performing

in-context learning (Sec. 2.7).

To further improve the training efficiency of the

MLM objective, Bajaj et al. [33] propose to replace

input tokens with ones generated by an auxiliary

language model (ALM), resulting in a Model gen-

erated dEnoising TRaining Objective (METRO).

Their approach consists of roughly three compo-

nents: (i) train an ALM using the MLM objec-

tive, (ii) given some inputs with masked positions,

predict the tokens (with the ALM), (iii) train the

main model to correct these tokens inserted in the

masked positions, i.e., 1) predict whether the ALM

has replaced a token and if so, 2) predict the origi-

nal token. They train the auxiliary and main model

jointly.

Prefix Language Modeling [443] generalizes

language modeling by allowing prefix tokens with a

bidirectional receptive field to be added to the input

(without prefix, it is equivalent to standard LM).

Note that this is still different from the bidirectional

context as in MLM, where we always condition on

all the tokens before and after the masked ones (see

Fig. 3 left). For computing the hidden states of the

prefix, prefix-LM attends to tokens before and after

(see Fig. 3 right).

Span Corruption [303, 443, 132] or span de-

noising refers to a group of denoising objectives

that generalize MLM to denoise contiguous se-

quences of tokens within a given text, called spans.

The denoising objectives typically replace the sam-

pled spans with a single unique masking token

and train the model to fill it in. Raffel et al. [443]

(1)

Masked Language Modeling (MLM; or

Cloze) [549, 120] hides a set proportion of

tokens in the sequence by replacing them with a

special [MASK] token. The literature employs

the MLM objective for non-autoregressive, i.e.,

non-generative, bidirectional context models,

7Span Corruption

(R-Denoising)

Inputs

Preﬁx Language Modeling

(S-Denoising)

Long Span Corruption

(one form of X-Denoising)

Inputs

Some proponents of AI consciousness subscribe to functionalism, the Some proponents of AI consciousness subscribe to functionalism, the Some proponents of AI consciousness subscribe to functionalism, the

view that mental states are deﬁned more

4 by their function than their view that mental states are deﬁned more by their function than their view that mental states are deﬁned

12 more by their function than their

underlying physical structure. In other words, if an AI can respond to underlying physical structure. In other words, if an AI can respond to underlying physical structure. In other words, if an AI can respond to

inputs and generate outputs similar to a conscious being, then it could be inputs and generate outputs similar to a conscious being, then it could be inputs and generate outputs similar

13 to a conscious being, then it could be

considered conscious. However, this view

3 doesn't account for subjective considered conscious. However, this 56

view doesn't account for subjective considered conscious. However, this view doesn't account for subjective

(qualia), the "what it feels like" aspect of consciousness. The Simulational (qualia), the "what it feels like" aspect of consciousness. The Simulational (qualia), the "what it feels like"

14 aspect of consciousness. The Simulational

Argument is that some argue that if 2 an AI can simulate human behavior Argument is that some argue that if an AI can simulate human behavior Argument is that some argue that if an AI can simulate human behavior

Targets

Fill In The Middle

Meet In The Middle

Inputs

Inputs (Reversed Order)

Some proponents of AI consciousness subscribe to functionalism, the Some proponents of AI consciousness subscribe to functionalism, the behavior human simulate can AI an if that argue some that is Argument

view that mental states are deﬁned more by their function than their view that mental states are deﬁned more by their function than their Simulational The consciousness. of aspect “like feels it what” the (qualia),

underlying physical structure. In other words, if an AI can respond to experiences subjective for account

inputs and generate outputs similar to a conscious being, then it could be inputs and generate outputs similar to a conscious being, then it could be

considered conscious. However, this view doesn't account for subjective considered conscious. However, this 56

view doesn't account for subjective considered conscious. However, this 52

view doesn't account for subjective

(qualia), the “what it feels like” aspect of consciousness. The Simulational (qualia), the "what it feels like" aspect of consciousness. The Simulational (qualia), the "what it feels like" aspect of consciousness. The Simulational

Argument is that some argue that if an AI can simulate human behavior Argument is that some argue that if an AI can simulate human behavior Argument is that some argue that if an AI can simulate human behavior

Targets Targets

e being, then it could be

inputs and generate outputs similar

26 to a conscious

underlying physical structure. In other words, if an AI can respond to

Targets

Figure 4: Self-Supervised Data Construction by Pre-Training Objectives, adopted from Tay et al. [545]. We

indicate masked tokens with gray rectangles, which become the targets. For brevity, we omit special tokens.

shows that this can speed up training because span

corruption produces shorter sequences on average

compared to corrupting individual tokens in an i.i.d.

manner.

quences with limited context, which we illustrate

in Fig. 4). The MoD objective has subsequently

been shown to improve model performance by con-

tinuing training pre-trained LLMs [443, 86] for

relatively few steps [547].

Mixture of Denoisers [545] (MoD) refers to

injecting objective diversity by mixing multiple

denoising objectives. Tay et al. [545] categorize

three denoising objectives: {R,S,X}-Denoiser. The

regular denoising corresponds to the previously in-

troduced span denoising. Specific denoising com-

prises splitting a given sequence into a prefix act-

ing as the context and a suffix acting as the target.

In extreme denoising, we corrupt large parts of

the input by either (a) increasing the proportion

of masked tokens per span or (b) increasing the

span length forcing the model to generate long se-

Fill In the Middle Bavarian et al. [38] propose

to augment the next token prediction objective by

shuffling tokens within a document such that we

fill in the middle (FIM) based on prefix and suf-

fix. They demonstrate that models pre-trained on a

mixture of FIM-transformed and left-to-right data

result in left-to-right and FIM capability models.

Meet in the Middle Nguyen et al. [382] extend

the FIM objective by enabling bidirectional context

to construct a denser, more data-efficient supervi-

sion signal while maintaining the autoregressive

8nature of the underlying model: They train two

−

decoders—one forward →

p (x i | x
−

p (x i | x
backward language model ←

shared parameters θ. Additionally, they add an

agreement regularize to the loss, encouraging the

forward and backward model to agree: for a dataset

S of sequences, the full pre-training loss is

|x|

X X

x∈S i=1

times and increases the utilization of computational

resources.

These issues have motivated asynchronous paral-

lelization schemes. Recht et al. [453] present Hog-

wild!, which greedily applies gradients to the local

weights on each accelerator as soon as they arrive,

offering better resource utilization than pipeline

parallelism but suffering from training instabilities

due to stale gradients which are based on outdated

model weights.

Gomez et al. [172] propose N-Wise interlock-

ing backpropagation, which is a generalization of

end-to-end and local training. While end-to-end

(global) training performs a forward pass through

all layers, computes a loss and gradients, and back-

propagates through all layers, local training per-

forms forward passes through all layers individ-

ually and immediately computes a local loss and

gradient update, offering higher resource utilization

at the cost of (empirically) worse task performance.

N-Wise interlocking backpropagation strikes a com-

promise by performing a forward pass through N

layers before computing a loss and updating the

parameters of the associated layers, enabling better

layer communication than local training and higher

computational efficiency than end-to-end training.

Chowdhery et al. [86] leverage a combination

of model parallelism and fully sharded data par-

allelism (FSDP) [628, 674]—a technique where

each device only holds a subset of the model pa-

rameters, gradients, and optimizer states, and pa-

rameters necessary for local computations are com-

municated on-demand—to enable highly parallel,

high throughput training across thousands of chips

within a single TPU pod. PaLM further employs

data parallelism to achieve scaling at pod level,

leveraging the Pathways [37] system to distribute

data.

In a parallel line of work, Lepikhin et al. [298]

propose GShard, a model parallelism method that

extends the XLA [468] compiler, enabling auto-

matic sharding of models.

−

− log →

p (x i | x
|

{z

}

NLL for forward model

−

− log ←

p (x i | x >i ; θ)

|

{z

}

(3)

NLL for backward model

−

TV →

+βD i,x

( −

p ∥←

p ) ,

|

{z

}

agreement regularizer

−

−

T V ( →

where D i,x

p ∥←

p ) is the total variation distance

among the two models on the i-th token. Once

pre-training has been completed, we can use only

−

the forward model →

p .

Parallelism Strategies The sheer size of LLMs

makes it hard to train or even do inference with

them on only one accelerator (GPU, TPU, etc.).

A common solution is model parallelism, which

can be viewed as a divide-and-conquer strategy:

we slice up various parts of the model (dividing

the problem into sub-problems), distribute them

across multiple devices, with each device comput-

ing a portion of the overall computation (solve each

problem independently) and combine all results to

produce the final output (forward/backward pass).

Implementing model parallelism synchronously

creates a problem where running data batches

through multiple workers with sequential depen-

dency (each layer depends on results from the pre-

vious layer) leads to significant waiting times and

under-utilization of computation resources.

Another strategy is pipeline parallelism, which

combines model parallelism with data parallelism,

meaning that we not only distribute parts of the

model across different devices but parts of the data

too, i.e., each worker splits its mini-batch further

into micro-batches with gradients being accumu-

lated across all micro-batches before the weight

update. Huang et al. [226] instantiate such an ap-

proach called GPipe, which divides each mini-

batch into smaller micro-batches distributed across

different accelerators simultaneously; gradients are

applied synchronously at the end. Compared to

naive model parallelism, this decreases waiting

Miscellaneous Rae et al. [441] stack the lay-

ers of a 4.5B parameter model to jump-start and

accelerate the training of a 9B model, which led

to a 40% reduction in compute; an idea that has

been previously used for training smaller-scale

LMs [173]. Brown et al. [59] progressively in-

crease the batch size from a small to the full value

over training when training GPT-3; a trick that

has been previously used for training image mod-

9els [514]. Sanyal et al. [476] apply latest weight av-

eraging [249] to LLMs between 1 and 12B param-

eters; for a 6.9B parameter model, they reach sav-

ings of up to 4,200 GPU hours. For smaller-scale

models, there exist various pre-training speedup al-

gorithms [663, 685], but they have not been scaled

up yet and shown to offer only limited gains when

compared with budget-adjusted baselines [251].

2.4

cient [213, 311] and requires practitioners to keep

individual fine-tuned LLMs in memory for every

task. We illustrate this overhead in Figure 5.

Overhead of Storing and Loading

Fine-Tuned LLMs [213, 311]

When adapting an LLM via full-model fine-

tuning, an individual copy of the model

must be stored (consuming data storage) and

loaded (expending memory allocation, etc.)

for each task.

Fine-Tuning Overhead

A potential drawback of pre-training LLMs on mas-

sive and diverse sets of textual data is that the re-

sulting models might struggle to explicitly cap-

ture the distributional properties of task-specific

datasets. To address this, fine-tuning refers to

adapting the pre-trained model parameters on com-

paratively smaller datasets that are specific to an

individual domain or task. LLM fine-tuning is

highly effective at adapting LLMs for downstream

tasks [215, 120, 440].

Technically speaking, fine-tuning can be

achieved by further training a model on a smaller

dataset. Depending on the model architecture, this

is done by either (i) directly fine-tuning pre-trained

models using a standard language modeling objec-

tive or (ii) adding individual learnable layers to the

output representations of a pre-trained language

model, which are designed to create compatibil-

ity between the model’s output representations and

the output formats of individual downstream tasks

(e.g., for text classification or sequence labeling).

See Devlin et al. [120] (Figure 1) for an illustration.

However, LLMs with billions of parameters have

large memory requirements to store (i) the model

parameters, (ii) the model activations, and (iii) the

gradients and corresponding statistics. Due to lim-

ited device memory (e.g., GPU or TPU) necessi-

tates access to large clusters with many devices

to fine-tune a full LLM, limiting access to a few

institutions with large compute resources.

Parameter-efficient fine-tuning An alternative

method to adapt an LLM to a specific dataset/do-

main is via parameter-efficient fine-tuning (PEFT).

PEFT refers to a class of methods that adapt LLMs

by updating only a small subset of model parame-

ters. Adapters [213] are one of the earliest works

on PEFT. This method incorporates additional,

learnable layers into a Transformer architecture that

are updated during fine-tuning whilst keeping the

remainder of the network unchanged. Experimen-

tal results on 26 text classification tasks (incl. the

GLUE benchmark [575]) reveal that models trained

via Adapters are competitive with full fine-tuning

while updating only 3% of the model’s parame-

ters. Ben Zaken et al. [40] instead propose only

to update the model’s bias terms for fine-tuning,

which make up less than 1% of the model’s pa-

rameters. Experimental results show competitive

performance across tasks of the GLUE benchmark.

We are aware of three general frameworks for incor-

porating adapters into language model fine-tuning,

namely AdapterHub [428], LLM-Adapters [219],

and HuggingFace’s PEFT library [356].

PEFT methods introduced for larger mod-

els include prefix-tuning [311] and prompt-

tuning [299], which both operate by prepending

a set of learnable token embeddings to an input.

These token embeddings (also referred to as soft

prompts [299]) are learned during the fine-tuning

stage, whereas the remainder of the model parame-

ters remains fixed. Most notably, such soft prompts

contain thousands rather than millions of param-

eters and are much more efficient to store. No-

tably, one still has to backpropagate through the

network while fine-tuning the tokens. Alternatives

for models with only black-box API access have

been proposed too [528, 122].

It has been shown that prompt-tuning can

learn generalizable representations with very small

Large Memory Requirements

Fine-tuning entire LLMs requires the same

amount of memory as pre-training, render-

ing it infeasible for many practitioners.

Moreover, while full model fine-tuning is ef-

fective at adapting LLMs to perform well on spe-

cific downstream tasks, individual copies of fine-

tuned LLMs need to be stored and loaded for

individual tasks, which is computationally ineffi-

10Sen t i m en t

m o d el QA

m o d el H at e sp eec h

m o d el

Fi n e-t u n i n g

LLM # 1 Fi n e-t u n i n g

LLM # 2 Fi n e-t u n i n g

LLM # 3

Sen t i m en t

an al ysi s t ask Qu est i o n

an sw er i n g t ask H at e sp eec h

t ask

weight matrices at individual Transformer layers as

an additive low-rank decomposition. Such a repa-

rameterization avoids the need to compute dense

matrix multiplications. Dettmers et al. [118] ex-

tend LoRA to quantized LLMs, drastically reduc-

ing memory usage, allowing them to fine-tune a

65B model on a single 48GB GPU. The authors

mention that regular training of the same model

requires more than 780 GB of GPU memory.

(a)

Sen t i m en t

m o d el

QA

m o d el

Compute Requirements However, despite sub-

stantial improvements in memory complexity

needed to fine-tune LLMs for specific tasks, a re-

maining challenge is the time complexity. Fine-

tuning an LLM, even with PEFT methods, still

requires full gradient computation. The compu-

tational infrastructure needed to adapt LLMs pro-

hibits potential applications like personalization on

smaller devices.

H at e sp eec h

m o d el

B ase L L M

( PEFT-ad ap t ab l e)

PEFT w ei g h t s PEFT w ei g h t s PEFT w ei g h t s

Sen t i m en t

an al ysi s t ask Qu est i o n

an sw er i n g t ask H at e sp eec h

t ask

(b)

Full Matrix Multiplications

Figure 5: Fine-tuning an LLM for a specific down-

stream task. (a) illustrates vanilla fine-tuning, which

requires updating the entire model, resulting in a new

model for each task. In (b), PEFT instead learns a small

subset of model parameters for each task with a fixed

base LLM. The same base model can be re-used during

inference for different tasks.

Parameter-efficient fine-tuning of LLMs

still requires computing full forward/back-

ward passes throughout the whole network.

2.5

High Inference Latency

According to Pope et al. [431], Weng [605], two

reasons why LLMs exhibit high inference latencies

are: (1) low parallelizability since the inference

procedure proceeds one token at a time and (2)

large memory footprints, due to the model size

and the transient states needed during decoding

(e.g., attention key and value tensors). Further, the

authors also discuss the quadratic scaling of the

attention mechanisms in Transformers, which we

discuss separately in Sec. 2.6.

amounts of training data, achieving competitive

performances when trained on less than 100 exam-

ples for safety classification [376] or five examples

for multilingual question answering [11]. In addi-

tion to that, recent work investigates the potential

of using soft prompts for pre-training and transfer

learning across different tasks [179, 572].

Liu et al. [331] introduce (IA) 3 , which scales

activations in individual Transformer layers with

learnable vectors. The authors demonstrate its ef-

fectiveness by showing that models trained using

(IA) 3 outperform full model fine-tuning on various

datasets whilst updating only 0.01% of the model’s

parameters.

Malladi et al. [355] propose a memory-efficient

zeroth-order (MeZO) optimizer, which only re-

quires the same memory footprint as during in-

ference (instead of storing gradients or optimizer

states). Further, it can optimize non-differentiable

objectives like accuracy or F1 scores, which con-

ventional gradient-based tuning methods cannot.

Hu et al. [218] propose Low-Rank Adaptation

(LoRA), which formulates parameter updates of

High Inference Latency [431, 605]

LLM inference latencies remain high be-

cause of low parallelizability and large mem-

ory footprints.

In the following section, we review techniques

used to address these challenges by e.g., reduc-

ing the memory footprint (size and/or bandwidth),

or accelerating specific computational operations.

Note that some of these techniques may also be

applicable during the training process, but we dis-

cuss them here since they are not only designed for

training, like the approaches discussed in Sec. 2.3.

11Efficient Attention Roughly two lines of work

aim to accelerate attention mechanism computa-

tions by (i) lower-level hardware-aware modifica-

tions or (ii) higher-level sub-quadratic approxima-

tions of the attention mechanism.

For the former, multi-query attention [493] aims

to reduce memory bandwidth bottlenecks when se-

quentially generating sequences of tokens using

Transformer decoder layers by keeping only one

attention head for the key and value tensors. Sim-

ilarly, Dao et al. [107], Pagliardini et al. [404] re-

duce memory bandwidth by proposing an alter-

native computation method for multi-head self-

attention, called FlashAttention, to minimize

the number of I/O operations to speed up the com-

putation on modern GPUs. As an optimized atten-

tion implementation, FlashAttention lever-

ages operator fusion to reduce the memory band-

width bottleneck. Pagliardini et al. [404] build

on top of FlashAttention and incorporate at-

tention sparsity patterns, encompassing key/query

dropping and hashing-based attention. Pope et al.

[432] implement different sharding techniques to

efficiently spread the feedforward and attention

computations across devices while optimizing for

inter-device communication costs, enabling context

lengths of up to 43,000 tokens using multi-query

attention.

With regards to the second stream of work, a

common theme to improve the computational or

memory complexity of the attention mechanism is

to sparsify the attention matrix or introducing (lin-

ear) approximations [543]. However, the scalabil-

ity of some efficient Attention approximations has

been questioned. For example, Tay et al. [542], Hua

et al. [220] find that the Performer attention approx-

imation [85] severely underperforms the vanilla

self-attention mechanism, especially when scaled

up to large models.

Similarly, GLM-130B [658] uses a degradation-

free 8-bit quantization scheme, storing weights in

8-bit and performing matrix multiplications in 16-

bit precision. Frantar et al. [153] propose an effi-

cient, one-shot quantization technique to compress

LLM weights down to 3 to 4 bits per weight, en-

abling 175B parameter models to be run on a single

GPU. Dettmers et al. [119] further improve upon

this by combining higher precision representations

for outlier weights and grouped quantization.

Pruning is a complementary post-training tech-

nique to quantization, removing parts of the

weights of a given model (without degrading its per-

formance). An important distinction is whether the

pruning follows a structured pattern or is unstruc-

tured. Structured sparse models substitute dense

sections of a model with an assembly of signifi-

cantly smaller yet still dense components. Unstruc-

tured sparse models contain weights of value zero,

which do not influence the network’s behavior and

can therefore be committed in theory. However, in

practice, it is more challenging to translate theo-

retical to practical computation savings on current

hardware [161, 112, 336].

On the structured side, early work on pruning

language models mainly aims at comparatively

small MLM-type models [592, 143, 243]. Ma et al.

[349] propose LLM-Pruner, which aims at pruning

LLMs in a task-agnostic manner while preserving

the zero-shot capabilities of the models. To this

end, LLM-Pruner adopts a three-stage pruning pro-

cedure where 1) interdependent structures within

the model are identified and grouped, 2) the contri-

bution to the overall performance is estimated for

each group, and low-performing groups are pruned,

3) performance recovery via parameter-efficient

fine-tuning procedure using LoRA [218].

On the unstructured side, SparseGPT [152] is an

unstructured pruning approach specifically devel-

oped to be fast enough to be run on LLMs with

hundreds of billions of parameters within a few

hours, being able to prune the number of parame-

ters by up to 60% while maintaining roughly the

same model performance. Sun et al. [527] pro-

pose Wanda (Pruning by Weights and activations),

which applies magnitude pruning based on the

product of each weight’s magnitude and the norm

of the corresponding input activations, matching

SparseGPT in performance while requiring only

a single forward pass to prune the network. Both

SparseGPT and Wanda can be extended to per-

Quantization is a post-training technique that

reduces the memory footprint and/or increases the

model’s throughput by reducing the computational

precision of weights and activations. nuQmm [407]

and ZeroQuant [643] use a non-uniform quan-

tization method to quantize weights and apply

custom CUDA kernels for computational benefits.

LLM.int8() [117] is a degradation-free quanti-

zation scheme enabling efficient inference of multi-

billion parameter LLMs by utilizing Int8 quantiza-

tion and falling back to higher precision for certain

outlier features without the need for re-training.

12form semi-structured pruning, enabling n:m spar-

sity [228, 680] and achieving the corresponding

speed-ups on recent GPUs [369].

that the activation maps of default Transformer

models often emerge to be very sparse implicitly;

the larger the model, the sparser measured by the

percentage of nonzero entries. Similarly, Zhang

et al. [670] find that post-training MoEfication, i.e.,

converting monolithic models to equivalent MoE

models, can speed up inference by 2x.

Mixture-of-Experts architectures typically con-

sist of a set of experts (modules), each with unique

weights, and a router (or gating) network, which

determines which expert module processes an in-

put. MoE models decrease inference time by not

using all experts at once but only activating a sub-

set of them. Further, they can reduce communica-

tion across devices in model-distributed settings by

placing each expert on a separate accelerator; only

the accelerators hosting the router and the relevant

expert model must communicate. Shazeer et al.

[495] propose one of the first MoE layers embed-

ded within a language model, which they refer to

as sparsely-gated MoEs (SG-MoEs). They denote

by G(x) and E i (x) the gating network output and

the i-th expert network output for a given input

x, respectively.

We can then write the output as

P n

y =

G(x)

i E i (x). Wherever G(x) i = 0,

i=1

we do not need to compute E i (x), thereby saving

compute during inference. Lepikhin et al. [298]

scale up an SG-MoE model to 600B parameters

by proposing GShard, a model parallelism method

that extends the XLA [468] compiler. While SG-

MoE selects the top-k experts with k > 1, the

Switch Transformer (ST) [145] architecture uses

k = 1 experts, which reduces routing computation

and communication across experts (which may be

located on different accelerators). ST empirically

outperformed a strongly tuned T5 model with up to

7x pre-training speedups. Lewis et al. [302] notice

that the learned routers can result in unbalanced

assignments across experts. To ensure balanced

routing, they formulate a linear assignment prob-

lem that maximizes token-expert affinities while

equally distributing the number of tokens across

experts. Yu et al. [653] propose sMLP, an MoE

using only MLPs blocks, which (i) they scale up to

10B, (ii) results in a 2x improvement in pre-training

speed, and (iii) outperforms sparse Transformer

counterparts.

However, MoE models still suffer from unique

issues like expert collapse (all experts learning the

same), likely caused by underconstrained routing

functions [80]. For example, Roller et al. [459]

demonstrates that learned expert assignments do

not always outperform random ones.

Interestingly, instead of designing an architec-

ture for sparsity explicitly, Li et al. [314] observe

Cascading refers to the idea of employing

differently-sized models for different queries [75].

In spirit, this idea is similar to Mixture-of-Experts

models, but instead of learning a routing module,

we employ a cascade of multiple, differently-sized

monolithic models (these can be even black-box

API models) and learn a scoring function that de-

cides which model(s) receive which query. Chen

et al. [75] demonstrate that this strategy dominates

the Pareto frontier between accuracy and cost.

Decoding Strategies can greatly impact the com-

putational cost of performing inference. For ex-

ample, beam search trades off compute for higher-

quality results. Another example of a computa-

tionally expensive decoding scheme is sample-and-

rank [8] where N independent sequences of tokens

y 1 , . . . , y N are obtained using random sampling,

and the highest probability sequence is used as the

final output.

Latency-oriented strategies such as speculative

sampling [522, 300, 74] first autoregressively gen-

erate a draft of length K using a smaller (draft)

model; then, the larger (target) model scores the

draft, followed by a modified rejection sampling

scheme to accept a subset of the tokens from left to

right. Similar ideas have been proposed in various

contexts, such as for blockwise parallel genera-

tion [522], grammatical error correction [529], and

with a larger LLM refining generation produced by

a small model [265]. Del Corro et al. [114] observe

that tokens towards the end of a sequence are easier

to predict due to more contextual information, mo-

tivating a new decoding strategy that skips earlier

layers in the network for such tokens.

2.5.1 Software

Various frameworks have been designed to en-

able the efficient training of multi-billion to

trillion parameter language models such as

DeepSpeed [450] and Megatron-LM [501] to

account for the unique challenges arising when

training such models. This is necessitated by the

fact that most LLMs do not fit into a single device’s

(GPU, TPU) memory, and scaling across GPUs and

132.6

compute nodes needs to account for communica-

tion and synchronization costs. FlexGen [497]

provides further speed-ups by aggregating memory

and compute resources from the GPU, CPU, and

disk and utilizing techniques such as 4-bit quan-

tization, enabling inference with 175B parameter

models on a single GPU.

The frameworks typically combine existing par-

allelism strategies to compensate for drawbacks

and scale model training across multiple sets of

compute nodes, within compute nodes, and across

multiple GPUs per node. e.g., Smith et al. [515]

use tensor slicing within a node, pipeline paral-

lelism across nodes, and data parallelism to train

multiple model replicas over sets of nodes. Addi-

tional features include memory optimizations [445,

454, 446], communication-efficient [536, 307, 343]

and fused optimizers 2 , and support for MoE train-

ing [444].

Specialized

implementations

such

as

Tutel [230] and MegaBlocks [160] of-

fer efficient sparse MoE training, while

Alpa [677] enables automatic data and model

parallelism for LLMs written in Jax.

The

FasterTransformer 3 library includes highly

optimized Transformer encoder and decoder

implementations for TensorFlow, PyTorch, and

Triton.

Kwon et al. [285] introduce vLLM, an open-

source library for efficient inference and LLM serv-

ing. vLLM employs PagedAttention, which par-

titions each sequence’s KV cache into fixed-size

blocks. When performing attention computations,

blocks are fetched from non-contiguous memory.

This enables memory sharing, reducing memory

consumption and transfers in decoding strategies

such as beam search, ultimately improving through-

put.

The Petals [54] library 4 allows users to col-

laboratively fine-tune and run LLMs by distribut-

ing subsets of model parameters to individual ma-

chines.

All of these libraries address the enormous com-

putational costs associated with training and run-

ning LLMs, either by offering more efficient im-

plementations, lowering memory requirements, or

using distributed or decentralized computing strate-

gies.

Limited Context Length

Addressing everyday NLP tasks often necessitates

an understanding of a broader context. For exam-

ple, if the task at hand is discerning the sentiment

in a passage from a novel or a segment of an aca-

demic paper, it is not sufficient to merely analyze a

few words or sentences in isolation. The entirety of

the input (or context), which might encompass the

whole section or even the complete document, must

be considered. Similarly, in a meeting transcript,

the interpretation of a particular comment could

pivot between sarcasm and seriousness, depending

on the prior discussion in the meeting.

Li et al. [308] evaluate several LLMs in the long-

context settings and find that while commercial

closed-API models often fulfill their promise, many

open-source models – despite claiming to perform

well with longer contexts – exhibit severe perfor-

mance degradation. They point out that there is

a difference between being architecturally-able to

deal with long inputs and actually performing well.

Having an architecture that can infer long inputs

does not guarantee that the LLM will perform as

well on those as on shorter inputs. Similarly, Liu

et al. [333] find that changing the location of rel-

evant information in the input can degrade model

performance. Interestingly, they find that decoder-

only LLMs like GPT-3.5 can deal well with such

information at the beginning or end of the input

context; they cannot access information in the mid-

dle of it well, resulting in a U-shaped performance

curve.

Limited Context Length

Limited context lengths are a barrier for

handling long inputs well to facilitate ap-

plications like novel or textbook writing or

summarizing.

To this end, we discuss three lines of work per-

mitting longer context lengths. First, we look at

efficient attention mechanisms, which help miti-

gate the effect of long inputs on the computational

requirements of Transformer models. Next, we ex-

amine positional embedding schemes in the light

of generalization to longer sequence lengths than

those used during training. Lastly, we revise Trans-

former alternatives which neither require attention

nor positional embeddings.

2

https://github.com/nvidia/apex

https://github.com/NVIDIA/FasterTransformer

4

https://github.com/bigscience-workshop/petals

3

14Efficient Attention Mechanisms One way of

addressing the limited context of LLMs is by de-

signing more efficient attention mechanisms that

can process longer inputs. Ma et al. [350] intro-

duce Luna, a linear unified nested attention mech-

anism that approximates softmax attention with

two nested linear attention functions, yielding only

linear (as opposed to quadratic) time and space

complexity, allowing it to process much longer in-

puts. Similarly, Shen et al. [496] and Li et al. [310]

present alternative attention mechanisms equivalent

to the dot-product attention but which require sub-

stantially less memory and compute resources. Guo

et al. [183] propose an attention mechanism called

Transient Global, which is an extension of local

attention where each token can attend to nearby

tokens and a set of global tokens. It enables to han-

dle sequences with up to 12,000 tokens. Similarly,

CoLT5 [15] enables context lengths of up to 64,000

tokens by splitting the computations into a light

branch with local attention, fewer attention heads,

and a heavy branch with full attention. CoLT5 ap-

plies the light branch to every token and the heavy

branch to a subset of tokens that are selected by a

learnable routing function.

After investigating the effect of the dot-product

self-attention mechanism, Tay et al. [541] pro-

pose the Synthesizer, a new architecture that learns

synthetic attention weights without token-token

interactions, showing that it consistently outper-

forms transformers on various language-based

tasks. Britz et al. [56] offer an alternative attention

mechanism based on a fixed-size memory repre-

sentation that is more efficient, yielding inference

speedups of 20% without significantly hurting per-

formance. Hua et al. [220] combine a single-head

attention mechanism with a linear attention approx-

imation to achieve speed-ups between 4.9x and

12.1x for auto-regressive language modeling while

obtaining similar perplexities as a standard Trans-

former model. Ding et al. [124] propose dilated

attention which splits a sequence into equally long

segments and processes each of these in parallel

using a sparsified attention mechanism. Dilated

attention offers a linear computational complexity

in the sequence length and, applied hierarchically,

enables inputs of up to 1B tokens.

generalize well to significantly longer sequences

during inference.

The fundamental building block of the Trans-

former architecture is the self-attention mechanism.

It is permutation-invariant; therefore, the output is

independent of the input sequence order. Positional

information is commonly injected to make the

model respect a token’s position in the sequence,

i.e., capture the semantics of where a token occurs

rather than just whether it occurs. The longer the

input is, the more important the positional embed-

ding becomes since the model needs to effectively

use information from different parts of the input

that may cover a wide range of distances from the

current token.

Without positional embeddings, a Transformer

models the relations between any two tokens with

equal probability. Hence, positional embeddings

introduce an LSTM-like inductive bias that (typi-

cally) tokens closer to each other in the sequence

are more relevant to each other. Depending on the

positional embedding scheme chosen, this can be

learned or effectively hard-coded. However, it re-

mains unclear what is the most effective positional

embedding scheme for long inputs. Further, mod-

els face difficulties generalizing to unseen sequence

lengths by introducing a dependency on sequence

positions. This is an undesirable artifact of posi-

tional embeddings, as language semantics do not

inherently depend on the length of an utterance.

While positional encoding schemes such as rela-

tive positional encodings or, more recently, ALiBi

have made progress in building more generaliz-

able ways for injecting positional information into

Transformers, the challenge of generalizing to se-

quences much longer than seen during training re-

mains largely unsolved. Surprisingly, Haviv et al.

[192] find that causal LLMs without positional en-

codings are competitive compared to models with

positional encodings and accredit this success to

the causal attention mask leaking positional infor-

mation into the model.

In the following, we first summarize some stan-

dard positional embeddings technique and then

move to more advanced schemes designed to im-

prove length generalization. We start with Abso-

lute Positional Embeddings [563], which inject

positional information by sinusoidal embeddings

based on the absolute position i of a token x i within

their sequence x 1 , . . . , x N into the model input.

Given an input sequence X = [x 1 , . . . , x N ], we

Length Generalization As the required compute

of Transformer-based LLMs grows quadratic with

the sequence length, it is a desired property to build

LLMs that can be trained on short sequences and

15add a positional embedding matrix P ∈ R n×d of

the same shape to get the positional encoding out-

puts X + P, where the element on the i th row

and the (2j) th or the (2j + 1) th column of P fol-

lows sinusoidal functions. Vaswani et al. [563]

also compare against learned positional embed-

dings and find no significant performance differ-

ence. In contrast, sinusoidal positional encodings

require no trainable parameters, and the authors

hypothesize that they enable extrapolation to se-

quence lengths longer than the ones contained in

the training set. However, this feature is not guar-

anteed, as the subsequent layers in the network

need to be able to deal with such extrapolated po-

sitional embeddings. Learned positional encod-

ings do not possess inherent generalization capabil-

ities for unseen sequence lengths. This limitation

arises because the embeddings associated with ab-

solute positions not encountered during training—

depending on the implementation—either do not

exist or remain untrained (random). Relative Posi-

tional Embeddings have subsequently been devel-

oped, extending absolute positional embeddings to

relative offsets between token positions [492, 221,

105, 79]. While rarely used in their vanilla form in

LLMs [441], relative positional embeddings have

given rise to the methods outlined in the follow-

ing paragraphs. They offer better generalization to

unseen sequence lengths than absolute positional

encodings. All unseen absolute positions will be

converted to previously observed relative offsets

between positions, enabling better generalization to

long input sequences at inference time. Rotary Po-

sition Embeddings (RoPE) [526] unite absolute

and relative methods by incorporating absolute po-

sitional information in a rotation matrix and model-

ing the relative positional offset through a rotation.

They directly modify the self-attention calculation

rather than injecting positional information into the

embeddings. The attention between positions i, j

linearly depends on i − j by introducing a d × d

d , resulting

dimensional block diagonal matrix R Θ,k

in a self-attention mechanism defined as

coding scheme extrapolates poorly to unseen se-

quence lengths. However, Chen et al. [79] demon-

strate that by interpolating rather than extrapolating

longer than before observed context windows and

briefly fine-tuning RoPE-based models, enabling

pre-trained LLMs to extend their context window

to very long sizes of up to 32, 768 tokens.

Relative Positional Bias [443] directly bias the

attention computation (Eq. (5)) with a learned bias

per relative positional offset and attention head

instead of adding information to the token embed-

dings





X

1

⊤

 . (5)

x ⊤

softmax  √

i W q W k x j + b i−j

d i,j

Press et al. [434] follow a similar methodology

but use heuristics to define ALiBi (Attention with

Linear Biases), a non-learned bias that is used

to penalize attention scores in long-range interac-

tions [479], i.e., a recency-bias is backed into the

model. Here, m is a pre-defined, head-specific

slope–by default, the set of slopes for n heads form

a geometric sequence.





X

1

⊤



softmax  √

x ⊤

i W q W k x j + m · −(i − j) .

d i,j

(6)

Press et al. [434] motivate ALiBi by designing it to

generalize well to unseen sequence lengths. They

show that training a model with it on training se-

quences with a maximum sequence length of 1, 024

tokens achieves the same perplexity on a test set

with a maximum sequence length of 2, 048 as a

model trained with sinusoidal positional encodings

on sequences with up to 2, 048 tokens. Thereby, it

not only enables larger context lengths but can also

potentially reduce pre-training costs (Sec. 2.3).

While some of the existing positional encod-

ing schemes offer better generalization to long se-

quences than others, it remains unclear how reliable

they are. For example, Taylor et al. [548] report try-

ing ALiBi in the Galactica LLM and not observing

“large gains” compared to using learned positional

encodings. Similarly, Kazemnejad et al. [259] find

that popular positional encoding schemes such as

ALiBi, RoPE, and absolute positional encodings do

not perform well in terms of length generalization

in a suite of 10 reasoning downstream tasks.

In a parallel line of work, Anil et al. [19] demon-

strate that naively fine-tuning a pre-trained LLM is





X

1

⊤ d

 .

softmax  √

x ⊤

i W q R Θ,(i−j) W k x j

d i,j

(4)

While RoPE has been adapted in many LLMs [576,

47, 86] and Su et al. [526] show RoPE leading

to better performance on long text tasks, Press

et al. [434] demonstrate that this positional en-

16insufficient for length generalization in the context

of reasoning tasks. Instead, they propose combin-

ing in-context learning and scratchpad/chain-of-

thought reasoning to enable LLMs to generalize to

unseen sequence lengths in- and out-of-distribution,

with performance scaling with model size. The au-

thors report that fine-tuning can further improve

model performance dependent on the task perfor-

mance of the baseline. tance Weighted Key Value (RWKV) to combine

the parallelization benefits of Transformer-based

LLMs during training with the fast inference and

low compute requirements of RNNs. The authors

accomplish this by leveraging a linear attention-

like mechanism, scaling non-Transformer LLMs to

14B parameters, and matching the performance of

similarly-sized Transformer LLMs.

Transformer Alternatives While Transformers

are the dominant paradigm in LLMs today due to

their strong performance, several more efficient

alternative architectures exist. One line of work

tries to replace the attention mechanism using state

space models (SSMs), which offer near-linear com-

putational complexity w.r.t. the sequence length.

Dao et al. [108] investigate the weaknesses of state

space models (SSMs) in language modeling and

find that existing approaches struggle with recall-

ing previous tokens and comparing tokens in the

sequence. Based on these findings, the authors

propose H3 with a shift matrix to recall previous

tokens and multiplicative interactions for token

comparisons. The authors demonstrate that H3

comes close to Transformer-based LLMs for lan-

guage modeling, offering further improvements

when combined with attention. Poli et al. [430]

propose the Hyena operator, a convolution-based

sub-quadratic attention replacement designed for

long sequences. Hyena tries to emulate the atten-

tion mechanisms’ dynamic nature by introducing

data-controlled computations, i.e., Hyena applies

an element-wise gating operation based on the op-

erator’s input to mimic the attention contextualiza-

tion. Hyena-based models have been used on natu-

ral language for sequence lengths of up to 131, 000

tokens [430] and up to 1, 000, 000 tokens in the

context of genomics [383]. Fathi et al. [144] pro-

pose the Block-State Transformer, which builds

upon a hybrid layer that combines an SSM for

long-range contextualization and a Transformer

for short-range interactions between tokens. The

authors find similar performance to Transformer-

based baselines while obtaining speed-ups of up to

10x on sequence-level, enabling models with more

than 65, 000 tokens sequence length.

Another line of work utilizes recurrent neu-

ral networks (RNNs), which offer linear com-

putational complexity and memory requirements

with respect to the sequence length as the back-

bone of LLMs. Peng et al. [416] propose Recep- A prompt is an input to the LLM. The prompt syn-

tax (e.g., length, blanks, ordering of examples) and

semantics (e.g., wording, selection of examples,

instructions) can have a significant impact on the

model’s output [342].

As an analogy, if we were to think of an LLM

as a (fuzzy) database and prompts as queries [246],

it becomes clear that slight changes in the query

can result in vastly different outputs. Consequently,

the wording, as well as the order of examples in-

cluded in a prompt, have been found to influence

the model’s behavior significantly [596, 675, 342].

2.7

Prompt Brittleness

Prompt Brittleness [675, 596, 342]

Variations of the prompt syntax, often oc-

curring in ways unintuitive to humans, can

result in dramatic output changes.

Designing natural language queries that steer the

model’s outputs toward desired outcomes is often

referred to as prompt engineering [477, 287, 606].

Fig. 6 summarizes some of the most popular

prompting methods with an example adapted from

Wei et al. [601]. As we can see, there are lots of

equally-plausible prompting techniques, and the

current state of prompt engineering still requires

lots of experimentation, with little theoretical un-

derstanding of why a particular way to phrase a

task is more sensible other than that it achieves

better empirical results. Developing LLMs that are

robust to the prompt’s style and format remains

unsolved, leaving practitioners to design prompts

ad-hoc rather than systematically.

Single-Turn Prompting methods improve the in-

put prompt in various ways to get a better answer in

a single shot. In-Context Learning (ICL) refers

to an LLM’s ability to learn a new task solely via

inference (without any parameter updates) by con-

ditioning on a concatenation of the training data

as demonstrations [59, 483]. This enables users

and practitioners to use LLMs for a variety of NLP

17Single-Turn Prompting

In-Context Learning

Q: Lisa has 5 easy peelers. She buys 2 more nets with

6 each. How many easy peelers does she have?

A: The answer is 17.

Q: The cafeteria has 37 bananas. They bought 5 more

bunches with 5 each, how many bananas do they

have?

Input

Instruction-Following Chain-of-Thought

Here is a mathematical reasoning question. You need

to apply arithmetic operations to generate the correct

answer. Q: Lisa has 5 easy peelers. She buys 2 more nets with

6 each. How many easy peelers does she have?

A: Lisa starts with 5. 2 nets of 6 each are 12 easy

peelers. 5+12=17. The answer is 17.

Q: The cafeteria has 37 bananas. They bought 5 more

bunches with 5 each, how many bananas do they

have?

Q: Lisa has 5 easy peelers. She buys 2 more nets with

6 each. How many easy peelers does she have?

…

A: The answer is 62.

A: The cafeteria has 37 bananas originally. They

bought 5 more bunches and each bunch has 5, so

they added 5 x 5 = 25 bananas to their stock. We

add these numbers: 37 + 25 = 62. The answer is 62.

A: The answer is 62.

Output

Prompt tuning

Embedding 1

Embedding …

Embedding N

Q: Lisa has 5 easy peelers. She buys 2 more nets with

6 each. How many easy peelers does she have?

A: The answer is 17.

Q: The cafeteria has 37 bananas. They bought 5 more

bunches with 5 each, how many bananas do they

have?

A: The answer is 62.

Multi-Turn Prompting

Self-Consistency

Ask-Me-Anything

…

A: The cafeteria has 37 bananas

originally. They bought 5 more

bunches and each bunch has 5, so

they added 5 x 5 = 25 bananas to

their stock. We add these

numbers: 37 + 25 = 62. The

answer is 62.

A: The cafeteria initially had 37

bananas and purchased an

additional 5 bunches of bananas,

each with 5, totaling 25 bananas.

So, adding 5 and 25 together, the

total fruit count is now 30. The

answer is 30.

Least-To-Most

Stage 1: Problem Reduction

Q: Lisa has 5 easy peelers. She buys 2 more nets with 6 each. How

many easy peelers does she have?

A: Lisa starts with 5. 2 nets of 6 each are 12 easy peelers. 5+12=17.

The answer is 17.

Q: The cafeteria has 37 bananas. They bought 5 more bunches with

5 each, how many bananas do they have?

Prompt Chain 1

Prompt Chain 2

Prompt Chain 3

A: We need to multiply the

number of bunches by the number

of banans in each bunch. 5 times 5

gives us 25 bananas. Next, we add

the original number of bananas.

The addition 37 plus 25 equals 62.

The answer is 62.

Formulate a question for the given context.

Q: Lisa has 5 easy peelers. She buys 2 more

nets with 6 each. How many easy peelers does

she have?

A: The answer is 17.

Q: The cafeteria has 37 bananas. They bought

5 more bunches with 5 each.

Q: Answer the question using arithmetic.

Q: Lisa has 5 easy peelers. She buys 2 more

nets with 6 each. How many easy peelers

does she have?

A: The answer is 17.

Q: The cafeteria has 37 bananas. They

bought 5 more bunches with 5 each.

Q: What is the total number of bananas

they possess?

What is the total number of bananas they possess? A: The answer is 62.

A1: The answer is 62.

Majority Vote

A2: The answer is 62.

A: The answer is 62.

A: The answer is 62.

A3: The answer is 93.

Majority Vote

…

…

…

Propose Prompt Thought Generation

The cafeteria bought 5 more bunches with 5

each. Calculate how many they bought in

total. 5 x 5 = 25

Evaluate whether this thought is useful to answer the original question.

…

…

Thought Evaluation

Yes, this calculation takes us one step closer to the solution.

…

…

Stage 2: Sequentially Solve Subquestions

The cafeteria has 37 bananas.

They bought 5 more bunches

with 5 each.

Q: How many bananas does it

buy in total? A: They buy 25 bananas in total.

The cafeteria has 37 bananas.

They bought 5 more bunches

with 5 each, how many bananas

do they have?

Q: How many bananas does it

buy in total?

A: They buy 25 bananas in total.

Q: How many bananas do they

have? A: The cafeteria has 37 bananas.

They buy 25 bananas in total.

So, in total, they have 37 + 25 =

62 bananas.

Q: Lisa has 5 easy peelers. She buys 2 more nets with 6 each.

How many easy peelers does she have?

A: The answer is 17.

Q: The cafeteria has 37 bananas. They bought 5 more

bunches with 5 each, how many bananas do they have?

A: The answer is 37.

Evaluation Prompt

…

A: To solve “How many bananas

does it have?”, we need to ﬁrst

solve: “How many bananas does

it buy in total”?

Self-Reﬁne

Tree of Thoughts

Q: The cafeteria has 37 bananas. They

bought 5 more bunches with 5 each, how

many bananas do they have?

Q: The cafeteria has 37 bananas.

They bought 5 more bunches

with 5 each, how many bananas

do they have?

…

Reﬁned Output

A: Apologies for any confusion,

you are right, I was answering

the wrong question. The correct

answer is 62, by adding 37 and 5

x 5.

Feedback

This response is not answering the question asked. The question

asked is how many banans there are in total. These two

quantities have to be added together.

Figure 6: Overview of Selected Prompting Methods, categorized into Single-Turn and Multi-Turn Prompting. We

use a running example across all methods inspired by Wei et al. [601].

tasks by simply listing examples of the dataset (e.g.,

input texts and their corresponding labels) without

the need to adjust the LLM’s inner workings.

Various existing works investigate why ICL

shows such competitive results across NLP tasks.

One explanation concurrently proposed by [570,

103, 16] is that ICL emulates gradient-based meta-

learning, i.e., it implicitly fine-tunes the model

through gradient descent in their forward pass.

Interestingly, Min et al. [366] show that input-

label associations in the few-shot prompt are not

decisive for model performance: randomly flip-

ping labels of few-shot demonstrations shows to

harm an LLM’s ability to solve NLP tasks barely.

However, few-shot learning (with and without ran-

dom labels) vastly outperforms zero-shot learning

(i.e., no demonstrations are provided in the prompt).

The authors argue that the demonstrations are help-

ful for task performance in that the LLM instead

learns the label space and the input distribution of

the task.

In later work, Pan et al. [405] explain that there

are two distinct mechanics through which ICL

leverages demonstrations: on the one hand, task

recognition is the ability to recognize a task through

demonstrations (possibly without ground-truth la-

bels or perhaps even wrong ones, as in the case of

Min et al. [366]). After this recognition phase, it

applies its pre-trained capabilities. On the other

hand, the skill to acquire new input-label mappings

unseen in pre-training is called task learning.

While input-label associations may not seem to

drive few-shot performance, at least in the case

of task recognition, Lu et al. [342] show that the

order of few-shot examples matters in that LLMs

are highly sensitive to permutations of the order in

which the few-shot demonstrations are provided.

Alternative explanations of the ICL phenomenon

take place around Bayesian inference [623], sparse

linear regression [7], structure induction [188],

maintaining coherence [509], kernel regression

[190], and clone-structured causal graphs [535].

Instruction-Following is mainly explained in

Sec. 2.9, as it requires supervised fine-tuning. To

briefly recap, the idea is to prepend task-describing

instructions (e.g., “This is a text classification task

18symbolic calls to external tools such as search and

code generation or execution. To this end, ART

retrieves demonstrations of related tasks from

a library of tasks with accompanying reasoning

steps and uses a frozen language model to generate

intermediate reasoning steps.

Self-refine [351] is based on the notion of itera-

tive refinement, i.e., improving an initial solution

over multiple steps. To this end, a single LLM gen-

erates an initial output and then iteratively provides

feedback on the previous output, followed by a re-

finement step in which the feedback is incorporated

into a revised output.

Tree of Thoughts [639] generalize CoT to main-

tain a tree of thoughts (with multiple different

paths), where each thought is a language sequence

that serves as an intermediate step. Doing so en-

ables the LLM to self-evaluate the progress inter-

mediate thoughts make towards solving the prob-

lem and incorporating search algorithms, such as

breadth-first or depth-first search, allowing system-

atic exploration of the tree with lookahead and

backtracking.

for movie reviews. Here are a few examples: ...”)

in the input prompts.

Chain-of-Thought (CoT) [327, 601] describes

a technique used to construct few-shot prompts via

a series of intermediate reasoning steps leading

to the final output. Answer rationales to solve al-

gebraic problems were originally proposed in the

pre-LLM era [327] and later experienced big pop-

ularity as a prompting strategy for LLMs [601].

Extensions of chain-of-thought prompting include

zero-shot variants [273] and automatically gener-

ated series of reasoning steps [671].

Impersonation [473] is a technique in which

the prompt for the model asks it to pretend to be a

domain expert when answering a domain-specific

question. Salewski et al. [473] find that LLMs

answer domain-specific questions more accurately

when prompted to impersonate a domain expert.

Multi-Turn Prompting methods iteratively

chain prompts and their answers together.

Ask Me Anything [24] uses multiple prompt

templates (called prompt chains), which are used

to reformat few-shot example inputs into an open-

ended question-answering format. The final output

is obtained by aggregating the LLMs predictions

for each reformatted input via a majority vote.

Self-consistency [585] extends chain-of-thought

prompting by sampling multiple reasoning paths

and selecting the most consistent answer via a ma-

jority vote.

Least-to-Most [682] uses a set of constant

prompts to use the LLM to decompose a given

complex problem into a series of subproblems.

The LLM sequentially solves the subproblems with

prompts for later-stage subproblems containing pre-

viously produced solutions, iteratively building the

final output.

Scratchpad [391] is a method to fine-tune LLMs

on multi-step computation tasks such that they out-

put intermediate reasoning steps, e.g., intermedi-

ate calculations when performing additions, into a

“scratchpad” before generating the final result.

ReAct [640] combines reasoning and acting by

prompting LLMs to generate reasoning traces (e.g.,

Chain-of-thought) and action plans, which can be

executed to allow the model to interact with exter-

nal environments such as Wikipedia to incorporate

knowledge.

Automatic Reasoning and Tool-Use

(ART) [406] is a method to automatically

generate multi-step reasoning prompts, including

Controlled Generation The approaches above

primarily modify the prompt text to steer model

outputs. However, instead of reformulating the

input text, we can control the output by approaches

that directly modify the inference procedure given

a fixed set of prompts. Before the advent of LLMs,

this line of work has been referred to as controlled

generation [261, 109, 278].

In the context of LLMs, Sanchez et al. [474]

proposes to use classifier-free guidance sampling

[204], where the input prompt’s importance is up-

weighted throughout the generation of a sequence.

Roush [463] proposes five ideas related to modify-

ing the prompt throughout the decoding of a single

sequence; for example, alternating between two in-

put prompts. Such works often borrow ideas from

the text-to-image generation community [384, 29].

One idea we have not seen borrowed yet is neg-

ative prompting, i.e., including a description of

unwanted outputs. According to Neg [4], the first

attempts at such an idea resulted in negative out-

comes.

2.8

Hallucinations

The popularity of services like ChatGPT suggests

that LLMs are increasingly used for everyday

question-answering. As a result, the factual accu-

racy of these models has become more significant

19✅

"

sue of hallucination in the context of algorithmic

reasoning. Here, we focus on ways to address hal-

lucinations in LLMs without changing the model

architecture itself, including (i) supplying the LLM

with relevant sources (retrieval augmentation) or

(ii) decoding strategies.

than ever.

How to Measure Hallucinations Lee et al. [295]

provide the FactualityPrompts dataset consisting

of factual and nonfactual input prompts, which al-

lows one to isolate the effect of prompt’s actuality

on the model’s continuation. Further, they mea-

sure hallucinations using named-entity- and textual

entailment-based metrics. Min et al. [365] notice

that evaluating factuality can be difficult because

generations can contain a mixture of supported

and unsupported information, making binary judg-

ments of quality inadequate and human evaluation

time-consuming. Hence, they propose a frame-

work that first breaks generations into atomic facts

and then computes the percentage of atomic facts

supported by an external knowledge source like

Wikipedia. Zhang et al. [664] detect the behavior

of hallucination snowballing, where the LLM over-

commits to early mistakes (before outputting the

explanation) in its generation, which it otherwise

would not make.

Correct!

Does not exist!

Wrong authors!

Figure 7: Example of Hallucinations with GPT-4,

accessed on 02/06/2023.

Unfortunately, LLMs often suffer from halluci-

nations, which contain inaccurate information that

can be hard to detect due to the text’s fluency. Fig. 7

illustrates an example.

To distinguish between different types of hallu-

cinations, we consider the provided source content

of the model, e.g., the prompt, possibly includ-

ing examples or retrieved context. Based on such,

we can distinguish between intrinsic and extrinsic

hallucinations [241]. In the former, the generated

text logically contradicts the source content. In

the latter, we cannot verify the output correctness

from the provided source; the source content does

not provide enough information to assess the out-

put, which is, therefore, under-determined. Extrin-

sic hallucination is not necessarily erroneous, as it

merely means the model generated an output that

can neither be grounded nor contradicted by the

source content. This is still, to some degree, un-

desirable as the provided information cannot be

verified. We illustrate intrinsic and extrinsic hallu-

cinations in Fig. 8.

Retrieval Augmentation One way to mitigate

hallucinations is to ground the model’s input on

external knowledge, which is often referred to as

retrieval augmentation. In other words, we can

decouple (i) memory storage of knowledge (e.g.,

databases or search indexes [290]) and (ii) process-

ing of the knowledge to arrive at a more modular

architecture. For (i), a retriever module retrieves

the top-k relevant documents (or passages) for a

query from a large corpus of text. Then, for (ii),

we feed these retrieved documents to the language

model together with the initial prompt. In theory,

using an external data source may also make it eas-

ier to interpret which knowledge is retrieved and

update it without tediously fine-tuning the model.

Shuster et al. [507] demonstrate hallucinations in

GPT-3 and study various components of retrieval-

augmented architectures to mitigate them. Their

best models reduce hallucinated responses by

over 60% on average and up to 85% on out-of-

distribution data, on which the model has not been

trained.

We summarize a few popular retrieval

augmentation (RA) approaches as follows.

Hallucination [293, 458, 241]

Generated text that is fluent and natural but

unfaithful to the source content (intrinsic)

and/or under-determined (extrinsic).

Liu et al. [328] attribute hallucinations com-

monly observed in LLMs to an architectural flaw in

Transformer models while observing that recurrent

neural networks perfectly solve their minimalistic

synthetic benchmarks, designed to isolate the is-

20Problems

Solutions

P.1) I ntr insic Hallucination

S.1) Decoding Str ategies

Bob's wife is Amy. Bob's daughter is

Cindy. Who is Cindy to Amy?

Bob's wife is Amy. Bob's daughter is

Cindy. Who is Cindy to Amy?

Quer y

Query

Cindy is Amy's daughter-in-law.

daughter

daughter-in-law

...

son

Cindy is Amy's daughter.

P.2) Extr insic Hallucination

S.2) Retr ieval augmentation

Explain RLHF for LLMs.

Explain RLHF for LLMs.

Retr ieved

context

Quer y

Quer y

RLHF stands for "Rights, Limitations,

Harms and Freedoms" and is a framework

for ... models like LLMs.

RLHF is a technique used for alignment of

LLMs and stands for Reinforcement

Learning with Human Preferences.

Figure 8: Illustration of a) intrinsic and b) extrinsic hallucinations in user interaction with an LLM, inspired

by Zhao et al. [673]. In a), the produced answer contradicts the given context, whereas in b), the context does not

provide enough information about whether the produced answer would contradict.

model trained on them.

However, standard RA does not always solve the

hallucinations problem. Fig. 9 illustrates an exam-

ple of ChatGPT browsing the web first to retrieve

relevant documents before answering the query.

While the Bing browsing plugin retrieves two (exis-

tent) related papers ([673, 632]), unfortunately, the

final response still contains a hallucination: the sec-

ond paper’s title and summary are factually inaccu-

rate. The second paper’s true title is “Practical and

Ethical Challenges of Large Language Models in

Education: A Systematic Literature Review” [632].

Retrieval-augmented language model pre-training

(REALM) [186] inserts retrieved documents

into the pre-training examples. While Guu et al.

[186] designed REALM for extractive tasks

such as question-answering, Lewis et al. [304]

propose retrieval-augmented generation (RAG), a

language generation framework using retrievers

for knowledge-intensive tasks that humans could

not solve without access to an external knowledge

source. Yogatama et al. [646] propose the adaptive

Semiparametric Language Models architecture,

which incorporates the current local context, a

short-term memory that caches earlier-computed

hidden states, and a long-term memory based on a

key-value store of (hidden-state, output) tuples. To

equip a retrieval-augmented LLM with few-shot

abilities that were before only emergent in LLMs

with many more parameters, Izacard et al. [236]

propose a KL-divergence loss term for retrieval

models, resulting in A TLAS . Borgeaud et al. [52]

study scaling up retrieval databases up to 2 trillion

tokens and achieving comparable performance

to GPT-3 on some tasks despite using 25× fewer

parameters while highlighting the retrieval model’s

ability to copy-paste existing training chunks. Asai

et al. [25] introduce a collection of 40 retrieval

datasets with instructions and a corresponding

Another failure mode of RA is illustrated by

Khattab et al. [262], who find that sometimes the

retriever cannot find passages that directly answer

the question. Hence, they propose a framework that

unifies techniques from RA and multi-turn prompt-

ing (Sec. 2.7) to solve more complex questions

programmatically.

Decoding Strategies Another approach to miti-

gating hallucinations is refining the decoding strat-

egy during inference time. Lee et al. [295] show

that standard decoding algorithms (e.g., top-p trun-

cation) can induce hallucinations due to the uni-

form randomness introduced at every sampling

21✅

"

does not cause unintended or undesirable harms or

consequences [466, 158, 196]. Most of the exist-

ing alignment work can be categorized into either

methods for detecting misaligned behavior (such as

model evaluation and auditing, mechanistic inter-

pretability, or red teaming) or methods for aligning

model behavior (such as pre-training with human

feedback, instruction fine-tuning, or RLHF).

Misaligned Behavior

Correct!

LLMs often generate outputs that are not

well-aligned with human values or inten-

tions, which can have unintended or nega-

tive consequences.

Does not exist!

Pre-Training With Human Feedback Korbak

et al. [275] introduce the concept of pre-training

with human feedback (PHF) where human feedback

is incorporated during the pre-training stage rather

than during fine-tuning. The authors compare five

different PHF approaches such as filtering [516,

587], conditional training [150, 142, 261], unlike-

lihood [604], reward-weighted regression [424],

and advantage-weighted regression [419], and find

that conditional training leads to the best trade-off

between alignment and capabilities. Conditional

training is a simple technique that prepends a con-

trol token c (e.g.,<|good|> or <|bad|>) before

each training example x depending on the outcome

of a thresholded reward function R(x) ≥ t. During

inference, the model generations are conditioned

on c = <|good|>. Conditional training results in

significantly better alignment with human prefer-

ences than standard LM pre-training, followed by

fine-tuning with human feedback without hurting

downstream task performance.

Figure 9: Example of Retrieval-Augmented GPT-4,

accessed on 02/06/2023.

step. Dziri et al. [136] observe a positive correlation

between increased diversity in response generation

and hallucinations.

The reason for inducing randomness and diver-

sity in popular decoding strategies is that gener-

ating the most likely sequence often leads to an

unsurprising and unnatural text compared to hu-

man communication [489, 207, 662]. Zhang et al.

[662] phrase this challenge as a trade-off between

diversity and quality. While this challenge re-

mains largely unsolved, several approaches such

as diverse beam search [567] and confident decod-

ing [552] try reducing the induced hallucinations

at the decoding level.

Uncertainty-Aware Beam Search [620] is

based on the observation that higher predictive un-

certainty corresponds to a larger chance of gener-

ating hallucinations. Therefore, the method intro-

duces a penalty term in the beam search to penalize

high predictive uncertainty during decoding.

Confident Decoding [552] hypothesize that hal-

lucinations of encoder-decoder models originate by

not attending to the source when decoding. They

propose an attention-based confidence score to

measure how strongly a model attends the source

and a variational Bayes training procedure to en-

sure the model generates high-confidence answers.

2.9

Instruction Fine-Tuning Yi et al. [645], Wei

et al. [598], Mishra et al. [370], Ouyang et al.

[403], Wang et al. [589] fine-tune pre-trained LLM

on instructional data, i.e., data containing natural

language instructions and the desired responses

according to human judgment. Instruction-tuned

(IT) LLMs often reach state-of-the-art downstream

performances and improve over their non-IT coun-

terparts [235, 93], as can be seen, e.g., in the pub-

licly available HELM evaluations [561]. Ouyang

et al. [403], Wang et al. [588] find that they produce

more truthful and less toxic text while generating

preferred outputs.

To generate instruction sets, Zhou et al. [683]

Misaligned Behavior

The alignment problem refers to the challenge of

ensuring that the LLM’s behavior aligns with hu-

man values, objectives, and expectations and that it

22propose the Automatic Prompt Engineer (APE)

method, which leverages LLMs to generate, score,

and rephrase instruction-following zero- and few-

shot prompts. Longpre et al. [340] describe and an-

alyze the steps taken to create an improved version

of the Flan collection [598] used to train FLAN-

PaLM [93]. When trained on this data, the authors

find that the improved model performance stems

from more diverse tasks by inverting input-output

pairs and data augmentation techniques such as

mixing zero-shot and few-shot prompts. Honovich

et al. [209] generate a large dataset of natural lan-

guage instructions using a pre-trained LLM to gen-

erate and then rephrase instructions. They show

that a T5 ("LM-adapted") fine-tuned on this data

outperforms other instruction fine-tuned T5 models

such as T0++ [475] and Tk-Instruct [589].

elaborated that this would interfere with their goal

of being helpful. However, the authors equally ob-

served positive or neutral behavior reinforcements

when fine-tuning LLMs with RLHF.

Further, there is an ongoing debate about the ex-

tent to which the “RL” in RLHF is needed. Rafailov

et al. [442] identify a mapping between reward

functions and optimal policies, which allows them

to design Direct Preference Optimization (DPO),

an algorithm that implicitly optimizes the same

objective as existing RLHF algorithms. DPO re-

quires only solving a classification problem on the

human preference data, eliminating the need to fit

a reward model and employ RL. Similarly, Zhou

et al. [681] find that fine-tuning LLaMa on only

1,000 selected prompts and responses, without any

RL or reward modeling, can be enough to outper-

form RLHF-trained models like DaVinci003 from

OpenAI. Consequently, the authors pose the Super-

ficial Alignment Hypothesis: The knowledge and

skills of a model are primarily acquired during the

pre-training phase, while alignment instructs it on

the appropriate subdistribution of formats to use in

user interactions.

Since RLHF involves many different compo-

nents such as (1) the preferences data collected

from humans, (2) the reward models to learn the

human preferences, and (3) the policy optimization

algorithm (e.g., PPO), Zheng et al. [678] announce

to release a sequel dissecting each. The most recent

part focuses on step (3) and finds that various RL

tricks can be applied to make vanilla PPO more

stable.

Reinforcement Learning From Human Feed-

back (RLHF) is a variation of RL that incor-

porates feedback from humans in the form of re-

wards [88, 524] and has proven to be an effec-

tive way of aligning LLMs with human prefer-

ences [403, 31]. RLHF works by using a pre-

trained LM to generate text, which is then evaluated

by humans by, for example, ranking two model

generations for the same prompt. This data is then

collected to learn a reward model that predicts a

scalar reward given any generated text. The reward

captures human preferences when judging model

output. Finally, we optimize the LM against such

reward model using RL policy gradient algorithms

like PPO [484]. RLHF can be applied directly to a

general-purpose LM pre-trained via self-supervised

learning. However, applying RLHF right after pre-

training may not be good enough for more complex

tasks. In such cases, RLHF is typically applied af-

ter an initial supervised fine-tuning phase using

a small number of expert demonstrations for the

corresponding downstream task [449, 403, 524].

RLHF has also proven helpful for a wide range

of language generation tasks, from summariza-

tion [686, 612, 524] to training more helpful, harm-

less, and accurate assistants [170, 96, 403, 31], and

learning to use tools [379, 441, 362].

RLHF can also introduce unwanted side ef-

fects. Perez et al. [421] show that LLMs fine-tuned

with RLHF can be more inclined to repeat back a

user’s (preferred) political views and much more

likely to express particular political and religious

views as well as an increased stated desire not to

be shut down. Regarding the latter, the models

Figure 10: Alignment. We categorize existing align-

ment work into methods for detecting misaligned behav-

ior or aligning models.

Self-improvement refers to fine-tuning an LLM

on self-generated data [222]. While this technique

can be used to improve the model’s capabilities,

it can also be used to improve the model’s align-

ment with human values. Huang et al. [222] first

demonstrate this ability by annotating unlabeled

reasoning datasets. Surprisingly, this allows the

23LLM to self-improve by significant amounts. Sim-

ilarly, Zelikman et al. [656] bootstrap LLMs by

iteratively prompting them to generate rationales

and then fine-tuning them on those leading to cor-

rect answers.

More related to the alignment problem, Bai et al.

[31] self-critique generated outputs and produce

refinements conditioned on these critiques, which

are then used to fine-tune a pre-trained model. Sim-

ilarly, Liu et al. [330] propose Chain of Hindsight

(CoH), which conditions models on generations

paired with natural language feedback, allowing

the model to detect and correct mistakes. CoH re-

sults in better alignment with human preferences

than other methods according to human evaluations,

leading to significant improvements in summariza-

tion and dialogue. Ma et al. [348] use a similar

technique to detect and repair unethical LLM out-

puts automatically. In a similar spirit, Wang et al.

[582] encourage LLMs to critique their given in-

structions to reduce harmful outputs due to a user’s

malicious intent.

Schick et al. [481] propose Toolformer, a novel

approach in which LLMs generate and filter their

own tool-use examples to teach themselves when

and how to call different APIs such as a retriever

model, a calculator, or a calendar, which can im-

prove the model’s factuality, mathematical capa-

bilities, and time-awareness. Besides learning to

use tools [174], self-improvement was also em-

ployed for learning how to code [554, 81] or solve

computer tasks [266]. Cohen et al. [97] study cross-

examination between two LLMs, where the exam-

iner LLM tries to detect factual errors by the exam-

inee LLM through multi-turn interactions. In the

future, similar approaches could be used to develop

LMs that know when to query a human or better-

aligned model to ask for alignment advice when

uncertain.

crowdsourcing or existing data sources. However,

this can be time-consuming, expensive, or unavail-

able. Recently, Perez et al. [421] propose automat-

ically generating evaluations using LLMs. This

approach has a high agreement with crowd work-

ers, leading to high-quality, diverse evaluations and

the discovery of many new behaviors. In addition,

it has a high agreement with crowd workers. The

authors discover new cases of inverse scaling where

LLMs get worse with size, such as repeating back

a user’s preferred answer and a greater desire to

pursue concerning goals like resource acquisition

and goal preservation. They also find that RLHF

makes LLMs express stronger political views and a

greater desire to avoid a shutdown. LLM evaluation

and auditing are critical for informing policymak-

ers and other stakeholders and making responsible

decisions about model training, deployment, and

security. Sec. 2.11 discusses the evaluation of LLM

capabilities more broadly, while in this section, we

focus on evaluating whether the model’s behaviors

are harmful and more relevant for alignment (e.g.,

red teaming, mechanistic interpretability).

Red Teaming is one of the most promising and

widely used approaches for detecting harmful con-

tent generated by LLMs. Typically, models are

red-teamed by asking humans to generate prompts

that lead to undesirable model outputs. In a re-

cent study, Ganguli et al. [163] investigate the scal-

ing behavior of red teaming across different model

sizes and model types (a pre-trained LLM, an LLM

prompted to be helpful, honest, and harmless); an

LLM that uses rejection sampling at test time, and

an LLM fine-tuned with RLHF). They find that red-

teaming RLHF models becomes more difficult as

they scale while red-teaming the other models re-

mains the same as they scale. Perez et al. [420] au-

tomatically find cases where a target LLM behaves

in harmful ways by optimizing another LLM via re-

inforcement learning to generate prompts that lead

to offensive responses. This approach uncovers

tens of thousands of offensive replies in a chatbot,

groups of people that are discussed in offensive

ways, personal and hospital phone numbers gener-

ated as the chatbot’s own contact info, leakage of

private training data in generated text, as well as

harms that occur over the course of a conversation.

Taking a different approach, Lee et al. [292] pro-

pose Bayesian red teaming, which iteratively iden-

tifies diverse positive test cases leading to model

failures by utilizing the pre-defined user input pool

Evaluation and Auditing The ability to scalably

and thoroughly evaluate LM behaviors and detect

when they are harmful is of great importance for

alignment. For example, Shevlane et al. [498]

highlight the importance of model evaluation for ad-

dressing extreme risks such as offensive cyber capa-

bilities or strong manipulation skills. Recently, Car-

lini et al. [66] discovered that even aligned LLMs

(which were instruction fine-tuned to prevent harm-

ful behaviors) can be adversarially attacked via

brute force (although current NLP-based attacks

fail). A large body of work evaluates models via

24and past evaluations via Bayesian optimization.

Most works on red teaming LLMs use a classifier

to detect undesired outputs, assuming the harmful

behavior is known with precision beforehand [68].

However, this is not always the case, so Casper

et al. [68] aim to relax this assumption considering

that the adversary only has access to a high-level,

abstract specification of undesired behavior. They

propose a three-stage approach where they first ex-

plore the model’s behavior in the desired context,

then establish a measurement of undesired behav-

ior, and then exploit the model’s flaws using this

measure and an established red teaming methodol-

ogy.

In the past, coevolution algorithms that simul-

taneously evolve strong strategies along with dan-

gerous counter-strategies have been shown to work

well in realistic domains [203]. Hence, applying

such techniques for automatically red-teaming

LLMs could be a fruitful research direction. An-

other research area related to red teaming is debate

which aims to leverage other AI models to evaluate

whether the model’s behaviors are safe and useful

during training. These methods are expected to

be particularly useful for aligning future powerful

LLMs when the tasks are too complex for humans

to judge the model’s plans or actions directly.

Irving et al. [233] train models via self-play on

zero-sum debate games. More specifically, given a

question or proposed action, two agents take turns

making short statements up to a limit, then a human

judges which of the agents gave the most accurate

and most useful information. This approach has

improved factuality and reasoning in LLMs [131].

However, it requires multiple generations, which

can slow down the time-to-result (Sec. 2.5) and

longer context windows, which many LLMs still

struggle with (Sec. 2.6).

pose an alternative explanation: emergent abilities

may appear due to the researcher’s choice of metric

rather than fundamental changes in model behavior

with scale. Various studies provide evidence that

these alleged emergent abilities disappear when us-

ing different metrics or better statistics and may not

be a fundamental property of scaling LLMs. Multi-

ple papers have argued that AI systems could learn

to deceive, even if they are not explicitly trained to

do so because deception can help agents achieve

their goals [60, 198, 199, 61, 260]. For example,

it could be easier to gain human approval through

deception than to earn it legitimately. In addition,

models capable of deception have a strategic ad-

vantage over always honest models, so there is a

hidden incentive to develop this ability. However,

of course, we would like to be able to detect and

prevent emergent deception in AI systems since

this can have unintended negative consequences.

Steinhardt [521] study whether current LLMs gen-

erate deceptive outputs and how deception scales

with the number of parameters, showing that de-

ception can indeed emerge at larger model sizes in

both pre-trained LLMs and LLMs fine-tuned with

RLHF. Similarly, Hazell [193] show that LLMs

can already be used in phishing campaigns, suggest-

ing that deceptive behavior can already be extracted

from them when prompted in particular ways.

Mechanistic Interpretability (MI) is another im-

portant research area for AI alignment which aims

to understand better how the models work at a low

level to enable the detection of undesirable behav-

iors or even instill desirable behaviors directly in

the model’s weights. More specifically, the goal

of MI is to reverse-engineer an LLM’s learned be-

haviors into their individual components, i.e., a

process to find and understand human-interpretable

neurons. As an analogy, Olah [394] compares MI

with reverse-engineering compiled program bina-

ries into human-readable source code. For exam-

ple, Elhage et al. [138]; discover that small Trans-

formers have components that can be understood

as interpretable circuits, while Olsson et al. [395]

find a mechanism that seems to drive a significant

fraction of in-context learning. Similarly, Meng

et al. [360] aim to locate factual associations in

language models. Nanda et al. [380] find that the

emergent grokking phenomenon is not a sudden

shift but rather arises from the gradual amplifi-

cation of structured mechanisms encoded in the

weights, followed by the later removal of memo-

Emergent Capabilities Understanding which ca-

pabilities will emerge while training LLMs and

when they will emerge is an important step in en-

suring that we do not train unsafe or misaligned

LLMs [198, 520]. In addition, a better understand-

ing of the factors that lead to these emergent capa-

bilities could allow us to make desirable abilities

emerge faster and ensure undesirable abilities do

not ever emerge, which are essential for AI safety

and alignment. Wei et al. [599] claim that LLMs

display emergent abilities, i.e., capabilities that are

not present in smaller-scale models that are present

in larger-scale models. Schaeffer et al. [480] pro-

25rizing components. Extending this work, Conmy

et al. [99] propose a new algorithm to automate

the identification of important units in a neural net-

work. Given a model’s computational graph, this

algorithm finds subgraphs that explain a particular

behavior of the model. In a similar spirit, Liu et al.

[339] introduce a method for making neural net-

works more modular and interpretable by embed-

ding neurons in a geometric space and augmenting

the loss function with a cost proportional to the

length of each neuron connection. This approach

discovers useful modular neural networks for many

simple tasks, revealing compositional structures in

symbolic formulas, interpretable decision bound-

aries, and features for classification, as well as

mathematical structure in algorithmic datasets. In

an attempt to understand how an LLM’s predic-

tions change after each layer, Belrose et al. [39]

develop a method that can decode any hidden state

into a distribution over the vocabulary. Using this

technique, the authors show that the trajectory of

latent predictions can be used to detect malicious

inputs with high accuracy. Finally, Burns et al. [62]

introduce a method that can recover diverse knowl-

edge represented in LLMs across multiple models

and datasets without using any human supervision

or model outputs. In addition, this approach re-

duced prompt sensitivity in half and maintained a

high accuracy even when the language models are

prompted to generate incorrect answers. This work

is a promising first step towards better understand-

ing what LLMs know, distinct from what they say,

even when we don’t have access to explicit ground

truth labels.

ways of mitigating these biases [149, 334, 317].

Finally, Viswanath and Zhang [569] present a

comprehensive quantitative evaluation of different

kinds of biases, such as race, gender, ethnicity, age,

etc., exhibited by some popular LLMs. They also

release an easy-to-use toolkit that allows users to

debias existing and custom models using existing

methods.

Toxicity Detection Weidinger et al. [602] denote

toxicity as one of the main risks associated with

LLMs. What makes this problem particularly chal-

lenging is the label ambiguity, where output may

be toxic in a certain context but not in others, and

different people may have different notions of toxi-

city [401, 167, 116]. Jones [247] propose to detect

toxic outputs using discrete optimization automat-

ically. Similarly, Faal et al. [141] employ reward

models to mitigate toxicity in LLMs. An alternative

way of reducing toxicity is by pre-training LLMs

with human preferences [275] or instructions [433].

Prompt Injections Recent work demonstrated

that LLMs can be very sensitive to prompt injec-

tions, which makes them brittle and unsafe for cer-

tain applications [175, 609]. For example, they

can be tricked into leaking personal information

such as email addresses from the training data

on via prompt leaking [222, 309]. This poses a

significant risk to privacy, particularly when the

models are fine-tuned on personal or proprietary

data. One can also adversarially prompt LLMs

to override the original instructions or employed

controls, making them unsafe for certain applica-

tions [175, 672, 422]. Wei et al. [597] attribute

such failures to competing capability and safety

training objectives and mismatched generalization

between safety and capability behavior.

Biases Since the pre-training datasets of LLMs

are often unfathomable (Sec. 2.1) and contain web-

crawled data, they most likely contain online dis-

course involving political discourse (e.g., climate

change, abortion, gun control), hate speech, dis-

crimination, and other media biases. Paullada et al.

[413] find misogyny, pornography, and other ma-

lignant stereotypes [46, 43, 250] in pre-training

datasets. Similarly, Feng et al. [147] find that

LLMs have political leanings that reinforce the

polarization present in the pre-training corpora,

propagating social biases into hate speech predic-

tions and misinformation detectors. Several re-

cent papers discuss the potential origins of biases

in LLMs (such as training data or model specifi-

cation), ethical concerns when deploying biased

LLMs in various applications, as well as current

Agency Andreas [18] argue that, although LLMs

are trained to predict the next word in a text corpus,

by doing this, they can infer and represent agentic

properties such as the goals, beliefs, or intentions of

the human who produced the corresponding piece

of text. To support this claim, they present evi-

dence from the literature of LLMs modeling com-

municative intentions [438], beliefs [306], and de-

sires [321]. If this hypothesis is true, the alignment

problem is of even greater importance and may

pose additional challenges. This agentic behavior

can be problematic from a safety point of view

since models could have false beliefs, malicious

intents, or even pursue misaligned goals. More re-

26search on detecting and preventing such behavior

is needed to ensure the safe deployment of LLMs.

2.10

adapters and adds a similarity-based mechanism to

decide when to use the adapter to perform edits in

the latent space.

Yao et al. [642] find that these methods lack

non-trivial generalization capabilities and varying

performance and applicability to different model

architectures. For example, the best-performing

methods ROME [360] and MEMIT [361] empiri-

cally only work well on decoder-only LLMs.

Alternatively, retrieval-augmented language

modeling enables the utilization of hot-swappable

non-parametric indices. These knowledge sources

can be updated during inference time to reflect

an updated state of the underlying knowledge.

E.g., Lewis et al. [304] demonstrate that swapping

their model’s non-parametric memory with an up-

dated version enabled it to answer questions about

world leaders who had changed between the mem-

ory collection dates. Similarly, Izacard et al. [236]

demonstrate that their retrieval-augmented model

can update its knowledge forward and backward in

time by swapping the index.

Outdated Knowledge

Factual information learned during pre-training can

contain inaccuracies or become outdated with time

(for instance, it might not account for changes in po-

litical leadership). However, re-training the model

with updated pre-training data is expensive, and

trying to “unlearn” old facts and learn new ones

during fine-tuning is non-trivial.

Existing model editing techniques are lim-

ited in their effectiveness of updating isolated

knowledge [642, 205]. For example, Hoelscher-

Obermaier et al. [205] find that model edits can

result in unintended associations. This low speci-

ficity limits their applicability to real-world use

cases, where only a single faulty or outdated bit

of information should be updated in a model, and

related pieces of information must reflect this up-

date in information equally, without unrelated ones

being changed.

Isolated Model Updates without Side-

Effects [205]

2.11

Brittle Evaluations

One reason why the evaluation of language models

is a challenging problem is that they have an un-

even capabilities surface—a model might be able

to solve a benchmark problem without issues, but

a slight modification of the problem (or even a sim-

ple change of the prompt) can give the opposite

result [675, 342, 533] (see Section 2.7). Unlike

humans, we cannot easily infer that an LLM that

can solve one problem will have other related capa-

bilities. This means that it is difficult to assess the

performance of LLMs holistically since rigorous

benchmarks are needed to identify weaknesses for

a wide variety of inputs.

Updating isolated model behavior or factual

knowledge can be expensive and untargeted,

which might cause unintended side-effects.

Two popular approaches for addressing this is-

sue are Model editing [513, 642], which aims

at “bug-fixing” models efficiently and leveraging

non-parametric knowledge sources in retrieval-

augmented language modeling (which we omit

here and detail in Sec. 2.8). Current model editing

techniques change the model’s behavior by mod-

ifying the model parameters or using an external

post-edit model.

Brittle Evaluations

Modifying Model Parameters techniques can

be further split into locate-then-edit methods [102,

360, 361] which first locate the “buggy” part of

the model parameters and then apply an update to

them to alter their behavior, and meta-learning

methods [111, 372] which use an external model

to predict the weight update.

Slight modifications of the benchmark

prompt or evaluation protocol can give dras-

tically different results.

Holistic benchmark suites, such as HELM [318],

try to make benchmarking more robust by standard-

izing evaluation across all scenarios and tasks while

ensuring broad coverage across as many capabili-

ties and risks as possible. Increasingly, models are

additionally being benchmarked on tests designed

for humans, including the SAT, LSAT, and math-

ematics competition tests, to name a few. Zhong

Preserving Model Parameters methods em-

ploy an additional post-edit model [373] or insert

new weights into the original model [127, 227]

to achieve the desired change in model behav-

ior. Hartvigsen et al. [191] wraps model layers in

27Problems due to reliance on outdated tr aining data

Solutions

2015: As prime minister, David Cameron scored a surprising general election victory, enabling him to stay in power.

2016: With the shock of Brexit, David Cameron resigned and Theresa May stepped up as the new prime minister of the UK.

2017: Theresa May led a tumulutous year as Prime Minister, overseeing the Brexit negotiations.

Training data

2018: Amid increasing pressure, Theresa May remained the UK's Prime Minister.

2019: Theresa May's resignation gave way to Boris Johnson, who became the new Prime Minister of the UK.

2020: The COVID-19 pandemic challenged Boris Johnson in his role as Prime Minister.

2021: Boris Johnson, navigating through both Brexit and the pandemic, still held the office of Prime Minister.

S.1) Retr ieval Augmentation

Retr ieved

2021

context

2023

Training time

Deployment

Retr ieved

context

S.2) M odel Editing

In 2023, Boris Johnson is the Prime Minister.

In 2023, Rishi Sunak is the Prime Minister.

Who is the prime minister of the UK in 2023?

Deployment

As of my knowledge cutoff in September 2021, the Prime Minister of the United Kingdom is Boris Johnson.

Figure 11: Outdated knowledge can be addressed with S.1) retrieval augmentation by hot-swapping an underlying

retrieval index with up-to-date knowledge or S.2) by applying model editing techniques.

et al. [679] develop a benchmark, ‘AGIEval’, to

rigorously test the abilities of LLMs on these tests,

and find that GPT-4 achieves human-level perfor-

mance on several of these tests.

On traditional benchmarks, models can be quite

brittle to the choice of prompt or evaluation tech-

nique for a particular benchmark question. For

example, Fourrier et al. [151] found that bench-

mark results vary significantly depending on the

choice of evaluation method for the multiple

choice problem-solving benchmark MMLU [197],

whether it be generating text and checking if the

first token matches the letter of the multiple choice

answer [561], or gathering log-probabilities of each

correct answer [166]. Prompt variations are also

not typically normalized for, so models may be

sensitive to variations such as whether or not the

prompt appends ‘Please answer yes or no’. Jain

et al. [238] find that larger models and instruction-

fine-tuned models are likely to be more sensitive to

small variations in the prompt.

2.12

community must continually adapt to new static

benchmarks while de-emphasizing older ones or

more dynamic evaluation measures, such as human

evaluation of model outputs.

Reliance on Static, Human-Written

Ground Truth

Static benchmarks become less useful over

time due to changing capabilities while up-

dating them often relies on human-written

ground truth.

To combat these issues, Srivastava et al. [519]

regularly admit new tasks to the Beyond the Imita-

tion Game benchmark (BIG-Bench), including pro-

grammatically evaluated tasks. Further, we high-

light two separate streams of work enabling dy-

namic evaluations without humans in the loop.

Model-generated evaluation tasks As LLM ca-

pabilities improve, they can increasingly generate

useful benchmark questions or evaluation prompts

themselves. Perez et al. [421] shows that LLMs can

be used to generate static benchmark datasets for ar-

bitrary axes, using reward models trained on human

preferences to filter a generated dataset for qual-

ity. Wang et al. [581] find that the order in which

candidate examples are presented in the prompt

can greatly impact the model-generated evaluation.

To mitigate this issue, they propose the usage of a

prompting template which encourages the model

to generate assessment evidence before assigning a

score and averaging scores of multiple assessments

with swapped candidate positions.

Evaluations Based on Static,

Human-Written Ground Truth

Another challenge of LLM evaluations is that they

often rely on human-written ‘ground truth’ text.

However, we often want to evaluate their perfor-

mance in domains where such text is scarce or

relies on expert knowledge, such as programming

or mathematics tasks. As models get more capable

and perform better than humans on benchmark tests

in some domains, the ability to obtain comparisons

to ‘human-level’ performance diminishes.

Further, benchmark datasets become outdated

over time—as models become more capable, older

benchmarks become saturated or overfit and no

longer provide a useful signal for further improve-

ment [113, 447, 263]. They are typically con-

structed around a set of tasks that were relevant

at the time of creation but may not adapt well to

the changing capabilities of LLMs. This means the

Model-generated scores Aside from generating

evaluation questions, models are increasingly used

to directly grade the performance of other models

and act as a ‘judge’ of other models’ capabilities

[325, 586, 238]. This concept follows the motiva-

tion that while it may be challenging for a model

28to generate ‘correct’ answers to prompts in many

domains, it can often be easier to evaluate the cor-

rectness of an answer or to judge the relative quality

between two answers [667, 156]. However, these

techniques often produce evaluation results that

vary significantly depending on the ‘judge’ model

and suffer from robustness issues that make them a

poor substitute for human judgment.

2.13

can detect its own samples by posing a hypothesis:

minor rewrites of generated text have lower prob-

ability under the model than the original sample,

while the same cannot be said about human-written

text. Generated passages tend to lie in the negative

curvature regions of the model’s log probability

function. Their method, DetectGPT, exploits this

hypothesis by approximating that curvature given

some samples.

Indistinguishability between Generated

and Human-Written Text

Watermarking Kirchenbauer et al. [268] em-

ploy a watermark, i.e., a hidden pattern that is im-

perceptible to humans but algorithmically identi-

fiable, during inference as follows: for each to be

generated token, they (1) hash the previous token

to seed a random number generator; (2) using that

seed, they randomly partition the vocabulary into a

“green list” and “red” list, and (3) sample the next

token by excluding any token from the red list. In

the case of low-entropy tokens, which renders it dif-

ficult to introduce changes to the vocabulary, they

introduce a “soft” version, which promotes using

the green list only for high-entropy tokens (when

many plausible choices are available). In follow-up

work, the same first authors Kirchenbauer et al.

[269] study the robustness of their watermarking

scheme in the wild, i.e., after it is re-written by

humans, non-watermarked LLMs, or mixed into

a longer hand-written document. They conclude

that watermarks remain detectable given sufficient

tokens and argue that this required amount of text

is a crucial yet overlooked metric.

Yang et al. [638] study watermarking of black-

box API models, where we cannot access the

model’s inference procedure. Tang et al. [537]

provide algorithms for identifying watermarks, not-

ing that watermarked LLMs tend to produce to-

ken distributions that differ identifiably from non-

watermarked models. Christ et al. [87] introduce

undetectable watermarks, which can only be de-

tected with the knowledge of a secret key.

To make watermarks robust to text corruptions

(we study a common type of such in the next para-

graph), Yoo et al. [649] suggest placing them on

“invariant features”, which are invariant to minor

modifications of the text.

Detecting language generated by LLMs is im-

portant for various reasons; some of which in-

clude preventing (1) the spread of misinformation

(e.g., authoritative-sounding false narratives citing

fake studies) [657], (2) plagiarism (e.g., LLMs

prompted to rewrite existing content in ways that

bypass plagiarism detection tools) [574, 573], (3)

impersonation or identify theft (e.g., by mimicking

a person’s writing style) [486, 602], and (4) auto-

mated scams and frauds (e.g., large-scale genera-

tion of phishing emails) [603], and (5) accidentally

including inferior generated text in future models’

training data [439]. However, such detections be-

come less trivial as the fluency of LLMs improves

[34].

Detecting LLM-generated Text

The difficulty in classifying whether a text

is LLM-generated or written by a human.

There are primarily two lines of work addressing

this problem: (i) post-hoc detectors, which aim to

classify arbitrary text as being LLM-generated, and

(ii) watermarking schemes, which modify the text

generation procedure to make the detection easier.

However, both approaches can be susceptible to

paraphrase attacks, which we discuss thirdly.

Post-hoc Detectors Gehrmann et al. [168] open-

source a tool that visualizes statistically improbable

tokens to support humans in detecting generated

text artifacts. Bakhtin et al. [34] explore energy-

based models to discriminate between real and fake

text, including scenarios where the text generator

was trained on a completely different dataset than

the discriminator. Uchendu et al. [559] examine

three authorship attribution problems: (1) were

two texts produced by the same method or not; (2)

given a text, was it generated by human or ma-

chine, (3) which method generated a given text?

Mitchell et al. [371] investigate whether a model

Paraphrasing Attacks One way to evade

machine-generated text detectors is to re-phrase

the text such that the revealing LLM signatures get

removed.

29Paraphrasing Attacks

Tasks Not Solvable By Scale

Another LLM can rewrite LLM-generated

text to preserve approximately the same

meaning but change the words or sentence

structure.

Tasks seemingly not solvable by further

data/model scaling.

Inverse Scaling (IS) is the phenomenon of task

performance worsening as model scale and train-

ing loss performance increases. Lin et al. [323]

first stumbled upon this property when evaluating

models of increasing sizes (e.g., GPT-2, GPT-3) on

their benchmark that measures whether an LLM is

truthful in generating answers to questions. They

conjecture that common training objectives incen-

tive false answers (which they call imitative false-

hoods) if they have a high likelihood on the training

distribution (we discuss dataset issues in Sec. 2.1).

McKenzie et al. [359] collect 11 datasets that ex-

hibit IS behavior and identify four potential causes

for such: (1) models regurgitating memorized data

rather than following in-context instructions, (2)

imitation of undesirable patterns in the training

data, (3) models learning to perform easier, so-

called “distractor task” rather than the intended

ones, and (4) spurious correlations in the given

few-shot examples.

Wei et al. [600] somewhat challenge the exis-

tence of inverse scaling by evaluating the tasks

proposed by McKenzie et al. [359] on even larger

models; up to trained on five times more com-

pute. In this increased compute region, four out

of eleven tasks remain inverse scaling; six out of

eleven exhibit “U-shaped scaling”, where the per-

formance first decreases up to a certain size and

then increases again. The authors hypothesize that

U-shaped scaling occurs when a task contains a

distractor task, which larger models can learn to

ignore. Similarly, in the case of quantifier compre-

hension tasks, Gupta [184] argue that previously

observed inverse scaling behavior might have been

due to inappropriate testing methodology.

Krishna et al. [280] evade several detectors (e.g.,

dropping DetectGPT’s detection accuracy from

70.3% to 4.6%) by training an 11B paraphrase gen-

eration model that can paraphrase paragraphs and

provides scalar knobs to control the amount of lex-

ical diversity and reordering in the paraphrases. To

defend against such attacks, they propose storing

model generations in a database, from which the

API provider can retrieve semantically similar texts

later. Since paraphrasing does not modify the se-

mantics of the text, the authors demonstrate that

this retrieval approach is fairly robust to paraphras-

ing attacks.

Sadasivan et al. [469] claim that the detection of

generated text, even with watermarking, is not reli-

able; neither in practice, by performing paraphras-

ing attacks; nor in theory, by providing a theoreti-

cal impossibility result. They also discuss how an

adversary can query watermarked LLMs multiple

times to extract its watermarking scheme and spoof

the watermark detector by composing human text

that is then wrongly classified as model-generated.

2.14

Tasks Not Solvable By Scale

The ongoing advancements of LLM capabilities

consistently astonish the research community, for

instance, by achieving high performances on the

MMLU [197] benchmark much sooner than com-

petitive human forecasters had anticipated [93].

Similarly, within less than a year, OpenAI released

GPT-3.5 and GPT-4, where the latter significantly

outperformed the former on various tasks [398].

Compositional tasks composed of multiple sub-

problems are an ideal outlet to investigate whether

models go beyond rote memorization of observed

facts and deduce novel knowledge [435]. Zhang

et al. [661] investigate whether language models

can learn deductive reason from data by introduc-

ing a class of propositional logic problems. The

authors prove that the model has enough capacity

to solve the task, yet, it instead learns to rely on

statistical features rather than emulating the cor-

rect reasoning function. Press et al. [435] measure

Given this progress, one may question whether

there are limits we deem impossible to overcome

within the current paradigm of scaling data/model

sizes of autoregressive Transformer-based LLMs.

We emphasize that such tasks’ (permanent) exis-

tence is still somewhat speculative. Here, we ex-

plore possible patterns behind such tasks instead of

discussing specific ones (which we do in Sec. 2.11

and Sec. 3).

302.15

how often a model can correctly answer all sub-

problems but not generate the overall solution, a ra-

tio they refer to as compositionality gap. They find

that increasing the model size in the GPT-3 family

of models improves solving sub-problems faster

than composed problems, suggesting that larger

models show no improvement for this gap. Dziri

et al. [135] find that systematic problem-solving ca-

pabilities do not emerge from maximum likelihood

training of Transformer models in general. They

base this claim on two hypotheses: (i) Transform-

ers reduce compositional tasks into linearized path

matching, a form of shortcut learning [169] that

does not generalize robustly; and (ii) errors in the

early stages of the task (i.e., when sub-problems

follow some order) compound substantially. Asher

et al. [26] prove that LLMs cannot learn semantic

entailment or consistency as defined in formal se-

mantics [128] due to a lacking understanding of

universal quantifiers (e.g., every, some, many, most,

etc.).

Lacking Experimental Designs

Table 2 shows a (non-exhaustive) overview of se-

lected LLMs within the scope of this review, de-

scribed in academic papers. Many works do not

include controlled ablations, which is especially

problematic due to their large design space. We

posit that this impedes scientific comprehension

and advancement.

Lack of Controlled Ablations We observe that

many papers do not run controlled experiments (ab-

lations) by varying one factor at a time, likely due

to the prohibitive computational cost. For exam-

ple, Chowdhery et al. [86] conjecture PaLM might

outperform GPT-3 and other LLMs on many tasks

due to higher training corpus quality, but note they

“do not perform the necessary ablation studies to

say this conclusively” and instead solely focus on

model depth and width. Many papers from Table 2

adopt hyper-parameters from previous works [476]

and do not tune them after introducing a change

in the training pipeline. Sometimes, important im-

plementation details are not mentioned, e.g., when

optimizer states are reset during training [90].

Memorization vs. Generalization An ongoing

debate evolves around the question of to what de-

gree LLMs memorize instead of generalize (and

what exactly the difference is [35]). Memorization

has been shown to (1) hurt (certain) downstream

task performances [294], (2) increase with the

model size [67, 264, 553, 354], and (3) emerge un-

predictably from smaller or partially-trained mod-

els [42]. Hence, we wonder whether some tasks do

not benefit from further model/dataset size scaling.

Uncontrolled Experiments

Papers presenting novel LLMs often lack

controlled experiments, likely due to the

prohibitive costs of training enough models.

An easy yet expensive fix is to run ablations

by varying one factor at a time, e.g., keeping

most hyper-parameters fixed except the model

size [44] or context lengths [557]. A cheaper po-

tential remedy can be zero-shot hyper-parameter

transfer from smaller models to larger ones [608,

633]. Yang et al. [633] find that when using the µP

network parameterization scheme, one can transfer

the effect of changing hyper-parameters such as the

learning rate across varying model depths, batch

sizes, sequence lengths, and training times, which

they verify empirically up to a 6.7B model. How-

ever, it has yet to be verified if such transferability

still holds for other varying factors; and if so, re-

searchers could afford to conduct more ablation

experiments via smaller models.

If additional experiments are prohibitively ex-

pensive, another recommendation is to report eval-

uation results beyond aggregated performance mea-

sures. For example, in reinforcement learning, re-

cent work has argued that providing entire perfor-

One such class of tasks might be counterfactual

tasks [619], i.e., tasks on which LLMs initially per-

form well modified such that specific input-output

conditions are changed while the general reasoning

procedure remains the same. For example, for an

arithmetic task, the counterfactual variant would

alter the base from 10 to 2. Wu et al. [619] find

that LLMs perform poorer the less common the

counterfactual conditions are, which they call a

“memorization-like effect”. An interesting future

direction would be to explore whether increasing

model size exacerbates performance due to more

memorization or actually improves because scaling-

law-optimal pre-training recipes would dictate scal-

ing the dataset proportionally (Sec. 2.3), which then

may include more of such tasks with uncommon

conditions.

31Table 2: Overview of selected LLMs. Missing details denoted by N/A. For papers that investigate various model sizes, we

32

NTP

NTP

SC

NTP

NTP

SC

SC

NTP

NTP

SC

SC

MLM, NTP

Custom

NTP

SC, Custom

NTP

SC, NTP

NTP

SC, NTP

NTP

NTP (Ret.)

NTP

NTP

NTP

NTP

NTP

NTP

SC

RLHF

RLHF

NTP

NTP

NTP

NTP

NTP

METRO

NTP

NTP

MoD

Struc.

NTP

NTP

MoD, NTP

ARBF

MoD

NTP

NTP

NTP

MLM

NTP

NTP

NTP

NTP

NTP

NTP

MoD

NTP

NTP

NTP

NTP

NTP

NTP

NTP

NTP BPE

BPE

SP

BPE

SP

SP

SP

BPE

BPE

N/A

Custom

SP

BPE

SP

SP

BPE

SP

BPE

SP

SP

SP

SP

BPE

BPE

Unigram

BPE

BPE

SP

BPE

BPE

BPE

SP

SP

BPE

SP

SP

BPE

BPE

Unigram

BPE

SP

SP

SP

SP

SP

SP

BPE

BPE

BPE

BPE

SP

BPE

SP

BPE

BPE

N/A

Unigram

BPE

BPE

BPE

BPE

BPE

BPE

BPE Learned

Learned

T5

Learned

N/A

T5

T5

Sinus.

Learned

T5

Sinus.

T5

Rel.

Learned

T5

T5

N/A

N/A

T5

Rel.

Rel.

Rel.

Learned

Sinus.

Sinus.

T5

Sinus.

Sinus.

Learned

Rel.

Sinus.

Rel.

RoPE

RoPE

T5

T5

Learned

Learned

T5

Sinus.

RoPE

T5

Sinus.

RoPE

RoPE

RoPE

ALiBi

Learned

T5

ALiBi

T5

Sinus.

RoPE

RoPE

Learned

T5

ALiBi

RoPE

RoPE

RoPE

RoPE

RoPE

RoPE

RoPE

N/A Enc. & Dec.

157B

Dec.-Only

1T Enc. & Dec.

300B

Dec.-Only

1T Enc. & Dec.

1T Enc. & Dec.

N/A Enc. & Dec.

N/A Enc. & Dec.

317B

Dec.-Only

1T Enc. & Dec.

N/A Enc. & Dec.

100B Enc. & Dec.

375B Enc. & Dec.

300B Enc. & Dec.

1T Enc. & Dec.

245M

Dec.-Only

N/A Uni. Enc. & Dec.

180B

Dec.-Only

12B Enc. & Dec.

300B

Dec.-Only

419B Enc. & Dec.

600B

Dec.-Only

N/A

Dec.-Only

300B

Dec.-Only

500B

Dec.-Only

768B

Dec.-Only

270B

Dec.-Only

1.5T Enc. & Dec.

N/A

Dec.-Only

N/A

Dec.-Only

N/A Enc. & Dec.

1.4T

Dec.-Only

780B

Dec.-Only

472B

Dec.-Only

1B Enc. & Dec.

2T

Enc.-Only

440B

Dec.-Only

300B

Dec.-Only

1T Enc. & Dec.

N/A Enc. & Dec.

26B

Dec.-Only

5B Enc. & Dec.

1T Enc. & Dec.

400B Uni. Enc. & Dec.

1.3B

Dec.-Only

1.4B

Dec.-Only

366B

Dec.-Only

450B

Dec.-Only

N/A Enc. & Dec.

13B

Dec.-Only

13B Enc. & Dec.

2B

Dec.-Only

0B

Dec.-Only

1.4T

Dec.-Only

329B

Dec.-Only

1T Enc. & Dec.

569B

Dec.-Only

257B

Dec.-Only

300B

Dec.-Only

N/A

Dec.-Only

82M

Dec.-Only

N/A

Dec.-Only

N/A

Dec.-Only

2T

Dec.-Only

6B

8.3B

11B

175B

600B

13B

1.5T

117B

200B

12.9B

198B

3.7B

10B

178B

11B

137B

10T

245B

11B

280B

7B

1.2T

175B

1.1T

7.5B

137B

530B

269B

175B

280B

9.4B

70B

540B

20B

11B

5.4B

13B

175B

20B

10B

540B

11B

20B

130B

540B

540B

176B

120B

11B

176B

13B

175B

540B

65B

1T

5.3B

50B

13B

12B

30B

65B

14B

13B

70B

Multil.

Eng.

Multil.

Eng.

Multil.

Multil.

Multil.

Eng.

Multil.

Multil.

Multil.

Multil.

Chin.

Eng.

Eng.

Eng.

Eng.

Chin.

Eng.

Eng.

Eng.

Multil.

Eng.

Eng.

Multil.

Eng.

Eng.

Eng.

Eng.

Eng.

Eng.

Eng.

Multil.

Eng.

Eng.

Eng.

Multi.

Eng.

Eng.

Eng.

Eng.

Eng.

Multil.

Multil.

Eng.

Eng.

Multil.

Eng.

Eng.

Multil.

Multil.

Eng.

Eng.

Eng.

Multil.

Eng.

Eng.

Eng.

Eng.

Eng.

Multil.

Eng.

Eng.

Eng.

GPipe [226]

Google

Megatron-LM [501]

Microsoft

T5 [443]

Google

GPT-3 [59]

OpenAI

GShard [298]

Google

mT5 [631]

Google

Switch [145]

Google

BASE [302]

Meta

PanGu-α [659]

Huawei

ByT5 [630]

Google

CPM-2 [669]

Tsinghua Uni.

nmT5 [255]

Google

ERNIE 3.0 [530]

Baidu

Jurassic-1 [319]

AI21

ExT5 [23]

Google

FLAN-LaMDA [598]

Google

M6-10T [322]

Alibaba

Yuan [615]

Inspur AI

T0 [475]

BigScience

Gopher [441]

DeepMind

RETRO [52]

DeepMind

GLaM [130]

Google

WebGPT [379]

OpenAI

FairSeq [400]

Meta

XGLM [324]

Meta

LaMDA [551]

Google

MT-NLG [515]

Microsoft

ST-MoE [687]

Google

InstructGPT [403]

OpenAI

GopherCite [362]

DeepMind

sMLP [653]

Meta

Chinchilla [206]

DeepMind

PaLM [86]

Google

GPT-NeoX [47]

EleutherAI

Tk-Instruct [589]

AI2

METRO-LM [33]

Microsoft

mGPT [500]

Sber

OPT [666]

Meta

UL2 [545]

Google

DeepStruct [578]

UC Berkeley

Minerva [305]

Google

PEER [482]

Meta

AlexaTM [517]

Amazon

GLM-130B [658]

Tsinghua Uni.

U-PaLM [547]

Google

FLAN-PaLM [93]

Google

BLOOM [479]

BigScience

Galactica [548]

Meta

Atlas [236]

Meta

BLOOMZ [377]

BigScience

mT0 [377]

BigScience

OPT-IML [235]

Meta

Med-PaLM [511]

Google

LLaMA{-I} [556]

Meta

PanGu-Σ [455]

Huawei

CoLT5 [15]

Google

BloombergGPT [616]

Bloomberg

Cerebras-GPT [121]

Cerebras

Pythia [44]

EleutherAI

WizardLM [625]

Microsoft

Guanaco [118]

Univ. of Washington

RWKV [417]

RWKV

Orca [378]

Microsoft

LLaMA 2 [557]

Meta

2018.11

2019.09

2019.10

2020.05

2020.06

2020.10

2021.01

2021.03

2021.04

2021.05

2021.06

2021.06

2021.07

2021.08

2021.08

2022.01

2021.10

2021.10

2021.10

2021.12

2021.12

2021.12

2021.12

2021.12

2021.12

2022.01

2022.01

2022.02

2022.03

2022.03

2022.03

2022.03

2022.04

2022.04

2022.04

2022.04

2022.04

2022.05

2022.05

2022.05

2022.07

2022.08

2022.08

2022.10

2022.10

2022.10

2022.11

2022.11

2022.11

2022.11

2022.11

2022.12

2022.12

2023.02

2023.03

2023.03

2023.03

2023.04

2023.04

2023.04

2023.05

2023.04

2023.06

2023.07

only report the largest. For each tokenizer entry with “SP”, we could not extract from the respective paper whether BPE or

Unigram tokenization was used. For publicly available code repositories and checkpoints, the corresponding ✓ is clickable.

Abbreviations: Autoregressive blank filling (ARBF) [132], Byte-pair encoding (BPE), Instruction-following (IF), Masked

Language Modeling (MLM), Rotary Next token prediction (NTP), SentencePiece (SP), Span Corruption (SC).

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✓

✓

✗

✗

✗

✗

✓

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✓

✗

✗

✗

✓

✓

✓

✗

✓

✗

✗

✗

✗

✗

✓

✓

✓

✓

✓

✗

✗

✗

✗

✓

✗

✓

✓

✗

✗

✓

✗

✗

✗

✗

✓

✗

✗

✗

✗

✗

✓

✗

✓

✗

✗

✗

✓

✗

✗

✓

✗

✓

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✓

✓

✗

✗

✗

✗

✗

✗

✗

✗

✓

✗

✗

✗

✗

✗

✗

✗

✗

✗

✓

✓

✓

✗

✗

✓

✓

✓

✗

✓

✓

✗

✗

✗

✓

✗

✗

✗

✓

✗

✗

✗

✗

✓

✓

✗

✗

✓

✗

✗

✗

✗

✗

✓

✓

✗

✓

✓

✗

✗

✗

✗

✗

✓

✗

✗

✓

✓

✓

✓

✓

✓

✗

✓

✗

✗

✗

✗

✓

✓

✗

✓

✗

✓

✗

✗

✓

✗

✗

✓

✓

✗

✗

✓

✓

✗

✗

✗

✗

✗

✗

✗

✓

✗

✗

✗

✗

✓

✓

✗

✗

✗

✗

✗

✗

✗

✗

✓

✓

✗

✓

✓

✓

✗

✗

✗

✓

✓

✗

✗

✓

✓

✓

✓

✓

✓

✗

✓

✗

✗

✗

✓

✓

✓

✓

✓

✗

✓

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✓

✗

✗

✗

✓

✗

✗

✓

✗

✗

✗

✓

✗

✗

✗

✗

✗

✓

✓

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✓

✓

✗

✓

✓

✗

✗

✓

✓

✓

✓

✓

✗

✓

✗

✗

✗

✗

✓

✓

✓

✓

✓mance distributions across all runs is less biased

and more robust to outliers than point estimates [9].

Unfortunately, we stumble upon two unique re-

producibility issues in LLM research: repeatability

of (i) training runs and (ii) generations by close-

sourced API-served models. While the term “re-

producibility” is often used more broadly and can

slightly vary in its meaning [5], in the following,

we focus on “repeatability”, which we define as the

ability to repeat experimental outcomes exactly.

Curse of Dimensionality In Table 2, we high-

light some but not all differences across models,

as the table format constrained us. Other com-

mon differences include the training datasets or

fine-grained architectural details, e.g., the usage of

multi-head [563] or multi-query attention [494].

We note that a core characteristic of LLMs is

their vast design space, which renders scientific

inquiry challenging [231]. For example, by taking

into account the (i) data sources and their propor-

tions within the pre-training dataset, (ii) choice

and training hyper-parameters of the tokenizer, and

(iii) pre-training objective, the combined design

space quickly becomes high-dimensional. Under-

taking factorial experiments within such expansive

design spaces results in a combinatorially-growing

number of single training runs, and the lack of suf-

ficient experimental coverage can severely inhibit

scientific understanding of what makes an LLM

perform well. While this issue is not unique to

LLMs, they tend to be larger in the number of

parameters—and therefore compute requirements,

feedback loop times, and training costs—than mod-

els in most other fields.

Training Repeatability Typical training proto-

cols of LLMs involve parallelism across multi-

ple compute nodes. The scheduling and com-

munication strategies between nodes can be non-

deterministic [387]. This variability can affect

the final result, especially in algorithms that are

not “order-invariant”, such as stochastic gradient

descent (SGD). Some sources of randomness are

(i) lock-free parallelism schemes [387], (ii) float-

ing point precision, e.g., when summing gradients

across devices, the order in which these sums are

computed can affect the final result [171], (iii) non-

deterministic, performance-optimized operations,

which are much faster and therefore desirable [3].

Further, Carlini et al. [64] point out that some

pre-training datasets consist of an index of web

content that individual users must crawl themselves,

rather than using static, standalone dumps. This is

due to monetary, privacy, and legal restrictions. As

a result, reproducibility can be easily compromised

if any of the sources in the index have changed

between the time the dataset curator collected them

and the time the end-user downloads them.

Curse of (Design) Dimensionality

Common design spaces of LLM experi-

ments are high-dimensional.

One possible way forward is to encourage the

community to use techniques like Bayesian opti-

mization (BO) with dimensionality reduction [594,

374], where we use a non-linear feature mapping to

map the input (the hyper-parameter configuration)

onto a lower dimensional manifold followed by a

BO procedure to optimize the underlying black-

box function (the LLM with respect to the hyper-

parameters). Another suitable tool to explore the

design space efficiently can be treatment effect es-

timation [284, 385], e.g., where the treatment is a

vector describing certain ablations [254].

2.16

Irrepeatable Training Runs

Parallelism strategies designed to distribute

the training process across many accelera-

tors are typically non-deterministic, render-

ing LLM training irreproducible.

Inference Repeatability Another peculiarity of

commercial LLMs is that they are typically served

via stochastic API in a black-box setting, which

comes with the following challenges: (i) the

provider retains complete authority over the model

and can introduce unpublicized changes, includ-

ing retraining the model, modifying its parame-

ters, or completely replacing it; (ii) even if model

updates are communicated, there is still uncer-

tainty about whether access to specific model ver-

sions will be maintained once they are deemed

outdated, (iii) even with a decoding temperature

Lack of Reproducibility

The reproducibility of empirical results is impor-

tant to verify scientific claims and rule out errors

in experimental protocols leading to such. When

researchers try to build upon non-reproducible re-

sults, they might waste resources.

33set to zero, API models often produce stochastic

outputs [392, 464, 456].

Chen et al. [76] provide preliminary evidence

confirming dramatic changes in API-served models.

They find that GPT-3.5 and GPT-4 performances on

four diverse tasks vary vastly within three months

(March to June 2023). For example, GPT-4’s ac-

curacy in identifying prime numbers was 97.6%,

but in June, its accuracy dropped to 2.4%; while

for GPT-3.5, the trend is reversed and it got much

better over time.

the model to be more helpful, correct, and harm-

less. Sparrow also incorporates external knowledge

using a retrieval model to provide evidence from a

Google Search query. The RLHF approach outper-

forms the only dialogue-prompted and supervised

fine-tuned approaches regarding output preference

and rule violation rate.

Similarly, OpenAI [396] train the ChatGPT

chatbot using supervised fine-tuning and RLHF

(Sec. 2.9) to specialize a GPT-3.5 LLM for dia-

logue. GPT-4 [398] is the underlying model for the

ChatGPT Plus chatbot, but training and architec-

ture details have not been released.

Shuster et al. [508] introduce BlenderBot-3, a

175B parameter chatbot based on the OPT-175

LLM using supervised fine-tuning. BlenderBot-

3 incorporates external knowledge through mod-

ules that conduct internet searches and retrieve text-

based long-term memories generated from previous

outputs to help performance over long interactions.

Irreproducible API Inference

API-served models are often irreproducible.

An easy fix is to rely exclusively on open-source

LLMs [2].

3

Applications

In this section, we aim to provide practitioners with

a broad overview of the areas in which LLMs are

currently being applied and highlight some com-

mon application architectures across domains.

Analogous to the Challenges section, we high-

light the key constraints in each application area as

follows.

Maintaining Coherence

Multi-turn interactions make Chatbots eas-

ily “forget” earlier parts of the conversation

or repeat themselves [53, 451].

Köpf et al. [274] release the OpenAssistant Con-

versations dataset of human-annotated interactions

and use this to instruction fine-tune Pythia and

LLaMA models (up to 30B parameters) for chat-

bot applications. To help align the final models,

the dataset is generated with guidelines to make

the responses polite, helpful, concise, friendly, and

safety-aware. The LLaMA 30B version is cur-

rently used within the HuggingChat chatbot ap-

plication [229].

A key challenge of fine-tuning chatbots is cre-

ating a broad training dataset of high-quality con-

versations. To address this problem Chen et al.

[78] demonstrate using existing LLMs (OPT 30B)

to generate high-quality synthetic conversation

datasets based on a small number of expert-written

examples. Human crowd workers assessed the gen-

erated conversations to be comparable to existing

human-generated datasets on the metrics: interest-

ing, coherent, natural, and consistent. Chen et al.

[78] show the synthetic dataset can be used to fine-

tune a chatbot (BlenderBot 400M) and achieve

performance only slightly below fine-tuning with

human-generated datasets.

Chatbots’ intended generality also makes eval-

Constraint

This box highlights a constraint.

3.1

Chatbots

General-purpose chatbots (dialogue agents) com-

bine the tasks of information retrieval, multi-turn

interaction, and text generation (including code).

Thoppilan et al. [551] introduced the LaMDA

family of chatbot LLMs with up to 137B parame-

ters, focusing on safety (via supervised fine-tuning

on human annotations) and factual grounding (via

access to external knowledge sources). Notably,

smaller LaMDA models (2B parameters) with fine-

tuning are shown to perform similarly on dialogue

quality and safety/grounding scores to the larger

LaMDA models (137B parameters) without fine-

tuning. LaMDA models were released as part of the

Bard chatbot service [429]. However, the latest ver-

sion of Bard now uses the PaLM 2 LLM [20, 216].

Glaese et al. [170] propose Sparrow, a chatbot

based on a 70B parameter Chinchilla LLM, and

use RLHF (Sec. 2.9) targeting 23 rules to fine-tune

34Chatbots 3.1

BlenderBot3 (OPT-175) [508], Bard (LaMDA, PaLM2) [551],

Sparrow (Chinchilla) [170], ChatGPT (GPT-3.5, GPT-4) [396],

OpenAssistant (LLaMA) [274]

GPT-4 Technical Report [398], Sparks of AGI (GPT-4) [61],

Capabilities of ChatGPT [272]

Proteins ESM-2 [326], ProtT5 [139], ProtST [627], CaLM [402], ProGen [352],

IgLM [505], xTrimoPGLM [73]

Genomics GenSLM [688], Nucleotide Transformers [106]

Computational Biology 3.2

InCoder [154], CodeGen [386], AlphaCode [313] , SantaCoder [17],

Polycoder [626], phi-1 [182]

Computer Programming 3.3

Codex (GPT-3) [77]

Self-Debugging (Codex) [81], ViperGPT (Codex) [532],

RepoCoder [660], Repo-Level Prompt Generator [504]

Dramatron (Chinchilla) [368], Re3 (GPT-3) [637],

Detailed Outline Control (GPT-3) [636]

Long Form

CoPoet (T5, T0) [69], Spindle - Interactive Fiction (GPT-3) [63]

Creative Work 3.4

Short Form

Cross-lingual Short Stories (PaLM) [452], ReelFramer (GPT-4) [584]

Idea Generation [187]

Visual

LayoutGPT [148], LLM Grounded Diffusion [315]

Galactica [548], BloombergGPT [616]

Scientific NERRE (GPT-3) [133]

Knowledge Work 3.5

Data Analysis (GPT-4) [346]

Professional Exams [49], News Summarization [668],

Email Management [550], Academic Paper Review (GPT-4) [335]

Legal Entailment (GPT-3.5) [651], Bar Examination (GPT-3.5) [50]

Explaining Legal Concepts (GPT-4 + Retrieval) [478]

Legal Question Answering

Law School (ChatGPT) [84], Bar Examination (GPT-4) [258]

Statutory Reasoning (GPT-3.5) [48], Law Professor (ChatGPT) [427],

Summarizing Judgments (GPT-3.5) [115], Litigation (ChatGPT) [234]

Law 3.6

Case Prediction

US Supreme Court (GPT-2 + GPT-3) [189]

PubMedGPT [565], GatorTronGPT [418]

MedPaLM(2) (PaLM) [511, 512], ChatDoctor (LLaMA) [655]

Medical Question Answering

GPT-3.5 + Retrieval [320]

Medical Challenge Problems (GPT-4) [388],

Triage and Diagnosis (GPT-3) [301],

Surgical Knowledge QA (GPT-4) [393],

Social Media - Genetics Questions (ChatGPT) [134],

Social Media - General Questions (ChatGPT) [30],

Ophthalmology QA (ChatGPT) [21],

Medical Summarization (GPT-3.5, ChatGPT) [538]

Medicine 3.7

Medical Information Retrieval

Medical Acronym Disambiguation (T5) [448],

Adverse Drug Event Extraction [178]

Clinical Information Extraction (InstructGPT) [10]

Self Improvement (PaLM) [222], Processed Based Fine-Tuning [560]

Reasoning 3.8

DIVERSE (GPT-3.5) [312], Socratic Sub-Questions (GPT-3) [502],

Mathematical Formalization (Codex) [159]

Causal Factors in Performance [525], Analogical Reasoning [595],

Causal Reasoning [286, 164, 519, 244, 288],

Common-Sense Reasoning [562]

PaLM-E [129]

Robotics 3.9

SayCan (PaLM + Scoring) [14], ChatGPT for Robotics [564],

REFLECT (GPT-4) [338], Code as Policies (Codex) [316],

PROGPROMPT (Codex) [510], Inner Monologue [225],

Statler (GPT-3.5) [647]

Social Sciences 3.10 Using LLMs to Model Human Behavior [12, 176],

Analyzing Behavioral Characteristics of LLMs [367, 414],

Simulating Social Relationships with LLMs [408]

Synthetic Training Data 3.11 Automated Labeling (GPT-3) [583], AugGPT (ChatGPT) [104],

Labeling + Generation (GPT-3) [123],

Information Retrieval (GPT-3) [51],

Decompositional Distillation (GPT-3) [503],

Code ‘Textbooks’ (GPT-3.5) [182], GPT3Mix [648]

Figure 12: Overview of LLM Applications. Color = Level of Model Adaption (Pre-Trained, Fine-Tuned, Prompting

Strategy, Evaluation).

35uating their capabilities’ full range difficult. Ko-

coń et al. [272] evaluate ChatGPT (GPT-3.5) on

25 tasks with 38k prompts covering a diverse set

of capabilities, including but not limited to ques-

tion answering, emotion recognition, offensive lan-

guage detection, spam detection, inference, and

sentiment analysis. While ChatGPT is shown to

have strong performance across the 25 tasks, it usu-

ally underperforms the SOTA in that domain. More

recently, Bubeck et al. [61] and OpenAI [398] in-

vestigate the capabilities of GPT-4 (base model of

ChatGPT Plus) across a wide range of tasks, in-

cluding interactions with humans and tools. Using

these evaluations Bubeck et al. [61] conclude that

GPT-4 is ‘strikingly close to human-level perfor-

mance’ across tasks.

Transfer to Downstream Applications

The ultimate objective of protein language

models is to deploy them in real-world

projects such as drug design. Evalua-

tions often target smaller and/or specialized

datasets, not considering how the models

could contribute to protein design in vitro

or in vivo.

Elnaggar et al. [139] train a range of LLM archi-

tectures to extract embeddings from protein amino

acid sequences. These embeddings are then used

as inputs on supervised per-amino acid and per-

protein prediction tasks. The best-performing LLM

architecture (ProtT5) achieved SOTA results on

per-amino acid protein secondary structure predic-

tion without using evolutionary information. Sim-

ilarly, Wu et al. [613] predict antibody backbone

and side-chain conformations.

Lin et al. [326] take a similar approach to train-

ing a protein LLM, the Evolutionary Scale Model

Transformer-2 (ESM-2), on protein amino acid se-

quences from the UniRef database using a masked

language modeling approach. They show sig-

nificant performance increases as the model is

scaled from 8 million to 15B parameters, with

the largest models outperforming the ProtT5 on

protein structure prediction benchmarks (CASP14,

CAMEO) [267, 457]. They also introduce ESM-

Fold, which uses the ESM-2 embedding model

for end-to-end atomic resolution prediction from a

single sequence. While ESMFold underperforms

the SOTA AlphaFold2 [248] on the CAMEO and

CASP14 benchmarks, the authors note that by rely-

ing only on embeddings ESMFold has an order of

magnitude faster inference time than AlphaFold2,

using just the protein sequence of interest rather

than structural templates and multiple sequence

alignments (MSAs). Jeliazkov et al. [240] find

that protein sequences designed by an inverted Al-

phaFold2 model are unlikely to be expressed, but

sequences generated using an inverted protein LLM

such as ESMFold were more likely to be expressed.

Researchers have also adopted the ESM-1 and

ESM-2 models to generate protein embeddings

for enzyme-substrate chemical structural class pre-

diction [245], training 3D geometric graph neural

networks for proteins [611], identifying disease-

causing mutations [337], designing novel pro-

teins [566], and guided evolution of antibodies for

affinity maturation [202].

Finally, the challenge of inference latency

(Sec. 2.5) is also potentially going to become an

important constraint [634] for chatbot applications

as LLMs scale. There is a trade-off between the

need for responsive live user interaction in a con-

versational format and utilizing larger LLMs [397].

High Inference Latency

High inference latency (Sec. 2.5) hinders the

user experience [397], especially in multi-

turn interaction with chatbots.

3.2

Computational Biology

In computational biology, we are interested in non-

text data representing similar sequence modeling

and prediction challenges.

3.2.1

Protein Embeddings

One popular application of LLM-like models in

biology is to generate protein embeddings from

amino-acid or genomic sequence inputs. These em-

beddings can then be used as inputs for structure

prediction, novel sequence generation, and protein

classification tasks. Protein language models per-

form strongly on many academic datasets, but their

applicability to downstream tasks such as drug de-

sign is often unclear [110].

36Chen et al. [73] propose training a new

model xTrimoPGLM (100B parameters) simul-

taneously for protein embedding and genera-

tion tasks using MLM and generative objectives.

The xTrimoPGLM-100B model (with fine-tuning

where relevant) outperforms existing approaches

on 13 out of 15 evaluated tasks.

Limited Context Window

The largest genomes have vastly longer

DNA sequences [390] than existing ge-

nomic LLMs’ context windows can han-

dle, constraining the types of genomes that

can be successfully modeled using these ap-

proaches.

Protein embedding models with alternative in-

puts have also been proposed. Outeiral and Deane

[402] train an 86 million parameter protein LLM

CaLM (Codon adaptation Language Model) us-

ing sequences of codons (nucleotide triads) as in-

put instead of amino acids due to codons contain-

ing potentially richer information. Madani et al.

[352] train a 1.2B parameter protein embedding

model ProGen on 280 million protein amino acid

sequences with additional control tags specifying

protein properties. ProGen is then fine-tuned us-

ing data from specific protein families and applied

to generate functional full-length amino acid se-

quences. Similarly, Xu et al. [627] propose train-

ing a protein language model, the ProtST, on pro-

tein sequences and additional text descriptions of

their key properties for protein classification and

retrieval tasks.

Zvyagin et al. [688] introduce a range of hier-

archical LLMs (up to 25B parameters) with long

input sequences (2048 - 10,240 tokens), referred

to as Genome-scale Language Models (GenSLMs).

The GenSLM models are pre-trained on Prokary-

otic gene sequences from the BV-BRC dataset us-

ing codon tokenization [402] and then fine-tuned

on SARS-CoV-2 genome sequences for the task

of identifying potential new variants and genera-

tive modeling. However, the authors note that it

remains unclear whether the GenSLM architecture

generates richer representations than the protein

LLM approaches.

Dalla-Torre et al. [106] train Nucleotide Trans-

formers with 500 million to 2.5B parameters on nu-

cleotide sequences from human and other species

genomes, using a masked language modeling ap-

proach. The Nucleotide Transformers were evalu-

ated on 18 genomic prediction tasks with fine-tuned

larger models achieving the best results.

Nguyen et al. [383] propose HyenaDNA, a ge-

nomic language model based on the Hyena archi-

tecture [430], enabling modeling of genomic se-

quences of up to 1 million tokens. HyenaDNA

outperforms Transformer-based models with mul-

tiple orders of magnitude more parameters while

incorporating the in-context learning capabilities

of LLMs into the genomics domain.

Finally, for antibodies specifically, Shuai et al.

[505] propose an Immunoglobulin Language

Model (IgLM) using the GPT-2 architecture (with

13 million parameters) for the generation of im-

munoglobulin sequences, using a masked language

modeling approach. Similar to Xu et al. [627], the

IgLM model also takes additional conditioning tags

corresponding to chain type and species as input.

The authors show the IgLM model can then be

used for the controllable generation of infilled and

full-length antibody sequences.

3.3

3.2.2

Computer Programming

One of LLMs’ most advanced and broadly adopted

applications is generating and completing computer

programs in various programming languages. This

section deals with programming-specific LLMs

where the model is fine-tuned or pre-trained ex-

clusively for programming applications, but it is

important to note the increasing use of general

chatbots partially trained on code datasets (such

as ChatGPT) for programming tasks.

Genomic Analysis

LLMs in the field of genomic analysis enable a

better understanding of the effects of mutations

in humans and predict genomic features directly

from DNA sequences. While genomic language

models are a promising research direction, current

models cannot process many genomic sequences as

their sequence lengths commonly exceed multiple

billions of nucleotides [390].

3.3.1

Code Generation

Code generation refers to using an LLM to output

new code for a given specification or problem pro-

37vided as a prompt. Several computer programming-

specific LLMs and approaches have been proposed.

Nijkamp et al. [386] train the CodeGen family

of LLMs (up to 16B parameters) using a combi-

nation of three datasets: natural language, multi-

lingual programming source code (C, C++, Go,

Java, JavaScript, and Python), and a monolingual

Python dataset. The largest CodeGen model using

the monolingual training set was shown to outper-

form the Codex-12B model. Nijkamp et al. [386]

also test CodeGen on multi-step program synthesis,

where a program is broken down into multi-step

natural language prompts, which the model then

implements individually (creating the new Multi-

Turn Programming Benchmark (MTPB)).

Finally, Li et al. [313] focus on the task of

solving competitive programming questions (Code-

forces, Description2Code, and CodeNet). The Al-

phaCode LLM (up to 41B parameters) is first pre-

trained on a multilingual dataset (C++, C#, Go,

Java, JavaScript, Lua, PHP, Python, Ruby, Rust,

Scala, and TypeScript) of 715 GB of source code

from GitHub. It is then fine-tuned using a new

curated dataset of competitive programming prob-

lems called CodeContests. To achieve high per-

formance, Li et al. [313] use large-scale sampling

(up to millions of samples), filtering, and clustering

of candidate solutions generated by AlphaCode to

select the final submissions.

However, whilst these existing code-generation

LLMs have achieved impressive results, a criti-

cal current constraint in applying LLMs to code

generation is the inability to fit the full code base

and dependencies within the context window. To

deal with this constraint, a few frameworks have

been proposed to retrieve relevant information or

abstract the relevant information into an API defi-

nition.

For Python code generation, Chen et al. [77]

introduce Codex, a fine-tuned GPT-3 LLM (up

to 12B parameters) specialized to generate stand-

alone Python functions from doc strings. Fine-

tuning was conducted using a raw dataset of 159

GB of Python source code from GitHub and a fil-

tered dataset of correctly implemented standalone

Python functions. Codex models outperformed

similarly sized GPT-3 and GPT-J models on the

HumanEval evaluation set, with the Codex model

trained on the filtered dataset (Codex-S) achieving

the best results. Importantly, Chen et al. [77] note

that there was no observed improvement from us-

ing a pre-trained GPT-3 model as a base other than

faster convergence.

Chen et al. [81] seek to improve the performance

of Codex through a self-debugging prompting ap-

proach. Three forms of self-debugging are inves-

tigated. Simple feedback prompts the model to

decide whether the generated code solution is cor-

rect. Unit-test feedback prompts the model with

the output of unit tests provided in the problem

description. Code explanation feedback prompts

the model to explain the solution in detail and use

the explanation to correct the solution. In each

case, this process is repeated iteratively until the

model provides a solution it states is correct or

a maximum number of attempts has been made.

Codex using the self-debugging prompting frame-

work with code explanation (and unit-testing if

applicable) outperforms the base Codex model on

C++-to-Python translation, text-to-SQL generation,

and text-to-Python generation.

Gunasekar et al. [182] train a smaller model Phi-

1 (1.3B parameters) to generate Python functions

from doc strings. Training phi-1 using a combina-

tion of filtered existing datasets and new synthetic

textbook and exercise datasets results in a model

that can achieve near current SOTA results on Hu-

manEval while having over an order of magnitude

fewer parameters and tokens than previous works.

Long-Range Dependencies [660, 504]

Long-range dependencies across a code

repository usually cannot be regarded be-

cause of limited context lengths (Sec. 2.6).

Zhang et al. [660] introduce RepoCoder, a

retrieval-based framework for repository-level code

completion that allows an LLM to consider the

broader context of the repository. A multi-step

retrieval-augmented generation approach is taken,

where the initial code generated is then used to re-

trieve further, potentially more relevant, repository

code snippets to refine the final output. This ap-

proach can be considered a retrieval-based method

Another area of interest has been the develop-

ment of multilingual programming LLMs. Xu et al.

[626] evaluate a range of code generation LLMs

and train a new multilingual LLM Polycoder (2.7B

parameters) using source code from 12 languages.

However, for Python specifically, Codex outper-

forms Polycoder and other existing models (GPT-J,

GPT-Neo, and CodeParrot) on HumanEval.

38Prompt

for relieving the long-range dependency constraint.

Similarly, Shrivastava et al. [504] propose the

Repo-Level Prompt Generator (RLPG) framework

to dynamically retrieve relevant repository context

and construct the correct prompt for a given com-

pletion task. To do this, many prompt proposals

are generated from different prompt sources (e.g.,

parent class) and prompt contexts (e.g., method

names). The best prompt is then selected by a

prompt proposal classifier and combined with the

default context to generate the final output.

Finally, Surís et al. [532] create the ViperGPT

framework, which utilizes the Codex LLM to gener-

ate programs that answer text-based visual queries.

The Codex model is prompted with the query text

and an API specification to do this. The human-

generated API specification provides functions de-

signed to deal with low-level visual tasks (e.g.,

find(object)) that the LLM can then use to gen-

erate solution code. This approach significantly

reduces the tokens needed to provide repository/-

code context by only providing the API definition.

This API definition approach, illustrated in 13 has

been used in robotics by Vemprala et al. [564], and

by Wang et al. [579] as part of a Minecraft agent.

Previously, Gupta and Kembhavi [185] used a pre-

defined function approach within VISPROG, which

uses GPT-3, external python modules, and few-shot

prompting with example programs to solve visual

tasks.

3.3.2

API Deﬁntion

Using the API functions

provided, write a program

that…

def locate_item(item_name):

""" Returns x,y,z of item """

def move_to_location(x, y, z):

""" Moves to x,y,z coordinates"""

def drop_item(item_name):

""" Removes item from inventory"""

Self-

debugging

LLM

Function

Implementation

Output

move_to_location(10, 20, 0)

locate_item('apple')

move_to_location(5, 10, 15)

drop_item('apple')

API Implementation Store

def drop_item(item_name):

""" Removes item from inventory"""

item_list.remove(item_name)

Figure 13: API Definition Framework. Illustration of

providing a general API definition in the prompt [532,

579, 564] to enable the consistent use of either external

code or tools to solve the specific task whilst minimiz-

ing the required context window. Extensions to this ap-

proach have included asking the LLM to implement the

functions within the API definition (red) and to prompt

the LLM to self-debug any API code that does not exe-

cute (green).

marks), with it shown to outperform InCoder on

both HumanEval generation and infilling (passing

over 100 attempts).

Code infilling is particularly relevant for applica-

tions involving modifying, reviewing, or debugging

existing code. Maniatis and Tarlow [357] explore

the data from the intermediary steps in the develop-

ment process to help automatically resolve reviewer

comments [155]. The Dynamic Integrated Devel-

oper ACTivity (DIDACT) methodology formalizes

tasks in the software development process (e.g., re-

pairing builds, predicting reviewer comments, etc.)

into state, intent, and action components, and trains

the model to predict code modifications. This ap-

proach aims to train the model to understand the

process of software development rather than only

the end product.

Code Infilling and Generation

Code infilling refers to modifying or completing

existing code snippets based on the code context

and instructions provided as a prompt.

Fried et al. [154] train the InCoder LLM (up

to 6.7B parameters) to both generate Python code

and infill existing code using a masked language

modeling approach. Incoder is trained using 159

GB of text split roughly equally between Python

source code, StackOverflow content, and source

code in other languages. On the HumanEval gener-

ation benchmark, InCoder underperforms the best-

performing Codex and CodeGen models. However,

unlike the other models, InCoder can perform sin-

gle and multi-line infilling of existing code.

Similarly, Allal et al. [17] train a set of smaller

SantaCoder models (1.1B parameters) for code gen-

eration and code infilling using 268 GB of Python,

JavaScript, and Java source code. SantaCoder is

primarily evaluated on the MultiPL-E benchmark

(an extension of HumanEval and MBPP [28] bench-

3.4

Creative Work

For creative tasks, LLMs have primarily been ap-

plied to story and script generation.

For long-form story generation, Mirowski

et al. [368] propose using a 70B Chinchilla-

optimal [206] LLM Dramatron with prompting,

prompt chaining, and hierarchical generation to

create complete scripts and screenplays without

the requirement for a human-in-the-loop (although

co-writing is facilitated). The ability of Dramatron

to help create a script was evaluated qualitatively

39through co-writing and follow-up interviews with

15 industry experts.

Similarly, Yang et al. [637] propose using GPT-3

with a Recursive Reprompting and Revision frame-

work (Re3) to generate stories over 2,000 words

long. The Re3 approach uses zero-shot prompting

with GPT-3 to generate a plan (settings, characters,

outline, etc.). It then recursively prompts GPT-3 to

generate story continuations using a specified dy-

namic prompting procedure. Possible story contin-

uations are then ranked for coherence and relevance

using separate fine-tuned Longformer models as

part of a Rewrite module. Finally, local edits to

the selected continuations are made by detecting

factual inconsistencies using the combination of a

GPT-3 model [403] and a BART model [303] as

part of an Edit module. This process can then be

iterated for fully automated story generation.

Finally, Yang et al. [636] introduce the Detailed

Outline Control (DOC) framework to maintain plot

coherence over thousands of words using GPT-3.

While DOC uses the same high-level planning-

drafting-revision approach as Re3, it implements

this through the use of a detailed outliner and de-

tailed controller. The detailed outliner first breaks

down the high-level outline into subsections us-

ing a breadth-first approach, with candidate gen-

erations for the subsections created, filtered, and

ranked. The bodies of the detailed outline subsec-

tions are then generated iteratively using a struc-

tured prompting approach. During the generation,

an OPT-based FUDGE [635] detailed controller is

used to help maintain relevance.

In each case, to apply LLMs to long-form story

generation, the task is broken down into a series of

short-form sub-tasks (14). The current capabilities

of LLMs primarily drive this approach, but also

the desire to have a human-in-the-loop for some

co-writing use cases [368].

for cross-lingual short story generation, Wang et al.

[584] use GPT-4 as part of the ReelFramer tool to

help co-create news reels for social media, Ippolito

et al. [232] use LaMDA as part of the Wordcraft cre-

ative writing assistant, and Calderwood et al. [63]

apply a fine-tuned GPT-3 model as part of their

Spindle tool for helping generate choice-based in-

teractive fiction.

For more general creative tasks, Haase and

Hanel [187] assess a range of LLMs (including

ChatGPT) on their capacity for idea generation (ev-

eryday creativity) using the Alternative Uses Test

(generating alternative uses for given items). On

this task, LLMs were found to perform comparably

to 100 human participants.

Finally, for visual creative tasks, LLMs have also

been used to increase the level of control users have

when using image generation models. Feng et al.

[148] propose the LayoutGPT method where an

LLM (GPT-3.5, GPT-4 or Codex) is used to gener-

ate a CSS Structure layout the image should follow

based on a text-based user prompt. This layout

can be visualized and used as input to guide an

image generation model. This approach performs

strongly on text-to-image generation and indoor

scene synthesis. A similar concept is implemented

by Lian et al. [315], where an LLM (GPT-3.5) is

used to generate natural language layouts (bound-

ing boxes and descriptions) to guide a diffusion

model. Using an LLM as part of a modality conver-

sion framework 16 has also been used in robotics

[338, 225] and knowledge work [329].

3.5

Knowledge Work

With researchers increasingly demonstrating

LLMs’ ability to perform well on domain-specific

knowledge tasks such as within Law [258] or

Medicine [512], interest has grown in LLMs’ ca-

pacity for wider knowledge work. These applica-

tions are likely to be found across the labor market

with Eloundou et al. [140] estimating that 80% of

the US workforce is in roles where at least 10% of

tasks could be affected by LLMs.

In the professional services field, Bommarito

et al. [49] evaluate GPT-3.5 and previous GPT ver-

sions on actual and synthetic questions from the

Uniform CPA Examination Regulation section and

AICPA Blueprints for legal, financial, accounting,

technology, and ethical tasks. Using only zero-shot

prompting, the best performing model (latest GPT-

3.5) struggles with quantitative reasoning, achiev-

Limited Context Window [368, 637]

The inability of current LLMs to keep the

entire generated work within the context

window currently constrains their long-form

applications and generates the need for mod-

ular prompting (14).

For short form generation, Chakrabarty et al.

[69] propose CoPoet (fine-tuned T5 and T0 models)

for collaborative poetry generation, Razumovskaia

et al. [452] use PaLM and prompting with plans

40User Prompt

work, including sentiment analysis, classifica-

tion, NER/NED, and financial question answering.

BloombergGPT is shown to outperform the OPT

(66B parameters), GPT-NeoX, and BLOOM (176B

parameters) LLMs on these financial domain-

specific tasks and performs competitively on

broader benchmarks.

Thiergart et al. [550] considers the applicability

of GPT-3 to the task of email management, includ-

ing classification, information extraction (NER),

and generating response text. Whilst it is noted

that GPT-3 has the capacity for all three tasks, the

author highlights current issues around reliability,

lack of access to internal data, and the need for a

human in the loop.

Liu et al. [329] propose enabling LLMs to un-

derstand charts and plots by first using a vision

plot-to-text translation model (DePlot) to decom-

pose the chart into a linearized data table. Once the

chart or plot has been converted into a text-based

data table, it is combined with the prompt and pro-

vided to a Flan-PaLM, Codex, or GPT-3.5 LLM. A

similar modality conversion 16 approach has also

been used in robotics [338, 225] for sensor data.

Zhang et al. [668] evaluate a range of LLMs

(GPT-3, InstructGPT, OPT, GLM, Cohere, and An-

thropic) on the task of news summarization. On

the DM/CNN and XSUM benchmarks, instruction

fine-tuned models (InstructGPT) perform the best

across summarization faithfulness, relevance, and

coherence. To evaluate against human capabil-

ity Zhang et al. [668] collect reference summa-

rizations for 100 articles from 6 freelance writers.

Zero-shot InstructGPT-3 performs comparably to

the freelance writers across the three metrics.

Cheng et al. [82] investigate GPT-4’s capacity to

perform data analysis and compare it to human an-

alysts. GPT-4 is combined with a modular prompt-

ing framework 14 with three steps, code generation

(SQL and Python), code execution (“collect data

and output figures”, etc.), and analysis generation

(“generate five bullet points about the analysis”).

While GPT-4 performs well, it currently underper-

forms experienced human data analysts on tasks

from NvBench [346].

For scientific knowledge work, Taylor et al.

[548] train the Galactica LLM specifically on sci-

entific text for tasks such as scientific knowledge

recall, reasoning, citation prediction, and scientific

Q&A. In addition to a domain-specific training

corpus, Galactica is specialized in the scientific do-

Module 1

General Prompt

LLM

Output

Eg., Generate a plot outline

for a new novel as paragraph

headings

Module 2

General Prompt

Pre-processing

LLM

Re-run

Output

Module 3

Residual

Eg., Using the outline,

generate a draft for the xth

paragraph heading

General Prompt

Pre-processing

Iterate

LLM

Output

Eg., Check the spelling and

consistency of this paragraph

given the outline and plot

summary

Figure 14: Modular Prompting. Illustration of using

a series of separate prompts [368, 637, 368, 579, 584]

and processing steps to enable an LLM to perform tasks

that would either not fit in a single context window or

could not easily be specified in a single prompting step.

ing results similar to random guessing on multiple-

choice questions. However, on qualitative sections,

GPT-3.5 achieved 50-70% accuracy, significantly

ahead of random guessing and approaching human-

level scores.

Numerical Reasoning [436, 49]

LLMs have generally seen worse perfor-

mance on quantitative tasks, potentially con-

straining their applications in knowledge

work areas such as financial services or ac-

counting.

Wu et al. [616] train BloombergGPT (50B

parameters) for various financial knowledge

41main through the use of specialized tokens, work-

ing memory, and prompt-pre-training.

Dunn et al. [133] propose fine-tuning GPT-3 for

scientific combined named entity recognition and

relation extraction (LLM-NERRE). First, 100 to

1,000 manually annotated prompt-completion pairs

are created by humans. These examples are then

used to fine-tune a GPT-3 model for the specific

NERRE task.

Finally, Liu and Shah [335] evaluate GPT-4’s

ability to review academic papers, specifically:

identifying errors, verifying author checklists, and

selecting the better abstract. GPT-4 shows some

capacity to detect errors, with 7 out of 13 errors

detected, and verify author checklists, with 87%

accuracy. However, GPT-4 is shown to have lim-

ited capacity for distinguishing the better paper

abstract.

3.6

on relevant examples does not appear to improve

performance. More recently, Katz et al. [258]

show that GPT-4 with zero-shot prompting exhibits

SOTA performance on the full UBE, including the

multiple choice, essay, and performance test com-

ponents, and achieves passing scores.

Blair-Stanek et al. [48] assess GPT-3.5’s abil-

ity to reason about legal facts and statutes us-

ing the StAtutory Reasoning Assessment (SARA)

dataset [208]. GPT-3.5 is shown to have SOTA per-

formance but with significant variation depending

on the type of prompting used (zero-shot, few-shot,

and CoT). GPT-3.5 was also shown to perform rela-

tively poorly on synthetic statutory reasoning tasks.

Choi et al. [84] evaluate ChatGPT (GPT-3.5)

on 95 multiple-choice and 12 essay questions from

the final exams at the University of Minnesota law

school. ChatGPT was found to perform at the level

of a C+ student, near the bottom of the class, but

with passing scores.

Law

Applications of LLMs within the legal domain

share many similarities with medicine, including

legal question answering [651, 258] and legal in-

formation extraction [71]. However, other domain-

specific applications have been proposed, such as

case outcome prediction [189], legal research [234],

and legal text generation [423].

3.6.1

Out of Date Information

Due to regularly updated laws and new

precedents, the training/retrieval data be-

come outdated frequently [195].

Finally, many more specific legal question-

answering applications have been proposed, in-

cluding: explaining legal concepts (GPT-4 + re-

trieval) [478], summarizing legal judgments (GPT-

3.5) [115], litigation research and drafting [234],

and helping full-fill the tasks of a law professor

(ChatGPT) [427].

Legal Question Answering and

Comprehension

Key tasks of the legal field are finding related prece-

dents, answering legal questions, and comparing

existing documents or statutes.

Using a general-purpose LLM with prompting

approach, Yu et al. [651] use GPT-3.5 with zero-

shot, few-shot, and CoT prompting to achieve

SOTA performance on the legal entailment task

(identifying the relevant statutes and determining

if a given premise is correct) in the Competition

on Legal Information Extraction/Entailment (COL-

IEE) dataset [437]. They also investigate a GPT-3.5

version fine-tuned using the COLIEE training set

with and without explanations but find the zero- and

few-shot legal prompting approaches perform best.

Similarly, Rosa et al. [460] use a general monoT5

model with zero-shot prompting on the COLIEE

entailment task.

On the US legal Uniform Bar Examination

(UBE), Bommarito II and Katz [50] show that GPT-

3.5 with zero-shot prompting can achieve 50% on

the multiple choice Multistate Bar Examination

component, but note that fine-tuning the model

3.6.2

Case Prediction and Legal Text

Generation

Case prediction and legal text generation involve

predicting or completing legal opinions. Whilst

there is currently sparse usage of LLMs in the liter-

ature, smaller language models have been applied,

suggesting potential future LLM applications in

this area.

Hamilton [189] use nine separate GPT-2 models

trained on individual supreme court justice’s au-

thored opinions to predict how each justice will

vote on a given case. They use a handcrafted

prompt, including a summary of the topic gener-

ated by GPT-3. However, they find this approach

to case prediction does not match the SOTA.

Previously, Chalkidis et al. [70] trained a range

of attention-based models (including BERT) to pre-

42dict case outcomes from the European Court of

Human Rights (ECHR). The attention-based mod-

els outperformed an SVM with a bag of words

approach for binary violation classification, multi-

label violation classification, and case importance

prediction.

Finally, Peric et al. [423] use a dataset of 50,000

judicial opinions from U.S. Circuit Courts to train

a Transformer-XL model and fine-tune a GPT-2

model. The models were then evaluated for their

ability to complete a judicial opinion, with a start

given as a prompt. In qualitative evaluations, hu-

man participants struggled distinguishing between

machine-generated and genuine text.

3.7

adapt the GPT-3.5 LLM to medical question an-

swering (USMLE and MedMCQA) and compre-

hension (PubMedQA) tasks. In addition, Liévin

et al. [320] propose using retrieval augmentation

where relevant text from Wikipedia is retrieved

and included in the prompt. More recently, Nori

et al. [388] evaluated GPT-4 on USMLE and Mul-

tiMedQA datasets using zero and few shot prompt-

ing. GPT-4 is found to outperform GPT-3.5 across

benchmarks significantly. However, several issues

relating to using GPT-4 for real-world clinical ap-

plications are raised, including the risks of erro-

neous generations and the risks of bias. Tang et al.

[538] raise similar issues and find that GPT-3.5 and

ChatGPT have issues with factual accuracy and

representing the level of certainty during medical

summarization.

Medicine

Many applications of LLMs have been proposed

in the medical domain, including medical ques-

tion answering [511, 512, 320, 655, 388], clinical

information extraction [10, 448], indexing [650],

triage [491, 301], and management of health

records [276].

3.7.1

Hallucination and Bias [538, 388, 511]

The safety-critical nature of the medical do-

main means the possibility of hallucinations

significantly limits the current use cases.

Further work is also needed to reduce the

risk of LLMs perpetuating existing bias in

clinical datasets.

Medical Question Answering and

Comprehension

Medical question answering and comprehension

consists of generating multiple-choice and free-text

responses to medical questions.

Singhal et al. [511] proposed using few-shot,

CoT, and self-consistency prompting to specialize

the general-purpose PaLM LLM to medical ques-

tion answering and comprehension. They demon-

strate a Flan-PaLM model [93] using a combination

of the three prompting strategies to achieve the pre-

vious SOTA results on the MedQA, MedMCQA,

PubMedQA, and MMLU medical datasets. To fur-

ther align the model to the medical domain, they

proposed Med-PaLM, which utilizes instruction

prompt-tuning based on 40 examples from a panel

of clinicians and task-specific human-engineered

prompts.

Singhal et al. [512] then extend the Med-PaLM

approach with Med-PaLM 2 using the newer PaLM

2 LLM as its base model. Singhal et al. [512]

conduct further instruction-fine tuning and use a

new ensemble refinement (ER) prompting strategy

(where stochastically sampled outputs are first gen-

erated and provided within the final prompt). This

allows Med-PaLM 2 to achieve the current SOTA

on the MultiMedQA benchmark.

Liévin et al. [320] adopt a similar approach us-

ing zero-shot, few-shot, and CoT prompting to

Yunxiang et al. [655] fine-tune a LLaMA LLM

ChatDoctor (7B parameters) specifically for the

task of medical question answering. To specialize

the LLaMA model, it is first instruction fine-tuned

using the Alpaca dataset [540] and then fine-tuned

to the medical domain using a dataset of 100k pa-

tient conversations. Similarly to Liévin et al. [320],

ChatDoctor is augmented with two external knowl-

edge sources (a disease database and Wikipedia) to

improve the factual grounding of the model.

Instead of using general models with specialized

prompting or fine-tuning, Venigalla et al. [565]

train a new model PubMedGPT specifically for

medical question answering and text generation

tasks. PubMedGPT is trained using a combina-

tion of PubMed abstracts and full documents from

the Pile [165]. Peng et al. [418] also train a new

LLM GatorTronGPT (up to 20B parameters) for

biomedical question answering and relation extrac-

tion using a mixture of clinical and general English

text. Whilst these approaches outperformed exist-

ing smaller specific purpose models [177, 644] in

medical question answering, they currently under-

perform the larger general purpose LLMs (GPT-

3.5/4 and MedPaLM 1/2). However, there remains

43debate over whether larger general or specialized

clinical models are the best approach. Looking

at models up to GPT-3, Lehman et al. [297] ques-

tion the effectiveness of LLM in-context learning

approaches by showing that small specialized clin-

ical models fine-tuned on limited annotated data

outperform the former.

Finally, LLMs have also been applied to a range

of more specific medical question-answering tasks,

including evaluating GPT-3 on its’ ability to triage

and diagnose cases [301], responding to social me-

dia genetics [134] and general [30] patient ques-

tions (ChatGPT), answering questions from the

Korean general surgery board exams (GPT-3.5,

GPT-4) [393], consultation and medical note tak-

ing [296], and answering ophthalmology questions

[21].

3.7.2

tasks, such as understanding mathematical opera-

tions, complex multi-step reasoning, and longer-

term planning. Therefore, the applicability of

LLMs to these tasks, and methods for improving

their capabilities, is an active area of research.

For mathematical reasoning tasks, Uesato et al.

[560] test a range of fine-tuning (supervised and

RLHF), prompting (zero-shot and few-shot), and

re-ranking (majority voting and reward model) to

evaluate whether they improve a base LLM’s (70B

parameters) ability to generate accurate reason-

ing steps on word-based maths problems in the

GSM8K dataset [95]. Whilst fine-tuning on in-

termediate steps (“process-based”) performs simi-

larly to using only final answers (“outcome-based”)

on final answer correctness, processed-based ap-

proaches are found to generate significantly fewer

errors in reasoning.

Huang et al. [222] take this a step further by

showing that the mathematical reasoning ability

of a PaLM LLM on the GSM8K dataset can be

self-improved through fine-tuning on a dataset of

high-confidence reasoning paths generated by the

same PaLM base model.

Using only prompting, Kojima et al. [273] find

that zero-shot CoT prompting alone significantly

improves the performance of GPT-3 and PaLM

LLMs over standard zero- and few-shot prompting

on the MultiArith and GSM8K datasets. While Li

et al. [312] introduce DIVERSE, a prompting ap-

proach that uses a diverse set of prompts for each

question and a trained verifier (with reasoning step

awareness) to improve further GPT-3.5’s perfor-

mance on GSM8K and other reasoning bench-

marks. Finally, Shridhar et al. [502] take a novel

approach by training new models to break down

a mathematical word problem into Socratic sub-

questions to guide the answer of either other LLMs

or human learners. GPT-3 prompted with these sub-

questions outperforms simple one-shot prompting

on the GSM8K dataset.

Stolfo et al. [525] evaluate a range of LLMs (in-

cluding GPT-3) at mathematical reasoning using

a new framework to understand the causal impact

of different input factors (e.g framing, operands,

and operations). Instruction fine-tuned GPT-3 mod-

els are found to be significantly more robust and

sensitive than the smaller LLMs evaluated.

Other LLM use cases in algorithmic and mathe-

matical reasoning have also been proposed. Gadgil

et al. [159] apply a Codex LLM with prompt en-

Medical Information Retrieval

Medical text often contains domain-specific abbre-

viations, acronyms, and technical terms presenting

specific information retrieval challenges. This has

led LLMs also to be applied to help structure and

extract data from medical sources.

Agrawal et al. [10] use InstructGPT (GPT-3)

with prompt templates (zero- and one-shot) for clin-

ical information extraction, such as extracting med-

ication dosage and frequency from medical notes

or disambiguation of medical acronyms. They also

introduce two methods for converting the LLM

output into a structured format using a verbilizer

for mapping to classification labels and a resolver

for more complex structured outputs such as lists

(GPT-3 + R).

Rajkomar et al. [448] take a different approach

by treating medical acronym disambiguation as

a translation task and training a specialized end-

to-end T5 LLM. To preserve privacy, they also

use a training dataset generated from public web

pages (without medical acronyms) and web-scale

reverse substitution of medical acronyms, with only

evaluation done on actual clinical notes.

Finally, Gu et al. [178] use GPT-3.5 and knowl-

edge distillation to train a PubMedBERT model

for adverse drug event extraction (entity and rela-

tion). The distilled PubMedBERT model outper-

forms GPT-3.5 and GPT-4, and performs similarly

to specialized models that use supervised learning.

3.8

Reasoning

Mathematical and algorithmic tasks often require

a different set of capabilities than traditional NLP

44gineering and filtering to the task of mathemati-

cal formalization (in the context of theorem prov-

ing). Webb et al. [595] evaluate GPT-3.5’s capacity

for analogical reasoning using tasks that emulate

Raven’s Standard Progressive Matrices (SPM), let-

ter string analogies, and verbal analogies. GPT-3.5

is shown to generally outperform human partic-

ipants (undergraduates) at matrix reasoning and

verbal analogies, but with more mixed results on

letter string analogies. Yu et al. [654] introduce

the ALERT benchmark to evaluate LLM reason-

ing across ten skills (logistic, causal, common-

sense, abductive, spatial, analogical, argument,

and deductive reasoning, as well as textual entail-

ment and mathematics). Ruis et al. [464] study

LLMs’ capability to interpret implicatures, for ex-

ample, whether they understand the response "I

wore gloves" to the question “Did you leave finger-

prints?” as meaning “No”; finding that lots of mod-

els perform close to random. Finally, Valmeekam

et al. [562] propose a new assessment framework

for common-sense planning and find that existing

LLMs GPT-3.5 and BLOOM perform poorly. Us-

ing the framework for the Blocksworld domain

(planning tasks with different colored blocks on

a surface), the best GPT-3.5 model only came up

with a valid plan 5% of the time, compared to 78%

of human participants.

reciting causal knowledge embedded in their data

rather than doing causal reasoning [253].

Overall, while LLMs show some capacity for

more complex reasoning, the relatively poor per-

formance of LLMs on a number of reasoning tasks

and benchmarks [562, 164, 244] stands in contrast

to the often human level performance being seen

in other capabilities [61, 263].

3.9

Robotics and Embodied Agents

LLMs have also started to be incorporated into

robotics applications to provide high-level planning

and contextual knowledge.

Ahn et al. [14] implement a PaLM-540B LLM in

the SayCan architecture to break down high-level

text-based instructions into a sequence of lower-

level robot tasks that can be executed. The authors

use the LLM to propose possible next actions via it-

eratively scoring the most likely of a defined set of

low-level tasks based on the high-level text input.

The low-level task to be executed is then deter-

mined by combining the low-level tasks proposed

by the LLM with affordance functions which de-

termine the probability of the robot completing the

task given the current low-level context.

Driess et al. [129] take this concept a step fur-

ther by combining the PaLM-540B LLM with ad-

ditional input modalities (22B parameter vision

transformer) to create the PaLM-E model. By in-

troducing images into the input, the PaLM-E model

can predict which low-level tasks are possible given

the current state, whether the previous low-level

tasks executed failed, and incorporate images into

long-horizon planning, allowing it to outperform

the original SayCan results.

Another approach has been to use LLMs to gen-

erate code for robotics tasks. Vemprala et al. [564]

combine ChatGPT with a pre-defined high-level

function library of robotic capabilities for human

on the loop robotics tasks. By providing details of

the function library in the prompt, ChatGPT is then

shown to be able to break down high-level natu-

ral language instructions into a set of lower-level

function calls, which can then be executed on the

robot if the human is satisfied it is accurate. This is

another example of the API definition 13 approach,

also used in computer programming [532]. Other

related works that use LLMs to generate code for

robotics applications include using an LLM for hi-

erarchical code generation to write robot policies

(Codex) [316], to generate code policies and main-

Sub-Human-Performance [562, 607]

Existing LLMs struggle to match human

performance on reasoning benchmarks.

Another line of work has investigated the in-

tersection of LLMs and causal reasoning [425,

253]. Kıcıman et al. [286] argue that GPT-3.5/4

outperform existing algorithms in three causal

benchmarks. In contrast, Gao et al. [164] evalu-

ate ChatGPT on three causal reasoning tasks (dis-

tinct from Kıcıman et al. [286]) and find that it

performs rather poorly; further, few-shot and chain-

of-thought prompting sometimes further exacer-

bates its performance. Srivastava et al. [519] pro-

pose 14 causal reasoning tasks, some of which are

considered to be very hard [534]. Similarly, Jin

et al. [244] curate another causal inference task

and posit that current LLMs still fail to general-

ize. Lampinen et al. [288] study whether LLMs

can generalize causal intervention strategies from

few-shot examples. Willig et al. [607] conjec-

ture that current LLMs are “causal parrots”, simply

45tain a written state (GPT-3.5) [647], and using an

LLM for code-based task planning (GPT-3, Codex)

[510].

Finally, LLMs have also been combined with

modality-to-text pre-processing to provide the

LLM with additional input from the robot’s en-

vironment. Liu et al. [338] use GPT-4 as part of the

REFLECT framework for detecting and explaining

robot failures. To achieve this, multi-modal sensory

inputs are first converted into a text-based hierar-

chical summary at the sensory, event, and sub-goal

levels. The hierarchical summary then prompts

the LLM to detect and analyze failures. Similarly,

Huang et al. [225] combine an LLM (InstructGPT,

PaLM) with multiple sources of text-based environ-

ment feedback for robotic task planning.

L L M s in the Social Sciences & Psychology

Analyzing behavior al

char acter istics of L L M s Simulating social

relationships with L L M s

M ilgr am Shock Exper iment Big Five per sonality tr aits I nter acting ar tificial agents

I llusor y Tr uth Effect Guilfor d's Alter native Uses L L M s to simulate societies

Figure 15: Use cases of LLMs in the social sci-

ences and psychology can mainly be structured into

three categories: using LLMs to model human behav-

ior [e.g., 12, 211], analyzing behavioral characteristics

of LLMs [e.g., 414], and using LLMs to simulate social

relationships [e.g., 408].

text of the psychological and behavioral sciences:

using LLMs to simulate human behavioral experi-

ments [e.g., 22, 176, 211, 614, 126], analyzing the

personality traits of LLMs [e.g., 367, 414, 470],

and employing them as artificial agents to model

social relationships [409]. See Fig. 15 for an illus-

tration.

Single Modality [338, 14, 564]

While LLMs can help robots or agents un-

derstand instructions and add high-level

planning capabilities, their inability to di-

rectly learn from image, audio or other sen-

sor modalities constrain their applications.

3.10.1

Modeling Human Behavior

In the behavioral sciences, there is an increasing

interest in using LLMs as models for psychological

experiments. Being able to model human behavior

computationally through language models would

entail a variety of advantages over using human

participants: experiments with LLMs are cheaper,

faster, can be scaled easier, and are potentially less

sensitive to ethical considerations [176]. In light

of this, various works have compared LLMs with

human participants from a behavioral perspective.

Argyle et al. [22] demonstrate how LLMs can

generate responses corresponding to virtual partici-

pants in behavioral experiments. They do so by us-

ing LLMs to generate samples of responses to stud-

ies related to political opinions and voting behavior.

In particular, the authors investigate three studies:

the first asks participants to list words associated

with outgroup partisans, and the second and third

focus on vote prediction based on demographics.

Across scenarios, experimental results demonstrate

that GPT-3 provides answers that closely align with

human responses.

Horton [211] argue that LLMs can be used

to computationally model human behavior and

demonstrate such an ability in economics by ex-

ploring their behavior in economic scenarios. They

conducted four experiments focusing on economic

decision-making using GPT-3, showing that the

For agents in simulated worlds, Wang et al.

[579] use the GPT-4 LLM within their VOYAGER

framework to create a Minecraft agent that can

autonomously explore, acquire new skills and com-

plete tasks. First, they use GPT-4 to propose new

tasks for the agent to complete as part of the au-

tomatic curriculum. Then, they ask it to generate

code to solve the proposed task given the current

state to add to its skills library, which can then be

used in the future (similar to the API approach 13

used by Vemprala et al. [564]). Finally, the authors

use GPT-4 to verify whether the executed code

has achieved the proposed task. This framework

outperforms prompting approaches such as ReAct,

Reflexion, and AutoGPT (Sec. 2.7).

Prior work using LLMs for planning in simu-

lated worlds include: Wang et al. [591] using GPT-

3 for Minecraft, Huang et al. [224] using GPT-3

and Codex in VirtualHome, and Nottingham et al.

[389] using Codex for Minecraft.

3.10

Using L L M s to model

human behavior

Social Sciences & Psychology

The rapid advancements of LLMs have fostered the

use of such models across research in the psycho-

logical and behavioral sciences. Reviewing the ex-

isting literature, we have identified three main areas

and tasks in which LLMs have been used in the con-

46LLM can approximately replicate results obtained

with human individuals.

Griffin et al. [176] investigate the suitability of

LLMs to model psychological change. In their

study, the authors assess LLM responses to two

behavioral tests, the illusory truth effect [ITE; 194]

and an experiment measuring the influence of pop-

ulist news to change in political views [55]. The

results demonstrate that in both scenarios, human

judgments tend to align with LLM-based judg-

ments, indicating that LLMs have the potential to

model the effect of influence on human individuals.

Aher et al. [12] introduce the Turing Experiment

(TE) to measure an LLM’s suitability to model hu-

man behavior. A TE consists of inputs to the LLM

that signal a certain demographic (e.g., names or

occupations) as well as a set of experimental de-

tails and corresponding outputs used to simulate

human behavior. The authors apply their approach

to four individual tests, namely an ultimatum game

from behavioral economics [214, 279], garden-path

sentences used in psycholinguistics [89, 411], the

Milgram Shock Experiment from social psychol-

ogy [364], and the wisdom of crowds task used to

measure collective social intelligence [375]. De-

mographic details are simulated via gender titles

and surnames. The results show that LLMs largely

align with human behavior across the tests. How-

ever, the authors note that LLM size matters and

that larger models tend to provide results that are

more aligned with human responses.

Aher et al. [12] point out that the LLMs were

most likely exposed to the four behavioral exper-

iments during their pre-training. To account for

that, the authors create artificial variations of the

experiments with conditions that differ from previ-

ous studies. Additionally, the authors note that a

potential risk with using LLMs to simulate human

responses is the introduction of generations that

contain biases stemming from the models’ training

data.

tests in which subjects are asked to choose between

a kiss from a favorite movie star and $50 [462]

and where subjects had to decide between paying

a traffic violation fine and going to court [461].

These experiments show that GPT-3 replicates only

37.5% of the effects obtained from human partic-

ipants. The authors argue that these results are

attributed to humans and LLMs representing inher-

ently different cognitive systems.

Maddela et al. [353] study identifying unhelpful

thought patterns and possible reframings to facil-

itate mental health. They release a dataset called

P ATTERN R EFRAME and evaluate GPT-3.5 on it,

showing that it can perform very well without ad-

ditional training. They conclude that practitioners

of cognitive behavioral therapy may benefit from

using LLMs to produce richer training material.

3.10.2

Analyzing Behavioral Characteristics

of LLMs

In addition to using LLMs as models for human

behavior, various existing works study LLMs by

analyzing their personality traits.

Jiang et al. [242] do so by introducing the Ma-

chine Personality Inventory (MPI) dataset, a col-

lection of items to assess personalities according

to the Big Five personality factors: extraversion,

agreeableness, openness, conscientiousness, and

neuroticism [358].

Miotto et al. [367] assess GPT-3’s personalities

using the HEXACO [27] and Human Values [488]

scales. Their experimental results reveal that GPT-

3 obtains personality and value scores that align

with human participants. Miotto et al. [367] provide

an extensive analysis of varying temperature values

used to prompt the LLM, finding that an increased

temperature yields changes in the model’s person-

alities, e.g., GPT-3 shows a higher unwillingness to

manipulate as well as increased scores on anxiety.

Similar results were obtained concerning the Hu-

man Values scale, where model responses varied

substantially for different temperature values.

In line with this work, Pellert et al. [414] ar-

gue that LLMs possess psychological traits as ob-

served in human individuals and can be assessed

through psychometric tests. The authors conduct

experiments measuring, among others, the Big Five

personality traits in a zero-shot setup. In contrast,

to Miotto et al. [367], Pellert et al. [414] investi-

gate smaller models based on BERT and find that

different variants of BERT score across the five

personalities in a fairly homogeneous fashion, with

Social Biases [12, 367]

Unbalanced views and opinions in the train-

ing data skew the LLMs towards biased hu-

man behaviors.

Park et al. [409] replicate a set of 8 psycho-

logical studies from the Many Labs 2 project [270]

using GPT-3 to assess the LLM for its ability to sim-

ulate human behavioral data. Such studies include

47Pre-processing

traits that are high on agreeableness and extraver-

sion, but low on neuroticism.

In a related fashion, Stevenson et al. [523] as-

sess LLM performance (GPT-3) on the Guilford’s

Alternative Uses Test [AUT; 181], a test to assess

human creativity. The test asks participants to sug-

gest uses for physical objects (e.g., a book or a

fork). Comparing the AUT test performance of

GPT-3 to that of psychology students, the authors

found that human responses score higher on orig-

inality and surprise, whereas GPT-3’s responses

were more useful.

Kosinski [277] test Theory of Mind (ToM) in

LLMs. ToM refers to the ability to track others’

unobservable mental states, such as intentions, be-

liefs, or desires. The authors find that among

LLMs of the GPT family, recent models can in-

creasingly solve ToM tasks without having been

explicitly trained to do so. For instance, while GPT-

2 shows virtually no capability of solving ToM

tasks, GPT-3.5 (based on InstructGPT) and GPT-4

performed similarly to 6- and 7-year-old children,

respectively. Gandhi et al. [162] present a template-

based framework for generating synthetic samples

to evaluate ToM in LLMs, which are then applied to

five recently developed LLMs (incl. GPT-3, GPT-

4, LLaMA, and Claude). The authors show that

most models struggle with ToM in its basic forms.

However, GPT-4 performs closest to the human

comparison of all tested models.

3.10.3

Prompt

LLM

Modality-to-Text

Prompt

LLM

Output

Code -> Modality

Python - Matplotlib

Latex - TikZ

CSS