Summary of Embers of Autoregression Understanding Large Language Models

Summary Embers of Autoregression Understanding Large Language Models arxiv.org

48,510 words - PDF document - View PDF document

One Line

Language models are biased towards common tasks and patterns, making systematic reasoning challenging, though careful prompting may provide some assistance.

Slides

Slide Presentation (14 slides)

Copy slides outline Copy embed code Download as Word

Understanding the Dynamics of Large Language Models

Source: arxiv.org - PDF - 48,510 words - view

Large Language Models Revolutionize AI

• LLMs like GPT-3.5 and GPT-4 have transformed natural language processing.

• Their design focuses on next-word prediction using vast internet text datasets.

• Understanding their capabilities shapes effective deployment in real-world scenarios.

[Visual: Infographic illustrating the evolution of language models]

Next-Word Prediction is the Core Function

• LLMs are fundamentally trained to predict the next word in a sequence.

• This teleological approach highlights the limitations in reasoning and comprehension.

• Performance varies significantly based on task familiarity and complexity.

[Visual: Flowchart showing the next-word prediction process]

Sensitivity to Task Frequency Affects Performance

• LLMs excel at common tasks but struggle with rare ones, regardless of complexity.

• Example: Higher accuracy in decoding common Pig Latin vs. rare variants.

• Familiarity with tasks enhances predictive accuracy and reliability.

[Visual: Bar graph comparing performance on common vs. rare tasks]

Output Probability Drives Predictive Accuracy

• LLMs show significant sensitivity to the probability of correct outputs.

• Decoding accuracy drops sharply for low-probability outputs (e.g., from 51% to 13%).

• Models favor producing high-probability sequences, leading to errors in low-probability contexts.

[Visual: Line graph illustrating output probability impact on accuracy]

Input Probability Influences Outcomes

• While less impactful than output probability, input context matters.

• Models perform better with high-probability inputs, especially in deterministic tasks.

• Example: Enhanced accuracy when prompts involve familiar entities or contexts.

[Visual: Pie chart showing input vs. output probability effects]

Statistical Nature Shapes LLM Capabilities

• LLMs are statistical systems, not reasoning engines.

• Their performance is heavily influenced by training data patterns and biases.

• Understanding these biases is crucial for effective model evaluation and use.

[Visual: Diagram mapping statistical properties to LLM behavior]

Limitations in Compositionality and Systematicity

• LLMs struggle with tasks requiring logical reasoning and meaning-based processing.

• They exhibit difficulties with tracking entities and handling negation effectively.

• These limitations stem from their autoregressive nature rather than a lack of training data.

[Visual: Comparison chart of LLM performance on reasoning tasks vs. language tasks]

Prompting Techniques Can Enhance Performance

• Specific prompting strategies improve model outcomes in various tasks.

• Step-by-step and chain-of-thought approaches yield better results in complex queries.

• Tailored prompts leverage LLM strengths while mitigating weaknesses.

[Visual: Examples of effective vs. ineffective prompts]

The Role of Contextual Wording

• Wording and formatting of prompts significantly affect model responses.

• Explicit context improves task performance, while ambiguity leads to errors.

• Understanding linguistic nuances enhances interaction quality with LLMs.

[Visual: Side-by-side comparison of prompt variations and their responses]

Autoregressive Structure Limits Future Planning

• LLMs' autoregressive architecture restricts their ability to plan ahead or correct errors.

• This leads to systematic biases that affect response consistency and reliability.

• Awareness of these constraints is essential for developers and users alike.

[Visual: Illustration showing autoregressive flow and its limitations]

Flexible Reasoning in Certain Tasks

• Despite limitations, LLMs can demonstrate flexible reasoning in specific scenarios.

• Success in tasks like Pig Latin or article swapping showcases this potential.

• Balancing autoregressive tendencies with adaptable strategies is key for optimization.

[Visual: Case studies highlighting successful flexible reasoning tasks]

Implications for AI Practitioners

• Deploying LLMs requires understanding their probabilistic nature and biases.

• Caution is necessary in low-probability scenarios to avoid detrimental errors.

• Leveraging strengths while addressing weaknesses can enhance model effectiveness.

[Visual: Checklist for best practices in deploying LLMs]

Embracing the Future of Language Modeling

• LLMs represent a significant advancement but come with inherent limitations.

• A teleological perspective helps clarify their strengths and biases for better utilization.

• Continued exploration and adaptation are essential for advancing AI capabilities.

[Visual: Vision statement graphic for the future of AI and language models]

Key Points

Large language models (LLMs) are heavily influenced by the probability of their outputs, even for deterministic tasks, favoring high-probability sequences over low-probability ones
LLMs exhibit systematic biases and limitations in tasks like shift ciphers, reversals, and counting, suggesting their reasoning is constrained by autoregressive structure
However, LLMs can also exhibit more flexible reasoning, as shown in their ability to solve Pig Latin and article swapping tasks
LLMs' performance is shaped by the statistical properties of the language they are trained on, highlighting the importance of understanding their biases and limitations
Probing LLMs' capabilities across diverse tasks provides valuable insights into their inner workings and the interplay of different cognitive capabilities

Summaries

22 word summary

LLMs are biased towards common tasks and patterns, not systematic reasoning. Careful prompting may help, but overcoming language modeling constraints remains challenging.

47 word summary

LLMs like GPT-3.5 and GPT-4 are biased towards common tasks, high-probability outputs/inputs, and statistical patterns in training data, rather than systematic reasoning. While impressive, they are language models, not general reasoning systems. Careful prompting and calibration may help, but overcoming language modeling constraints remains a key challenge.

109 word summary

Large language models (LLMs) like GPT-3.5 and GPT-4 are trained for next-word prediction, leading to three "embers of autoregression": sensitivity to task frequency, output probability, and input probability. LLMs perform better on common tasks and high-probability outputs/inputs, even for deterministic tasks. This suggests LLMs are heavily influenced by statistical patterns in their training data rather than systematic logical reasoning. While LLMs can achieve impressive results, they are fundamentally language models, not general reasoning systems. Careful prompting, scaling, and calibration may help, but overcoming the constraints of language modeling remains a key challenge. Adopting a teleological perspective can aid in understanding LLM strengths, weaknesses, and future developments in language modeling.

406 word summary

Large language models (LLMs) like GPT-3.5 and GPT-4 are trained to perform the task of next-word prediction on internet text. This "teleological approach" to their training can help explain certain quirks and limitations of these models when applied to tasks beyond next-word prediction.

We identify three key "embers of autoregression" that stem from this training objective:

1. Sensitivity to task frequency: LLMs perform better on tasks that are common in their training data, even if the tasks have equivalent complexity. For example, GPT-4 achieves higher accuracy on common Pig Latin variants compared to rare ones.

2. Sensitivity to output probability: LLMs are more accurate when the correct answer is high-probability text, even for deterministic tasks. For instance, GPT-4 is more accurate at reversing high-probability sentences than low-probability ones.

3. Sensitivity to input probability: LLMs sometimes perform better on deterministic tasks when the input text is high-probability, though input probability is less influential than output probability. GPT-4 is more accurate at decoding high-probability sentences in a simple cipher compared to low-probability ones.

We test these predictions across a range of tasks, including shift ciphers, Pig Latin, acronyms, and linear functions. The results demonstrate that LLMs' autoregressive training causes them to be highly sensitive to the statistical properties of their inputs and outputs, even for deterministic tasks.

For example, in shift cipher decoding, GPT-4 had 51% accuracy for high-probability outputs but only 13% for low-probability ones. Similar effects were seen in word reversal and Pig Latin translation tasks, where accuracy increased with output probability.

However, input probability had mixed effects, with significant impacts only in some cases. This suggests output probability is a more pervasive factor in influencing LLM performance.

While LLMs can be scaled up to achieve impressive results on many tasks, our findings highlight that they are fundamentally language models, not general reasoning systems. Their performance is heavily influenced by statistical patterns in their training data rather than systematic application of logical rules.

Careful prompting, scaling, and calibration may help mitigate these issues, but overcoming the fundamental constraints of language modeling remains an important challenge for advancing AI capabilities. Adopting a teleological perspective when analyzing LLMs can help us better understand their strengths and weaknesses and guide future developments in language modeling.

1712 word summary

Large language models (LLMs) have become widely adopted, but understanding their strengths and limitations is crucial. The authors argue that to develop a holistic understanding of these systems, we must consider the problem they were trained to solve: next-word prediction over internet text. This "teleological approach" leads to three key factors that influence LLM accuracy: task probability, target output probability, and input probability.

The authors predict that LLMs will achieve higher accuracy when these probabilities are high, even in deterministic settings. To test this, they evaluate GPT-3.5 and GPT-4 on various tasks, revealing surprising failure modes. For example, GPT-4's accuracy at decoding a simple cipher drops from 51% for high-probability outputs to only 13% for low-probability outputs.

These results show that AI practitioners should be cautious when using LLMs in low-probability situations. The authors conclude that we should not evaluate LLMs as if they are human, but rather treat them as a distinct system shaped by the pressures of next-word prediction. This teleological approach provides a framework for understanding LLM behavior and predicting their performance across different tasks and contexts.

Large language models (LLMs) like GPT-4 are trained to perform the task of next-word prediction on Internet text. This training objective, which we refer to as the "teleological approach," can help explain certain quirks and limitations of these models when applied to tasks beyond next-word prediction.

We hypothesize three key "embers of autoregression" that stem from this training objective:

Beyond these three effects, the teleological approach also provides a unifying perspective on other LLM behaviors, such as lack of embodiment, sensitivity to wording, difficulty on meaning-dependent tasks, and limitations in compositionality and systematicity. Understanding these "embers of autoregression" is crucial as LLMs become more widely deployed, to anticipate their failures and biases.

Large language models (LLMs) are statistical systems trained on internet text, making them sensitive to the statistics of word sequences in that text. We make three key predictions about LLM performance:

1) LLMs will perform worse on rare tasks than on frequent tasks, even when task complexity is equal.

2) LLMs will be sensitive to the probability of the answers they must produce, even for deterministic tasks.

3) LLMs will be sensitive to the probability of their inputs, but to a lesser degree than output probability.

We test these predictions across a range of tasks, including shift ciphers, Pig Latin, acronyms, and linear functions. For shift ciphers, LLMs perform much better on the common rot-13 cipher than on rare shift levels. Similarly, they excel at the frequent Pig Latin variant but struggle with the rare Boar Etruscan. For acronyms, they are far more accurate at forming first-letter acronyms than second-letter ones. And they apply the common Celsius-to-Fahrenheit linear function much better than a rare but similar function.

These results cannot be explained by task complexity, as the rare and common variants are equally straightforward computationally. Rather, they demonstrate that LLMs' autoregressive training causes them to be highly sensitive to the statistical properties of their inputs and outputs, even for deterministic tasks. This highlights the importance of considering LLMs' probabilistic nature when evaluating their capabilities.

Large language models (LLMs) like GPT-3.5 and GPT-4 perform better on common task variants than rare ones, even when the tasks are deterministic. This suggests LLMs are sensitive to the probability of the task or output.

In shift cipher decoding, GPT-4 had 51% accuracy for high-probability outputs but only 13% for low-probability ones. Models often produced incorrect but higher-probability outputs. Similar effects were seen in word reversal and Pig Latin translation tasks.

For article swapping and acronym formation, accuracy increased with output probability, even when input probability was controlled. This indicates output probability is a key factor.

Counting performance also varied with output probability - models were more accurate when counting to common numbers. Input probability had mixed effects, with significant impacts only in some cases. This suggests input probability is less pervasive than output probability in influencing LLM performance.

Overall, these results demonstrate LLMs are heavily influenced by the probability of their outputs, even for deterministic tasks. This highlights their tendency to favor high-probability sequences, which can lead to errors when the correct answer is low-probability. The findings situate LLM abilities within a broader teleological perspective focused on probability.

Our experiments have revealed several key insights about the behavior of large language models (LLMs) like GPT-3.5 and GPT-4:

LLMs are more sensitive to output probability than input probability. They perform better on tasks involving high-probability outputs, even when the inputs are low-probability. This effect was observed for tasks like decoding shift ciphers and forming acronyms.

However, input probability can have a large effect in certain cases, such as when models are asked to provide the birthdays of public figures. Models performed much better on common figures than rare ones, suggesting they have memorized information about frequent entities rather than applying a systematic function.

LLMs also display sensitivity to the wording and formatting of prompts. For example, they performed better at applying a linear function (the Celsius-to-Fahrenheit conversion) when the prompt explicitly mentioned the context, compared to just presenting the abstract equation. Similarly, their accuracy on multiplication varied widely depending on whether the numbers were presented as digits, words, or in alternating capitalization.

These findings highlight that LLMs are fundamentally language models, not general reasoning systems. Their performance is heavily influenced by statistical patterns in their training data rather than systematic application of logical rules.

While these limitations may seem obvious in hindsight, they are not always accounted for when evaluating or deploying LLMs. Our teleological approach of analyzing the problems LLMs were trained to solve can help uncover such model-specific biases and failure modes. Careful prompting, scaling, and calibration may help mitigate these issues, but overcoming the fundamental constraints of language modeling remains an important challenge for advancing AI capabilities.

Our analysis of large language models (LLMs) reveals that they are fundamentally shaped by the statistical properties of their training data and architecture, rather than exhibiting human-like reasoning or understanding. We find that LLMs struggle with tasks that require meaning-based processing, such as tracking entities or handling negation, as they are primarily models of language rather than thought. Additionally, LLMs are limited by their autoregressive nature, which prevents them from planning ahead or correcting errors in their previous output.

We also show that LLM performance is heavily influenced by the probability distribution of their training data. They exhibit biases and rely on spurious correlations, and can even memorize specific sentences that are frequently repeated in their training corpus. Furthermore, the choice of input representation, such as subword tokenization, can impact their performance on character-level manipulations.

While LLMs can be scaled up to achieve impressive results on many tasks, we find that this scaling does not fully overcome the fundamental limitations imposed by their training objective and architecture. Nonetheless, we recognize that LLMs also have beneficial properties, such as their generality and ability to learn efficiently from natural data.

Our findings highlight the importance of adopting a teleological perspective when analyzing LLMs, focusing on the specific problem they are designed to solve rather than anthropomorphizing their capabilities. This approach can help us better understand the strengths and weaknesses of these systems and guide future developments in language modeling.

Large language models (LLMs) have shown impressive capabilities, but also raise important questions about their inner workings and limitations. This paper explores the "embers of autoregression" - the residual autoregressive structure that can persist in LLMs even after pretraining on vast amounts of data. The authors investigate how this autoregressive structure affects the models' reasoning abilities through a series of targeted regression tests.

Key findings include: - LLMs exhibit systematic biases and limitations in tasks like shift ciphers, reversals, and counting, suggesting their reasoning is constrained by autoregressive structure. - However, LLMs can also exhibit more flexible reasoning, as shown in their ability to solve Pig Latin and article swapping tasks. - The authors propose that LLMs may combine autoregressive and more flexible reasoning, with the balance depending on the task and prompting. - Overall, the results highlight the need to better understand the interplay of different cognitive capabilities in large language models, and how to effectively leverage their strengths while mitigating their limitations.

Large language models (LLMs) like GPT-3.5 and GPT-4 can perform a variety of language tasks, from shift ciphers and Pig Latin to counting and sorting. Statistical analyses reveal key insights about their capabilities:

For shift ciphers, LLMs show better performance on the more common rot-13 cipher compared to rot-2. Their accuracy is influenced by the probability of the input and output, with higher-probability examples yielding better results. Prompting techniques like step-by-step and chain-of-thought can further improve performance.

In Pig Latin tasks, LLMs are sensitive to the specific variant used, with more common variants like "-way" yielding higher accuracy. As with shift ciphers, output probability is a key factor, while input probability has little effect.

For sorting tasks, LLMs are much better at alphabetical order than reverse alphabetical order, reflecting the relative frequencies of these orderings in natural language. Ascending and descending order show a smaller gap in performance.

Counting tasks reveal that LLMs are highly sensitive to the probability of the output number, while the magnitude of the number matters more for character-based counting than word-based. Input probability can also influence counting accuracy, especially for rare numbers.

Overall, these findings demonstrate that LLMs' performance is shaped by the statistical properties of the language they are trained on, highlighting the importance of understanding their biases and limitations. Probing their capabilities across diverse tasks provides valuable insights into their inner workings.

Raw indexed text (294,159 chars / 48,510 words / 5,428 lines)

Embers of Autoregression: Understanding Large Language

Models Through the Problem They are Trained to Solve

R. Thomas McCoy

Shunyu Yao

Dan Friedman

Matthew Hardy

Thomas L. Griffiths

Princeton University

One-sentence summary:

To understand what language models are, we must understand what we have trained them to be.

Abstract:

The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and

limitations. We argue that in order to develop a holistic understanding of these systems we need to consider

the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the

pressures that this task exerts we can make predictions about the strategies that LLMs will adopt, allowing us

to reason about when they will succeed or fail. This approach—which we call the teleological approach—leads

us to identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to

be performed, the probability of the target output, and the probability of the provided input. We predict

that LLMs will achieve higher accuracy when these probabilities are high than when they are low—even in

deterministic settings where probability should not matter. To test our predictions, we evaluate two LLMs

(GPT-3.5 and GPT-4) on eleven tasks, and we find robust evidence that LLMs are influenced by probability

in the ways that we have hypothesized. In many cases, the experiments reveal surprising failure modes. For

instance, GPT-4’s accuracy at decoding a simple cipher is 51% when the output is a high-probability word

sequence but only 13% when it is low-probability. These results show that AI practitioners should be careful

about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs

as if they are humans but should instead treat them as a distinct type of system—one that has been shaped

by its own particular set of pressures.

Contents

1 Introduction 2 A teleological approach to understanding

2.1 What problem do LLMs solve? . . . . . .

2.2 Hypothesized embers of autoregression .

2.3 What we are not arguing . . . . . . . . . 3 Motivating our predictions

3.1 Running example: Shift ciphers . . .

3.2 Sensitivity to task probability . . . .

3.3 Sensitivity to output probability . . .

3.4 Sensitivity to input probability . . . .

3.5 Asymmetry between input probability . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

and output probability .

. .

. 8

Overview of experiments

4.1 Tasks . . . . . . . . . .

4.2 Sentences . . . . . . . .

4.3 Models . . . . . . . . .

4.4 Notes about figures . . .

. .

. 11

LLMs

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Sensitivity to task probability

5.1 Shift ciphers . . . . . . . . . . . . .

5.2 Pig Latin . . . . . . . . . . . . . . .

5.3 Acronyms . . . . . . . . . . . . . . .

5.4 Linear functions . . . . . . . . . . .

5.5 Sorting . . . . . . . . . . . . . . . .

5.6 Summary of task probability effects

. .

. 13

Sensitivity to output probability

6.1 Shift ciphers . . . . . . . . . . . . . .

6.2 Reversal . . . . . . . . . . . . . . . .

6.3 Pig Latin . . . . . . . . . . . . . . . .

6.4 Article swapping . . . . . . . . . . . .

6.5 Acronyms . . . . . . . . . . . . . . . .

6.6 Counting . . . . . . . . . . . . . . . .

6.7 Summary of output probability effects .

. .

. 20

Sensitivity to input probability

7.1 Shift ciphers, reversal, Pig Latin, and acronyms . . . . . . . .

7.2 Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.3 Asymmetry between input probability and output probability

7.4 Birthdays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.5 Summary of input probability effects . . . . . . . . . . . . . . .

. .

. 28

Other phenomena

8.1 Lack of embodiment . . .

8.1.1 Keyboard cipher

8.2 Sensitivity to wording . .

8.2.1 Linear function .

8.2.2 Multiplication . . .

. .

. 33

Discussion

9.1 How to understand a new type of intelligence . . . . . . . .

9.2 Adversarial testing of AI systems . . . . . . . . . . . . . .

9.3 Comparing models to humans . . . . . . . . . . . . . . . .

9.4 Evaluating models fairly: Prompts, scaling, and calibration .

. .

. 37

10 Related work

10.1 Training task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10.1.1 Difficulty in tasks that depend on meaning . . . . . . . . . . . . .

10.1.2 Inability to modify text that has already been produced . . . . .

10.2 Training distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10.2.1 Societal biases and spurious correlations . . . . . . . . . . . . . .

10.2.2 Idiosyncratic memorization . . . . . . . . . . . . . . . . . . . . .

10.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10.3.1 Sensitivity to tokenization and other aspects of input formatting

10.3.2 Limited compositionality and systematicity . . . . . . . . . . . .

10.4 Beneficial embers of autoregression . . . . . . . . . . . . . . . . . . . . . .

. .

. 43

11 Conclusion 49

A Dataset details 62

B Prompts 67

C Evaluation details 75

D Statistical tests 75

E Estimating task probability 82

F Spelling 84

Introduction

Large language models (LLMs), such as GPT-4 (OpenAI, 2023), PaLM (Chowdhery et al., 2022), and Llama 2

(Touvron et al., 2023), receive a piece of text as input and then generate additional text as output. Virtually

any task can be framed in the form of linguistic queries, so LLMs could be applied to virtually any task—from

summarizing text to generating computer code. This flexibility is exciting: it led one recent paper to argue

that LLMs display “sparks of artificial general intelligence” (Bubeck et al., 2023). However, it also hinders

us from understanding LLMs holistically. Given that we can only run a finite number of tests, how can we

understand a system whose potential scope is infinite? Answering this question requires some method for

deciding which tests will be most informative about the general strengths and weaknesses of LLMs.

One popular way to select evaluations is to use a human-centric approach: test for the abilities that

are viewed as most important for characterizing human cognition. For instance, some popular benchmarks

target the ability to draw inferences from sentences (Dagan, Glickman, and Magnini, 2006; Williams, Nangia,

and Bowman, 2018), which is widely viewed as a central aspect of human reasoning. In some cases, LLM

evaluations are not only inspired by general human abilities but in fact use specific tests that were originally

designed for humans: OpenAI (2023) evaluated GPT-4 on the SAT, the GRE, and other real-world exams,

and many popular benchmarks such as ARC (Clark et al., 2018) and MMLU (Hendrycks et al., 2021) are

curated from tests written for humans. However, LLMs are not humans, so the types of tests that might be

most informative about them may differ from the tests that are most informative about humans (Mitchell,

2023). In particular, a human-centric approach runs the risk of highlighting the strengths of these models—

their overlap with human abilities—without revealing their idiosyncratic weaknesses (Figure 1.1). How can

we approach the problem of understanding a new type of intelligence, evaluating it on its own terms?

We argue for an approach that has been extremely productive in cognitive science: understanding intelli-

gent systems by understanding the sorts of problems that they developed to solve (Marr, 1982; Shepard, 1987;

Anderson, 1990; Griffiths, 2020). This approach—which we refer to as the teleological approach—focuses

on the goals of the system and the environment in which it operates. As an example from biology, some species

of birds store seeds in the ground and then retrieve them months later. Finding stored seeds is much easier if

one can remember where they are hidden, so we would expect birds that rely on seed caching to have evolved

strong spatial memory abilities. Indeed, Balda and Kamil (1989) found that Clark’s nutcrackers—which store

thousands of seeds each year—perform better on spatial memory tasks than scrub jays—which, though they

are related to Clark’s nutcrackers, are much less reliant on seed caching. That is, the memory differences

between these species can be explained by differences in the problems that they need to solve.

The crucial question to ask, then, is: What problem(s) do LLMs need to solve, and how do these pressures

influence them? Here we focus on perhaps the most salient pressure that defines any machine learning system,

Humans

LLMs

Unique

human

capacities

and biases

Shared

properties

Unique

LLM

capacities

and biases

Figure 1.1: Humans and large language models (LLMs) have some shared properties and some

properties that differ. If LLMs are analyzed using tests designed for humans, we risk identifying

only the shared properties, missing the properties that are unique to LLMs (the dotted region of the

diagram). We argue that in order to identify the properties in the dotted region we must approach

LLMs on their own terms by considering the problem that they were trained to solve: next-word

prediction over Internet text.

3namely the task that it was trained to perform. For the LLMs that have been the focus of recent attention in

AI, this task is autoregression—next-word prediction (Elman, 1991; Bengio, Ducharme, and Vincent, 2000;

Radford et al., 2018)—performed over Internet text. As described in more detail in the next section, we

argue that a full understanding of LLMs should be strongly informed by the fact that this is their primary

training objective. We therefore hypothesize that, even when they are asked to perform tasks that seem very

different from next-word prediction—e.g., solving math problems—their performance on such tasks is highly

influenced by the autoregressive substrate from which all of their abilities emerge. That is, if we continue

with the example of solving math problems, it is important to view the LLM not as a “math problem solver”

but rather as a “statistical next-word prediction system being used to solve math problems.”

A mismatch between the problem that a system developed to solve and the task that it is being given can

have significant consequences. Just as the human drive to obtain sweet and fatty foods can be maladaptive

in a world where those foods are easily available, the autoregressive tendencies of LLMs can cause problems

when they are given a task that is not next-word prediction. In this paper, we exhibit what might seem like

surprising failures of LLMs on tasks that are straightforward for humans to perform, such as using simple

ciphers, constructing acronyms, calculating linear functions, and counting; see Figure 1.2. These failures can

be understood directly in terms of a conflict between next-word prediction and the task at hand. Being able

to reason about when LLMs will fail at a task is critical as these systems become more widely deployed.

Based on an analysis of the problem that LLMs are trained to solve (statistical next-word prediction),

we make three predictions about how LLMs will be influenced by their origin in this task—the embers of

autoregression that appear in these systems even as they might show sparks of artificial general intelligence.

These predictions are listed in Table 1. For example, we predict that, even when the task is a deterministic one

that does not depend on probability, LLMs will achieve higher accuracy when the answer is high-probability

than when it is low-probability. In the next section, we discuss these hypotheses in more detail.

We then test these hypotheses through an extensive set of experiments. Since our goal is to identify sources

of difficulty for LLMs, our experiments need to take place in settings where LLMs make at least some errors.

To create such settings, we use an adversarial strategy that follows straightforwardly from our hypotheses:

we use tasks that push LLMs into low-probability situations where we expect they will perform poorly. For

instance, one task we use is decoding simple ciphers, since text written in a cipher is a low-probability subspace

of the broader space of possible types of text. Table 2 on page 12 gives the full list of tasks that we use. As

long as models indeed have a non-negligible error rate on these tasks, we can then investigate which factors

cause this error rate to increase or decrease—focusing on the factors that we have hypothesized will affect the

error rate.

Using this strategy, we find robust evidence for the effects described in Table 1: as task probability or

example probability varies, LLM accuracy can indeed vary substantially in the ways that we have hypothesized.

Beyond the three effects that we focus on, we also use some additional experiments and a literature review

to discuss how the teleological approach provides a unifying perspective on a variety of effects that have

been discussed in prior work. Combining these additional effects with the three that we focus on yields the

following complete list:

(1) Embers of autoregression:

a. Sensitivity to task frequency

b. Sensitivity to output probability

c. Sensitivity to input probability

d. Lack of embodiment

e. Sensitivity to wording

f. Difficulty on meaning-dependent tasks

g. Inability to modify text that has already been produced

h. Societal biases and spurious correlations

i. Idiosyncratic memorization

j. Sensitivity to tokenization

k. Limited compositionality and systematicity

Overall, our findings illustrate how we can understand LLMs more clearly if we recognize the pressures that

have shaped them. Our code, our stimuli, and the responses we gathered in our experiments are all on

GitHub. 1

https://github.com/tommccoy1/embers-of-autoregression

4Article swapping

Counting

Count the letters.

Input 1: iiiiiiiiiiiiiiiiiiiiiiiiiiiiii

Correct: 30

D GPT-4:

Swap each article (a, an, or the) with the word before it.

Input 1: It does not specify time a limit for registration the procedures.

Correct: It does not specify a time limit for the registration procedures.

D GPT-4:

It does not specify a time limit for the registration procedures.

Input 2: iiiiiiiiiiiiiiiiiiiiiiiiiiiii

Correct: 29 Input 2: It few with it to lying take the get just a hands would kinds.

Correct: It few with it to lying the take get a just hands would kinds.

GPT-4: 30 GPT-4: It flew with a few kinds to take the lying just to get the hands.

Shift ciphers

Linear functions

Decode by shifting each letter 13 positions backward in the alphabet.

Input:

Jryy, vg jnf abg rknpgyl cynaarq sebz gur ortvaavat.

Correct: Well, it was not exactly planned from the beginning.

D GPT-4:

Well, it was not exactly planned from the beginning.

Decode by shifting each letter 12 positions backward in the alphabet.

Multiply by 9/5 and add 32.

Input:

328

Correct: 622.4

D GPT-4:

622.4

Multiply by 7/5 and add 31.

Input:

Iqxx, uf ime zaf qjmofxk bxmzzqp rday ftq nqsuzzuzs.

Correct: Well, it was not exactly planned from the beginning.

GPT-4: Wait, we are not prepared for the apocalypse yet.

Input:

328

Correct: 490.2

GPT-4: 457.6

Figure 1.2: GPT-4 struggles on some seemingly simple tasks such as counting, article swapping,

shift ciphers, and linear functions. Later in the paper, we explain the contrasts that are illustrated

here. In the counting and article swapping examples, GPT-4 fails in the cases where the correct

output is a low-probability piece of text (for the counting example, we refer to 29 as low-probability

because it occurs much less frequently in natural corpora than 30 does). In the shift cipher and

linear function examples, GPT-4 performs well on the task variants that are common in Internet

text but poorly on the variants that are rare (note that the shift cipher with a shift of 13 is over 100

times more common in Internet text than the shift cipher with a shift of 12; and the linear function

f (x) = (9/5)x + 32 is common because it is the Celsius-to-Fahrenheit conversion, while the other

linear function has no special significance). For the sake of brevity, this figure does not show the

full prompts we used; see later in the paper for the complete prompts. The GPT-4 predictions were

obtained using gpt-4-0613 on the OpenAI API; other model versions (e.g., the online chat interface)

may give different predictions.

A teleological approach to understanding LLMs

To understand an information-processing system such as an LLM, the approach that we are arguing for is

to first characterize the problem that the system solves and to then use this characterization as a source of

hypotheses about the system’s capacities and biases. We refer to this as the teleological approach because

it focuses on explaining the behavior of the system in terms of its goal (telos in Greek) (e.g., Lombrozo and

Carey, 2006). Teleological explanation is a common strategy in making sense of biological systems, manifest

in approaches such as computational-level (Marr, 1982) and rational (Anderson, 1990) analysis in cognitive

science, and adaptationist explanations in evolutionary biology (Godfrey-Smith, 2001; Shettleworth, 2010).

Just as teleological explanations can be incomplete or misleading in these settings (Gould and Lewontin, 1979;

5Ember of

autoregression Definition Example

Sensitivity

to task

frequency LLMs perform better on tasks that are frequent

than ones that are rare, even when the tasks have

an equivalent level of complexity. When asked to translate English sentences into

Pig Latin, GPT-4 gets 42% accuracy when using

the most common variant of Pig Latin but only

23% accuracy when using a rare variant.

Sensitivity

to output

probability LLMs achieve higher accuracy when the correct

answer is high-probability text than when it is

low-probability text, even when the task is de-

terministic. When asked to reverse a sequence of words,

GPT-4 gets 97% accuracy when the answer is

a high-probability sentence yet 53% accuracy

when the output is low probability.

Sensitivity

to input

probability Even when the task is deterministic, LLMs

sometimes achieve higher accuracy when the in-

put text is high-probability than when it is low-

probability, but input probability is less influen-

tial than output probability. When asked to encode sentences in a simple ci-

pher (rot-13), GPT-4 gets 21% accuracy when

the input is a high-probability sentence yet 11%

accuracy when the input is low probability.

Table 1: Effects on the performance of large language models that are attributable to the fact that

they are statistical next-word prediction systems.

Lombrozo and Wilkenfeld, 2019) we do not anticipate that all the properties of LLMs can be understood via

their goals. However, we believe that this is a useful lens through which to study these systems and generate

predictions about their behavior. In Section 2.1 we describe more clearly the goals that LLMs are trained to

accomplish. In Section 2.2, we then use this analysis to generate hypotheses about LLM behavior. Finally,

Section 2.3 clarifies what arguments we are and are not intending to make in pursuing this approach.

2.1

What problem do LLMs solve?

Anderson (1990, p. 29) argued that the problem solved by a cognitive system can be characterized by three

factors: the system’s goal, the environment in which the system pursues this goal, and the computational

limitations that constrain the system. In the context of machine learning, those three factors correspond to

the following aspects of a model:

1. Training task: What goal is the system trained to accomplish?

2. Training distribution: What types of examples is the system trained on?

3. Model architecture: What computational tools does the system have access to?

For standard LLMs, the training task is next-word prediction, the training distribution is a distribution over

Internet text, 2 and the model architecture is a neural network (specifically, a Transformer: Vaswani et al.,

2017). Thus, the problem that an LLM must solve is: how to perform next-word prediction on samples of

Internet text, given the mechanisms available in a neural network.

Note that many recent LLMs are not solely trained on next-word prediction but also go through a second

training phase based on instruction tuning (Ouyang et al., 2022), which is intended to align model behavior

with human preferences. A complete teleological analysis should therefore consider instruction tuning as well

as next-word prediction. However, in this paper, we only analyze next-word prediction, leaving instruction

tuning for future work. We start with next-word prediction because, during LLM training, the next-word

prediction phase is much longer than the instruction tuning phase. We therefore conjecture that next-word

prediction plays a greater role in shaping the basic nature of LLMs than instruction tuning does, making

analysis of next-word prediction a reasonable starting point for the goal of understanding LLMs.

Importantly, the tasks for which LLMs are used often differ from the problem they were trained to solve.

For example, even though they were trained for next-word prediction, LLMs are sometimes asked to translate

We use Internet text as an umbrella term covering the full range of diverse document types that appear in LLM pre-training

datasets. Such documents include the types of data that are traditionally referred to as text (e.g., news articles), as well as other

types of data such as computer code.

6sentences or develop Python code. Why does this discrepancy matter? When a system is adapted for one

purpose but then co-opted for a different purpose, the original purpose may influence the system’s nature in

ways that would not make sense if only the new purpose were considered. As an example from biology, the basic

mammalian body plan evolved for a life on four legs, but humans recently evolved to instead walk on two legs.

Biologists have argued that the friction between our quadrupedal roots and our bipedal lifestyle contributes

to several musculoskeletal ailments that are common in humans (Latimer, 2005; Held Jr, 2009). For instance,

having our spines be perpendicular to the ground rather than parallel with it causes strains that make humans

susceptible to lower back pain (Krogman, 1951; Plomp et al., 2015). This bipedalism example can be viewed

as friction arising from a shift in the body’s function or “task” (four-legged life vs. two-legged life); mismatches

can also arise with the other two factors of environment (training distribution) and architecture. An example

of a shift in environment is the dietary mismatch highlighted in the introduction: humans developed a drive

towards consuming fatty foods because we evolved in a world where such foods were scarce and their calories

were valuable, but this taste can now be harmful in a world with readily-available donuts and ice cream

(Buss et al., 1998). Finally, a difficulty arising from our “architecture” is the blind spot in human vision: due

to the way in which the optic nerve exits the retina, there is a point in the retina without photoreceptors,

making us unable to see a specific region in our visual field (Wandell, 1995). We anticipate that goals, training

distribution, and model architecture create opportunities for similar mismatches in machines.

2.2

Hypothesized embers of autoregression

Do LLMs face anything analogous to humans’ back pain, love for fatty foods, and blind spots—any quirks

attributable to a mismatch between what they “adapted” to do and what they are used for? To answer this

question, we must consider the types of solutions that would arise from the pressures that shape LLMs. First,

the fact that LLMs are neural networks makes them statistical systems. The other two factors—the task of

next-word prediction and the training distribution of Internet text—dictate which statistics they are sensitive

to: the statistics of word sequences in Internet text. Building on this analysis, we make the following three

predictions about LLM performance (some additional predictions are listed in (1) on page 4):

• Even when there is no difference in the complexity of the tasks, LLMs will perform worse on rare tasks

than on frequent tasks.

• Even when asked to perform a deterministic task, LLMs will be sensitive to the probability of the

answers they must produce.

• Even when asked to perform a deterministic task, LLMs will be sensitive to the probability of the inputs

they are given, but to a lesser degree than they are sensitive to the probability of the answers they must

produce.

A crucial aspect of these predictions is the first clause of each one. For instance, it is clear that probability is

important for next-word prediction, but we are predicting that probability will influence LLMs even in tasks

that are not inherently probabilistic, such as forming an acronym from a sequence of words. In later sections,

we will develop these hypotheses in more detail and will provide extensive evidence supporting them.

2.3

What we are not arguing

We are arguing that certain types of tasks and examples will be harder for LLMs than others. We are not

arguing that LLMs are incapable of handling these tasks and examples. That is, our core predictions are of

the form “for LLMs, X is harder than Y” rather than “LLMs can’t do X.”

This distinction is important for explaining how we set up our experiments. For most experiments, we test

LLMs using basic prompting (i.e., simply asking them to provide an answer to a query), even though LLM

accuracy can be substantially increased by using more sophisticated inference techniques such as step-by-step

prompting (Nye et al., 2021; Kojima et al., 2022), chain-of-thought prompting (Wei et al., 2022b), or the tree

of thoughts framework (Yao et al., 2023). Our decision to use basic prompting would be problematic if we

were claiming that LLMs cannot perform our tasks: such claims would only be true if LLMs fail under all

prompting approaches, so investigations of such claims should use the prompting approach that yields the

strongest performance. However, since this is not the type of claim we are making, it is not necessary to use

the strongest available prompting paradigms. Instead, because our claims are comparative in nature (“for

LLMs, X is harder than Y”), what is important is ensuring that the two conditions (X and Y) are evaluated

in the same way as each other. Since any prompting approach could work for this purpose, we chose the

7approach that is the fastest and most straightforward to run, namely basic prompting—though see Section 9.4

for experiments with other prompting techniques.

Though the choice of prompting technique does not matter for our purposes, the fact that LLMs are

so sensitive to the prompting approach supports our broader claim that, in order to understand LLMs, we

must consider their autoregressive roots. That is, because LLMs are trained to make probabilistic predictions

about word sequences, we would expect them to be highly sensitive to the particular word sequences that

they interact with, both regarding the wording of the prompt (a point elaborated on in Section 8.2) and

regarding the types of word sequences that they are encouraged to produce (the factor that is targeted by

several prominent prompting techniques such as chain-of-thought prompting: Wei et al. 2022b). The fact that

prompting techniques indeed influence performance substantially can therefore be viewed as another ember

of autoregression.

In addition, we are not claiming that researchers have ignored the autoregressive origins of LLMs. Indeed,

in the “Sparks of artificial general intelligence” paper (Bubeck et al., 2023), Section 8 is titled “Limitations of

autoregressive architecture highlighted by GPT-4,” and it considers failures of planning in arithmetic and text

generation that result from only predicting the next word in a sequence. What we are claiming is that this

aspect of LLMs has been neglected in constructing effective evaluations of their capacities. As highlighted

in Figure 1.1, much of the literature evaluating LLMs has started with tasks that are viewed as important

indicators of human abilities and then assessed how well LLMs can do them. By instead starting with tasks

that we anticipate will be challenging for systems that are focused on next-word prediction, we get a more

balanced view of what kinds of tasks are easy or hard for these systems.

Motivating our predictions

In Section 2.2, we listed several hypotheses about factors that cause difficulty for LLMs. Here we show how we

arrived at these hypotheses by analyzing the problem that LLMs need to solve. To make this discussion more

concrete, we will provide brief examples from one particular domain, namely shift ciphers (defined below).

Later in the paper, we will expand upon the shift cipher results (augmenting our initial observations with

statistical tests and further analyses) and will also use an additional ten tasks to further test our predictions.

3.1

Running example: Shift ciphers

In a shift cipher, a message is encoded by shifting each letter forward in the alphabet a certain number of

positions. For example, with a shift of 1, How are you? becomes Ipx bsf zpv? because the letter after h is

i, the letter after o is p, etc. One particularly prominent case is rot-13 (short for “rotate by 13”), the shift

cipher with a shift of 13 positions. Rot-13 is commonly used in online forums as a spoiler-free way to share

information. For example, in some puzzle-solving groups, community members write hints in rot-13 so that

those who want to solve the puzzle without help will not read the hint inadvertently.

3.2

Sensitivity to task probability

Task probability is the probability that a task will be illustrated in a randomly-selected sample of text. This

probability determines how many examples of the task will have been encountered by an LLM during training.

Neural networks generally perform better on a task when they have been trained on more examples of the

task, so we expect that LLMs will perform better on high-probability tasks than low-probability tasks.

This prediction is borne out for shift ciphers. As mentioned above, rot-13 is a commonly-used shift cipher.

In contrast, rot-2—which uses a shift of 2 rather than 13—is no more complex than rot-13 but is used rarely.

We asked GPT-3.5 and GPT-4 to decode messages written in each cipher. Both models performed much better

on the frequent task of rot-13 than on the rare task of rot-2 (Figure 3.1, top), supporting the conclusion that

LLMs are sensitive to task probability.

3.3

Sensitivity to output probability

LLMs are statistical systems. We therefore expect that their predictions will be influenced by probability,

even in deterministic situations where probability should be irrelevant. Specifically, we predict that LLMs

will be biased toward producing high-probability sequences of words, meaning that their performance will get

worse when the correct output is in fact low-probability.

8Shift cipher: Task probability

Common task: Rot-13. Decode the message by shifting each

letter thirteen positions backward in the alphabet.

D GPT-4:

Well, if they don’t come, so be it.

Uncommon task: Rot-2. Decode the message by shifting

each letter two positions backward in the alphabet.

Input:

Ygnn, kh vjga fqp’v eqog, uq dg kv.

Correct: Well, if they don’t come, so be it.

GPT-4: Well, if there isn’t cake, to be it.

Jryy, vs gurl qba’g pbzr, fb or vg.

Input:

Correct: Well, if they don’t come, so be it.

GPT−3.5

GPT−4

1.00

0.75

0.50

0.25

0.00

Rot−2 Rot−13 Rot−2 Rot−13

Note: In Internet text, rot-13 is about

60 times more common than rot-2.

Shift cipher: Output probability

D GPT-4:

First, she just posted to her Instagram story.

Rot-13 decoding: example of medium-probability output

Input:

Fbeel, Naanguba jevgrf gb bhe Pbclevtug Hfref.

Correct: Sorry, Annathon writes to our Copyright Users.

1.00

Rot-13 decoding: Example of high-probability output

Input:

Svefg, fur whfg cbfgrq gb ure Vafgntenz fgbel.

Correct: First, she just posted to her Instagram story.

GPT−3.5

GPT−4

0.75

0.50

0.25

0.00

GPT-4: Sorry, Annabeth writes to our Prophetic Users.

−125

−100

−75

Output log probability

Shift cipher: Input probability

Input:

In a word, everything has been complicated there.

D GPT-4: Va n jbeq, rirelguvat unf orra pbzcyvpngrq gurer.

Correct:

Va n jbeq, rirelguvat unf orra pbzcyvpngrq gurer.

Rot-13 encoding: example of medium-probability input

Input:

In a word, governance has been frustrating daily.

Correct: Va n jbeq, tbireanapr unf orra sehfgengvat qnvyl.

GPT-4: Va n jbeq, tbinapr unf orra sehfgevat qnlyl.

1.00

Rot-13 encoding: example of high-probability input

GPT−3.5

GPT−4

0.75

0.50

0.25

0.00

−125

−100

−75

Input log probability

Figure 3.1: When processing shift ciphers, LLMs are highly sensitive to probability, even though

shift ciphers are deterministic. Top: Effect of task probability. GPT-3.5 and GPT-4 both score

better on a frequently-occurring shift cipher (rot-13) than on a rare one (rot-2). Middle: Effect of

output probability. When decoding text written in the rot-13 cipher, both models score better when

the answer is high-probability than when it is low-probability. Bottom: Effect of input probability.

As predicted, input probability does not show as strong an effect as output probability. All error

bars and intervals show one standard error.

9To motivate this point more formally, we can first frame the LLM’s task as finding the output that max-

imizes P (output|input), where the input is the start of a word sequence, and the output is the sequence’s

continuation. By Bayes’ rule, this problem is equivalent 3 to maximizing P (input|output)P (output). There-

fore, as long as there are multiple outputs for which P (input|output) is nonzero, the LLM’s predictions will

be influenced by the unconditional probability of the output, P (output): among the candidates that yield a

nonzero value for P (input|output), LLMs will be biased toward selecting ones with a high P (output).

The argument made so far only covers non-deterministic situations, when there are multiple outputs that

yield a non-zero P (input|output). Now consider deterministic settings, where there is only one output that

could go with a given input; rot-13 is one such setting. In principle, P (output) should not matter in such cases

because the likelihood, P (input|output), fully determines the output on its own. In practice, however, LLMs

are unlikely to have perfectly estimated the distribution P (input|output), meaning that there will in fact be

multiple candidate outputs that yield a nonzero likelihood. With multiple output candidates in consideration,

P (output) will then affect the model’s predictions in practice, even if in theory it should be irrelevant.

The results, in Figure 3.1 (middle), support this prediction. We evaluated GPT-3.5 and GPT-4 on

decoding text written in rot-13 across three conditions that varied in the probability of the output sentence

(see Section 4.2 for details on how output probability was modulated). For both models, accuracy increased

as the log probability of the correct output was increased, supporting the hypothesis that these systems are

influenced by the unconditional probability of potential outputs.

3.4

Sensitivity to input probability

The previous section described one way in which probability is relevant to LLMs—namely, the fact that LLMs

use probability when performing inference. Probability also has a second, and less direct, effect on LLMs

because of how it influences LLM training data. The probability of a piece of text influences how often that

text will occur in the training set, so a trained LLM will have had less experience with low-probability strings

than high-probability strings. Therefore, the model may capture information about low-probability strings

less robustly than information about high-probability strings. As a concrete example, when the model has

encountered shift ciphers during training, it is more likely to have seen the enciphered versions of common

words such as time than rare words such as emit, and it may therefore be more successful at encoding or

decoding time than emit.

Information learned during pretraining is important both for processing the input and producing the

output. We have already hypothesized that LLMs will be sensitive to output probability (Section 3.3); this

new argument adds the additional hypothesis that they will be sensitive to input probability. Figure 3.1

(bottom) shows that, when encoding text into rot-13, GPT-4 achieves higher accuracy when the input is

high-probability than when it is low-probability, illustrating that LLMs can indeed be sensitive to input

probability. GPT-3.5 shows no significant effect of input probability. In the next subsection, we describe why

LLMs may show less sensitivity to input probability than output probability.

3.5

Asymmetry between input probability and output probability

We have hypothesized that LLMs will be sensitive to the probability of both the input and the target output.

Though these effects sound similar, we have predicted them for two fairly different reasons. Input probability

was predicted to matter only in situations where the task depends on information about the input that

models mainly learn through experience with that specific input; thus, for rarer inputs, models will perform

the task less well due to a lack of experience with the relevant inputs. We expect that such situations only

arise occasionally, such that a model’s dependence on input probability may not manifest itself routinely. In

contrast, output probability was predicted to matter not only in the same situations where input probability

matters but also in an additional context: whenever the model has any uncertainty about what the correct

output is (in which case the model will use probability to help resolve its uncertainty). We expect that

this situation is common since neural network outputs rarely include probabilities that are precisely equal to

zero. We therefore expect that output probability will be more broadly influential than input probability.

An initial, intuitive piece of evidence for this hypothesis is the fact that, in Figure 3.1, output probability

We do not claim that LLMs actually compute a likelihood and prior. Rather, the problem that they must solve can be

framed in these terms, such that we expect their behavior to display phenomena that would arise in a system which performed this

decomposition, even though the LLM might arrive at this behavior in a way that does not explicitly involve such a decomposition.

That is, we are analyzing LLMs at Marr’s (1982) computational level.

10substantially affects model performance, whereas input probability shows only a minor effect and only in one

model. We will expand on this idea more rigorously below.

Overview of experiments

In the next several sections, we test the hypotheses developed in the previous section by analyzing LLM

performance on a wider range of tasks. Here we briefly describe the tasks, sentences, and models that we

investigate. In the next sections, we will then discuss in more detail the particular tasks that are relevant to

each hypothesis and how we used these tasks to test each hypothesis.

4.1

Tasks

We evaluated models on the tasks described in Table 2. See Appendix A.1 for the sample size used with each

task. We chose these tasks for two reasons. First, as described in the introduction, we selected tasks that push

models into low-probability situations so that models will have a non-negligible error rate—a requirement for

our goal of observing what causes the error rate to increase or decrease. Second, most of the tasks that we

chose are ones that can be solved with a simple, deterministic algorithm that is invariant to various changes

to the task. For example, consider a shift cipher that moves each letter n positions forward in the alphabet.

The most straightforward way to decode this cipher is to shift each letter back n positions—an algorithm

that works equally well regardless of the value of n and the identity of the particular words being processed.

Thus, if a person demonstrated that they could decipher one message written in rot-13, we might assume

that they knew this algorithm and could therefore perform equally well on any other shift cipher or on any

other inputs. However, we have hypothesized that LLMs are not invariant to such properties: we expect their

performance to vary based on which task variant is used and which inputs are used. The tasks described in

the table allow us to test these predictions.

Many of our tasks involve character-level manipulations of words. At first blush, this factor might seem

unfair to the GPT models because they operate over subword tokens rather than characters. To address this

concern, we gave both models a spelling test where we asked them to spell 1,000 single-token words; details

are in Appendix F. Both models scored close to 100% (99.8% for GPT-3.5 and 99.9% for GPT-4). Thus,

these models robustly encode the spelling of their tokens, making it reasonable to test them on tasks that

involve character-level manipulations. For additional evidence that LLMs encode the character-level structure

of tokens, see Itzhak and Levy (2022) and Kaushal and Mahowald (2022) (the latter of which also contains

discussion of how LLMs might come to learn this information).

4.2

Sentences

Two of our hypotheses relate to the probability of the specific examples a model is processing: we have

predicted that models will perform worse on low-probability sentences than high-probability ones. To facilitate

testing of these hypotheses, we generated three sets of sentences—a high-probability set, a medium-probability

set, and a low-probability set. The sentences in (2) below show one example from each set. The high-

probability set consisted of 100 sentences drawn from GlobalVoices, an online news service that we chose

because its content is published under a permissive license that allows sharing and modification. 4 The

medium-probability set was created by taking the high-probability sentences and using RoBERTa (Liu et al.,

2019) to replace some of the words with others that had a low probability in that context (but were still

grammatical). Each replacement word was required to be the same length as the word it was replacing,

whether length was measured in letters or subword tokens. The low-probability set was created by shuffling

the words of the medium-probability examples, except that the first and last words were left in place. For

more details on the creation of these sentence sets, see Appendix A.2.

(2) a. High-probability: Are they now building a bridge of their own?

b. Medium-probability: Are dogs yet climbing a jungle of their own?

c. Low-probability: Are their jungle of dogs a yet climbing own?

https://globalvoices.org/about/global-voices-attribution-policy/

11Task Description Example

Article swapping Swap each article (a, an, or the) with

the preceding word. In box the there was key a.

→ In the box there was a key.

Reversal Reverse a sequence of words. everyone! morning Good,

→ Good morning, everyone!

Counting Count the words or letters in a list. lively news exhibit steep

→ 4

Acronyms Join the first letters of the words in

a list. view inch show into tray east deal

→ VISITED

Linear function Apply f (x) = (9/5)x + 32. 328

→ 622.4

Multiplication Multiply two three-digit numbers. 351 times 373

→ 130923

Sorting Sort a list of words in alphabetical

or reverse alphabetical order. immense, tariff, golden, magic

→ golden, immense, magic, tariff

Keyboard cipher Replace each letter with the one to

the right of it on a keyboard. Hello world!

→ Jraap eptaf!

Shift cipher Decode by shifting each letter 13 po-

sitions backward in the alphabet. Fgnl urer!

→ Stay here!

Pig Latin Move the first consonant cluster of

each word to the end and add -ay. the frogs aren’t noisy.

→ ethay ogsfray aren’tay oisynay.

Birthdays Return the birth date of a provided

public figure. Jeremy Lin

→ August 23, 1988

Table 2: The tasks that we used to evaluate LLMs. To keep this table manageably sized, some of the

examples are not from the datasets that we used to evaluate LLMs but are instead shorter examples

of the tasks that those datasets target.

4.3

Models

The models that we investigated were GPT-3.5 and GPT-4 (OpenAI, 2023). We chose to include GPT-4

because it is the state-of-the-art in many areas and because it is the focus of the “sparks of AGI” paper

(Bubeck et al., 2023) that our work connects to. As a second model, we used GPT-3.5 for two reasons: first,

it allows us to investigate the effects of model scale by comparing to GPT-4; second, it uses the same tokenizer

as GPT-4, allowing us to evaluate it on the same examples as GPT-4 (our datasets control for tokenization,

so if we were to evaluate a model that uses a different tokenizer we would need to generate new datasets).

Both of these systems are large-scale Transformer (Vaswani et al., 2017) language models. We accessed them

through the OpenAI API and had them generate responses with a temperature of 0.0. We used the most

recent time-stamped versions that were available at the time we ran the experiments: gpt-3.5-turbo-0613

and gpt-4-0613.

We emphasize that the results we found may not transfer identically to other language models, or even

to other versions of the ones that we have tested, such as newer versions or versions accessed in a different

12Shift cipher

encoding

Shift cipher

decoding

Pig Latin

encoding

Pig Latin

decoding

Acronyms

Linear function

Sorting

1.00

0.75

0.50

0.25

0.00

1.00

0.75

0.50

0.25

0.00

Rare

Common

Rare

Common

Rare

Common

Rare

Common

Rare

Common

Rare

Common

Rare

Common

Figure 5.1: Comparing performance on common vs. rare versions of tasks. For shift ciphers, the

common variant that we illustrate is rot-13, and the rare variant is rot-2. For Pig Latin, the common

variant is Pig Latin, and the rare variant is Boar Etruscan (see Section 5.2). For acronyms, the com-

mon variant is acronyms formed from the first letters of words, while the rare variant is acronyms

formed from the second letters of words. The common linear function is f (x) = (9/5)x + 32 (which

is the Celsius-to-Fahrenheit conversion), and the rare linear function is f (x) = (7/5)x + 31. The

common version of sorting is sorting in alphabetical order, while the rare variant uses reverse alpha-

betical order. Error bars show one standard error.

way (i.e., via the chat interface rather than via the API). In fact, anecdotally, we noticed that that the chat

interface for GPT-4 performs much better than the API on the task of reversal, showing that in at least

some cases these different versions perform qualitatively differently. However, there are other tasks that our

brief tests found were also difficult for the chat interface, such as shift ciphers; but we did not systematically

investigate the chat interface because the size of our datasets would make this investigation prohibitive. In

general, it does not cause problems for our hypotheses if there are some settings in which language models

perform our tasks well, since we do not claim that our tasks are fundamentally impossible for language models

but rather that, when LLMs are not at ceiling on these tasks, they will show certain predictable patterns

regarding which queries they find easy or difficult. See Section 2.3 for further discussion.

4.4

Notes about figures

In many of our figures, we include examples of model predictions. These examples typically do not show the

complete prompts that we used because these prompts were often long (e.g., because we included an example

of the task to be performed). See Appendix B for the full prompts that we used.

Sensitivity to task probability

We have hypothesized that LLMs will perform better on tasks that are frequently illustrated in Internet text

than on tasks that occur more rarely—even when the rare task is no more complex than the common one.

In this section we test this hypothesis by evaluating models on common variants of several tasks and rarer

variants of those same tasks. Figure 5.1 gives an overview of the results.

5.1

Shift ciphers

Summary: When asked to decode shift ciphers with various shift levels, GPT-4 scores 50% or above for

the three most common shift levels (1, 3, and 13) yet scores below 3% for all other shifts.

Details: The realm of shift ciphers provides an excellent setting to investigate the effects of task probability.

There are 25 possible shift ciphers, corresponding to the 25 unique shifts that can be applied within the

13Shift ciphers

Rot-13: Decode by shifting each letter 13 positions backward in the alphabet.

Input:

Ohg guvf gvzr, gurer znl nyfb or nabgure ernfba.

Correct: But this time, there may also be another reason.

D GPT-4:

But this time, there may also be another reason.

Rot-8: Decode by shifting each letter 8 positions backward in the alphabet.

Jcb bpqa bqum, bpmzm uig itaw jm ivwbpmz zmiawv.

Input:

Correct: But this time, there may also be another reason.

GPT-4: Say what you, think and then be silent.

1225

1000

500

117

5 15 12 6 3 1

4 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Shift

1.00

GPT−3.5

GPT−4

0.75

0.50

0.25

0.00

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Shift

Figure 5.2: Analysis of shift ciphers with shift levels varying from 1 to 25. Top: Example of shift

cipher stimuli for two shift levels. Middle: Number of occurrences of each shift level within the C4

corpus (see Appendix E.1 for details). Bottom: Accuracy on decoding shift ciphers with various

shifts. The only shifts where models get non-negligible accuracy are 1, 3, and 13, which the middle

plot shows to be the three most common shift levels.

14alphabet. These different shifts are not all used with the same frequency. Based on a manual analysis of the

C4 corpus (Raffel et al. 2020), we found that the three most common shift levels are rot-1, rot-3, and rot-13

(Figure 5.2, middle; see Appendix E.1 for more details). Rot-1 is common because it is the smallest shift, so

it is a natural first choice for illustrating the concept of a shift cipher (for instance, it is what we used for our

initial example in Section 3.1). Rot-3 is common because Julius Caesar famously used this cipher (Suetonius,

121/1883). 5 Because of this historical connection, shift ciphers are sometimes called Caesar ciphers, and

many who describe or use shift ciphers follow Caesar’s precedent by selecting a shift of 3 (e.g., Kahn, 1967;

Singh, 1999). Finally, as mentioned above, rot-13 is common because there is a convention of using it in

online message boards. The other twenty-two possible shifts are used occasionally but are not as common as

these three.

If LLMs are indeed sensitive to task probability, we would therefore expect them to perform better on

rot-1, rot-3, and rot-13 than on other shifts. To test this prediction, we evaluated our models on decoding

text written with each shift level from 1 to 25. Models were always told which shift cipher was being used

and were shown an example of an English sentence and that sentence’s encoding in the shift cipher at hand;

see Appendix B.2 for the exact prompt that was used. The answer sentences were the 100 high-probability

sentences that we drew from GlobalVoices (Section 4.2).

The results are striking (Figure 5.2, bottom): GPT-3.5’s accuracy is 0.21 for rot-13 but 0.00 for all

other shifts. GPT-4 scores 0.50 or above on the three highest-frequency shifts, but its accuracy is less than

0.03 for all other shifts. These results therefore follow the prediction that LLMs would perform better on

commonly-used shifts than on rarely-used ones.

The one aspect of the results that is unclear is why GPT-4 performs better on rot-1 and rot-3 than on

rot-13, even though rot-13 is by far the most common shift cipher in C4. One possibility is that its training

distribution may be meaningfully different from C4: rot-13 is most common on online forums while rot-1 and

rot-3 show up most often in expository text about shift ciphers, so differences in the frequency of document

types could influence the frequency of shift levels.

Importantly, these results cannot be explained by the complexity of the different shift levels. From

one perspective, all 25 shifts have the same level of complexity because they all require the same basic

computational abilities (counting and shifting in the alphabet). If any difference in complexity could be said

to exist, it would be that smaller shifts are simpler than large ones because they require less counting, but

LLM performance is not driven by the size of the shift: for both models, rot-13 is substantially easier than

many other, smaller shifts. One particularly striking comparison is that GPT-4 achieves an accuracy of 0.82

for rot-1 and 0.76 for rot-3 yet only 0.02 for rot-2, even though we might intuitively expect that a system

that can perform rot-1 and rot-3 would also succeed at rot-2. Therefore, the results can be straightforwardly

explained by each task’s probability but not by its inherent difficulty.

To test the significance of the task effects observed here, we compared performance on rot-13 and rot-2

(as examples of a common and rare variant, respectively). Both models had a statistically significantly better

performance on decoding rot-13 than rot-2 (p < 0.01 for both models; see Appendix D.2.1). We also evaluated

both models on encoding for these two shift levels; in this case, GPT-4 still performed significantly better on

rot-13 than rot-2 (p < 0.01), but for GPT-3.5 there was no significant difference, likely due to floor effects,

since GPT-3.5’s encoding accuracy was close to 0.0 for both task variants; see Figure 5.1.

5.2

Pig Latin

Summary: GPT-4 achieves a much higher accuracy when converting a sentence into Pig Latin than when

converting it into a different system that is no more complex but that is unattested in natural corpora

(accuracy: 0.39 vs. 0.13). In addition, when comparing 4 minimally-differing variants that occur in natural

corpora, GPT-4’s performance increases with the corpus frequency of the variant, ranging from 0.23 to 0.42.

Details: Pig Latin is a secret “language” based on English. To convert an English sentence into Pig Latin,

the first consonant cluster of each word is moved to the end of the word and then -ay is added to the end of

the word. We compared Pig Latin to a system that we made up called Boar Etruscan that instead uses -uv

as the letter pair that is added at the end. Pig Latin and Boar Etruscan are illustrated below:

Suetonius also notes that Augustus used a cipher with a shift of 1. Augustus does not get mentioned much in discussions

of shift ciphers, so his use of rot-1 is probably not a major contributor to rot-1’s frequency, but it nonetheless provides another

illustration of the fact that rot-1 is a natural choice for a shift cipher.

15(3) a. English: there is a long and complicated history of how we ended up like this.

b. Pig Latin: erethay isay aay onglay anday omplicatedcay istoryhay ofay owhay eway endeday upay

ikelay isthay.

c. Boar Etruscan: erethuv isuv auv ongluv anduv omplicatedcuv istoryhuv ofuv owhuv ewuv endeduv

upuv ikeluv isthuv.

We tested GPT-3.5 and GPT-4 on both encoding into each of these fake languages and decoding from these

fake languages. Model performance is shown in Figure 5.1, in which the “common” variant is Pig Latin and

the “rare” variant is Boar Etruscan. For encoding, both models performed much better on Pig Latin than

Boar Etruscan, a difference that was statistically significant (p < 0.01 for GPT-3.5, p < 0.001 for GPT-4; see

Appendix D.4.1). For decoding, both models still scored better on Pig Latin than on Boar Etruscan, but the

difference was now significant only for GPT-4 (p < 0.05) but not GPT-3.5 (p = 0.08).

In addition to this binary comparison of attested vs. unattested, Pig Latin also gives us the opportunity

for a finer-grained comparison because there are several variants of Pig Latin that have varying levels of

commonness (Vaux and Nevins, 2003). Specifically, all the major variants handle consonant-initial words the

same (by moving the initial consonant cluster to the end and adding -ay), but they vary in how they handle

vowel-initial words: all the variants add some fixed syllable at the end of vowel-initial words, but the identity

Encoding with Pig Latin variants

Task: Convert the following English sentence into Pig Latin (using a

variant of Pig Latin that is specified in the prompt): “in times of

crisis, art should be pragmatic.”

Pig Latin with -way

Correct: inway imestay ofway isiscray, artway ouldshay ebay agmaticpray.

GPT-4: inway imestay ofway isiscray, artway shouldway ebay ragmaticpay.

1.00

GPT−3.5

GPT−4

0.75

0.50

0.25

0.00

Pig Latin with -ay

Correct: inay imestay ofay isiscray, artay ouldshay ebay agmaticpray.

GPT-4: inay imestay ofay isiscray, artay shoulday ebay ragmaticpay.

Pig Latin with -hay

Correct: inhay imestay ofhay isiscray, arthay ouldshay ebay agmaticpray.

GPT-4: inay imestay ofhay isiscray, artay houldsay ebay ragmaticpay.

Pig Latin with -say

Correct: insay imestay ofsay isiscray, artsay ouldshay ebay agmaticpray.

GPT-4: inay imestay ofsay isiscray, artsay shouldsay ebay ragmaticpray.

0.1

0.2

0.3

1.00

0.75

0.50

0.25

GPT−3.5

GPT−4

0.00

0.0

0.1

0.2

0.3

0.4

Task frequency

Figure 5.3: Left: An English sentence and its equivalent in five Pig Latin variants, along with

GPT-4’s predictions for the Pig Latin sentences in these variants. The variants are listed from most

to least frequent in the Pile corpus. Errors are highlighted in red; cases where GPT-4 produced the

correct variant-specific form of a vowel-initial word are highlighted in green (note that the variants

only differ in how they handle vowel-initial words). Right: Model accuracy on these variants when

encoding (top) or decoding (bottom), as a function of the variant’s frequency. The intervals around

the lines show one standard error.

0.4

Task frequency

Pig Latin with -yay

Correct: inyay imestay ofyay isiscray, artyay ouldshay ebay agmaticpray.

GPT-4: inay imestay ofay isiscray, artay shouldyay ebay ragmaticpay.

0.0Acronyms: Task probability

Common task: First-letter acronym. Combine the first letters of

the words in the sequence.

Input:

penance aplenty rooster trample impasse envious subtext

Correct: PARTIES

PARTIES

Input:

unbound newness cranium likable emerald abalone reissue

Correct: UNCLEAR

GPT-4: UNCLERA

Uncommon task: Second-letter acronym. Combine the second let-

ters of the words in the sequence.

Input:

aplenty maestro precept strayed figment megaton ascetic

Correct: PARTIES

GPT-4: PLEEGET

D GPT-4:

GPT−3.5 GPT−4

Second First

letter letter Second First

letter letter

1.00

0.75

0.50

0.25

0.00

quattro ennoble scissor fluency regency hawkish pricked

Input:

Correct: UNCLEAR

GPT-4: UENLUEAW

Figure 5.4: Left: Examples of GPT-4 predictions on the acronym tasks. Right: Model accuracy

at acronym formation (this plot repeats the acronym results from Figure 5.1). Error bars show one

standard error.

of that syllable varies. We identified all mentions of Pig Latin in the Pile dataset (Gao et al., 2020), as a proxy

for the training data of GPT models, and found 68 cases where the added syllable was -way, 46 where it was

-ay, 26 where it was -yay, and 8 where it was -hay; an analysis of the C4 dataset (Raffel et al., 2020) yielded

similar proportions. We tested LLMs on all 4 of these variants, plus a fifth one where the added syllable was

-say (which has 0 occurrences in C4 and the Pile). See Figure 5.3 (left) for examples of the five Pig Latin

variants. Models were always told which variant was being used and were given an example sentence in that

variant; see Appendix B.10 for the specific prompt that was used.

We found that (particularly for encoding) model performance patterns with the frequency of the Pig

Latin variant in the sample of internet text that we analyzed (Figure 5.3). For both models, the effect of

variant was significant for encoding (p < 10 −6 for GPT-3.5; p < 10 −8 for GPT-4) but not for decoding; see

Appendix D.4.1. Decoding accuracy may not vary much because the identity of an encoded vowel-initial word

is easy to infer across all of these variants (since all that is required is removing a few letters at the end of

the word), and models may find it easier to remove an unfamiliar ending (what must be done in decoding)

than to produce an unfamiliar ending (what must be done in encoding). This finding is striking because all

five of these variants are very similar: they treat most words identically because they differ only for vowel-

initial words; and even in these cases, they only differ from each other in a single letter. Nonetheless, these

differences are enough to yield markedly differing performance along the lines we predicted.

5.3

Acronyms

Summary: GPT-4 performed much better at the common operation of joining the first letter of each word

(accuracy: 0.76) than the rare operation of joining the second letter of each word (accuracy: 0.03).

17Linear functions: Task probability

Common function: f (x) = (9/5)x + 32

GPT−3.5

Input:

Correct: 147.2

Input:

577

Correct: 1070.6

GPT-4:

147.2

1069.6

Rare function: f (x) = (7/5)x + 31

Input:

Correct: 120.6

GPT-4: 89.8

Input:

577

Correct: 838.8

GPT-4: 805.4

D GPT-4:

GPT−4

1.00

0.75

0.50

0.25

0.00

Rare Common Rare Common

Figure 5.5: Left: Examples of GPT-4 predictions on the linear functions. Right: Model accuracy

at linear functions (this plot repeats the linear function results from Figure 5.1) Error bars show one

standard error.

Details: People frequently join together the first letter of each word in a sequence of words, but it is rare

to join the second letter of each word. We therefore use first-letter acronym formation (e.g., (4)) as a common

task and second-letter acronym formation (e.g., (5)) as a rare task:

(4) First-letter acronym

a. Input: revolve edifice scrappy panicky outlast negated drizzle

b. Output: RESPOND

(5) Second-letter acronym

a. Input: prequel leaping ascetic splurge policed invader edifice

b. Output: RESPOND

The outputs were matched across the two tasks (e.g., the examples above both use the answer RESPOND),

and the input words were drawn from the same vocabulary across tasks (specifically, the highest-probability

bin of input words described in Section 6.5). For both models, performance was much higher on the first-

letter acronyms than second-letter ones, and the effect was statistically significant (p < 10 −4 for GPT-3.5, and

p < 10 −15 for GPT-4; see Appendix D.6.1). See Figure 5.4 for examples of model predictions on the acronym

tasks. Note that (Wei et al., 2022b) found that chain-of-thought prompting could substantially improve

accuracy on a task similar to our acronym task, suggesting that our task might similarly show improved

performance in that setting; however, if this prediction is borne out, it would not affect our claims, for the

reasons detailed in Section 2.3.

5.4

Linear functions

Summary: GPT-4 performed much better at applying a linear function that is commonly illustrated in

text (the Celsius-to-Fahrenheit conversion) than one that is similar but rare (accuracies: 0.33 vs. 0.00).

Details: We compared two different tasks where models had to apply a linear function to a given number.

As a common function, we used f (x) = (9/5)x + 32; this function occurs frequently in natural text because

it is the function that converts temperatures from Celsius to Fahrenheit, so text written for multinational

audiences often includes nearby pairs of numbers illustrating both x and f (x). As a rare function, we used

f (x) = (7/5)x + 31, which has no special significance and thus does not show up often in natural text.

Both models scored reasonably well for the common function yet had an accuracy of 0% for the rarer one

(Figure 5.5).

The results shown in Figure 5.5 use the same inputs across the rare and common math tasks, necessarily

resulting in different outputs. To ensure that the different results were not simply driven by one function

18Sorting: Task probability

Common task: Sort in alphabetical order.

Input:

big, evergreen, flagrant, donkey, illegal, clean, exalted,

normative, bountiful, archaic, insufferable, better, fair

Correct: archaic, better, big, bountiful, clean, donkey, evergreen,

exalted, fair, flagrant, illegal, insufferable, normative

Uncommon task: Sort in reverse alphabetical order.

Input:

big, evergreen, flagrant, donkey, illegal, clean, exalted,

normative, bountiful, archaic, insufferable, better, fair

Correct: normative, insufferable, illegal, flagrant, fair, exalted,

archaic, better, big, bountiful, clean, donkey, evergreen,

0.75 0.75

0.50 0.50

0.25 0.25

0.00 0.00

exalted, fair, flagrant, illegal, insufferable, normative

D GPT-4:

GPT−3.5

GPT−3.5 GPT−4 GPT−4

1.00

evergreen, donkey, clean, bountiful, big, better, archaic

GPT-4:

normative, insufferable, illegal, evergreen, exalted, fla-

grant, fair, donkey, clean, bountiful, big, better, archaic

Figure 5.6: Left: Examples of GPT-4 predictions on sorting tasks. Right: Model accuracy at

sorting (this plot repeats the sorting results from Figure 5.1). Error bars show one standard error.

producing outputs that were higher-probability, we also ran a different version of the task where the outputs

were the same across tasks (necessitating different inputs), and the results remained similar: on the common

task, GPT-3.5 and GPT-4 scored 0.60 and 0.44, respectively, while their scores on the rare task were 0.01 and

0.00. For both models, in both the output-controlled setting and the input-controlled setting, the difference

between tasks was statistically significant (with p < 0.01; see Appendix D.9.1)

5.5

Sorting

Summary: GPT-4 performs better at sorting a list of words into alphabetical order (accuracy: 0.80) than

reverse alphabetical order (accuracy: 0.32). Alphabetical order is mentioned much more commonly in the

C4 dataset than reverse alphabetical order. In contrast, ascending order and descending order are mentioned

with similar frequencies, and when sorting lists of numbers GPT-4 shows similar accuracy with ascending

order (accuracy: 0.82) and descending order (accuracy: 0.80).

Details: We gave models a list of words and told them to sort the list into either alphabetical order or

reverse alphabetical order. Based on a corpus analysis using the C4 corpus, we estimate that alphabetical

order is approximately 150 times more common than reverse alphabetical order in Internet text (see Ap-

pendix E.3). Paralleling this difference in frequency, both models perform better on alphabetical order than

reverse alphabetical order (Figure 5.6). This difference between orderings is statistically significant for both

GPT-3.5 (p < 10 −8 ) and GPT-4 (p < 10 −6 ); see Appendix D.8.

Although these two ways of sorting words show a marked difference in frequency, the two most prevalent

ways of sorting numbers—in ascending order or in descending order—do not show such a large difference.

Specifically, we found that descending order is mentioned approximately 1.5 times as often as ascending order

in the C4 corpus, a much smaller ratio than the ratio of 150 found for word orderings; further, we suspect

that ascending order may be used without being explicitly named more often than descending order is, which

would further reduce the ratio. As we would expect based on the similar corpus frequency levels, models’

performance on sorting numbers is similar when they are asked to sort in ascending order or descending order

(Table 3), and logistic regressions found no statistically significant difference between the two orderings for

either model (Appendix D.8).

19Count in C4 GPT-3.5 accuracy GPT-4 accuracy

Alphabetical order

Reverse alphabetical order 95,942

629 0.76

0.15 0.80

0.32

Ascending order

Descending order 21,562

31,378 0.66

0.58 0.82

0.80

Table 3: Corpus counts and model accuracies for various sorting tasks. When alphabetical and

reverse alphabetical order were being tested, the units to be sorted were words. When ascending

and descending order were being tested, the units to be sorted were numbers.

In sum, when we tested two orderings that had a large frequency difference (alphabetical order vs. reverse

alphabetical order), we found a large difference in model performance across orderings. However, when we

compared two orderings that had similar levels of frequency (ascending order vs. descending order), models

showed little difference in performance across ordering. These results support the broader claim that model

performance on a task is influenced by the task’s frequency in pretraining data.

5.6

Summary of task probability effects

We investigated seven pairs of tasks, with each pair containing a common task variant and a rare task variant.

Across all seven pairs, models performed substantially better on the common task variant than the rare one.

Contemporaneous work by Wu et al. (2023) tested a similar hypothesis and drew similar conclusions: they

evaluated LLM performance on the default version of a task (e.g., executing Python code under the true

assumption that Python uses 0-based indexing) and a counterfactual version of the task (e.g., adopting the

false assumption that Python uses 1-based indexing). They found that LLMs reliably performed worse on the

counterfactual tasks than the default tasks. We view our work as highlighting a general phenomenon of which

Wu et al.’s conclusion is a special case: we show that LLMs are sensitive to task probability, a factor that

encompasses the default-vs.-counterfactual situations that Wu et al. study but that also includes other cases

where there is no default. For instance, we showed that model accuracy varied along with task probability for

linear functions even though there is no default among the class of linear functions; similarly, model accuracy

on Pig Latin was influenced by the frequency of Pig Latin variants even though there is no agreed-upon default

version of Pig Latin. Wu et al. also discuss continuous task frequency in their Section 5.1 as an explanation

for why some counterfactual task variants are harder than others, but this discussion still operates within the

default-vs.-counterfactual distinction. As part of our more direct focus on probability, we used corpus analyses

to measure task frequency for several of our experiments, a type of analysis that was not used in Wu et al.’s

work. A final difference is that we situate task frequency within a broader teleological perspective, while Wu

et al.’s motivation for analyzing task variants is a distinct (but also very important) goal of evaluating how

general LLM abilities are. Overall, because our work and Wu et al.’s use non-overlapping sets of tasks and

draw compatible conclusions, we view these two papers as mutually reinforcing.

Sensitivity to output probability

We have hypothesized that LLMs will perform better when the correct answer is a high-probability string

than when it is a low-probability string, even in deterministic situations where the answer could be determined

without considering probability. In this section we test this hypothesis.

6.1

Shift ciphers

Summary: When deciphering rot-13 text, GPT-4’s accuracy ranges from 0.51 when the output sentence

is high-probability to 0.13 when the output sentence is low-probability. Through manual analyses of model

predictions, we find that many model errors arise in situations where the correct answer is low-probability, in

which case models often produce an output that is incorrect but has a higher probability.

20Shift cipher: Output probability

D GPT-4:

The resolution of conflicts has to be based on the truth.

Rot-13: example of medium-probability output

Input:

Nyy rfpnyngvba bs shapgvbaf unq gb or onfrq ba uvf sbphf.

Correct: All escalation of functions had to be based on his focus.

1.00

Rot-13: Example of high-probability output

Input:

Gur erfbyhgvba bs pbasyvpgf unf gb or onfrq ba gur gehgu.

Correct: The resolution of conflicts has to be based on the truth.

GPT−3.5

GPT−4

0.75

0.50

0.25

0.00

GPT-4: All escalation of conflicts had to be based on his proofs.

−125

−100

−75

Output log probability

Reversal: Output probability

D GPT-4:

True, he was not totally pleased with the paintings.

Reversal: example of low-probability output

Input:

paintings. the with pleased he totally was not True,

Correct: True, not was totally he pleased with the paintings.

1.00

Reversal: Example of medium-probability output

Input:

paintings. the with pleased totally not was he True,

Correct: True, he was not totally pleased with the paintings.

0.75

0.50

0.25

0.00

GPT-4: True, not was he totally pleased with the paintings.

−125

−100

−75

Output log probability

Pig Latin: Output probability

anotheray easonray.

Correct: but this time, there may also be another reason.

D GPT-4:

but this time, there may also be another reason.

Pig Latin: example of medium-probability output

Input:

etyay ithway ustray, erethay aymay evenay ebay

anotheray eformray.

Correct: yet with rust, there may even be another reform.

GPT-4: yet with trust, there may even be another reform.

1.00

Pig Latin: Example of high-probability output

Input:

utbay isthay imetay, erethay aymay alsoay ebay

0.75

0.50

0.25

0.00

−120

−100

−80

Output log probability

Figure 6.1: The effect of output probability on model accuracy across three tasks. In all plots, the

intervals around the lines show one standard error.

21Details: We asked models to decode sentences written in rot-13, with the target outputs belonging to

our high-probability, medium-probability, or low-probability set of sentences (Section 4.2). We used GPT-2

(Radford et al., 2019) to estimate the probability of the sentences in each condition and verified that the

conditions differed in output probability but were similar in input probability (Table 8 on page 64).

Model performance varied substantially with output probability (Figure 6.1, top). For example, when the

correct answer was high-probability, GPT-4 had an accuracy of 0.51, but when it was low-probability, the

accuracy dropped to 0.13. Using a logistic regression, we found that output log probability indeed had a

statistically significant effect for both models (p < 10 −4 for both; see Appendix D.2.3).

Targeted examples: As a different way to investigate the effects of output probability, we tested models

on examples that were more targeted, such that our hypotheses not only predict poor performance but also

make predictions about which incorrect answers models will produce. To create such examples, we started

with our high-probability sentences and then manually changed one word in each to a new word that, in the

context of the sentence, was still grammatical but was unlikely. The new word always had a Levenshtein edit

distance (Levenshtein, 1966) of at most 2 from the original word (counting a single insertion, deletion, or

substitution as 1 edit). For example, given (6a), we produced (6b). We evaluated GPT-3.5 and GPT-4 on

decrypting rot-13 text where the correct outputs were our modified sentences.

(6) a. Well, if they don’t come, so be it.

b. Well, if they don’t code, so be it.

Under the hypothesis that models favor high-probability answers, we predict that they will “regularize” their

outputs for these cases by producing the high-probability sentence that is similar to the correct answer (e.g.,

(6a)) rather than the correct answer (e.g., (6b)). We evaluated both whether the full output sentence matched

the correct or regularized answer, and also whether just the critical word matched the correct word or the

regularized version. Under both metrics, both models produced the regularized version more often than the

correct version (Figure 6.2). Here are some cases where GPT-4 produced an incorrect regularized answer:

(7) a. Correct output: Because of this, their names were chanted for security reasons.

b. GPT-4 output: Because of this, their names were changed for security reasons.

(8) a. Correct output: Are they now building a bridge of their owl?

b. GPT-4 output: Are they now building a bridge of their own?

(9) a. Correct output: We are very careful about the voices we want to address, so we need to collaborate

with otters for sustainable impact.

b. GPT-4 output: We are very careful about the voices we want to address, so we need to collaborate

with others for sustainable impact.

(10) a. Correct output: This is a great achievement for us and a very beige responsibility.

b. GPT-4 output: This is a great achievement for us and a very big responsibility.

A related type of regularization was shown in McKenzie et al. (2023), whose “resisting correction” results

showed that models, when asked to repeat back a sentence containing a typo, often fixed the typo. Our

findings show that such modifications can occur even when the sentence could be correct as written but is

simply low-probability.

These results provide further evidence for the conclusion that LLMs are sensitive to the probability of the

outputs that they produce. There are certainly contexts where such regularization is desirable, such as when

it leads a model to fix typos. Therefore, we included this experiment not as a way to showcase model failure

but rather as a way to highlight the types of strategies used by LLMs—strategies that may have desirable

effects in some contexts but undesirable effects in others.

Further analysis of model errors: For decoded rot-13 text, model predictions generally adhered fairly

closely to the correct answer. Even when models made mistakes, they were not egregious, as in the following

examples from rot-13 decoding:

(11) a. Correct answer: What are your hopes for the independent filmmaking scene in Indonesia?

b. GPT-4 output: What are your hopes for the impending filmmaking scene in Indianapolis?

22Shift ciphers: Output regularization

D GPT-4:

GPT−3.5

GPT−4

1.0

Full sentence correct:

Input:

Gung’f bhe jnl bs gunaxvat naq jung jr onfr bhe jbex ba.

Correct: That’s our way of thanking and what we base our work on.

That’s our way of thanking and what we base our work on.

0.5

Full sentence regularized

Input:

V jvyy abg tvir hc rira sbe n zbahzrag.

Correct: I will not give up even for a monument.

GPT-4: I will not give up even for a moment.

0.0

Critical word correct (but with an error elsewhere)

1.0

GPT-4: The relationships between these groups is our mail challenge.

Input:

Gur eryngvbafuvc orgjrra gurfr tebhcf vf bhe znvy punyyratr.

Correct: The relationship between these groups is our mail challenge.

0.5

Critical word regularized (with an additional error elsewhere)

GPT-4: The situation at the university level is quite dependable.

0.0

Input:

Gur fvghngvba ng gur havirefvgl yrirr vf dhvgr qrcybenoyr.

Correct: The situation at the university levee is quite deplorable

Figure 6.2: Shift cipher performance on locally modified sentences. Left: Examples of the stimuli.

In each case, the correct answer contained one word (underlined) that was implausible in context

but that was similar to a more plausible word; e.g., in the first example, the correct word thanking

is similar to the incorrect but more plausible word thinking. We refer to the incorrect but plausible

options as regularized. Right: The plots show how often model outputs matched the correct or

regularized answer, either entirely (Full sentence) or at the critical word (e.g., thanking vs. thinking).

Adjacent bars do not sum to 1.0 because models might produce an answer that is not the same as

either option we consider. Error bars show one standard error.

As the overall lower accuracy would suggest, model errors in rarer shifts deviate more from the correct

answer. Consider rot-10: GPT-4’s responses in this case have an average Levenshtein distance of 68.4, vs.

3.4 for rot-13. In such cases where the model output diverges dramatically from the correct answer, an

interesting pattern emerges where the outputs are often recognizable sayings or quotations. For instance,

among the rot-10 predictions are a popular motivational saying (12) 6 and (a modified version of) a sentence

famous for using all letters of the alphabet (13):

(12) a. Correct answer: We are very careful about the voices we want to address, so we need to collaborate

with others for sustainable impact.

b. GPT-4 output: We are what we repeatedly do. Excellence, then, is not an act, but a habit.

(13) a. Correct answer: She never regretted her passion for the artistic craft, nor did she waver in her

tireless dedication.

b. GPT-4 output: The quick brown fox jumps over the lazy dog, but not the sheep in the background.

This saying originated from Will Durant but is often attributed to Aristotle.

23In another case, GPT-4 produced the start of a Shakespearean soliloquy (14). Clearly something is rot-10 in

the state of Denmark:

(14) a. Correct answer: As a doctor of humanities, he was a university professor, founded a university and

a newspaper, and won awards in journalism and literature.

b. GPT-4 output: To be or not to be, that is the question, whether tis nobler in the mind to suffer

the slings and arrows of outrageous fortune.

This tendency to produce well-known sayings is consistent with the effect of output probability described in

Section 3.3. The less well the model has learned a task, the more uncertainty it will have about the relationship

between output and input, which is what defines the likelihood. It will therefore be more influenced by the

prior, resulting in predictions that are not very closely connected to the input but that have a high probability

in absolute terms, such as well-known quotations.

Lack of separation between instructions and input: One other type of model error is worth

discussing as an illustration of how models rely on probability. Specifically, the models’ outputs often include

words relating to ciphers even when the correct answer does not. The following examples are all from GPT-4

decoding rot-2 text:

(15) a. Correct answer: This decision was influenced by the political climate of the time.

b. GPT-4 output: This decryption was engineered by the programming language of the time.

(16) a. Correct answer: We could suggest things, but we didn’t know how a city council worked.

b. GPT-4 output: We could seriously damage, but we didn’t know how a code breaker worked.

(17) a. Correct answer: The definition of a donor was also narrowed, making it difficult for foreign donors

to fit this requirement.

b. GPT-4 output: The decryption of a cipher was also designed, making it easier for other people to

get their messages to go this direction.

The previous sections showed ways in which models are sensitive to the unconditional probability of their

output. However, these new examples cannot be explained in that way; e.g., in (15), the model-produced

word decryption has a much lower unconditional probability than the correct word decision. Instead, these

examples show models relying on the input but in an incorrect way: the input includes directions that talk

about a cipher (using words such as cipher and decode), and this makes models likely to produce related

words in their output, even when the output should not be influenced by the instructions in this way.

6.2

Reversal

Summary: GPT-4 performed much better at reversing a list of words when the target output was a high-

probability sentence (yielding an accuracy of 0.97) than when it was a low-probability sentence (yielding an

accuracy of 0.53).

Details: We gave models a sentence that had been reversed and asked them to reverse the order of the

words (i.e., returning it to its original, grammatical order). We evaluated them on this task when the

outputs were drawn from our high-probability, medium-probability, or low-probability set of sentences. For

both models, accuracy increased as output probability increased (Figure 6.1, middle); logistic regressions

(Appendix D.3.2) found that output probability had a highly significant effect for both GPT-3.5 (p < 10 −8 )

and GPT-4 (p < 10 −11 ).

6.3

Pig Latin

Summary: When converting a Pig Latin sentence to English, GPT-4 had a much higher accuracy for

examples with a high-probability output sentence (accuracy: 0.83) than examples with a low-probability

output sentence (accuracy: 0.28).

24Article swapping: Output probability

Article swapping: Example of high-probability output

Input:

She distributed items the around room the and we tried them on.

Correct: She distributed the items around the room and we tried them on.

D GPT-4:

1.00

She distributed the items around the room and we tried them on.

Article swapping: example of medium-probability output

Input:

Hal distributed props the across park the and he takes more in.

Correct: Hal distributed the props across the park and he takes more in.

GPT-4: Hal distributed the props across the park and he takes in more.

Article swapping: example of low-probability output

Input:

Hal more he the props and the across distributed takes park in.

Correct: Hal more the he props the and across distributed takes park in.

0.75

0.50

0.25

0.00

GPT−3.5

GPT−4

−160 −140 −120 −100 −80

Output log probability

GPT-4: Hal takes the more props and distributed them across the park in.

Figure 6.3: Left: Examples of the article swapping task. Right: The effect of output probability

on model accuracy on this task. The intervals around the lines show one standard error. Due to

collinearity, we cannot tell with certainty whether the accuracy differences are driven by output

probability or input probability, but based on the results of an additional set of experiments we have

concluded that the results are most likely driven by output probability; see Appendix D.5.2.

Details: We gave models a sentence written in Pig Latin and asked them to convert it to English. We

evaluated models on examples where the output sentences were high-probability, medium-probability, or low-

probability. For both models, accuracy increased as output probability increased (Figure 6.1, bottom); logistic

regressions (Appendix D.4.3) found that output probability had a highly significant effect for both GPT-3.5

(p < 10 −11 ) and GPT-4 (p < 10 −14 ).

6.4

Article swapping

Summary: When given a sentence and asked to swap each article (a, an, or the) with the preceding word,

GPT-4 performed significantly better when the output sentence was a high-probability sentence (accuracy:

0.83) than when it was a random sequence of words (accuracy: 0.02).

Details: We gave models a sequence that had been created by starting with a sentence and swapping each

article with the following word, where articles are the words a, an, and the; see Figure 6.3 for examples. The

model was then asked to swap each article with the preceding word, returning the sentence to its original form.

We evaluated models on cases where the output sentence was high-probability, medium-probability, or low-

probability. For both models, accuracy increased substantially as output probability increased (Figure 6.3).

For this task, input probability and output probability are highly correlated, likely because the input and

output sequences have extensive overlap; thus, due to collinearity, we were unable to confidently interpret a

logistic regression that factored in both input probability and output probability. When we ran a test that

included only output probability (intending it to stand for general example probability, since the collinearity

makes output probability a reasonable proxy for input probability as well), probability had a highly significant

effect for both GPT-3.5 (p < 10 −11 ) and GPT-4 (p < 10 −15 ); see Appendix D.5.1. Thus, it is clear that

accuracy is connected to some aspect of example probability.

Although we cannot confidently determine whether this effect is attributable mainly to input probability

or output probability (or both), we ran an expanded set of experiments which provide evidence that output

probability is the most important factor; see Appendix D.5.2 for details of this analysis. For this reason, we

group our results for this task under the heading of output probability rather than input probability.

251.00

Acronyms: Output probability

Example of high-probability output

Input:

salient tribune outlast parades

panache emerald dweller

Correct: STOPPED

D GPT-4:

0.75

STOPPED

Example of low-probability output

Input:

gremlin affixed manatee empress

0.50

0.25

GPT−3.5

GPT−4

lasagna impasse naively

0.00

Correct: GAMELIN

−18

GPT-4: GAMENIL

−16

−14

−12

Output log probability

Figure 6.4: Effect of output probability on acronym performance. The target output was always

a real word that appeared in the CMU Pronouncing Dictionary, but the commonness of that word

varied; the horizontal axis shows different output probability levels (e.g., STOPPED is a common

word, and GAMELIN is a rare word).

6.5

Acronyms

Summary: When asked to form an acronym from a sequence of words, GPT-4 performs better when the

acronym that is produced is a common word (accuracy: 0.76) than when it is a rare word (accuracy: 0.67).

Details: Models were given a sequence of words and asked to concatenate the first letters of all the words;

e.g., given lamprey insular turnkey esquire realist acetone luckier, produce LITERAL. In all conditions, all

input words and output words were seven letters long and two subword tokens long (as measured by the

tokenizer used by both GPT-3.5 and GPT-4). The input words were in all lowercase letters, and models were

instructed to produce their output in capital letters.

The set of possible words was created by extracting all 2-token 7-letter words from the CMU Pronouncing

Dictionary 7 (used in this case as a wordlist, not for any phonetic purpose). The words were then ranked by

the log probability assigned to them by GPT-2 (Radford et al., 2019). This process was repeated separately

for lowercase words and words written in all capital letters, creating a list of 8920 lowercase words and 2303

all-capitalized words (the numbers differ due to differing tokenization). Each set was then split into five

probability bins. We had five conditions, corresponding to the five probability bins for the output; across all

conditions, we used only input words from the highest-probability bin. Thus, the input probability was held

roughly constant, while the output probability varied.

Both models performed better on conditions with high output probability than ones with low output

probability (Figure 6.4): GPT-3.5 had an accuracy of 0.39 in the highest-probability case, compared to 0.18

in the lowest-probability case, and for GPT-4 the corresponding numbers were 0.76 and 0.67. The effect of

output probability is highly statistically significant in both cases, even though the effect size is modest for

GPT-4 (p < 10 −15 for GPT-3.5, p < 10 −9 for GPT-4; see Appendix D.6).

6.6

Counting

Summary: When asked to count how many characters or words are in a list, GPT-4 achieves a much

higher accuracy when the correct count is a commonly-used number (e.g., 100) than when the correct count

is a more rarely used number (e.g., 83).

Details: We presented models with a list of either characters or words. The list length varied from 1

to 100; we showed 30 examples of each length and then computed an accuracy for each number based on

http://www.speech.cs.cmu.edu/cgi-bin/cmudict

26Counting

GPT−3.5

Counting characters:

How many letters are in the list?

yyyyyyyyyyyyyyyyyy

Input:

Correct: 18

GPT-4: 20

0.75

0.50

0.25

0.00

1.00

0.75

Input:

yyyyyyyyyyyyyyyyyyyy

Correct: 20

D GPT-4:

GPT−4

1.00

0.50

0.25

0.00

Low

High

Low

High

Output probability

Character

Word

1.00

0.75

GPT−3.5

GPT−4

0.50

0.25

0.00

100

Magnitude of count

Character

Word

1.00

0.75

GPT−3.5

GPT−4

0.50

0.25

0.00

Negative log frequency of count

Figure 6.5: Top left: Examples of stimuli for counting characters; the stimuli for counting words are

similar except that the characters being counted are all the same within an example, but the words

being counted are all different. Top right: Counting accuracy in settings with low-probability and

high-probability outputs. In this context, the output is a number (the number of elements in the

list), and output probability refers to the frequency of that number in an Internet corpus. Error bars

show one standard error. Middle and bottom: Counting performance by the magnitude (middle)

or frequency (bottom) of the number being counted to. We use the C4 corpus (Raffel et al., 2020)

to estimate each number’s frequency. The intervals around the line show one standard error.

27|cor(Accuracy, Magnitude)| |cor(Accuracy, Frequency)|

GPT-3.5: Counting words

GPT-4: Counting words 0.57

0.70 0.74

0.84

GPT-3.5: Counting characters

GPT-4: Counting characters 0.81

0.76 0.84

0.88

Table 4: Investigating which features correlate with LLM accuracy at counting. The two features

considered are the magnitude of the number being counted to and the frequency in a corpus of the

number being counted to. The table shows the absolute value of the Spearman rank correlation

coefficients between accuracy and each of these features. In all cases, accuracy correlates more

strongly with frequency than with magnitude.

those 30 examples. Figure 6.5 plots model accuracy as a function of the magnitude of the number being

counted to (middle) or the log frequency of the number being counted to (bottom), where the frequency

was estimated based on how often each number appears in the first 1 billion words in the C4 corpus (Raffel

et al., 2020). The curve is smoother when based on frequency than when based on magnitude, a point made

more formally by seeing that accuracy has a higher Spearman rank correlation with frequency than with

magnitude (Table 4). Finally, in logistic regressions, output log probability had a highly significant effect

across conditions (p < 10 −15 in all cases; see Appendix D.7); output magnitude also had a significant effect

in the character-counting cases (p < 10 −15 for GPT-3.5; p < 10 −14 for GPT-4) but not in the word-counting

cases (p = 0.56 for GPT-3.5; p = 0.07 for GPT-4).

As another way to disentangle magnitude from frequency, we did a test that controlled for magnitude but

varied frequency. We identified all numbers n from 1 to 100 that were at least twice as frequent as n − 2,

n − 1, n + 1, or n + 2. Such numbers were labeled high-probability, while their neighbors (n − 2, n − 1, n + 1,

and n + 2) were labeled low-probability. We then evaluated models on 100 examples for each high-probability

number and 25 examples for each low-probability number; average magnitude was thus matched because the

mean of n − 2, n − 1, n + 1, and n + 2 is n. Accuracy was much higher for the high-probability outputs

than the low-probability ones (Figure 6.5, top right), and these differences were statistically significant (see

Appendix D.7).

Davis (2023) and Arkoudas (2023) also found that GPT-4 struggled with counting. Our experiment takes

this result further by connecting it to probability, which is in turn connected to our broader teleological

analysis of why models behave as they do.

6.7

Summary of output probability effects

We have shown that the performance of GPT models is heavily influenced by the probability of the target

output, even though the tasks being investigated were not inherently probabilistic. These results support our

hypothesis that LLMs are sensitive to the probability of the sequences they must produce.

Sensitivity to input probability

We have hypothesized that LLMs will sometimes perform worse on examples with low-probability inputs than

ones with high-probability ones. However, we have also hypothesized that the influence of input probability

will be less pervasive than the influence of output probability: we predicted that LLMs would use output

probability in any scenario where they have some uncertainty about the output—a condition that we expect

holds frequently—whereas we predicted that input probability would only matter in situations where an LLM’s

ability to process an input is highly dependent on prior experience with that specific input—a condition that

we expect to hold less often under the view that, for many tasks, neural networks are not restricted to handling

only the inputs they have seen but can also generalize across inputs. In this section, we investigate how model

performance varies with input probability and compare these effects to the output probability effects observed

in the previous section.

28Shift cipher

Reversal

Pig Latin

Acronyms

1.00

0.75

0.50

0.25

GPT−3.5

GPT−4

0.00

−125

−100

−75

−125

−100

−75

−120

−100

−80

−60

−140 −130 −120 −110 −100

Input log probability

Figure 7.1: The effect of input probability on model accuracy across four tasks.

Accuracy Mean edit distance

GPT-3.5: High-probability

GPT-3.5: Medium-probability

GPT-3.5: Low-probability 0.02

0.03

0.05 24.2

24.0

38.5

GPT-4: High-probability

GPT-4: Medium-probability

GPT-4: Low-probability 0.21

0.10

0.11 7.9

9.8

10.4

Table 5: Accuracy on encoding sentences using a simple cipher (rot-13). There were three groups

of sentences of varying probability. The two metrics of success are accuracy (higher is better) and

Levenshtein edit distance (lower is better).

7.1

Shift ciphers, reversal, Pig Latin, and acronyms

Summary: When encoding text with rot-13, GPT-4 achieves an accuracy of 21% when the input sentence

is high-probability but only 11% when it is low-probability. For the other three tasks considered in this

subsection (reversal, Pig Latin, and acronyms), input probability does not have a statistically significant

effect for GPT-4.

Details: To investigate input probability for shift ciphers, reversal, and Pig Latin, we reversed the setup

used in Section 6: instead of asking LLMs to decode sentences that have been manipulated through these

three methods, we asked them to start with English text and encode it using rot-13, reversal, or Pig Latin.

Thus, our three different sets of sentences now gave us three levels of input probability rather than three

levels of output probability.

For acronyms, we did not need to change the task framing from before, but now instead of manipulating

the probability of the word that is produced as an acronym, we manipulated the probability of the words

from which the acronym is formed by selecting inputs from each of the five probability bins described in

Section 6.5; for the target outputs, we used only words from the highest-probability bin. Below are examples

of a high-probability input and a low-probability input, both of which have TOASTER as the target output:

(18) a. High-probability input: trundle octopus archaic stooped tabling esquire retrace

b. Low-probability input: trapelo ottaway amiably scholle troiani esqueda royster

The results are shown in Figure 7.1. There were two cases where our logistic regressions found that input

probability had a statistically significant effect on model performance: GPT-4 processing the shift cipher

(p < 0.05) and GPT-3.5 forming acronyms (p < 10 −5 ). For the shift cipher, we noted that the lack of a

29GPT−3.5

GPT−4

0.75

0.50

0.25

0.00

1.00

0.75

1.00

0.50

0.25

0.00

Low

High

Low

High

Input probability

Figure 7.2: Accuracy on counting when input probability is modulated

significant effect for GPT-3.5 might be due to floor effects since its accuracy was close to zero across input

probability levels (Figure 7.1, left), so for the shift cipher we also computed results with the more lenient

measure of edit distance between the model output and the correct answer, which is shown in Table 5. A

linear regression predicting this more sensitive metric found that input probability had a significant effect for

both GPT-3.5 and GPT-4, with p < 0.01 for both models. See Appendices D.2.2, D.3.1, D.4.2, and D.6.2

for details of the regressions. These results provide an existence proof that the probability of the input can

indeed influence LLM performance; the fact that it only had a significant effect in a few cases is consistent

with our prediction that input probability will not have a reliable influence.

7.2

Counting

Summary: Input probability shows inconsistent effects when the task is counting: it has a significant effect

in the expected direction (better performance on high-probability inputs) for GPT-3.5 counting characters

and GPT-4 counting words, but it has a significant effect in the opposite direction for GPT-3.5 counting

words and GPT-4 counting characters; note, however, that the effect size was very small in all cases except

for GPT-3.5 counting characters (which was one of the cases that was significant in the expected direction).

Details: In order to vary the input probability when having models count characters, we used letters as

high-probability characters and emojis as low-probability characters. GPT-3.5 performed better on count-

ing high-probability characters (letters) than low-probability ones (emojis), while GPT-4 had a small but

statistically significant effect in the opposite direction (Figure 7.2, top); see Appendix D.7 for details of the

significance tests. To modulate input probability when counting words, we estimated word probability using

GPT-2 (as we did for acronyms) and then sorted the vocabulary by probability. We selected 450 words from

the higher-probability half and 450 words from the lower-probability half to serve as our two types of words.

In this case, GPT-4 performed better on counting high-probability words than low-probability ones, whereas

GPT-3.5 unexpectedly performed better on counting low-probability words than high-probability ones (Fig-

ure 7.2, bottom), though in both cases the effect size was small. It is unclear why two of the cases show

effects in the direction that is opposite from what we expected. Note however that both of these cases had

small effect sizes (Appendix D.7).

7.3

Asymmetry between input probability and output probability

Summary: A comparison of our results in Sections 6 and 7 supports the hypothesis that LLMs are more

pervasively influenced by output probability than input probability.

Details: Now that we have run experiments modulating output probability and input probability, we have

the opportunity to compare these two factors to see if input probability is indeed less influential, as we

have hypothesized. One immediate piece of evidence supporting this conclusion is that all of our output

30probability experiments yielded significant effects—often with large effect sizes—whereas only a few input

probability conditions yielded significant effects.

In addition to noting which experiments produced significant effects, we can also look for an input/output

asymmetry by considering model performance in the high-probability rot-13 conditions. GPT-4 scores 0.51 for

decoding rot-13 in the high-probability setting but only 0.21 for encoding rot-13 in the high-probability setting;

GPT-3.5 scores 0.21 for decoding and 0.02 for encoding. This asymmetry between decoding performance

and encoding performance supports the conclusion that output probability is more influential than input

probability. The argument for why these results support this conclusion is as follows:

• In a shift cipher, encoded text is usually a very low-probability string of letters. 8 Therefore, encoding

text involves turning a high-probability sequence into a low-probability one, while decoding text involves

the opposite. (The fact that we want to ensure that the English text is high-probability is why, in the

previous paragraph, we considered only the high-probability conditions within the encoding and decoding

tasks).

• We have ample evidence to believe that low probability is a source of difficulty for LLMs. Therefore, we

have one task (encoding) where the input is easy to process yet the output is hard to process, and one

task (decoding) where the opposite is true.

• Whichever “side” (input vs. output) the LLM is more sensitive to is the one that the LLM’s overall

performance would pattern with, since the variations in difficulty in the less-influential side would be

overwhelmed by the variations in difficulty in the more-influential side.

• Since LLM performance follows the difficulty of the output rather than the difficulty of the input (i.e.,

higher accuracy for decoding than encoding), we have evidence that LLMs are more sensitive to output

probability than input probability.

One potential concern with the previous argument is that it assumes that encoding and decoding have no

inherent difference in complexity as tasks, such that any differences in performance can be attributed to the

specific text rather than the task being performed. In the general case, it is true that a function and its

inverse may not always be equally easy to compute; e.g., it is much harder to find the prime factors of a

large number than it is to multiply together prime factors. However, rot-13 is a special case because it is

its own inverse—that is, rot-13 encoding and rot-13 decoding are in fact the same function. It still remains

possible that the standard ways to perform each task differ in difficulty: encoding for rot-n is traditionally

performed by moving forward n positions in the alphabet, while decoding is traditionally performed by moving

backward n positions. However, if anything, we would intuitively expect that moving forward would be easier

than moving backward because the alphabet is much more often said in forward order than reverse order.

Therefore, if this factor does play a role it would be in the direction of making encoding easier than decoding.

In sum, we have no reason to believe that encoding is inherently a more complex task than decoding, justifying

the conclusion that the difference in accuracy is driven by the properties of the input and output strings, not

by differences in the mapping that must be performed.

In addition to rot-13, we have one other task that is its own inverse: reversal. Here too models performed

better on decoding than encoding: GPT-4 scored 0.97 for high-probability decoding but only 0.86 for high-

probability encoding, and GPT-3.5 scored 0.74 for high-probability decoding but 0.46 for high-probability

encoding. Therefore, the reversal results, like the rot-13 results, support our hypothesis that LLM accuracy

is affected more strongly by output probability than input probability.

7.4

Birthdays

Summary: When asked the birthday of a public figure, GPT-4 performed much better when the person

had a high probability of being mentioned online (accuracy: 0.99) than when they had a low probability of

being mentioned (accuracy: 0.23). This task provides an example where input probability has a large effect

(something that did not occur in any of our previous experiments that modulated input probability).

There are rare exceptions where shifting text produces other recognizable text. For instance, the rot-13 encoding of vex is its

synonym irk, the rot-13 encoding of abjurer is nowhere, and fans of 2001: A Space Odyssey have pointed out that the name of the

computer HAL becomes IBM when given a shift of 1.

311.00

Birthdays: Input probability

Example with a high-frequency input

Input:

Carrie Underwood

Correct: March 10, 1983

D GPT-4:

March 10, 1983

Example with a low-frequency input

Input:

Jacques Hanegraaf

Correct: December 14, 1960

0.75

0.50

0.25

GPT−3.5

GPT−4

0.00

GPT-4: November 16, 1960

−15

−13

−11

−9

Input log probability

Figure 7.3: Left: Example stimuli for asking models to provide the birthdays of public figures.

Right: Model accuracy at providing a person’s birthday, as a function of the probability of that

person being mentioned in online text. The intervals around the lines show one standard error.

Details: Our experiments with input probability have revealed some cases where it has a statistically

significant effect, but in these situations the effect size was small. This observation naturally raises the

question of whether input probability ever has a large effect on LLM performance. To attempt to identify

such a case, we return to our prediction about when input probability should matter: when there is some

piece of information about the input that is necessary for solving the task but that models would not learn

if the input is low probability. Building on this analysis, we now consider a task that, under this analysis, is

expected to yield very large effects of input probability: being provided a public figure’s name and returning

that person’s birth date. Here, the piece of information about the input that is relevant is the person’s birth

date. This information cannot be retrieved from the input alone, so the only way it can be produced is if the

model has encountered it during training—which is much more likely to happen if the person is frequently

discussed. Therefore, models should show large effects of input probability on this task.

To test this prediction, we used the WikiBio dataset (Lebret, Grangier, and Auli, 2016) to create four sets

of public figures, stratified by the frequency with which each person’s name is mentioned in the C4 corpus;

details are in Appendix A.11. For both models, accuracy varied substantially with input probability: GPT-4’s

accuracy ranged from 23% in the lowest-probability condition to 99% in the highest-probability condition,

while GPT-3.5’s accuracy ranged from 9% to 99% (Figure 7.3). The effect of input probability was highly

significant for both models, with p < 10 −15 (Appendix D.11).

These birthday results are an example of a finding that is obvious if one considers how LLMs work but

potentially surprising otherwise. That is, when one interacts with an LLM, it is easy to imagine that it can

search the Internet to find factual information. If LLMs indeed did that, then they should be able to get

100% accuracy across the board on this birthday task, because all people in our dataset have their birthdays

listed on Wikipedia. In fact, however, the standard versions of GPT-3.5 and GPT-4 do not query external

databases. Any factual information that they store is the result of them encountering that information during

their training phase. Therefore, they can be expected to perform more robustly on pieces of information that

have a high probability of occurring, and we have indeed found that they do so.

Mapping names to birthdays is a task for which there is no systematic function that extracts the output

from the input. This fact makes it unsurprising that LLM performance would be tied to what they have seen.

Though this scenario is straightforward, we believe that it lends a window into a less obvious conclusion:

even when there is a systematic function that maps the input to the output, LLMs may not necessarily use

it—they might still approach the task in a way akin to providing birthdays, where they memorize input-

output mappings on a case-by-case basis rather than applying the function systematically. We conjecture

that, when LLMs are sensitive to input probability, it is because they (at least partially) use such a strategy

in that situation. For example, the fact that we have found GPT-4 to be sensitive to input probability when

encoding text into rot-13 suggests that it may not have learned the rot-13 function systematically but may

have instead memorized the rot-13 versions of the specific words that it has seen written in this cipher.

32GPT-3.5

Accuracy Edit distance

Basic prompt

Keyboard provided

Correspondences provided

0.00

21.5

17.5

51.5

GPT-4

Accuracy Edit distance

0.00

0.01

0.12

18.0

8.0

4.0

Table 6: Performance on the keyboard cipher. For accuracy, higher is better; for edit distance, lower

is better. The edit distances are medians to address the presence of outliers in the GPT-3.5 data.

7.5

Summary of input probability effects

Consistent with our hypotheses, we have found that input probability had a smaller impact on LLM perfor-

mance than output probability for the tasks we have investigated. Though input probability often had little

or no effect, we identified one case (namely, the birthday task) where it substantially influences accuracy,

as well as a few cases (such as rot-13 encoding) where it has a moderate but statistically significant effect.

Some additional examples of situations where input probability has a large effect can be found in the work

of Razeghi et al. (2022), who showed that LLMs performed much better on numerical reasoning tasks (e.g.,

arithmetic questions) when the numbers in the question are common in pretraining corpora than when they

are rare—a finding that is complementary to our results because, in the arithmetic tasks that we consider

(linear functions and multiplication) we only assess the effects of task frequency (Section 5.4), output proba-

bility (Section 6.6), and question formatting (Section 8.2). Overall, then, we conclude that, under the right

conditions, LLM performance can indeed be meaningfully affected by input probability.

Other phenomena

Beyond the probability-sensitivity that we have documented above, there are many other ways in which the

problem that LLMs solve influences their performance. In this section we describe two more of these properties

and illustrate them with experiments. In the following section, we discuss several additional properties that

have been identified in prior work.

8.1

Lack of embodiment

There are some tasks that humans easily solve by relying on our presence in the physical world—e.g., by

using information from sensory input, or by offloading work onto external tools and resources. However,

LLMs do not have bodies and do not navigate the physical world. All of their experience is in the form of

text tokens. Therefore, LLMs can be expected to struggle on the tasks that are most straightforwardly solved

by leveraging one’s embodiment.

8.1.1

Keyboard cipher

Summary: When asked to replace every letter in a sentence with the letter to the right of it on a QWERTY

keyboard, GPT-4 scores 0%.

Details: The keyboard cipher is a cipher in which each letter is replaced with the letter to the right of

it on a QWERTY keyboard (wrapping around if necessary); e.g., Good morning becomes Hppf zptmomh.

A human with access to a keyboard can easily apply this cipher by referring to the keyboard. However,

LLMs do not inhabit the physical world and therefore do not use a keyboard to produce their answers, so

we therefore expect for them to struggle at this task. To test this prediction, we asked GPT-3.5 and GPT-4

to encode our 100 high-probability sentences (Section 4.2) into the keyboard cipher. For this experiment,

we used a chain-of-thought prompt (shown in Figure 8.1) because our prediction is an absolute one rather

than a relative one and therefore merits testing models in the setting most favorable to them. As predicted,

GPT-3.5 and GPT-4 both scored 0% on this task (Table 6, “basic prompt”).

To some extent, this result may seem obvious—it may not be surprising that GPT models cannot perform

this task without a keyboard in front of them. However, if we modify their prompt to include a textual

33Keyboard cipher

Prompt:

The keyboard cipher is a cipher where each letter is replaced with the letter to the right of it on a

QWERTY keyboard. If the letter does not have a letter to its right, then you should wrap around to

the start of the row; e.g., “p” maps to “q”. For example, here is a message to be encoded:

Original message: “Hello world!”

To encode this message, we replace each letter with the one to the right of it on a QWERTY keyboard:

1. The first word is “Hello”

* H -> J

* e -> r

* l -> a

* o -> p

So “Hello” becomes “Jraap”

2. The second word is “world!”

* w -> e

* o -> p

* r -> t

* l -> a

* d -> f

* ! -> !

So “world!” becomes “eptaf!”

Therefore, the message in the keyboard cipher is: “Jraap eptaf!”

Here is another message. Encode the message in the keyboard cipher, word by word and letter by letter.

On the last line, write the words “Keyboard cipher:” followed by the encoded message:

Original message: “But this time, there may also be another reason.”

Correct answer:

Niy yjod yozr, yjrtr zsu sadp nr smpyjrt trsdpm.

GPT-4 output:

Niy yjoa yo,r yjtrr ,su saap nr sbpyjrt trsapb.

Figure 8.1: The chain-of-thought prompt that we use to ask models to apply the keyboard cipher.

representation of the keyboard layout (shown in (19) below; see Appendix B.31 for the full prompt), their

performance improves but is still flawed (Table 6, “keyboard provided”). For example, GPT-4’s edit distance

is cut approximately in half, meaning that it makes only half as many errors, but the edit distance remains

relatively high (8.0, meaning that it typically makes 8 mistakes per sentence). This result shows that the

models’ failure is not merely due to a lack of information; even with this information provided, they are still

not able to reliably use it properly—a difficulty that we conjecture can be tied back to the lack of embodiment:

because they do not inhabit the physical world, they lack experience in navigating spatial arrangements such

as the provided keyboard layout.

(19) q w e r t y u i o p

asdfghjkl

zxcvbnm

When we instead modify the prompt to specify exactly what each letter maps to (a -> s, b -> n, c -> v,

...), we see another jump in performance for GPT-4 (Table 6, “correspondences provided”), supporting the

34argument that one reason for GPT-4’s difficulty is an inability to extract the necessary information from an

array of letters. Thus, when we provide this information directly instead of requiring GPT-4 to extract it

from an array of letters, its performance increases. Even in this final case, GPT-4’s accuracy is far from

100%; the remaining errors may be explained by difficulty with systematically manipulating symbols (see

Section 10.3.2). In contrast to GPT-4, GPT-3.5 performs worse in the “Correspondences provided” condition

than in the other conditions. It is not clear why this drop in performance occurs. One possible explanation is

that the prompt is much longer in this condition than in the others, so GPT-3.5 might have trouble sorting

through all of the provided information.

Note that embodiment is not strictly necessary to impart the information required for this task: it can

in principle be learned from textual input. Indeed, even in the “basic prompt” condition where models were

not provided with a keyboard, they get many letters right: if we assumed that they only got correct the

letters whose correspondences were illustrated in the prompt, and guessed randomly on all others, their edit

distances would be 37.0, but they in fact display much lower edit distances (Table 6). Thus, models appear

to have learned some information about keyboard structure from their training data, just not enough to

perform the task well. More generally, then, we are not claiming that text-only models could never acquire

the information for which humans typically rely on embodiment. Rather, what we are claiming is that such

information is difficult but not impossible for non-embodied models to acquire.

8.2

Sensitivity to wording

Because LLMs are explicitly trained on language but not on ideas, we expect them to be sensitive to the

linguistic form of the ideas they encounter: they may yield varying results when given the same idea expressed

in different ways. Prior work has identified several ways in which LLMs are sensitive to task-irrelevant changes

in wording: they may change their response based on the addition of irrelevant information (Kassner and

Schütze, 2020; Shi et al., 2023a), based on textual perturbations such as lowercasing text or introducing

typos (Liang et al., 2022), based on the ordering of information in the prompt (Liu et al., 2023; Zhao et al.,

2021), or based on the way in which the prompt is phrased (Zhao et al., 2021). In the opposite direction,

LLMs are sometimes less sensitive to the prompt than they should be, displaying successful performance even

when prompts are irrelevant or misleading (Webson and Pavlick, 2022; Min et al., 2022); this result further

illustrates how models do not understand their instructions in the same ways that humans likely do but

instead rely on some unintuitive statistical cues relating to the way the prompt is worded. Below, we provide

two further experiments showing sensitivity to prompt wording (Section 8.2.1) and formatting (Section 8.2.2).

8.2.1

Linear function

Summary: GPT-4 performs significantly better at applying the function f (x) = (9/5)x + 32 when the

prompt mentions that this function is the Celsius-Fahrenheit conversion than when it does not.

Details: We hypothesized that models would perform better at applying the function f (x) = (9/5)x + 32

when it was pointed out that this function was the Celsius-to-Fahrenheit conversion than when the function

was simply presented as an abstract equation. This hypothesis is motivated by the fact that, when LLMs

have previously encountered input-output pairs illustrating this function, it would usually have been in the

context of Celsius-to-Fahrenheit conversion, so mentioning this context may help to activate the components

of the LLM that are relevant to this function. To test this hypothesis, we used three conditions illustrated in

Figure 8.2. In the basic condition, models were simply told to apply this function. In the primed condition,

they were given the same prompt as in the basic condition, except that it was prefixed with the sentence In

this task, you must convert a number from Celsius to Fahrenheit. Finally, the priming control condition was

similar to the primed condition, but the new sentence that was added at the start was instead the generic

statement In this task, you must return an answer based on an input; we used this condition to check whether

any differences between the primed condition and the basic condition were solely due to the presence of an

additional sentence at the start of the prompt.

We were most interested in models’ abilities to compute the linear function’s output rather than their

ability to retrieve a memorized value. Accordingly, we restricted the input values to be greater than 500 and

to not be divisible by 10; in a sample of Internet text from C4 (Raffel et al., 2020), we found that these

constraints eliminated 99.8% of examples mentioning a Celsius-Fahrenheit equivalence because people rarely

discuss temperatures greater than 500°C, and when they do they often use round numbers that are divisible

35Linear function: Prompt variants

Basic prompt

Below is a number. Multiply it by 9/5 and then add 32. Your

answer should be a single number:

558

1036.4

1046.4

731 842 977 1.00

1347.8 1547.6 1790.6 0.75

1517.6 1750.6

D 1347.8

0.50

Below is a number. Multiply it by 9/5 and then add 32.

Your answer should be a single number:

731 842 977

1347.8 1547.6 1790.6

1537.6 1760.6

D 1347.8

0.00

1.00

0.75

0.50

0.25

Input:

558

Correct:

1036.4

GPT-4: D 1036.4

0.25

Priming control prompt

In this task, you must return an answer based on an input.

Input:

Correct:

GPT-4:

Below is a number. Multiply it by 9/5 and then add 32.

Your answer should be a single number:

Input:

558

Correct:

1036.4

GPT-4: D 1036.4

731

1347.8

D 1347.8

842 977

1547.6 1790.6

D 1547.6

Primed prompt

In this task, you must convert a number from Celsius to Fahrenheit.

0.00

1780.6

Figure 8.2: Left: The prompts used to ask models to apply the function f (x) = (9/5)x + 32, with

example predictions from GPT-4. Right: Model performance on the different prompts. Error bars

show one standard error

by 10 (see Appendix A.10). Therefore, applying these constraints restricts the evaluation to values that

models are less likely to have memorized, though we cannot be rule out the possibility that models may have

memorized these values because we do not have access to the training corpora for GPT-3.5 and GPT-4.

Both models performed much better when the task was framed as Celsius-Fahrenheit conversion (in

the primed condition) than otherwise (in the basic and priming control conditions), as shown in the top of

Figure 8.2. Using mixed-effects logistic regressions, we found that these differences were statistically significant

for both models (p < 0.01 for GPT-3.5 and GPT-4; see Appendix D.9.2). This difference further shows that

models’ ability to apply this function is tied to the function’s use in a particular scenario, as invoking that

scenario improves performance.

At a more detailed level, there are two potential explanations for this result. First, it might be that

models have learned the linear function in a systematic way, but their ability to access this systematic ability

is increased by an explicit mention of Celsius and Fahrenheit. Second, it might be that models do not have

a systematic ability to compute this function but have instead memorized many Celsius-Fahrenheit pairs.

Under this account, mentioning the words Celsius and Fahrenheit is not only helpful but in fact necessary—

models have no way to connect the two numbers except in the context of a memorized temperature-related

pair. By using rare types of input values, we have attempted to reduce the chance of finding results that only

36illustrate a memorization-based explanation, pointing instead toward the explanation that posits systematic

abilities that can only be accessed under certain settings, but we cannot rule out the memorization-based

explanation: the GPT models have been trained on so much data that they likely have had opportunities to

encounter rare temperature values.

Perhaps more surprisingly, models also perform much better in the priming control condition than in

the basic condition. These differences were statistically significant for both models (p < 0.01 for GPT-3.5

and GPT-4; see Appendix D.9.2). We do not see any way in which the priming control prompt provides

more information than the basic prompt, since all of the information in the initial sentence of the priming

control condition is already apparent in the basic prompt by virtue of its structure. This finding shows how

models can be highly sensitive to wording choices, reinforcing the point that they are primarily models of

word sequences rather than idea-manipulating models.

8.2.2

Multiplication

Summary: When GPT-4 is asked to multiply two numbers, its performance varies widely depending on

how the numbers are written. Its accuracy is 0.46 when the numbers are written in digits (294), 0.39 when the

numbers are written in lowercase words (two hundred ninety-three), 0.35 when the numbers are written in all

caps (TWO HUNDRED NINETY-THREE), and 0.17 when the numbers are written in alternating capitals

(tWo HuNdReD nInEtY-tHrEe).

Details: We gave models a multiplication problem where both numbers being multiplied were random

three-digit numbers. There were four conditions, corresponding to different ways of writing the numbers—

using digits, lowercase words, all-caps words, or alternating capitals. See Figure 8.3 for examples. The values

of the numbers were identical across the conditions.

Model accuracy varied substantially depending on how the numbers were formatted (Figure 8.3, right),

and many of the differences between formats were statistically significant (specifically, all comparisons were

significant except digits vs. lowercase for GPT-4, digits vs. all caps for GPT-4, and lowercase vs. all caps for

both models; see Appendix D.10). This result illustrates how the models are, first and foremost, language

models rather than math models or reasoning models; as a result, the same task expressed in different linguistic

forms can be handled very differently by models.

Discussion

Our experiments highlight two scenarios where AI practitioners should be careful about using LLMs. First,

we have shown that LLMs perform worse on rare tasks than on common ones, so we should be cautious about

applying them to tasks that are rare in pretraining data. Second, we have shown that LLMs perform worse

on examples with low-probability answers than ones with high-probability answers, so we should be careful

about using LLMs in situations where they might need to produce low-probability text. Overcoming these

limitations is an important target for future work in AI.

Our experiments were inspired by a teleological analysis—an analysis aimed at understanding a system

by understanding the problems that it was trained to solve. Our results therefore demonstrate the usefulness

of the teleological approach as a way to illuminate important properties of a model. In the rest of the

Discussion, we first compare the teleological approach to other approaches for understanding (Section 9.1) or

adversarially testing (Section 9.2) AI systems. We then describe how a teleological perspective can inform

human-AI comparisons (Section 9.3). Finally, we discuss the ways in which scaling and advanced prompting

techniques could improve model performance on the tasks we have investigated (Section 9.4).

9.1

How to understand a new type of intelligence

To understand the benefits of the teleological approach, it is helpful to compare it to a more prevalent approach

that we call desideratum-focused evaluation. In desideratum-focused evaluation, the evaluator starts by

specifying some desideratum—a property or capability that they wish for models to have. The desideratum

might be the ability to perform certain tasks that are targeted by a benchmark (e.g., Wang et al., 2019b;

BIG-bench authors, 2023), or it might be a more fine-grained property such as the avoidance of a particular

type of error (e.g., Ribeiro et al., 2020; Dou et al., 2022). Crucially, this approach is model-agnostic: what

we want from a model is not driven by that model’s nature. Therefore, the desideratum-driven approach

37Three-digit multiplication: Input format

Digits

Input:

285 times 796

Correct: 226860

0.75

226860

0.50

Lowercase words

Input:

two hundred eighty-five times

seven hundred ninety-six

Correct: 226860

D GPT-4:

226860

0.00

1.00

0.75

0.50

SEVEN HUNDRED NINETY-SIX

0.25

Correct: 226860

All-caps words

Input:

TWO HUNDRED EIGHTY-FIVE times

0.25

D GPT-4:

1.00

0.00

GPT-4: 226620

Alternating capitals

Input:

tWo HuNdReD eIgHtY-fIvE times

sEvEn HuNdReD nInEtY-sIx

Correct: 226860

GPT-4: 226960

Figure 8.3: Left: Example stimuli for multiplication problems with different types of input format-

ting. Right: Accuracy on multiplication using the different types of number formats. The error

bars show one standard error.

runs the risk of missing model-specific properties, properties that one would not think to check for based on

a generic characterization of the behavior that we want. In contrast, the teleological approach starts with

an analysis of the model and is therefore well-suited for capturing the ways in which a model’s behavior is

influenced by its nature.

As an example, a desideratum-focused approach to evaluating rot-13 would be likely to miss models’

sensitivity to answer probability because probability is not part of how humans usually discuss this task. In

more detail, we can characterize the desideratum as the ability to decode the sort of rot-13 text that users

are likely to provide to the model. A standard way to test for such an ability would be to encrypt many

sentences from a natural corpus and then evaluate whether models can decrypt them. Since low-probability

sentences are by definition rare in natural corpora, such an evaluation would overlook models’ sensitivity to

sentence probability. Rather than solely evaluating on a general test set, we might instead test for finer-

grained desiderata—an approach akin to writing unit tests when testing software. We might, for example,

write tests that check whether the model works correctly on letters that wrap around the end of the alphabet

and on non-alphabetic characters. However, it might never occur to us to include a unit test that checks

whether the program is invariant to string probability. Taking a teleological perspective helps us to identify

important tests such as this one that we might otherwise have missed.

To be clear, we are not arguing against testing for desiderata—such testing is necessary for answering

important questions about whether models are doing what we want. Rather, we are arguing that such

testing should be informed by a teleological analysis because this analysis can reveal non-obvious ways that

models are likely to deviate from the desired behavior. Without a teleological analysis, we are likely to miss

some potential failure modes, in which case our tests for desiderata will not actually be evaluating what we

want them to evaluate—e.g., a generic rot-13 benchmark might lead us to falsely conclude that models have

mastered rot-13 when in fact their rot-13 ability has important shortcomings that were simply not tested for.

38For discussion of the challenges of ensuring that evaluations test what they are intended to, see Lipton and

Steinhardt (2019), Linzen (2020),Raji et al. (2021),Rodriguez et al. (2021), and Hutchinson et al. (2022).

One reason why the teleological approach can reveal points that are otherwise overlooked is that it leverages

more information than other approaches. That is, standard evaluation approaches treat LLMs as impenetrable

black boxes. The teleological approach instead recognizes that, although the inner workings of LLMs are

indeed difficult to understand, there is nonetheless a lot that we do know about these models: we know their

architecture, and we know what they were trained to do. We can use this knowledge to reason about how

models are likely to approach a task, providing us with testable hypotheses about their behavior. 9 To be

sure, many prominent LLMs are proprietary ones for which many training details are not public, but—as we

have shown—we still have enough general information about these systems to make predictions about their

behavior.

A final benefit of the teleological approach is that it reveals unifying principles behind why systems do

what they do (Shepard, 1987; Marr, 1982; Anderson, 1990; Griffiths, 2020). Other approaches, such as the

desideratum-driven approach, instead tend to provide lists of scores without a unifying characterization of

why the model displays the behavioral patterns that it does. Having a set of deeper explanatory principles

is useful because it enables us to make more general predictions about the sorts of scenarios that models are

likely to handle well or poorly.

Besides the desideratum-based approach, there are two other recently-proposed approaches that are more

similar to our proposal, but there are still important differences. First, Liang et al. (2022) and Holtzman, West,

and Zettlemoyer (2023) have called for holistic approaches that characterize LLMs by providing generalizable

descriptions of the behavioral properties of LLMs (e.g., what general abilities do they have, and what general

types of errors do they make?) This approach therefore focuses on what models do. Second, Binz and Schulz

(2023) argue for an approach based on experimental psychology: treat LLMs as if they are participants in

an experiment, thereby applying controlled tests that illuminate the particular strategies that they use—

i.e., this approach analyzes how models do what they do. The teleological approach, like these other two

approaches, emphasizes searching for general principles that define a model, but it differs in that it pursues

this goal through an analysis of why rather than what or how. These approaches can be fruitfully combined: a

teleological analysis can be a useful source of hypotheses about the strategies that models will adopt and the

behaviors that will result—indeed, this is how most of our experiments proceeded in this paper: we started

by using why to make predictions about how, then used those predictions to inform predictions about what,

and then experimentally tested the what predictions.

9.2

Adversarial testing of AI systems

Our evaluation strategy falls within the adversarial tradition of AI research—a line of work based on identifying

situations where AI systems make mistakes so that we can understand them better and make them more robust

(Barreno et al. 2006, Szegedy et al. 2014; see Xu et al. 2020 for a review). There are several distinct strategies

for identifying situations in which AI systems perform poorly. Many of the best-known strategies can be

placed into one of two categories:

1. White-box attacks: Use knowledge of the model’s architecture and parameters to identify input

perturbations that break the model.

2. Black-box attacks: Analyze the model’s input-output behavior (but not its internal workings) to

identify types of examples on which it tends to make mistakes.

A prominent example of a white-box attack is adversarial examples in computer vision where an image

classifier’s prediction is changed by applying a small perturbation that is derived through a process similar to

backpropagation (Szegedy et al., 2014); in one famous example, a classifier initially classified an image of a

panda correctly, but after a perturbation that was imperceptible to humans, it then confidently classified the

image as a gibbon (Goodfellow, Shlens, and Szegedy, 2015). Black-box attacks have recently been used as a

way to break game-playing AI systems, where an adversarial policy is trained on observations of the system’s

game play in order to identify strategies that defeat the system (Wang et al., 2023).

The approach that we use is distinct from both the white-box strategy and the black-box strategy because

the way that we identify challenging scenarios does not involve access to the model at all. Instead, we analyze

In the terminology of Marr’s levels of analysis (Marr, 1982), LLMs are black boxes at the algorithmic level but not at the

implementation level or the computational level. Because the different levels of analysis influence each other, we can use what we

know about computation and implementation to infer likely properties of the unknown algorithmic level.

39the pressures that have shaped the system in order to make hypotheses about situations that will challenge

it; in our case, this strategy led us to push LLMs into low-probability spaces where we expected they would

perform poorly. That is, rather than analyzing the box (whether it be a white box or a black box), we are

analyzing the factory in which the box was produced. This approach is reminiscent of prior challenge sets,

such as the HANS dataset (McCoy, Pavlick, and Linzen, 2019), that were created by identifying types of

examples that are rare in common training sets.

One final strategy for predicting which examples will be challenging is to use an abstract notion of example

complexity. For instance, one might hypothesize that long sentences will be harder to process than short

ones, or that multi-hop reasoning questions will be harder to answer than single-hop ones. A downside of this

approach is that the definitions of complexity that we focus on may not correspond to the factors that cause

difficulty for a given model. The teleological approach provides a potential way around this shortcoming

because it helps us identify what notion of difficulty is relevant for the model at hand. For example, if we

approached the task of counting from first principles, a reasonable hypothesis would be that the main source

of difficulty would be the size of the number being counted to. As our teleologically-informed experiments

have shown, however, counting difficulty in language models is in fact driven more by the commonness of the

number being counted to than by its size; e.g., models performed better at counting to 100 (a number that is

used frequently) than counting to many numbers that are smaller but rarer, such as 59.

In using a teleological adversarial strategy, we do not mean to dismiss the other approaches. The strategy

that is most useful for a given project depends on that project’s goals. For example, white-box and black-box

attacks are useful when we intend to find model errors automatically, and a complexity-guided approach is

useful when we have no knowledge of what the system was trained to do. The situation where the teleological

perspective is most useful is when we seek a holistic understanding of a system for which we have at least

some knowledge about the training task, training distribution, and model architecture.

9.3

Comparing models to humans

We have argued that, in order to understand LLMs, we should approach them on their own terms rather

than evaluating them in the same ways that we test humans. But what if your goal is to compare LLMs to

humans? Even in that case, we argue that the teleological perspective is important; much as it should inform

testing for desiderata (Section 9.1), it should also inform comparisons to humans. Specifically, we argue that

human-likeness can be better assessed by the two-step process in (21) than the direct process in (20):

(20) Sub-optimal approach for investigating human-likeness: Ask “is this model like a human?”

(21) Better approach for investigating human-likeness:

a. First, use teleologically-motivated experiments to characterize the model in its own terms.

b. Then, ask “in what ways is this characterization like a human and unlike a human?”

There are several reasons why the direct approach in (20) is sub-optimal. First, it carries a risk of confirmation

bias: many model behaviors are human-like in some ways but not in other ways, leading to a situation where

an experiment’s results could likely be interpreted as supporting whichever perspective the experimenter wants

to find evidence for, whether it be human-likeness or human-unlikeness. Second, the natural way to pursue

path (20) is to evaluate models by using tests developed for humans. Such an evaluation may overestimate

similarities to humans because, for many tasks, there is only one way to be correct yet many ways to be

incorrect. Thus, to the extent that models are accurate, they will likely be accurate in human-like ways; and

to the extent that they make mistakes, a test that is designed for humans may only highlight the sorts of

mistakes that humans are liable to make, omitting potential failure modes that are unique to models. Third,

if an experimenter starts with the implicit assumption that a model is the same sort of entity as a human,

the experiment’s results are unlikely to cause them to reassess that assumption because any errors can be

interpreted as the model being only quantitatively worse than a human, even if the correct conclusion is that

the model is qualitatively different. The teleological approach in (21) mitigates these risks by evaluating

models in a way that is less biased by our view of human cognition, making it better equipped to reveal

ways in which models might be qualitatively different from humans without removing the potential for also

demonstrating similarities to humans.

One of the most influential proposals for investigating human-likeness is the Turing test (Turing, 1950), in

which models are evaluated on their ability to converse with a human such that an observer cannot reliably

tell whether the computer is a computer or human. The teleological approach that we advocate can be

40viewed as a targeted way to propose queries for use in a Turing test. This more targeted approach can enable

a Turing test to be more effective by revealing qualitatively non-human-like aspects of a model that would

not be tested for in a model-agnostic Turing test—i.e., accessing the dotted region of the Venn diagram in

Figure 1.1. Turing originally referred to his test as the imitation game; since we propose to understand

models by understanding the limitations that shape their behavior (Griffiths, 2020), this approach could

instead be called the limitation game.

In this work, we have focused on performing the first step of (21). Without evaluating humans on the

same tasks and examples, we cannot make strong claims about whether the LLM properties that we have

identified make LLMs qualitatively different from humans—the mere fact that models are making errors does

not necessarily make them different from humans (Dasgupta et al., 2022), and when comparing types of

intelligence one must be careful to evaluate them fairly to avoid penalizing one system for being limited by

factors extraneous to the comparison at hand (De Waal, 2016; Firestone, 2020; Lepori and Firestone, 2022;

Lampinen, 2022). Nonetheless, based on prior work studying human cognition, we suspect that most of the

properties we have identified in Table 1 are present in humans to some extent but to a lesser degree than in

LLMs. In the list below, we comment in more detail on several of the properties we have observed in LLMs:

• Sensitivity to task probability: We have shown that LLMs achieve higher accuracy on common

tasks than on rare tasks that are no more formally complex. Several studies have shown cases where

humans performed better on a task they were trained on than on a similar but new task (Luchins, 1942;

Ho, Cohen, and Griffiths, in press) or even on the same task with new items (Posner and Keele, 1968),

showing that human performance is indeed influenced by experience with the task in question. However,

we expect that humans are less sensitive to this factor than LLMs, in part because of the diversity of

tasks that humans have to perform. In particular, we expect that, in comparison to LLMs, humans are

better able to solve a task by using an abstract algorithm that can be equally well applied across the

task variants that we studied (e.g., across different shifts in a shift cipher, or across different suffixes

used in Pig Latin). If this view is accurate, humans would not show such stark differences across tasks

as those we have found for LLMs.

• Sensitivity to sentence probability: It is well-established in psycholinguistics that a human’s ease of

processing a sentence is influenced by sentence probability (Ehrlich and Rayner, 1981; Smith and Levy,

2013). Nonetheless, we expect that humans are not as thoroughly influenced by string probability as

LLMs are. In particular, we have shown that LLM performance is influenced substantially by probability

even in deterministic tasks (e.g., Pig Latin decoding and acronym formation), and we expect that for

such tasks humans would be better able to turn off their reliance on probability, yielding results that

are more uniform across probability levels. Though general sentence probability may not play a large

role in human performance on our tasks, there is one finding that we have attributed to probability that

we expect would also occur in humans, namely the tendency to regularize an implausible output when

there is a similar output that is much more plausible. Gibson, Bergen, and Piantadosi (2013) showed

that humans often perform such regularization when given a sentence with an implausible meaning; e.g.,

when presented with The mother gave the candle the daughter, participants often interpreted it as The

mother gave the candle to the daughter. Based on this finding, we think it is likely that, if humans were

tested on our targeted rot-13 sentences, they would regularize the outputs in ways similar to the LLMs

we studied (e.g., producing Are they now building a bridge of their own? when the correct answer is

Are they now building a bridge of their owl? ).

• Sensitivity to wording: In Section 8.2.2, we showed that models’ accuracy on multiplication varied

substantially based on the formatting of the numbers being multiplied. Spelke and Tsivkin (2001)

showed that bilingual humans who were trained on math problems in one language could perform such

problems more effectively in the language of instruction than in the other language that they were also

fluent in. Thus, at a high level, it can be said that humans are also sensitive to the formatting of numbers

used in math problems. Nonetheless, there are also important differences between our LLM results and

Spelke and Tsivkin’s human results: First, for the humans, the difference in “formatting” is a difference

in the language used; in our LLM experiments, we instead varied capitalization and the use of digits vs.

words. We suspect that humans would be less sensitive to these particular perturbations than LLMs

are. Second, though humans showed statistically significant differences between evaluation languages,

these differences were generally less stark than the differences that we observed in LLM performance.

41Decoding

1.00

Basic

Step−by−step

Chain−of−thought

0.75

0.50

0.25

0.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Shift

1.00

0.75

0.50

0.25

0.00

0.75

0.50

0.25

0.00

−125 −100

−75

−125 −100

Output log probability

−75

Input log probability

Figure 9.1: Comparing three prompting styles (basic, step-by-step, and chain-of-thought) on pro-

cessing shift ciphers. In all plots, the intervals around the lines show one standard error. Top:

Accuracy on decoding shift-cipher-encrypted text with different shifts. Bottom left: Influence of

output probability when decoding. Bottom right: Influence of input probability when encoding.

In sum, it is reasonable to view our results as illuminating some ways in which LLMs are decidedly non-

human-like, but establishing this conclusion with certainty would require further work that directly compares

LLMs and humans on the same stimuli.

9.4

Evaluating models fairly: Prompts, scaling, and calibration

All of our experiments have used basic prompting, in which models are simply asked to provide the response

to a query. For many of our tasks, it is likely that model performance could be increased by more advanced

prompting techniques, such as chain-of-thought prompting (Wei et al., 2022b) and step-by-step prompting

(Kojima et al., 2022); for instance, Wei et al.’s chain-of-thought approach substantially improved performance

on the task of last-letter concatenation, which is similar to our acronym task. Similarly, it is also likely that

performance on our tasks will improve as models are further scaled up. However, it would not invalidate our

conclusions if there are conditions in which LLM performance is greater than what we have observed. We do

not claim to be highlighting fundamental incapabilities of LLMs but rather are claiming that some tasks and

examples are harder for LLMs than others. Therefore, the existence of any setting that yields the predicted

performance differences supports our claims, even if other settings exist where LLM performance is at ceiling

for both conditions being compared.

As a first step toward extending our results to other prompting techniques, we used chain-of-thought

prompting and step-by-step prompting to evaluate GPT-3.5 and GPT-4 on shift ciphers (see Appendices B.3

through B.5 for details on the prompts). The results are in Figure 9.1 and show that, at least when de-

coding, these prompting techniques can indeed substantially increase performance (for encoding—shown in

the bottom right of Figure 9.1—they appear to hurt performance). Nonetheless, it is worth noting that the

42basic effects we have identified still hold (see Appendix D.2.4 for significance tests): First, even with more

advanced prompting, models show similar patterns of difficulty on different shift levels—e.g., showing a spike

in accuracy at a shift of 13, and showing greater accuracy for rot-1 and rot-3 than rot-2 (Figure 9.1, top).

Second, models still perform better on high-probability sentences than low-probability ones, particularly when

decoding (Figure 9.1, bottom). Thus, while more advanced prompting techniques improve overall accuracy,

they do not erase the ways in which models are influenced by their autoregressive roots.

A similar conclusion applies to scaling. Across almost all tasks, we have observed that GPT-4 substantially

outperforms GPT-3.5, showing that scaling can provide large benefits on the tasks we have studied—as it

does for many other tasks (Kaplan et al., 2020; Rae et al., 2022; BIG-bench authors, 2023; Wei et al., 2022a).

However, even though GPT-4 scored better than GPT-3.5, it still displayed the same qualitative trends of

being sensitive to task probability and example probability. Therefore, like prompting techniques, it appears

that scaling may improve overall performance but may not fully overcome the ways in which models are

influenced by their nature.

In addition to prompt engineering and scaling, another way in which embers of autoregression could

potentially be mitigated would be to ask the LLM how difficult the task at hand is for it. Ensuring that

LLMs correctly report when they have difficulty is an ongoing area of research (Mielke et al., 2022; Kadavath

et al., 2022); to the extent that an LLM is well-calibrated in this regard, its judgments could allow us to

recognize when we should be cautious about its outputs (e.g., we might take a judgment that difficulty is

high as a sign that we should verify the response with a human or pass the query to a symbolic interpreter

that can robustly handle the task). As an initial step toward investigating whether LLMs recognize that the

tasks we have studied are difficult for them, we considered five queries for each of three tasks where we found

LLMs to struggle:

• Counting a list of 29 letters.

• Performing the article swapping task in the low-probability setting.

• Decoding a message written in rot-10, where the correct answer is a high-probability sentence.

For each of these cases, we tested GPT-4 via the online chat interface (accessed on September 20, 2023)

on five examples. Of the fifteen total cases, there were two that GPT-4 answered correctly (both within

the counting task); since we are most interested in cases where GPT-4 was incorrect, we regenerated those

responses, which yielded an incorrect answer for both. Each of the fifteen queries was asked in a new chat

session. After we received GPT-4’s incorrect response, we then asked it, “On a scale from 1 to 10—where 1 is

easy and 10 is hard—how difficult was this question to answer for you?” In all fifteen cases, GPT-4 responded

that the question had a difficulty level of 1—i.e., very easy.

Two examples of GPT-4’s difficulty estimates are in Figure 9.2. In the first case, it states that the task is

“straightforward...for a computer program.” In the second case, it refers to what its “algorithms” can perform

easily. One possible interpretation of these responses is that GPT-4 is estimating difficulty based on a generic

definition of computation, rather than by considering its own specific capacities and limitations. Perhaps,

then, GPT-4 itself could benefit from the teleological perspective, as this perspective emphasizes approaching

systems on their own terms.

Our results have focused on ways in which the teleological perspective illuminates shortcomings of LLMs,

but being critical of LLMs is not our goal. Instead, our goal is to promote a perspective that accurately

captures the properties of LLMs, both positive properties and negative ones. Though in many cases this

perspective has led us to point out model weaknesses, in some ways it makes LLMs more impressive than

they would otherwise seem. Once we fully recognize that LLMs are statistical next-word prediction systems,

it becomes remarkable that they can perform rot-13 or acronym formation at all, even if they do not perform

these tasks perfectly.

Related work

We have argued that the problem an LLM must solve has three components: the training task, the training

distribution, and the model architecture. For each of these components, we provide below a review of prior

literature about ways in which a system is influenced by that factor. This review includes the identification

of several additional embers of autoregression beyond those that we have focused on above.

43Difficulty calibration

Experimenter:

How many letters are in the following list? “bbbbbbbbbbbbbbbbbbbbbbbbbbbbb”

GPT-4:

The list provided contains 28 “b” letters. [Note: The correct answer is 29]

Experimenter:

On a scale from 1 to 10 - where 1 is easy and 10 is hard - how difficult was this question to answer for you?

GPT-4:

For me, answering that question would be a 1 on the scale from 1 to 10 in terms of difficulty. Counting the

number of characters in a string is a straightforward task for a computer program.

Experimenter:

Rot-10 is a cipher in which each letter is shifted 10 positions forward in the alphabet. For example, here is a

message written in rot-10 along with the original text that it was created from:

Rot-10 text: “Cdki robo!”

Original text: “Stay here!”

Here is another message in rot-10. Decode this message to produce the original text:

Rot-10 text: “Dryco gry exnobcdkxn dro boqsyx gsvv qod grkd S wokx pbyw drsc cswzvo zrbkco.”

Original text:

GPT-4:

To decode a ROT-10 cipher, you would shift each letter 10 positions backward in the alphabet. Numbers,

spaces, and punctuation marks remain the same.

Here’s how to decode your ROT-10 message:

Rot-10 text: “Dryco gry exnobcdkxn dro boqsyx gsvv qod grkd S wokx pbyw drsc cswzvo zrbkco.”

For

“D”, shifting 10 positions backward gives “A”.

“r”, shifting 10 positions backward gives “i”.

“y”, shifting 10 positions backward gives “m”.

“c”, shifting 10 positions backward gives “w”.

“o”, shifting 10 positions backward gives “e”.

Following this pattern for the entire sentence, the original text becomes:

Original text: “After you understand the method used for this simple message.”

I’ve shifted each letter 10 positions backward in the alphabet to decode the ROT-10 message into its original

text.

[Note: The correct answer is “Those who understand the region will get what I mean from this simple phrase.”]

Experimenter:

On a scale from 1 to 10 - where 1 is easy and 10 is hard - how difficult was this question to answer for you?

GPT-4:

For me, the question is very straightforward to answer, so on a scale from 1 to 10, it would be a 1 in terms of

difficulty. My algorithms are designed to perform tasks like decoding ROT ciphers relatively easily.

Figure 9.2: Asking GPT-4 to rate the difficulty of queries. These responses are from GPT-4’s online

chat interface, accessed on September 20, 2023.

10.1

Training task

The idea that a system is crucially influenced by the function it performs has a long history in cognitive science

(Marr, 1982; Shepard, 1987; Anderson, 1990). One important consequence of this idea is that two different

44systems that have the same goal might converge to similar strategies even if they are implementationally very

different. This fact can help us to understand one system by leveraging what we know about a different system

that achieves the same goal. For instance, Grant et al. (2018) gained insight into the gradient-based meta-

learning approach MAML (Finn, Abbeel, and Levine, 2017) by demonstrating a relationship to hierarchical

Bayesian modeling, and Tenney, Das, and Pavlick (2019) illuminated processing strategies of the LLM BERT

(Devlin et al., 2019) by showing that, faced with the task of modeling natural language, BERT arrived at a

similar analysis pipeline as NLP practitioners who had faced the same goal. Such parallels can also be drawn

in the opposite direction: instead of hypothesizing that two systems will work in similar ways because they

have similar goals, we can infer that two systems may be shaped by similar goals based on the fact that they

work in similar ways—a type of inference at the heart of work making inferences about the human brain based

on similarities to computational models (Schrimpf et al., 2021; Caucheteux and King, 2022). Perhaps most

relevant to our perspective is the argument from Shanahan (2022) that those who develop and use LLMs must

continually remind themselves of what the LLMs were trained to do, in order to avoid anthropomorphizing

these systems; avoiding this temptation was one of our main reasons for pursuing a teleological approach to

understanding LLMs.

In addition to helping us understand systems, the importance of a system’s goal is also helpful in work

aimed at improving models. First, many authors have studied which (pre-)training objectives are most

effective for producing desired behavior in a model (Wang et al., 2019a; Raffel et al., 2020; Lin et al., 2021).

Second, beyond differences in general training tasks, the influence of a system’s goal in driving that system’s

nature is substantial enough that models can be given highly targeted training objectives that can override a

model’s inherent inductive biases, causing one type of system to behave like another type of system. In this

way, it is possible to distill a symbolic hypothesis space (McCoy et al., 2020) or a probabilistic model (Müller

et al., 2022; McCoy and Griffiths, 2023) into a neural network, or to distill the inductive biases of one type

of neural network into another (Abnar, Dehghani, and Zuidema, 2020).

The next subsections discuss two specific ways in which the LLM training task of next-word prediction

influences their behavior.

10.1.1

Difficulty in tasks that depend on meaning

Since LLMs are trained to manipulate sequences of words, they are primarily models of language rather than

models of thought (Mahowald et al., 2023). They do not directly manipulate ideas or structured world models;

to the extent that such processing occurs in LLMs, it has to emerge indirectly. LLMs therefore sometimes

struggle in contexts that could be naturally handled by using and updating a model of entities in the world;

e.g., they sometimes fail to track entities in a discourse (Kim and Schuster, 2023) and are sometimes insensitive

to the ways that negation affects statements about entities and their properties (Ettinger, 2020). Outside of

particular linguistic phenomena, LLMs also displaying shortcomings in tasks that require knowledge of the

causal properties and affordances of everyday objects (Yiu, Kosoy, and Gopnik, 2023; Kosoy et al., 2023).

Beyond the realm of specific linguistic phenomena, a philosophical debate has arisen about whether LLMs

have any form of understanding or meaning (Mitchell and Krakauer, 2023). Some have indeed argued that

text-only LLMs are fundamentally incapable of capturing meaning, under various construals of meaning

(Bender and Koller, 2020; Bisk et al., 2020), while others have pointed to plausible ways in which text-only

systems could be said to acquire meaning (Mollo and Millière, 2023; Merrill, Warstadt, and Linzen, 2022; Li,

Nye, and Andreas, 2021; Piantadosi and Hill, 2022). We remain agnostic about whether LLMs truly capture

meaning or only capture other properties that correlate with it; what we believe is clear is that meaning-

sensitive tasks do not come naturally to systems trained solely on textual input, such that we can expect

LLMs to encounter difficulty in handling these tasks.

10.1.2

Inability to modify text that has already been produced

As we mentioned in the introduction, Section 8 of Bubeck et al. (2023) discusses a way in which LLMs are

limited by the autoregressive nature of how they interact with language: LLMs produce text one word at a

time. They are not explicitly trained to plan far ahead, and they are unable to alter text once they have

produced it. These facts sometimes cause inaccuracies in model performance because LLMs will produce an

error due to the inability to plan ahead and cannot later correct that error because they cannot change their

previously-produced text. An example of this issue is in Figure 10.1. Ortega et al. (2021) refer to this issue

as self-delusion: models treat the text that they have previously produced as part of the input from which

they should make future predictions, so errors in their previous output propagate forward into later outputs.

45Arithmetic puzzle

Prompt:

If a pen and a pencil together cost $2.50, and the pen costs $2 less than the pencil, how much does the

pen cost? First give your answer, and then explain your reasoning.

GPT-4 output:

The pen costs $1.25.

Explanation:

Let’s assume the cost of the pencil is x dollars.

The cost of the pen would then be x − 2 dollars.

According to the problem, the pen and pencil together cost $2.50:

x + (x − 2) = 2.50

2x − 2 = 2.50

2x = 4.50

x = 2.25

So the pencil costs $2.25, and the pen, which costs $2 less, would be:

2.25 − 2 = 1.25

The pen costs $1.25.

Prompt:

If a pen and a pencil together cost $2.50, and the pen costs $2 less than the pencil, how much does the

pen cost? First explain your reasoning, and then give your answer.

GPT-4 output:

To solve this problem, we can use a system of equations.

Let P represent the cost of the pen and C represent the cost of the pencil.

1) The first equation comes from the statement “a pen and a pencil together cost $2.50”:

P + C = 2.50

2) The second equation comes from the statement “the pen costs $2 less than the pencil”:

P = C − 2

First, we’ll solve for one variable in terms of the other using one of the equations. The second equation

already gives us P in terms of C, so we can use it to substitute into the first equation:

P + C = 2.50

(C − 2) + C = 2.50

2C − 2 = 2.50

2C = 4.50

C = 2.25

Now we know the cost of the pencil, C, is $2.25. To find the cost of the pen, P , we can use the second

equation P = C − 2:

P = 2.25 − 2

P = 0.25

So, the pen costs $0.25.

Figure 10.1: When given an arithmetic puzzle, GPT-4 gets the wrong answer if it is prompted to

first provide its answer and then its reasoning (top), but it gets the right answer if it is prompted

to provide the reasoning first and then the answer (bottom). These examples were produced by the

online version of GPT-4 on September 17, 2023.

4610.2

Training distribution

When evaluating any system that learns, it is important to consider what the system learns from (Pullum

and Scholz, 2002; Linzen, 2020; Kim, Linzen, and Smolensky, 2022; Hupkes et al., 2022; Piktus et al., 2023).

A wealth of papers have therefore highlighted ways in which language model performance is influenced by the

model’s training data. Some have shown ways in which the choice of dataset plays a role, both with respect

to the type of data (e.g., Internet text vs. child-directed speech: Huebner et al. 2021) and with respect to the

way in which the dataset is curated or filtered (Gao et al., 2020; Lee et al., 2022). On a more fine-grained

level, several papers have connected model behavior to specific (pre-)training examples that have contributed

to that behavior (Han et al., 2023; Wei et al., 2021; Razeghi et al., 2022; Han and Tsvetkov, 2022; Elazar

et al., 2022; McCoy et al., 2023). Finally, several abstract abilities of language models have been traced

to properties of their training data; such capacities include cross-lingual abilities (Blevins and Zettlemoyer,

2022) and in-context learning (Chan et al., 2022). Below, we highlight two particular properties of LLMs that

connect directly to the training distribution.

10.2.1

Societal biases and spurious correlations

Language models are trained to capture the statistical properties of their training sets. At times, these statis-

tics can have undesirable effects on model behavior. First, neural network models can absorb societal biases

that are encoded in natural textual distributions (Bolukbasi et al., 2016; Caliskan, Bryson, and Narayanan,

2017; Bender et al., 2021), leading them to behave in bigoted ways such as producing toxic, prejudiced text

(Gehman et al., 2020; Deshpande et al., 2023). Second, neural networks are susceptible to using invalid

heuristics that yield the correct answer most of the time in their training distribution but that are not valid

strategies in the general case (McCoy, Pavlick, and Linzen, 2019; Niven and Kao, 2019). For example, when

language models are given several sentences with sentiment labels and are then asked to label the sentiment

of a new sentence, their answer is highly influenced by the labels of the examples they were shown: they

are biased toward producing the label that is most common among the examples, relying too heavily on the

invalid statistical cue of label frequency rather than the cues that truly indicate sentence sentiment (Zhao

et al., 2021). Both of these phenomena—societal biases and spurious correlations—are driven by properties

of the distribution that a model is trained on, and the fact that models are so sensitive to such properties has

driven calls to curate models’ training data more carefully (Rogers, 2021).

10.2.2

Idiosyncratic memorization

We have documented above the fact that LLMs are biased toward sentences that have a high probability

(Section 6). It is important to note that, from the perspective of an LLM, the probability of a sentence is

determined by the particular dataset that the LLM was trained on. As a result, if the dataset contains some

sentences that are frequently repeated, the model is likely to memorize them—even if they are not important

or high-probability sentences in the broader world.

McCoy et al. (2023) provide one example. They show that several language models trained on the Wikitext-

103 dataset (Merity et al., 2017), when left to their own devices, have a tendency to generate sentences about

a rowing competition called the Boat Race, such as the sentence in (22):

(22) The Boat Race is a side-by-side rowing competition between the University of Oxford (sometimes referred

to as the “Dark Blues”) and the University of Cambridge (sometimes referred to as the “Light Blues”).

The reason for this tendency is that the training set is a collection of Wikipedia articles, and it happens to

contain 159 articles about the Boat Race (one article for each year that the event has been held: “The Boat

Race 1861”, “The Boat Race 1862”, etc.) There are several sentences that are repeated word-for-word across

these articles, so it is no surprise that models have memorized these frequent sentences, even though they are

not particularly notable sentences outside the context of this dataset.

10.3

Architecture

There are several perspectives in cognitive science which emphasize that, in order to understand an information-

processing system, it is important to consider the computational resources that the system has at its disposal.

Because the human brain is a type of neural network (albeit a type that is very different from the networks

used in LLMs), many of these perspectives are couched in discussions of how our understanding of higher

47cognition should be informed by the fact that the mind emerges from the brain, as opposed to being explained

solely using symbolic theories that are not rooted in neural networks; for an overview of perspectives on this

topic, see Smolensky (1988). For our purposes, the perspectives that are particularly noteworthy are those

saying that, even when the goal is to understand higher-level cognitive functions, it remains important to

view the mind through the lens of the neural architecture that it emerges from—where this architecture is

sometimes taken to be the primary consideration for cognitive theories (McClelland, Rumelhart, and Hinton,

1986) and is sometimes taken to be something that should be unified with higher-level symbolic analyses

(Smolensky and Legendre, 2006; Smolensky et al., 2022). Outside of debates about the role of neural net-

works, our perspective overlaps with resource rationality (Lieder and Griffiths, 2020)—the idea that cognition

should be analyzed as the rational use of limited resources to accomplish a task. This view is a modification

of more traditional rational analysis (Anderson, 1990), which focuses on the task that one aims to perform

but not on the resources that one has access to; since we consider not just training task but also model

architecture, our approach is closer to a resource-rational view than a more traditional rational view.

Many lines of research in AI can be understood as ways to understand and improve models based on the

limitations that their architectures impose. For instance, new modeling developments are typically motivated

by insights about the limitations of previous architectures: RNNs (Elman, 1990), unlike preceding systems,

were able to process variable-length sequential data; LSTMs (Hochreiter and Schmidhuber, 1997) overcame

the vanishing gradient problem inherent in non-gated RNNs; Transformers (Vaswani et al., 2017) afforded

more efficient training than LSTMs; and current research now seeks to overcome the context-length limitations

of standard Transformers by enabling them to efficiently handle longer contexts (Beltagy, Peters, and Cohan,

2020; Ainslie et al., 2023). Below we specify two embers of autoregression that can be attributed to model

architecture.

10.3.1

Sensitivity to tokenization and other aspects of input formatting

All models that take in data need to have that data represented in some way. The choice of input format

can have important implications for model behavior. For instance, Pinker and Prince (1988) provided an

influential early criticism of neural-network modeling, focused on the model of Rumelhart and McClelland

(1986), which was trained to take in an English verb and produce its past-tense form. Some of their points

identify ways in which the model’s input format makes it fundamentally incapable of handling certain cases;

for instance, it does not distinguish between different words that are pronounced the same, meaning that it

cannot capture cases where two identically-pronounced words have different past tenses (e.g., break and brake

have the same pronunciation, yet their past tenses are pronounced differently: broke vs. braked ).

In more recent models, the way in which the input is formatted continues to be important. Current models

of language typically use subword tokenization schemes (Sennrich, Haddow, and Birch, 2016) in which each

word is encoded as a sequence of one or more tokens, each of which can be made of one or more contiguous

letters. One consequence of this choice is that some languages are more expensive to process than others due

to differences in how straightforwardly this approach can handle a language’s writing system (Ahia et al.,

2023); another consequence is that models can be brittle to perturbations (e.g., typos) that cause a word to

be broken into unfamiliar token sequences, whereas approaches that do not rely on subword tokens can be

more robust to such perturbations (Salesky, Etter, and Post, 2021). The fact that our tasks were challenging

for GPT models may also relate to their tokenization scheme: though GPT-3.5 and GPT-4 do encode the

spellings of their tokens (see Section 4.1), this information may be less readily accessible to them than it

would be to models in which characters are fully listed in the input, which would be a source of difficulty in

tasks that involve character-level manipulations (e.g., rot-13 decoding).

10.3.2

Limited compositionality and systematicity

Fodor and Pylyshyn (1988) and Marcus (1998) both presented influential arguments for why neural networks,

due to the way they operate, are likely to struggle with the related phenomena of compositionality (combining

familiar structures in novel ways) and systematicity (handling related stimuli in unified ways). Composition-

ality and systematicity have continued to be important targets of neural network evaluation (Lake and Baroni,

2018; Hupkes et al., 2020; Kim and Linzen, 2020). Although models have shown some impressive advances

in evaluations of compositionality (e.g., Zhou et al., 2023), fully compositional behavior continues to elude

them in at least some settings (Dziri et al., 2023), motivating proposals that aim to enhance neural networks

in ways that support symbolic abstractions (e.g., Schlag et al., 2019; Webb, Sinha, and Cohen, 2021).

48We do not claim that neural networks are fundamentally incapable of compositional or systematic behavior.

On the contrary, theoretical work has proven that neural networks are capable of capturing these properties

(Smolensky, 1990). Our claims are not therefore about what is possible; rather, they are about what is

straightforward. Even though neural networks can behave compositionally and systematically, it is not

straightforward for them to do so, so it can be expected that their handling of these phenomena will encounter

some difficulties.

10.4

Beneficial embers of autoregression

In discussing the ways in which models are shaped by the problem they solve, we have mostly discussed

negative consequences, because we believe that the positive consequences are already well-recognized; indeed,

they are some of the reasons why certain tasks, training distributions, and architectures have become prevalent.

First, the next-word prediction objective provides a major benefit by virtue of its generality (Malach,

2023): almost any task can be cast in terms of next-word prediction, such that training a model on this task

will also give it training on a wide range of other tasks (Radford et al., 2019). For instance, predicting what

words will come after the phrase In summary requires a model to summarize the preceding text, giving it

experience in the task of summarization.

Second, Internet text is useful as a training distribution because it gives models exposure to a wide range

of types of examples. As a result, very few types of input are out-of-distribution for models, mitigating the

ways in which neural networks are brittle to out-of-distribution examples.

Finally, the use of a neural network as the model architecture has several positive attributes compared

to a symbolic approach, including similarity-based generalization, multiple constraint satisfaction, graceful

degradation, and content-addressable memory (McClelland, Rumelhart, and Hinton, 1986), as well as the

ability to learn efficiently from naturalistic data (LeCun, Bengio, and Hinton, 2015).

Conclusion

Recent paper titles have made many statements about what language models are:

• “Language Models are Unsupervised Multitask Learners” (Radford et al., 2019)

• “Language Models are Few-Shot Learners” (Brown et al., 2020)

• “Language Models are General-Purpose Interfaces” (Hao et al., 2022)

• “Language Models are Multilingual Chain-of-Thought Reasoners” (Shi et al., 2023b)

• “Language Models are Open Knowledge Graphs” (Wang, Liu, and Song, 2020)

We should absolutely recognize and explore these advanced properties. Nonetheless, we should also remember

a simpler fact: language models are...language models! That is, they are statistical next-word prediction

systems. Recognizing this fact highlights some important properties: for instance, as we have shown, language

models have greater difficulty with infrequent tasks than frequent ones, even when comparing two tasks that

seem equally complex to a human; and they have greater difficulty on examples with low-probability answers

than high-probability ones, even when the task is deterministic. Both of these properties—as well as several

others that we have discussed above—can be attributed to the way that LLM training focuses on the statistics

of word sequences. In sum, our efforts to understand language models should consider both their sparks of

artificial general intelligence and their embers of autoregression.

Limitations

Throughout the paper, we have noted several limitations of the general approach we have proposed as well as

the specific experiments we have run. Here we collect the most important limitations. First, the teleological

perspective that we have argued for can be incomplete. Therefore, we do not claim that teleological analysis

should be the only way that the field studies LLMs; rather, it should be one of many perspectives that can

inform analyses. Second, the teleological perspective can be misleading: our analyses of a model’s goal may

be incorrect. For this reason, it is critical that teleologically-derived predictions should not be assumed to be

correct but should instead be tested empirically, as we have done.

An important limitation of our specific experiments is that the models we tested—which we selected for

the reasons given in Section 4.3—are proprietary ones for which many details are not publicly available (e.g.,

49the precise model architecture and training data). We have nonetheless shown that the information we do

have is sufficient to reason teleologically about these systems and to identify important sources of difficulty for

them. An additional limitation is that the tasks we investigated are relatively simple ones that do not have

substantial practical utility. We chose these tasks because our goal was to understand the general principles

that drive LLM performance, and for the purpose of understanding a system it is useful to use simple tasks

so that we can reason clearly about what capabilities those tasks require (Efrat, Honovich, and Levy, 2023).

We leave for future work an investigation of whether and how the effects that we have identified also influence

model performance in more practical settings.

Acknowledgments

We thank Robert Frank, Tania Lombrozo, Dakota McCoy, Paul Smolensky, and Cameron Turner for helpful

discussion. We also thank Bubeck et al. (2023) for making their LaTex code publicly available, as our

paper uses many aspects of their formatting. Any errors are our own. RTM was supported by the National

Science Foundation SBE Postdoctoral Research Fellowship under Grant No. 2204152, SY was supported by

the Harold W. Dodds Fellowship from Princeton University, and MH was supported by the US Department

of Defense through a National Defense Science and Engineering Graduate Fellowship. Any opinions, findings,

and conclusions or recommendations expressed in this material are those of the authors and do not necessarily

reflect the views of the National Science Foundation or the US Department of Defense.

50References

Abnar, S.; Dehghani, M.; and Zuidema, W. 2020. Transferring inductive biases through knowledge distillation.

arXiv preprint arXiv:2006.00555.

Ahia, O.; Kumar, S.; Gonen, H.; Kasai, J.; Mortensen, D. R.; Smith, N. A.; and Tsvetkov, Y. 2023. Do

All Languages Cost the Same? Tokenization in the Era of Commercial Language Models. arXiv preprint

arXiv:2305.13707.

Ainslie, J.; Lei, T.; de Jong, M.; Ontañón, S.; Brahma, S.; Zemlyanskiy, Y.; Uthus, D.; Guo, M.; Lee-Thorp,

J.; Tay, Y.; et al. 2023. CoLT5: Faster long-range Transformers with conditional computation. arXiv

preprint arXiv:2303.09752.

Anderson, J. R. 1990. The Adaptive Character of Thought. Erlbaum.

Arkoudas, K. 2023. GPT-4 Can’t Reason. arXiv preprint arXiv:2308.03762.

Balda, R. P.; and Kamil, A. C. 1989. A comparative study of cache recovery by three corvid species. Animal

Behaviour, 38(3): 486–495.

Barreno, M.; Nelson, B.; Sears, R.; Joseph, A. D.; and Tygar, J. D. 2006. Can machine learning be secure?

In Proceedings of the 2006 ACM Symposium on Information, Computer and Communications Security,

16–25.

Beltagy, I.; Peters, M. E.; and Cohan, A. 2020.

arXiv:2004.05150.

Longformer:

The Long-Document Transformer.

Bender, E. M.; Gebru, T.; McMillan-Major, A.; and Shmitchell, S. 2021. On the dangers of stochastic parrots:

Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability,

and Transparency, 610–623.

Bender, E. M.; and Koller, A. 2020. Climbing towards NLU: On meaning, form, and understanding in the

age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,

5185–5198.

Bengio, Y.; Ducharme, R.; and Vincent, P. 2000. A neural probabilistic language model. Advances in Neural

Information Processing Systems, 13.

BIG-bench authors. 2023. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of

language models. Transactions on Machine Learning Research.

Binz, M.; and Schulz, E. 2023. Using cognitive psychology to understand GPT-3. Proceedings of the National

Academy of Sciences, 120(6): e2218523120.

Bird, S.; Loper, E.; and Klein, E. 2009. Natural Language Processing with Python. O’Reilly Media Inc.

Bisk, Y.; Holtzman, A.; Thomason, J.; Andreas, J.; Bengio, Y.; Chai, J.; Lapata, M.; Lazaridou, A.; May,

J.; Nisnevich, A.; Pinto, N.; and Turian, J. 2020. Experience Grounds Language. In Proceedings of the

2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 8718–8735. Online:

Association for Computational Linguistics.

Blevins, T.; and Zettlemoyer, L. 2022. Language Contamination Helps Explains the Cross-lingual Capabilities

of English Pretrained Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural

Language Processing, 3563–3574. Abu Dhabi, United Arab Emirates: Association for Computational Lin-

guistics.

Bolukbasi, T.; Chang, K.-W.; Zou, J. Y.; Saligrama, V.; and Kalai, A. T. 2016. Man is to computer

programmer as woman is to homemaker? Debiasing word embeddings. Advances in Neural Information

Processing Systems, 29.

51Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.;

Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.;

Ziegler, D.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark,

J.; Berner, C.; McCandlish, S.; Radford, A.; Sutskever, I.; and Amodei, D. 2020. Language Models are

Few-Shot Learners. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M. F.; and Lin, H., eds., Advances

in Neural Information Processing Systems, volume 33, 1877–1901.

Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz, E.; Kamar, E.; Lee, P.; Lee, Y. T.; Li, Y.;

Lundberg, S.; et al. 2023. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv

preprint arXiv:2303.12712.

Buss, D. M.; Haselton, M. G.; Shackelford, T. K.; Bleske, A. L.; and Wakefield, J. C. 1998. Adaptations,

exaptations, and spandrels. American Psychologist, 53(5): 533.

Caliskan, A.; Bryson, J. J.; and Narayanan, A. 2017. Semantics derived automatically from language corpora

contain human-like biases. Science, 356(6334): 183–186.

Caucheteux, C.; and King, J.-R. 2022. Brains and algorithms partially converge in natural language processing.

Communications Biology, 5(1): 134.

Chan, S.; Santoro, A.; Lampinen, A.; Wang, J.; Singh, A.; Richemond, P.; McClelland, J.; and Hill, F. 2022.

Data distributional properties drive emergent in-context learning in Transformers. Advances in Neural

Information Processing Systems, 35: 18878–18891.

Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H. W.;

Sutton, C.; Gehrmann, S.; et al. 2022. PaLM: Scaling language modeling with pathways. arXiv preprint

arXiv:2204.02311.

Clark, P.; Cowhey, I.; Etzioni, O.; Khot, T.; Sabharwal, A.; Schoenick, C.; and Tafjord, O. 2018. Think you

have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457.

Dagan, I.; Glickman, O.; and Magnini, B. 2006. The PASCAL Recognising Textual Entailment Challenge. In

Proceedings of the First International Conference on Machine Learning Challenges: Evaluating Predictive

Uncertainty Visual Object Classification, and Recognizing Textual Entailment, MLCW’05, 177–190. Berlin,

Heidelberg: Springer-Verlag. ISBN 3-540-33427-0, 978-3-540-33427-9.

Dasgupta, I.; Lampinen, A. K.; Chan, S. C.; Creswell, A.; Kumaran, D.; McClelland, J. L.; and Hill, F. 2022.

Language models show human-like content effects on reasoning. arXiv preprint arXiv:2207.07051.

Davis, E. 2023. Mathematics, word problems, common sense, and artificial intelligence. arXiv preprint

arXiv:2301.09723.

De Waal, F. 2016. Are we smart enough to know how smart animals are? WW Norton & Company.

Deshpande, A.; Murahari, V.; Rajpurohit, T.; Kalyan, A.; and Narasimhan, K. 2023. Toxicity in ChatGPT:

Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335.

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American

Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long

and Short Papers), 4171–4186. Minneapolis, Minnesota: Association for Computational Linguistics.

Dou, Y.; Forbes, M.; Koncel-Kedziorski, R.; Smith, N. A.; and Choi, Y. 2022. Is GPT-3 Text Indistinguishable

from Human Text? Scarecrow: A Framework for Scrutinizing Machine Text. In Proceedings of the 60th

Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 7250–7274.

Dublin, Ireland: Association for Computational Linguistics.

Dziri, N.; Lu, X.; Sclar, M.; Li, X. L.; Jian, L.; Lin, B. Y.; West, P.; Bhagavatula, C.; Bras, R. L.; Hwang, J. D.;

et al. 2023. Faith and Fate: Limits of Transformers on Compositionality. arXiv preprint arXiv:2305.18654.

52Efrat, A.; Honovich, O.; and Levy, O. 2023. LMentry: A Language Model Benchmark of Elementary Language

Tasks. In Findings of the Association for Computational Linguistics: ACL 2023, 10476–10501. Toronto,

Canada: Association for Computational Linguistics.

Ehrlich, S. F.; and Rayner, K. 1981. Contextual effects on word perception and eye movements during reading.

Journal of Verbal Learning and Verbal Behavior, 20(6): 641–655.

Elazar, Y.; Kassner, N.; Ravfogel, S.; Feder, A.; Ravichander, A.; Mosbach, M.; Belinkov, Y.; Schütze, H.; and

Goldberg, Y. 2022. Measuring Causal Effects of Data Statistics on Language Model’s Factual Predictions.

arXiv preprint arXiv:2207.14251.

Elman, J. L. 1990. Finding structure in time. Cognitive Science, 14(2): 179–211.

Elman, J. L. 1991. Distributed representations, simple recurrent networks, and grammatical structure.

Machine Learning, 7(2): 195–225.

Ettinger, A. 2020. What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language

Models. Transactions of the Association for Computational Linguistics, 8: 34–48.

Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks.

In International Conference on Machine Learning, 1126–1135. PMLR.

Firestone, C. 2020. Performance vs. competence in human–machine comparisons. Proceedings of the National

Academy of Sciences, 117(43): 26562–26571.

Fodor, J. A.; and Pylyshyn, Z. W. 1988. Connectionism and cognitive architecture: A critical analysis.

Cognition, 28(1-2): 3–71.

Friedman, J.; Hastie, T.; and Tibshirani, R. 2010. Regularization paths for generalized linear models via

coordinate descent. Journal of Statistical Software, 33(1): 1.

Gao, L.; Biderman, S.; Black, S.; Golding, L.; Hoppe, T.; Foster, C.; Phang, J.; He, H.; Thite, A.; Nabeshima,

N.; Presser, S.; and Leahy, C. 2020. The Pile: An 800GB Dataset of Diverse Text for Language Modeling.

arXiv preprint arXiv:2101.00027.

Gehman, S.; Gururangan, S.; Sap, M.; Choi, Y.; and Smith, N. A. 2020. RealToxicityPrompts: Evaluat-

ing Neural Toxic Degeneration in Language Models. In Findings of the Association for Computational

Linguistics: EMNLP 2020, 3356–3369. Online: Association for Computational Linguistics.

Gelman, A.; Jakulin, A.; Pittau, M. G.; and Su, Y.-S. 2008. A weakly informative default prior distribution

for logistic and other regression models. The Annals of Applied Statistics, 2(4): 1360–1383.

Gibson, E.; Bergen, L.; and Piantadosi, S. T. 2013. Rational integration of noisy evidence and prior semantic

expectations in sentence interpretation. Proceedings of the National Academy of Sciences, 110(20): 8051–

8056.

Godfrey-Smith, P. 2001. Three kinds of adaptationism. In Orzack, S. H.; and Sober, E., eds., Adaptationism

and Optimality, volume 122, 335–357. Cambridge University Press.

Goodfellow, I.; Shlens, J.; and Szegedy, C. 2015. Explaining and Harnessing Adversarial Examples. In

International Conference on Learning Representations.

Gould, S.; and Lewontin, R. 1979. The spandrels of San Marco and the Panglossian paradigm: A critique of

the adaptationist programme. Proceedings of the Royal Society of London Series B, 205: 581–698.

Grant, E.; Finn, C.; Levine, S.; Darrell, T.; and Griffiths, T. 2018. Recasting Gradient-Based Meta-Learning

as Hierarchical Bayes. In International Conference on Learning Representations.

Griffiths, T. L. 2020. Understanding human intelligence through human limitations. Trends in Cognitive

Sciences, 24(11): 873–883.

53Han, X.; Simig, D.; Mihaylov, T.; Tsvetkov, Y.; Celikyilmaz, A.; and Wang, T. 2023. Understanding In-

Context Learning via Supportive Pretraining Data. In Proceedings of the 61st Annual Meeting of the

Association for Computational Linguistics (Volume 1: Long Papers), 12660–12673. Toronto, Canada: As-

sociation for Computational Linguistics.

Han, X.; and Tsvetkov, Y. 2022. ORCA: Interpreting prompted language models via locating supporting data

evidence in the ocean of pretraining data. arXiv preprint arXiv:2205.12600.

Hao, Y.; Song, H.; Dong, L.; Huang, S.; Chi, Z.; Wang, W.; Ma, S.; and Wei, F. 2022. Language models are

general-purpose interfaces. arXiv preprint arXiv:2206.06336.

Held Jr, L. I. 2009. Quirks of Human Anatomy: An Evo-Devo Look at the Human Body. Cambridge

University Press.

Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; and Steinhardt, J. 2021. Measuring

Massive Multitask Language Understanding. In International Conference on Learning Representations.

Ho, M. K.; Cohen, J. D.; and Griffiths, T. in press. Rational simplification and rigidity in human planning.

Psychological Science.

Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term memory. Neural Computation, 9(8): 1735–1780.

Holtzman, A.; West, P.; and Zettlemoyer, L. 2023. Generative Models as a Complex Systems Science: How

can we make sense of large language model behavior? arXiv preprint arXiv:2308.00189.

Hothorn, T.; Bretz, F.; and Westfall, P. 2008.

Biometrical Journal, 50(3): 346–363.

Simultaneous Inference in General Parametric Models.

Huebner, P. A.; Sulem, E.; Cynthia, F.; and Roth, D. 2021. BabyBERTa: Learning more grammar with small-

scale child-directed language. In Proceedings of the 25th Conference on Computational Natural Language

Learning, 624–646.

Hupkes, D.; Dankers, V.; Mul, M.; and Bruni, E. 2020. Compositionality decomposed: How do neural

networks generalise? Journal of Artificial Intelligence Research, 67: 757–795.

Hupkes, D.; Giulianelli, M.; Dankers, V.; Artetxe, M.; Elazar, Y.; Pimentel, T.; Christodoulopoulos, C.;

Lasri, K.; Saphra, N.; Sinclair, A.; et al. 2022. State-of-the-art generalisation research in NLP: a taxonomy

and review. arXiv preprint arXiv:2210.03050.

Hutchinson, B.; Rostamzadeh, N.; Greer, C.; Heller, K.; and Prabhakaran, V. 2022. Evaluation gaps in

machine learning practice. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and

Transparency, 1859–1876.

Itzhak, I.; and Levy, O. 2022. Models In a Spelling Bee: Language Models Implicitly Learn the Character

Composition of Tokens. In Proceedings of the 2022 Conference of the North American Chapter of the

Association for Computational Linguistics: Human Language Technologies, 5061–5068.

Kadavath, S.; Conerly, T.; Askell, A.; Henighan, T.; Drain, D.; Perez, E.; Schiefer, N.; Hatfield-Dodds, Z.;

DasSarma, N.; Tran-Johnson, E.; Johnston, S.; El-Showk, S.; Jones, A.; Elhage, N.; Hume, T.; Chen,

A.; Bai, Y.; Bowman, S.; Fort, S.; Ganguli, D.; Hernandez, D.; Jacobson, J.; Kernion, J.; Kravec, S.;

Lovitt, L.; Ndousse, K.; Olsson, C.; Ringer, S.; Amodei, D.; Brown, T.; Clark, J.; Joseph, N.; Mann, B.;

McCandlish, S.; Olah, C.; and Kaplan, J. 2022. Language models (mostly) know what they know. arXiv

preprint arXiv:2207.05221.

Kahn, D. 1967. The Codebreakers. Scribner.

Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T. B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.;

and Amodei, D. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

Kassner, N.; and Schütze, H. 2020. Negated and Misprimed Probes for Pretrained Language Models: Birds

Can Talk, But Cannot Fly. In Proceedings of the 58th Annual Meeting of the Association for Computational

Linguistics, 7811–7818. Online: Association for Computational Linguistics.

54Kaushal, A.; and Mahowald, K. 2022. What do tokens know about their characters and how do they know it?

In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational

Linguistics: Human Language Technologies, 2487–2507.

Kim, N.; and Linzen, T. 2020. COGS: A Compositional Generalization Challenge Based on Semantic Inter-

pretation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing

(EMNLP), 9087–9105. Online: Association for Computational Linguistics.

Kim, N.; Linzen, T.; and Smolensky, P. 2022. Uncontrolled lexical exposure leads to overestimation of

compositional generalization in pretrained models. arXiv preprint arXiv:2212.10769.

Kim, N.; and Schuster, S. 2023. Entity Tracking in Language Models. In Proceedings of the 61st Annual

Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 3835–3855. Toronto,

Canada: Association for Computational Linguistics.

Kojima, T.; Gu, S. S.; Reid, M.; Matsuo, Y.; and Iwasawa, Y. 2022. Large language models are zero-shot

reasoners. Advances in Neural Information Processing Systems, 35: 22199–22213.

Kosoy, E.; Reagan, E. R.; Lai, L.; Gopnik, A.; and Cobb, D. K. 2023. Comparing Machines and Children: Us-

ing Developmental Psychology Experiments to Assess the Strengths and Weaknesses of LaMDA Responses.

arXiv preprint arXiv:2305.11243.

Krogman, W. M. 1951. The scars of human evolution. Scientific American, 185(6): 54–57.

Lake, B. M.; and Baroni, M. 2018. Generalization without Systematicity: On the Compositional Skills of

Sequence-to-Sequence Recurrent Networks. In International Conference on Machine Learning, 2879–2888.

Lampinen, A. K. 2022. Can language models handle recursively nested grammatical structures? A case study

on comparing models and humans. arXiv preprint arXiv:2210.15303.

Latimer, B. 2005. The perils of being bipedal. Annals of Biomedical Engineering, 33(1): 3–6.

Lebret, R.; Grangier, D.; and Auli, M. 2016. Neural Text Generation from Structured Data with Application to

the Biography Domain. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language

Processing, 1203–1213. Austin, Texas: Association for Computational Linguistics.

LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature, 521(7553): 436–444.

Lee, K.; Ippolito, D.; Nystrom, A.; Zhang, C.; Eck, D.; Callison-Burch, C.; and Carlini, N. 2022. Dedu-

plicating Training Data Makes Language Models Better. In Proceedings of the 60th Annual Meeting

of the Association for Computational Linguistics (Volume 1: Long Papers), 8424–8445. Dublin, Ireland:

Association for Computational Linguistics.

Lepori, M. A.; and Firestone, C. 2022. Can you hear me now? Sensitive comparisons of human and machine

perception. Cognitive Science, 46(10): e13191.

Levenshtein, V. I. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet

Physics Doklady, volume 10, 707–710.

Lhoest, Q.; Villanova del Moral, A.; Jernite, Y.; Thakur, A.; von Platen, P.; Patil, S.; Chaumond, J.; Drame,

M.; Plu, J.; Tunstall, L.; Davison, J.; Šaško, M.; Chhablani, G.; Malik, B.; Brandeis, S.; Le Scao, T.;

Sanh, V.; Xu, C.; Patry, N.; McMillan-Major, A.; Schmid, P.; Gugger, S.; Delangue, C.; Matussière, T.;

Debut, L.; Bekman, S.; Cistac, P.; Goehringer, T.; Mustar, V.; Lagunas, F.; Rush, A.; and Wolf, T. 2021.

Datasets: A Community Library for Natural Language Processing. In Proceedings of the 2021 Conference

on Empirical Methods in Natural Language Processing: System Demonstrations, 175–184. Online and

Punta Cana, Dominican Republic: Association for Computational Linguistics.

Li, B. Z.; Nye, M.; and Andreas, J. 2021. Implicit Representations of Meaning in Neural Language Models.

In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th

International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1813–1827.

Online: Association for Computational Linguistics.

55Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.;

Kumar, A.; Newman, B.; Yuan, B.; Yan, B.; Zhang, C.; Cosgrove, C.; Manning, C. D.; Ré, C.; Acosta-

Navas, D.; Hudson, D. A.; Zelikman, E.; Durmus, E.; Ladhak, F.; Rong, F.; Ren, H.; Yao, H.; Wang,

J.; Santhanam, K.; Orr, L.; Zheng, L.; Yuksekgonul, M.; Suzgun, M.; Kim, N.; Guha, N.; Chatterji, N.;

Khattab, O.; Henderson, P.; Huang, Q.; Chi, R.; Xie, S. M.; Santurkar, S.; Ganguli, S.; Hashimoto, T.;

Icard, T.; Zhang, T.; Chaudhary, V.; Wang, W.; Li, X.; Mai, Y.; Zhang, Y.; and Koreeda, Y. 2022. Holistic

evaluation of language models. arXiv preprint arXiv:2211.09110.

Lieder, F.; and Griffiths, T. L. 2020. Resource-rational analysis: Understanding human cognition as the

optimal use of limited computational resources. Behavioral and brain sciences, 43: e1.

Lin, C.-C.; Jaech, A.; Li, X.; Gormley, M. R.; and Eisner, J. 2021. Limitations of Autoregressive Models

and Their Alternatives. In Proceedings of the 2021 Conference of the North American Chapter of the

Association for Computational Linguistics: Human Language Technologies, 5147–5173. Online: Association

for Computational Linguistics.

Linzen, T. 2020. How Can We Accelerate Progress Towards Human-like Linguistic Generalization? In

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5210–5217.

Online: Association for Computational Linguistics.

Lipton, Z. C.; and Steinhardt, J. 2019. Troubling trends in machine learning scholarship. Communications

of the ACM.

Liu, N. F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; and Liang, P. 2023. Lost in the

middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.

Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov,

V. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.

Lombrozo, T.; and Carey, S. 2006. Functional explanation and the function of explanation. Cognition, 99(2):

167–204.

Lombrozo, T.; and Wilkenfeld, D. 2019. Mechanistic versus functional understanding.

Understanding: New Perspectives from Philosophy, Psychology, and Theology, 209–229.

Varieties of

Luchins, A. S. 1942. Mechanization in problem solving: The effect of Einstellung. Psychological Monographs,

54(6): i–95.

Mahowald, K.; Ivanova, A. A.; Blank, I. A.; Kanwisher, N.; Tenenbaum, J. B.; and Fedorenko, E. 2023.

Dissociating language and thought in large language models: A cognitive perspective. arXiv preprint

arXiv:2301.06627.

Malach, E. 2023.

Auto-Regressive Next-Token Predictors are Universal Learners.

arXiv:2309.06979.

arXiv preprint

Marcus, G. F. 1998. Rethinking eliminative connectionism. Cognitive Psychology, 37(3): 243–282.

Marr, D. 1982. Vision. W.H. Freeman.

McClelland, J. L.; Rumelhart, D. E.; and Hinton, G. E. 1986. The Appeal of Parallel Distributed Processing.

MIT Press, Cambridge MA, 3–44.

McCoy, R. T.; Grant, E.; Smolensky, P.; Griffiths, T. L.; and Linzen, T. 2020. Universal linguistic inductive

biases via meta-learning. Proceedings of the 42nd Annual Conference of the Cognitive Science Society,

737–743.

McCoy, R. T.; and Griffiths, T. L. 2023. Modeling rapid language learning by distilling Bayesian priors into

artificial neural networks. arXiv preprint arXiv:2305.14701.

McCoy, R. T.; Pavlick, E.; and Linzen, T. 2019. Right for the Wrong Reasons: Diagnosing Syntactic Heuris-

tics in Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for

Computational Linguistics, 3428–3448. Florence, Italy: Association for Computational Linguistics.

56McCoy, R. T.; Smolensky, P.; Linzen, T.; Gao, J.; and Celikyilmaz, A. 2023. How Much Do Language

Models Copy From Their Training Data? Evaluating Linguistic Novelty in Text Generation Using RAVEN.

Transactions of the Association for Computational Linguistics, 11: 652–670.

McKenzie, I. R.; Lyzhov, A.; Pieler, M.; Parrish, A.; Mueller, A.; Prabhu, A.; McLean, E.; Kirtland, A.;

Ross, A.; Liu, A.; et al. 2023. Inverse Scaling: When Bigger Isn’t Better. arXiv preprint arXiv:2306.09479.

Merity, S.; Xiong, C.; Bradbury, J.; and Socher, R. 2017. Pointer sentinel mixture models. In International

Conference on Learning Representations.

Merrill, W.; Warstadt, A.; and Linzen, T. 2022. Entailment Semantics Can Be Extracted from an Ideal

Language Model. In Proceedings of the 26th Conference on Computational Natural Language Learning

(CoNLL), 176–193. Abu Dhabi, United Arab Emirates (Hybrid): Association for Computational Linguis-

tics.

Mielke, S. J.; Szlam, A.; Dinan, E.; and Boureau, Y.-L. 2022. Reducing Conversational Agents’ Overconfidence

Through Linguistic Calibration. Transactions of the Association for Computational Linguistics, 10: 857–

872.

Min, S.; Lyu, X.; Holtzman, A.; Artetxe, M.; Lewis, M.; Hajishirzi, H.; and Zettlemoyer, L. 2022. Rethink-

ing the Role of Demonstrations: What Makes In-Context Learning Work? In Proceedings of the 2022

Conference on Empirical Methods in Natural Language Processing, 11048–11064. Abu Dhabi, United Arab

Emirates: Association for Computational Linguistics.

Mitchell, M. 2023. How do we know how smart AI systems are? Science, 381(6654): adj5957.

Mitchell, M.; and Krakauer, D. C. 2023. The debate over understanding in AI’s large language models.

Proceedings of the National Academy of Sciences, 120(13): e2215907120.

Mollo, D. C.; and Millière, R. 2023. The vector grounding problem. arXiv preprint arXiv:2304.01481.

Müller, S.; Hollmann, N.; Arango, S. P.; Grabocka, J.; and Hutter, F. 2022. Transformers Can Do Bayesian

Inference. In International Conference on Learning Representations.

Niven, T.; and Kao, H.-Y. 2019. Probing Neural Network Comprehension of Natural Language Arguments.

In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4658–4664.

Florence, Italy: Association for Computational Linguistics.

Nye, M.; Andreassen, A. J.; Gur-Ari, G.; Michalewski, H.; Austin, J.; Bieber, D.; Dohan, D.; Lewkowycz,

A.; Bosma, M.; Luan, D.; et al. 2021. Show your work: Scratchpads for intermediate computation with

language models. arXiv preprint arXiv:2112.00114.

OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.

Ortega, P. A.; Kunesch, M.; Delétang, G.; Genewein, T.; Grau-Moya, J.; Veness, J.; Buchli, J.; Degrave,

J.; Piot, B.; Perolat, J.; Everitt, T.; Tallec, C.; Parisotto, E.; Erez, T.; Chen, Y.; Reed, S.; Hutter, M.;

de Freitas, N.; and Legg, S. 2021. Shaking the foundations: Delusions in sequence models for interaction

and control. arXiv preprint arXiv:2110.10819.

Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C. L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama,

K.; Ray, A.; Schulman, J.; Hilton, J.; Kelton, F.; Miller, L.; Simens, M.; Askell, A.; Welinder, P.; Christiano,

P.; Leike, J.; and Lowe, R. 2022. Training language models to follow instructions with human feedback.

Advances in Neural Information Processing Systems, 35: 27730–27744.

Piantadosi, S.; and Hill, F. 2022. Meaning without reference in large language models. In NeurIPS 2022

Workshop on Neuro Causal and Symbolic AI (nCSI).

Piktus, A.; Akiki, C.; Villegas, P.; Laurençon, H.; Dupont, G.; Luccioni, S.; Jernite, Y.; and Rogers, A.

2023. The ROOTS Search Tool: Data Transparency for LLMs. In Proceedings of the 61st Annual Meeting

of the Association for Computational Linguistics (Volume 3: System Demonstrations), 304–314. Toronto,

Canada: Association for Computational Linguistics.

57Pinker, S.; and Prince, A. 1988. On language and connectionism: Analysis of a parallel distributed processing

model of language acquisition. Cognition, 28(1-2): 73–193.

Plomp, K. A.; Viðarsdóttir, U. S.; Weston, D. A.; Dobney, K.; and Collard, M. 2015. The ancestral shape

hypothesis: an evolutionary explanation for the occurrence of intervertebral disc herniation in humans.

BMC Evolutionary Biology, 15: 1–10.

Posner, M. I.; and Keele, S. W. 1968. On the genesis of abstract ideas. Journal of Experimental Psychology,

77: 353.

Pullum, G. K.; and Scholz, B. C. 2002. Empirical assessment of stimulus poverty arguments. The Linguistic

Review, 19(1-2): 9–50.

R Core Team. 2022. R: A Language and Environment for Statistical Computing. R Foundation for Statistical

Computing, Vienna, Austria.

Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I. 2018. Improving language understanding by

generative pre-training.

Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language models are unsuper-

vised multitask learners. OpenAI Blog.

Rae, J. W.; Borgeaud, S.; Cai, T.; Millican, K.; Hoffmann, J.; Song, F.; Aslanides, J.; Henderson, S.; Ring,

R.; Young, S.; Rutherford, E.; Hennigan, T.; Menick, J.; Cassirer, A.; Powell, R.; van den Driessche, G.;

Hendricks, L. A.; Rauh, M.; Huang, P.-S.; Glaese, A.; Welbl, J.; Dathathri, S.; Huang, S.; Uesato, J.; Mellor,

J.; Higgins, I.; Creswell, A.; McAleese, N.; Wu, A.; Elsen, E.; Jayakumar, S.; Buchatskaya, E.; Budden, D.;

Sutherland, E.; Simonyan, K.; Paganini, M.; Sifre, L.; Martens, L.; Li, X. L.; Kuncoro, A.; Nematzadeh,

A.; Gribovskaya, E.; Donato, D.; Lazaridou, A.; Mensch, A.; Lespiau, J.-B.; Tsimpoukelli, M.; Grigorev,

N.; Fritz, D.; Sottiaux, T.; Pajarskas, M.; Pohlen, T.; Gong, Z.; Toyama, D.; de Masson d’Autume, C.; Li,

Y.; Terzi, T.; Mikulik, V.; Babuschkin, I.; Clark, A.; de Las Casas, D.; Guy, A.; Jones, C.; Bradbury, J.;

Johnson, M.; Hechtman, B.; Weidinger, L.; Gabriel, I.; Isaac, W.; Lockhart, E.; Osindero, S.; Rimell, L.;

Dyer, C.; Vinyals, O.; Ayoub, K.; Stanway, J.; Bennett, L.; Hassabis, D.; Kavukcuoglu, K.; and Irving, G.

2022. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv.

Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J.

2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. The Journal of

Machine Learning Research, 21(1): 5485–5551.

Raji, I. D.; Bender, E. M.; Paullada, A.; Denton, E.; and Hanna, A. 2021. AI and the everything in the whole

wide world benchmark. Thirty-fifth Conference on Neural Information Processing Systems Datasets and

Benchmarks Track.

Razeghi, Y.; Logan IV, R. L.; Gardner, M.; and Singh, S. 2022. Impact of pretraining term frequencies on

few-shot numerical reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2022,

840–854.

Ribeiro, M. T.; Wu, T.; Guestrin, C.; and Singh, S. 2020. Beyond Accuracy: Behavioral Testing of NLP

Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational

Linguistics, 4902–4912. Online: Association for Computational Linguistics.

Rodriguez, P.; Barrow, J.; Hoyle, A. M.; Lalor, J. P.; Jia, R.; and Boyd-Graber, J. 2021. Evaluation

Examples are not Equally Informative: How should that change NLP Leaderboards? In Proceedings of the

59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint

Conference on Natural Language Processing (Volume 1: Long Papers), 4486–4503. Online: Association for

Computational Linguistics.

Rogers, A. 2021. Changing the World by Changing the Data. In Proceedings of the 59th Annual Meeting

of the Association for Computational Linguistics and the 11th International Joint Conference on Natural

Language Processing (Volume 1: Long Papers), 2182–2194. Online: Association for Computational Lin-

guistics.

58Rumelhart, D. E.; and McClelland, J. L. 1986. On learning the past tenses of English verbs. Parallel

Distributed Processing: Explorations in the Microstructure of Cognition.

Salesky, E.; Etter, D.; and Post, M. 2021. Robust Open-Vocabulary Translation from Visual Text Represen-

tations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,

7235–7252. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.

Schlag, I.; Smolensky, P.; Fernandez, R.; Jojic, N.; Schmidhuber, J.; and Gao, J. 2019. Enhancing the

Transformer with explicit relational encoding for math problem solving. In NeurIPS Workshop on Context

and Composition in Biological and Artificial Neural Systems.

Schrimpf, M.; Blank, I. A.; Tuckute, G.; Kauf, C.; Hosseini, E. A.; Kanwisher, N.; Tenenbaum, J. B.; and

Fedorenko, E. 2021. The neural architecture of language: Integrative modeling converges on predictive

processing. Proceedings of the National Academy of Sciences, 118(45): e2105646118.

Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural Machine Translation of Rare Words with Subword

Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume

1: Long Papers), 1715–1725. Association for Computational Linguistics.

Shanahan, M. 2022. Talking about large language models. arXiv preprint arXiv:2212.03551.

Shepard, R. N. 1987. Toward a universal law of generalization for psychological science. Science, 237(4820):

1317–1323.

Shettleworth, S. J. 2010. Cognition, Evolution, and Behavior. Oxford University Press.

Shi, F.; Chen, X.; Misra, K.; Scales, N.; Dohan, D.; Chi, E. H.; Schärli, N.; and Zhou, D. 2023a. Large

language models can be easily distracted by irrelevant context. In International Conference on Machine

Learning, 31210–31227. PMLR.

Shi, F.; Suzgun, M.; Freitag, M.; Wang, X.; Srivats, S.; Vosoughi, S.; Chung, H. W.; Tay, Y.; Ruder, S.;

Zhou, D.; Das, D.; and Wei, J. 2023b. Language models are multilingual chain-of-thought reasoners. In

The Eleventh International Conference on Learning Representations.

Singh, S. 1999. The Code Book. Doubleday.

Smith, N. J.; and Levy, R. 2013. The effect of word predictability on reading time is logarithmic. Cognition,

128(3): 302–319.

Smolensky, P. 1988. On the proper treatment of connectionism. Behavioral and Brain Sciences, 11(1): 1–23.

Smolensky, P. 1990. Tensor product variable binding and the representation of symbolic structures in con-

nectionist systems. Artificial Intelligence, 46(1-2): 159–216.

Smolensky, P.; and Legendre, G. 2006. Harmony optimization and the computational architecture of the

mind/brain. The Harmonic Mind, 1: 1–61.

Smolensky, P.; McCoy, R. T.; Fernandez, R.; Goldrick, M.; and Gao, J. 2022. Neurocompositional computing:

From the Central Paradox of Cognition to a new generation of AI systems. AI Magazine, 43(3): 308–322.

Spelke, E. S.; and Tsivkin, S. 2001. Language and number: A bilingual training study. Cognition, 78(1):

45–88.

Suetonius. 121/1883. The Twelve Caesars. R. Worthington. Translated by Alexander Thomson.

Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; and Fergus, R. 2014. Intriguing

properties of neural networks. In International Conference on Learning Representations.

Tay, J. K.; Narasimhan, B.; and Hastie, T. 2023. Elastic net regularization paths for all generalized linear

models. Journal of statistical software, 106.

59Tenney, I.; Das, D.; and Pavlick, E. 2019. BERT Rediscovers the Classical NLP Pipeline. In Proceedings

of the 57th Annual Meeting of the Association for Computational Linguistics, 4593–4601. Florence, Italy:

Association for Computational Linguistics.

Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava,

P.; Bhosale, S.; Bikel, D.; Blecher, L.; Ferrer, C. C.; Chen, M.; Cucurull, G.; Esiobu, D.; Fernandes, J.;

Fu, J.; Fu, W.; Fuller, B.; Gao, C.; Goswami, V.; Goyal, N.; Hartshorn, A.; Hosseini, S.; Hou, R.; Inan,

H.; Kardas, M.; Kerkez, V.; Khabsa, M.; Kloumann, I.; Korenev, A.; Koura, P. S.; Lachaux, M.-A.; Lavril,

T.; Lee, J.; Liskovich, D.; Lu, Y.; Mao, Y.; Martinet, X.; Mihaylov, T.; Mishra, P.; Molybog, I.; Nie, Y.;

Poulton, A.; Reizenstein, J.; Rungta, R.; Saladi, K.; Schelten, A.; Silva, R.; Smith, E. M.; Subramanian,

R.; Tan, X. E.; Tang, B.; Taylor, R.; Williams, A.; Kuan, J. X.; Xu, P.; Yan, Z.; Zarov, I.; Zhang, Y.; Fan,

A.; Kambadur, M.; Narang, S.; Rodriguez, A.; Stojnic, R.; Edunov, S.; and Scialom, T. 2023. Llama 2:

Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288.

Turing, A. M. 1950. Computing machinery and intelligence. Mind, 433–460.

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I.

2017. Attention is all you need. In Advances in Neural Information Processing Systems, 5998–6008.

Vaux, B.; and Nevins, A. I. 2003. Underdetermination in language games: Survey & analysis of Pig Latin

dialects. In 77th Annual Meeting of the LSA, Atlanta, 2–5.

Wandell, B. A. 1995. Foundations of Vision. Sinauer Associates, Inc.

Wang, A.; Hula, J.; Xia, P.; Pappagari, R.; McCoy, R. T.; Patel, R.; Kim, N.; Tenney, I.; Huang, Y.; Yu,

K.; Jin, S.; Chen, B.; Van Durme, B.; Grave, E.; Pavlick, E.; and Bowman, S. R. 2019a. Can You Tell Me

How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling. In Proceedings

of the 57th Annual Meeting of the Association for Computational Linguistics, 4465–4476. Florence, Italy:

Association for Computational Linguistics.

Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S.

2019b. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. Advances

in Neural Information Processing Systems, 32.

Wang, C.; Liu, X.; and Song, D. 2020. Language models are open knowledge graphs. arXiv preprint

arXiv:2010.11967.

Wang, T. T.; Gleave, A.; Tseng, T.; Pelrine, K.; Belrose, N.; Miller, J.; Dennis, M. D.; Duan, Y.; Pogrebniak,

V.; Levine, S.; and Russell, S. 2023. Adversarial Policies Beat Superhuman Go AIs. arXiv:2211.00241.

Webb, T. W.; Sinha, I.; and Cohen, J. 2021. Emergent Symbols through Binding in External Memory. In

International Conference on Learning Representations.

Webson, A.; and Pavlick, E. 2022. Do Prompt-Based Models Really Understand the Meaning of Their

Prompts? In Proceedings of the 2022 Conference of the North American Chapter of the Association for

Computational Linguistics: Human Language Technologies, 2300–2344. Seattle, United States: Association

for Computational Linguistics.

Wei, J.; Garrette, D.; Linzen, T.; and Pavlick, E. 2021. Frequency Effects on Syntactic Rule Learning in Trans-

formers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,

932–948. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.

Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.;

Metzler, D.; Chi, E. H.; Hashimoto, T.; Vinyals, O.; Liang, P.; Dean, J.; and Fedus, W. 2022a. Emergent

Abilities of Large Language Models. Transactions on Machine Learning Research. Survey Certification.

Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q. V.; Zhou, D.; et al. 2022b. Chain-of-

thought prompting elicits reasoning in large language models. Advances in Neural Information Processing

Systems, 35: 24824–24837.

60Wickham, H. 2016. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-

319-24277-4.

Williams, A.; Nangia, N.; and Bowman, S. 2018. A Broad-Coverage Challenge Corpus for Sentence Under-

standing through Inference. In Proceedings of the 2018 Conference of the North American Chapter of

the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers),

1112–1122. New Orleans, Louisiana: Association for Computational Linguistics.

Wu, Z.; Qiu, L.; Ross, A.; Akyürek, E.; Chen, B.; Wang, B.; Kim, N.; Andreas, J.; and Kim, Y. 2023. Reason-

ing or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual

Tasks. arXiv preprint arXiv:2307.02477.

Xu, H.; Ma, Y.; Liu, H.-C.; Deb, D.; Liu, H.; Tang, J.-L.; and Jain, A. K. 2020. Adversarial attacks and

defenses in images, graphs and text: A review. International Journal of Automation and Computing, 17:

151–178.

Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T. L.; Cao, Y.; and Narasimhan, K. 2023. Tree of Thoughts:

Deliberate Problem Solving with Large Language Models. arXiv:2305.10601.

Yiu, E.; Kosoy, E.; and Gopnik, A. 2023. Imitation versus Innovation: What children can do that large

language and language-and-vision models cannot (yet)? arXiv preprint arXiv:2305.07666.

Zhao, Z.; Wallace, E.; Feng, S.; Klein, D.; and Singh, S. 2021. Calibrate before use: Improving few-shot

performance of language models. In International Conference on Machine Learning, 12697–12706. PMLR.

Zhou, D.; Schärli, N.; Hou, L.; Wei, J.; Scales, N.; Wang, X.; Schuurmans, D.; Cui, C.; Bousquet, O.; Le,

Q. V.; and Chi, E. H. 2023. Least-to-Most Prompting Enables Complex Reasoning in Large Language

Models. In The Eleventh International Conference on Learning Representations.

61Task Sample size

Article swapping

Reversal

Counting

Acronyms

Linear functions

Multiplication

Sorting

Keyboard cipher

Shift cipher

Pig Latin

Birthdays 100

100

See caption

1000

100

Table 7: Sample sizes for our datasets. Each number shows the sample size for one condition,

corresponding to a single point in each of our line plots or a single bar in each of our bar plots; e.g.,

for reversal, we had 3 conditions, yielding a total of 300 examples. For counting, there were different

sample sizes for different settings. The results in the top right of Figure 6.5 use a sample size of 1200

per bar; the plots in the middle and bottom of that figure use a sample size of 30 per point; and the

results in Figure 7.2 use a sample size of 3000 per bar.

Dataset details

A.1

Sample sizes

Table 7 specifies how many examples there were in each dataset that we created for evaluating models.

A.2

Shift cipher sentences

On June 16, 2023, we downloaded the 50 most recent new articles from Global Voices 10 (spanning from

June 2, 2023, to June 16, 2023), to minimize the chances that models would have seen these particular

sentences—though it would not cause problems for our conclusions if they had because we intended for these

sentences to be high-probability for the models. We selected Global Voices because its content has a permissive

license that allows sharing and modification, with attribution. 11 A full list of the articles we used is at

https://github.com/tommccoy1/embers-of-autoregression/blob/main/global_voices_articles.md.

We then sentence-tokenized these articles using the NLTK sentence tokenizer (Bird, Loper, and Klein, 2009).

For each sentence, we used GPT-2 (Radford et al., 2019) to find the maximum perplexity of any token in

that sentence and ranked sentences in order of increasing maximum perplexity; the goal was to find sentences

in which no individual words were overly surprising. We then selected the first 100 sentences in this list that

met the following criteria:

• Did not include proper names (except for very common ones such as the names of countries)

• Did not discuss violence, death, or sex.

• Did not have grammatical errors.

The resulting 100 sentences served as the high-probability sentences for being the output of shift cipher

decryption or the input for shift cipher encryption.

To produce medium-probability sentences, we started with the high-probability sentences and then modi-

fied each one in a way that reduced its probability. Specifically, we tokenized the sentence using the RoBERTa-

large tokenizer and then masked one token at a time, from left to right. For each masked token, we sampled

a replacement token from RoBERTa-large (Liu et al., 2019), where the replacement token had to meet the

following criteria:

https://globalvoices.org/

https://globalvoices.org/about/global-voices-attribution-policy/

62• Must be within the top 95% of the probability mass

• Must not be within the top 90% of the probability mass

• Must be the same number of characters as the token being replaced

The first two criteria were intended to find replacements that were low-probability but not so low-probability as

to be ungrammatical. The last criterion served to preserve the length of the sentence. If there were no tokens

that fit all criteria, the token was left unchanged. Once a new token was sampled, the sentence was changed

to incorporate it before the next token’s replacement was sampled. The full sentence was required to have the

same number of subword tokens as the original, measured using the tokenizer that is used by both GPT-3.5

and GPT-4. 12 This procedure did not always produce grammatical sentences; to ensure grammaticality, for

each sentence, we produced 20 candidates (ranked by perplexity) and manually chose the highest-perplexity

one that was grammatical. In a few cases, we had to generate more than 20 candidates or had to put together

parts of multiple candidates.

To create each low-probability sentence, we started with a medium-probability sentence and then shuffled

all of its words except the first word and last word. For this purpose, words were defined as whitespace-

delimited units. The first word and last word were left in place to avoid unusual capitalization or punctuation.

To create the targeted sentences which differ from the high-probability ones by just one or two characters,

we started with the high-probability sentences and then manually replaced one word in each one with another

word. The new word had to have a Levenshtein edit distance (Levenshtein, 1966) of at most 2 from the

original word (where one edit is defined as a single-character insertion, deletion, or substitution). The new

word was selected to be one that was grammatical yet unlikely in the given context. In the codebase, these

sentences are referred to as adversarial.

The prompts used for rot-13 (Appendices B.1 and B.2) included one in-context example that clarified two

potential points of confusion:

• It indicated that capitalization was retained in the output.

• It indicated that non-letter characters were to be left unchanged.

In all conditions, there were 100 examples. Table 8 shows some statistics for these examples that were used

in our regression tests. The tokens were counted using the tokenizer used by GPT-3.5 and GPT-4. The log

probabilities were estimated as the log probability assigned by GPT-2 to the sentences.

A.3

Reversal sentences and keyboard cipher sentences

For the reversal task and keyboard cipher task, we used the same sentences as for shift ciphers (Appendix A.2).

As with the shift ciphers, there were 100 sentences per condition.

A.4

Pig Latin sentences

The Pig Latin sentences were generated with the same basic technique as the rot-13 sentences (Appendix A.2).

However, Pig Latin is more constrained than rot-13 because Pig Latin is mainly a spoken phenomenon, so

when it is written down there can be some ambiguity due to the ways in which English spelling does not

match English pronunciation. E.g., should the Pig Latin for hour be ouray or ourhay or houray? Therefore,

we excluded any sentences with words that created any such ambiguities. This included abbreviations,

numbers, and words (such as hour ) for which the initial consonant cluster in writing does not match well

with the initial consonant cluster in speech—such mismatches were automatically flagged using the CMU

Pronouncing Dictionary 13 and then manually checked. Thus, to generate Pig Latin sentences, we followed

the same procedure as for rot-13 but discarded any examples that failed to meet these criteria, and sampled

new examples until we had the requisite 100 examples per condition. All inputs and outputs were lowercased

to avoid ambiguity about how to deal with capitalization (e.g., in Pig Latin, should Latin become atinLay or

Atinlay or atinlay?). There were 100 examples per condition.

The prompts for the Pig Latin tasks (Appendices B.10 and B.11) include an in-context example that clears

up several points of confusion:

• It illustrates how to handle punctuation.

The tokenizer is cl100k_base from the tiktoken package: https://github.com/openai/tiktoken

http://www.speech.cs.cmu.edu/cgi-bin/cmudict

63Input

chars. Output

chars. Input

tokens Output

tokens Input

logprob Output

logprob

Rot-13

decryption:

High-probability 98.7 98.7 46.4 19.7 -302.6 -58.4

Rot-13

decryption:

Medium-

probability 98.7 98.7 46.8 19.7 -304.5 -86.3

Rot-13

decryption:

Low-probability 98.7 98.7 46.8 19.7 -306.9 -144.5

Pig Latin

decryption:

High-probability 118.3 88.1 40.8 17.6 -256.3 -62.5

Pig Latin

decryption:

Medium-

probability 118.3 88.1 40.8 17.6 -256.6 -86.9

Pig Latin

decryption:

Low-probability 118.3 88.1 40.8 17.6 -258.2 -132.3

Reversal

decryption:

High-probability 98.7 98.7 20.0 19.7 -140.6 -58.4

Reversal

decryption:

Medium-

probability 98.7 98.7 20.1 19.7 -149.7 -86.3

Reversal

decryption:

Low-probability 98.7 98.7 20.1 19.7 -148.4 -144.5

Table 8: Average dataset statistics for some of the datasets discussed in Appendix A. For the en-

cryption versions of the decryption tasks, the numbers are the same, but the input and output values

are swapped. For the decryption datasets, the intent was to vary output log probability yet keep all

other factors similar; for encryption, the goal was the opposite.

64• It shows not to break words on apostrophes.

• It illustrates the handling of both consonant-initial and vowel-initial words (a distinction that is also

described in words in the prompt).

A.5

Article swapping sentences

The article swapping sentences were generated with the same basic technique as the rot-13 sentences (Ap-

pendix A.2), but with additional constraints added. First, each sentence had to contain at least one article (a,

an, or the), since the task becomes trivial if there are no articles to be moved. Second, the sentence cannot

start or end with an article because that could create issues when attempting to swap (e.g., if we are swapping

each article with the previous word, and if the sentence starts with an article, then that initial article will not

have anything to swap with). There were 100 examples per probability level.

A.6

Acronyms

Our initial word list was created by extracting all seven-letter words in the CMU Pronouncing Dictionary

that were two tokens long under the tokenizer used by GPT-3.5 and GPT-4; the token count had to be two

both when the word was preceded by a space and when it was not. This word list had two versions—one in

all lowercase letters and one in all capital letters. The lists differed because the number of tokens is often

different between lowercase and uppercase versions of a word. Each of these word lists was ordered from

highest to lowest probability, where the probability in question was the probability that GPT-2 assigned to

the word in the context The word is “WORD”—chosen to provide a context in which any word should fit

reasonably well, and using quotation marks to clearly indicate the word’s boundaries.

Each word list was then split into bins by probability. Certain words had to be excluded because no

acronyms could be formed from them (e.g., due to a lack of words starting with X, no output could contain

an X). After removing such words, there were 8203 lowercase words and 2244 all-caps words. These were

split into bins as shown in Table 9; there was some space left between bins to make them more distinct, and

we also left out the words at the end of each list since their low-probability under GPT-2 might reflect that

something is unusual about them.

When varying the input probability, the output words were fixed across conditions—i.e., spelling the same

word but from input words of differing probabilities. When comparing first-letter and second-letter acronyms,

the same output words were used in the two conditions. In all comparisons (varying output probability, varying

input probability, and comparing first-letter vs. second-letter acronyms), each word in the n th example had

the same split point as the corresponding word in the n th example in all other conditions, where the split

point is the index of the break in between the two tokens within the word; we controlled for this factor in

case different split points lead to different difficulty levels. Within these constraints, examples were generated

by first sampling an output word and then, for each letter in that word, sampling an input word that would

yield that letter. We generated 1000 examples per condition.

To determine the log probabilities of the output words (for purposes of plotting accuracy vs. log probability,

or for purposes of statistical tests), we used the log probability that GPT-2 assigned to the whole sequence

The word is “WORD” minus the log probability that it assigned to The word is “; this gives the probability

of the word plus the closing quotation mark conditioned on The word is “. We included the closing quotation

mark because some effects of the word’s probability might manifest in the following quotation mark (as the

indicator that the word has ended). To compute the log probability of the input, we used this same approach

to get the log probability of each input word, and we then summed the log probabilities of all the input words

to yield the log probability of the whole input.

A.7

Counting

In the case of counting characters, we used lowercase letters as the common characters and emojis as the rare

characters. In the case of counting words, we extracted all words from the CMU Pronouncing Dictionary that

fit the following criteria:

• Must be 1, 2, or 3 tokens long.

• Must have the same number of tokens whether preceded by a space or not.

• Must appear at least 20 times in the Wikitext-103 corpus (Merity et al., 2017).

65Lowercase

Lowercase

Capital

bin

Start End

1500

3000

4500

6000 999

2499

3999

5499

6999

440

880

1320

1760 399

839

1279

1719

2159

Table 9: Acronym vocabulary binning.

We then sorted all these words by the probability assigned to them by GPT-2 in the context The word is

“WORD”, as described in Appendix A.6. Then, we split the resulting words into the highest-probability

8000 words and the rest (this divided the list roughly in half, as there were about 16,000 words meeting our

criteria). To create our candidate high-probability words, we took the 150 highest-probability words for each

token count of 1, 2, or 3, after manually excluding profanity and words that could conceivably be considered

not words (e.g., because they are names); since we were asking models to count words, we wanted to avoid an

interpretation where some “words” were ignored on the grounds that they were not in fact words. We used

the same procedure to create our list of rare words, except that in this case we used the 150 lowest-probability

words of each token length. This gave us a list of 450 high-probability words and 450 low-probability words.

For each integer from 1 to 100, we created 30 examples in each condition. To create a character example

for count n, we uniformly sampled a character from the possible candidates and then repeated it n times (that

is, within each character example, all characters were the same). To create a word example, we sampled n

words without replacement; the words were matched for token patterns across common/rare conditions (e.g.,

if the “common words” example had a word with 3 tokens of length 2, 5, and 7, the “rare words” example would

also have a word in that position with the same token pattern). We sampled words without replacement so

that, within each word sequence, there were no duplicated words, to avoid any ambiguity about whether we

intended for the count to include duplicates or not.

The frequency of each number was estimated by counting its occurrences in the first one billion words

of the C4 corpus (Raffel et al., 2020). In the experiment where we compared common and rare numbers, a

number n was categorized as common if it occurred at least twice as many times as each of n − 2, n − 1, n + 1,

and n + 2; this yielded twelve common numbers (20, 30, 40, 45, 50, 60, 70, 75, 80, 90, 95, 100). We then used

n − 2, n − 1, n + 1, and n + 2 as rare numbers. We generated 100 examples for each common number and 25

examples for each rare number; this setup guaranteed that the common-number and rare-number conditions

were matched for average magnitude of the counts.

A.8

Sorting

For sorting both words and numbers, the input was a list of length 10 to 20 inclusive; the list could not contain

repeats. Each example was generated by first selecting a length (by uniform sampling) and then randomly

selecting (without replacement) list elements from the set of candidates. For sorting words, the candidates

were the 450 common words used in the counting task (Appendix A.7). For sorting numbers, the candidates

were the integers from 1 to 10,000 inclusive. We generated 100 examples for each condition.

A.9

Three-digit multiplication

We generated 100 multiplication problems in which both numbers being multiplied were three-digit integers.

The two numbers were sampled uniformly from the range [100,999]. The same set of 100 problems was used

across all methods of formatting the text.

66A.10

Linear functions

We generated 100 examples for all conditions. For the initial test, in which we compared (9/5)x + 32 to

(7/5)x + 31 using the same set of inputs, we sampled 100 input values uniformly from the integers 0 to 999

inclusive. For the test in which we instead held the outputs constant, we sampled 100 output values from the

integers 0 to 999 inclusive and then, for each function, used the integer input value—if there was one that

satisfied our constraints—which yielded that output value after the function had been applied and the output

had been rounded to the nearest integer. If either function did not have a suitable integer input value for a

given output, we rejected that output and sampled another one. For the out-of-distribution tests, we added

two additional constraints:

• The first number that was sampled (i.e., the Celsius input in the initial version of the test, or the

Fahrenheit output in the inverse version) had to be in the range 501 to 999, inclusive.

• Neither number could be evenly divisible by 10.

We chose these constraints based on a corpus analysis of the C4 dataset (Raffel et al., 2020) in which we aimed

to determine the types of temperatures that are frequently present with conversions in an Internet corpus. In

this analysis, we identified all sentences that contained both the word “fahrenheit” preceded by a number and

the word “celsius” preceded by a number, or the characters “°f” preceded by a number and the characters “°c”

preceded by a number, and assumed that the two numbers were a Fahrenheit and Celsius temperate being

presented as conversions of each other. (We ignored case when searching for these strings). Of these 10,000

occurrences, 99.8% are excluded by applying our two constraints when the first constraint is applied to the

Celsius temperature, and 99.7% are excluded by applying the constraints when the first constraint is applied

to the Fahrenheit temperature. Thus, by enforcing these constraints, we can yield only temperatures of a

type that tend to occur very rarely with a conversion in natural Internet text, reducing—but certainly not

removing—the risk that models might have seen and copied it.

A.11

Birthdays

We first extracted all name/birth date pairs from the WikiBio dataset (Lebret, Grangier, and Auli, 2016),

restricting the set to names containing exactly two words and containing only the 26 basic letters of the Roman

alphabet. We then counted how many times each name occurred in the first 100 million lines (approximately

3.6 billion words) of the C4 corpus (Raffel et al., 2020). Using this frequency information, we created four sets

of names. The first set contained names mentioned 2,000 times or more in our sample of C4; the second set

contained names mentioned between 200 and 1,999 times inclusive; the third set contained names mentioned

between 20 and 199 times inclusive; and the fourth set contained names mentioned between 5 and 19 times

inclusive. We only included names that were unambiguous on Wikipedia (i.e., when the name is searched for,

Wikipedia goes directly to the person’s page, without redirecting to a disambiguation page, and the top of

the resulting page does not include a link to a disambiguation page); we checked for ambiguity by manually

searching for each candidate name on Wikipedia until we had gathered sufficiently many unambiguous ones.

We also verified that, for each person we included, Wikipedia included that person’s birthday, and the birthday

listed by Wikipedia matched the one listed in the WikiBio dataset. Each set contains 100 names, and we

matched the birth dates across sets (i.e., the n th names across all sets have the same birth dates; e.g., in each

of the four sets, the eighteenth person was born on September 27, 1972).

B.1

Prompts

Shift ciphers: Encryption

Rot-13 is a cipher in which each letter is shifted 13 positions forward in the alphabet. For example, here is a

message and its corresponding version in rot-13:

Original text: “Stay here!”

Rot-13 text: “Fgnl urer!”

Here is another message. Encode this message in rot-13:

Original text: “To this day, we continue to follow these principles.”

Rot-13 text:

67B.2

Shift ciphers: Decryption

Rot-13 is a cipher in which each letter is shifted 13 positions forward in the alphabet. For example, here is a

message written in rot-13 along with the original text that it was created from:

Rot-13 text: “Fgnl urer!”

Original text: “Stay here!”

Here is another message in rot-13. Decode this message to produce the original text:

Rot-13 text: “Gb guvf qnl, jr pbagvahr gb sbyybj gurfr cevapvcyrf.”

Original text:

B.3

Comments on step-by-step and chain-of-thought prompts

We used an initial informal exploration of several options to select the specific step-by-step and chain-of-

thought prompts that we used. For step-by-step prompting, we compared two approaches: adding the

sentence Make sure to think step by step into the middle of the prompt, or adding the sentence Let’s think

step by step at the end of the prompt. We found the former to be more successful, so we used it for step-by-step

prompting (Appendices B.4 and B.5). For chain-of-thought prompting, we tried the following options:

• Convert one word at a time (e.g., given Fgnl urer!, first encode Fgnl and then urer! )

• Convert one letter at a time, each on its own numbered line of the generation (there were three variants:

(i) do not include spaces; (ii) leave a blank line for a space; or (iii) leave a numbered line for a space

exactly as if it were a letter).

• Convert one word at a time, but within the word, break it down into letters.

• Move the entire sentences back one position in the alphabet at a time. E.g., given Fgnl urer!, first shift

it to Efmk tqdq!, then Delj spcp!, etc.

We were expecting the last option to perform the best, but we instead found that one of the letter-based

variants performed the best, so we used that (Appendices B.6 and B.7).

B.4

Shift ciphers step-by-step: Encryption

Rot-13 is a cipher in which each letter is shifted 13 positions forward in the alphabet. For example, here is a

message and its corresponding version in rot-13:

Original text: “Stay here!”

Rot-13 text: “Fgnl urer!”

Here is another message. Encode this message in rot-13. Make sure to think step by step. On the last line,

write the words “Rot-13 text:” followed by the encoded message:

Original text: “But this time, there may also be another reason.”

B.5

Shift ciphers step-by-step: Decryption

Rot-13 is a cipher in which each letter is shifted 13 positions forward in the alphabet. For example, here is a

message written in rot-13 along with the original text that it was created from:

Rot-13 text: “Fgnl urer!”

Original text: “Stay here!”

Here is another message in rot-13. Decode the message to produce the original text. Make sure to think step

by step. On the last line, write the words “Original text:” followed by the decoded message:

Rot-13 text: “Ohg guvf gvzr, gurer znl nyfb or nabgure ernfba.”

68B.6

Shift ciphers chain-of-thought: Encryption

Rot-13 is a cipher in which each letter is shifted 13 positions forward in the alphabet. For example, here is a

message to be encoded:

Original text: “Stay here!”

To encode this message, we shift each letter 13 positions forward:

1. S -> F

2. t -> g

3. a -> n

4. y -> l

5. ->

6. h -> u

7. e -> r

8. r -> e

9. e -> r

10. ! -> !

Therefore, the rot-13 text is: “Fgnl urer!”

Here is another message. Encode the message one letter at a time. On the last line, write the words “Rot-13

text:” followed by the encoded message:

Original text: “But this time, there may also be another reason.”

B.7

Shift ciphers chain-of-thought: Decryption

Rot-13 is a cipher in which each letter is shifted 13 positions forward in the alphabet. For example, here is a

message written in rot-13:

Rot-13 text: “Fgnl urer!”

To decode this message, we shift each letter 13 positions backward:

1. F -> S

2. g -> t

3. n -> a

4. l -> y

5. ->

6. u -> h

7. r -> e

8. e -> r

9. r -> e

10. ! -> !

Therefore, the original text is: “Stay here!”

Here is another message in rot-13. Decode the message one letter at a time. On the last line, write the words

“Original text:” followed by the decoded message:

Rot-13 text: “Ohg guvf gvzr, gurer znl nyfb or nabgure ernfba.”

B.8

Reversal: Encryption

Reverse the following sequence of words. Punctuation marks should be moved along with the words they are

attached to; e.g., the reversed version of “Good morning, everyone!” would be “everyone! morning, Good”:

“But this time, there may also be another reason.”

B.9

Reversal: Decryption

Reverse the following sequence of words. Punctuation marks should be moved along with the words that they

are attached to; e.g., the reversed version of “everyone! morning, Good” would be “Good morning, everyone!”:

“letter. sons, may another also be there with Yet”

69B.10

Pig Latin: Encryption

There is a secret way of talking called Pig Latin in which you move the first consonant cluster of each word

to the end of the word, and then add -ay to the end of the word. If the word starts with a vowel, you simply

add -ay to the end without otherwise changing the word. For example, below is a sentence in English and its

equivalent in Pig Latin:

English: “the frogs aren’t noisy.”

Pig Latin: “ethay ogsfray aren’tay oisynay.”

Write this sentence in Pig Latin:

English: “turkey almost exclusively relies on energy imports as its domestic output covers only three percent

of its energy needs.”

Pig Latin:

B.11

Pig Latin: Decryption

There is a secret way of talking called Pig Latin in which you move the first consonant cluster of each word

to the end of the word, and then add -ay to the end of the word. If the word starts with a vowel, you simply

add -ay to the end without otherwise changing the word. For example, below is a sentence in Pig Latin and

its equivalent in English:

Pig Latin: “ethay ogsfray aren’tay oisynay.”

English: “the frogs aren’t noisy.”

Convert this sentence from Pig Latin to English:

Pig Latin: “urkeytay almostay exclusivelyay eliesray onay energyay importsay asay itsay omesticday outputay

overscay onlyay eethray ercentpay ofay itsay energyay eedsnay.”

English:

B.12

Article swapping: Decoding, swap with previous

Repeat the sentence listed as Input 2 below, but every time there is an article (“the”, “a”, or “an”), swap it

with the previous word. For example, if the input were Input 1, you should reply with Output 1:

Input 1: “When we boarded train the, we saw strange a sight.”

Output 1: “When we boarded the train, we saw a strange sight.”

Input 2: “Well, it was not exactly planned from beginning the.”

Output 2:

B.13

Article swapping: Decoding, swap with next

Repeat the sentence listed as Input 2 below, but every time there is an article (“the”, “a”, or “an”), swap it

with the following word. For example, if the input were Input 1, you should reply with Output 1:

Input 1: “When we the boarded train, we a saw strange sight.”

Output 1: “When we boarded the train, we saw a strange sight.”

Input 2: “Well, it was not exactly planned the from beginning.”

Output 2:

B.14

Article swapping: Encoding, swap with previous

Repeat the sentence listed as Input 2 below, but every time there is an article (“the”, “a”, or “an”), swap it

with the previous word. For example, if the input were Input 1, you should reply with Output 1:

Input 1: “When we boarded the train, we saw a strange sight.”

Output 1: “When we the boarded train, we a saw strange sight.”

Input 2: “Well, it was not exactly planned from the beginning.”

Output 2:

70B.15

Article swapping: Encoding, swap with next

Repeat the sentence listed as Input 2 below, but every time there is an article (“the”, “a”, or “an”), swap it

with the following word. For example, if the input were Input 1, you should reply with Output 1:

Input 1: “When we boarded the train, we saw a strange sight.”

Output 1: “When we boarded train the, we saw strange a sight.”

Input 2: “Well, it was not exactly planned from the beginning.”

Output 2:

B.16

Acronyms

What sequence of letters is created when you combine the first letters of the words in the sequence “chevron

outback nomadic ferrous entitle splurge surfeit”? Write your answer in capital letters, with no spaces or

punctuation.

B.17

Counting: Words

How many words are in the following list? “stilted hyperbolic secret charlatan normative malevolent flippant

lawless gullible flagrant albatross purposeful”

B.18

Counting: Letters

How many letters are in the following list? “zzzzzzzzzzzzzz”

B.19

Counting: Emojis

”

How many tortoise emojis are in the following list? “

B.20

Sorting words: Alphabetical order

Sort the following list of words in alphabetical order: “illogical, hard, scoundrel, over, sequester, irrefutable,

lost, anathema, raucous, wondrous”

B.21

Sorting words: Reverse alphabetical order

Sort the following list of words in reverse alphabetical order: “illogical, hard, scoundrel, over, sequester,

irrefutable, lost, anathema, raucous, wondrous”

B.22

Sorting numbers: Ascending order

Sort the following list of numbers in ascending order: “1292, 3475, 8774, 7076, 3695, 6441, 6298, 9762, 1916,

6988, 8084”

B.23

Sorting numbers: Descending order

Sort the following list of numbers in descending order: “1292, 3475, 8774, 7076, 3695, 6441, 6298, 9762, 1916,

6988, 8084”

B.24

Three-digit multiplication: Number

Evaluate the mathematical expression in Question 2. For example, if you were asked Question 1, you would

answer with Answer 1.

Question 1: 831 times 447

Answer 1: 371457

Question 2: 285 times 796

Answer 2:

71B.25

Three-digit multiplication: Lowercase

Evaluate the mathematical expression in Question 2. For example, if you were asked Question 1, you would

answer with Answer 1.

Question 1: eight hundred thirty-one times four hundred forty-seven

Answer 1: 371457

Question 2: two hundred eighty-five times seven hundred ninety-six

Answer 2:

B.26

Three-digit multiplication: All caps

Evaluate the mathematical expression in Question 2. For example, if you were asked Question 1, you would

answer with Answer 1.

Question 1: EIGHT HUNDRED THIRTY-ONE times FOUR HUNDRED FORTY-SEVEN

Answer 1: 371457

Question 2: TWO HUNDRED EIGHTY-FIVE times SEVEN HUNDRED NINETY-SIX

Answer 2:

B.27

Three-digit multiplication: Alternating capitals

Evaluate the mathematical expression in Question 2. For example, if you were asked Question 1, you would

answer with Answer 1.

Question 1: eIgHt HuNdReD tHiRtY-oNe times fOuR hUnDrEd FoRtY-sEvEn

Answer 1: 371457

Question 2: tWo HuNdReD eIgHtY-fIvE times sEvEn HuNdReD nInEtY-sIx

Answer 2:

B.28

Linear function

Below is a number. Multiply it by 9/5 and then add 32. Your answer should be a single number:

Input: 328

Answer:

B.29

Linear function with rounding

Below is a number. Multiply it by 9/5 and then add 32. Your answer should be a single number rounded to

the nearest integer:

Input: 78

Answer:

B.30

Keyboard cipher: Basic

The keyboard cipher is a cipher where each letter is replaced with the letter to the right of it on a QWERTY

keyboard. If the letter does not have a letter to its right, then you should wrap around to the start of the

row; e.g., “p” maps to “q”. For example, here is a message to be encoded:

Original message: “Hello world!”

To encode this message, we replace each letter with the one to the right of it on a QWERTY keyboard:

1. The first word is “Hello”

* H -> J

* e -> r

* l -> a

* o -> p

So “Hello” becomes “Jraap”

722. The second word is “world!”

* w -> e

* o -> p

* r -> t

* l -> a

* d -> f

* ! -> !

So “world!” becomes “eptaf!”

Therefore, the message in the keyboard cipher is: “Jraap eptaf!”

Here is another message. Encode the message in the keyboard cipher, word by word and letter by letter. On

the last line, write the words “Keyboard cipher:” followed by the encoded message:

Original message: “But this time, there may also be another reason.”

B.31

Keyboard cipher: Keyboard provided

The keyboard cipher is a cipher where each letter is replaced with the letter to the right of it on a QWERTY

keyboard. If the letter does not have a letter to its right, then you should wrap around to the start of the

row; e.g., “p” maps to “q”. For reference, here are the three rows of a QWERTY keyboard:

qwertyuiop

asdfghjkl

zxcvbnm

For example, here is a message to be encoded:

Original message: “Hello world!”

To encode this message, we replace each letter with the one to the right of it on a QWERTY keyboard:

1. The first word is “Hello”

* H -> J

* e -> r

* l -> a

* o -> p

So “Hello” becomes “Jraap”

2. The second word is “world!”

* w -> e

* o -> p

* r -> t

* l -> a

* d -> f

* ! -> !

So “world!” becomes “eptaf!”

Therefore, the message in the keyboard cipher is: “Jraap eptaf!”

Here is another message. Encode the message in the keyboard cipher, word by word and letter by letter. On

the last line, write the words “Keyboard cipher:” followed by the encoded message:

Original message: “But this time, there may also be another reason.”

B.32

Keyboard cipher: Correspondences provided

The keyboard cipher is a cipher where each letter is replaced with the letter to the right of it on a QWERTY

keyboard. If the letter does not have a letter to its right, then you should wrap around to the start of the

row; e.g., “p” maps to “q”. For reference, here are the correct replacements for all letters:

a -> s

b -> n

73c -> v

d -> f

e -> r

f -> g

g -> h

h -> j

i -> o

j -> k

k -> l

l -> a

m -> z

n -> m

o -> p

p -> q

q -> w

r -> t

s -> d

t -> y

u -> i

v -> b

w -> e

x -> c

y -> u

z -> x

For example, here is a message to be encoded:

Original message: “Hello world!”

To encode this message, we replace each letter with the one to the right of it on a QWERTY keyboard:

1. The first word is “Hello”

* H -> J

* e -> r

* l -> a

* o -> p

So “Hello” becomes “Jraap”

2. The second word is “world!”

* w -> e

* o -> p

* r -> t

* l -> a

* d -> f

* ! -> !

So “world!” becomes “eptaf!”

Therefore, the message in the keyboard cipher is: “Jraap eptaf!”

Here is another message. Encode the message in the keyboard cipher, word by word and letter by letter. On

the last line, write the words “Keyboard cipher:” followed by the encoded message:

Original message: “But this time, there may also be another reason.”

B.33

Birthdays

On what date was Michelle Pfeiffer born? You should format your answer as “Month Day, Year”; e.g., “July

19, 1743.”

74C

C.1

Evaluation details

Shift ciphers and reversal

We ignored quotation marks at the start and end of the answer, but otherwise the model’s output had to

exactly match the ground truth in order to be counted as correct.

C.2

Pig Latin

We ignored quotation marks at the start and end of the answer. When the task was decryption, we accepted

any answer that was a possible decryption of the input, even if it was different from the intended one. The

reason to make this allowance is that Pig Latin decryption can be ambiguous; e.g., identstray could correspond

to tridents or strident. We also ignored case; since we lowercased the text in the input, we decided not to

penalize models for adding capitalization since inserting proper capitalization could reasonably be assumed

to be part of the intended decoding process.

C.3

Acronyms

We ignored quotation marks, spaces, and capitalization in the answers. Models often placed their answer

within a larger sentence (e.g., The sequence of letters created when combining the first letters of the words in

the given sequence is “CONCORD”.); thus, we manually inspected answers containing spaces to extract just

the answer, in order to avoid penalizing models for including this extra material.

C.4

Sorting

We ignored quotation marks in the answers. The model’s output list had to exactly match the correct output

to be counted as correct. Sometimes the model placed the list within a larger sentence (e.g., The list of

numbers in descending order is...); in such cases, we extracted the list from the larger sentence.

C.5

Counting, linear functions, and three-digit multiplication

A model was considered correct if it got exactly the right number, or incorrect otherwise. We allowed models

to use any wording that communicated the correct number (e.g., just saying the number, vs. producing it

within a longer sentence; or writing large numbers with or without commas); we allowed for such cases by

manually checking any answers that could not be parsed as a number.

C.6

Birthdays

For each model output, we first tried to automatically parse it into a date. If we could not parse it automat-

ically, we checked it manually and converted it into a date if the answer mentioned a date. To be counted as

correct, a model had to mention exactly the correct date (day, month, and year).

D.1

Statistical tests

General notes

In our comparisons across conditions (e.g., across output probability levels), we aimed to control our datasets

for other variables (e.g., input probability). However, it was not always possible to exactly match these other

variables. Therefore, here we run logistic and linear regressions that include predictors for our target variable

as well as potential confounding variables, as a way to test whether the target variables have a statistically

significant effect even when other factors are taken into account. All statistical tests were run in R (R Core

Team, 2022) using the glmnet package (Friedman, Hastie, and Tibshirani, 2010; Tay, Narasimhan, and Hastie,

2023). Plots were generated using ggplot2 (Wickham, 2016). In some cases, various predictors of interest are

correlated, so we report variance inflation factors (VIFs) as a measure of the extent to which this correlation

is problematic. We deem the VIF to be excessive if it is above a threshold of 5. In all cases, we z-scored

continuous or integer predictors due to large differences in scale, and we ran separate statistical tests for the

75two LLMs (GPT-3.5 and GPT-4). Unless otherwise stated, the response variable was a binary variable that

was 1 if the GPT model produced the correct answer or 0 otherwise.

In some of the logistic regressions, the response variable was 0 for all examples in a particular condition,

which creates issues for logistic regression due to complete separation. To address this issue, we augmented

the logistic regression with the weak Bayesian prior proposed by Gelman et al. (2008), implemented in the R

package arm. 14 The cases where we used this Bayesian augmentation are noted below.

D.2

Shift ciphers

D.2.1

Comparing task variants

To compare rot-13 and rot-2, we used a logistic regression with predictors of shift (a binary variable that

could either be a shift of 2 or 13), the number of characters in the sentence (which was the same for the input

and the output), the number of subword tokens in the input, the number of subword tokens in the output,

the log probability of the input (estimated using GPT-2), and the log probability of the output (estimated

using GPT-2):

correct ∼ shift + input_nchars + input_ntokens + output_ntokens

+ input_logprob + output_logprob

(1)

(2)

In these tests, we used the weak Bayesian prior (Appendix D.1) to address complete separation issues.

There was a statistically significant effect of shift for GPT-4 decoding (p < 10 −7 ), GPT-3.5 decoding

(p < 0.01), and GPT-4 encoding (p < 0.01), but not GPT-3.5 encoding (p = 0.29), likely due to floor effects

in that case (GPT-3.5’s accuracy on rot-2 encoding was 0.00, and its accuracy on rot-13 encoding was 0.02).

In all cases, the shift predictor had a negligible VIF (less than 1.2) meaning that collinearity is not a concern.

D.2.2

Input probability

To investigate the effects of input probability, we used a logistic regression with predictors for input and

output log probability (estimated using GPT-2), the number of subword tokens in the input and output, and

the number of characters in the input (which was the same as the number of characters in the output):

correct ∼ input_logprob + output_logprob + input_ntokens + output_ntokens + input_nchars

(3)

This test was applied to the cases where LLMs were asked to encode sentences of varying levels of probability.

The predictor of interest, input log probability, had a statistically significant effect for GPT-4 (p < 0.05) but

not GPT-3.5 (p = 0.61). The VIF was small (< 1.4) for input log probability in both cases.

We then re-ran this analysis but with Levenshtein edit distance rather than correctness as the dependent

variable; accordingly, we used a linear regression rather than a logistic regression. Input probability now had

a statistically significant effect for both GPT-3.5 (p < 0.01) and GPT-4 (p < 0.01). For both models, the VIF

for input log probability was less than 2.0.

D.2.3

Output probability

To investigate the effects of output probability, we used the same variables as for input probability (Ap-

pendix D.2.2). Output probability had a highly statistically significant effect for both GPT-3.5 (p < 10 −4 )

and GPT-4 (p < 10 −7 ). The VIF for output log probability was small (< 1.6) in both cases.

D.2.4

Step-by-step and chain-of-thought prompting

We applied the same statistical tests described above to the shift cipher results obtained with GPT-4 using

step-by-step and chain-of-thought prompting. Output log probability had a significant effect in all three cases

(p < 10 −7 for basic and step-by-step prompting; p < 10 −8 for chain-of-thought prompting). When having the

response variable be a binary variable indicating whether the model was correct or not, input log probability

had a significant effect for basic prompting (p < 0.05) and chain-of-thought prompting (p < 0.05), but not

https://cran.r-project.org/package=arm

76step-by-step prompting (p = 0.059). However, when we switched to the more sensitive metric of edit distance

from the correct answer, we found a significant effect for both basic prompting (p < 0.01) and step-by-step

prompting (p < 0.05), and it came close to significance for chain-of-thought prompting (p = 0.066). Finally,

when comparing rot-13 to rot-2, there was a significant effect of the task for both encoding and decoding

and for all prompt styles, except for the case of GPT-4 with chain-of-thought prompting. For encoding, we

had p < 0.01 for basic prompting, p < 0.01 for step-by-step prompting, and p < 0.05 for chain-of-thought

prompting. For decoding, we had p < 10 −7 for basic prompting, p < 10 −4 for step-by-step prompting, and

p = 0.22 for chain-of-thought prompting. Although the difference between rot-13 and rot-2 was no longer

significant with chain-of-thought prompting, note that there are still substantial accuracy differences across

shift levels in the chain-of-thought setting (see Figure 9.1, top). Rot-13 vs. rot-2 was just one comparison that

we initially chose to represent the broader set of comparisons that could be performed. Other comparisons

retain a large difference; e.g., if we compare rot-13 and rot-12 in the same way as we compared rot-13 and

rot-2, there is a significant effect of shift level for GPT-4 with chain-of-thought prompting (p < 10 −4 ). For

all the cases discussed in this paragraph, the VIF was small (less than 2.0).

D.3

D.3.1

Reversal

Input probability

We used a logistic regression with the same variables as used for shift ciphers (Appendix D.2.2. Input

probability did not have a statistically significant effect for GPT-3.5 (p = 0.22) or GPT-4 (p = 0.80). The

VIF for input log probability was modest for both models (1.82 for GPT-3.5 and 2.01 for GPT-4), meaning

that collinearity is not a serious concern for these results.

D.3.2

Output probability

We used a logistic regression with the same variables as used for shift ciphers (Appendix D.2.2. Output

probability had a statistically significant effect for both GPT-3.5 (p < 10 −8 ) and GPT-4 (p < 10 −11 ). For

GPT-3.5, the VIF for output log probability was modest (1.68), meaning that collinearity is not a major

concern. The VIF for GPT-3.5 was 1.76, and for GPT-4 it was 3.33, meaning that both were within the

range we considered acceptable. We therefore conclude that collinearity is unlikely to be a serious problem

for drawing conclusions from these tests.

D.4

D.4.1

Pig Latin

Comparing task variants

To compare Pig Latin and Boar Etruscan, we used a logistic regression with predictors of task (a binary

variable that could either be Pig Latin or Boar Etruscan), the number of characters in the input, the number

of characters in the output, the number of subword tokens in the input, the number of subword tokens in

the output, the log probability of the input (estimated using GPT-2), and the log probability of the output

(estimated using GPT-2):

correct ∼ task + input_nchars + output_nchars + input_ntokens + output_ntokens

+ input_logprob + output_logprob

(4)

(5)

In these tests, we used the weak Bayesian prior (Appendix D.1) to address complete separation issues.

There was a statistically significant effect of task for GPT-3.5 encoding (p < 0.01), GPT-4 encoding

(p < 0.001), and GPT-4 decoding (p < 0.05), but not GPT-3.5 decoding (p = 0.08). In all cases, the shift

predictor had a small VIF (less than 1.3) meaning that collinearity is not a concern.

To compare different variants of Pig Latin, we used a similar logistic regression as the one used for Pig

Latin vs. Boar Etruscan, except instead of a binary categorical variable for task we used a continuous variable

of task probability (which we defined as equal to the proportion of Pig Latin instances in C4 that used the

Pig Latin variant in question). We did not use the Bayesian prior in this case because there was no complete

separation. For encoding text into Pig Latin, we found a statistically significant effect of task for both GPT-

3.5 (p < 10 −6 ) and GPT-4 (p < 10 −8 ). For decoding text from Pig Latin, neither model showed a significant

77effect of task (p = 0.27 for GPT-3.5, and p = 0.11 for GPT-4). In all cases, the VIF for the task probability

variable was small (less than 1.2), meaning that collinearity was not a concern for these regressions.

D.4.2

Input probability

We used a logistic regression with the same variables as used for shift ciphers (Appendix D.2.2) except with

a predictor added for the number of characters in the output (output_nchars) because for Pig Latin (unlike

for shift ciphers) this quantity differs from the number of characters in the input. Input probability did not

have a statistically significant effect for GPT-3.5 (p = 0.70) or GPT-4 (p = 0.88). The VIF for input log

probability was modest for both models (1.88 for GPT-3.5 and 1.98 for GPT-4), meaning that collinearity is

not a serious concern for these results.

D.4.3

Output probability

We used a logistic regression with the same variables as used for investigating input probability (Appendix D.4.2).

Output probability had a statistically significant effect for both GPT-3.5 (p < 10 −11 ) and GPT-4 (p < 10 −14 ).

The VIF for output log probability was reasonably small for both models (2.14 for GPT-3.5 and 2.69 for GPT-

4), meaning that collinearity is not a major concern for these results.

D.5

D.5.1

Article swapping

Main experiment

We first analyzed the effect of example probability using a logistic regression with predictors for input and

output log probability (estimated using GPT-2), the number of subword tokens in the input (which was the

same as the number in the output), and the number of characters in the input (which was the same as the

number in the output):

correct ∼ task + input_nchars + input_ntokens + input_logprob + output_logprob

(6)

However, with this model, input log probability and output log probability had excessives VIFs of over 10 for

both GPT-3.5 and GPT-4. We therefore reran the test without input log probability, leaving just output log

probability to stand for the overall example probability. In this case, probability had a significant effect for

both GPT-3.5 (p < 10 −11 ) and GPT-4 (p < 10 −15 ), and its VIF was reasonable (1.69 for GPT-3.5 and 2.22

for GPT-4).

D.5.2

Expanded experiment

The tests run in the previous subsection were unable to determine whether the effect of probability was due

to the probability of the input or the probability of the output. We ran another serious of experiments to

attempt to disentangle these effects. In the article swapping task that we have focused on, the input is an

ungrammatical sentence that becomes grammatical when each article is swapped with the word before it. We

now consider three additional framings of this task (based on which direction the swap occurs in and whether

it is the input or output that is ungrammatical), to give four total task versions, illustrated in Table 10. These

variants yield widely varying accuracies. The direction of the swap (next vs. previous) does not appear to be

the main driver of the differences in accuracy because, for both models, the “previous” direction yields both

the highest-accuracy setting and the lowest-accuracy setting. The input also does not appear to be the main

driver of accuracy differences: if it were, we would expect the highest accuracies in the conditions with a

grammatical input, but in fact those conditions get the lowest accuracies. What does appear to matter most

is the output: the conditions with the grammatical outputs yield the highest accuracies for both models.

To be sure, this is not the only factor influencing the results: it appears the the “ungramm.-P” sentences

are harder to process than the “ungramm.-N” ones, since for both encoding and decoding the condition with

“ungramm.-P” has lower accuracies than the one with “ungramm.-N.” However, given that the most extreme

differences in accuracy seem to be driven by the type of output that is produced, we conclude that output

probability (rather than input probability) is mainly responsible for the highly significant effect of probability

that we observed in Appendix D.5.1.

78Input Output Direction Accuracy:

GPT-3.5 I will not give up

even for a moment.

→ I will not give up

even for moment a. Gramm. Ungramm.-N Next 0.23 0.57

Encoding:

Swap with previous I will not give up

even for a moment.

→ I will not give up

even a for moment. Gramm. Ungramm.-P Previous 0.01 0.05

Decoding:

Swap with next I will not give up

even a for moment.

→ I will not give up

even for a moment. Ungramm.-P Gramm. Next 0.32 0.62

Decoding:

Swap with previous I will not give up

even for moment a.

→ I will not give up

even for a moment. Ungramm.-N Gramm. Previous 0.65 0.83

Task version Example

Encoding:

Swap with next Accuracy:

GPT-4

Table 10: Four variants of the article swapping task. Encoding refers to a variant that converts a

grammatical sentence to an ungrammatical one; decoding is the reverse. Gramm. and ungramm.

stand for grammatical and ungrammatical. In more detail, ungramm.-N refers to an ungrammatical

sentence created by starting with a grammatical sentence and swapping each article with the next

word; ungramm.-P is analogous but uses swapping with the previous word. Direction refers to

whether the task involves swapping articles with the next or previous word. The accuracies shown

here are all based on the high-probability set of grammatical sentences.

D.6

D.6.1

Acronyms

Comparing task variants

We evaluated the effect of input probability using the logistic regression below. The task variable was a binary

categorical variable that could either be the first-letter acronym task or the second-letter acronym task. We

did not include predictors for example length because all examples had the same length, whether counted in

characters or in subword tokens.

correct ∼ task + input_logprob + output_logprob

(7)

The log probability of an individual word was as described in Appendix A.6 (based on GPT-2’s probability

of the word in the context The word is “WORD”). Since the inputs were multiple words, their log probability

was taken to be the sum of the log probabilities of the words in them. In these tests, we used the weak

Bayesian prior (Appendix D.1) to address complete separation issues.

The factor of interest was task, which had a significant effect for both GPT-3.5 (p < 10 −4 ) and GPT-4

(p < 10 −15 ). In both cases, the VIF for this factor was negligible (less than 1.02), showing that collinearity

was not a concern.

D.6.2

Input probability

We evaluated the effect of input probability using the logistic regression below:

correct ∼ input_logprob + output_logprob

(8)The factor of interest was input log probability, which had a significant effect for GPT-3.5 (p < 10 −5 ) but

was not significant for GPT-4 (p = 0.092). In both cases, the VIF was negligible (less than 1.01), showing

that collinearity was not a concern.

D.6.3

Output probability

To investigate the effect of output probability, we used the same logistic regression setup as for input proba-

bility (Appendix D.6.2). Output probability had a highly significant effect for both GPT-3.5 (p < 10 −15 ) and

GPT-4 (p < 10 −9 ). The VIF was negligible (less than 1.01) in both cases, so collinearity was not a concern.

D.7

Counting

To investigate the effect of examples probability on counting success, we used logistic regressions with pre-

dictors for the magnitude of the number being counted to, the log probability of the input sequence, and the

log probability of the output (i.e., the count):

correct ∼ magnitude + input_logprob + output_logprob

(9)

For counting words, the input log probability was the mean of word log probabilities, where a word’s log

probability is the log probability assigned by GPT-2 to the sequence The word is “WORD” minus the log

probability for The word is “. For characters, we counted how often each character appeared in the first 1

billion words of the C4 corpus (Raffel et al., 2020), and added 1 to each count (add-1 smoothing) to deal with

the nonappearance of some emojis. The output log probability was estimated by counting how many times

each number appeared in the first 1 million integers between 1 and 100 mentioned in the C4 corpus, then

normalizing.

When the inputs were all common and the outputs could be any number from 1 to 100, output log

probability had a significant effect in all cases (GPT-3.5 or GPT-4 processing characters or words; p < 10 −15

in all cases). Magnitude had a significant effect for both character-based conditions (p < 10 −15 for GPT-3.5;

p < 10 −14 for GPT-4) but not for either word-based condition (p = 0.56 for GPT-3.5; p = 0.07 for GPT-4).

For all four conditions, the VIFs for both magnitude and output log probability are below 2.0, so we conclude

that collinearity is not a concern.

We ran the same tests on the case where we performed a controlled comparison between high-probability

counts and low-probability counts. Output log probability was also significant here (p < 10 −15 for all 4 cases),

and the VIFs continued to be reasonable.

When we compared the counting of low-probability inputs with high-probability ones, we found that input

log probability had a statistically significant effect for both models when counting words (p < 0.05 for both

GPT-3.5 and GPT-4) and for counting characters (p < 10 −10 for GPT-3.5; p < 10 −6 for GPT-4). Note that,

for GPT-3.5 counting words and for GPT-4 counting characters, the effect of input probability was in the

opposite direction from what we expected: they scored higher when counting rare words than common ones.

However, the effect size in these cases was small (Figure 7.2). The VIF for input log probability was below

our threshold of 5 for all conditions.

D.8

Sorting

To compare different sorting directions (alphabetical vs. reverse alphabetical, or ascending vs. descending)

we used logistic regressions with predictors for the direction, the number of characters (which was the same

for the input and output), the number of subword tokens (which was the same for the input and output), the

input log probability, and the output log probability:

correct ∼ direction + nchars + ntokens + input_logprob + output_logprob

(10)

When we were investigating the sorting of words, the direction predictor was a binary variable that could

take values of alphabetical or reverse alphabetical. When we were investigating the sorting of numbers, the

direction predictor was a binary variable that could take values of ascending or descending. The input and

output log probabilities were both estimated as the probability assigned by GPT-2 to the word sequence.

These regressions found that, for sorting words, direction had a significant effect for both GPT-3.5 (p <

10 −8 ) and GPT-4 (p < 10 −6 ). However, for sorting numbers, direction did not have a significant effect (for

GPT-3.5, p = 0.70; for GPT-4, p = 0.96). The VIF for the task predictor was below our threshold of 5.0 for

all cases, so we conclude that collinearity is not a major concern.

80D.9

D.9.1

Linear functions

Comparing task variants

To compare the two linear functions, we used a logistic regression with predictors for the task (a binary

variable that could be the task (9/5)x + 32 or the task (7/5)x + 31), the continuous input value, and the

continuous output value:

correct ∼ task + input + output

(11)

In these tests, we used the weak Bayesian prior (Appendix D.1) to address complete separation issues. The

factor of interest was task, which had a significant effect for both GPT-3.5 (p < 0.001) and GPT-4 (p < 0.01).

In both cases, the VIF was negligible (less than 1.1), showing that collinearity was not a concern.

We repeated these tests for the setting in which the output (rather than the input) was controlled across

tasks. The variable of task continued to have a significant effect for both GPT-3.5 (p < 10 −7 ) and GPT-4

(p < 0.001), and the VIF was still small for both tasks (less than 1.1).

D.9.2

Comparing prompts

To compare the different prompts, we used a mixed-effects logistic regression with a fixed effect for prompt

type (a 3-way contrast, sum coded) and a random intercept for item:

correct ∼ prompt + (1|item)

(12)

We did not include effects for input or output since the by-item random intercept subsumes these factors.

For each LLM, we first tested whether there was a significant effect of prompt type by using a chi-square

test to compare the full model described above with a null model that removed the fixed effect of prompt.

These tests found a significant effect of prompt for both LLMs (p < 10 −12 for GPT-3.5; p < 10 −18 for GPT-4).

Since the chi-square test found a significant effect of prompt, we then tested which pairwise comparisons of

prompt types were significant by applying the R package multcomp (Hothorn, Bretz, and Westfall, 2008),

which controls for multiple comparisons, to the full model that had a fixed effect for prompt method. We

found that, for both models, all three comparisons were significant, with p < 0.01 in all cases.

We repeated these tests on the setting where the values were constrained to be unusual ones (i.e., outside

the standard range of temperatures, and not divisible by 10). The chi-square tests again showed a significant

effect of prompt (with p < 10 −18 for both models), and again all pairwise comparisons were significant, with

p < 0.01.

D.10

Three-digit multiplication

To compare the four different ways of formatting the numbers in the prompt, we used a mixed-effects logistic

regression with a fixed effect for the prompt format (a 4-way contrast, sum-coded) and a random intercept

for item:

correct ∼ format + (1|item)

(13)

The procedure for assessing the significance of differences between prompt formats was the same as the

procedure for comparing linear function prompts (Appendix D.9.2). For each LLM, we first tested if the

format had a significant effect by running a chi-square test that compared the full model to a null model

that removed the fixed effect of format. For both LLMs, the chi-square test found a significant effect for

format (p < 10 −15 for GPT-3.5; p < 10 −8 for GPT-4). Since the chi-square test found that the format had

a significant effect, we tested which pairs of formats were significantly different from each other using the

multcomp package. Many of the differences had a significant effect; see Table 11.

D.11

Birthdays

To investigate the effect of the frequency of a person being mentioned on model accuracy at naming the per-

son’s birthday, we ran logistic regressions with predictors for input log probability and output log probability:

correct ∼ input_logprob + output_logprob

(14)Number vs. lowercase

Number vs. all caps

Number vs. alternating capitals

Lowercase vs. all caps

Lowercase vs. alternating capitals

All caps vs. alternating capitals

GPT-3.5 GPT-4

***

n.s.

* n.s.

n.s.

***

n.s.

***

Table 11: Statistical significance of the differences between LLM accuracy for three-digit multiplica-

tion when using different prompt formats. Key: n.s. = not significant; * = p < 0.05; ** = p < 0.01;

*** = p < 0.001.

The input probability (i.e., the probability of the name being asked about) was proportional to the number

of occurrences of the name in the C4 corpus. The output probability (i.e., the probability of the date that

was the answer) was proportional to the number of entries in the WikiBio dataset that were born on each

date. For both models, input log probability had a highly statistically significant effect, with p < 10 −15 . In

both cases, the VIF of input log probability was negligible (less than 1.01).

Estimating task probability

To estimate the probability of task variants, we performed corpus analyses of the C4 corpus (Raffel et al.,

2020), specifically the version hosted by Hugging Face Datasets (Lhoest et al., 2021).

It is challenging to precisely count how many occurrences of a task there are in a naturalistic corpus

because task instances are typically not labeled as such. As a proxy for counting task instances, we counted

phrases that are often used to invoke each task. Below we discuss precisely how we carried this out for each

type of task, including discussion of the assumptions that underlie each investigation. The reliance on these

(certainly imperfect) assumptions is a limitation of these analyses; the other main limitation is that, in many

cases, we had to use manual analysis of data to ensure accuracy, but due to the large amount of data we

did not scrutinize all examples carefully, meaning that some examples may have been missed or misclassified.

Though these analyses are certainly imperfect, they are useful for giving a rough sense of how frequent each

task variant is.

E.1

Shift ciphers

We extracted all examples within the corpus that contained any of the following types of strings. The search

was done in a case-insensitive way. These string templates are intended to capture the most frequent ways of

referring to shift ciphers:

• shift cipher

• Caesar cipher (another common term for a shift cipher)

• rotN (for all values of N from 1 to 25)

• rot-N (for all values of N from 1 to 25)

• the next letter in the alphabet (often used when discussing rot-1)

• the previous letter in the alphabet (often used when discussing rot-25)

We then manually checked all examples returned by this search. Many of the examples were not in fact

discussing a shift cipher. For instance, there is a gene called ROT2 which should not be counted as a shift

cipher; there is a scheme for encoding numbers called rot-5, which we also did not count because it is different

from the text encoding scheme rot-5; and the phrase the next/previous letter in the alphabet can be used

in non-cipher-related contexts. We excluded such examples to give us only the examples that discuss shift

ciphers. We then categorized which shift level was being discussed in each case (some examples were excluded

because they did not mention a specific shift level). If a single document included several instances of the

82above phrases, we only counted that document once. In order to count an example as an instance of a shift,

we did not require there to be examples of encoded text because checking for such examples would have

further lengthened an already labor-intensive search. The resulting counts are shown in the top of Figure 5.2.

Instead of searching for phrases used to talk about shift ciphers, we considered searching for text encoded

in shift ciphers by searching for the encoded forms of various common words. We attempted this approach

but found that it ran into unexpected challenges. Words sometimes become other words when shifted,

yielding far more false positives than we could manually check; yet we also could not simply exclude shifted

candidates that become other words because often the false positives were strings that would not typically

be recognized as a word; e.g., stop gave us the false positive rsno, which would be hard to exclude because it

does not typically show up on word lists, yet which does occur in C4 as an abbreviation for the Royal Scottish

National Orchestra.

The assumptions underlying this investigation include:

• The distribution of shifts in C4 is similar to the distribution of shifts in the training data for GPT-3.5

and GPT-4.

• The texts returned by the key phrases described above are representative of the full set of texts men-

tioning shift ciphers.

• A shift’s probability of being illustrated is proportional to its probability of being mentioned (this

assumption relates to the fact that we did not require an example to be included, whereas models are

most likely to only learn about a shift from cases with examples).

E.2

Pig Latin

We extracted all examples from the corpus that, when lowercased, contained the string pig latin. We then

manually examined each occurrence to categorize which variant of Pig Latin was used. We only counted

examples which specified a variant, either by verbally describing the rule or by providing examples that made

the rule clear. As with shift ciphers, we did not require the text to include a translated example to be counted.

This search resulted in a count of 32 instances of the variant using -way, 22 instances of the variant using -ay,

12 instances of the variant using -yay, and 9 instances of the variant using -hay.

This sample size is relatively small. However, we conducted a similar analysis on the Pile dataset (Gao

et al., 2020) and got similar results: 68 instances of -way, 46 instances of -ay, 26 instances of -yay, and 8

instances of -hay. Thus, the rough frequencies of each variant seem reasonably reliable. Our results are also

broadly similar to the counts found in the survey of Vaux and Nevins (2003), with the major exception that

they had many more respondents use -ay than -way; their population of respondents may have different Pig

Latin preferences due to the difference in medium (online text vs. survey response) or due to time (their

survey was collected in 2003, whereas most text in C4 and the Pile is more recent).

The following assumptions underlie this investigation:

• The distribution of variants in C4 and the Pile is similar to the distribution of shifts in the training data

for GPT-3.5 and GPT-4.

• The texts returned by searching for the phrase pig latin are representative of the full set of texts

discussing Pig Latin.

• A variant’s probability of being illustrated is proportional to its probability of being mentioned.

E.3

Sorting

To estimate the relative frequency of alphabetical order and reverse alphabetical order in the C4 corpus, we

counted the occurrences of the phrases in alphabetical order and in reverse alphabetical order. They occurred

95,492 times and 629 times, respectively, from which we estimate that alphabetical order is approximately

150 times as common as reverse alphabetical order. Similarly, to compare ascending order and descending

order, we counted the occurrences of the phrases in ascending order —which occurred 21,562 times—and in

descending order —which occurred 31,378 times. We thus estimate that descending order is approximately

1.5 times as common as ascending order.

These analyses assume that the frequency of these phrases is proportional to the frequency of each type of

ordering. This assumption is likely imperfect; in particular, it might be that alphabetical order and ascending

order are the default ways of ordering words and numbers, such that they are sometimes used without any

explicit mention of the type of ordering that occurs. Nonetheless, these analyses provide one piece of evidence

83that alphabetical order is much more common than reverse alphabetical order in Internet text, but that

ascending and descending order do not have as large of a difference in usage frequency.

Spelling

To evaluate whether models have access to the spelling of their tokens, we asked them to spell 1,000 single-

token words. We first produced a list of all words that appeared among the first 1 billion words in the Pile

dataset that fulfilled two conditions. First, the word had to be a single token in the tokenization scheme used

by GPT-3.5 and GPT-4, both when the word appeared after a space and when it did not. Second, the word

had to consist entirely of lowercase letters from the Roman alphabet. We then randomly selected 1,000 words

from this list of candidates. Models were asked to spell the word with a prompt of the following form:

(23) Spell the word “additional” by listing its letters, in order, separated by spaces.

The inclusion of spaces in the output forces the model to break the token down into individual letters.

GPT-4 scored 99.9% on this test. Its only error was producing b e c k e n d when the correct answer was

b a c k e n d. When evaluated strictly, GPT-3.5 scored 98.1%; of its 19 errors, 17 took the form of using

hyphens rather than spaces (e.g., responding i-n-d-u-s-t-r-y when the target response was i n d u s t r y). We

view these errors as orthogonal to our question of whether models encode the spelling of words—they reveal

failure at following instructions rather than failure at spelling—so we counted GPT-3.5 as correct in these

cases, yielding an accuracy of 99.8%.