Summary of Limits of Transformers on Compositionality

Summary Limits of Transformers on Compositionality arxiv.org

20,179 words - PDF document - View PDF document

One Line

Transformers have limitations in handling complex, multi-step reasoning tasks and compositional operations, struggling with generalization, precise compositional reasoning, and planning multiple steps for overall correct reasoning.

Slides

Slide Presentation (7 slides)

Copy slides outline Copy embed code Download as Word

Limits of Transformers on Compositionality

Source: arxiv.org - PDF - 20,179 words - view

Introduction

• Transformers struggle with complex, multi-step reasoning tasks and compositional operations.

• They have limitations in generalizing beyond the complexity seen in training data.

• Transformers excel in low-complexity tasks but struggle with higher complexity and out-of-distribution cases.

Limitations in Compositional Reasoning

• Transformers collapse the depth of compositional operations and struggle with true multi-step reasoning.

• They rely on pattern matching rather than general reasoning.

• Transformers have weaknesses in tasks that require precise compositional reasoning.

Challenges with Complex Tasks

• Transformers exhibit signs of memorization during training but struggle to plan and compose multiple steps.

• They perform well in single-step reasoning but struggle to effectively combine multiple steps.

• The ratio of restoration errors is higher than local errors, suggesting difficulty in propagating errors.

Performance in High-Complexity Tasks

• Transformers tend to guess partially correct answers without fully understanding the task.

• Pre-training alone is insufficient for teaching models how to combine basic operations for compositional reasoning.

• Transformers perform better with explicit reasoning through scratchpads, but their performance deteriorates with increasing complexity.

Limitations in Learning and Generalization

• Transformers face limitations in zero-shot and few-shot settings, highlighting their struggles in learning without explicit guidance.

• Experimental setups involving different models and configurations demonstrate the limitations of transformers.

• Dynamic Programming and Einstein's Puzzle serve as examples of compositional tasks.

Conclusions

• Transformers have limitations in handling complex, multi-step reasoning tasks and compositional operations.

• They struggle to generalize beyond the complexity seen in the training data and often collapse the depth of compositional operations.

• Further research is needed to address these limitations and develop models capable of robust generalization and systematic problem-solving.

Key Points

Transformers have limitations in handling complex, multi-step reasoning tasks and compositional operations.
Transformers struggle to generalize beyond the complexity seen in the training data and often collapse the depth of compositional operations.
Transformers excel in low-complexity tasks but struggle with higher complexity and out-of-distribution cases.
Transformers rely on pattern matching rather than general reasoning and have weaknesses in tasks that require true multi-step compositional operations.
Transformers may have inherent limitations in solving high-complexity compositional tasks and further research is needed to address these limitations.

Summary

861 word summary

Understanding the limitations of Transformers in compositional reasoning is crucial for developing more reliable and robust AI systems. This knowledge is essential for researchers, developers, and policymakers in making informed decisions regarding the application of Transformers in various domains. By shedding light on these limitations, we contribute to a deeper understanding of the capabilities and constraints of these models. Our work on analyzing the limitations of current Transformers in compositional tasks can have a positive societal impact in several ways. It can guide future research efforts toward addressing these limitations and developing models that exhibit improved performance in handling complex tasks requiring compositional reasoning. We do not foresee any negative societal impacts, as our analysis aims to understand the reasons behind transformers' failures and successes, but does not introduce any new model or dataset. Future work may leverage our findings to direct further research towards the development of more reliable and robust AI systems. The text excerpt is from a document titled "Limits of Transformers on Compositionality" and contains mathematical equations and results. The content is highly technical and specific to the topic of the document, making it difficult to provide a concise summary without losing important details. We conducted experiments using different data splits, including problem size, depth, and width of the graph. The model was fine-tuned on various tasks, such as multiplication, dynamic programming, and puzzles. We evaluated the performance of different language models, including GPT4, ChatGPT, LLaMA, and FlanT5. The results showed a lack of generalization for out-of-domain examples and a decline in performance as the task complexity increased. We also examined the zero-shot and few-shot accuracy of the models. The cost of fine-tuning GPT3 for the multiplication task was approximately $12 million for four epochs on question-answer pairs. Overall, our experiments highlighted the limitations of transformers in terms of compositionality and generalization to out-of-domain examples. The excerpt is from a document discussing the limits of transformers on compositionality. It includes a sample scratchpad for a puzzle task, a final solution to the puzzle, and a step-by-step reasoning process. The document also mentions the use of different clue types and the construction of data for experiments. It provides information about the multiplication task, including an example prompt and scratchpad, as well as the process of multiplying two numbers. The document includes appendices with additional information and references to related works. The text excerpt is a list of references from a document on the limits of transformers on compositionality. The references include various papers and books related to language models, reasoning, problem-solving, and neural networks. The list is quite long and contains many irrelevant details and repetitions. A more concise version of the summary would be as follows:

The text excerpt consists of a list of references from a document on the limits of transformers on compositionality. The references include papers and books related to language models, reasoning, problem-solving, and neural networks. Transformers have limitations in handling complex, multi-step reasoning tasks and compositional operations. They struggle to generalize beyond the complexity seen in the training data and often collapse the depth of compositional operations. Transformers may perform well in single-step reasoning tasks but face challenges in combining multiple steps effectively. While they have impressive empirical results, their fundamental limitations suggest that reaching full mastery of certain tasks is difficult. Transformers demonstrate weaknesses in tasks that require true multi-step compositional operations and struggle with precise compositional reasoning. Theoretical findings show that errors rapidly escalate as the problem size increases in Transformers. These limitations highlight the need for further investigation and the development of models capable of robust generalization and systematic problem-solving. Transformers exhibit signs of memorization during training, as they can produce correct outputs despite incorrect computations. However, they struggle to plan and compose multiple steps for overall correct reasoning. While they can perform single-step reasoning, they rely on pattern matching rather than general reasoning. The ratio of restoration errors is higher than local errors, suggesting that models are able to propagate errors. Transformers excel in low-complexity tasks but struggle with higher complexity and out-of-distribution cases. They tend to guess partially correct answers without fully understanding the task. Pre-training alone is not sufficient to teach models how to combine basic operations for compositional reasoning. Transformers perform better with explicit reasoning through scratchpads. The performance deteriorates as problems become more complex. Zero-shot and few-shot settings highlight the limitations of Transformers in learning without explicit guidance. Experimental setups involve testing different models and configurations. Dynamic Programming and Einstein's Puzzle are used as examples of compositional tasks. Transformers can struggle with compositional tasks, even though they excel in other areas. The limitations of Transformers in solving complex tasks are explored through the use of computation graphs. The study focuses on three representative compositional tasks: multi-digit multiplication, logic grid puzzles, and dynamic programming problems. It is observed that while Transformers perform well on tasks that involve basic reasoning operations, they struggle with tasks that require multi-step reasoning. Transformers tend to rely on shallow, rote learning rather than deep, holistic understanding. The study suggests that Transformers may have inherent limitations in solving high-complexity compositional tasks and that further research is needed to address these limitations.

Raw indexed text (110,144 chars / 20,179 words / 2,577 lines)

Faith and Fate:

Limits of Transformers on Compositionality

Nouha Dziri 1∗ , Ximing Lu 1,2∗ , Melanie Sclar 2∗ , Xiang Lorraine Li 1† , Liwei Jiang 1,2 † ,

Bill Yuchen Lin 1 , Peter West 1,2 , Chandra Bhagavatula 1 , Ronan Le Bras 1 , Jena D. Hwang 1 ,

Soumya Sanyal 3 , Sean Welleck 1,2 , Xiang Ren 1,3 , Allyson Ettinger 1,4 ,

Zaid Harchaoui 1,2 , Yejin Choi 1,2

Allen Institute for Artificial Intelligence 2 University of Washington

University of Southern California 4 University of Chicago

[email protected], [email protected], [email protected]

Abstract

Transformer large language models (LLMs) have sparked admiration for their

exceptional performance on tasks that demand intricate multi-step reasoning. Yet,

these models simultaneously show failures on surprisingly trivial problems. This

begs the question: Are these errors incidental, or do they signal more substantial

limitations? In an attempt to demystify Transformers, we investigate the limits of

these models across three representative compositional tasks—multi-digit multi-

plication, logic grid puzzles, and a classic dynamic programming problem. These

tasks require breaking problems down into sub-steps and synthesizing these steps

into a precise answer. We formulate compositional tasks as computation graphs to

systematically quantify the level of complexity, and break down reasoning steps

into intermediate sub-procedures. Our empirical findings suggest that Transformers

solve compositional tasks by reducing multi-step compositional reasoning into

linearized subgraph matching, without necessarily developing systematic problem-

solving skills. To round off our empirical study, we provide theoretical arguments

on abstract multi-step reasoning problems that highlight how Transformers’ perfor-

mance will rapidly decay with increased task complexity.

Introduction

“It was the epoch of belief, it was the epoch of incredulity.” – Charles Dickens, A Tale of Two Cities

Large-scale Transformers such as ChatGPT [41] and GPT-4 [42] demonstrate unprecedented capa-

bilities [41, 57, 9, 12, 66], even noted as “sparks of AGI” [10]. In stark contrast, the same models

sometimes struggle with simple, intuitive tasks [8, 45, 35]. For instance, humans can solve 3-digit

by 3-digit multiplication arithmetic after learning basic calculation rules [19, 29]. Yet, off-the-shelf

ChatGPT and GPT4 achieve only 55% and 59% accuracies on this task, respectively (§3).

The striking discrepancy between the impressive successes of Transformers on seemingly complex

tasks and the astonishing failures on seemingly trivial tasks spark critical open questions about how

to faithfully interpret their mixed capabilities. Under what conditions do Transformers succeed, fail,

and why? What types of errors do they make? Can Transformers uncover implicit problem-solving

rules or be taught to follow reasoning paths?

∗

†

First co-authors.

Second co-authors.

Preprint. Under review.Seeking thorough answers to these questions remains an open research challenge. However, we offer

novel insights into the fundamental limits of Transformer LLMs, centered around compositional

problems that require strict multi-hop reasoning to derive correct predictions. Applying step-by-step

reasoning is fundamental to human intelligence [52, 51]. These compositional problems present

compelling challenges for AI systems as they require combining basic reasoning operations to

follow computational paths that arrive at unique correct solutions. In particular, we study three

straightforward and flexible representative compositional tasks: long-form multiplication, logic grid

puzzles (i.e., Einstein’s puzzle [44]), and a classic dynamic programming problem.

We propose two hypotheses. First, Transformers solve compositional tasks by reducing multi-step

compositional reasoning into linearized path matching. This contrasts with the systematic multi-step

reasoning approach that learns to apply underlying computational rules required for building correct

answers [54, 32, 24]. Shortcut learning [26] via pattern-matching may yield fast correct answers

when similar compositional patterns are available during training but does not allow for robust

generalization to uncommon or complex examples. Second, due to error propagation, Transformers

may have inherent limitations on solving high-complexity compositional tasks that exhibit novel

patterns. Errors in the early stages of the computational process can lead to substantial compounding

errors in subsequent steps, preventing models from finding correct solutions.

To investigate our hypotheses, we formulate compositional tasks as computation graphs. These graphs

break down problem-solving into submodular functional steps, enabling structured measurements

of complexity and verbalization of computational steps as input sequences to language models.

Moreover, we leverage information gain to predict patterns that models are likely to learn based on

the underlying task distribution without the need to perform full computations within the graph.

Empirical results show that training on task-specific data leads to near-perfect performance on in-

domain instances and under low compositional complexity, but fails drastically on instances outside

of this region. This substantial gap suggests that systematic problem-solving capabilities do not

emerge from maximum likelihood training [5] on input-output sequences, even when prompted or

trained with human-like reasoning steps (i.e., a linearization of computation graphs; §3.1). Models’

success can be attributed, in part, to their exposure to training examples sub-graphs that involve

the same computations required for solving test examples (see Section 3.2.2) In order to gain a

deeper understanding of models’ failures, we conduct a comprehensive analysis by decomposing

their computation graphs and examining different error types. We find that while models can

memorize single-step operations, they fail to compose them into correct reasoning paths, suggesting

that they mostly make predictions based on shallow, rote learning rather than a deep, holistic

task understanding (§3.2.3). Importantly, we provide theoretical evidence of exponential error

accumulation using abstract compositional tasks. All tasks analyzed empirically in this paper are

instantiations of these abstractions (§4). We argue that Transformers could be inherently limited in

solving compositionally complex tasks out-of-the-box.

As Transformers continue to make tangible real-world impacts, it is pressing to interpret their

remarkable performance critically. Our work takes a realistic look at the limitations of Transformers

in the context of compositional tasks. To shed light on practical future steps, we identify directions

for addressing these limitations, such as using Transformers for tasks that could be decomposed into

few reasoning steps, tasks where evaluation may afford some leniency, and using Transformers in

combination with planning modules or refinement methods to improve their generations. To advance

language AI, fundamental innovations are required to address or complement these limitations.

Measuring Limitations of Transformers in Compositional Tasks

Human problem-solving skills can be conceptualized as a graph structure, where each vertex repre-

sents a partial solution and the edges represent operators that can be applied to modify these solutions.

As we will outline next and illustrate in Figure 1, we use computation graphs and corresponding

metrics to methodically evaluate Transformers’ reasoning abilities.

2.1

Computation Graph Definition

Let A be a deterministic algorithm (function), and let F A be a set of primitives (functions) the

algorithm uses in its execution. Assuming the inputs x to algorithm A are given, we define A(x)’s

2function multiply (x[1..p], y[1..q])

// multiply x for each y[i]

for i = q to

carry =

for j = p to 1

t = x[j] * y[i]

t += carry

carry = t // 10

digits[j] = t mod 10

summands[i] = digit

A(x) for

x = [7,49]

343

Color

legend:

multiply

1-digit

carry

mod 10

sum

concat

A(x)

// add partial results (computation not shown)

product = ∑ qi=1 summands[q+1-i] ⋅ 10 i-1

return product

Figure 1: Transformation of an algorithm A to its computational graph G A(x) . The depicted example is of

long-form multiplication algorithm A, for inputs x = [7, 49] (i.e. computing 7 × 49).

static computation graph G A(x) . G A(x) = (V, E, s, op) is a directed acyclic graph. Nodes V

represent all variables’ values during A’s execution: each node v ∈ V has a value s(v) ∈ R

associated. Edges E represent the function arguments involved in some computation: for each non-

source node v ∈ V , let U = {u 1 , . . . , u j } ⊂ V j be its parent nodes. Then, s(v) = f (u 1 , . . . , u j )

for some f ∈ F A . Since each node v is uniquely defined by the computation of a single primitive f ,

we define op : V → F A as op(v) = f .

Let S ⊂ V be the source nodes of G A(x) and without loss of generality, let o ∈ V be its sole leaf

node. By definition, S ≡ x and A(x) = s(o), representing the input and output of A respectively.

To be able to train and evaluate a language model’s ability to follow algorithm A we must linearize G A(x) .

Since we only consider autoregressive models, this linearization must also be a topological ordering.

2.2

Quantifying Compositional Complexity using Graph Metrics

A’s representation as a computation graph G A(x) enables measuring task complexity from many angles.

We define a node v ∈ V ’s layer number as the length of the longest path from a source node to v in

the directed acyclic graph G A(c) . We then define the reasoning depth as the largest layer number in

the graph. In computation graphs, reasoning depth is a proxy for the maximum level of multi-hop

reasoning required to solve the task.

Let d S : V → N 0 be the shortest distance to any of G’s source nodes S ⊂ V . We define the

reasoning width of a graph as the mode of {d(v) : v ∈ V }. This metric aims to measure the

maximum number of variables required to maintain in parallel during the computation. Relatedly, we

also define the average parallelism of a graph as the ratio between |V | and its reasoning depth. This

aims to compute the average width in computation through the graph, and not just in its mode.

2.3

Predicting Surface Patterns through Relative Information Gain

When evaluating model performance, we may observe partially correct answers even in an overall

incorrect response. To understand model strategies in these partial successes, we use Relative

Information Gain to predict surface patterns that models are likely to recognize. We represent task T as

a distribution (X 1 , . . . , X n , Y 1 , . . . , Y m ) and measure the amount of (normalized) information gained

about an output element Y j by observing a subset of input random variables X ⊂ {X 1 , . . . , X n }:

RelativeIG(Y j , X) =

H(Y j ) − H(Y j |X)

∈ [0, 1]

H(Y j )

(1)

RelativeIG may be used to analyze the influence of any node in the computation graph (as defined in

§2.1) with respect to a set of its ancestors; in particular, output nodes with respect to input nodes.

2.4

Exploring Three Representative Compositional Tasks: Definitions

Multiplication Multi-digit multiplication requires executing operations with numerical symbols

based on procedural rules [29]. This task has multiple algorithmic solutions; in constructing computa-

3(b) 1.5 GPT4 zero-shot (Multiplication)

1.0

0.5

0.0

0.5

Average parallelism

Figure 2: (a) Zero-shot accuracy. Axes refer to problem sizes (number of digits in multiplication, number of

houses and attributes in puzzle, and sequence length in the DP task). Transformers’ accuracy decreases to near

zero as task complexity increases, measuring task complexity by the problem size. (b) Average parallelism

negatively correlates with accuracy.

tion graphs, we use the well-known O(k 1 k 2 ) long-form multiplication algorithm for computing x · y,

where x has k 1 ≤ 5 digits and y has k 2 ≤ 5 digits in base 10. See §A.1 for data construction details.

To instantiate G A(x) , let F A = {one-digit multiplication, sum, mod 10, carry over, concatenation}.

Source nodes S are digits of input numbers, leaf node o is the final output, and intermediate nodes v

are partial results generated during execution of the long-form multiplication algorithm (see Figure 1).

Einstein’s Puzzle Einstein’s puzzle is a well-known logic puzzle often used as a benchmark for

solving constraint satisfaction problems [44]. It involves a list of houses with different attributes (e.g.,

owner’s name, pets), and the goal is to determine which attributes belong to each house by combining

a set of pre-defined natural language clues or constraints. The solution to the puzzle is a matrix of

size K × M , where K represents the number of houses and M the number of attributes. As K and

M increase, synthesizing different partial solutions that satisfy individual constraints becomes highly

compositionally complex. To construct the computation graph, we consider a greedy algorithm that

iteratively eliminates possible solutions by filling at least one cell each time. It deterministically fills

the cell(s) that requires the minimum number of clues among all current unfilled cells. We refer to

this as the elimination function. See §A.2 for examples, data construction, and algorithm details.

To instantiate G A(x) , let F A = {elimination function}. The source nodes are the clues, all intermedi-

ate nodes are partially-filled matrices, and the output node is a fully-filled solution matrix.

Dynamic Programming Problem Dynamic programming (DP) recursively breaks down complex

problems into simpler sub-problems, so problems solved using this technique are compositional. We

analyze a classic relaxation of the NP-complete Maximum Weighted Independent Set problem [34]:

Given a sequence of integers, find a subsequence with the highest sum, such that no two numbers in

the subsequence are adjacent in the original sequence. This relaxation may be solved in O(n) time

using DP. See the solution in §A.3. In the experiments, we restrict each integer to the [−5, 5] range.

To instantiate G A(x) , let F A = {equals, and, not, indicator function, sum, max}. Source nodes are

elements of the input list, and the output node is a list that for each element indicates whether it

should be selected. We select an O(n) algorithm since G A(x) ’s size is proportional to A’s complexity.

Testing the Limits of Transformers: Empirical Evidence

Experimental Setup To understand the capabilities of LLMs, we evaluate GPT3

(text-davinci-003) [9], ChatGPT (GPT-3.5-turbo) [41] and GPT4 (gpt-4) [42] using zero-

shot, few-shot, and finetuning techniques. To enable the generation of computation graphs beyond

the final answers, we use the concept of scratchpads [40]. Scratchpads are a verbalization of the

computation graphs (i.e., a linearized representation of a topological ordering of G A(x) ). Overall, we

consider question-answer and question-scratchpad formats for few-shot and finetuning settings to

gauge models’ capabilities for learning with and without explicit reasoning. See details of additional

models and experimental configurations in §B and examples of scratchpad in §A.

43.1

Testing the Limits of Transformers with Zero-shot, Few-shot and Finetuning

Limits of Transformers in zero- and few-shot settings To investigate the inherent problem-solving

capabilities of LLMs, we begin by analyzing models’ zero-shot and few-shot performances on our

compositional tasks. As shown in Figure 2, task performances deteriorate significantly from near

perfection to zero with increasing complexity when measured by either problem size (Figure 2(a))or

average parallelism (Figure 2(b)).The trend remains the same for few-shot prompting (see §B.2).

These results indicate that pre-training is in fact not sufficient to teach models how to combine basic

operations to solve compositional problems, especially as problems grow more complex.

Limits of Transformers with question-answer training The limited performance of models

may be attributed to the lack of task-specific data during pre-training. To fully bring out models’

potentials in solving these tasks, we next exhaustively finetune GPT3 with question-answer pairs. In

multiplication and DP, we finetune models with all enumerations of questions up to the maximum

problem size 3 within reasonable training budget, leaving out 10% for validation and 10% for testing.

In puzzles, we train on a subset of all instances up to (K, M ) ≤ (4, 4) due to combinatorial explosion.

We separately finetune GPT3 models on ∼1.8M multiplication pairs, ∼142K DP pairs, and ∼41K

puzzle pairs (see details in §B.3). Additionally, to examine problems of different complexity, we

consider different training splits based on the depth and width of computation graphs.

Figure 3 and Figure 4a show high accuracy for exam-

ples with splits seen during training, i.e., in-domain.

However, the performance sharply declines when

evaluating unseen splits during training, i.e., out-of-

domain (OOD). Similar trends hold in all tasks (see

§ B.3), suggesting that systematic problem-solving

capabilities do not emerge via exhaustive training on

task-specific data.

Limits of Transformers with explicit scratchpad

training Next, we test whether we can explicitly Figure 3: GPT3 finetuned exhaustively on task-

teach models the required computational operations specific data up to a certain problem size. The blue

via scratchpads. To do so, we finetune GPT3 with region represents the in-distribution examples and

question-scratchpad pairs for all tasks. We considered the red region refers to OOD examples. The same

trend is observed for the puzzle task (See §B.2)

the same distribution splits as before. The results, pre-

sented in Figure 4b, show that once again GPT3 achieves near-perfect performance on in-distribution,

but fails entirely in generalizing to OOD cases—in particular, wider or deeper computation graphs.

These results indicate that even when training directly with guidance on the computation steps,

models still fail to learn component operations in a generalizable manner. This observation holds

for all tasks (See details in § B.4). These findings suggest that the autoregressive characteristic of

Transformers, which forces them to tackle problems sequentially, presents a fundamental challenge

that cannot be resolved by instructing the model to generate a step-by-step solution. Instead, models

face the fundamental challenge of simply depending on a greedy process of producing the next word

to make predictions without a rigorous global understanding of the task.

3.2

Breaking Down Successes and Failures of Transformers

3.2.1

Information Gain Explains Where Transformers Partially Excel

At times Transformers predict partially correct answers even when the overall response is incorrect.

We speculate that this may be due to particularities in the task distribution that allow for guessing

partial answers without performing the full multi-step reasoning that the task requires.

Using relative information gain (defined in §2.3), we can predict surface patterns that a model is

likely to learn and contrast them empirically. For multiplication, relative information gain shows that

the first digit (two digits) of the output highly correlates with the first digit (two digits) of each input

number (see §C.1). Hence, this spurious pattern is likely to be learned by a model. Similarly, the

prediction of the last digit (or two digits) of the output is observed to solely rely on the last digit (or

We consider all k 1 -by-k 2 digit multiplications with 1 ≤ k 1 , k 2 ≤ 4 and k 1 · k 2 ≤ 9; and all DP problems

up to 5 elements. We selected sizes based on budget constraints for GPT3 finetuning, see §B.3 for cost details.

5(a) Results on question-answer pairs.

(b) Results on question-scratchpad pairs.

Figure 4: GPT3 finetuning and prompting accuracy on different data splits. Although the in-distribution

performance is almost perfect, GPT3 exhibits poor generalization with increasing graph depth and width. Refer

to §B.3 and §B.4 for results on the puzzle and DP tasks.

two digits) of each input number. This pattern holds true due to the principles of modulo arithmetic,

which ensures the validity of this relationship in all cases. Empirically, we verify that models indeed

learn the patterns we predicted and other patterns as well (e.g., order of magnitude of the answer,

number of trailing zeros for multiplication) in all the settings with and without scratchpad. See details

for multiplication, plus dynamic programming task analysis in §C.

These experiments suggest that if an output element heavily relies on a single or a small set of input

features, Transformers are likely to recognize such correlation during training and directly map these

input features to predict the output element in testing, without going through the rigorous multi-hop

reasoning and giving a false illusion of performing compositional reasoning.

3.2.2

Transformers Reduce Multi-Step Compositional Reasoning into Linearized Subgraph

Matching

We now explore whether models’ correct predictions on unseen test data are due to learning the un-

derlying algorithm or, instead, explainable by exposure to similar training examples. We hypothesize

that, beyond simple memorization, Transformers largely rely on pattern matching for solving these

tasks. To test this, we calculate the average frequency with which partial computations needed to

solve an instance appear in the training data, for both correctly and wrongly predicted examples.

b A(x) we analyze how often the full computation of

Given a model-generated computation graph G

each node v ∈ V b is seen in training. We define v’s full computation as the subgraph induced by all

ancestors of v including v, denoted F C G b A(x) (v). We say that F C G b A(x) (v) is seen during training if

F C G b A(x) (v) ≡ F C G A(x ′ ) (w) for some computation graph G A(x ′ ) in training, and for some w ∈ V .

We characterize complexity of a full computation subgraph by its depth, as defined in §2.1.

Figure 5 shows that full computation subgraphs appear significantly more frequently in the training

data for correctly predicted test examples than for incorrectly predicted ones, for both the multi-

plication and DP task (both frequencies tend to zero for large depths since we ensured a disjoint

train/test split). This high correlation suggests that pattern matching—and not general reasoning

capabilities—may be the cause behind correct model outputs. This type of learning could be largely

effective when the compositional complexity of tasks is low but it becomes less efficient when tasks

are increasingly complex. This may elucidate the observed performance gain in low-complexity and

in-domain cases and the striking performance drop in OOD and highly complex cases.

3.2.3

What Types of Errors do Transformers Make at Different Reasoning Depths?

For clearer understanding of where Transformers fall short, we analyze the types of errors that

transformers make for nodes at different layers in the computation graph. For every input x, we

compare the ground truth computation graph G A(x) with the (possibly incorrect) model-generated

b A(x) . We consider a node v as having a correct value if and only if s(v) = s b (v). 4 .

computation graph G

We consider a node v to be derived from a correct computation if given that U = {u 1 , . . . , u k } are

b A(x) and that op(v)

the immediate predecessors of v in G

= f , we have that f (u 1 , . . . , u k ) = s b (v).

If a node v does not appear in the ground truth graph G, we consider it to have an incorrect value.

6Five-shot GPT4 – Multiplication

Five-shot GPT4 – DP

Five-shot GPT4 – Puzzle

avg.

training

Fine-tuned GPT3 - Multiplication

Fine-tuned

GPT3 GPT3

- Dynamic

Fine-tuned

- DP Prog.

Fine-tuned GPT3 – DP

Correct Final Answer

Incorrect Final Answer

Fully Correct

Local Error

Propagation Error

Restoration Error

Figure 5: Average frequency in Figure 6: Ratio of nodes in each of the four correct/error categories

which test examples’ full computa- for each layer in computation graph. Results shown are for few-shot

tions subgraph appear in the train- prompting and fine-tuning with scratchpad.

ing data w.r.t. the subgraph depth,

grouped by final answer.

Note that the notion of correct computation is independent of G, and that a node v derived from a

correct computation may not have the correct value if an error occurred in some of its ancestors.

We classify each node v ∈ V b into one of four categories. Node v is fully correct if v and its ancestors

have correct values and are derived from correct computations. If a node v is not fully correct, its

error can be of the following types: v has a local error if its parent nodes have correct values but v is

derived from an incorrect computation (i.e., a one-hop reasoning error); v has a propagation error if

v is derived from a correct computation but some of its parent nodes have incorrect values; v has a

restoration error if it has a correct value but is derived from an incorrect computation.

Figure 6 shows results for few-shot GPT4 and fine-tuned GPT3 with scratchpad, with respect to

graph layer number for each node. In all settings, the ratio of fully correct nodes is almost perfect

but sharply decreases toward zero with increasing graph layers. Moreover, the ratio of propagation

errors is usually higher than the ratio of local errors. Both phenomena suggest that models are able to

correctly perform single-step reasoning, potentially due to memorizing such single-step operations

during training, but fail to plan and compose several of these steps for an overall correct reasoning.

Both the DP and the puzzle tasks have a high ratio of restoration errors, suggesting memorization

since correct outputs are produced despite incorrect computations. There are signs of memorization

even when restoration errors are near zero: 82.3% of the final correct answers for 4-digit by 2-digit

multiplications (a setting unseen during training) had at least one error in the computation graph, but

still produced correct answers. These patterns are possibly due to high frequency of (input, output)

multiplication pairs in the pretraining data, in contrast to intermediate reasoning steps.

Error Propagations: The Theoretical Limits

Experiments (§3) highlight the limitations of current Transformers in handling complex, multi-step

reasoning tasks. Concretely, we show that errors rapidly escalate as the problem size grows (§3.2.3).

In this section, we aim to provide theoretical insights into why these models can perform significantly

worse in compositional tasks as the problem size increases. We argue using stylized examples that

transformers may be too limited to solve compositionally complex tasks.

Algorithms designed to solve compositional tasks typically involve multiple independent applications

of a function and/or iterated applications of the same function. A Transformer executing such an

algorithm acts as an estimator of these functions. In this context, we examine the probability of such

an estimator reaching the correct answer as the problem size increases. We first consider a scenario

where a Transformer estimates an algorithm requiring n independent applications of a function:

7Proposition 4.1 (informal). Let f n involve the combination h n of n independent applications of a func-

tion g. Let f b , g b , b

h n be their estimators. Assume h n has low collision and that b

h n is a perfect estimator

of h n . If P(g ̸ = g b ) > 0 where g b ’s errors are independent, then P(f n ̸ = f b n ) tends to 1 as n increases.

Proof sketch. Derive an upper bound for P(f n = f b n ) by using the law of total probability on the

event that the estimations g b were correct for all of the n independent applications. Show it converges

to zero as n → +∞. See §D.1 for a formal statement and proof derivation.

Proposition 4.1’s proof shows the rate of convergence is exponential, thus concluding that transformers

will rapidly fail with increasing n. Let’s now analyze the iterated application function scenario.

Proposition 4.2 (informal). Let f n (x) = g n (x) involve the repeated application of g. Assume that

the probability of recovering from a mistake due to the randomness of applying the estimator on an

incorrect input has probability at most c. If P(g ̸ = g b ) = ϵ > 0, then lim inf P(f n ̸ = f b n ) = 1 − c/(c + ϵ).

n→+∞

Proof sketch. Let s n := P(f n = f b n ), where s 1 = 1−ϵ by definition. Derive s n ≤ (1−ϵ−c)·s n−1 +c

using law of total probability. Then, prove by induction a non-recursive upper bound for s n with

when n → +∞. See formal statement and derivation in §D.2.

limit c+ϵ

Prop. 4.2’s proof also shows an exponential rate of convergence. Note that if c ≪ ϵ then

lim inf P(f n ̸ = f b n ) ≈ 1. It is reasonable to assume c ≪ ϵ when g has low collision, since c

n→+∞

represents the probability of the estimator g b (y) arriving at the correct output g(x) by chance when

given the wrong input y ̸ = x. More details in §D.3.

Moreover, repeated applications of a function often imply unbounded errors: if g(x) can be expressed

as an affine transformation F x+c, then it may be viewed as a first-order vector autoregression, which

are known to be unstable when |λ| ≥ 1 for at least one λ eigenvalue of F [27] (statement in §D.4).

While we make these arguments with affine maps, similar behaviors, perhaps even more acute, would

hold with nonlinear maps [22]—but their study is beyond the scope of this paper.

In Prop. 4.2’s current form, we implicitly assume that there is a single valid reasoning for each input

since g is a function. We can potentially generalize this assumption with a state-transition framing,

where the probability of transitioning from a valid state to an invalid one is ϵ, and the probability of

recovering from an invalid state is at most c. See formal statement in D.2.

All tasks evaluated in the present work can be seen as instances of the results just proven. Prop. 4.1

directly applies to multiplication, since m-by-n digit multiplication can be seen as n independent

instances of m-by-1 digit multiplication (see Cor. D.1). Prop. 4.2 directly applies to the recursive

function of the dynamic programming task, as well as to m-by-1 digit multiplication, and to the

puzzle through its elimination function (details in D.3). They are also all low collision settings.

Discussion

Collapsed Compositionality and Robustness Implications Transformers today demonstrate un-

deniably powerful empirical results. Yet, our study suggests that Transformers may have fundamental

weaknesses in certain intellectual tasks that require true multi-step compositional operations such as

multiplications and logic puzzles. Our careful study based on the computation graph and analyses

demonstrates that Transformers can often solve multi-step compositional problems by collapsing the

depth of the compositional operations via analogical pattern matching. More broadly, our findings

suggest that the strong performance of Transformers should be taken with a certain grain of salt: De-

spite initially appearing challenging, certain tasks may not possess the inherent compositionality they

seem to have. This is due to the fact that desired solutions could be readily derived from input-output

sequences present in the training data, allowing for shortcut pattern matching to produce acceptable

solutions. However, such an approach can ultimately result in poor generalization capabilities as

shown in our study. For example, fine-tuning GPT3 on our tasks both with and without explicit

reasoning graphs shows that models’ learning fails to generalize beyond levels of complexity seen in

training.

8Theoretical Findings and their Empirical Implications The proofs presented in §4 show that,

under reasonable assumptions, the probability of incorrect predictions converges exponentially to

≈ 1 for abstract compositional tasks. Importantly, these proofs apply to autoregressive language

models in general. Building on these findings, we suggest several empirical strategies for harnessing

the potential of Transformers. Firstly, Transformers may be employed in ways that require chaining

only a few compositional steps to reach a solution rather than lengthy reasoning steps (e.g., [30]).

Secondly, Transformers may be best suited for compositional tasks where evaluation metrics can

afford some leniency; for example, finding approximate solutions that do not require executing the

whole graph, such as identifying the most significant digit in a multiplication. Finally, we suggest

augmenting Transformers with planning modules as well as using refinement methods, that can

iteratively improve their generations [64, 37].

Call for broad Participation to Investigation on the Limitations Identification of limitations is

an important step towards achieving greater robustness. Our study suggests fundamental limitations

that impede Transformers from fully mastering certain compositional operations. However, we

acknowledge that due to our compute budget constraints as well as limited access to the largest

language models such as GPT-4, we are unable to push the empirical limits of Transformers even

further in terms of training data size and number of epochs. We invite the broader research community,

particularly those with more extensive resources at their disposal, to investigate these possibilities

further.

Related Work

Reasoning abilities in Transformer LLMs Recently, Transformers [9, 42, 41, 14, 13, 46, 56, 57]

have demonstrated impressive reasoning abilities across a wide range of tasks, even outperforming

humans in certain cases [61, 25, 12, 66]. This success has been largely attributed to the scaling effect,

where larger models and training datasets result in improved performance [33, 28, 1]. However, these

models have also been shown to struggle across multiple domains, including algorithmic reasoning

[60], commonsense reasoning [45, 35], theory of mind [48], planning [59], logical reasoning [53],

and ethical reasoning [31]. These difficulties have motivated us to take a step back and thoroughly

examine both the successes and failures of Transformers from empirical and theoretical perspectives

on compositional reasoning tasks.

Challenges of Transformers in compositional tasks Transformers perform fairly well in single-

step reasoning tasks [53], but face challenges when it comes to effectively combining multiple steps

to solve compositionally complex problems [39, 49, 63]. Recent research has focused on overcoming

these limitations through various approaches. First, fine-tuning Transformers to directly generate

the final answer while keeping the reasoning implicit [7, 15]. Second, encouraging Transformers to

generate reasoning steps explicitly within a single generation [39, 62, 36]. For example, Nye et al. [39]

and Zhou et al. [67] used scratchpads to teach Transformers how to perform algorithmic reasoning

tasks such as addition by splitting the task into intermediate steps [36, 62]. Lastly, leveraging LLMs

to generate each reasoning step iteratively via a selection and inference mechanism [17, 16, 55].

The primary focus of these studies is to explore possibilities for improving models’ performances

on compositional problems without necessarily aiming for complete mastery of the task. While

they have indeed made enhancements compared to baseline approaches, they did not achieve 100%

accuracy on OOD domain. In contrast, our work delves into investigating the fundamental limits of

achieving full mastery of the task. We seek to examine whether we could achieve 100% performance

in both in-domain and OOD settings by pushing transformers to their limits. Our findings reveal that

accomplishing this feat is not a simple task, and we offer insights into why reaching full mastery is

inherently challenging.

Challenges of Transformers in generalization Extensive research has been done to investigate

the generalization capabilities of Transformers [3, 38, 23, 47]. This encompasses various facets of

generalization, including easy-to-hard generalization [50, 4], length generalization [2, 43, 11, 38], and

generalization on symbolic mathematical integration [65]. Schwarzschild et al. [50] and Bansal et al.

[4] employ weight-tied neural networks to generalize from easy to hard examples. Razeghi et al. [47]

revealed a positive correlation between the frequency of training terms and their test performance.

Building upon this line of inquiry, we present a more rigorous examination of sub-graph matching

9between training and test instances for complex compositional tasks. We complement our empirical

results with theoretical insights on Transformers’ limits.

Iterated Functions The process of repeatedly applying a noisy single operation or function f

can be related to iterated random functions [21]. In this latter literature, the focus is usually on the

contractive regime in which accrued errors can be kept under control, and the subsequent convergence

guarantees (e.g., [20]). When f is an affine transformation, the process falls simultaneously between

two perspectives: time series [27] and dynamic programming and control [6]. We leverage the former

to discuss the often explosive errors of f n .

Conclusions

On a broader scope, as Transformers continue to gain widespread deployment with significant real-

world impacts, it is ever more urgent to understand their successes and failures. Our study critically

investigates Transformers’ limitations and emphasizes the need to develop models capable of robust

generalization and systematic problem-solving. By examining the compositional capabilities of these

models, we aspire to work towards more reliable AI systems that excel not only in tasks where

abundant training examples are sufficient, but also in cases requiring precise compositional reasoning.

Limitations

We focus on analyzing compositional reasoning capabilities through the lens of computation graphs.

Although they are a useful way to systematically represent rigorous reasoning processes, it is important

to note that for the scratchpad approach, we are limited to only establishing a correlation between the

model generation and its preceding context, as we cannot inspect the exact tokens model attends to

when making the prediction. This limitation arises from our lack of access to the activations of the

studied models. Furthermore, we posit that alternative approaches to linearizing reasoning processes

may yield different performances and provide opportunities for further exploration.

Acknowledgements

We thank members of the Mosaic team at AI2 for valuable feedback on this project, and Agustín

Santiago Gutiérrez for valuable discussions. This research was supported by the NSF DMS-2134012,

DARPA MCS program through NIWC Pacific (N66001-19-2-4031), and the Allen Institute for AI.

10References

[1] Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan

Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for

generative mixed-modal language models. CoRR, abs/2301.03728, 2023.

[2] Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh,

Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. Exploring length general-

ization in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho,

and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages

38546–38556. Curran Associates, Inc., 2022.

[3] Cem Anil, Yuhuai Wu, Anders Johan Andreassen, Aitor Lewkowycz, Vedant Misra,

Vinay Venkatesh Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur.

Exploring length generalization in large language models. In Advances in Neural Information

Processing Systems, volume 35, pages 38546–38556. Curran Associates, Inc., 2022.

[4] Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Goldblum,

and Tom Goldstein. End-to-end algorithm synthesis with recurrent networks: Extrapolation

without overthinking. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh,

editors, Advances in Neural Information Processing Systems, volume 35, pages 20232–20242.

Curran Associates, Inc., 2022.

[5] Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. A neural probabilistic language model.

In Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000.

[6] D. Bertsekas. Abstract Dynamic Programming: 3rd Edition. Athena scientific optimization and

computation series. Athena Scientific., 2022.

[7] Gregor Betz, Christian Voigt, and Kyle Richardson. Critical thinking for language models. In

Proceedings of the 14th International Conference on Computational Semantics (IWCS), pages

63–75, 2021.

[8] Ning Bian, Xianpei Han, Le Sun, Hongyu Lin, Yaojie Lu, and Ben He. ChatGPT is a knowl-

edgeable but inexperienced solver: An investigation of commonsense problem in large language

models. CoRR, abs/2303.16421, 2023.

[9] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal,

Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel

Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler,

Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott

Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya

Sutskever, and Dario Amodei. Language models are few-shot learners. In Advances in Neural

Information Processing Systems, volume 33, pages 1877–1901, 2020.

[10] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece

Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi,

Marco Túlio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments

with GPT-4. CoRR, abs/2303.12712, 2023.

[11] Mirelle Bueno, Carlos Gemmel, Jeffrey Dalton, Roberto de Alencar Lotufo, and Rodrigo Fras-

setto Nogueira. Induced natural language rationales and interleaved markup tokens enable

extrapolation in large language models. CoRR, abs/2208.11445, 2022.

[12] Jonathan H Choi, Kristin E Hickman, Amy Monahan, and Daniel Schwarcz. ChatGPT goes to

law school. Available at SSRN, 2023.

[13] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam

Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker

Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes,

Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson,

Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin,

11Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier

Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David

Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani

Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat,

Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei

Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei,

Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling

language modeling with pathways. CoRR, abs/2204.02311, 2022.

[14] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li,

Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu,

Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav

Mishra, Adams Yu, Vincent Y. Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav

Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason

Wei. Scaling instruction-finetuned language models. CoRR, abs/2210.11416, 2022.

[15] Peter Clark, Oyvind Tafjord, and Kyle Richardson. Transformers as soft reasoners over language.

In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences

on Artificial Intelligence, pages 3882–3890, 2021.

[16] Antonia Creswell and Murray Shanahan. Faithful reasoning using large language models. CoRR,

abs/2208.14271, 2022.

[17] Antonia Creswell, Murray Shanahan, and Irina Higgins. Selection-inference: Exploiting large

language models for interpretable logical reasoning. In International Conference on Learning

Representations, 2023.

[18] Leonardo De Moura and Nikolaj Bjørner. Z3: An efficient smt solver. In Tools and Algorithms

for the Construction and Analysis of Systems: 14th International Conference, TACAS 2008,

Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS

2008, Budapest, Hungary, March 29-April 6, 2008. Proceedings 14, pages 337–340. Springer,

2008.

[19] Stanislas Dehaene, Nicolas Molko, Laurent Cohen, and Anna J Wilson. Arithmetic and the

brain. Current opinion in neurobiology, 14(2):218–224, 2004.

[20] B. Delyon and A. Juditsky. On small perturbations of stable markov operators: Unbounded

case. Theory of Probability & Its Applications, 43(4):577–587, 1999.

[21] Persi Diaconis and David Freedman. Iterated random functions. SIAM review, 41(1):45–76,

1999.

[22] R. Douc, E. Moulines, and D. Stoffer. Nonlinear Time Series: Theory, Methods and Applications

with R Examples. Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis, 2014.

[23] Yann Dubois, Gautier Dagan, Dieuwke Hupkes, and Elia Bruni. Location attention for extrapo-

lation to longer sequences. In Proceedings of the 58th Annual Meeting of the Association for

Computational Linguistics, pages 403–413, 2020.

[24] Jonathan St BT Evans. Bias in human reasoning: Causes and consequences. Lawrence Erlbaum

Associates, Inc, 1989.

[25] Hao Fu, Yao; Peng and Tushar Khot. How does gpt obtain its ability? tracing emergent abilities

of language models to their sources. Yao Fu’s Notion, Dec 2022.

[26] Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel,

Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature

Machine Intelligence, 2(11):665–673, 2020.

[27] James Douglas Hamilton. Time series analysis. Princeton university press, 1994.

12[28] Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo

Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec

Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and

Sam McCandlish. Scaling laws for autoregressive generative modeling. CoRR, abs/2010.14701,

2020.

[29] James Hiebert. Conceptual and procedural knowledge: The case of mathematics. Routledge,

2013.

[30] Albert Qiaochu Jiang, Sean Welleck, Jin Peng Zhou, Timothee Lacroix, Jiacheng Liu, Wenda

Li, Mateja Jamnik, Guillaume Lample, and Yuhuai Wu. Draft, sketch, and prove: Guiding

formal theorem provers with informal proofs. In The Eleventh International Conference on

Learning Representations, 2023.

[31] Liwei Jiang, Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Maxwell Forbes, Jon

Borchardt, Jenny Liang, Oren Etzioni, Maarten Sap, and Yejin Choi. Delphi: Towards machine

ethics and norms. CoRR, abs/2110.07574, 2021.

[32] Philip N Johnson-Laird, Sangeet S Khemlani, and Geoffrey P Goodwin. Logic, probability, and

human reasoning. Trends in cognitive sciences, 19(4):201–214, 2015.

[33] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child,

Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language

models. CoRR, abs/2001.08361, 2020.

[34] Jon Kleinberg and Eva Tardos. Algorithm Design. Addison-Wesley Longman Publishing Co.,

Inc., USA, 2005.

[35] Philipp E. Koralus and Vincent Wang-Mascianica. Humans in humans out: On GPT converging

toward common sense in both success and failure. CoRR, abs/2303.17276, 2023.

[36] Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale

generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th

Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),

pages 158–167, 2017.

[37] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe,

Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Sean Welleck, Bodhisattwa Prasad

Majumder, Shashank Gupta, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative

refinement with self-feedback. CoRR, abs/2303.17651, 2023.

[38] Benjamin Newman, John Hewitt, Percy Liang, and Christopher D Manning. The eos decision

and length extrapolation. In Proceedings of the Third BlackboxNLP Workshop on Analyzing

and Interpreting Neural Networks for NLP, pages 276–291, 2020.

[39] Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin,

David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton,

and Augustus Odena. Show your work: Scratchpads for intermediate computation with language

models, 2021.

[40] Maxwell I. Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin,

David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton,

and Augustus Odena. Show your work: Scratchpads for intermediate computation with language

models. CoRR, abs/2112.00114, 2021.

[41] OpenAI. ChatGPT: Optimizing language models for dialogue, 2022.

[42] OpenAI. GPT-4 technical report, 2023.

[43] Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases

enables input length extrapolation. In International Conference on Learning Representations,

2022.

13[44] Patrick Prosser. Hybrid algorithms for the constraint satisfaction problem. Computational

intelligence, 9(3):268–299, 1993.

[45] Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi

Yang. Is ChatGPT a general-purpose natural language processing task solver? CoRR,

abs/2302.06476, 2023.

[46] Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song,

John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom

Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne

Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri,

Saffron Huang, Jonathan Uesato, John F. J. Mellor, Irina Higgins, Antonia Creswell, Nathan

McAleese, Amy Wu, Erich Elsen, Siddhant M. Jayakumar, Elena Buchatskaya, David Budden,

Esme Sutherland, Karen Simonyan, Michela Paganini, L. Sifre, Lena Martens, Xiang Lorraine

Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki

Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, N. K. Grigorev, Doug

Fritz, Thibault Sottiaux, Mantas Pajarskas, Tobias Pohlen, Zhitao Gong, Daniel Toyama,

Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin,

Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew G.

Johnson, Blake A. Hechtman, Laura Weidinger, Iason Gabriel, William S. Isaac, Edward

Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem W. Ayoub, Jeff

Stanway, L. L. Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geoffrey Irving. Scaling

language models: Methods, analysis & insights from training gopher. ArXiv, abs/2112.11446,

2021.

[47] Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh. Impact of pretraining

term frequencies on few-shot numerical reasoning. In Findings of the Association for Computa-

tional Linguistics: EMNLP 2022, pages 840–854, Abu Dhabi, United Arab Emirates, December

2022. Association for Computational Linguistics.

[48] Maarten Sap, Ronan Le Bras, Daniel Fried, and Yejin Choi. Neural theory-of-mind? on the

limits of social intelligence in large LMs. In Proceedings of the 2022 Conference on Empirical

Methods in Natural Language Processing, pages 3762–3780, Abu Dhabi, United Arab Emirates,

December 2022. Association for Computational Linguistics.

[49] Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal

analysis of chain-of-thought. In International Conference on Learning Representations, 2023.

[50] Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum,

and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with

recurrent networks. In Advances in Neural Information Processing Systems, volume 34, pages

6695–6706. Curran Associates, Inc., 2021.

[51] Herbert A. Simon. The architecture of complexity. Proceedings of the American Philosophical

Society, 106(6):467–482, 1962.

[52] Herbert A Simon and Allen Newell. Human problem solving: The state of the theory in 1970.

American psychologist, 26(2):145, 1971.

[53] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid,

Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka

Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexan-

der W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain,

Amanda Askell, Amanda Dsouza, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen,

Andrea Santilli, Andreas Stuhlmüller, Andrew M. Dai, Andrew La, Andrew K. Lampinen,

Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio

Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubara-

jan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla

Karakas, and et al. Beyond the imitation game: Quantifying and extrapolating the capabilities

of language models. CoRR, abs/2206.04615, 2022.

[54] Keith Stenning and Michiel Van Lambalgen. Human reasoning and cognitive science. MIT

Press, 2012.

14[55] Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. Proofwriter: Generating implications, proofs,

and abductive statements over natural language. In Findings of the Association for Computa-

tional Linguistics: ACL-IJCNLP 2021, pages 3621–3634, 2021.

[56] Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis

Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language

model for science. CoRR, abs/2211.09085, 2022.

[57] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha,

Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee,

Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun,

Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts,

Maarten Bosma, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett,

Kathleen S. Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos,

Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben

Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi

Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen,

Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed H. Chi,

and Quoc Le. Lamda: Language models for dialog applications. CoRR, abs/2201.08239, 2022.

[58] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo-

thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez,

Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation

language models. CoRR, abs/2302.13971, 2023.

[59] Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Large

language models still can’t plan (a benchmark for LLMs on planning and reasoning about

change). In NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.

[60] Petar Veličković and Charles Blundell. Neural algorithmic reasoning. Patterns, 2(7):100273,

2021.

[61] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani

Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto,

Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language

models. Transactions on Machine Learning Research, 2022. Survey Certification.

[62] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi,

Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language

models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837.

Curran Associates, Inc., 2022.

[63] Sean Welleck, Jiacheng Liu, Ximing Lu, Hannaneh Hajishirzi, and Yejin Choi. NaturalProver:

Grounded mathematical proof generation with language models. In Advances in Neural

Information Processing Systems, volume 35, pages 4913–4927. Curran Associates, Inc., 2022.

[64] Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and

Yejin Choi. Generating sequences by learning to self-correct. In International Conference on

Learning Representations, 2023.

[65] Sean Welleck, Peter West, Jize Cao, and Yejin Choi. Symbolic brittleness in sequence models:

on systematic generalization in symbolic mathematics. In Proceedings of the AAAI Conference

on Artificial Intelligence, volume 36, pages 8629–8637, 2022.

[66] Haoyi Zheng and Huichun Zhan. ChatGPT in scientific writing: a cautionary tale. The American

Journal of Medicine, 2023.

[67] Hattie Zhou, Azade Nova, Aaron Courville, Hugo Larochelle, Behnam Neyshabur, and Hanie

Sedghi. Teaching algorithmic reasoning via in-context learning, 2023.

15Appendices

A Compositional Tasks

A.1 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

A.2 Einstein’s Puzzle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

A.3 Dynamic Programming Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

B Experimental Setups & Empirical Results

B.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

B.2 Limits of Transformers in Zero- and Few-shot Settings . . . . . . . . . . . . . . . 22

B.3 Limits of Transformers with question-answer Training . . . . . . . . . . . . . . . 22

B.4 Limits of Transformers with Explicit Scratchpad Training . . . . . . . . . . . . . . 23

C Surface Patterns

C.1 Relative Information Gain Predictions for Multiplication . . . . . . . . . . . . . .

C.2 Empirical Surface Pattern Analysis for Multiplication with GPT4, ChatGPT and GPT3 27

C.3 Relative Information Gain Predictions for Dynamic Programming Task . . . . . . 28

C.4 Empirical Surface Pattern Results for Dynamic Programming Task . . . . . . . . . 30

D Theoretical Results: Derivations

D.1 Transformers struggle with problems with increasingly larger parallelism (width) . 32

D.2 Transformers struggle with problems that require increasingly larger iterative applica-

tions of a function (depth) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

D.3 Discussing c ≪ ϵ in the context of Proposition 4.2 . . . . . . . . . . . . . . . . . . 35

D.4 Error rates in repeated applications of a function may be unbounded . . . . . . . . 36

E Societal impact

16A

A.1

Compositional Tasks

Multiplication

Data Construction We exhaustively generate multiplication problems as question-answer pairs

(e.g., Q: “What is 4 times 32?” A: “128”). We focus on multiplications of two numbers x =

(x 1 , x 2 , . . . , x k ) and y = (y 1 , y 2 , . . . , y k ) where each number can have up to k digits, amounting to

9 × 10 (k−1) combinations per each number. We set k to 5 in our experiments. Figure 7 showcases

an example prompt for performing few-shot learning without the inclusion of a scratchpad, while

Figure 8 demonstrates an example prompt using a scratchpad. Throughout our experimentation,

we explored various versions of the scratchpad, ranging from verbose and detailed to more concise

alternatives. Among these variations, the scratchpad version depicted in Figure 8 ultimately produced

the most favorable outcomes. Listing 1 shows the Python code for solving the task.

To multiply two numbers, start by multiplying the rightmost digit of the

multiplicand by each digit of the multiplier, writing down the products and

carrying over any remainders. Repeat this process for each digit of the

multiplicand, and then add up all the partial products to obtain the final

result.

Questions: what's 22 times 2? Answer 44.

Figure 7: Example prompt for the multiplication task used for the few-shot setting.

Question: What is 35 times 90?

Scratchpad: Let's perform the multiplication step by step:

Let's multiply 35 by the digit in the ones place of 90, which is 0.

1. Multiply 0 by the digit in the ones place of 35, which is 5. This gives 5 x 0

= 0. Write down the result 0.

2. Multiply 0 by the digit in the tens place of 35, which is 3. This gives 3 x 0

= 0. Write down the result 0.

3. The partial product for this step is A=0 which is the concatenation of the

digits we found in each step.

Now, let's multiply 35 by the digit in the tens place of 90, which is 9.\n\n4.

Multiply 9 by the digit in the ones place of 35, which is 5. This gives 5 x 9 =

45. Write down the result 5 and carry over the 4 to the next step.

5. Multiply 9 by the digit in the tens place of 35, which is 3. Add the carryover

from the previous step to account for this. This gives (3 x 9) + 4 = 31. Write

down the result 31.

6. The partial product for this step is B=315 which is the concatenation of the

digits we found in each step.

Now, let's sum the 2 partial products A and B, and take into account the position

of each digit: A=0 (from multiplication by 0) and B=315 (from multiplication by 9

but shifted one place to the left, so it becomes 3150). The final answer is 0 x 1

+ 315 x 10 = 0 + 3150 = 3150.

Figure 8: A sample scratchpad for the multiplication task.

def multiply (x , y ) :

summands = [0] * len ( y )

for i in range ( len ( y ) - 1 , -1 , -1) :

17digits = [0] * len ( x )

carry = 0

for j in range ( len ( x ) - 1 , -1 , -1) :

t = x[j] * y[i]

t += carry

carry = t // 10

digits [ j ] = t %

digits . insert (0 , carry )

summands [ i ] = sum ( digits [ - k ] * (10 ** ( k - 1) ) for k in range

(1 , len ( digits ) + 1) )

product = sum ( summands [ - i ] * (10 ** ( i - 1) ) for i in range (1 , len

( y ) + 1) )

return product

Listing 1: Example Python code for solving the multiplication task.

A.2

Einstein’s Puzzle

Data Construction In our experiments, we initially establish a set of properties, such as Color,

PhoneModel, Pet, and so forth, along with their corresponding values expressed in natural language

templates (e.g., “The house has a red color.”). We then devise a fundamental and straightforward

set of clue types: 1) ‘found_at’, e.g., “Alice lives in House 2”, 2) ‘same_house’, e.g., “The person

who is a cat lover lives in the house that has a red color.”, 3) ‘direct_left’, e.g., “The person who

has a dog as a pet lives to the left of the person who lives in a red house.”, and 4) ‘besides’, e.g.,

“The person who has a dog as a pet and the person who has a red house live next to each other.” In

addition, we also set up harder clue types such as ‘not_at’, ‘left_of’ (not necessarily directly left of),

‘two_house_between’, etc. which are only used in auxiliary experiments.

The solution to the puzzle is a matrix of size K × M , where K represents the number of houses and

M the number of attributes. During the puzzle generation, the M properties are randomly selected

from the candidate pool, followed by the random sampling of K values for each property. The

sampled values are then randomly permuted and assigned within the table to create the solution. It is

important to note that we ensure one of the sampled properties is ‘Name’ to enhance the readability

and comprehensibility of the puzzles. To construct the clues, we initially over-generate all valid clues

based on the solution and subsequently remove redundant clues at random until we obtain a set with a

General Unique Rules

There are 3 houses (numbered 1 on the left, 3 on the right). Each has a different person in them. They have different characteristics:

- Each person has a unique name: peter, eric, arnold

Ground-Truth Table

- People have different favorite sports: soccer, tennis, basketball

- People own different car models: tesla, ford, camry

House Name

Sports

Clues

1. The person who owns a Ford is the person who loves tennis.

Eric

Basketball

2. Arnold is in the third house.

3. The person who owns a Camry is directly left of the person who owns a Ford.

Peter

Tennis

4. Eric is the person who owns a Camry.

5. The person who loves basketball is Eric.

Arnold Soccer

6. The person who loves tennis and the person who loves soccer are next to each other.

Camry

Ford

Tesla

Reasoning Path Generation

Algorithm 1 Puzzle Solver

Input: Clues

Output: Reasoning path

1: function PuzzleSolver(Clues)

Path

[]

LeftClues

clues

while |LeftClues| =

0 do

for i=1 to |LeftClues| do

CandidateClues = |LeftClues|

for clue in CandidateClues do

if solve any cell then

LeftClues.remove(clue)

10:

Path.append(clue)

11:

return Path

Car

Name

Clue #2

Eric

Peter

Clue #5

Clue #6

Arnold

Name Sports Name Sports Car

Eric Basketball Eric Basketball Camry

Peter Tennis Peter Tennis Arnold Soccer Arnold Soccer

Clue #4

Figure 9: A sample of the puzzle task and the reasoning path to reach a solution.

18This is a logic puzzle. There are 3 houses (numbered 1 on the left, 3 on the

right). Each has a different person in them. They have different characteristics:

- Each person has a unique name: peter, eric, arnold

- People have different favorite sports: soccer, tennis, basketball

- People own different car models: tesla model 3, ford f150, toyota camry

1. The person who owns a Ford F-150 is the person who loves tennis.

2. Arnold is in the third house.

3. The person who owns a Toyota Camry is directly left of the person who owns a

Ford F-150.

4. Eric is the person who owns a Toyota Camry.

5. The person who loves basketball is Eric.

6. The person who loves tennis and the person who loves soccer are next to each

other.

Let's think step by step. Please first briefly talk about your reasoning and show

your final solution by filling the blanks in the below table.

$ House: ___ $ Name: ___ $ Sports: ___ $ Car: ___

Reasoning:

Step 1: First apply clue We know that The Name in

house 3 is arnold.

Step 2: Then combine clues:

soccer are next to each other.>

Unique Values Rules and the fixed table structure. We know that The Name in house

1 is eric. The FavoriteSport in house 1 is basketball. The Name in house 2 is

peter.

Step 3: Then apply clue We know

that The CarModel in house 1 is toyota camry.

Step 4: Then apply clue

the person who owns a Ford F-150.> and Unique Values We know that The CarModel in

house 2 is ford f150. The CarModel in house 3 is tesla model 3.

Step 5: Then apply clue

tennis.> and Unique Values We know that The FavoriteSport in house 2 is tennis.

The FavoriteSport in house 3 is soccer.

The puzzle is solved.

Final solution:

$ House: 1 $ Name: Eric

$ Sports: Basketball $ Car: Camry

$ House: 2 $ Name: Peter $ Sports: Tennis

$ Car: Ford

$ House: 3 $ Name: Arnold $ Sports: Soccer

$ Car: Tesla

Figure 10: A sample scratchpad for the puzzle task.

unique solution, as previously sampled. This process ensures a coherent and engaging puzzle-solving

experience. Refer to Figure 9 for an example.

Graph Construction Algorithm To solve the complex compositional reasoning process for a

logical grid puzzle, we use existing puzzle solvers [18] to generate the computation graph. It follows

the basic greedy principle of applying the minimum number of rules to solve any cell, i.e., if using

only one rule to solve any given cell, then apply this rule. This algorithm iterates through all clues in

the clue set until one or a set of clue combinations can solve any cell in the table. While it may not be

the most efficient way to solve the puzzle, it provides models with explicit scratchpad verbalization

through an intuitive computation graph. Refer to Figure 9 for the pseudo-code of the process, and

Figure 10 for a scratchpad example.

19A.3

A.3.1

Dynamic Programming Problem

Solution to this problem

Let a = [a 1 , . . . , a n ] be an input. Let dp i be the maximum sum of a subsequence that does not

include adjacent elements, when considering only the elements of the input from the i-th position

onwards.

Trivially, dp n = max(a n , 0) since we only want to choose a number if it is non-negative. Moreover,

dp n−1 = max(a n , a n−1 , 0) since we cannot choose adjacent numbers.

For any given dp i with i ≤ n − 2, we can express it in terms of dp i+1 and dp i+2 . Concretely, the

maximum sum of a subsequence starting at position i may or may not include the element in the i-th

position, a i . If the subsequence includes a i , then the maximum sum is a i + dp i+2 , since using a i

blocks us from using the next element. If the subsequence does not include a i , then its sum is dp i+1 .

Moreover, the answer may never be less than zero, because otherwise we would select the empty

sequence 5 . In summary,

dp i = max(dp i+1 , a i + dp i+2 , 0)

We now have a recursion with its base cases dp n = max(a n , 0) and dp n−1 = max(a n , a n−1 , 0), and

we can therefore compute all values in O(n). It now only rests to reconstruct the lexicographically

smallest subsequence that maximizes the desired sum, based solely on the computed dp values.

Starting from dp 1 and iterating sequentially through dp n−2 , we choose an item if and only if

dp i = a i + dp i+2 (that is, the maximum sum comes from choosing the current element) and we have

not chosen the previous element. This helps disambiguate cases where choosing or not choosing

a i yields the same sum, but possibly only one of those will not incur in choosing adjacent numbers.

Similarly, for positions i = n − 1 and i = n we choose the element if dp i = a i (that is, choosing the

element yields the maximum sum) and we have not chosen the immediately previous element. See an

example Python solution in 2.

Given a sequence of integers, find a subsequence with the highest sum, such that

no two numbers in the subsequence are adjacent in the original sequence.

Output a list with "1" for chosen numbers and "2" for unchosen ones. If multiple

solutions exist, select the lexicographically smallest. input = [3, 2, 1, 5, 2].

Figure 11: Example prompt for the DP task, used for zero-shot and few-shot settings.

def m a x i m u m _ s u m _ n o n a d j a c e n t _ s u b s e q u e n c e ( arr ) :

N = len ( arr )

dp = [0 for _ in range ( N ) ]

dp [ N - 1] = max ( arr [ N - 1] , 0)

dp [ N - 2] = max ( max ( arr [ N - 1] , arr [ N - 2]) , 0)

for i in range ( N - 3 , -1 , -1) :

dp [ i ] = max ( max ( dp [ i + 1] , arr [ i ] + dp [ i + 2]) , 0)

# reconstruct the answer with a fixed - size graph

result = []

can_use_next_item = True

for i in range ( N - 2) :

if dp [ i ] == arr [ i ] + dp [ i + 2] and can_use_next_item :

result . append (1)

can_use_next_item = False

else :

We don’t need to explicitly check for this since dp n ≥ 0. However, we include the condition to ease the

scratchpad logic.

2021

result . append (2)

can_use_next_item = True

if dp [ N - 2] == arr [ N - 2] and can_use_next_item :

result . append (1)

can_use_next_item = False

else :

result . append (2)

can_use_next_item = True

if dp [ N - 1] == arr [ N - 1] and can_use_next_item :

result . append (1)

else :

result . append (2)

return result

Listing 2: Example Python code for solving the DP task. We chose this implementation because the

computation graph has always the same topology for any given input length.

Question: Let's solve input = [3, 2, 1, 5, 2].

Scratchpad: dp[4] = max(input[4], 0)

dp[3] = max(input[3], input[4], 0) =

dp[2] = max(dp[3], input[2] + dp[4],

dp[1] = max(dp[2], input[1] + dp[3],

dp[0] = max(dp[1], input[0] + dp[2],

= max(2, 0) = 2

max(5, 2, 0) = 5

0) = max(5, 1 + 2, 0) = 5

0) = max(5, 2 + 5, 0) = 7

0) = max(7, 3 + 5, 0) = 8

Finally, we reconstruct the lexicographically smallest subsequence that fulfills

the task objective by selecting numbers as follows. We store the result on a list

named "output".

Let can_use_next_item = True.

Since dp[0] == input[0] + dp[2] (8 == 3 + 5) and can_use_next_item == True, we

store output[0] = 1. We update can_use_next_item = False.

Since dp[1] != input[1] + dp[3] (7 != 2 + 5) or can_use_next_item == False, we

store output[1] = 2. We update can_use_next_item = True.

Since dp[2] != input[2] + dp[4] (5 != 1 + 2) or can_use_next_item == False, we

store output[2] = 2. We update can_use_next_item = True.

Since dp[3] == input[3] (5 == 5) and can_use_next_item == True, we store

output[3] = 1. We update can_use_next_item = False.

Since dp[4] != input[4] (2 != 2) or can_use_next_item == False, we store

output[4] = 2.

Reconstructing all together, output=[1, 2, 2, 1, 2].

Figure 12: A sample scratchpad for the DP task used for fine-tuning with few-shot settings.

Data Construction We exhaustively generate data for this DP task. For question-answer setting,

we include a thorough explanation of the task before asking to generate a solution (see Figure 11).

We use all lists up to 5 elements as training, and we consider only lists where elements are in the

range [−5, 5] (giving a total of 11 n lists for an input list of size n). For out-of-domain evaluation, we

use lists of sizes 6 to 10 inclusive. Example scratchpads and zero-shot prompts are shown in Figure

12 and 11 respectively. The scratchpad is generated automatically through templates. We considered

five exemplars for the few-shot setup.

21B

Experimental Setups & Empirical Results

B.1

Models

For our experiments, we evaluate the performance of 6 LLMs: GPT4 (gpt-4) [42], ChatGPT

(GPT3.5-turbo) [41], GPT3 (text-davinci-003) [9], FlanT5 [14] and LLaMa [58]. The evalua-

tions were conducted from January 2023 to May 2023 using the OpenAI API. We perform fine-tuning

on GPT3 (text-davinci-003) for the three tasks, observing faster convergence when training on

question-scratchpad pairs rather than question-answer pairs. For question-answer pairs fine-tuning,

we train separately the model for {12, 12, 4} epochs for multiplication, puzzle, and DP respectively,

saving the best model based on the validation set. Regarding training on question-scratchpad pairs,

we train the model for {4, 8, 2} epochs for multiplication, puzzle, and DP. The batch size is set

to approximately 0.2% of the number of examples in the training set. Generally, we observe that

larger batch sizes tend to yield better results for larger datasets. For the learning rate multiplier, we

experiment with values ranging from 0.02 to 0.2 to determine the optimal setting for achieving the

best results and chose 0.2. During inference, we set nucleus sampling p to 0.7 and temperature to 1.

For each task, we evaluate the performance of each model on 500 test examples.

B.2

Limits of Transformers in Zero- and Few-shot Settings

0 0 0

3 4 5

0.76 0.1

0 0 0 0 0

1 2 3 4 5

0.22 0

0.32 0.19 0.17

0.41 0.05 0.15 0.02 0

0.85 0.28 0.04 0

0.5 0.42

0.15 0 0 0

0.98 0.26 0.02 0

0.97 0.41 0.09

0.98 0.51 0.11 0.01

1 0.88

0.99 0.85 0.55

1 0.99

Figure 13: Graph parallelism vs accuracy. The accuracy decreases as the complexity increases.

ChatGPT zero-shot

GPT3 zero-shot

LLaMA zero-shot (13B)

FlanT5-XXL zero-shot (11B)

0.6

0.96

Figure 14, Figure 16 and Figure 18 show the zero-shot performance of GPT4, ChatGPT, LLaMA and

FlanT5 on the three tasks. Overall, there is a notable decline in performance as the task complexity

increases (measured by graph parallelism for multiplication and DP, and propagation steps for puzzles

as shown in Figure13). The few-shot performance with question-answer pairs results in minimal

improvement over the zero-shot setting as depicted in Figure 15 and Figure 18 for the multiplication

and DP tasks. In contrast, the few-shot setting did not lead to any improvement in the puzzle task.

0.07 0 0 0 0

3 4 5

Figure 14: Zero-shot accuracy. Performance of ChatGPT, GPT3, LLaMA and FlanT5 on the multiplication

task.

B.3

Limits of Transformers with question-answer Training

Figure 17 and Figure 19 show the performance of GPT3 finetuned on question-answer pairs. The

model was trained on various splits, considering the problem size, depth, and width of the computation

graph. Specifically, for the multiplication task, the model was fine-tuned on a range of multiplication

problems, spanning from 1-digit by 1-digit multiplication to 4-digit by 2-digit multiplication amount-

ing to 1.8M pairs. As for the puzzle task, the model was fine-tuned on puzzles of sizes ranging from

2x2 to 4x4 resulting in a total of 142k pairs. Additionally, for the DP task, the model was fine-tuned

on problems with a sequence length of 5 resulting in 41K pairs. In an additional setup, we divided

225

0.22 0

0 0 0 0 0 0 0

1 2 3 4 5

0.8 0.17 0.03 0

0.31 0.19 0.18

0.88 0.21 0.04 0

0.52 0.42

1 0.88

0.97 0.49 0.09

0.96 0.42 0.05

LLaMA few-shot (13B) FlanT5-XXL few-shot (11B)

0.61

0.14 0.02 0

0.96 0.28 0.02 0

GPT3 few-shot

0.16 0 0 0

0.94 0.45 0.08 0.01

0.95 0.42 0.07 0

0.99 0.82 0.58

0.93 0.76 0.18 0.03

ChatGPT few-shot

1 0.99

1 0.98

0.97 0.88 0.56

GPT4 few-shot

0.08 0 0 0 0

3 4 5

Figure 15: Few-shot accuracy with question-answer pairs. Performance of GPT4, ChatGPT, GPT3, LLaMA

0 0 0 0 0 0

0 0 0.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 3 4 5 6

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 3 4 5 6

0 0 0.2 0 5 0 0

0.04 0 0 0 0

0 0 0.5 0.41 0.22 0.02 0.02 0 0 0 0 0

0 0.1 0.02 0.03 0.1 0 0 0 0.2 4 0

FlanT5 zero-shot (Puzzle)

0.44 0.25 0.09 0.08 0 0 0.1 0.5 3 2 0 0 0.2 0.4 2 0.5 0.1

LLaMa zero-shot (Puzzle)

0.6 0.4 0.8

and FlanT5 on the multiplication task.

ChatGPT zero-shot (Puzzle)

GPT3 zero-shot (Puzzle)

0 0 0 0 0

2 3 4 5 6

Figure 16: Zero-shot accuracy. Performance of ChatGPT, GPT3, LLaMA and FlanT5 on the puzzle task.

Few-shot performance led to worse performance.

those datasets based on the depth and width of the computation graph for all the tasks and finetuned

on different splits. The results indicate a lack of generalization for out-of-domain (OOD) examples

while showcasing near-perfect performance for in-domain examples.

Figure 17: GPT3 finetuned on the puzzle task using question-answer pairs. The training data

consisted of puzzles of size 4x4, and the model was subsequently evaluated on larger puzzle sizes for

OOD testing.

GPT3 finetuning cost We will discuss here the approximate cost of fine-tuning GPT3 for the

multiplication task. When fine-tuning with question-answer pairs, each example typically consists

of around 20 tokens, and 250 tokens for question-scratchpad pairs. The cost for utilizing the

text-davinci-003 model amounts to $0.02 (USD) per 1,000 tokens. With this particular setup,

the total number of training examples required for multiplication up to 5 digits by 5 digits reaches

an astonishing figure of approximately 9.1 billion examples. Should we choose to fine-tune GPT3

for 4 epochs on question-answer pairs, the cost would amount to $12 million and $700 million for

question-scratchpad training. For a more comprehensive breakdown of the cost per problem size,

please refer to Table 1.

B.4

Limits of Transformers with Explicit Scratchpad Training

Figure 21, 22, 20 show the performance of GPT3 finetuned on different splits

of the tasks using question-scratchpad pairs.

Specifically, for the multiplica-

23GPT4 (Dynamic Programming) ChatGPT (Dynamic Programming) GPT3 (Dynamic Programming)

GPT4 zero-shot 0.6

GPT4 five-shot 0.5

0.6 0.4

Accuracy

0.8

ChatGPT zero-shot 0.5

ChatGPT five-shot

0.4

0.3

0.4

0.2

0.1

0.2

Problem Size

0.0

GPT3 zero-shot

GPT3 five-shot

Problem Size

0.0

Problem Size

Figure 18: Zero-shot and Few-shot accuracy using question-answer pairs. Performance of GPT4, ChatGPT,

and GPT3 on the dynamic programming task. LLaMA and FlanT5 results are near zero for all problem sizes.

Figure 19: GPT3 finetuned on the dynamic programming task using question-answer pairs. We

consider different data splits: problem size, depth, and width of the graph. Specifically, the model

was trained with a problem size of 5, and the graph’s depth and width were set to 18.

Figure 20: GPT3 finetuned on the dynamic programming task using question-scratchpad pairs.

We consider different data splits: problem size, depth, and width of the graph. Specifically, the model

was trained with a problem size of 5, and the graph’s depth and width were set to 18.

tion task, the model was fine-tuned on a range of multiplication problems, span-

ning from 1-digit by 1-digit multiplication to 3-digit by 2-digit multiplication.

As for the puzzle task, the model was fine-tuned on

puzzles of sizes ranging from 2x2 to 4x4. Addition-

ally, for the DP task, the model was fine-tuned on

problems with a sequence length of 5. Furthermore,

different data splits were considered, including varia-

tions based on the number of hours, number of prop-

erties, depth and width of the graph, and the number

of digits in the multiplication output. On all tasks,

we can see that the model fails to generalize to OOD

data while achieving perfect accuracy on in-domain

data, indicating that it cannot learn the underlying

computational rules.

Figure 21: GPT3 finetuned exhaustively on task-

specific data up to a certain problem size. In partic-

ular, we train on examples up to 3-digit by 2-digit

multiplication (left) and on examples that have up

to 5 digits in the output response (right). The blue

region represents the in-distribution examples and

the red region refers to OOD examples.Problem size # examples GPT3 Cost

without scratchpad

with scratchpad

1x1

2x1

2x2

3x1

3x2

3x3

4x1

4x2

4x3

4x4

5x1

5x2

5x3

5x4

5x5 81

810

8100

81000

810000

81000

810000

8100000

81000000

810000

8100000

81000000

810000000

8100000000 $0.12

$1.28

$12.96

$129.6

$1296

$129.6

$1296

$12,960

$129,600

$1296

$12,960

$129,600

$1,296,000

$12,960,000

$7.44

$74.4

$744

$7440

$74,404

$7440

$74,404

$744,040

$7,440,400

$74,404

$744,040

$7,440,400

$70,440,400

$700,440,400

Table 1: Finetuning cost of GPT3 model on the multiplication data.

Figure 22: GPT3 finetuned on the puzzle task using question-scratchpad pairs. The training data

consisted of puzzles of size 4x4, and the model was subsequently evaluated on larger puzzle sizes for

OOD testing.

25C

Surface Patterns

C.1 Relative Information Gain Predictions for

Multiplication

Relative Information Gain

Input

variable Output

variable 2x2 3x3 4x4 5x5

x n

y n z 2n

z 2n 0.223

0.223 0.223

0.223

x 1

y 1 z 1

z 1 0.198

0.198 0.199

0.199 0.199

0.199

x n y n

x n−1 x n

y n−1 y n z 2n

z 2n

z 2n 1.000

0.223

0.223 1.000

0.223

0.223 1.000

0.223

0.223 1.000

0.223

x n y n

y n−1 y n

x n−1 x n

x n−1 y n−1 z 2n−1

z 2n−1

z 2n−1 0.110

0.032

0.018 0.101

0.036

0.025 0.101

0.036

0.025 0.101

0.036

0.025

x 1 y 1

x 2 y 2 z 2

z 2 0.099

0.025 0.088

0.016 0.088

0.016

x 1 y 1

z 1

0.788 0.792 0.793 0.793

y 1 y 2

z 1

0.213 0.211 0.211 0.211

x 1 x 2

z 1

0.213 0.211 0.211 0.211

Table 2: Highest Relative Information Gain Elements and Pairs of Elements, for multiplications

between x = (x 1 , . . . , x n ) and y = (y 1 , . . . , y n ), with 2 ≤ n ≤ 5. We define z := x · y, which

will always have size 2n (with possibly a leading zero). z 2n denotes the least-significant digit of

z, and z 1 denotes the left-most digit. Only (input, output) pairs above 0.01 are shown. Note that

since multiplication is commutative, several pairs of input variables (e.g. a 0 and b 0 ) exhibit the same

relative information gain.

26Empirical Surface Pattern Analysis for Multiplication with GPT4, ChatGPT and GPT3

1 1 1 1 1 1 1 1 1 1 0.99 1 1 1

1 2 3 4 5

4 5

0.98 0.98 0.68 0.58

1 0.98 0.98 0.72

Accuracy first digit

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 2 3 4 5

0.88 0.61 0.63 0.63

Accuracy first two digits

0.98 0.99

0.99

1 1

0.99 0.94

1 1

Accuracy # digits

0.99 0.98 0.97 0.97

0.99 0.98

0.99

3 1 1

Accuracy last two digits

Accuracy last digit

2 1

Accuracy Trailing 0

C.2

0.99

0.99 0.99 0.99 0.99

0.99 0.99 0.98 0.98

Figure 23: GPT4 zero-shot accuracy in predicting partially correct responses. This evidences surface

pattern learning, since the accuracy of full answer prediction is significantly lower–and often near

zero (see Figure 2). Specifically, ‘accuracy trailing zeros’ pertains to accurately predicting the number

of zeros in the output number, which is known to be relatively easy to predict based on arithmetic

calculations.

1 1 1 0.99 1 1 1

1 2 3 4 5

0.93 0.91

1 0.89 0.88 0.87

1 0.88 0.83 0.8 0.8

Accuracy first digit

1 1 1 0.99 0.99

1 0.98 0.97 0.98

1 0.98 0.92 0.86 0.86

Accuracy first two digits

1 1 1 1 0.99

1 1 1 1 0.96 0.97

1 1 0.94 0.95 0.94

1 0.95 0.91 0.84 0.84

Accuracy last two digits

0.98 0.95 0.9 0.9

1 1

Accuracy # digits

0.99 0.98 0.97

1 1

0.99

0.98 0.96 0.87 0.87

0.99

0.98 0.93 0.95

1 1 1

Accuracy last digit

1 1 1

Accuracy Trailing 0

Figure 24: ChatGPT zero-shot accuracy in predicting partially correct responses. We observe the

same trend for GPT3 predictions.

271 0.99 1 1 1 1 1 0.99 1 1 0.98 0.99 1 0.9

4 5

0.86

4 5

1 2 0.9 0.98 1 1 0.96 0.97 0.53 0.56 0.99 1 1 1 0.63

1 2 3 4 5

Accuracy first digit

1 1 1 1 1 1 1 1 0.98 1 0.98 1

1 2 3 4 5

0.99 0.99

Accuracy first two digits

Accuracy # digits

0.97 0.89 0.69 1

0.99

1 1

0.97 0.99 0.98 0.98

0.99 0.99 0.96

0.97 0.98

0.99 0.99

0.99

Accuracy last two digits

Accuracy last digit

1 4

Accuracy Trailing 0

0.99

0.99 0.99 0.99 0.99

0.99 0.99 0.98 0.98

Figure 25: GPT4 five-shot accuracy in predicting partially correct responses. We observe the same

trend for ChatGPT, GPT3 few-shot predictions.

0.12 0.54

0.02 0.02

0 0 0

3 4 5

0.98 0.98 0.91 0.64

0.94

0.99

0.93

0.96

0.43 0.53 0.17 0.18 0.1

1 1

0.86 0.97 0.64

0.84 0.74 0.69 0.62 0.5

0.94

Accuracy first two digits

Accuracy first digit

1 1

0.5

Accuracy last two digits

0.32 0.69 0.1 0.01

1 0.98

0.99

1 1

Accuracy # digits

0.85 0.92 0.98 0.8 0.75

1 0.76 0.8 0.65 0.8

0.99

1 1

Accuracy last digit

1 1 0.98

Accuracy Trailing 0

1 0.43 0.31 0.16 0.08 0.12

Figure 26: GPT3 finetuned on question-scratchpad pairs. Accuracy of predicting partially correct

responses.

C.3

Relative Information Gain Predictions for Dynamic Programming Task

Let a i be the i-th element of the input sequence, and let o i be the i-th element of the output sequence.

As shown in Table 3, a i is a good predictor of o i , and this is especially true for a 1 and a n−1 , the

first and last elements of the sequence. This matches the task intuition, since one would never pick

an element a i < 0 and decrease the final sum (one may pick a i = 0 if it makes a lexicographically

smaller output sequence).

a i weakly helps to predict its neighbors. The only case of this behavior with RelativeIG>0.1 is at

the start of the sequence, where the first element helps predict the value of the second. This again

matches intuition, since a very high a 1 indicates that with high probability o 2 will not be selected for

the final subsequence.

28Relative Information Gain for each problem size

Input

variable Output

variable 2 3 4 5 6 7 8 9 10

a 1 o 2 0.15 0.13 0.14 0.14 0.14 0.14 0.14 0.14 0.14

a 1

a 2

a 3

a 4

a 5

a 6

a 7

a 8

a 9

a 10 o 1

o 2

o 3

o 4

o 5

o 6

o 7

o 8

o 9

o 10 0.64

0.53 0.71

0.42

0.64 0.69

0.45

0.49

0.60 0.69

0.44

0.53

0.46

0.62 0.69

0.45

0.52

0.50

0.47

0.61 0.69

0.44

0.52

0.49

0.51

0.47

0.61 0.69

0.44

0.52

0.49

0.50

0.51

0.47

0.61 0.69

0.45

0.52

0.49

0.50

0.49

0.51

0.47

0.61 0.69

0.44

0.52

0.49

0.50

0.51

0.47

0.61

a n−1

o n−1

0.64 0.60 0.62 0.61 0.61 0.61 0.61 0.61

a n−2

o n−2

0.46 0.47 0.47 0.47 0.47 0.47

a n−3

o n−3

0.51 0.51 0.51 0.51

a n−4

o n−4

0.49 0.50

Table 3: Highest Relative Information Gain Elements, for DP problems of size 2 ≤ n ≤ 10.

We only show the (input, output) pairs where at least three problem sizes have RelativeIG>0, and

at least one with RelativeIG>0.1. a n−1 refers to the last element of the sequence, regardless of its

actual id in the sequence.

Similar behaviors, but with higher relative information gains overall, are observed when analyzing

triples of consecutive elements in the list. Table 4 shows that o i is highly predicted by (a i−1 , a i , a i+1 ).

Moreover, o i is highly predicted by both (a i−2 , a i−1 , a i ) and (a i , a i+1 , a i+2 ), with the former

generally having higher scores than the latter. This again matches the task intuitions, since the value

of the neighbors helps determine whether to select a number for the subsequence; and asking for the

lexicographically smallest sequence biases the output subsequence to care more about the previous

numbers rather than the following ones. We believe that this last point is the cause of the weakly

predictive power of (a i−3 , a i−2 , a i−1 ) to predict o i ; whereas (a i+1 , a i+2 , a i+3 ) is not shown, since

all the relative information gain values were below 0.1.

29Relative Information Gain for each problem size

Input

variable Output

variable

a n−3 a n−2 a n−1

a n−3 a n−2 a n−1 o n−1

o n−2

o n−3

a 1 a 2 a 3

a 2 a 3 a 4

a 1 a 2 a 3

a 2 a 3 a 4

a 3 a 4 a 5

a 2 a 3 a 4

a 3 a 4 a 5

a 4 a 5 a 6

a 3 a 4 a 5

a 4 a 5 a 6

a 5 a 6 a 7

a 4 a 5 a 6

a 5 a 6 a 7

a 6 a 7 a 8

a 5 a 6 a 7

a 6 a 7 a 8

a 6 a 7 a 8 o 1

o 2

o 3

o 4

o 5

o 6

o 7

o 8

1.00

0.96

0.91

0.56

0.66

0.86

0.97

0.92

0.55

0.73

0.77

0.67

0.64

0.88

0.94

0.97

0.91

0.55

0.71

0.78

0.66

0.7

0.79

0.63

0.65

0.87

0.95

0.94

7 8 9 10

0.95

0.87

0.64 0.95

0.87

0.64 0.95

0.87

0.64 0.95

0.87

0.64

0.97

0.92

0.55

0.72

0.78

0.66

0.68

0.81

0.62

0.71

0.78

0.64

0.87 0.97

0.91

0.55

0.72

0.78

0.66

0.69

0.8

0.62

0.69

0.79

0.63

0.71

0.78

0.64

0.87

0.95 0.97

0.92

0.55

0.72

0.78

0.66

0.69

0.8

0.62

0.7

0.79

0.63

0.69

0.8

0.62

0.71

0.78

0.64 0.97

0.91

0.56

0.72

0.78

0.66

0.69

0.8

0.62

0.7

0.79

0.64

0.7

0.8

0.63

0.69

0.8

0.71

0.95

a 1 a 2 a 3

o 4

0.12 0.1 0.11 0.11 0.11 0.11 0.11

a 2 a 3 a 4

o 5

0.1 0.09 0.1 0.09 0.1

0.1

a 3 a 4 a 5

o 6

0.11 0.1

0.1

0.1 0.11

0.11 0.09 0.1 0.11

a 4 a 5 a 6

o 7

a 5 a 6 a 7

o 8

0.11 0.09 0.11

Table 4: Highest Relative Information Gain Contiguous Triples, for DP problems of size

3 ≤ n ≤ 10. We only show the (input, output) pairs where at least three problem sizes

have RelativeIG>0, and at least one with RelativeIG>0.1. a n−1 refers to the last element of the

sequence, regardless of its actual id in the sequence.

C.4

Empirical Surface Pattern Results for Dynamic Programming Task

We observe that all analyzed models match the Relative Information Gain prediction that o 1 (whether

the first element goes into the output sequence or not) should be the easiest value to predict (see

Figures 27, 28, and 29). However, since GPT3 often predicts shorter output sequences than the

required size, the analysis of the predictive power of o n−1 is only done for GPT4. In GPT4, we

observe that o n−1 is among the easiest values to predict as expected by Relative Information Gain.

o_1

o_2

o_3

o_4

o_5

o_{n-1}

1.00 1.00

0.90 0.90

o_{n-2}

o_{n-3}

o_{n-4}

o_{n-5}

Output size is correct

Model never chooses negatives

Model respects the non-adjacent condition

0.80

0.70

1.00

0.80

0.70

Problem size (input list length)

0.90

0.80

0.70

0.60

Problem size (input list length)

Figure 27: GPT4 five-shot with scratchpad accuracy in predicting output elements o i in the DP task.

All o i are predicted with high accuracy with o 1 and o n−1 being consistently among the highest. These

observations go in line with the Relative Information Gain prediction.

30o_1

o_2

o_3

o_4

o_5

Output size is correct

Model never chooses negatives

Model respects the non-adjacent condition

0.75

1.00

0.75

Accuracy

0.65

0.55

0.45

0.35

0.50

0.25

0.00

Problem size (input list length)

Figure 28: GPT3 few-shot without scratchpad accuracy in predicting output elements o i in the DP

task. As predicted by Relative Information Gain, the model predicts o 1 correctly with the highest

probability. However, because GPT3 often does not produce the correct output size, it hinders us

from analyzing o n−1 .

o_1

o_2

o_3

o_4

o_5

Output size is correct

Model never chooses negatives

Model respects the non-adjacent condition

1.00

0.75

0.9

1.0

0.8

0.7

0.50

0.25

0.00

Problem size (input list length)

Figure 29: GPT3 fine-tuned without scratchpad accuracy in predicting output elements o i in the DP

task. As predicted by Relative Information Gain, the model predicts o 1 correctly with the highest

probability. However, because GPT3 often does not produce the correct output size, it hinders us

from analyzing o n−1 .

31D

Theoretical Results: Derivations

D.1

Transformers struggle with problems with increasingly larger parallelism (width)

Proposition D.1. Let f n (x) = h n (g(x, 1), g(x, 2)), . . . , g(x, n)). Let b

h n , g b , f b n be estimators of

h n , g, f n respectively. Assume P(h n = b

h n ) = 1 and P(h n (X) = h n (Y ) | X ̸ = Y ) < βα n

for some α ∈ (0, 1) and β > 0 (i.e. b

h n perfectly estimates h n , and h n is almost injective). If

P(g ̸ = g b ) = ϵ > 0 and errors in g b are independent, then lim P(f n ̸ = f b n ) = 1.

n→+∞

Proof. For ease of writing, let X i = g(X, i) and Y i = g b (X, i), and let X = (X 1 , . . . , X n ) and

Y = (Y 1 , . . . , Y n ). We will compute some auxiliary probabilities, and then upper bound P(f = f b ),

to finally compute its limit.

P(X = Y ) = P(X 1 = Y 1 , X 2 = Y 2 , . . . , X n = Y n )

= P(X 1 = Y 1 ) · P(X 2 = Y 2 ) . . . · P(X n = Y n ) = P(g = g b ) n = (1 − ϵ) n (2)

Since by hypothesis we know P(h n (Y ) = b

h n (Y )) = 1, we have that:

P(h n (X) = b

h n (Y ) | X ̸ = Y ) = P(h n (X) = b

h n (Y ) ∩ h n (Y ) = b

h n (Y ) | X ̸ = Y )

= P(h n (X) = h n (Y ) = b

h n (Y ) | X ̸ = Y )

≤ P(h n (X) = h n (Y ) | X ̸ = Y )

< βα n

(3)

We will now estimate P(f n = f b n ) using the law of total probability w.r.t. the event X = Y .

P(f n = f b n ) = P(h n (X) = b

h n (Y ))

h n (Y ) | X = Y ) · P(X = Y ) + P(h n (X) = b

h n (Y ) | X ̸ = Y ) · P(X ̸ = Y )

= P(h n (X) = b

h n (X)) · P(X = Y ) + P(h n (X) = b

h n (Y ) | X ̸ = Y ) · (1 − P(X = Y ))

= 1 · (1 − ϵ) + P(h n (X) = b

h n (Y ) | X ̸ = Y ) · (1 − (1 − ϵ) n )

(using 2 and hypothesis)

< (1 − ϵ) n + βα n · (1 − (1 − ϵ) n )

< βα n + (1 − ϵ) n · (1 − βα n )

(using 3)

To conclude our proof, we will show that lim P(f n = f b n ) exists and compute its value. Note that

n→+∞

since 1 − ϵ ∈ [0, 1) and α ∈ (0, 1), trivially lim βα n + (1 − ϵ) n · (1 − βα n ) = 0.

n→+∞

0 ≤ lim inf P(f n = f b n ) ≤ lim sup P(f n = f b n ) ≤ lim sup βα n + (1 − ϵ) n · (1 − βα n ) = 0

n→+∞

Then, lim n→+∞ P(f n = f b n ) = 0 and we conclude lim n→+∞ P(f n ̸ = f b n ) = 0.

Corollary D.1. Assume that a model M solves shifted addition perfectly, but it incorrectly solves at

least one m digit by 1 digit multiplication for some fixed m. Then, the probability that M will solve

any m digit by n digit multiplication using the long-form multiplication algorithm tends to 0.

Proof. We define s : Z m+n

× N → N × N, d : N × Z 10 → N, h n : N n → N, and f n : Z m+n

→ N

as follows.

⌢

s([x 1 , . . . , x m , x m+1 , . . . , x m+n ], j) := (x ⌢

1 x 2 . . . x m , x m+j )

⌢

where x ⌢

1 x 2 . . . x m denotes concatenating digits x i

d(x, y) := x · y

g := d ◦ s

h n (x 1 , . . . , x n ) :=

x i 10 n−i

i=1

f n (x) := h n (g(x, 1), g(x, 2)), . . . , g(x, n))

32Note that g defines the base-10 multiplication between m-digit numbers (x 1 x 2 . . . x m ) and 1-digit

numbers (x m+j ), where s denotes the selection of the numbers to multiply and d denotes the actual

multiplication. Note that h n describes the shifted addition used at the end of long-form multiplication

to combine n m-digit by 1-digit multiplications. Therefore, f n describes the long-form multiplication

of m-digit by n-digit numbers.

h n denote estimators using

By hypothesis, P(g ̸ = g b ) = ϵ > 0 and P(h n = b

h n ) = 1, where g b and b

model M. It can be shown that P(h n (X) = h n (Y ) | X ̸ = Y ) < βα n for α = 0.1 and β = 10 m .

Using Lemma D.1, lim P(f n ̸ = f b n ) = 1, which concludes our proof.

n→+∞

Note that Lemma D.1’s proofs gives us empirical bounds once ϵ and α are approximated. Also

note that our definition of g in the proof of Corollary D.1 highlights two possible sources of

exponentially-accumulating error: errors in the selection of the numbers to multiply s, and errors

in the actual m-digit by 1-digit multiplication d.

D.2

Transformers struggle with problems that require increasingly larger iterative

applications of a function (depth)

Proposition D.2. Let f n (x) = g n (x). Assume P(g(X) = g b (Y ) | X ̸ = Y ) ≤ c (i.e. recovering from

a mistake due to the randomness of applying the estimator on an incorrect input has probability at

most c). If P(g ̸ = g b ) = ϵ > 0 with c + ϵ < 1, then lim inf P(f n ̸ = f b n ) = 1 −

n→+∞

c + ϵ

Proof. We first derive a recursive upper bound using the law of total probability, and then prove a

non-recursive upper bound by induction.

s n := P(f n = f b n ) = P(g(g n−1 (Z)) = g b (b

g n−1 (Z)))

= P(g(X) = g b (Y )) where X := g n−1 (Z) and Y := g b n−1 (Z)

= P(g(X) = g b (Y ) | X = Y ) · P(X = Y ) + P(g(X) = g b (Y ) | X ̸ = Y ) · P(X ̸ = Y )

= P(g(X) = g b (X)) · P(X = Y ) + P(g(X) = g b (Y ) | X ̸ = Y ) · (1 − P(X = Y ))

= P(g(X) = g b (X)) · s n−1 + P(g(X) = g b (Y ) | X ̸ = Y ) · (1 − s n−1 )

≤ (1 − ϵ) · s n−1 + c · (1 − s n−1 )

≤ (1 − ϵ − c) · s n−1 + c

We know s 1 = (1 − ϵ) since s 1 = P(f 1 = f b 1 ) = P(g = g b ). Let b := 1 − ϵ − c for ease of writing.

Then, we have

s n ≤ b · s n−1 + c

(4)

n−2

It can be easily shown by induction that s n ≤ b n−1 (1 − ϵ) + c i=0 b i :

• The base case n = 2 is true since we know s 2 ≤ b · s 1 + c, and b · s 1 + c = b(1 − ϵ) + c =

P 2−2

b 2−1 (1 − ϵ) + c i=0 b i , thus showing s 2 ≤ b 2−1 (1 − ϵ) + c i=0 b i

• The inductive step yields directly using Equation 4,

s n ≤ b · s n−1 + c

n−3

n−2

≤ b · b n−2 (1 − ϵ) + c

b i + c ≤ b n−1 (1 − ϵ) + c

b i

i=0

We can rewrite the geometric series

P n−2

i=0

i=1

b i in its closed form

1−b n−1

1−b ,

and recalling b := 1 − ϵ − c,

1 − b n−1

= b n−1 (1 − ϵ) + c

1 − b

c + ϵ

n−1

n−1 c

= b

(1 − ϵ) +

− b

c + ϵ

n−1

= b

1 − ϵ −

c + ϵ

s n ≤ b n−1 (1 − ϵ) + c

i=0Recalling that s n = P(f n = f b n ), we compute the limit inferior of P(f n ̸ = f b n ) = 1 − s n ≥

) − c+ϵ

1 − b n−1 (1 − ϵ − c+ϵ

lim inf P(f n ̸ = f b n ) ≥ lim 1 − b n−1 1 − ϵ −

−

=1 −

n→+∞

c + ϵ

that concludes our proof.

We can generalize the proof in Lemma 4.2 to tasks where there are potentially many valid reasoning

chains with the following alternative state-transition framing.

Lemma D.2. Let S denote the set of all possible states a language model can generate, and let

z : S → {0, 1} defines if a state is valid (0 = invalid). Let g b : S → Π(S) be a state-transition

function representing a language model’s probability distribution of generating each possible next

state when attempting to perform a single reasoning step. Assume P(z(b

g (X)) = 1 | z(X) = 0) ≤ c

and P(z(b

g (X)) = 0 | z(X) = 1) = ϵ > 0 with c + ϵ < 1. Then, lim inf P(z(b

g n ) = 0) = 1 −

n→+∞

c + ϵ

If for task T we know that all valid reasoning chains to arrive at a correct result have at least length

n (i.e., the equivalent of defining f n = g n in Lemma D.1) then the probability of solving task T

correctly tends to at most c+ϵ

Corollary D.3. The recursions for dynamic programming tasks, the m-by-1 digit multiplication, and

the puzzle’s elimination function are all tasks where there is a fixed reasoning step g being repeatedly

applied. Therefore, we can directly apply Proposition 4.2 to these tasks.

Proof. Let’s analyze the three tasks separately below.

m-by-1 digit multiplication may be viewed as f m (x) Let x = (x 1 , . . . , x m ) be the m-digit

number that we multiply by the 1-digit number y (0 ≤ y < 10). Let z = (z 1 , . . . , z m+1 ) denote

z = x · y, which is guaranteed to have exactly m + 1 digits (with possibly leading zeros). We define

f as:

f (x 1 , . . . , x m , y, i, c) := (x 1 , . . . , x i−1 , x ′ i , x i+1 , . . . x m , y, i − 1, c ′ )

where x ′ i := (x i · y + c) mod 10 and c ′ := ⌊(x i · y + c)/10⌋. Note that x ′ i = z i+1 since f is

performing one step of the long-form multiplication algorithm.

Let the initial input be x := (x 1 , . . . , x m , y, m, 0). Then, it can be easily shown that

f m (x) = (z 2 , . . . , z m+1 , y, 0, c). Since c is the left-most carry, it is the leading digit

of z, i.e. c = z 1 (possibly zero) . Thus, the value of z can be directly extracted from

f m (x) = (z 2 , . . . , z m+1 , y, 0, z 1 ).

In the DP task, dp’s computation may be viewed as f m−2 (x) for a list of size m See §A.3.1

for details on the solution to this problem. We will use identical notation. Let a 1 , . . . , a m

be an input list. Let x = (a 1 , . . . , a m−2 , a ′ m−1 , a ′ m , m − 2), where a ′ m := max(a m , 0) and

a ′ m−1 := max(a m−1 , a m , 0). Intuitively, this means that we have applied the first two steps of

the dp computation, and stored the results in a ′ m−1 and a ′ m . Let f be a function representing the

recursive computation of dp i :

f (a 1 , . . . , a i , a ′ i+1 , . . . , a ′ m , i) = (a 1 , . . . , a i−1 , a ′ i , . . . , a ′ m , i − 1)

where a ′ i := max(a ′ i+1 , a i + a ′ i+2 , 0).

Note that since a ′ i+1 stores the value of dp i+1 and a ′ i+2 stores the value of dp i+2 , it can be easily

shown that f m−2 (x) = (a ′ 1 , . . . , a ′ m , 0) = (dp 1 , . . . , dp m , 0). Therefore, f m−2 computes all

recursive values of dp i when given the base cases.

In the DP task, the reconstruction of the desired subsequence given already computed dp values

may be viewed as f m (x) for an input list of size m. This case is similar to the previous one. Let

r = (r 1 , . . . , r m ) be the result, where r i = 1 if a i was selected for the desired subsequence, and

r i = 2 otherwise. Let x := (dp 1 , . . . , dp m , 0, 0, a 1 , . . . , a m , 1, 1). Let f be defined as follows:

34f (dp 1 , . . . , dp m , 0, 0, a ′ 1 , . . . , a ′ i−1 , a i , . . . , a m , i, u) = (dp 1 , . . . , dp m , 0, 0, a ′ 1 , . . . , a ′ i , a i+1 , . . . , a m , i + 1, u ′ )

where a ′ i := 2 − 1{dp i = a i + dp i+2 and u = 1} and u := 1 − 1{dp i = a i + dp i+2 and u = 1}.

Intuitively, a ′ i stores whether the i-th element of the list should be selected for the final subsequence,

assigning 1 if the element should be taken, and 2 otherwise (i.e., a ′ i = r i ). Moreover, if the i-th

element has been selected, we mark that the next item will not be available using u ′ . Therefore, f

performs one step of the final output reconstruction as defined in §A.3.1.

It can be easily shown that f m (x) := (dp 1 , . . . , dp m , 0, 0, a ′ 1 , . . . , a ′ m , m + 1, u ′ ) =

(dp 1 , . . . , dp m , 0, 0, r 1 , . . . , r m , m + 1, u ′ ). Note that the extra two elements in the input state

allow lifting the special cases m − 1 and m in the solution shown in §A.3.1 without falling out of

bounds.

Solving the puzzle task may be seen as f m for some m, where f is the elimination function Let

c 1 , . . . , c n be the list of clues, let H be the number of houses, and let A be a partially filled solution

of size K × M as defined in §2.4. Each cell A ij can take H + 1 values: the H options for the cell

and the value ø, implying this cell has not been filled. An elimination step f may be defined as:

f (c 1 , . . . , c n , A 11 , . . . A 1M , . . . , A K1 , . . . A KM ) = (c 1 , . . . , c n , A ′ 11 , . . . A ′ 1M , . . . , A ′ K1 , . . . A ′ KM )

where A ′ is also a partially filled matrix, with A ij = A ′ ij for every A ij ̸ = ø and where A ′ has at least

one more filled cell.

Let x = (c 1 , . . . , c n , E) where E is an empty matrix of size K × M (all cell values of E are ø).

Then, a full solution is computed as f m (x) for some value of m that increases with the problem size.

In contrast to other tasks, the value of m is not fixed, and depends on the task instance, but using

solvers we know that m increases with problem size.

D.3

Discussing c ≪ ϵ in the context of Proposition 4.2

Note that in Proposition 4.2, if c ≪ ϵ then lim inf P(f n ̸ = f b n ) ≈ 1. This is because assuming

n→+∞

ϵ = m · c for some m > 0, we have 1 −

= 1 −

, and

c + ϵ

c + m · c

m +1

is a monotonically increasing function for all m > 0 that tends to 1 when m goes to infinity.

m +1

Therefore, large m’s (or alternatively, c ≪ ϵ) imply

will be close to 1.

m +1

It is reasonable to assume c ≪ ϵ when g has low collision, since c represents the probability of the

estimator g b (y) arriving at the correct output g(x) by chance when given the wrong input y ̸ = x.

If g is discrete, it can take |Im(g)| values, where |Im(g)| denotes the cardinal of the image space of

g.Assuming approximately uniform errors, c ≈ ϵ/|Im(g)|, which in turn implies c ≪ ϵ since g being

low collision implies |Im(g)| is large.

If g is continuous, then assuming approximately uniform errors we have c ≈ 0.

Summarizing both cases, if errors are approximately evenly distributed we obtain that

lim inf P(f n ̸ = f b n ) ≈ 1.

n→+∞

35D.4

Error rates in repeated applications of a function may be unbounded

Time series analysis studies series (y t ) t where each y t linearly depends on the immediately previous

p ≥ 1 time steps, and potentially including an error component. We will focus on the vectorial case,

defined as follows.

Definition D.1 (Hamilton 1994, p-th order vector autorregressions, VAR(p)). Let

y t = c + Φ 1 y t−1 + Φ 2 y t−2 + . . . + Φ p y t−p + ϵ t

where c denotes a n × 1 vector of constants and Φ i denotes an n × n matrix of autoregressive

coefficients. The n × 1 ϵ t vector is a generalization of white noise: E(ϵ t ) = 0, E(ϵ t , ϵ ′ Γ ) = 0 for

t ̸ = Γ, and E(ϵ t , ϵ t ) = Ω with Ω symmetric positive definite matrix.

′

We say a process is covariance-stationary if its first and second moments (E[y t ] and E[y t−j

]) are

independent of the time t. Intuitively, this implies that the consequences of any ϵ t must eventually

die out. Such a process may also be referred to as a stable process (e.g., in Lütkepohl 2005). The

following necessary and sufficient condition for stableness can be derived:

Proposition D.3 (Hamilton 1994, Proposition 10.1). Let F be an np × np matrix defined as follows.





Φ 1 Φ 2 Φ 3 . . . Φ p−1 Φ p

 I n 0

0 ...

0 





0 

F =  0 I n 0 . . .

 0

0 I n . . .

0 

0 ...

I n

The eigenvalues of matrix F satisfy

|I n λ p − Φ 1 λ p−1 − Φ 2 λ p−2 − Φ p | = 0

Hence, a VAR(p) is covariance-stationary as long as |λ| < 1 for all values of λ satisfying this

equation.

In our case, repeated iterations only involve considering the immediately previous step, i.e. p = 1.

Then, a V AR(1) process y t = c + Φ 1 y t−1 + ϵ t is covariance-stationary (or stable) if and only if the

eigenvalues of F = Φ 1 lie inside the unit circle. F will be unstable if at least one eigenvalue lies

outside the unit circle, which in turn usually means an explosive system.

Intuition For VAR(1) (i.e., y t = c + Φ 1 y t−1 + ϵ t ), we can intuitively see why large eigenvalues

are problematic. If y t is VAR(1), then it can be rewritten as

y t = Φ t 1 y 0 +

t−1

i=1

t−1

Φ i 1 ϵ t−i + I n +

Φ i 1 c

i=0

Intuitively, large eigenvalues are problematic because if we diagonalize Φ 1 = P DP −1 , then

Φ t 1 = P D t P −1 , with D ii = λ ti . Thus, a component of Φ t 1 will diverge if |λ i | > 1. If Φ 1 is

not diagonalizable, a similar argument holds for its Jordan decomposition. See Lütkepohl 2005,

Section 2.1.1 for details.

36E

Societal impact

Our work on analyzing the limitations of current Transformers in compositional tasks can have a

positive societal impact in several ways. By shedding light on these limitations, we contribute to a

deeper understanding of the capabilities and constraints of these models. This knowledge is essential

for researchers, developers, and policymakers in making informed decisions regarding the application

of Transformers in various domains.

Understanding the limitations of Transformers in compositional reasoning is crucial for developing

more reliable and robust AI systems. By identifying these shortcomings, we can direct future research

efforts toward addressing these limitations and developing models that exhibit improved performance

in handling complex tasks requiring compositional reasoning.

We do not foresee any negative societal impacts, as our analysis aims to understand the reasons

behind transformers’ failures and successes, but does not introduce any new model or dataset that

future work may leverage.