Summary of Auto-Regressive Next-Token Predictors A Theoretical Framework

Summary Auto-Regressive Next-Token Predictors A Theoretical Framework arxiv.org

8,770 words - PDF document - View PDF document

One Line

ARNPs like GPT-3 and GPT-4 simplify function learning by treating each token as both input and label, producing coherent responses and effectively generating text while investigating the trade-off between length complexity and other measures.

Slides

Slide Presentation (10 slides)

Copy slides outline Copy embed code Download as Word

Auto-Regressive Next-Token Predictors: A Theoretical Framework

Source: arxiv.org - PDF - 8,770 words - view

Introduction

• Auto-Regressive Next-Token Predictors (ARNPs) are universal learners capable of solving complex tasks.

• ARNPs trained on next-token prediction can approximate any function efficiently computed by a Turing machine.

• Linear ARNPs demonstrate the power of linear models.

• Auto-regressive learning enables the learner to compute non-linear functions.

Length Complexity

• Length complexity plays a crucial role in the performance of ARNPs.

• It measures the number of intermediate tokens required to approximate a target function.

• Length complexity can be traded off with other complexity measures when learning certain tasks.

Experimental Results

• Simple next-token predictors are effective in generating coherent text and solving arithmetic tasks.

• Linear next-token predictor trained on the TinyStories dataset generates plausible and grammatically sound stories.

• Shallow Multi-Layer Perceptron (MLP) outperforms GPT-4 in multiplying two 4-digit numbers.

Chain-of-Thought Reasoning

• Chain-of-Thought (CoT) reasoning enhances the performance of language models in logical reasoning and arithmetic tasks.

• Scratchpad techniques contribute to a better understanding of the capabilities of auto-regressive models.

• CoT reasoning allows for the study of the influence of intermediate token sequences on the difficulty of learning problems.

Language Models for Arithmetic Tasks

• Language models for arithmetic tasks often encounter difficulties in executing straightforward operations.

• Structuring language models using an algorithmic pipeline enhances their efficiency in arithmetic tasks.

• MLP-based architectures show promise as alternatives to transformers in language modeling tasks.

Theoretical Investigations

• Theoretical investigations contribute to a better understanding of the capabilities and constraints of language models and transformers.

• Transformers can simulate automata using few layers.

• Self-attention in transformers represents sparse functions with logarithmically scaling sample complexity.

Conclusion

• Auto-Regressive Next-Token Predictors are universal learners capable of approximating any function efficiently computed by a Turing machine.

• The power of language models relies on the auto-regressive training scheme rather than a specific architectural choice.

• Length complexity is a crucial factor in the learning process.

• Experimental results highlight the effectiveness of simple next-token predictors.

• Theoretical investigations contribute to a better understanding of the capabilities and constraints of language models and transformers.

Key Takeaways

• ARNPs are capable of solving complex tasks and approximating any function computed by a Turing machine.

• Length complexity plays a crucial role in the performance of ARNPs.

• Experimental results demonstrate the effectiveness of simple next-token predictors.

• Theoretical investigations contribute to a better understanding of language models and transformers.

[Consider including visuals such as graphs illustrating experimental results or diagrams explaining the auto-regressive learning process]

Key Points

Auto-Regressive Next-Token Predictors (ARNPs) are universal learners capable of solving complex tasks.
ARNPs trained on the task of next-token prediction can approximate any function efficiently computed by a Turing machine.
Linear ARNPs can compute any Turing computable function, demonstrating the power of linear models.
Auto-regressive learning allows for supervision on intermediate steps in the computation process and enables the learner to compute non-linear functions.
Length complexity, which measures the number of intermediate tokens required to approximate a target function, plays a crucial role in the performance of ARNPs.
Experimental results show the effectiveness of simple next-token predictors in generating coherent text and solving arithmetic tasks.
Chain-of-Thought (CoT) reasoning and scratchpad techniques enhance the performance of language models in logical reasoning and arithmetic tasks.
Theoretical investigations into language models and transformers contribute to a better understanding of their capabilities and constraints.

Summaries

41 word summary

ARNPs like GPT-3 and GPT-4 generate coherent responses. They treat each token as both input and label, simplifying function learning. Next-token predictors are effective in generating text and solving arithmetic tasks. Research examines the trade-off between length complexity and other measures.

62 word summary

Auto-Regressive Next-Token Predictors (ARNPs) like GPT-3 and GPT-4 are powerful models that can generate coherent and contextually relevant responses. They simplify complex function learning by treating each token as both an input and a label. Experimental results show the effectiveness of next-token predictors in generating text and solving arithmetic tasks. Ongoing research explores the trade-off between length complexity and other complexity measures.

151 word summary

Auto-Regressive Next-Token Predictors (ARNPs) like GPT-3 and GPT-4 are powerful models that can generate coherent and contextually relevant responses. They can efficiently approximate any function computed by a Turing machine. Linear ARNPs can implement any target function. ARNPs simplify complex function learning by treating each token as both an input and a label, enabling the computation of non-linear functions. Length complexity, which measures the number of intermediate tokens needed to approximate a target function, is crucial for ARNP performance. Experimental results demonstrate the effectiveness of simple next-token predictors in generating coherent text and solving arithmetic tasks. Chain-of-Thought (CoT) reasoning and scratchpad techniques enhance language model performance. MLP-based architectures show promise as alternatives to transformers in language modeling tasks. Ongoing research explores the trade-off between length complexity and other complexity measures. The theoretical underpinnings of language models and transformers are relatively unexplored, but recent investigations shed light on their capabilities and constraints.

455 word summary

Auto-Regressive Next-Token Predictors (ARNPs) like GPT-3 and GPT-4 are powerful models that can solve complex tasks by generating coherent and contextually relevant responses. ARNPs trained on next-token prediction can efficiently approximate any function computed by a Turing machine. The length complexity, which measures the number of intermediate tokens needed to approximate a target function, is crucial for ARNP performance. Linear ARNPs can compute any Turing computable function, demonstrating that linear models can implement any target function.

Auto-regressive learning simplifies complex function learning by treating each token as both an input and a label. This enables the learner to compute non-linear functions, unlike classical supervised learning. However, there is a complexity trade-off with ARNPs. They require long sequences of tokens to detail the internal computations of the target, and length complexity quantifies the number of intermediate tokens necessary to learn a specific concept. Length complexity can be traded off with sample complexity or computational complexity for certain tasks.

Experimental results show that simple next-token predictors are effective. A linear next-token predictor trained on the TinyStories dataset generates plausible and grammatically sound stories. Additionally, a shallow Multi-Layer Perceptron (MLP) outperforms GPT-4 in multiplying two 4-digit numbers, achieving comparable results to Goat, a transformer trained for arithmetic tasks. These results highlight the power of next-token predictors in solving complex tasks.

Chain-of-Thought (CoT) reasoning and scratchpad techniques enhance language model performance in logical reasoning and arithmetic tasks. Theoretical investigations into CoT reasoning in auto-regressive models contribute to understanding their capabilities. The length complexity measure allows the study of how intermediate token sequences influence the difficulty of learning problems.

Language models for arithmetic tasks have gained interest, but they struggle with straightforward arithmetic operations. Structuring language models using an algorithmic pipeline can improve efficiency in arithmetic tasks. MLP-based architectures show promise as alternatives to transformers in language modeling tasks.

The theoretical underpinnings of language models and transformers are relatively unexplored. Early investigations show that transformers can emulate any Turing machine, and recent work demonstrates that transformers can simulate automata using few layers. The inductive biases of self-attention have been studied, showing that bounded-norm transformer networks can represent sparse functions with logarithmically scaling sample complexity. Language models trained with CoT can efficiently learn arbitrary Turing machines.

In conclusion, auto-regressive next-token predictors are universal learners that can efficiently approximate any function computed by a Turing machine. The power of language models lies in the auto-regressive training scheme rather than a specific architectural choice. Length complexity is crucial in the learning process, and ongoing research explores its trade-off with other complexity measures. Experimental results show the effectiveness of simple next-token predictors in generating coherent text and solving arithmetic tasks. Theoretical investigations contribute to understanding the capabilities and constraints of language models and transformers.

642 word summary

Auto-Regressive Next-Token Predictors (ARNPs) are universal learners capable of solving complex tasks. These models, such as GPT-3 and GPT-4, are trained on large amounts of text data and learn to generate coherent and contextually relevant responses. Despite their simplicity, ARNPs trained on the task of next-token prediction can approximate any function efficiently computed by a Turing machine. The length complexity, which measures the number of intermediate tokens required to approximate a target function, plays a crucial role in the performance of ARNPs. Linear ARNPs, where the next-token probability is a linear function of the input sequence, can compute any Turing computable function. This theoretical result demonstrates that linear models can implement practically any target function of interest.

In supervised learning, the learner only has access to the input sequence and the target label, making it difficult to learn complex functions. However, in auto-regressive learning, the learner treats each token as both an input and a label, allowing for supervision on intermediate steps in the computation process. This significantly simplifies the learning task and enables the learner to compute non-linear functions. The power of auto-regressive learning is not possible in classical supervised learning.

While ARNPs have the capacity to generate proficient learners, there is a trade-off in terms of complexity. One significant expense is the requirement for long sequences of tokens detailing the internal computations of the target. This prompts the introduction of length complexity as a measure of learning complexity, which quantifies the quantity of intermediate tokens necessary for the model to learn a particular concept class. Length complexity can be traded off with sample complexity or computational complexity when learning certain tasks.

Experimental results demonstrate the effectiveness of simple next-token predictors. A linear next-token predictor trained on the TinyStories dataset generates plausible and grammatically sound stories. Additionally, a shallow Multi-Layer Perceptron (MLP) outperforms GPT-4 in the task of multiplying two 4-digit numbers, achieving comparable results to Goat, a 7B-parameter transformer trained for arithmetic tasks. These results highlight the power of next-token predictors and their ability to solve complex tasks.

Chain-of-Thought (CoT) reasoning and scratchpad techniques have been shown to enhance the performance of language models in logical reasoning and arithmetic tasks. Theoretical investigations into CoT reasoning in auto-regressive models contribute to a better understanding of their capabilities. The length complexity measure allows for the study of the influence of intermediate token sequences on the difficulty of learning problems.

Language models for arithmetic tasks have gained significant interest. While these models have demonstrated promising capacity for solving mathematical problems, they often encounter difficulties in executing straightforward arithmetic operations. Structuring language models to perform calculations using an algorithmic pipeline can enhance their efficiency in arithmetic tasks. MLP-based architectures have shown promise as alternatives to transformers in language modeling tasks.

The theoretical underpinnings of language models and transformers remain relatively unexplored. Early investigations have established the universality of transformers and their ability to emulate any Turing machine. Recent work has demonstrated that transformers can simulate automata using few layers. The inductive biases of self-attention have been studied, showing that bounded-norm transformer networks can represent sparse functions with logarithmically scaling sample complexity. The ability of language models to learn computationally challenging problems using CoT has been explored, demonstrating that arbitrary Turing machines can be efficiently learned by language models trained with CoT.

In conclusion, auto-regressive next-token predictors are universal learners capable of approximating any function efficiently computed by a Turing machine. The power of language models can be attributed to the auto-regressive training scheme and not necessarily to a specific architectural choice. Length complexity plays a crucial role in the learning process, and its trade-off with other complexity measures is an area of ongoing research. Experimental results demonstrate the effectiveness of simple next-token predictors in generating coherent text and solving arithmetic tasks. Theoretical investigations contribute to a better understanding of the capabilities and constraints of

Raw indexed text (53,710 chars / 8,770 words / 898 lines)

Auto-Regressive Next-Token Predictors are

Universal Learners

Eran Malach

[email protected]

Abstract

Large language models display remarkable capabilities in logical and

mathematical reasoning, allowing them to solve complex tasks. Inter-

estingly, these abilities emerge in networks trained on the simple task of

next-token prediction. In this work, we present a theoretical framework for

studying auto-regressive next-token predictors. We demonstrate that even

simple models such as linear next-token predictors, trained on Chain-of-

Thought (CoT) data, can approximate any function efficiently computed

by a Turing machine. We introduce a new complexity measure—length

complexity—which measures the number of intermediate tokens in a CoT

sequence required to approximate some target function, and analyze the

interplay between length complexity and other notions of complexity. Fi-

nally, we show experimentally that simple next-token predictors, such

as linear networks and shallow Multi-Layer Perceptrons (MLPs), display

non-trivial performance on text generation and arithmetic tasks. Our re-

sults demonstrate that the power of language models can be attributed,

to a great extent, to the auto-regressive next-token training scheme, and

not necessarily to a particular choice of architecture.

Introduction

Large language models have achieved tremendous progress in various NLP tasks,

such as machine translation, logical reasoning, coding and natural language un-

derstanding. These models, like GPT-3, GPT-4 and LaMDA [6, 31, 39], are

trained on massive amounts of text data and learn to generate coherent and

contextually relevant responses to input prompts. Amazingly, such language

models are mostly trained with a single objective: predicting the next token.

While this objective seems extremely simplific, auto-regressive next-token pre-

dictors trained on rich enough data are able to solve strikingly complex tasks [7].

This raises the question of whether such next-token predictors are merely “glori-

fied” autocomplete models, which happened to memorize the entire internet, or

are they truly performing novel logical reasoning. To this end, it has been shown

that the ability of language models to compute complex functions can be greatly

enhanced by using chain-of-thought [44, 17, 20] and scratchpad [30] techniques,

1allowing the network to perform unrestricted intermediate computations before

arriving at a final answer.

In this work, we introduce a theoretical framework for studying auto regres-

sive next-token predictors. We demonstrate that much of the power of today’s

language models in logical reasoning can be attributed to the nature of the

auto-regressive learning, and not to a particular choice of architecture. We

show theoretically that very simple models trained to only predict the next to-

ken in an auto-regressive fashion can be used to solve extremely complex tasks

when utilizing chain-of-thought techniques. In particular, we show that even

linear predictors—models where the next-token probability is a linear function

of the input sequence—are already powerful enough to compute any Turing

computable function. The main theoretical result in the paper is captured in

the following informal statement:

Theorem 1 (informal). For any function f that can be efficiently computed

using a Turing machine, there exists a dataset D such that training a (linear)

auto-regressive next-token predictor on D results in a predictor that approxi-

mates f .

That is, any computer program or intelligent agent that can be simulated

by a computer, can be learned, given the right dataset, by a simple next-token

predictor.

To understand the power of auto-regressive learning, observe that a result

equivalent to Theorem 1 is not possible in classical supervised learning, where

the learner is given access only to the input sequence and the target label. It is

well-known that no learning algorithm can efficiently learn the class of all (effi-

cient) Turing computable functions [41], given only the input and the output of

the function (without access to intermediate supervision). In fact, in classical

supervised learning, there are only a few function classes that are known to

be efficiently learnable—function classes for which there exists a learning algo-

rithm that can efficiently recover the target function given a labeled dataset.

Learnable function classes are known to have fundamental limitations to their

computational capacity. For example, the class of linear predictors is efficiently

learnable in many settings, e.g. using the Perceptron algorithm [34]. However,

a famous result in [27] shows that linear predictors cannot compute simple func-

tions such as the XOR function. Auto-regressive learning, however, presents a

striking difference. While linear next-token predictors are still efficiently learn-

able using simple algorithms such as SGD, their computational capacity greatly

surpasses the capacity of their classical counterparts. Since auto-regressive in-

ference introduces a sampling function 1 after each step, it allows linear next-

token predictors to compute non-linear functions. As implied by Theorem 1,

linear next-token predictors can implement practically any target function of

interest.

While next-token predictors have the capacity to generate highly proficient

learners, this does not come without a cost. One significant expense is the

1 In our analysis we focus on the zero-temperature/argmax sampling, which acts as an

explicit non-linearity.

2requirement to provide the learning model with potentially long sequences of

tokens that detail the internal computations of the target. This requirement can

be resource-intensive and often impractical. As such, it prompts the introduc-

tion of a new measure of learning complexity, analogous to sample complexity or

run-time complexity: the length complexity. This type of complexity measures

the quantity of intermediate tokens in a CoT necessary for the model to learn a

particular concept class. We explore this complexity in the context of the parity

learning problem, an extension of the XOR problem that is known to be com-

putationally hard to learn in some settings. We demonstrate how traditional

forms of complexity, such as sample or run-time complexity, can be traded off

with length complexity when learning parities. Specifically, we show that an

increase in the complexity of the hypothesis class—and therefore in sample or

computational complexity—leads to a decrease in length complexity. This opens

up a new path for the theoretical investigation of auto-regressive learning, by

studying the interplay between these different complexity measures.

To substantiate our theoretical results, we perform several experiments that

illustrate the power of auto-regressive learning in enhancing the performance of

simple models. We train a linear next-token prediction network on the TinyS-

tories dataset [11], a collection of short stories composed of simple words. We

observe that linear models, once trained on this dataset, frequently generate

plausible and grammatically sound stories. Next, we demonstrate that a shal-

low Multi-Linear Perceptron (MLP) with 775M parameters (no attention lay-

ers), can learn to correctly multiply two 4-digit numbers, given chain-of-thought

data. Our MLP outperforms GPT-4 in this task, and achieves comparable re-

sults to Goat, a 7B-parameter transformer that was trained to solve arithmetic

tasks [23].

1.1

Related Work

Chain-of-Thought Reasoning The proposition of supervising intermediate

logical steps as an effective approach for problem-solving is well established,

predating the advent of Transformer models. The technique was found to be

particularly beneficial in solving arithmetic problems [35]. This idea became

very popular with the introduction of the Chain-of-Thought (CoT) approach,

where models are prompted to elucidate their thought process prior to yielding

a final outcome [44, 17, 20]. Recent developments have further demonstrated

the efficacy of the CoT method in the training of smaller student models [19].

Another method that bears similarity to CoT is the “scratchpad” technique,

which allows models to record intermediate computations that subsequently aid

in deriving the final answer [30]. Such techniques have been shown to enhance

performance across a variety of logical reasoning and arithmetic tasks. The re-

search presented in this paper aims to contribute to the theoretical understand-

ing of CoT reasoning in auto-regressive models. Our work illustrates how the

employment of CoT can significantly amplify the capabilities of simple models.

Furthermore, we introduce a novel complexity measure, the length complexity,

that allows us to study the influence of the length of the intermediate sequence

3of tokens within CoT on the difficulty of the learning problem.

Language Models for Arithmetic Tasks Leveraging large language mod-

els to tackle mathematical reasoning and arithmetic tasks has gained significant

interest, a trend that is discussed at length in a recent survey [24]. While these

models have demonstrated a promising capacity for solving an array of mathe-

matical problems, they often encounter difficulties in executing straightforward

arithmetic operations, such as the multiplication and addition of large numbers

[29, 33]. Previous studies have suggested that the efficiency of language models

in arithmetic tasks can be dramatically enhanced by structuring them to per-

form calculations using an algorithmic pipeline, facilitating step-by-step execu-

tion [28]. A notable contribution in this realm is the recent work by [23], where

they fine-tuned a moderately sized (7B-parameter) transformer employing the

CoT method to perform complex arithmetic operations, including the multipli-

cation of large numbers—a challenge even for advanced models like GPT-4. A

very recent work studies the ability of small transformers trained from scratch

to solve arithmetic tasks [18]. In our study, we further substantiate this claim

by demonstrating that a small MLP, devoid of any attention mechanism, can

match the performance of the transformer in [23] in 4-digit multiplication, pro-

vided that it receives appropriate intermediate supervision. This highlights that

the capability of language models for arithmetic and mathematical reasoning is

largely attributable to the CoT and next-token prediction techniques, rather

than the specific architectural choice.

Beyond Transformers Although the transformer architecture [42] currently

stands as the leading approach in language modeling, it is noteworthy that a di-

verse range of other architectures have served this purpose over time. A notable

instance is the application of Recursive Neural Networks (RNNs) [13], a model

highly popular for language modeling only a few years back, due to its efficient

and inherent sequence processing capabilities [26]. Furthermore, convolutions

have also been explored for language modeling tasks [9]. A work more related to

our own leveraged linear dynamical systems to model text [3]. Recent years have

witnessed an emerging interest in substituting the attention layer of transform-

ers, primarily due to its high computational cost, with simpler and more efficient

alternatives. In this vein, the work of [14] introduced the linear transformer,

where the attention layer was replaced with a more computationally-friendly

linear layer. Concurrently, [47] advanced an Attention-Free Transformer. More

recent advancements include the RWKV architecture [32], a modern variant

of the RNN architecture inspired by transformers, which exhibits competitive

performance when trained on large datasets. Some studies have proposed the

use of simpler MLP-based architectures as feasible alternatives to transformers

[40, 22]. Our work contributes to this ongoing discourse by conducting both

theoretical and empirical investigations into the potential of very simple mod-

els, such as linear models and small MLPs, training them to solve complex tasks

by leveraging the power of next-token auto-regressive learning.

4Related Theoretical Work Despite the rapid pace of practical advance-

ments in the realm of language models and transformers, the theoretical un-

derpinning remains comparatively unexplored. Early investigations have estab-

lished the universality of transformers (i.e., their ability to emulate any Turing

machine) given the incorporation of a recurrent module [46, 43]. More recently,

it has been demonstrated that transformers can simulate universal computers

when incorporated into an execution loop [12]. The work of [21] shows that

Transformers can simulate Automata, which are equivalent to bounded-memory

programs, using surprisingly few layers. Turing universality extends to other

language modeling architectures, such as RNNs [38]. A study by [10] under-

scores the inductive biases of self-attention, demonstrating that bounded-norm

Transformer networks can represent sparse functions of the input sequence with

logarithmically scaling sample complexity. Of particular relevance to our study

is the work of [45], which delves into how sub-task decomposition and the CoT

technique can facilitate the learning of computationally challenging problems.

Similarly to our study, [45] also explores parity learning with intermediate su-

pervision and demonstrates that arbitrary Turing machines can be efficiently

learned by language models trained with CoT. Our work extends these findings,

introducing a theoretical framework that enables broader examination of auto-

regressive learning. We show that even simple models, such as linear predictors,

can efficiently learn Turing computable functions. In addition, our results offer

improved length complexity bounds for learning parities, indicating that parities

can be learned using O(log n) intermediate tokens, a marked reduction from the

O(n) intermediate tokens shown in [45].

Theory

The key principle in our theoretical results is the differentiation between “clas-

sical” supervised learning and auto-regressive learning. In supervised learning,

there is a clear separation between the input and the label (or target). The

learner gets a dataset of inputs with their labels, and needs to find a model that

correctly predicts the label given a new input example. While supervised learn-

ing tasks can sometimes be easy (e.g., when the label is given by a linear function

of the input features), this task becomes very hard, or even impossible, when

the function used for generating the labels requires a complex computational

process [41]. This hardness stems from the fact that the internal computation is

not available to the learner, who only observes the input and the corresponding

final output.

In auto-regressive learning, on the other hand, the situation is different.

Auto-regressive learners get a sequence of tokens, and treat every token both as

an input (for predicting future tokens) and as a label (for sequences of previous

tokens). Coupling auto-regressive learning with the chain-of-thought technique

results in a learning paradigm where the internal computations required for

reaching the final answer become available to the learner both as inputs and as

labels. This naturally allows supervision on intermediate steps in the computa-

5tion/reasoning process, which greatly simplifies the learning task.

In the following sections we detail our theoretical results. In Section 2.1 we

formally define the framework of Auto-Regressive (AR) Learning and Learn-

ability, in an analogous way to classical PAC Learning. We then show how PAC

Learnable hypothesis classes can be used for constructing AR Learnable classes,

and discuss the special case of linear classes (which are known to be efficiently

PAC Learnable). In Section 2.2 we move on to discussing approximation results,

namely understanding what types of function a given AR model can compute.

To this end, we consider the function computed by the model to be the function

mapping the input tokens to the final token(s), allowing the model to arbitrar-

ily use internal computations in a chain-of-thought manner. Following this, we

show that even linear AR models can compute very complex functions, for ex-

ample emulating arbitrary Turing machines. Finally, in Section 2.3 we introduce

length complexity, which measures how many intermediate tokens are required

in order to learn to compute a given function. We show that using more inter-

mediate tokens, i.e. increasing the length complexity, can reduce time/sample

complexity, and vice-versa.

2.1

Learnability Results

Let D be a finite set of tokens, let X = D n be the space of contexts of n tokens,

and let Z = D ∗ be a space of strings of tokens. For some t, we denote Z t = D t .

An Auto-Regressive (AR) function h is a mapping X × Z → D (we assume a

deterministic function). An AR hypothesis class H is a set of AR functions. For

some distribution D over X × Z T , we say that D is realizable by the AR class

H if there exists a function h ∈ H such, with probability 1 over (x, z) ∼ D, we

have h(x, z

of z). In other words, the pair (x, z) is realizable by h if h accurately predicts

the next token for all sub-sequences z

AR framework:

Definition 2. We say that H is AR Learnable if there exists an algorithm that

for every ϵ, δ and distribution D realizable by H, given a sample of size m(ϵ, δ)

from D, returns w.p. ≥ 1 − δ a function ĥ s.t.

Pr ∃t ≤ T s.t. ĥ(x, z

Furthermore, we say that H is efficiently AR Learnable if it is AR Learnable

with an algorithm running in polynomial time.

That is, a class H is (efficiently) AR Learnable if there exists an (efficient)

algorithm that finds, w.h.p., a next-token predictor with low error.

We now show that hypothesis classes that are learnable in the classical sense

(i.e., by supervised learning), naturally induce hypothesis classes that are AR

Learnable. Let H be some AR hypothesis class. We assume that H can be de-

composed into “standard” hypothesis classes in the following sense. Let {H t } ∞

t=1

be a sequence of classes, where H t is a class of functions X × Z t−1 7→ D. We

6assume that H = H 1 × H 2 × . . . . Namely, we associate every h ∈ H with a

sequence (h 1 , h 2 , . . . ), where h i ∈ H i , s.t. for every x ∈ X and z ∈ Z t−1 we

have h(x, z

when we study learnability we limit ourselves to discussing sequences of length

at most T 2 . In particular, we can assume H = H 1 × · · · × H T . The follow-

ing result shows that PAC Learnability of the underlying hypothesis classes (as

defined e.g. in [36]) implies AR Learnability of the class H:

Theorem 3. If H 1 , . . . , H T are (efficiently) PAC Learnable with sample com-

plexity m(ϵ, δ), then H is (efficiently) AR Learnable with sample complexity

m(ϵ/T, δ/T ).

The proof (shown in Appendix A) is a simple reduction using the standard

notion of PAC Learnability.

2.1.1

Example: Linear Decoder

From Theorem 3, efficiently learnable classes induce classes that are efficiently

learnable in the Auto-Regressive setting. For example, by letting H t be a class

of linear functions, we can use known results on learning linear classifiers to

show that the induced AR hypothesis class is efficiently learnable.

We define the linear AR hypothesis class as follows. Let ψ : D → R d be

some embedding of the dictionary. With some abuse of notations, for z ∈ D t we

define ψ(z) = [ψ(z 1 ), . . . , ψ(z t )] ∈ R d×t . Fix some t, let W ∈ R D×d×(n+t) , and

for all x ∈ X and z ∈ Z t define

h W (x, z) = arg max ⟨W D , ψ([x, z])⟩

D∈D

Now, denote the function class of all linear predictors H t Lin = {h W : W ∈

}, and observe that this class is learnable in polynomial time. Under

some margin conditions and using a convex surrogate loss function, this class

is in fact learnable using SGD. Therefore, for the linear AR hypothesis class

H Lin = H 1 Lin × · · · × H T Lin , we get that H Lin is efficiently learnable in the Auto-

Regressive setting.

D×d×(n+t)

2.2

Approximation Results

We showed that when the AR hypothesis class H is induced from a sequence of

(efficiently) learnable hypothesis classes, then H is also (efficiently) AR learn-

able. In particular, H Lin is efficiently AR learnable, as a product of linear classes.

We now show that while learnability transfers from the classical setting to the

AR setting, in AR learning we can get much stronger approximation guaran-

tees. In fact, while linear classes are relatively limited in the standard setting,

we show that the linear AR class H Lin is extremely powerful. Namely, we show

2 In Section 2.3 we study how the choice of T affects the complexity of the learning problem,

but for now we treat T as a fixed parameter of the learning problem.

7that linear AR functions can efficiently approximate any Turing computable

function.

We first need a proper definition of what are the functions that AR hy-

potheses “compute”. For some AR hypothesis h, define the output of the auto-

regression process at time t to be h (t) (x), defined recursively by:

• h (1) (x) = h(x, ∅)

• h (t) (x) = h x, h (1) (x), . . . , h (t−1) (x)

For now, we focus on AR hypotheses that are evaluated for T steps, for

some fixed T ∈ N. In Section 2.3 we discuss how the choice of T (length com-

plexity) interacts with different measures of complexity. We define the function

computed (approximated) by h as follows:

Definition 4. Fix some target f : D n → D and some AR hypothesis h. Then,

we say that h computes f , if for every input x ∈ D n we have h (T ) (x) = f (x).

Additionally, for some distribution D over D n , we say that h ϵ-approximates f ,

if Pr D h (T ) (x) ̸ = f (x) ≤ ϵ.

In other words, we say that h computes f if after running auto-regression

for T steps, it outputs a value that agrees with f . Note that we ignore all

the intermediate outputs of h and observe only the final output. This is in

alignment with common practice, where we let language models use arbitrarily

long chain-of-thought/scratchpad before arriving at the final answer 3 .

Next, we show that if some AR class H is learnable, then auto-regressive

learning of distributions realizable by h ∈ H returns an approximator for the

function computed by h:

Theorem 5. Assume that H is (efficiently) AR Learnable with sample com-

plexity m(ϵ, δ). Then, there exists an (efficient) algorithm that for every ϵ, δ and

distribution D realizable by some h ∈ H, given a sample of size m(ϵ, δ) from D,

returns w.p. ≥ 1 − δ a function ĥ s.t. ĥ (T ) ϵ-approximate h (T ) .

The proof follows by induction from the definitions (see Appendix A). The-

orem 5 shows that using auto-regressive learning, we can learn to approximate

the function computed by the underlying AR function h.

2.2.1

Approximation Capacity of Linear Hypotheses

We now limit ourselves to a dictionary with only two tokens D = {0, 1},

to be compatible with standard analysis of computations with Boolean in-

puts/outputs. We will show that linear AR functions can approximate a very

large class of functions—namely, the class of linear threshold circuits.

3 Here we assume that f outputs a single token in D, and therefore observe only the last

token produced by the auto-regression. However, we note that this can be extended to the

case where f outputs multiple tokens, and we observe a sequence of tokens at the end of the

auto-regression.

8Definition 6. A linear threshold function is a function of the form

x 7→ σ(⟨w, x⟩ + b)

for σ(x) = 1 x≥0 . A linear threshold circuit is a Boolean circuit where every gate

computes a linear threshold function.

The following result shows that linear AR functions can approximate arbi-

trary linear threshold circuits:

Theorem 7. Assume that f : {0, 1} n → {0, 1} can be computed by a linear

threshold circuit with at most T gates. Then, f can be computed by a linear AR

function h.

The proof of the above result uses the fact that a linear threshold function

can be implemented using the argmax over a linear function, in the case where

D = {0, 1}. The full proof is given in Appendix A.

We note that any Turing computable function can be computed by a linear

threshold circuit of some size T that scales polynomially with the runtime of

the Turing machine (see e.g. [2]). Therefore, we get that linear AR functions

can compute any Turing computable function, with only polynomial blow-up in

run-time. This leads to the following result:

Corollary 8. For any function f that is Turing computable in time T (n), and

for any distribution D over inputs of size n, there exists a dataset of strings of

tokens, each of size poly(T (n)), such that training a linear AR model over this

dataset efficiently recovers a function that approximates f w.r.t. D.

2.3

Length Complexity

We showed that even simple classes like linear auto-regressive predictors can

approximate any Turing computable function. Since linear predictors can be

learned efficiently, we get a learning scheme that can efficiently learn virtually

any function of interest. This is in contrast with the standard supervised learn-

ing setting, where efficiently learnable function classes are typically very limited

in their expressive power. However, we note that the complexity of learning

did not magically “disappear”. To make learning possible, we require that the

learner has, during learning, access to a sequence of tokens representing the

internal “chain-of-thought” generated by the target it aims to imitate. While

the length of this sequence is still reasonable (polynomial in the problem pa-

rameters), acquiring data with such long sequences might be costly, or even

impossible.

In this section we introduce length complexity, a new notion of learning com-

plexity that quantifies the number of intermediate tokens required for learning

some concept class. In other words, length complexity captures the length of

the “chain-of-thought” supervision provided to the model during training. The

length complexity complements common complexity measures such as sample

complexity and run-time complexity, and we show that in some cases we can

9trade off sample or computational complexity for length complexity, and vice

versa.

We begin with a formal definition of length complexity. Fix some distribu-

tion over D n , some AR hypothesis class H and some target concept class F

of functions D n → D. The definition below extends Definition 4 to function

classes, which allows an explicit discussion on length complexity.

Definition 9. We say that H computes F with length complexity T , if for every

f ∈ F there exists some h ∈ H such that, for all x ∈ D n we have h (T ) (x) = f (x).

Additionally, we say that H ϵ-approximates F with length complexity

T if for

every f ∈ F there exists some h ∈ H s.t. Pr D h (T ) (x) ̸ = f (x) ≤ ϵ.

From Theorem 7 we get that the class of linear threshold circuits of size T

can be ϵ-approximated using linear AR functions with length complexity T . For

small circuits this might not be an issue, but otherwise the dependence of the

length complexity on the circuit size may become problematic. We expect that

taking a richer AR hypothesis class H would result in reduction of the length

complexity. For example, in the extreme case, if we take H to be the class of

all linear threshold circuits of size T (that is, we take the target class and the

hypothesis class to be the same class), then it trivially shrinks the approxima-

tion length complexity of such circuits to 1. However, AR learning of H in

this case is equivalent to classical supervised learning without any intermedi-

ate supervision, which becomes again computationally hard. In the rest of this

section, we discuss the interplay between the choice of the AR hypothesis class

and the different measures of complexity that it induces: sample complexity,

computational complexity and length complexity.

2.3.1

Length Complexity of Parities

To demonstrate a concrete analysis of length complexity, we consider the well-

studied problem of learning parities, a natural extension of the XOR problem

[27]. In the parity learning problem, the inputs are sequences of n bits, and the

label is determined by the parity of the sum of an unknown subset of bits from

the input. This problem is known to be computationally hard in some settings.

For example, Statistical Query (SQ) algorithms need to use Ω(2 n ) queries to

solve the parity problem [16]. This problem has also been shown to be hard for

different variants of gradient-descent [37, 1, 25].

We now formally define the set of parity functions. Assume D = {0, 1}

(Boolean inputs). For some subset A ⊆ [n], define the parity function over A,

χ A (x) =

x i mod 2

i∈A

Let F n be the class of all parity functions, F n = {χ A : A ⊆ [n]}.

We begin by showing that with a small (logarithmic) length complexity, a

linear AR model can compute any parity function.

Theorem 10. The class F n can be computed using the linear AR class H Lin ,

with length complexity O(log n).

10Proof of Theorem 10. From [15], there exists a linear threshold circuit which

computes the parity function over n bits with O(log n) gates. In particular,

there exists such a circuit for computing the parity χ A over any subset of the

bits. Therefore, the result follows from Theorem 7.

Since we showed that linear AR functions are efficiently learnable (Theorem

3), the above theorem implies that parities become efficiently learnable given

O(log n) intermediate tokens. This is in contrast to the standard supervised

learning setting, where linear functions cannot approximate parities [8]. We

note that a similar result on learning parities with intermediate tokens appears

in [45]. However, the result in [45] requires O(n) intermediate tokens, while we

show that learning is possible with length complexity of O(log n).

We next show that by taking more complex hypothesis classes we can re-

duce the length complexity of computing F n . However, this comes at a cost of

increasing either the sample or the computational complexity. We now define

a sequence of AR classes of growing complexity for computing F n . For every

k ≤ n, n

let F n,k be the class

of parities over subsets of size at most k, namely

F n,k = χ A : A ∈ ≤k .

The larger n and k are, the harder it is to learn F n,k (in the standard notion

of supervised PAC learning). In particular, there are known lower bounds on

learning F n,k using Statistical Query (SQ) algorithms, a large family of algo-

rithms that include variants of gradient-based learning algorithms [5]. Roughly

speaking, learning F n,k using

SQ algorithms requires the computational com-

plexity to grow with ≤k

≈ (n/k) k , and the sample complexity to grow with

≈ k log n.

We define H (k) = F n,k × F n+1,k × . . . , and show the following result:

Theorem 11. H (k) can compute F n with length complexity Θ(n/k).

To prove the above result, we show that any parity over n bits can be com-

puted by constructing a “tree” of k-order parities, which reduces the length

complexity by a factor of k (see Appendix A). This result shows that by in-

creasing the complexity of the hypothesis class, we can decrease the length

complexity required for AR learning. In this particular case, a linear decrease

by a factor k in the length complexity was possible at the cost of increasing the

computational complexity exponentially with k (for SQ algorithms and variants

of GD). While the exact interplay of computational and length complexity de-

pends on the choice of target and hypothesis classes, this example shows that,

in some cases, significantly decreasing the length complexity makes the problem

computationally hard to learn.

The above results show an analysis of length complexity for a particular

problem of interest: the parity problem. We believe that a fundamental un-

derstanding of the length complexity of different problem classes will allow us

to gain a better understanding of auto-regressive predictors. For example, dis-

covering an intrinsic complexity measure for hypothesis classes (analogous to

11Linear Network

Input Embedding

Output Embedding

Input Embedding

Output Embedding

Masked Linear

Output Embedding

Input Embedding

Shallow MLP

Input Embedding

Masked Linear

ReLU

Input Embedding

Linear + ReLU Output Embedding

. .

Linear + ReLU Output Embedding

Figure 1: Illustration of the linear network and the MLP used in our experi-

ments.

VC dimension or SQ dimension) that can be used to derive length complex-

ity bounds is of particular interest. We leave such an investigation to future

research.

Experiments

We now turn to empirically validate our theoretical results, showing that very

simple models perform surprisingly well when trained auto-regressively to per-

form next-token prediction. We start by training a simple linear model on a

dataset of short stories, and then evaluate the performance of a small MLP on

a task of arithmetic computations.

3.1

Tiny Stories

We test the efficiency of linear AR models on the simple TinyStories dataset

[11]. This is a synthetic dataset of short stories containing simple words.

We train a linear model with context length of T = 64 on this dataset. The

model has only three layers: 1) a standard (linear) embedding layer, mapping

tokens into a vector of dimension d = 256; 2) a linear layer mapping d×T to d×T

(using standard masking for next-token prediction during training); 3) an output

embedding layer mapping vectors of dimension d = 256 back into the output

space of all tokens (see Figure 1 for illustration of the architecture). To allow

next-token prediction training, we apply masking on the second linear layer, so

that each output token only has access to previous tokens in the sequence. While

the resulting classifier is linear, we note that this model is not exactly the linear

AR model analyzed previously, as we allow sharing some parameters (namely,

12the input/output embedding parameters) across the different sequence positions.

However, this is a close proxy to the idealized linear model. Altogether, the

resulting model has roughly 162M active parameters. The model is trained for

5 1 / 2 hours on a single A100 machine.

While the results are certainly inferior in quality to transformer-based lan-

guage models 4 , we note that the linear predictor often does produce coherent

text. Below we show some example for prompts and the resulting output of

the model. Notice that there are some grammatical errors (e.g. Prompt #3) or

coneptual errors (e.g. Prompt #4) 5 , but the overall behavior seems reasonable.

Prompt #1:

Output (Linear): Once upon a

Once upon a time, there was a little boy

named Tim. Tim was very hungry. He

wanted to eat some food. He asked his mom,

“Can I eat some soup, please?” His mom

said, “Yes, but be careful.”

Prompt #2:

Output (Linear): He was very scared

He was very scared. He cried out loud, but

then he heard a voice. He was afraid of the

dark monster. He ran to the bear. He had a

pain in his hand. He screamed and ran away.

He was never seen again. He was scared of

the monster and he went on. He never. . .

Prompt #3:

Output (Linear): Adam was hungry, so

Adam was hungry, so he wanted to eat the

dessert. But he knew it was not a good at

his sister . He thought, “Maybe I can eat this

food, but I don’t want to share with you.”

4 For comparison, in our experiments GPT-2 Small (124M parameters) reaches perplexity

of 2.2 on the TinyStories datasets, where our linear model reaches a perplexity of 3.4 when

trained in the same scheme.

5 In Lewis Caroll’s “Alice’s Adventures in Wonderland”, the adventures of Alice begin after

she falls asleep, sitting bored by a riverbank. Therefore, the model’s assertion that “Alice was

tired, so she decided to go on an adventure” is not completely unreasonable.

13Prompt #4:

Output #1 (Linear):

Output #2 (Linear):

Alice was tired, so

Alice was tired, so she decided to take a

nap . She put a blanket on the ground

and started to cry. Then, she heard a

noise. It was a nearby tree.

Alice was tired, so she decided to go on

an adventure . She hopped on the way to

go home and look for her...

3.2

Multiplication

We now turn to demonstrate the power of next-token prediction with chain-of-

thought reasoning for arithmetic tasks. We focus on the task of multiplying two

4-digit numbers, which has been shown to be challenging even for huge language

models such as GPT-4 [23]. For this task, we train a simple Multi-Layered

Perceptron (MLP) with four layers: 1) a standard (linear) embedding layer,

from tokens to dimension d = 128; 2) a linear layer with a ReLU activation,

applied across all the context window, mapping the input of d × T to an output

of d × T (where we use a context length of T = 307); 3) a linear layer with

a ReLU activation applied per token, mapping from d to d; 4) a final output

embedding, mapping back to the space of all tokens (see Figure 1 for illustration

of the architecture). Similarly to the linear network, we mask future positions

in the second layer. We note that while this network is non-linear (unlike the

models discussed previously), it is still very simple, and very far from standard

transformer-based networks (e.g., we use no attention mechanism). Altogether,

our MLP has 775M active parameters.

Recently, a paper by [23] instrodced Goat, a relatively small transformer

fine-tuned from the LAMA model that was able to outperform GPT-4 in vari-

ous arithmetic tasks, when trained on data with intermediate calculations. We

follow a similar procedure for training our model on 4-digit multiplication, with

some key differences. First, we give more intermediate steps than in [23], es-

sentially unfolding the multiplication algorithm in the training sequences (see

Figure 2). Second, we use a custom tokenization scheme, where we tokenizes

separately single digits (1, 2, 3, . . . ), signs (×, +, =) and also pairs of digits with

multiplication sign (1 × 2, 3 × 5, etc). This tokenization allows the model to

quickly solve the single-digit multiplication task (by mapping pairs of multiplied

digits to their product), which is a crucial tool in the multiplication algorithm.

Finally, we also add zero-padding to some of the numbers, to get all strings to

have the same length.

We split all pairs of 4-digit numbers arbitrarily, use 75% for training, and

keep the rest for validation. The network is trained from scratch for 17 hours

on a single A100 GPU, going over 100M sequences (307M tokens) sampled

uniformly from the training set. In Table 1 we compare the performance of our

simple MLP (evaluated on 1000 validation examples) with GPT-3.5 (evaluated

14Prompt:

Output (MLP):

Output (GPT-3.5):

Output (GPT-4):

Correct Answer:

1394×8618=

(4×1+9×10+3×100+1×1000)×

(8×1+1×10+6×100+8×1000)=

4×8×1×1+4×1×1×10+4×6×1×100+

4×8×1×1000+9×8×10×1+9×1×10×10+

9×6×10×100+9×8×10×1000+3×8×100×1+

3×1×100×10+3×6×100×100+

3×8×100×1000+1×8×1000×1+

1×1×1000×10+1×6×1000×100+

1×8×1000×1000=

32+040+2400+32000+720+0900+54000+

720000+2400+03000+180000+2400000+

08000+010000+0600000+08000000=

12013492

The multiplication of 1394 by 8618 is

equal to 12 , 013 , 6 92 .

The multiplication of 1394 and 8618

equals 12 , 01 4 , 05 2 .

12013492

Figure 2: Comparison between the output of our MLP, GPT-3.5 and GPT-4 on

the 4-digit multiplication task.

on the same examples), as well as to GPT-4 and Goat-7B on the same task

(as reported in [23]). We report both accuracy of the exact match of the final

answer, as well as accuracy of individual digits in the final number. We note

that the performance of our MLP matches the performance of the much larger

fine-tuned transformer in [23], and outperforms both GPT-3.5 and GPT-4 on

this task. This demonstrates again that a lot of the power of language models

can be attributed to the next-token auto-regressive training, and not necessarily

to a particular architectural choice.

Model

MLP-775M

GPT-3.5

GPT-4*

Goat-7B*

Accuracy (exact match)

96.9%

1.2%

5.3%

96.9%

Accuracy (per-digit)

99.5 %

61.85%

61.8%

99.2 %

Table 1: Performance of GPT vs. MLP model on the 4-digit multiplication

task. *For GPT-4 and Goat-7B, we use the numbers as repored in [23].

154

Discussion

The emerging capabilities of large language models has triggered an ongoing

debate about their potential and implications. Certain proponents assert that

we are close to achieving Artificial General Intelligence (AGI), pointing to mod-

els such as GPT-4 which have already demonstrated perceived “sparks of AGI”

[7]. They argue that AGI is just a matter of scaling up—creating larger models,

feeding them with more data, and increasing training time. In stark contrast,

others dismiss these large models as merely sophisticated autocomplete systems,

voicing concerns about their propensity to potentially absorb and perpetuate

biased and harmful data from the internet [4].

While this debate is far from settled, we hope that our work sheds light

on the theoretical possibilities inherent in training auto-regressive next-token

predictors. Our findings indicate that, given suitable data, simple next-token

predictors can be trained to effectively learn virtually any function of interest.

Consequently, if there exists some computer program capable of realizing AGI,

then it is theoretically plausible to attain AGI through training simple next-

token predictors, given the appropriate data. Admittedly, these assertions, in

their current form, are somewhat theoretical, with practical application requir-

ing data composed of potentially very long sequences of intermediate computa-

tions. However, we show that by modifying the choice of the hypothesis class we

can possibly shorten the required sequence length, making our results more re-

alistic. Therefore, we believe that our research can contribute towards a better,

more nuanced understanding of both the capabilities and constraints associated

with next-token predictors.

References

[1] Emmanuel Abbe and Colin Sandon. Provable limitations of deep learning.

arXiv preprint arXiv:1812.06369, 2018.

[2] Sanjeev Arora and Boaz Barak. Computational complexity: a modern ap-

proach. Cambridge University Press, 2009.

[3] David Belanger and Sham Kakade. A linear dynamical system model for

text. In International Conference on Machine Learning, pages 833–842.

PMLR, 2015.

[4] Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmar-

garet Shmitchell. On the dangers of stochastic parrots: Can language

models be too big? In Proceedings of the 2021 ACM conference on fair-

ness, accountability, and transparency, pages 610–623, 2021.

[5] Avrim Blum, Adam Kalai, and Hal Wasserman. Noise-tolerant learning,

the parity problem, and the statistical query model. Journal of the ACM

(JACM), 50(4):506–519, 2003.

16[6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Ka-

plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,

Amanda Askell, et al. Language models are few-shot learners. Advances in

neural information processing systems, 33:1877–1901, 2020.

[7] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke,

Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lund-

berg, et al. Sparks of artificial general intelligence: Early experiments with

gpt-4. arXiv preprint arXiv:2303.12712, 2023.

[8] Amit Daniely and Eran Malach. Learning parities with neural networks.

Advances in Neural Information Processing Systems, 33:20356–20365, 2020.

[9] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language

modeling with gated convolutional networks. In International conference

on machine learning, pages 933–941. PMLR, 2017.

[10] Benjamin L Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. In-

ductive biases and variable creation in self-attention mechanisms. In Inter-

national Conference on Machine Learning, pages 5793–5831. PMLR, 2022.

[11] Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models

be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023.

[12] Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Ja-

son D Lee, and Dimitris Papailiopoulos. Looped transformers as pro-

grammable computers. arXiv preprint arXiv:2301.13196, 2023.

[13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neu-

ral computation, 9(8):1735–1780, 1997.

[14] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François

Fleuret. Transformers are rnns: Fast autoregressive transformers with

linear attention. In International conference on machine learning, pages

5156–5165. PMLR, 2020.

[15] William H Kautz. The realization of symmetric switching functions with

linear-input logical elements. IRE Transactions on Electronic Computers,

(3):371–378, 1961.

[16] Michael Kearns. Efficient noise-tolerant learning from statistical queries.

Journal of the ACM (JACM), 45(6):983–1006, 1998.

[17] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and

Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances

in neural information processing systems, 35:22199–22213, 2022.

[18] Nayoung Lee, Kartik Sreenivasan, Jason D Lee, Kangwook Lee, and Dim-

itris Papailiopoulos. Teaching arithmetic to small transformers. arXiv

preprint arXiv:2307.03381, 2023.

17[19] Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang,

and Yejin Choi. Symbolic chain-of-thought distillation: Small models can

also” think” step-by-step. arXiv preprint arXiv:2306.14050, 2023.

[20] Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen

Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl

Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.

[21] Bingbin Liu, Jordan T Ash, Surbhi Goel, Akshay Krishnamurthy, and

Cyril Zhang. Transformers learn shortcuts to automata. arXiv preprint

arXiv:2210.10749, 2022.

[22] Hanxiao Liu, Zihang Dai, David So, and Quoc V Le. Pay attention to mlps.

Advances in Neural Information Processing Systems, 34:9204–9215, 2021.

[23] Tiedong Liu and Bryan Kian Hsiang Low. Goat: Fine-tuned llama outper-

forms gpt-4 on arithmetic tasks. arXiv preprint arXiv:2305.14201, 2023.

[24] Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang.

A survey of deep learning for mathematical reasoning. arXiv preprint

arXiv:2212.10535, 2022.

[25] Eran Malach and Shai Shalev-Shwartz. When hardness of approximation

meets hardness of learning. The Journal of Machine Learning Research,

23(1):3942–3965, 2022.

[26] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and San-

jeev Khudanpur. Recurrent neural network based language model. In

Interspeech, volume 2, pages 1045–1048. Makuhari, 2010.

[27] Marvin Minsky and Seymour A Papert. Perceptrons, reissue of the 1988

expanded edition with a new foreword by Léon Bottou: an introduction to

computational geometry. MIT press, 2017.

[28] Matteo Muffo, Aldo Cocco, and Enrico Bertino. Evaluating transformer

language models on arithmetic operations using number decomposition.

arXiv preprint arXiv:2304.10977, 2023.

[29] Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. Investigating the lim-

itations of transformers with simple arithmetic tasks. arXiv preprint

arXiv:2102.13019, 2021.

[30] Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk

Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz,

Maarten Bosma, David Luan, et al. Show your work: Scratchpads

for intermediate computation with language models. arXiv preprint

arXiv:2112.00114, 2021.

[31] OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.

18[32] Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho,

Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran

GV, et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint

arXiv:2305.13048, 2023.

[33] Jing Qian, Hong Wang, Zekun Li, Shiyang Li, and Xifeng Yan. Limitations

of language models in arithmetic and symbolic induction. arXiv preprint

arXiv:2208.05051, 2022.

[34] Frank Rosenblatt. The perceptron: a probabilistic model for information

storage and organization in the brain. Psychological review, 65(6):386, 1958.

[35] Subhro Roy and Dan Roth. Solving general arithmetic word problems.

arXiv preprint arXiv:1608.01413, 2016.

[36] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning:

From theory to algorithms. Cambridge university press, 2014.

[37] Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of

gradient-based deep learning. In International Conference on Machine

Learning, pages 3067–3075. PMLR, 2017.

[38] Hava T Siegelmann and Eduardo D Sontag. On the computational power of

neural nets. In Proceedings of the fifth annual workshop on Computational

learning theory, pages 440–449, 1992.

[39] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv

Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker,

Yu Du, et al. Lamda: Language models for dialog applications. arXiv

preprint arXiv:2201.08239, 2022.

[40] Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xi-

aohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel

Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vi-

sion. Advances in neural information processing systems, 34:24261–24272,

2021.

[41] Leslie G Valiant. A theory of the learnable. Communications of the ACM,

27(11):1134–1142, 1984.

[42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,

Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you

need. Advances in neural information processing systems, 30, 2017.

[43] Colin Wei, Yining Chen, and Tengyu Ma. Statistically meaningful approxi-

mation: a case study on approximating turing machines with transformers.

Advances in Neural Information Processing Systems, 35:12071–12083, 2022.

19[44] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia,

Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elic-

its reasoning in large language models. Advances in Neural Information

Processing Systems, 35:24824–24837, 2022.

[45] Noam Wies, Yoav Levine, and Amnon Shashua. Sub-task decompo-

sition enables learning in sequence to sequence tasks. arXiv preprint

arXiv:2204.02892, 2022.

[46] Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi,

and Sanjiv Kumar. Are transformers universal approximators of sequence-

to-sequence functions? arXiv preprint arXiv:1912.10077, 2019.

[47] Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin

Goh, Ruixiang Zhang, and Josh Susskind. An attention free transformer.

arXiv preprint arXiv:2105.14103, 2021.

20A

Proofs

Proof of Theorem 3. Let D be some distribution over X × Z T realizable by H,

and let D t be the distribution over (X × Z t−1 ) × D, where we sample (x, z) ∼ D

and observe ((x, z

H t , and so we can use a learner for H t to find

h using m(ϵ/T, i δ/T ) samples, with

probability 1−δ/T , a hypothesis ĥ t s.t. Pr D ĥ t (x, z

using the union bound, with probability at least 1 − δ, we get:

i X h

Pr ∃t ≤ T s.t. ĥ t (x, z

Pr ĥ t (x, z

t≤T

Proof of Theorem 5. By the definition of AR Learnability, we can find a hy-

pothesis ĥ s.t., with probability at least 1 − ϵ over x ∼ D, we get

ĥ(x, h (1) (x), . . . , h (t) (x)) = h(x, h (1) (x), . . . , h (t) (x))

for all t. So, for such x we get

ĥ (1) (x) = ĥ(x, ∅) = h(x, ∅) = h (1) (x)

and by induction:

ĥ (t) (x) = ĥ(x, ĥ (1) (x), . . . , ĥ (t−1) (x))

= ĥ(x, h (1) (x), . . . , h (t−1) (x))

= h(x, h (1) (x), . . . , h (t−1) (x)) = h (t) (x)

Proof of Theorem 7. Let f be some target circuit, and we define the depth of

some gate in the circuit to be the maximal number of nodes in a path con-

necting the gate to some input variable. We sort the gates in the circuit by

their depth, and let f (1) , . . . , f (T ) be the functions computed by the gates

in the circuit (where f (T ) = f is the output function). Observe that every

gate f (t) can be computed by the argmax of a linear function of the inputs

and previous gates, and therefore we can define some linear hypothesis h s.t.

h(x, f (1) (x), . . . , f (t−1) ) = f (t) (x). By induction, we get that for every t we

have h (t) = f (t) and therefore the required follows.

Proof of Theorem 11. To show that the length complexity is O(n/k), observe

that it is enough to construct a Boolean circuit of size O(n/k), where every gate

computes a parity over at most k input bits (similarly to the proof of Theorem

7). This circuit has the structure of a tree, where each node has in-degree at

most k. It is easy to see that such a tree, with depth log k (n) and O(n/k) internal

nodes can compute the parity over any subset of bits from the input.

21We now show that the length complexity is lower bounded by Ω(n/k). As-

sume, for the sake of contradiction, that F n can be computed with length com-

plexity T ≤ n/2k, and particularly this implies that the parity over all input

bits (namely, χ [n] ) can be computed with T ≤ n/2k. Observe that, by the

choice of the function class, at every step t we have h (t) (x) = χ A t (x) for some

subset A t ⊆ [n]. Additionally, at every step t, the size of A t can increase by at

most k. Therefore, after T ≤ n/2k steps, h (T ) (x) = χ A T for some A T ⊊ [n],

and therefore h (T ) (x) does not compute (or even approximate) χ [n] .