Summary of Uncovering Mesa-Optimization Transformers in Deep Learning

Summary Uncovering Mesa-Optimization Transformers in Deep Learning arxiv.org

26,992 words - PDF document - View PDF document

One Line

Researchers propose a mesa-layer with a forget factor to enhance deep learning model performance by addressing the bias towards mesa-optimization in autoregressive transformers.

Slides

Slide Presentation (10 slides)

Copy slides outline Copy embed code Download as Word

Uncovering Mesa-Optimization Transformers in Deep Learning

Source: arxiv.org - PDF - 26,992 words - view

Transformers' Bias Towards Mesa-Optimization

• Transformers have a superior performance in deep learning due to their architectural bias towards mesa-optimization.

• Autoregressive Transformers use gradient-based mesa-optimization algorithms for prediction.

Include image/graph showing the performance advantage of Transformers

Avoiding Memory Overhead in Mesa-Optimization Transformers

• The Sherman-Morrison formula can be used to avoid memory overhead in mesa-optimization Transformers during the backward pass.

Include a diagram illustrating the implementation of the Sherman-Morrison formula

• Parallelization during training is not possible with this implementation.

Repurposing Autoregressively-Trained Transformers

• Autoregressively-trained Transformers can be repurposed for few-shot learning tasks and consecutive task learning.

• Prompt tuning and the use of prefix prompts further improve the performance of these models.

Include examples or case studies showcasing the effectiveness of repurposed Transformers

Greedy Local Learning Algorithms in Deep Learning

• Greedy local learning algorithms in deep learning models achieve strong performance in natural tasks without top-down information.

• This approach has connections to research on local learning rules in theoretical neuroscience.

Include an image or graph demonstrating the performance improvement with greedy local learning algorithms

Mesa-Layer with a Forget Factor

• The proposed mesa-layer with a forget factor improves the performance of deep learning models.

• It utilizes the recursive least squares problem with forgetting, widely used in online learning literature.

Include a diagram illustrating the structure and functioning of the mesa-layer

Computation of the Mesa Layer in Deep Learning

• The computation of the Mesa layer in deep learning involves backward pass methods via Sherman-Morrison and the implicit function theorem.

• A parallel backward pass through Neumann series approximation is also mentioned.

Include a visual representation of the computation process

Optimizing the Forward Pass with Truncated Neumann Series

• The forward pass in deep learning can be optimized using a K-step truncated Neumann series.

• This approach allows for efficient computation of terms for all time steps in parallel.

Include a graph or chart illustrating the efficiency gains with the truncated Neumann series

Key Takeaways

• Transformers' architectural bias towards mesa-optimization contributes to their superior performance in deep learning.

• Mesa-optimization Transformers can be optimized and repurposed for various tasks, including few-shot learning and consecutive task learning.

• Greedy local learning algorithms and mesa-layers with forget factors offer effective approaches to enhance deep learning models.

[Include a captivating image or quote to leave a lasting impression]

Key Points

Transformers have a superior performance in deep learning due to their architectural bias towards mesa-optimization.
Autoregressive Transformers use gradient-based mesa-optimization algorithms for prediction.
Sherman-Morrison formula can be used to avoid memory overhead in mesa-optimization Transformers during the backward pass.
Autoregressively-trained Transformers can be repurposed for few-shot learning tasks and consecutive task learning.
Greedy local learning algorithms in deep learning models achieve strong performance in natural tasks without top-down information.
Mesa-layer with a forget factor improves the performance of deep learning models.
The computation of the Mesa layer in deep learning involves backward pass methods via Sherman-Morrison and the implicit function theorem.
A K-step truncated Neumann series can be used to optimize the forward pass in deep learning.

Summaries

37 word summary

Deep learning transformers have a bias towards mesa-optimization. Autoregressive transformers use gradient-based mesa-optimization algorithms. Researchers propose a mesa-layer with a forget factor to enhance deep learning model performance. They use recursive least squares problem with forgetting for

79 word summary

Transformers in deep learning have a bias towards mesa-optimization, which is a learned process within the forward pass of the model. Autoregressive Transformers use gradient-based mesa-optimization algorithms for prediction. The use of mesa-optimization transformers

Researchers propose a generalized mesa-layer with a forget factor to improve the performance of deep learning models. They use the recursive least squares problem with forgetting, which is widely used in online learning literature. The backward pass can be computed recursively using automatic differentiation tools.

1026 word summary

Transformers' superior performance in deep learning is attributed to their architectural bias towards mesa-optimization, a learned process within the forward pass of the model. Autoregressive Transformers trained on sequence modeling tasks implement gradient-based mesa-optimization algorithms for prediction

The summary of the text excerpt is as follows:

The excerpt discusses the use of mesa-gradient descent in Transformers for predicting future inputs. It introduces a one-step mesa-gradient descent construction and explores the limitations of stacking it over multiple layers. The excerpt also presents

The document discusses the use of mesa-optimization transformers in deep learning. It explains that the memory overhead can be avoided by using the Sherman-Morrison formula in reverse during the backward pass. However, this implementation is not parallelizable during training,

In this study, the authors analyze deep linear and softmax attention-only Transformers with multiple self-attention layers. They find that the weights of trained models exhibit clean structure and can be described by a compressed algorithm with fewer parameters. A linear regression probing analysis is

Autoregressively-trained Transformers can be repurposed for few-shot learning tasks, demonstrating in-context learning capabilities and the ability to learn multiple tasks consecutively. Prompt tuning and the use of prefix prompts further improve the performance of these models. In

Transformer models trained on sequence prediction tasks under a standard autoregressive objective can develop gradient-based inference algorithms. These algorithms can be repurposed to solve supervised in-context learning tasks. Reverse-engineering findings are currently limited to simple linear prediction tasks, and

Our study introduces greedy local learning algorithms in deep learning models, which only use bottom-up information and do not require global error information. This approach has connections to research on local learning rules in theoretical neuroscience. We achieve strong performance in natural tasks without top-down

The summary is not provided.

The summary is not clear as it is a list of references and does not provide a coherent description of the main ideas or key points of the document. Please provide a clear and concise summary of the document's content.

Several papers are referenced in this document, each focusing on different aspects of deep learning and optimization in transformers. The papers cover topics such as the learning abilities of transformers, the role of demonstrations in in-context learning, training neural networks with local error signals,

The text excerpt includes references to various research papers and conference presentations related to deep learning and optimization in machine learning. These include papers on self-attention with linear complexity, reasoning in large language models, predictive networks, error backpropagation algorithms, adaptive switching

This text excerpt discusses the computation of the Mesa layer in deep learning, including backward pass methods via Sherman-Morrison and the implicit function theorem, as well as a parallel backward pass through Neumann series approximation. It also mentions the visualization of weights and

This text excerpt discusses multi-layer accelerated mesa-gradient descent and the analysis of contracting linear dynamics. It also mentions the experimental details, including training Transformers on linear dynamical systems, testing trained Transformers on few-shot in-context learning, and language modeling experiments.

The summary of the text excerpt is as follows: Researchers propose a generalized mesa-layer with a forget factor to improve the performance of deep learning models. They use the recursive least squares problem with forgetting, which is widely used in online learning literature. The inverse

Accumulating the right error signal and using automatic differentiation tools allows for the computation of the full backward pass recursively in deep learning. The backward pass can be implemented using a series of equations that involve the computation of derivatives and the vector-Jacobian product trick

The forward pass in deep learning can be optimized using a K-step truncated Neumann series. This approach involves repeating a slightly altered linear self-attention layer K times, allowing for efficient computation of terms for all time steps in parallel. The truncated Neumann

The text excerpt discusses the reverse-engineering of faint additional structure resulting from a modified mesa-objective function. Attention maps of the mesa-hybrid and linear-hybrid transformers trained on the Pile dataset are observed to have stable off diagonals, indicating clean

In the study, the researchers observed a diagonal structure in the weight products of trained Transformers. This structure was found to be sufficient for approximating the final prediction as well as other computations. The weight matrix products showed stable values across block-diagonals of

This document discusses the parametrization and interpretation of Transformers in deep learning. It introduces the idea of using gradient descent and past-token averaging to predict the next token in a sequence. The authors hypothesize that past-token averaging helps overcome the sub-optim

When using the induced target transformation on new data, the prediction obtained is equivalent to standard gradient descent after a correction. Linear self-attention weight matrices can implement this multi-step case. To implement a d-step algorithm in a Transformer, specific weight configurations are

The authors argue that the Transformer can solve a problem differently by using a preconditioning matrix Ht which leads to improved single-step gradient descent performance. They provide a theoretical construction that shows how Transformers can approximate the inverse term (St-1St-1

We analyze the performance of single-layer, two-head, key-size-20 Transformers trained on constructed tokens. The models are compared to exact gradient descent, a single gradient update step, and a single mesa-layer. The optimal learning rate for gradient descent is

For models trained on constructed tokens, a fixed learning rate of 7e-4 and 9e-5 was used for the interpolations. The learnable regularization parameter was initialized to 1 for every mesa-head. The interpolation of multi-layer

We define the prediction vector as g t = ?S t S t?1 W t,inverse probe e t. We compute the loss per token and layer of this prediction model by comparing it with the actual targets for one batch. This procedure is

The findings of the study show gradually increasing probing results for implicit target probings, outperforming an update step of gradient descent. The last layer of the model has worse results due to the update step on the optimization problem. The sensitivity analyses indicate strong

Prompt-tuning improves performance in regression tasks. The hybrid-mesa model outperforms the linear model for multi-task problems. A softmax-only model's performance decreases without EOS tokens. Language modeling experiments use standard values and GPT-2 transformer architecture.

Raw indexed text (147,830 chars / 26,992 words / 4,484 lines)

Preprint

U NCOVERING MESA - OPTIMIZATION

T RANSFORMERS

Johannes von Oswald ∗,1

ETH Zürich &

Google Research

Nicolas Zucchet

ETH Zürich

Blaise Agüera y Arcas

Google Research

Eyvind Niklasson ∗

Google Research

Nino Scherrer

Independent Researcher

Max Vladymyrov

Google Research

ALGORITHMS IN

Maximilian Schlegel ∗

ETH Zürich Seijin Kobayashi

ETH Zürich

Nolan Miller

Google Research Mark Sandler

Google Research

Razvan Pascanu

Google DeepMind

João Sacramento 1

ETH Zürich

A BSTRACT

Transformers have become the dominant model in deep learning, but the reason

for their superior performance is poorly understood. Here, we hypothesize that

the strong performance of Transformers stems from an architectural bias towards

mesa-optimization, a learned process running within the forward pass of a model

consisting of the following two steps: (i) the construction of an internal learn-

ing objective, and (ii) its corresponding solution found through optimization. To

test this hypothesis, we reverse-engineer a series of autoregressive Transformers

trained on simple sequence modeling tasks, uncovering underlying gradient-based

mesa-optimization algorithms driving the generation of predictions. Moreover, we

show that the learned forward-pass optimization algorithm can be immediately

repurposed to solve supervised few-shot tasks, suggesting that mesa-optimization

might underlie the in-context learning capabilities of large language models. Fi-

nally, we propose a novel self-attention layer, the mesa-layer, that explicitly and

efficiently solves optimization problems specified in context. We find that this

layer can lead to improved performance in synthetic and preliminary language

modeling experiments, adding weight to our hypothesis that mesa-optimization is

an important operation hidden within the weights of trained Transformers.

I NTRODUCTION

Transformers (Vaswani et al., 2017) and especially large language models (LLMs) are known to

strongly adjust their predictions and learn based on data given in-context (Brown et al., 2020).

Recently, a number of works have studied this phenomenon in detail by meta-learning Transformers

to solve few-shot tasks, providing labeled training sets in context. These studies discovered that

Transformers implement learning algorithms that either closely resemble or exactly correspond to

gradient-based optimizers (Garg et al., 2022; Akyürek et al., 2023; von Oswald et al., 2023; Kirsch

et al., 2022; Zhang et al., 2023; Mahankali et al., 2023; Ahn et al., 2023; Li et al., 2023a).

However, it remains unclear how well these findings on meta-trained Transformers translate to models

that are autoregressively-trained on sequential data, the prevalent LLM training setup. Here, we

address this question by building on the theoretical construction of von Oswald et al. (2023), and show

how Transformers trained on sequence modeling tasks predict using gradient-descent learning based

on in-context data. Thus, we demonstrate that minimizing a generic autoregressive loss gives rise to a

subsidiary gradient-based optimization algorithm running inside the forward pass of a Transformer.

This phenomenon has been recently termed mesa-optimization (Hubinger et al., 2019). Moreover, we

find that the resulting mesa-optimization algorithms exhibit in-context few-shot learning capabilities,

independently of model scale. Our results therefore complement previous reports characterizing the

emergence of few-shot learning in large-scale LLMs (Kaplan et al., 2020; Brown et al., 2020).

∗

These authors contributed equally to this work. 1 Correspondence to [email protected], [email protected].

1Preprint

Autoregressive Transformer

(i) Create mesa-dataset

→ select input-output pairs

predictive of the future

(ii) Deﬁne mesa-objective

→ learn internal model

based on

(iii) Mesa-optimize

→ improves over sequence

length and layer depth

Figure 1: Illustration of our hypothesis: Optimizing the weights θ of an autoregressive Transformer

f θ gives rise to mesa-optimization algorithms implemented in the forward pass of the model. As a

sequence of inputs s 1 , . . . , s t is processed up to timestep t, the Transformer (i) creates an internal

training set consisting of pairs of input-target associations, (ii) defines an internal objective function

through the resulting dataset, used to measure the performance of an internal model with weights W ,

(iii) optimizes this objective and uses the learned model to generate a prediction ŝ t+1 of the future.

Our contributions are as follows:

• We generalize the construction of von Oswald et al. (2023) and show how, in theory,

Transformers can autoregressively predict the next element of a sequence by optimizing

internally-constructed objectives with gradient-based methods.

• Experimentally, we reverse-engineer Transformers trained on simple sequence modeling

tasks, and find strong evidence that their forward pass implements two-step algorithms:

(i) early self-attention layers construct internal training datasets by grouping and copying

tokens, and therefore implicitly define internal objective functions, (ii) deeper layers optimize

these objectives to generate predictions.

• Similarly to LLMs, we show that these simple autoregressively-trained models become

in-context learners, and that prompt-tuning, crucial to improve in-context learning in LLMs,

also improves performance in our setting.

• Motivated by our findings that attention layers are attempting to implicitly optimize internal

objective functions, we introduce the mesa-layer, a novel attention layer that efficiently

solves a least-squares optimization problem, instead of taking just a single gradient step

towards an optimum. We show that a single mesa-layer outperforms deep linear and softmax

self-attention Transformers on simple sequential tasks while offering more interpretability.

• We carry out preliminary language modeling experiments replacing standard self-attention

layers with the mesa-layer, and obtain promising results demonstrating strong in-context

learning capabilities enabled by the layer.

P RELIMINARIES

Self-attention. We study causally-masked, autoregressive Transformers (Vaswani et al., 2017)

where self-attention (Bahdanau et al., 2015) is the elementary building block. Given a sequence of t

input tokens E t = (e t ′ ) tt ′ =1 , representing the first t time steps, a self-attention layer with H heads

and parameters θ updates the current token e t ∈ R D e as follows:

∆e softmax

(E t , θ) =

⊤

P h V h,t softmax(K h,t

q h,t ),

(1)

h=1

where q h,t = W h,q e t ∈ R D a is referred to as a query, each column k h,t ′ = W h,k e t ′ ∈ R D a of

matrix K h,t ∈ R D a ×t as a key, and each column v h,t ′ = W h,v e t ′ ∈ R D v of matrix V h,t ∈ R D v ×t

as a value. The nonlinear function softmax(a) applied to vector a ∈ R t returns an attention

i )

vector with entries [softmax(a)] i = P t exp(a

. We absorb bias terms and assume here for

exp(a ′ )

t ′ =1

2Preprint

conciseness that all heads are equally sized. The parameters θ of this layer are the projection matrices

{(P h , W h,q , W h,k , W h,v )} H

h=1 for all heads. Transformers include other layers that we do not review

here, notably multi-layer perceptrons (MLPs) and layer normalization (LayerNorm) units.

We also consider linear attention models (e.g., Katharopoulos et al., 2020; Wang et al., 2020; Schlag

et al., 2021; Choromanski et al., 2021), which simply omit the softmax nonlinearity:

∆e linear

(E t , θ) =

⊤

P h V h,t K h,t

q h,t =

h=1

linear

P h Ŵ h,t

q h,t .

(2)

h=1

P t

linear

⊤

Above, we rewrite this equation using a weight matrix Ŵ h,t

= t ′ =1 v h,t ′ k h,t

′ . The size of this

weight matrix does not scale with time, but it encodes information from all past tokens (e t ′ ) tt ′ =1 ,

allowing inference at constant memory cost. For this reason, there is at present considerable interest

in linear attention (Fournier et al., 2023; Treviso et al., 2023).

Linear self-attention can implement one step of gradient descent. Our starting point is the main

result of von Oswald et al. (2023), who showed that one such attention layer can implement one

step of gradient descent (GD) on a quadratic cost function evaluated on in-context data. Therefore,

multi-layer Transformers can, in theory, minimize the loss down to an arbitrary desired level through

multiple steps of GD. In this paper, we extend this result to the autoregressive setting. First, we

review the original model and task setting.

In the setup of von Oswald et al. (2023), the goal is to meta-learn the parameters θ of a lin-

ear self-attention layer such that it learns to solve supervised learning tasks, similarly to related

work (Garg et al., 2022; Akyürek et al., 2023; Kirsch et al., 2022; Zhang et al., 2023; Ma-

hankali et al., 2023; Ahn et al., 2023; Li et al., 2023a). Each task τ is specified in-context by

a training set D τ = {(x τ,i , y τ,i )} N

i=1 and a test input x τ,test . The goal of meta-learning is then

min θ E τ ∥y τ,test − f (x τ,test , D τ , θ)∥ 2 , where y τ,test is the correct output revealed during meta-

learning, f (x τ,test , D τ , θ) denotes the actual output of the linear self-attention layer, and the ex-

pectation is taken over a distribution of linear regression tasks.

A standard approach for solving a linear regression task is to resort to a linear model f W (x) = W x

with parameters W ∈ R D y ×D x learned by gradient descent on the squared error loss L(W, D τ ) =

P N 1

i=1 2 ∥y τ,i − f W (x τ,i )∥ . Starting from an initial parameter W 0 , a gradient-descent learner updates

P N

it by taking a step ∆W 0 of size η along the negative of the gradient, ∇L = i=1 (y τ,i − W 0 x τ,i )x ⊤

τ,i .

The main result of von Oswald et al. (2023) is a theoretical construction showing that a linear

self-attention layer can implement exactly one such gradient descent step. We briefly sketch this

result now.

First, we construct a set of tokens E T , with T = N , such that e t = (y τ,i , x τ,i ), with y τ,i and x τ,i

concatenated. Additionally, we create a query token e T +1 = (−W 0 x τ,test , x τ,test ) not contained

within the set D τ , where we place the test input for which a prediction should be made. Under this

token construction and using the

the identity matrix

symbol

I x to denote

of size dim(x), if all bias

−ηI

ηW

terms are zero and W k ⊤ W q =

, and P W v =

, the query token e T +1 , after

0 I x

one such layer, becomes (−(W 0 + ∆W 0 )x τ,test ), x τ,test ). The y-component of this token contains the

(negative) of the prediction obtained by a linear model that underwent one step (∆W 0 ) of gradient

descent. Therefore, this self-attention layer implicitly constructs a least-squares optimization problem

and takes one step of mesa-gradient descent towards solving it. This layer can be directly stacked to

implement multiple steps of GD, cf. Appendix A4.2. The term mesa reinforces that this optimization

occurs within the forward attention dynamics, without any actual change to the parameters of the

attention layer itself (Hubinger et al., 2019). We stress the necessary assumption of having x τ,i and

y τ,i concatenated within a single token.

S EQUENTIAL PREDICTION BY LEAST - SQUARES MESA - OPTIMIZATION

The construction reviewed above is designed to solve few-shot supervised learning problems. As we

see next, moving to a general autoregressive modeling setting requires minimal change. However,

the spirit of what follows is markedly different: we no longer ask whether an attention layer can

3Preprint

solve few-shot supervised learning problems that are presented in-context. Instead, we ask whether

Transformers can rely on mesa-gradient descent to predict future inputs.

We therefore move to the case where a self-attention layer has to learn sequentially as some inputs

s 1:T are gradually unveiled. The goal at time t is now to minimize the autoregressive loss:

L t (W ) =

t−1

∥s t ′ +1 − W s t ′ ∥ 2 ,

′

(3)

t =1

where s t ′ +1 serves as the label for s t ′ . As in the previous section, we assume that the model always

starts from the same initial weights W 0 , and that learning corresponds to taking only a single gradient

step; this appears sub-optimal. We address this concern in the next section.

As is usually done in autoregressive modeling we apply causal masking, and at time t we update

token e t using the in-context data available in E t . To adapt to the autoregressive setting, we adapt the

token construction to a three-channel code, e t = (−W 0 s t , s t , s t−1 ), to include an additional separate

first channel to be filled with the prediction ŝ t+1 of future inputs at every time step t, alongside

channels for the previous and current sequence element, with the latter playing the role of target in

the construction of von Oswald et al. (2023). Note that by providing neighboring elements s t , s t−1

within one token e t , self-attention is able to compute dot products of targets and inputs of the loss

L t (W ) necessary to compute ∇L t , see Eq. 3. Then, to update the first channel of such a token with

the prediction of a linear model learned with one step of gradient descent, it suffices to set

0 −ηI s ηW 0

0 0 0

⊤

0 0 0 .

P W v =

, and W k W q =

(4)

0 I s 0

We refer to this result (Eq. 4) as the one-step mesa-gradient descent construction.

Multi-layer mesa-optimizers. We next move to the case of deep networks comprising stacked

linear self-attention layers. While it is natural to hypothesize that K layers simply implement K steps

of mesa-gradient descent, as in the few-shot learning (non-autoregressive) case reviewed above, this

picture might be too simple to explain actual trained autoregressive Transformers. A first hint towards

this view being too narrow lies in the fact that stacking the one-step mesa-gradient descent construction

(Eq. 4) over multiple layers does not yield vanilla gradient descent, as explained in Appendix A4.2.

Instead, we obtain an unconventional online gradient-based optimizer, that is expected to behave

worse than vanilla gradient descent. This observation, together with a mathematical analysis of the

resulting optimization algorithm, can be found in a study arguing for the disadvantages of causally-

masked attention for few-shot in-context learning (Ding et al., 2023). One may thus wonder if

Transformers can implement more efficient mesa-optimizers.

Here, we provide an alternative mesa-optimizer that is also based on causally-masked self-attention

layers. The novel optimizer operates in two stages. In a first stage, comprising one or more self-

attention layers, the algorithm implements an iterative preconditioning procedure. The result of

P t−1

this stage is a regularized mesa-objective L̄ t (W ) = t ′ =1 12 ∥s t ′ +1 − W H t s t ′ ∥ 2 + 2λ

||W || 2F , with

improved condition number compared to L t (W ). Above, H t is a preconditioning matrix and the scalar

λ −1 ≥ 0 controls the regularization strength. This preconditioning procedure has the property that in

⊤

the many-layer limit and under some mild conditions, H t converges to H t ∗ = (S t−1 S t−1

+ 1/λI) −1 ,

with S t the data matrix whose columns are (s t ′ ) t ′ =1 . In a second stage, a final self-attention layer

takes a single gradient descent step on the preconditioned mesa-objective L̄ t (W ).

The two-stage algorithm described here is theoretically justified: when H t = H t ∗ , the regression prob-

lem is solved in a single step, starting from a zero-weight initialization W 0 = 0. In Appendix A4.2,

we provide a simple weight and input token construction to implement this algorithm. Our novel

construction leverages the truncated Neumann series to iteratively approximate the required inverse-

∗

matrix-vector products H t−1

s t in parallel for all t = 2, . . . , T , and compactly, without ever explicitly

representing any of the H t matrices.

In Section 5 we show empirically that training a Transformer on autoregressive tasks can lead to

the solutions presented above. But first, in the next section, we assume that mesa-optimization is a

desirable feature for a model to have, and we discuss an architectural modification that makes this

behavior built-in by default within a Transformer.

4Preprint

A N ATTENTION LAYER FOR OPTIMAL LEAST - SQUARES LEARNING

Here we introduce the mesa-layer: a novel self-attention layer that fully solves a layer-specific

optimization problem, such as the minimization of Eq. 3, instead of only descending a loss function

with a single gradient step. The layer we propose is closely related to the Delta-Net model of

Schlag et al. (2021), which is hardwired to do one gradient descent step per time point. We focus on

causally-masked autoregressive problems, while noting that the insights remain the same for other

strategies such as BERT-style masking (Devlin et al., 2019).

Given again a sequence of tokens E t , we design a layer that changes the tokens following the update

∆e mesa

(E t , θ) =

(

with

mesa

Ŵ h,t

= arg min

mesa

P h Ŵ h,t

q h,t ,

h=1

||v h,t ′

t ′ =1

(5)

− W k h,t ′ || +

||W || 2F

2λ h

)

(6)

Above, the scalar λ −1

h > 0 controls the strength of a regularizer added to improve generalization,

and key, value and query vectors are the usual learned head-specific affine transformations of the

tokens, as before. However, through Eq. 6 these vectors are now assigned a precise, interpretable role:

value vectors specify targets to which an internal model with parameters W should map training and

test inputs, represented by keys and queries, respectively. The minimizer of a regularized version of

Eq. 3 can be immediately mapped to Eq. 6 under the token construction discussed in Section 3 by

appropriately setting the projection matrices W h,v , W h,k and W h,q .

At any given time step t = 1, . . . , T computing ∆e mesa

requires solving a regularized least squares

problem per head. To efficiently solve this sequence of T optimization problems, we will leverage

the recursive dependency of the solutions of these consecutive problems which can be expressed in

closed-form as

! −1

mesa

⊤

Ŵ h,t = V h,t K h,t R h,t =

v h,t ′ k h,t ′

k h,t ′ k h,t ′ + 1/λ h I

(7)

t ′ =1

Note that if we drop the inverted matrix R h,t , we recover a standard linear self-attention layer,

cf. Eq. 2. A recent study has also shown that the solution of a least-squares problem can be expressed

as a generalized attention layer (Garnelo & Czarnecki, 2023).

We now use the Sherman & Morrison (1950) formula to obtain the inverse at time t from the inverse

at the previous time step t − 1. This iterative update is possible because we only change the inverse

by a rank-one update. This solution scheme is known as recursive least squares (Gauss, 1821). We

obtain through Sherman-Morrison the recursion

R h,t = R h,t−1 −

⊤

R h,t−1 k h,t k h,t

R h,t−1

⊤ R

1 + k h,t

h,t−1 k h,t

(8)

with R h,0 = λ h I. With this, we can (causally in time) compute

∆e mesa

(E t , θ) =

⊤

P h V h,t K h,t

R h,t q h,t

(9)

h=1

which requires 2 additional vector-matrix and 2 vector-vector multiplications per step compared to the

standard self-attention operation. Note that since our intermediates consist of matrices of dimension

D a × D a across the timesteps, naive backward gradient computation requires storing them in memory.

Fortunately, this memory overhead can be avoided using the Sherman-Morrison formula in reverse

during the backward pass, cf. Appendix A2.1, enabling memory-efficient gradient computation of the

output of the mesa-layer w.r.t. its inputs. We further note that while the implementation described here

has a desirable O(1) inference memory cost like standard linear self-attention, it is not parallelizable

across time during training. This is a disadvantage for training on contemporary hardware shared

with recurrent neural networks, but not with standard softmax or linear self-attention. As discussed in

Appendix A2.1, in practice this significantly slows down our experiments.

We demonstrate the expressivity and performance of the mesa-layer in reverse-engineerable sequence

learning tasks as well as in language modeling in the next sections.

5Preprint

5.1

E MPIRICAL A NALYSIS

P REDICTION OF LINEAR DYNAMICS BY IN - CONTEXT LEARNING

We now attempt to reverse-engineer Transformers trained on simple synthetic autoregressive tasks.

We have two main goals. First, we want to understand whether autoregressively-trained Transformers

use mesa-optimization algorithms to predict future inputs. We use the constructions presented in

Section 3 to guide our reverse-engineering analyses. Our second goal is to determine if introducing

the mesa-layer improves the performance of standard Transformers, by subsuming multiple attention

layers that are otherwise needed to go beyond one mesa-gradient descent step.

Generative model. We focus on fully-observed linear dynamical systems. For all experiments

described in this section, we use the following generative model. To create a sequence s 1:T we first

draw a random groundtruth D s ×D s weight matrix W ∗ as well as a random initial state s 1 ∼ N (0, I s );

subsequent states for t = 2, . . . , T are then generated according to the rule s t+1 = W ∗ s t + ϵ t , where

ϵ t ∼ N (0, σ s 2 I s ) introduces Gaussian noise. We take W ∗ to be a random orthogonal matrix 1 . The

generation of W ∗ anew for each sequence avoids the memorization solution that stores W ∗ in θ, and

corresponds to a highly simplified toy model meant to capture the diversity present in real-world

data. A similar in spirit design choice may be found in the hierarchical generative model of Xie

et al. (2022). We refer to Appendix A6.1 for additional experimental details. Under such an assumed

groundtruth dynamics, the standard way of predicting future states from a given past sequence s 1:T is

to use a linear model, s t+1 = W s t , where the weights W are learned by minimizing L t (W ), Eq. 3,

possibly with an added regularizer.

Training and in-context learning objectives. Here, we analyze various configurations of Trans-

formers trained through stochastic online minimization of the autoregressive loss

" T −1

1 X

∥s t+1 − f t (s 1:t , θ)∥ ,

(10)

L(θ) = E s

L t (s 1:t , θ) = E s

2 t=1

t=1

where the expectation is taken under the sequence distribution described above, f t (s 1:t , θ) denotes

the output of the Transformer model using s t as query and s 1:t as context, and θ are the Transformer

parameters, which vary depending on the exact architecture being trained. To avoid confusion with

mesa-optimization, we refer to the minimization of L(θ) as the base-optimization process.

Here and throughout, to measure in-context learning performance we take the per-timestep loss

L t (s 1:t , θ) and monitor its evolution as a function of context size t. Thus, we simply measure how

future-input predictions improve as more context is provided to the model. This corresponds to the

operational definition of in-context learning proposed by Kaplan et al. (2020).

Hypothesis statement. The hypothesis we pursue is that base-optimization of L(θ) gives rise to a

mesa-optimization process in charge of generating predictions f t (s 1:t , θ), as illustrated in Figure 2A.

More concretely, for our linear generative model, we hypothesize that learning yields Transformers

that predict future inputs by implicitly, and entirely within their forward dynamics: (i) representing

a linear model with mesa-parameters W , (ii) constructing the least-squares mesa-objective L t (W ),

cf. Eq. 3, using in-context data s 1:t , (iii) learning W by minimizing the mesa-objective, and (iv) apply-

ing W to predict the next token s t+1 . We note that, according to our hypothesis, the mesa-objective

L t (W ) governing the forward pass of our Transformer coincides with the base-objective L(θ), but

now defined w.r.t. an implicit linear autoregressive model with mesa-parameters W .

Single self-attention layer. We begin by verifying our hypothesis on single-layer, linear-attention-

only Transformers, using the token construction of Section 3, e t = (0, s t , s t−1 ). We hypothesize that

feeding the Transformer with input-target pairs provides an inductive bias towards mesa-gradient

descent. Using this token construction, we then train by online mini-batch gradient descent on L(θ),

generating new sequences at each base optimization step according to the process described above.

This detail turns out to be important; we found that converging linear dynamics led to different inference

algorithms. We discuss this point in Appendix A5.

6Preprint

B 1.5 A

> Mesa-optimize

> Construct mesa-objective

> Create mesa-dataset

Autoregressive Transformer

RevAlg-1

Interpolation

Autoregressive Transformer

Mesa

linear-SA

> Mesa-optimize

1.0

> Construct mesa-objective

0.5

> Create mesa-dataset

0.0

2000

Training steps

GD exact -1

Mesa

linear-SA

1.5

1.0

0.5

0.0

Sequence length t

Figure 2: Reverse-engineering a trained linear self-attention layer. (A) Transformers mesa-

optimize an internal linear model and use it to predict the future state of a linear dynamical system.

(B) A trained 2-head linear self-attention layer (linear-SA) is perfectly described by a reverse-

engineered mesa-gradient descent algorithm (RevAlg-1; see Eq. A43). We show also the performance

achieved by an interpolation model, obtained by averaging the parameters θ of the trained model and

those expected from our reverse-engineered construction. (C) In-context learning loss after training:

next-input s t+1 mean squared prediction error (MSE) as a function of sequence length. The trained

linear-SA layer is very well described by a linear model learned by one step of gradient descent

with a tuned learning rate (GD exact -1). Linear-SA is greatly outperformed by a single mesa-layer,

which optimally solves the autoregressive learning problem at every time point t, reaching minimal

mean-squared prediction error after observing enough examples. By contrast, one-step GD runs into

capacity issues, exhibiting non-monotonic MSE as a function of sequence length. Averages over 5

different seeds; shaded area represents standard deviation.

We are able to perfectly identify the algorithm (RevAlg-1) that this single-layer Transformer uses

to generate predictions. Visual inspection of the projection matrices is revealing, cf. Figure A2:

we see that the dominant pattern coincides with our one-step mesa-gradient descent construction,

Eq. 4, plus some identification noise. We verify quantitatively that the layer is indeed implementing

a step of mesa-gradient descent by (i) comparing the loss reached by the trained layer with a

linear autoregressive model learned through one step of gradient descent, and by (ii) studying an

interpolated model, obtained by averaging directly in parameter space learned and constructed

weights, cf. Appendix A6.1. We find that we can perfectly fit our trained layer when using all degrees

of freedom in our construction, including not only a learned learning rate η, but also a learned set of

initial weights W 0 , reminiscent of the model-agnostic meta-learning method of Finn et al. (2017).

Importantly, as shown in Figure 2, the resulting learned one-step algorithm is still vastly outperformed

by a single mesa-layer. We note that under a simple setting of its weights, easily discovered by

base-optimization, this layer can optimally solve the task studied here. This result demonstrates the

advantage of hardcoded inductive biases in favor of mesa-optimization.

Multiple self-attention layers. Armed with our theoretical insights for the multi-layer case, cf. Sec-

tion 3, we now analyze deep linear and softmax attention-only Transformers. We format our inputs

according to a 4-channel construction, e t = (0, s t , s t , s t−1 ), which corresponds to choosing W 0 = 0.

This makes it possible to implement both multi-step mesa-optimization and our iterative precondi-

tioning algorithm, as well as hybrid variants mixing both, as discussed in Appendix A4.2.

Like with single-layer models, we see clean structure in the weights of the trained models, see

Figures A7 and A5. As a first reverse-engineering analysis, we exploit this structure and construct an

algorithm (RevAlg-d, where d denotes layer number) comprising 16 parameters (instead of 3200) per

layer head. We find that this compressed, albeit convoluted, expression can describe a trained model.

In particular, it allows interpolating between actual Transformer and RevAlg-d weights in an almost

lossless fashion, cf. Figure 3A. Experimental details can be found in Appendix A6.1.2.

While the RevAlg-d expression explains a trained multi-layer Transformer with a small number of

free parameters, it is difficult to interpret it as a mesa-optimization algorithm. We, therefore, resort

to a linear regression probing analysis (Alain & Bengio, 2017; Akyürek et al., 2023) to look for

signatures of our hypothesized mesa-optimization algorithms. In particular, we seek evidence both for

the stacked multi-layer gradient descent construction, which should bring the outputs of intermediate

70.2

1000

Training steps

d =1

d =2

d =3

d =4

d =5

d =6

Sequence length t

d =4

d =5

d =6

d =1

d =2

d =3

0.4

0.6

Loss

0.8

Mesa

Interpolation

LSA-6

RevAlg-6

Preprint

Sequence length t

Figure 3: Reverse-engineering multi-layer Transformers trained on constructed token inputs.

We report results for a 6-layer linear-self-attention-only Transformer. (A) As training proceeds, this

multi-layer linear model (LSA-6) is again perfectly described by a reverse-engineered algorithm

(RevAlg-6), described in Appendix A4. Note that the model is still outperformed by a single trained

mesa-layer. (B & C) We linearly regress the activations of each layer against (B) final targets (target

⊤

probing) as well as (C) the preconditioned inputs (S t−1 S t−1

+ 1/λI) −1 s t predicted by our theory

(inverse probing), observing an improvement in linear decoding performance across layers. Averages

computed over 5 different seeds; shaded area represents standard deviation.

layers closer to the desired targets; and for our novel iterative preconditioning algorithm, which

should bring layer outputs closer to H t ∗ s t . We therefore carry out our probing analysis taking as

targets for regression (i) the future state to be predicted s t+1 used as the target to train the Transformer,

⊤

which we term the target probe; and (ii) the preconditioned current input, (S t−1 S t−1

+ 1/λI) −1 s t ,

which we term the inverse probe, and that would allow for solving the least-squares problem in a

single gradient descent step as discussed above. Experimental details on how exactly we carry out

these regression analyses can be found in Appendix A6.1.2.

As shown in Figure 3 for deep linear self-attention Transformers (see Figure A14 for a softmax

model) we see that both probes can be linearly decoded, with decoding performance increasing with

sequence length and network depth. Base-optimization has therefore discovered a hybrid algorithm

that descends over layers the original mesa-objective L t (W ) while simultaneously improving the

condition number of the mesa-optimization problem. This leads to a fast descent of the mesa-objective

L t (W ), Eq. 3. Moreover, we find that performance strongly improves with depth, cf. Figure 3, with a

6-layer model coming close to but still not matching a single mesa-layer.

Our probing analysis results therefore support our hypothesis that a fast descent on the autoregressive

mesa-objective L t (W ) is achieved through mesa-optimization on progressively (across layers) better

preconditioned data. We point to Figures A12 and A13, and Appendix A6.1.2, for an additional

confirmation of this effect, showing that when taking regressed inverse probes as inputs to a linear

model (instead of raw inputs s t ), the performance of single-step learning significantly improves.

Full-fledged Transformers. To finish our synthetic data experiments, we relax all previous ar-

chitectural simplifications and turn to training standard Transformers that use positional encodings,

input and output projections, and which need to process raw tokens e t = s t . We hypothesize that

after autoregressive training these models operate in two stages. In a first stage, they use positional

information to re-create our token construction in the first softmax self-attention layer through a

copying mechanism, essentially identical to first stage of the induction heads discovered by Olsson

et al. (2022). This effectively corresponds to an internal specification of a mesa-optimization problem.

Since the states are Markovian, i.e. only depend (linearly) on the immediate previous state, a simple

next-token copying mechanism suffices in our toy model. The second part of our hypothesis is that

subsequent layers implement a mesa-optimizer that solves the self-constructed least-squares problem.

For this second part, we again use our two candidate constructions – mesa-gradient descent steps and

iterative preconditioning – to guide our analyses.

Following this hypothesis, we compare three model families, namely, softmax-only Transformers,

and hybrid models that have a first softmax layer followed by either linear or mesa layers. First,

we verify that Transformers of all three types learn copy layers when trained on linear dynamics by

80.00

Sequence length t

0.0

5000

Training steps

d =0

d =1

d =2

d =3

d =0

d =1

d =2

d =3

d =4

d =5

d =6

d =7

Sequence length t

d =4

d =5

d =6

d =7

0.5

0.05

1.0

0.10

t 0 = 50

t 0 = 49

t 0 = 48

t 0 = 47

t 0 = 46

t 0 < 45

2.0

1.5

MSE

0.15

Softmax-Hy

Linear-Hy

Mesa-Hy

Preprint

Sequence length t

Figure 4: Reverse engineering full-fledged trained Transformers. We study 2-layer hybrid-

mesa, 7-layer hybrid-linear, and 7-layer softmax-only Transformers. (A) After training, the hybrid-

mesa Transformer slightly outperforms the deep hybrid-linear and softmax-only models in terms

of autoregressive loss. In (B & C & D), we show results for a softmax-only model. The results

for a linear-hybrid and an MLP-layernorm model can be found in Appendix A11, A13. (B) The

first softmax layer groups together neighboring tokens. This can be seen in the high sensitivity

to the current and previous tokens of the outputs of the first layer of a softmax-only Transformer

(with even more clean next-token copying behavior for hybrid-linear and hybrid-mesa Transformers;

see also complementary attention map visualizations in Appendix A3). (B & C) We linearly

⊤

regress the activations of each layer against final targets (C) as well as (S t−1 S t−1

+ 1/λI) −1 s t , the

preconditioned inputs (D) predicted by our theory. Compared to our more constructed models of

Figure 3, here we observe a rather harsh transition in the last layer when measuring target probing (C)

while observing a gradual performance increase for early layers when probing for curvature-corrected

inputs (D). These results are well aligned with our hypothesized two-stage mesa-optimizer. Averages

computed over 5 different seeds; shaded area represents standard deviation.

(1)

(i) computing the sensitivity norm ∥∇ s t ′ f t (s 1:t , θ)∥ of the output of the first layer for all t ′ ≤ t,

(d)

and by (ii) inspecting attention maps. We use f t (s 1:t , θ) to denote the intermediate output of the

d-th layer of a Transformer, including the residual (skip connection) value. Both experiments provide

evidence that after the first layer, every token mostly depends on itself and on the preceding token, as

shown in Figure 4B. The corresponding attention maps as well as sensitivity analyses of all models

including hybrid-linear and -mesa can be found in Appendix A3, A6.1.2.

We now turn to the post-copying behavior of the models. Although some interpretable identity

⊤

structure can be observed in the weight matrix products W K

W Q , P W V of the Transformers, cf. Fig-

ures A6 and A8, we speculate that the initial embedding layer introduces too much ambiguity on

how the input data is represented and processed by the subsequent attention layers, complicating

reverse-engineering a clean algorithm. We therefore build on insights extracted from our previous

analyses and probe hidden layer activations using the same simple linear regression analysis. Even

for this more complex model, we find that again hidden activations gradually (over depth) become

more predictive for both the target as well as the inverse probes. Interestingly, we observe a hard-

transition-like behavior at the last layer in terms of target decoder performance, in line with our

constructed two-stage mesa-optimizer, which first preconditions, and then takes an optimization

step in the last layer, see Figure 4C&D and remarkably clear in Figure A11 for softmax resp. linear

self-attention Transformers. We show qualitatively similar results for Transformers trained with

MLPs and LayerNorm, cf. Figure A13. For experimental details, see Appendix A6.1.2.

Taken together, these findings provide evidence that realistic deep Transformers trained autoregres-

sively on simple linear dynamics implement prediction algorithms based on mesa-optimization

principles. These iterative algorithms allow a standard Transformer to harness depth to almost match

the performance of a learned mesa-layer, which achieves optimality for the task considered here.

5.2

S IMPLE AUTOREGRESSIVE MODELS BECOME FEW - SHOT LEARNERS

In the previous section, we established a close connection between autoregressively-trained Trans-

formers to gradient-based mesa-optimization. It is therefore natural to ask whether these models can

be repurposed to learn in-context when presented with few-shot regression data. Here, we pursue this

9Transformer

Preprint

Datapoints (x i , y i ) in sequence

Figure 5: Autoregressively-trained Transformers solve supervised few-shot regression problems.

(A) In-context learning by autoregressive mesa-optimization. (B) The mesa-optimization algorithm

acquired by training on autoregressive linear dynamics tasks allows softmax Transformers to learn

supervised tasks in-context, i.e., the mean-squared error ⟨(f (x i ; θ) − y i ) 2 ⟩ decreases gradually

and significantly with the number of labeled examples. When prompted with a special EOS token

after each pair (x i , y i ) or a prefix-prompt P at the beginning of an input sequence, which we fine-

tune for this regression task on a held-out training set, the performance improves considerably,

highlighting the usefulness of prompt-tuning already in this very simple setting. (C) Autoregressive

Transformers already display some continual in-context learning capabilities, being able to learn two

tasks consecutively. Here, we show the results for the full-fledged softmax-only transformer. The

results for the other models can be found in Appendix A6.2. Averages computed over 5 different

seeds; shaded area represents standard deviation.

question experimentally by changing the generation of the sequences after training, from a linear

dynamical system to a linear regression task. We illustrate our findings in Figure 5A.

Few-shot task generative model. To generate our few-shot tasks we still sample a groundtruth

W ∗ as a random orthogonal matrix as done during training, but now use this groundtruth model

∗

to generate a labeled training set {x i , y i } N

i=1 , with inputs x i ∼ N (0, I x ) and targets y i = W x i .

We then present this dataset to our autoregressively-trained Transformers as a sequence of tokens,

e few-shot = [x 1 , y 1 , . . . , x N , y N ] of length T = 2N , cf. Figure 5. As the sequence unfolds, and more

training data is presented, we measure in-context learning performance through the mean squared

error between the Transformer output f θ (e 2i−1 ; e few-shot

1:2i−1 ) and the corresponding target y i = e 2i . We

emphasize that both the sequence generative model and loss function differ from the ones used during

P N

training; compare the task performance metric L few-shot = 12 i=1 ∥e 2i − f θ (e 2i−1 ; e few-shot

1:2i−1 )∥ used

to evaluate in-context learning performance in this section with the actual loss used to train the

Transformer, Eq. 10.

Autoregressive Transformers are capable of few-shot learning. Although never trained on this

setting, we observe that the loss of the Transformer decreases with sequence length, see Figure 5B

for results obtained when taking the exact same 7-layer softmax Transformer model analyzed in

Figure 4, repurposing it for in-context linear regression. The model can thus learn in-context,

making use of additional in-context training data to improve its predictions. As a control, we

further report the performance reached by the least-squares solution (LSQ) obtained on the dataset

N −1

mesa

D N

= {(x i , y i )} N

i=1 ∪ {(y i , x i+1 )} i=1 , and observe a similar decrease in loss. This dataset, where

spurious

−1

half of the associations consist of wrong input-output pairs D N

= {(y i , x i+1 )} N

i=1 as illustrated

in Figure 5A, corresponds to the training set an autoregressive Transformer imbued with the mesa-

optimizers uncovered in the previous section learns from. In this sense, our models achieve a few-shot

learning performance that is not far from optimal. Thus, our results show that training Transformers

on simple autoregressive tasks can give rise to in-context few-shot learning, complementing previous

evidence for this phenomenon in large-scale models (Brown et al., 2020).

Prompt tuning improves in-context learning performance. To mitigate the influence of wrongly-

constructed inputs (y i , x i+1 ) in a sequence, we fine-tune a single token, which we refer to as the EOS

token, to improve the in-context-learned predictions. Prompt (or prefix) tuning has been shown to lead

10Preprint

Figure 6: Language modeling experiments on the Pile. We observe improved perplexity and

in-context learning scores across all our language modeling experiments when switching from

standard linear self-attention to the mesa-layer. When comparing loss values for longer time horizons,

cf. Appendix A20, we still observe a performance gap between softmax and mesa, possibly pointing

towards memory issues over long sequences. As hypothesized, we confirm that in all models various

copying heads can be found in the first softmax layer, see Appendix A3 for visualizations of the

attention heads. (A&B) 2-layer Transformers without MLPs and first layers softmax self-attention

and second layer either softmax, mesa or linear. (C&D) 4-layer Transformers with MLPs and first

layers softmax self-attention and rest of the layers either all softmax, mesa or linear.

to significant performance improvements when applied to large language models (Li & Liang, 2021;

Lester et al., 2021); here we investigate the effectiveness of this technique on our mechanistically-

understood models. When presenting data sequentially as [x 1 , y 1 , EOS, x 2 , y 2 , . . . , EOS, x N , y N ] we

observe a considerable performance improvement after prompt-tuning, see Figure 5B. Furthermore,

to ‘guide’ the model for few-shot tasks, we learn a single prefix-prompt P which we append at the

beginning of a sequence with EOS tokens. This appears to further improve the few-shot performance

for early data-pairs. Additional experimental details can be found in Appendix A6.2.

Continual in-context learning. Lastly, we demonstrate the capability of our trained Transformers

to learn multiple tasks in a row. We study the minimal setup where the model has to learn two

tasks, generated from two distinct groundtruth linear models with parameters W ∗,1 , W ∗,2 sampled as

described above, resulting in a sequence of data of the form [x 11 , y 1 1 , . . . , x 1 N , y N

, x 21 , y 1 2 , . . . , x 2 N , y N

We plot the performance when using EOS tokens (constructed as before) and prefix prompts P, as

well. In Figure 5C we see that the trained Transformer has the capability to overwrite the first and

learn a second task in-context, even though it was never explicitly trained to solve such sequential

learning problems.

A toy model for in-context learning. We conclude that Transformers trained to predict the next

element in a sequence can be naturally repurposed as in-context learners due to the similarity of

the algorithms implemented within their forward pass. This allows studying in a controlled setting

interesting properties of in-context learning, such as the advantages of prompt tuning and the ability

to learn continually. Our toy models could serve as a test bed for future work investigating the

shortcomings and various particularities of in-context learning observed in LLMs (e.g., Chan et al.,

2022a; Min et al., 2022; Kossen et al., 2023).

5.3

L ANGUAGE MODELS EQUIPPED WITH LEAST - SQUARES SOLVERS

We now move beyond synthetic tasks and provide results on autoregressive language modeling, a

problem domain Transformers have revolutionized in recent years. Because reverse-engineering

the ensuing models to the degree of our previous analyses is difficult, we base our claims on

performance comparisons between standard Transformers, and new variants based on the mesa-layer.

Our hypothesis is that the mesa-layer will improve the in-context learning and working memory

capabilities of a Transformer, in particular of the linear kind. We further hypothesize that this in turn

translates to language modeling improvements, based on the high correlation between in-context

learning and actual autoregressive loss reported by Kaplan et al. (2020). We therefore quantify

performance along two axes: the next-token prediction loss, the actual objective of base-optimization;

and the ability to learn in-context, measured as the difference in loss calculated over two timepoints

within a sequence, as defined by Kaplan et al. (2020) and Olsson et al. (2022).

11Preprint

We train Transformers with various architec-

tural configurations on the Pile (Gao et al.,

2020), a large compilation of various English

text datasets including parts of Wikipedia, arXiv,

and code. We always model the first layer using

softmax self-attention in all experiments. This

decision is based on insights from our previous

experiments, where base-optimization consis-

tently attributed a mesa-objective creation role

to this layer. We then compare pure softmax- Figure 7: Single-layer Transformers with

only Transformers to two types of hybrid mod- key-shifts, the Pile. We observe improved (A)

els, where the subsequent layers are either lin- perplexity and (B) in-context learning scores when

ear or mesa. We vary the depth of our mod- comparing one linear to one mesa layer with dif-

els, from 2-layer attention-only to deeper 4- ferent DPFP sizes ν ∈ {0, 1, 2, 3}, corresponding

attention-layer models endowed with tokenwise inversely to color fade. Mesa layers consistently

MLPs which are present by default in standard outperform linear layers, catching up with softmax.

Transformers. By transforming the data nonlin-

early, MLP layers allow solving nonlinear regression problems by mesa-gradient descent. Following

this reasoning, we further adopt in our hybrid-linear and hybrid-mesa Transformers the deterministic

parameter-free projection (DPFP, size denoted by ν) due to Schlag et al. (2021), a non-learned and

simple to compute nonlinear transformation of keys and queries. We found that this significantly

improved the performance of non-softmax attention layers. Finally, to represent discrete input sym-

bols as real-valued vectors, we learn a vocabulary of real-valued vectors using the standard GPT-2

tokenizer. All architectural and training details can be found in Appendix A6.3. We note that all

models have an (almost) identical number of parameters.

In line with our synthetic experiments, we observe stable learning across all model types of copying

layers, indicated by the constant attention to tokens in direct or close proximity, as shown in Figure

A1. We therefore reproduce the findings of Olsson et al. (2022), extending them to models that

include other forms of attention. This phenomenon is predicted by the mesa-optimization theory

presented here, where copy layers serve the purpose of constructing internal mesa-objective functions.

We note that, in contrast to our previous synthetic linear prediction tasks, the Pile is no longer

Markovian of order 1. This is reflected in the more complicated attention maps, indicating more

involved copying behavior. Additionally, we run an ablation where we compare to a single-layer

control model whose first softmax layer is removed and replaced by a hardcoded one-step key-shift

operator, cf. Appendix A6.3. Interestingly, such an operator can be found in previous work (Olsson

et al., 2022; Fu et al., 2023). Again, we verify the findings of Olsson et al. (2022) and observe strong

in-context learning scores, within a single layer, with the mesa-layer performing on-par with softmax,

see Figure 7. As in Schlag et al. (2021), DPFP features substantially improve performance; we fix

ν = 3 for the linear as well as the mesa layer for all other language modeling experiments.

We find that the hybrid-mesa Transformers dominate their hybrid-linear counterparts in terms of

performance, across all configurations, essentially matching (for 2-layer models) or coming closer

(for 4-layer models with MLPs) to pure-softmax Transformers, cf. Figure 6. We leave for future

work studying the mesa-layer equipped with forgetting factors, see Appendix A2.1, which could

further improve upon our results here. This is reflected both in terms of perplexity and in-context

learning scores. Strictly speaking, these results are not sufficient to make claims on whether mesa-

optimization is occurring within standard Transformers. However, the high performance achieved

by the hybrid-mesa models, which operate on mesa-optimization principles by design, suggests that

mesa-optimization might be happening within conventional Transformers. More reverse-engineering

work is needed to add weight to this conjecture.

D ISCUSSION

We presented evidence that Transformer models are capable of developing gradient-based inference

algorithms when trained on sequence prediction tasks under a standard autoregressive objective. We

therefore confirmed that recent results obtained under a multi-task, meta-learning setup translate

to the conventional self-supervised LLM training setup. Moreover, we have seen that the resulting

12Preprint

autoregressive inference algorithms can be repurposed without retraining to solve supervised in-

context learning tasks, thus explaining the aforementioned results within a single, unified framework.

It should be noted that our reverse-engineering findings are for now restricted to simple linear

prediction tasks. More work is needed to understand how and if our findings translate to the nonlinear

setting, and more generally to determine the conditions that lead some base optimization process

to pick solutions corresponding to gradient-based in-context learning algorithms. It seems unlikely

that the internal construction and gradient-based solution of least-squares problems is a universal

mechanistic explanation of trained Transformers. An interesting future work direction is to attempt to

reverse-engineer and describe through mesa-optimization models trained on problems of a radically

different kind than those considered here, such as algorithmic reasoning (Liu et al., 2023).

The idea that a Transformer generates its predictions by solving one or more internal optimization

problems has ties to many different lines of thinking in machine learning. One closely related line

of work explores the concept of a declarative node: a differentiable layer whose output is defined

implicitly as the solution of an optimization problem (Amos & Kolter, 2017; Gould et al., 2021;

Zucchet & Sacramento, 2022). The mesa-layer is an example of such a node. Summarizing the

operation of an entire chain of layers with thousands of parameters by a single declarative node is not

only potentially more efficient, but also more interpretable. We thus join a line of interesting recent

work exploring the advantages of including declarative nodes within attention-based models (Martins

et al., 2020; Garnelo & Czarnecki, 2023).

Our reverse-engineering analyses brought a surprising revelation: gradient-based base-optimization

of an autoregressive loss discovered such a declarative node, at least when the underlying sequence

was generated by a linear dynamics. This discovery or selection of an optimization algorithm through

learning has been termed mesa-optimization (Hubinger et al., 2019), a notion that we have adopted

throughout this paper. While we do not wish to comment here on the possible risks associated with

mesa-optimization, we point out that our results may be of interest to the artificial intelligence safety

community, by providing a simple mesa-optimization toy model.

The mesa-layer can also be seen as a locally-optimal fast weight programmer from the perspective of

Schmidhuber (1992). In his seminal work, Schmidhuber (1992) proposed to dynamically reprogram

the weights of a feedforward neural network using a Hebbian rule. As pointed out by Schlag et al.

(2021) and as can be seen from Eq. 2, this is precisely what a linear self-attention layer does: it

generates predictions using an effective weight matrix that is learned during a forward pass by taking

outer products of values and keys, a Hebbian associative rule (Hebb, 1949). In this work, we instead

frame fast weight learning as an optimization problem, that is efficiently and optimally solved at

every moment in time by the mesa-layer. This form of optimal fast learning is strictly superior to

Hebb’s rule, both in terms of generalization and memory capacity (Hertz et al., 1991). The mesa-layer

is therefore also closely related to the Delta-Net of Schlag et al. (2021), which uses the delta rule

(Widrow & Hoff, 1960) for fast weight learning. Unlike the mesa-layer which is optimal at every

time step, this rule requires multiple steps to converge, but it is cheaper to implement.

When using mesa-layers in an autoregressive Transformer, the base-optimization process becomes

explicitly a meta-learning algorithm (Thrun & Pratt, 1998). This algorithm should however be

distinguished from the end-to-end supervised meta-learning approaches that are currently highly

popular in machine learning (e.g., Ravi & Larochelle, 2017; Finn et al., 2017; Hochreiter et al.,

2001). In our models, everything is ultimately driven by the pressure to predict the future, the signal

that drives the slow autoregressive base-optimization process. This process ultimately dictates the

objectives each layer must optimize. Moreover and also unusually for meta-learning, each mesa-layer

is a greedy supervised local learner, which does not use backpropagation or any other kind of global

error information. Instead, each mesa-layer has its own local objective functions specified through

the corresponding key and value matrices.

Seen from this angle, our work has an unexpected connection to research on local learning rules, a

question of great interest in theoretical neuroscience (Lillicrap et al., 2020). Decomposing a global

supervised learning problem into a sequence of local quadratic optimization problems, as we do

here, is at the heart of the target propagation (Lee et al., 2015), predictive coding (Whittington &

Bogacz, 2017) and control-based (Meulemans et al., 2022) theories of learning in the brain, and

previous studies have proposed greedy layerwise learning algorithms that do not require global

error information (Hinton et al., 2006; Nøkland & Eidnes, 2019; Belilovsky et al., 2019; Löwe

13Preprint

et al., 2019; Hinton, 2022). Our study introduces greedy local learning algorithms, which only use

bottom-up information, to the fast timescale of inference. It is interesting that our models achieve

strong performance in natural tasks without any top-down feedback at fast timescales, at odds with

canonical predictive coding theories (Mumford, 1992; Rao & Ballard, 1999).

We finish by sharing our excitement about future research directions that aim at analyzing simple

autoregressively-trained sequence models like Transformers and in particular in-context learning

within by reverse engineering. We hope our work motivates further studies trying to describe

the emergence of single, multiple or mixture of expert models mesa-optimized in simple trained

Transformers (Bai et al., 2023) which we hypothesize could illicit inference reminiscent to world

models (Ha & Schmidhuber, 2018; Werbos, 1987). Furthermore, the insights we gained in our

controlled setting could motivate studying limitations and particularities of in-context learning (Min

et al., 2022; Kossen et al., 2023) and its powerful variants such as chain-of-thought prompting (Wei

et al., 2022; Li et al., 2023b; Giannou et al., 2023) as well as the fascinating interplay between

in-weights and in-context learning (Chan et al., 2022b).

A CKNOWLEDGMENTS

João Sacramento and Johannes von Oswald deeply thank Angelika Steger and Jyrki Alakuijala for

their support and guidance. The authors also thank Marc Kaufmann and Yassir Akram for many

valuable insights throughout the project and especially thank Andrey Zhmoginov for many fruitful

discussions. Furthermore, we are grateful to Luke Sernau and Alexander Meulemans providing

valuable comments on the manuscript. João Sacramento and Nicolas Zucchet were supported by

an Ambizione grant (PZ00P3_186027) from the Swiss National Science Foundation and an ETH

Research Grant (ETH-23 21-1).

R EFERENCES

Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement

preconditioned gradient descent for in-context learning. arXiv preprint arXiv:2306.00297, 2023.

Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning

algorithm is in-context learning? Investigations with linear models. In International Conference of

Learning Representations, 2023.

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.

In International Conference of Learning Representations, 2017.

Brandon Amos and J. Zico Kolter. OptNet: Differentiable optimization as a layer in neural networks.

In International Conference on Machine Learning, 2017.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. arXiv preprint

1607.06450, 2016.

Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky,

David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Antoine Dedieu, Claudio Fantacci,

Jonathan Godwin, Chris Jones, Ross Hemsley, Tom Hennigan, Matteo Hessel, Shaobo Hou, Steven

Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, Markus Kunesch, Lena Martens, Hamza

Merzic, Vladimir Mikulik, Tamara Norman, George Papamakarios, John Quan, Roman Ring,

Francisco Ruiz, Alvaro Sanchez, Laurent Sartran, Rosalia Schneider, Eren Sezener, Stephen

Spencer, Srivatsan Srinivasan, Miloš Stanojević, Wojciech Stokowiec, Luyu Wang, Guangyao

Zhou, and Fabio Viola. The DeepMind JAX Ecosystem, 2020.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly

learning to align and translate. In International Conference of Learning Representations, 2015.

Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians:

provable in-context learning with in-context algorithm selection. arXiv preprint arXiv:2306.04637,

2023.

Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon. Greedy layerwise learning can scale

to ImageNet. In International Conference on Machine Learning, 2019.

14Preprint

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclau-

rin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang.

JAX: composable transformations of Python+NumPy programs, 2018.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,

Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel

Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler,

Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott

Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya

Sutskever, and Dario Amodei. Language models are few-shot learners. In Advances in Neural

Information Processing Systems, volume 33, 2020.

Stephanie C. Y. Chan, Ishita Dasgupta, Junkyung Kim, Dharshan Kumaran, Andrew K. Lampinen,

and Felix Hill. Transformers generalize differently from information stored in context vs in weights.

arXiv preprint arXiv:2210.05675, 2022a.

Stephanie C. Y. Chan, Adam Santoro, Andrew K. Lampinen, Jane X. Wang, Aaditya Singh, Pierre H.

Richemond, Jay McClelland, and Felix Hill. Data distributional properties drive emergent in-

context learning in transformers. Advances in Neural Information Processing Systems, 35, 2022b.

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas

Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy

Colwell, and Adrian Weller. Rethinking attention with performers. In International Conference of

Learning Representations, 2021.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep

Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT, 2019.

Nan Ding, Tomer Levinboim, Jialin Wu, Sebastian Goodman, and Radu Soricut. CausalLM is not

optimal for in-context learning. arXiv preprint arXiv:2308.06912, 2023.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of

deep networks. In International Conference on Machine Learning, 2017.

Quentin Fournier, Gaétan Marceau Caron, and Daniel Aloise. A practical survey on faster and lighter

transformers. ACM Computing Surveys, 55(14s), 2023.

Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, and Christopher Ré. Hungry

hungry hippos: towards language modeling with state space models. In International Conference

of Learning Representations, 2023.

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang,

Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: an 800GB

dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.

Shivam Garg, Dimitris Tsipras, Percy S. Liang, and Gregory Valiant. What can transformers learn

in-context? A case study of simple function classes. In Advances in Neural Information Processing

Systems, volume 35, 2022.

Marta Garnelo and Wojciech Marian Czarnecki. Exploring the space of key-value-query models with

intention. arXiv preprint arXiv:2305.10203, 2023.

Carl Friedrich Gauss. Theoria combinationis observationum: erroribus minimis obnoxiae. Societas

Regia Scientiarum Gottingensis, 1821.

Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, and Dimitris

Papailiopoulos. Looped transformers as programmable computers. In International Conference on

Machine Learning, 2023.

Stephen Gould, Richard Hartley, and Dylan John Campbell. Deep declarative networks. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 2021.

David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.

15Preprint

Richard H. R. Hahnloser, Rahul Sarpeshkar, Misha A. Mahowald, Rodney J. Douglas, and H. Sebas-

tian Seung. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit.

Nature, 405(6789):947–951, 2000.

Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David

Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti

Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del

Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren

Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with

NumPy. Nature, 585(7825):357–362, 2020.

Donald O. Hebb. The Organization of Behavior: A Neuropsychological Theory. Wiley, New York,

1949.

Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas

Steiner, and Marc van Zee. Flax: A neural network library and ecosystem for JAX, 2023.

Tom Hennigan, Trevor Cai, Tamara Norman, Lena Martens, and Igor Babuschkin. Haiku: Sonnet for

JAX, 2020.

John Hertz, Richard G. Palmer, and Anders S. Krogh. Introduction to the Theory of Neural Computa-

tion. Perseus Publishing, 1st edition, 1991.

Geoffrey Hinton. The forward-forward algorithm: Some preliminary investigations. arXiv preprint

arXiv:2212.13345, 2022.

Geoffrey Hinton, Simon Osindero, and Yee Whye Teh. A Fast Learning Algorithm for Deep Belief

Nets. Neural Computation, 18:1527–1554, 2006.

Sepp Hochreiter, A. Steven Younger, and Peter R. Conwell. Learning to learn using gradient descent.

In Artificial Neural Networks — ICANN 2001, 2001.

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from

learned optimization in advanced machine learning systems. arXiv preprint 1906.01820, 2019.

J. D. Hunter. Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3):

90–95, 2007.

Richard M. Johnstone, C. Richard Johnson, Robert R. Bitmead, and Brian D. O. Anderson. Exponen-

tial convergence of recursive least squares with exponential forgetting factor. Systems & Control

Letters, 2(2):77–82, 1982.

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott

Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.

arXiv preprint arXiv:2001.08361, 2020.

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are

RNNs: fast autoregressive transformers with linear attention. In International Conference on

Machine Learning, 2020.

Diederik P. Kingma and Jimmy Ba. Adam: a method for stochastic optimization. In International

Conference on Learning Representations, 2015.

Louis Kirsch, James Harrison, Jascha Sohl-Dickstein, and Luke Metz. General-purpose in-context

learning by meta-learning transformers. In Sixth Workshop on Meta-Learning at the Conference

on Neural Information Processing Systems, 2022.

Jannik Kossen, Tom Rainforth, and Yarin Gal. In-context learning in large language models learns

label relationships but is not conventional learning. arXiv preprint arXiv:2307.12375, 2023.

Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference target propagation.

In Joint European Conference on Machine Learning and Knowledge Discovery in Databases,

2015.

16Preprint

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt

tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language

Processing, 2021.

Xiang Lisa Li and Percy Liang. Prefix-tuning: optimizing continuous prompts for generation. In

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021.

Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers

as algorithms: Generalization and stability in in-context learning. In International Conference on

Machine Learning, 2023a.

Yingcong Li, Kartik Sreenivasan, Angeliki Giannou, Dimitris Papailiopoulos, and Samet Oymak.

Dissecting chain-of-thought: a study on compositional in-context learning of MLPs. arXiv preprint

arXiv:2305.18869, 2023b.

Timothy P. Lillicrap, Adam Santoro, Luke Marris, Colin J. Akerman, and Geoffrey Hinton. Back-

propagation and the brain. Nature Reviews Neuroscience, 21(6):335–346, 2020.

Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Transformers

learn shortcuts to automata. arXiv preprint arXiv:2210.10749, 2023.

Sindy Löwe, Peter O’Connor, and Bastiaan Veeling. Putting an end to end-to-end: Gradient-isolated

learning of representations. In Advances in Neural Information Processing Systems, volume 32,

2019.

Arvind Mahankali, Tatsunori B. Hashimoto, and Tengyu Ma. One step of gradient descent is

provably the optimal in-context learner with one layer of linear self-attention. arXiv preprint

arXiv:2307.03576, 2023.

André Martins, António Farinhas, Marcos Treviso, Vlad Niculae, Pedro Aguiar, and Mario Figueiredo.

Sparse and continuous attention mechanisms. In Advances in Neural Information Processing

Systems, volume 33, 2020.

Alexander Meulemans, Nicolas Zucchet, Seijin Kobayashi, Johannes von Oswald, and João Sacra-

mento. The least-control principle for local learning at equilibrium. In Advances in Neural

Information Processing Systems, volume 35, 2022.

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke

Zettlemoyer. Rethinking the role of demonstrations: what makes in-context learning work? In

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022.

David Mumford. On the computational architecture of the neocortex. Biological Cybernetics, 66(3):

241–251, 1992.

Arild Nøkland and Lars Hiller Eidnes. Training neural networks with local error signals. In

International Conference on Machine Learning, 2019.

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan,

Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli,

Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane

Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish,

and Chris Olah. In-context learning and induction heads. Transformer Circuits Thread, 2022.

Rajesh P. N. Rao and Dana H. Ballard. Predictive coding in the visual cortex: a functional in-

terpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1):79–87,

1999.

Allan Raventós, Mansheej Paul, Feng Chen, and Surya Ganguli. Pretraining task diversity and the

emergence of non-Bayesian in-context learning for regression. arXiv preprint arXiv:2306.15063,

2023.

Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International

Conference on Learning Representations, 2017.

17Preprint

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight

programmers. In International Conference on Machine Learning, 2021.

Jürgen Schmidhuber. Learning to control fast-weight memories: an alternative to dynamic recurrent

networks. Neural Computation, 4(1):131–139, 1992.

Jack Sherman and Winifred J. Morrison. Adjustment of an inverse matrix corresponding to a change

in one element of a given matrix. The Annals of Mathematical Statistics, 21(1):124–127, 1950.

Sebastian Thrun and Lorien Pratt. Learning to learn. Springer US, 1998.

Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael

Hassid, Kenneth Heafield, Sara Hooker, Colin Raffel, Pedro H. Martins, André F. T. Martins,

Jessica Zosa Forde, Peter Milder, Edwin Simpson, Noam Slonim, Jesse Dodge, Emma Strubell,

Niranjan Balasubramanian, Leon Derczynski, Iryna Gurevych, and Roy Schwartz. Efficient meth-

ods for natural language processing: a survey. Transactions of the Association for Computational

Linguistics, 11, 2023.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,

Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information

Processing Systems, volume 30, 2017.

Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev,

Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In

International Conference on Machine Learning, 2023.

Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: self-attention with

linear complexity. arXiv preprint arXiv:2006.04768, 2020.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc

Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In

Advances in Neural Information Processing Systems, volume 35, 2022.

Paul J. Werbos. Learning how the world works: Specifications for predictive networks in robots and

brains. In Proceedings of IEEE International Conference on Systems, Man and Cybernetics, NY,

1987.

James C. R. Whittington and Rafal Bogacz. An approximation of the error backpropagation algorithm

in a predictive coding network with local Hebbian synaptic plasticity. Neural Computation, 29(5):

1229–1262, 2017.

Bernard Widrow and Marcian E. Hoff. Adaptive switching circuits. In IRE WESCON convention

record, volume 4, 1960.

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context

learning as implicit Bayesian inference. In International Conference of Learning Representations,

2022.

Ruiqi Zhang, Spencer Frei, and Peter L. Bartlett. Trained transformers learn linear models in-context.

arXiv preprint arXiv:2306.09927, 2023.

Nicolas Zucchet and João Sacramento. Beyond backpropagation: bilevel optimization through

implicit differentiation and equilibrium propagation. Neural Computation, 34(12), 2022.

18Preprint

Appendix

Table of Contents

A1 Mesa layer with forgetting factors

A1.1 Computing the inverse term within Ŵ t mesa . . . . . . . . . . . . . . . . . . . .

A1.2 Computing ∆e mesa

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

t 19

A2 Mesa layer backward computation

A2.1 Mesa layer computation backward pass via Sherman-Morrison . . . . . . . . . .

A2.2 Alternative derivation through the implicit function theorem . . . . . . . . . . .

A2.3 Parallel backward pass through Neumann series approximation . . . . . . . . . 20

A3 Visualization of weights and attention maps of trained Transformers 25

A4 Mechanistic interpretability of Transformers trained on linear dynaimcs

A4.1 Single-layer mesa-gradient descent . . . . . . . . . . . . . . . . . . . . . . . .

A4.2 Multi-layer accelerated mesa-gradient descent . . . . . . . . . . . . . . . . . . 31

A5 Analysing contracting linear dynamics 36

A6 Experimental details

A6.1 Training Transformers on linear dynamical systems . . . . . . . . . . . . . . . .

A6.2 Testing trained Transformers on few-shot in-context learning . . . . . . . . . . .

A6.3 Language modeling experiments . . . . . . . . . . . . . . . . . . . . . . . . . 36

A7 Software 45

M ESA LAYER WITH FORGETTING FACTORS

Here, we revisit the mesa-layer forward pass introduced in Section 4, with an added forget factor

Γ h,t = (γ h,t ′ ) tt ′ =1 , where γ h,t ′ ∈ (0, 1].

Although we leave an empirical investigation of the forget gate for future work, we hypothesize that a

token-dependent forget gate can benefit the performance of the layer by allowing selective memory

retention and forgetting. Nevertheless, we stress potential initialization and numerical issues as well

as training instabilities when computing the necessary products of factors across time.

Given again a set of tokens E t , the generalized mesa-layer changes the tokens as follows:

∆e mesa

(

with

mesa

Ŵ h,t

= arg min

mesa

P h Ŵ h,t

q h,t ,

h=1

t ′ =1

(A1)

γ h,t ′′

||W k h,t ′ − v h,t ′ || +

t ′′ =t ′ +1

Q t

t ′′ =1

γ h,t ′′

2λ h

)

||W || 2F

(A2)

This is known as the recursive least squares problem with forgetting and is widely used in the online

learning literature (Johnstone et al., 1982). For notational simplicity we drop the subscript in h and

ignore the sum over the heads in the following derivation. It can be shown that the analytical solution

of the optimization problem is

! −1

Q t

′′ =1 γ t ′′

mesa

⊤

Ŵ t

γ t ′′ v t ′ k t ′

γ t ′′ k t ′ k t ′ +

(A3)

′

′′

′

′′

′

t =1

We will now see how

t =t +1

∆e mesa

t =1

t =t +1

can be efficiently computed in a forward pass.

19Preprint

A1.1

C OMPUTING THE INVERSE TERM WITHIN Ŵ t mesa

Computing the full-fledged inverse at every timestep is computationally too expensive. As in Section 4,

we resort to using the Sherman-Morrison formula to efficiently compute the inverse term for all

timestep sequentially in time. We redefine

! −1

Q t

′′ =1 γ t ′′

⊤

R t =

γ t ′′ k t ′ k t ′ +

(A4)

′

′′

′

t =1

t =t +1

It satisfies the recursive formula

⊤

R t+1 = γ t R t −1 + k t+1 k t+1

−1

(A5)

with R 0 = λI, and the Sherman-Morrison formula thus gives

−1

⊤

R t+1 = γ t+1

R t −1 + γ t+1

k t+1 k t+1

−1

= γ t+1

R t −

−1

= γ t+1

R t −

−1

(A6)

−1

⊤

γ t+1

R t k t+1 k t+1

R t

−1 ⊤

1 + γ t+1 k t+1 R t k t+1

⊤

R t k t+1 k t+1

R t

⊤

γ t+1 + k t+1 R t k t+1

(A7)

(A8)

Note that we recover Eq. 8 by setting all γ t to 1.

A1.2

C OMPUTING ∆e mesa

Given R h,t for all heads, we can rewrite the token update as

⊤

∆e mesa

P h

γ h,t ′′ v h,t ′ k h,t

R h,t q h,t

′

h=1 t ′ =1

X 

P h V h 

! ⊤

1 t ′ ≤t

P h V h M :,t ⊙ K h ⊤ q̃ h,t



⊙ K h ⊤ q̃ h,t 

γ h,t ′′

t ′′ =t ′ +1

h=1

(A9)

t ′′ =t ′ +1

(A10)

t ′ =1

(A11)

h=1

Q t

where q̃ h,t = R h,t q h,t and M t ′ ,t := 1 t ′ ≤t t ′′ =t ′ +1 γ h,t ′′ . Note that we apply some form causal

masking here: we take the key K h ∈ R D a ×T and value matrices V h ∈ R D a ×T with all the sequence

timesteps and select the entries occurring before time t. The main difference with the usual causal

mask ( 1 t ′ ≤t ) t ′ ,t is the inclusion of the forget factors. It can be efficiently computed leveraging partial

products. We conclude by remarking that the same mask can be applied to softmax attention layers,

applying it to the key-queries products before the softmax.

A2.1

M ESA LAYER BACKWARD COMPUTATION

M ESA LAYER COMPUTATION BACKWARD PASS VIA S HERMAN -M ORRISON

In this section, we detail how to compute the backward pass of the mesa layer with forget factor

detailed in Section A1. Recall that the forward pass of the Mesa layer is computed recursively

following

R h,t+1 =

∆e t,mesa =

−1

γ h,t+1

R h,t −

⊤

R h,t k h,t+1 k h,t+1

R h,t

⊤

γ h,t+1 + k h,t+1

R h,t k h,t+1

P h V h M :,t ⊙ K h ⊤ q̃ h,t

h=1

(A12)

(A13)Preprint

with R h,0 = λ h I.

The forward pass can be decomposed into 3 steps:

1. First, the matrices R t,h are computed sequentially.

2. Then, for all t and h, the transformed queries q̃ h,t = R h,t q h,t are computed.

3. Finally, using the transformed queries Q̃ h = (q̃ h,t ) t as the queries, a standard cross-attention

operation is computed from (V h , K h , Q̃ h ) using the causal mask M that includes forgetting

rates.

While the backward pass of 2 and 3 can be computed easily with automatic differentiation tools

without much overhead compared to standard attention layers, the same thing cannot be said about

1. We will here discuss how the backward pass of the computation of Q̃ h can be computed in a

memory-efficient way. Without loss of generality, we drop the subscript h for notational simplicity.

The issue with automatic differentiation out of the box. For all time t, q̃ t = R t q t depends on q t ,

but also K t , Γ t and λ through the variable R t .

dq̃ t for

dL dL dL

dk t , dγ t , dq t

In the backward pass, we are given as input the gradient of the loss function w.r.t. Q̃, namely

all t. The goal is then to compute the gradient of the loss w.r.t. the input of Q̃, namely

and dL

dλ , which can be achieved via the chain rule.

While using automatic differentiation out of the box would take care of this computation, it would

require in particular the storing of all intermediate variables R t , which can be prohibitively expensive.

Memory efficient custom backward pass. Instead, we will show that storing the matrices K, Γ, Q

as well as R T where T is the last time step of the training sequence, is sufficient to exactly compute

the backward pass. Indeed, given the aforementioned inputs, all R t can be recomputed in linear

complexity w.r.t. T , which means we can reconstruct recursively the inputs of q̃ t at all time steps.

By noticing that R t−1 = γ t (R t −1 −k t k t ⊤ ) −1 , we can apply the Sherman-Morrison formula backwards

to obtain R t−1 as

R t (−k t )k t ⊤ R t

R t−1 = γ t R t −

(A14)

1 + (−k t ) ⊤ R t k t

R t k t k t ⊤ R t

= γ t R t − ⊤

(A15)

k t R t k t − 1

We will now show how accumulating the right error signal and leveraging the vector-jacobian product

trick together with automatic differentiation tools is sufficient for computing the full backward pass

recursively.

Firstly, given the error signal and reconstructed R t allows the computation of

dL dq̃ t

S t

dq t

dq̃ t dq t

dq̃ t

dq t

via

(A16)

Secondly, we rewrite q̃ t as a function of k t , γ t , R t−1 and q t , i.e.

q̃ t = R forward (R t−1 , k t , γ t )q t

(A17)

Since L depends on k t only via both q̃ t and R t , we can then rewrite

dL dq̃ t

dL dR t

dk t

dq̃ t dk t

dR t dk t

dL ∂ q̃ t

dL ∂R t

dq̃ t ∂k t

dR t ∂k t

(A18)

(A19)Preprint

where, provided R t−1 , k t , γ t and q t ,

tools. Similarly, we have,

∂ q̃ t

∂k t

can be computed easily using e.g. automatic differentiation

dL ∂ q̃ t

dL ∂R t

dγ t

dq̃ t ∂γ t

dR t ∂γ t

Notice that

dR t

(A20)

can be computed recursively following the chain rule

dL ∂R t

dL ∂ q̃ t

dR t−1

dR t ∂R t−1

dq̃ t ∂R t−1

(A21)

where again, provided R t−1 , k t , γ t and q t , both terms can be computed efficiently with standard

automatic differentiation tools coupled with the well known vector-Jacobian product trick given the

quantities dR

and dq̃

Thirdly, we can show that

= Tr

dλ

dR 0

(A22)

Combining everything, we can now implement the backward computation recursively via the follow-

ing equations:

R t k t k t ⊤ R t

(A23)

R t−1 = γ t R t − ⊤

k t R t k t − 1

dL ∂R t

∂L ∂ q̃ t

(A24)

dR t−1

dR t ∂R t−1

∂ q̃ t ∂R t−1

dL ∂ q̃ t

dL ∂R t

(A25)

dk t

dq̃ t ∂k t

dR t ∂k t

dL ∂ q̃ t

dL ∂R t

(A26)

dγ t

dq̃ t ∂γ t

dR t ∂γ t

R t

(A27)

dq t

dq̃ t

= Tr

(A28)

dλ

dR 0

= 0. The above equations only require the storage of

R T is assumed to be given and dR

dR t , dR t−1 , R t , R t−1 at all time, and computes the backward pass in a similar time and mem-

ory complexity as for the forward pass. The derivation is identical without forgetting factors, by

setting all γ to 1.

Comment on runtime. We highlight that, although this implementation of the mesa-layer reduces

the memory footprint of the forward and backward pass substantially, the layer still runs forward (and

backward) in time. This prevents the computation of all mesa-layer outputs in parallelization during

training, a crucial advantage of softmax as well as linear attention. On the other hand, during test

time, the mesa-layer benefits from the same advantages of linear self-attention or RNNs and predicts

the next token without the necessity to store and attend to the past. In the next section, we present

one potential avenue to improve the training time by approximating the necessary inversions by a

Neumann series running in parallel.

A2.2

A LTERNATIVE DERIVATION THROUGH THE IMPLICIT FUNCTION THEOREM

We here present an alternative way of deriving the gradients presented above that leverages the

implicit function theorem. The key here is to remark that Ŵ t mesa satisfies that the gradient of the

least-square regression loss L is 0. For simplicity, we restrict ourselves to the case in which the output

dimension of Ŵ t mesa is one, that is Ŵ t mesa = ŵ t ⊤ for ŵ t some column vector, and remark that we

have to repeat the same operation over all rows of Ŵ t mesa to obtain the full gradient, as all output

coordinates are independent in the least-square regression problem. Therefore, we w defined through

22Preprint

the implicit function

M 1,t ⊤

( ŵ t ) =

ŵ t = 0.

M t ′ ,t ( ŵ t ⊤ k t ′ − v t ′ )k t ⊤ ′ +

′

(A29)

t =1

We can then use the implicit function theorem and compute the derivative of w with respect to any

quantity · through

−1 2

d L t ( ŵ t )

d ŵ t

d L t

(w t )

= −

(A30)

d ·

d · dw

d 2 L t ( ŵ t )

= −R t

(A31)

d · dw

For example, this yields

d ŵ t

= M t ′ ,t R t k t ′ .

(A32)

dv t ′

Finally, we can recover the desired gradient by combining the previous equation with the chain rule.

A2.3

P ARALLEL BACKWARD PASS THROUGH N EUMANN SERIES APPROXIMATION

Note: We present this section for the sake of completeness – no experiments presented in this

manuscript use this approximation.

Although the previous custom backward gradient computation allows for dramatic memory savings

during training, the underlying recursive least squares computation still suffers from linear scaling

in time, similar to recurrent neural networks, as we cannot parallelize computation across time

dimension.

Here, we discuss an alternative forward pass that can be used when one can afford storing all

intermediate matrices R h,t in time. This forward pass leverages a K-step truncated Neumann series

to approximate the inverses in parallel, and is compatible with automatic differentiation tools out of

the box. Interestingly, we can do this by simply repeating (with the same weights) a slightly altered

linear self-attention layer K times.

Our goal is now to efficiently compute the terms q̃ t := R t q t = (K t K t ⊤ + λ 1 I) −1 q t for all time steps

in parallel. Indeed, once give these vectors, one can leverage Equation A11 and efficient dot-product

attention (DPA) layers implementations 2 . Note that we here ignore the forgetting factors, but their

partial products can easily be integrated in one of the K t in K t K t ⊤ to recover the version with forget

rates described above.

Given an invertible matrix X with operator norm less than 1, the truncated Neumann series approxi-

mates its inverse by

−1

X −1 ≈ X̃ (K)

(I − X) k .

(A33)

k=0

When multiplying a vector from the right, we see that

−1

x̃ (K) := X̃ (K)

x =

(I − X) k x

(A34)

k=0

(I − X) k x + x

(A35)

k=1

= (I − X)

K−1

(I − X) k x + x

k=0

(K−1)

= (I − X)x̃

+ x

(A36)

(A37)

See https://flax.readthedocs.io/en/latest/_modules/flax/linen/attention.

html for an implementation of DPA in JAX (Bradbury et al., 2018).

23Preprint

An advantage of the truncated Neumann series compared to other approximate inverse techniques

such as Newton-Iteration is that we can compute more series elements without passing intermediate

matrices across algorithmic steps – which in turn makes it memory efficient and straightforward to

use in the light of automatic differentiation. We only need to keep the original matrix we wish to

invert in memory at all times and store the intermediate vectors x̃ (k) for the backward pass.

We now look at the quantities we wish to compute, that is q̃ t = (K t K t ⊤ + λ 1 I) −1 q t , and approximate

(K)

it by q̃ t , obtained by multiplying q t to the K-step truncated Neumann series approximating the

inverse term (K t K t ⊤ + λ 1 I) −1 . Note that a normalization by the operator norm of the matrix inside

the inverse is necessary for the approximation to hold.

(K)

Then, q̃ t

can be computed recursively as

(k+1)

(k)

q̃ t

= I − K t K t ⊤ + I

q̃ t + q t

(k)

= q t + 1 −

q̃ − K t K t ⊤ q̃ t

λ t

(k)

and thus by denoting Q̃ t

(A38)

(A39)

(k)

:= (q̃ t ′ ) tt ′ =1 , we have

(k+1)

(k)

Q̃ k+1 = Q t + 1 −

Q̃ t − K t K t ⊤ Q̃ t

(A40)

(k)

which is the sum of simple terms with a DPA computed between K t , K t , Q̃ t .

(K)

After obtaining Q̃ t to approximate Q̃ t , we compute the approximate least-squares solution as de-

scribed above. Note that other implementations could save us from effectively recomputing (K t K t ⊤ )

at every iteration of Equation A40 by simply pre-computing these terms before running the Neumann

approximation. We nevertheless observe the former version to be faster when timing for forward

and backward computation and speculate the reason being the highly optimized implementation of

DPA as the backbone of the self-attention layer. Note that a simple byproduct of the derivations

here is the insight that chaining linear self-attention layers can actually easily implement truncated

Neumann series computation – especially if the goal is an inverse multiplied by a known vector. See

Section A4.2 for a more in-depth analysis.

24Preprint

V ISUALIZATION OF WEIGHTS AND ATTENTION MAPS OF TRAINED

T RANSFORMERS

1.0

0.8

63 0

token idx

63 0

token idx

63 0

token idx

63 0

token idx

0.6

0.4

63 0

token idx

63 0

token idx

63 0

token idx

63 0

token idx

63 0

token idx

63 0

token idx

0.2

63 0

token idx

0.0

63 0

token idx

Figure A1: Softmax attention maps of the 2-layer softmax-only Transformer trained on the

Pile. We average the attention maps of the first softmax-attention layer over a batch of size 256 and

observe stable off diagonals with different offsets and widths indicating clean copying behavior based

on positional encodings in multiple heads.

W K T W Q Head 1

PW V Head 1

W K T W Q Head 2

10 10 10 10

20 20 20 20

PW V Head 2

0.2

0.1

0.0

0.1

0.2

Figure A2: Mesa-optimization in a trained linear self-attention layer. We inspect the parameters

of a two-headed, linear self-attention layer trained to predict the future state of a linear dynamical

system. The dominant pattern obtained after learning corresponds to our mesa-gradient descent

construction described in Section 3. The faint additional structure can be further reverse-engineered,

and results from a modified mesa-objective function, Eq. A43, discovered by base-optimization of

Eq. 10. Please compare to the similar structure of the weight matrix products of our construction.

25Preprint

1.0

0.8

63 0

token idx

63 0

token idx

63 0

token idx

63 0

token idx

0.6

0.4

63 0

token idx

63 0

token idx

63 0

token idx

63 0

token idx

63 0

token idx

63 0

token idx

0.2

63 0

token idx

63 0

0.0

token idx

Figure A3: Softmax attention maps of the 2-layer mesa-hybrid trained on the Pile. We average

the attention maps of the first layer i.e. a softmax self-attention layer over a batch of size 256 and

observe stable off diagonals with different offsets and widths indicating clean copying behavior based

on positional encodings in multiple heads.

1.0

0.8

63 0

token idx

63 0

token idx

63 0

token idx

63 0

token idx

0.6

0.4

63 0

token idx

63 0

token idx

63 0

token idx

63 0

token idx

63 0

token idx

63 0

token idx

63 0

0.2

token idx

63 0

0.0

token idx

Figure A4: Softmax attention maps of the 2-layer linear-hybrid trained on the Pile. We average

the attention maps of the first layer i.e. a softmax self-attention layer over a batch of size 256 and

observe stable off diagonals with different offsets and widths indicating clean copying behavior based

on positional encodings in multiple heads.

26Preprint

10 10

20 20 20 20

30 30 30 30

PW V

10 10

20 20 20 20

30 30 30 30

W K T W Q

10 10

20 20 20 20

30 30 30 30

PW V

10 10

20 20 20 20

30 30 30 30

W K T W Q

0.02

W K T W Q

0.05

0.00

0.05

PW V

0.02

0.00

0.02

PW V

0.01

0.03

0.02

0.01

0.00

10 10

W K T W Q

0.10

0.05

0.00

W K T W Q

0.05

10 10 10 10

20 20 20 20

30 30 30 30

W K T W Q

0.2

0.1

0.0

W K T W Q

0.1

PW V

0.2

0.1

10 10 10 20 20 20 20 0.1

30 30 30 30 0.2

0.0

Figure A5: Weights of the deep 6-layer linear Transformers trained on constructed tokens e t =

⊤

(0, s t , s t , s t−1 ). We observe clear structure in the trained Transformer weight products W K

W Q as

⊤

well as P W V . Note that this structure seems to be sufficient to approximate (S t−1 S t−1 + 1/λI) −1 s t ,

see probing experiment in the main text, Section A2.3 and Section A4. We show here all 4 heads

(f.l.t.r.) of the first (top 2 rows), the second (next 2 rows), and the last (last 2 rows) linear layer.

27Preprint

W K T W Q

10 10 10 10

20 20 20 20

30 30 30 30

PW V

10 10 10

20 20 20 20

30 30 30 30

W K T W Q

20 20 20 20

30 30 30 30

PW V

10 10

20 20 20 20

30 30 30 30

W K T W Q

10 10

20 20 20 20

30 30 30 30

PW V

10 10

20 20 20 20

30 30 30 30

W K T W Q

PW V

W K T W Q

0.75

0.50

0.25

0.00

W K T W Q

0.25

PW V

0.0

0.2

0.4

0.6

Figure A6: Weights of the deep 6-layer softmax Transformers trained on constructed tokens e t =

⊤

(0, s t , s t , s t−1 ). We observe a lot of structure in the trained Transformer weight products W K

W Q as

⊤

well as P W V . Note that this structure seems to be sufficient to approximate (S t−1 S t−1 + 1/λI) −1 s t ,

see probing experiment in the main text, Section A2.3 and Section A4. We show here all 4 heads

(f.l.t.r.) of the first (top 2 rows), the second (next 2 rows), and the last (last 2 rows) softmax layer.

28Preprint

W K T W Q

5 5 5 5

10 10 10 10

15 15 15 15

20 20 20 20

25 25 25 25

30 30 30 30

35 35 35

PW V

5 5 5

10 10 10 10

15 15 15 15

20 20 20 20

25 25 25 25

30 30 30 30

35 35 35

W K T W Q

PW V

W K T W Q

0.100

0.075

0.050

0.025

0.000

0.025

0.050

0.075

0.100

0.075

0.050

0.025

0.000

0.025

0.050

0.075

0.100

W K T W Q

5 5 5 5 10 10 10 10 15 15 15 15 0.10

20 20 20 20 0.05

25 25 25 25 0.00

30 30 30 30 35 35 35 35

PW V

W K T W Q

0.05

0.10

PW V

5 5 5 0.10

10 10 10 0.05

15 15 15 15 0.00

20 20 20 20 0.05

25 25 25 25 0.10

30 30 30 30 35 35 35 35

W K T W Q

5 5 5 5

10 10 10 10

15 15 15 15

20 20 20 20

25 25 25 25

30 30 30 30

35 35 35 35

PW V

W K T W Q

0.4

0.2

0.0

0.2

0.4

PW V

0.4

5 5 5 5 10 10 10 10 0.2

15 15 15 15 0.0

20 20 20 20 25 25 25 25 30 30 30 30

5 10

0.2

0.4

Figure A7: Weights of the deep 1+6-layer linear-hybrid Transformers trained on unconstructed

tokens e t = s t . We observe some diagonal structure in the trained Transformer weight products

⊤

W K

W Q as well as P W V . Note that this structure seems to be sufficient to approximate (S t−1 S t−1

−1

1/λI) s t , see probing experiment in the main text, Section A2.3 and Section A4. We show here all

4 heads (f.l.t.r.) of the first (top 2 rows), the second (middle 2 rows) and last (last 2 rows) linear layer

after the first softmax layer.

29Preprint

W K T W Q

5 5 5 5 10 10 10 10 4

15 15 15 15 2

20 20 20 20 25 25 25 25 30 30 30 30 35 35 35 35

PW V

W K T W Q

PW V

0.4

5 5 5 5 10 10 10 10 0.2

15 15 15 15 0.0

20 20 20 20 25 25 25 25 30 30 30 30

W K T W Q

0.2

0.4

W K T W Q

5 5 5 10 10 10 10 1.5

15 15 15 15 1.0

20 20 20 20 0.5

25 25 25 25 0.0

30 30 30 30 0.5

35 35 35 35 1.0

PW V

1.5

0.3

5 5 5 0.2

10 10 10 10 0.1

15 15 15 15 0.0

20 20 20 20 0.1

25 25 25 25 0.2

30 30 30 30 0.3

35 35 35 35

W K T W Q

5 5 5 5

10 10 10 10

15 15 15 15

20 20 20 20

25 25 25 25

30 30 30 30

35 35 35 35

PW V

W K T W Q

0.2

0.1

0.0

0.1

0.2

PW V

0.15

5 5 5 0.10

10 10 10 10 0.05

15 15 15 15 0.00

20 20 20 20 0.05

25 25 25 25 0.10

30 30 30 30 35 35 35 35

0.15

Figure A8: Weight products of the deep 7-layer softmax Transformers trained on unconstructed

tokens e t = s t . We observe some diagonal structure in the trained Transformer weight products

⊤

W K

W Q as well as P W V . Note that this structure seems to be sufficient to approximate layer-wise

⊤

the final prediction s t+1 as well as (S t−1 S t−1

+ 1/λI) −1 s t , see probing experiment in the main text,

Section A2.3 and Section A4. We show here all 4 heads (f.l.t.r.) of the second (top 2 rows) the third

(middle 2 rows) and the last (last 2 rows) layers after the first (potential) copying-softmax-layer.

30Preprint

M ECHANISTIC INTERPRETABILITY OF T RANSFORMERS TRAINED ON

LINEAR DYNAIMCS

A4.1

S INGLE - LAYER MESA - GRADIENT DESCENT

We state here for completion the training objective of Transformers trained on sequences with

initial state s 1 ∼ N (0, I) with subsequent states t = 1, . . . , T generated according to the rule

s t = W ∗ s t−1 + ϵ t , where ϵ t ∼ N (0, σ s 2 I) introduces uncorrelated Gaussian noise. We take W ∗

to be a random orthogonal matrix. The Transformer models t θ are trained by stochastic online

minimization of the autoregressive loss L(θ), cf. Eq. 10.

⊤

After training, we obtain structured matrix products W K

W Q , P W V per layer which we visualize in

Figure A2 for a single linear self-attention Transformer and in Figure A5 for the multi-layer case.

When inspecting the trained weight matrix products, one observes stable values across block-diagonals

of the input size across all layers.

We start by analyzing the simpler single layer Transformer computation and reduce it to

e t ← e t + LSA(e t ; (e t ′ ) tt ′ =1 )

ϕ t 1 (d P

h , d h

)s t + ϕ t 2 (d P

h , d h

)s t−1 ,

(A41)

h=1

where LSA(e t ; (e t ′ ) tt ′ =1 ) denotes the linear self-attention operation with context (e t ′ ) tt ′ =1 and query

e t , and with d ·· inputs to the functions ϕ defined as follows:

ϕ tM (d P V , d K

d K

Q M 1

P V 2

· d P V 1 s t ′ s ⊤

s t ′ −1 s ⊤

t ′ + d

t ′

t ′ =1

+ d K

Q M 2

P V 2

· d P V 1 s t ′ s ⊤

s t ′ −1 s ⊤

t ′ −1 + d

t ′ −1 .

Here d P V , d KQ corresponds to the 4 (the lower right square of size 2d × 2d of K T Q) resp. 2

(the upper right rectangle of size d × 2d of P V ) non-zero

 off-diagonal values that  we observe

in the trained weight products per head i.e. W k ⊤ W q =  · d K Q 1,1 I s d K Q 1,2 I s  as well as

· d K Q 2,1 I s d K Q 2,2 I s





· d P V 1 I s d P V 2 I s



P W v = ·

·  .

We extract the values from the trained models by computing the mean of the block diagonal matrices.

Note that we allow for all combinations between temporally accumulated block matrices and the

current token inputs e t = [0, 0, s t , s t−1 ] ⊤ with specific strengths controlled through the parameters.

In all of our experiments, we observe a performance increase when changing from single-head to

two-head attention layers (more than two heads do not alter performance). This can be explained by

the improved flexibility of scaling the different terms individually as can be seen by comparing the

formulas for one and two heads.

Following equation A41, the prediction of a two-head single layer Transformer with our construction

of e t = [0, s t , s t−1 ] is given by

ŝ t+1 = 0 +

ϕ t 1 (d P

h , d h

)s t + ϕ t 2 (d P

h , d h

)s t−1

h=1

K T Q

) + ϕ t 2 (d P

) s t−1

= ϕ t 1 (d P

) + ϕ t 1 (d P

) s t + ϕ t 2 (d P

1 , d 1

2 , d 2

1 , d 1

2 , d 2

}

31Preprint

with

A =

d K

Q 11

K T Q 12

P V 2

V 1

P V 2

s t ′ −1 s ⊤

s t ′ s ⊤

s t ′ −1 s ⊤

· d P

t ′ −1 + d 1

t ′ −1

t ′ + d 1

t ′ =1

+ d K

Q 11

K T Q 12

V 1

P V 2

V 1

· d P

s t ′ s ⊤

s t ′ −1 s ⊤

· d P

s t ′ s ⊤

t ′ + d 2

t ′ −1 + d 2

t ′ −1

t ′ + d 2

⊤

λ A,1 s t ′ s ⊤

t ′ + λ A,2 s t ′ −1 s t ′ + λ A,3 s t ′ s t ′ −1 + λ A,4 s t ′ −1 s t ′ −1 .

(A42)

t ′ =1

after summarizing by combining factors in front of the same outerproducts. (B) is computed

accordingly. Here, every outer product combination of s t ′ and s t ′ −1 , is weighted by a product of two

factors, before summarizing, that are co-dependent within one head. Therefore we indeed need a

minimum of two heads to obtain independent λ i . However, adding further heads does not increase

performance as no further expressivity is gained. See for a visualization of the corresponding weight

matrix products and factors that we extract from the block diagonals Figure A2. The extracted mean

values of our trained Transformer block diagonals are reported in Table A1.

Table A1: Understanding the algorithm parametrization of Transformers trained on linear dynamics.

To test the significance of the λ values derived in Eq. A42, we set almost all values i.e. λ ·,1 = λ ·,2 =

λ B = 0 - we call this loss L reduced . This coincides as shown to gradient descent on a meta-learned

initial prediction and learning rate (see Eq. A43). To show the influence of the λ values corresponding

to GD, we compute the algorithms performance when only setting λ A,3/4 = 0 and observe a drastic

loss increase, we denote this loss as L ablation and also report the loss of one step of GD as L GD .

Seed

λ A

λ B

λ A

λ B

λ A

λ B

λ A

λ B

λ A

λ B

λ ·,1

0.000620

0.003644

-0.001648

0.003974

-0.003381

0.002756

0.001905

0.003851

-0.001676

0.002450

λ ·,2

0.000254

0.000034

0.000089

-0.000215

0.000776

-0.000444

0.000270

0.000285

0.000324

-0.000028

λ ·,3

-0.033142

-0.000978

-0.033428

-0.000423

-0.033751

-0.002846

-0.033858

0.000873

-0.033557

-0.000027

λ ·,4

0.000149

-0.000129

0.001809

0.000023

0.006270

0.000625

-0.002946

0.000225

0.001557

0.000441

L L reduced L ablation L GD

0.8923 0.9007 1.7275 0.9151

0.8926 0.9039 1.7349 0.9088

0.8941 0.9408 1.7643 0.9088

0.8928 0.9089 1.7417 0.9019

0.8932 0.9050 1.7222 0.9136

We now aim to interpret this parametrized algorithm and motivate it by gradient descent on a particular

regression loss. Since the parametrizations remain constant across a sequence, we speculate that

the Transformer has two principles by which it aims to predict the next token: gradient descent, and

past-token averaging. The latter becomes especially useful for quickly contracting dynamics after

convergence since simply copying over the last token can be one optimal and simple-to-implement

solution, see Section A5, even when aiming to obtain low loss on the entire sequence. We thus

hypothesize that past-token averaging is a simple way to overcome the sub-optimality of taking only

one step of gradient descent.

Consider again the squared error loss from Eq. 3, which we hypothesize is internally optimized inside

a single layer of self-attention

L self-attention

(W ) =

t−1

1 X

∥s t ′ − W s t ′ −1 ∥ 2 ,

2 ′

t =1

We now compute and evaluate the gradient of the loss evaluated at the initial W = λ̃ 1 I leading to an

initially scaled prediction of the current s i.e. ŝ t+1 = λ̃ 1 s t ,

∇ W L self-attention

( λ̃ 1 I) = −

t−1

(s t ′ − λ̃ 1 s t ′ −1 )s ⊤

t ′ −1 .

t ′ =1

32Preprint

The prediction after a gradient step can be computed by

ŝ t+1 = ( λ̃ 1 I − η 1 ∇ W L self-attention

)s t

= λ̃ 1 s t − η 1

t−1

(s t ′ − λ̃ 1 s t ′ −1 )s ⊤

t ′ −1 s t

t ′ =1

= λ̃ 1 s t + (

t−1

⊤

λ C,3 s t ′ s ⊤

t ′ −1 + λ C,4 s t ′ −1 s t ′ −1 ) s t .

(A43)

t ′ =1

Note that this is a stripped down version of the derivation above, see equation A42. We now simply

compare the final MSE loss when setting λ A,1 = λ A,2 = λ B = 0 and observe minimal loss

degradation, see Table A1. Given this robustness, we are confident that almost all of the behavior and

performance of the trained Transformer in this setting can be explained by simply descending the

mesa-objective of Eq. 3 by gradient descent. We also see that empirically using an initial prediction

and therefore a non-zero implicit initial weight has some influence on the final performance. Note that

in certain settings the transformer can obtain a significantly improved performance when including

the other terms in A42, as well, as can be seen in A5. Furthermore, note that to realize full expressivity

in λ A,3 = λ A,4 it requires 6 parameters coming from both heads.

A4.2

M ULTI - LAYER ACCELERATED MESA - GRADIENT DESCENT

We now return to the mesa-optimization algorithms presented in Section 3 – stacked mesa-gradient

descent layers, and preconditioned mesa-gradient descent – and present them in full detail, in the

context of the linear dynamics prediction problems studied in the main text.

Review: d-layers of self-attention can implement d steps of gradient descent in the few-shot

setting. We start by repeating the multi-layer construction provided in von Oswald et al. (2023)

which allows a Transformer, without causal masking, and applied to the few-shot regression setting.

This construction performs a gradient descent step per layer while simultaneously constructing a

prediction on some final input. Compared to Section 2, we slightly change notation to easily bridge

the gap to the autoregressive case in the next paragraph. Recall the squared error regression loss

given T − 1 data pairs (s t ′ , s t ′ +1 )

L(W (0) ) =

T −1

1 X

(W (0) s t ′ − s t ′ +1 ) 2 ,

2 ′

t =1

with the gradient given by

∇ W L(W (0) ) =

−1

(W (0) s t ′ − s t ′ +1 )s ⊤

t ′

t ′ =1

inducing a change in the weights ∆W (0) = −η∇ W L(W (0) ). We now evaluate the the loss again at

W (1) = W (0) + ∆W (0) :

L(W

(1)

) = L(W

(0)

+ ∆W

(1)

T −1

1 X

((W (0) + ∆W (1) )s t ′ − s t ′ +1 ) 2

2 ′

t =1

−1

(W (0) s t ′ − (s t ′ +1 − ∆W (1) )) 2

t ′ =1

−1

(1)

(W (0) s t ′ − s̃ t ′ +1 ) 2

t ′ =1

(1)

s̃ t ′ +1

with

= s t ′ +1 − ∆W (1) . Note that when repeating this algorithm, we descend the regression

loss after d steps of gradient descent by transforming the targets instead of updating the weights.

Note that we are here not learning weights which we could use for test predictions. Nevertheless,

when using the induced target transformation on some novel data point s T +1 while simultaneously

transforming the targets of our dataset, we see that after a −1 correction i.e. ŝ T +1 = W (0) s T +1 +

33Preprint

P d−1

−1 l=0 −∆W (d) )s T +1 ) = W (d) s T +1 , we obtain an equivalent prediction to the one of standard

gradient descent. Note that the linear self-attention weight matrices provided in the main text for

the single-step case directly implement this multi-step case when we restrict the attention to the first

T − 1 datapoints.

0 0

⊤

To implement this d-step algorithm in a Transformer, if all bias terms are zero, W k W q =

0 I s

−ηI s ηW 0

(0)

and P W v =

, and the token at initialization is e t = (s t+1 , s t ) for the training data

(0)

and e T +1 = (−W 0 s T +1 , s T +1 ) for the test point, after d such layers, the last token in which we com-

P d−1

(d)

pute the test data prediction is transformed into e T +1 = (−W 0 s T +1 + l=0 −∆W (d) )s T +1 , s T +1 ).

The y-component of this token contains again the (negative) of the prediction obtained by a linear

model that underwent d steps of gradient descent. Therefore, configured as such, our d self-attention

layers take d steps of mesa-gradient descent toward solving a least-squares problem. Note that in this

(l)

construction the lower half of the tokens e t = (·, s t ) ∀l keep unchanged throughout the Transformer

forward pass.

d-layers of causally-masked self-attention can implement d steps of online gradient descent.

To transfer the previous multi-layer construction to the autoregressive setting, we make two obser-

vations: (i) we now wish to make a prediction at every time step which will require more token

space, and (ii) introducing causal masking will affect the computations carried out by the layer of von

Oswald et al. (2023) reviewed above, leading to T different models learned in parallel by an unusual

variant of gradient descent.

To see (i), we note that at every point in time the Transformer needs to keep in memory throughout

its forward pass not only the inputs but also the targets, (s t ′ , s t ′ +1 ) for t ′ ∈ {1, . . . , T }. This is the

case as we need both to transform some target s t ′ through gradient descent dynamics to implicitly

keep updating the learned linear model, but also to use that same target s t ′ now as input to the

implicitly learned model to construct a final output. This observation motivates the token construction

e t = (−W 0 s t , s t , s t , s t−1 ), where the last two entries are kept unchanged throughout the Transformer

forward pass. This token construction suggests the following weight configuration for a 2-headed

linear self-attention layer:





0 0 0 0

 0 0 0 0 

⊤

W k,1

W q,1 = 

0 0 0 0 

0 0 I s 0





0 −ηI y 0 ηW 0

0 

 0

P 1 W v,1 = 

0 





0 0 0 0

 0 0 0 0 

⊤

W k,2 W q,2 = 

0 0 0 0 

0 0 0 I s





 0 −ηI y 0 ηW 0 

P 2 W v,2 = 

0 

The dynamics induced by these two heads are equivalent but are evaluated in head 1 at the current s t ′

which we use to make the next token prediction and at 2 at the training data input s t ′ −1 , to change

the moving targets based on our previously discussed gradient-based transformation of the targets.

Given these weights in layer l, we obtain the following change in the tokens

(l)

(l) (l)

(l)

e t+1 = e t + ∆e t = (ŝ t+1 , s̃ t , s t , s t−1 ) + (−∆W t s t , −∆W t s t−1 , 0, 0)

(l)

= (ŝ t+1 , s̃ t , s t , s t−1 ) + (∆ŝ t+1 , ∆s̃ t , 0, 0)

34Preprint

P t

(l)

(0)

with ∆W t = −η t ′ =1 (W 0 s t ′ −1 − s̃ t−1 )s ⊤

= s t . Note that now

t ′ −1 and ŝ t+1 = −W 0 s t , s̃ t

in the causally-masked setting, the sum only runs to element t for the token e t and that therefore

the transformed targets each follow their own dynamics instead of the gradient summed across the

sequence, as in the few-shot regression case. The above construction is equivalent to the few-shot

setting, for which we would want to make a prediction based on d steps of gradient descent for the

test as well as the training data, i.e., at all points in time.

Furthermore, we can motivate this token update as the gradient of the time-dependent loss L t (W 0 ) =

P t

(l) 2

′

t ′ =1 (W 0 s t −1 − s̃ t−1 ) for which again the targets at time step t change throughout the layers

based on their specific target transformation mechanism. This online gradient descent algorithm was

proved to be sub-optimal w.r.t. conventional (full-batch) gradient descent by Ding et al. (2023). They

show that, in the limit of infinitely many layers, this construction implements online gradient descent,

which does not coincide with the optimal (recursive) least-squares solution. Additionally, stochastic

gradient descent with non-vanishing learning rates does not converge so the effective weight does not

converge to the optimal solution when given infinitely many samples.

In the next section we will show how, at least in theory, multi-layer self-attention can implement a

different algorithm which can lead to the desired result, even in the causally-masked setting. We

end by noting that the aforementioned weight construction, despite being potentially sub-optimal,

motivates our token construction e t = (0, s t , s t , s t−1 ) which we use throughout all experiments

when training deep Transformers, i.e., whenever the model has more than one self-attention layer.

Note that −W 0 s t = 0 since we assume an W 0 = 0. Finally, we stress that until now it is not clear

how to incorporate common L2 regularization into the (online) gradient descent dynamics. This

however is now in the autoregressive case of particular importance: In the beginning of the sequence

the causally masked Transformer is forced to solve an under-constrained learning problem, dependent

on the input data dimension and data generation. The following paragraph will again provide a simple

solution to this problem.

d-layers of causally-masked self-attention can approximate optimal preconditioned gradient

descent. Our results are influenced by the GD++ algorithm presented by von Oswald et al. (2023),

which can lead to accelerated optimization by applying a whitening transform to input data. We restate

the goal of the autoregressive Transformer, namely, to solve the underlying least-squares problem

⊤

for all time steps simultaneously. This amounts to computing S t S t−1

(S t−1 S t−1

+ λ 1 I) −1 s t ∀t, a

(recursive) least squares solution, where time-shifted (by one) sequence elements play the role of

inputs and desired outputs in a dataset, with inputs S t−1 , targets S t , and test input s t .

With the limited expressivity of one layer, we have already established that Transformers can, and

do, in various settings, implement a single gradient step on the corresponding regression problem

P t−1

′

t ′ =1 (s t +1 −W s t ) both in theory and in practice. We now diverge from the previous section which

generalized a single mesa-gradient descent step to the multi-layer case. Instead, we argue here that the

Transformer could solve the problem differently. Our key observation is that given a preconditioning

P t−1

⊤

matrix H t = (S t−1 S t−1

+ λ 1 I) −1 which changes the loss as t ′ =1 ∥s t ′ +1 − W H t s t ′ ∥ 2 , a gradient

descent dynamics would converge in a single step to the regularized least-squares solution. This way,

we do not apply several gradient steps, thereby circumventing the potential problems arising from

causally-masked gradient descent dynamics on the targets discussed in the previous section.

Based on these insights, we provide a theoretical construction that shows how Transformers can

⊤

approximate (S t−1 S t−1

+ λ 1 I) −1 q t layer by layer in their forward pass, leading to improved single-

step gradient descent performance. To do so, we build on the derivations of Section A2.3, where

⊤

we showed how to implement an approximation of s̃ t := (S t−1 S t−1

+ λ 1 I) −1 s t efficiently, and for

all time steps t in parallel. We achieved this result with the help of the dot-product-attention (DPA)

operation at the heart of linear and softmax self-attention layers, and by resorting to the truncated

Neumann series.

First, we recall from Section A2.3 that we can approximate s̃ t by s̃ K

t , obtained when multiplying

⊤

s t to the K-step truncated Neumann series approximating the inverse term (S t−1 S t−1

+ λ 1 I) −1 ,

modulo normalization of the matrix, see A2.3. Then, we observe that the s̃ K

t satisfy the following

35Preprint

recursive relationship:

s̃ k+1

⊤

I − (S t−1 S t−1

= s t + (1 −

I) s̃ kt + s t

1 k

⊤

)s̃ − S t−1 S t−1

s̃ kt .

λ t

(A44)

We now see how, if (s̃ kt , s t , s t−1 ) is present in the activations at layer k at time point t, then s̃ k+1

can be computed for the next layer. In particular, the last term in eq A44, can be obtained by

0 0 0

one head of linear self-attention from the input (s̃ kt , s t , s t−1 ) when W k ⊤ W q = 0 0 0 , and

I x 0 0

0 0 −I x

0 . The other terms are simple scaled additions to s̃ kt of s̃ kt itself and s̃ t

P W v = 0 0

0 0

for which many constructions exist. The latter information is also available at all times if not

overwritten otherwise. Note that this weight construction strictly speaking only requires 3 channels

i.e. e t = (s t , s t , s t−1 ) in which we update the first by the just provided Neumann series computation.

Nevertheless, in practice we still use e t = (0, s t , s t , s t−1 ), i.e. tokens with additional memory. We

observe in practice that both constructions e.g. e t = (0, 0, s t , s t−1 ) or e t = (0, s t , s t−1 ) reach similar

performance but observe more training difficulties and instabilities for the more compact one. We

also stress that the derivation presented here is one out of possibly many ways of how Transformers

⊤

could implement and approximate the desired inverses (S t−1 S t−1

+ 1/λI) −1 s t in parallel. This is

the main reason we resorted to the probing analyses presented in the main text.

A NALYSING CONTRACTING LINEAR DYNAMICS

We show here the preliminary result when diverging from orthogonal teachers W to construct the

sequence presented to the Transformer and restrict the eigenvalues of W ∼ N (0, I) in a band of

[0.9, 0.3]. We notice that with these W approximately 2% of the sequences lead to very large values.

To ease trainability, we therefore clip all the values of those sequences to values between [−2, 2].

When training a single layer of linear self-attention, see Figure A9, we observe that the trained layer

outperforms a naive step of GD dramatically - while the mesa-layer still outperforms both. These

findings differ therefore from results obtained when training on an orthogonal teacher. We again find

clean weight structure but qualitatively different weights found by optimization compared to weights

trained on sequences which are generated by an orthogonal teacher, see Figure A2. We speculate

at this point that the meta-learned initialization W of the model provides a much more favorable

initial guess in this setting compared to the orthogonal teacher. Therefore, only adjusting, by the

residual connection, the current input element of the sequence leads to a large performance boost

when compared to naive GD, for which we only tune the learning rate and do not meta-learn an initial

model that we update. Nevertheless, we still find gradient descent implemented in the weights even

in this setting. We stress that we believe the emergence of gradient-descent-like algorithms will be

dependent on the data statistics as well as other experimental design choices (Raventós et al., 2023).

We leave this important investigation for future work. We use the same training hyperparameters as

when training on orthogonal teachers W .

A6.1

E XPERIMENTAL DETAILS

T RAINING T RANSFORMERS ON LINEAR DYNAMICAL SYSTEMS

We provide here details about the training details of the Transformer models when training on the

linear dynamics setting. As described in the main text we focus on fully-observed linear dynamical

systems and use a simple generative model. To create a sequence s 1:T we first draw a random ground-

truth D s × D s weight matrix W ∗ as well as a random initial state s 1 ∼ N (0, I); the subsequent states

for t = 2, . . . , T are then generated according to the rule s t+1 = W ∗ s t + ϵ t , where ϵ t ∼ N (0, σ s 2 I)

introduces uncorrelated Gaussian noise. We take W ∗ to be a random orthogonal matrix. Now, as

36Preprint

A 2.0

1.5

GD exact -1

Mesa

linear-SA

1.0

W K T W Q Head 1

(PW V ) T Head 1

15 15

20 20

25 25

10 15 20 25

W K T W Q Head 2

15 15

0.0 20 20

(PW V ) T Head 2

0.5

Sequence length t

0.1

0.2

0.0

0.1

0.2

10 15 20 25

Figure A9: Training a single linear self-attention layer on contracting linear dynamics The

trained single layer strongly outperforms (A) a naive step GD, diverging from results obtain when

training on an orthogonal teacher. We find clear structure in the weights of the trained model (B), as

found when training on sequences generated by orthogonal W , see Figure 2. However, here we see an

equal emphasis on all diagonals. Previously, we established a direct link between the weights found

in 2 and gradient descent. The predictions of a single layer with these weights are still explained by

A42, which includes a gradient-descent term, but here we find similar weight for other non-GD terms

in the formula as well, indicating their usefulness in settings different to our original linear sequence

models.

already stated we train all Transformer models on the following autoregressive linear regression loss

" T −1

1 X

L(θ) = E x

∥s t+1 − f t (s 1:t , θ)∥ .

(A45)

2 t=1

In all of our experiments, we employ causal masking during self-attention, implemented in the same

way as in the majority of auto-regressive language modeling experiments. Specifically, during the

self-attention operation we zero out the elements corresponding to the upper triangular matrix of

the attention map, except for the diagonal itself. We do this both for the linear attention layer and

for the mesa-layer. In practice, for softmax self-attention the incoming logits to the softmax are set

to −1e 30 . We did not use LayerNorm (Ba et al., 2016) in our models, except for A13, but ran into

stability issues especially when training hybrid models with linear layers. To mitigate those, we

simply clipped the activations of the forward pass to values between [−1, 1] which stabilized training

significantly. Hyperparameters and other experimental details can be found in table A2.

A6.1.1

S INGLE - LAYER LINEAR SELF - ATTENTION T RANSFORMER

We analyze single-layer, two-head, key-size-20 Transformers, trained on constructed tokens, by

comparing their performance with other models and providing an interpolation in parameter space

between trained Transformers and the provided construction, which can be described using only a

few hyperparameters, see Table A41. While we read out the predictions from the first D s entries of

the outputs (which initially contain a zero-vector), we simulate a residual connection by manually

adding s t to the output of the single layer. Since we do not use an embedding layer in this setting,

during training we learn another parameter α as a factor for the added s t , such that the prediction

reads ŝ t+1 = α · s t + LSA(e t ; (e t ′ ) tt ′ =1 ). For the performance analysis, these models are compared

to ‘exact’ gradient descent, a single gradient update step on the auto-regressive loss, and a single

mesa-layer. The optimal learning rate for this gradient descent step is line-searched.

Interpolation details: We first train a Transformer, then extract scalar parameters, e.g. d KQ 11

(see A41), from the D s × D s sub-matrices by taking the mean of the sub-diagonals of the matrix

products W k ⊤ W q , P W v . We proceed by using these to both build a construction of sparse weight

matrices, each consisting only of identity-sub-matrices (scaled by the resp. parameters), and, for the

single-layer case, also directly compute a loss for the hard-coded implementation of Table A41 with

the respective parameters. Then, during a second training-run of a Transformer for the same initial

conditions, we simultaneously compute the test loss for an interpolation, where we average equally

37Preprint

Table A2: Hyperparameters for all settings and model variants when training on simple linear

dynamics.

Hyperparameter Value

Context size

Optimizer

Weight decay

Batchsize

Gradient clipping

Activation clipping We used length 50, except for the ICL experiments, where we used length 224.

Adam (Kingma & Ba, 2015) with ϵ = 1e −8 , β 1 = 0.9, β 2 = 0.999

0.1 for constructed tokens, 0.05 otherwise

2048 for constructed tokens, 256 otherwise.

10 for constructed tokens, 1 otherwise

Clip [−5, 5] for all models trained on constructed tokens, clip [−1, 1] for

hybrid models, no clipping otherwise.

We add positional encodings of dimension 40 for models trained on

unconstructed tokens, otherwise no positional encodings.

We do not use Dropout for any model.

We use a 1-layer, 2-head, key-size 20, dim-30-tokens, no input- or output-

embedding architecture for single-layer models trained on constructed tokens.

We use a 6-layer, 4-head, key-size 20, dim-40-token, no input- or output-

embedding architecture for the multi-layer models (softmax and linear)

trained on constructed tokens for the probing analysis and used

key-size 40 for the interpolation.

We use a 7-layer, 4-head, key-size 20, dim-10-tokens, dim-40-

embedding- architecture with input- and output-embedding layers for hybrid-

and softmax-only-models. Hybrid models contain a softmax and

afterwards linear self-attention-layers. The MLP-softmax model contains

MLPs with one layer of hidden activations of size 160

and 2 layer- normalization layers (Ba et al., 2016), one directly after

the attention layer and one directly after the MLP.

We use 2-layer, 4-head, key-size 20, dim-10-tokens, dim-40-

embedding-architecture with inputs- and output embedding layers. First a

softmax-self-attention layer, then a single Mesa-layer.

W ∼ N (0, σ 2 ) with σ = 0.0002 for models trained on constructed tokens

and σ = 0.05 for all other models. We always fixed the bias parameters to zero.

For models trained on non-constructed tokens, we used linear warm-up

starting from 0 to 7e −4 in 500 steps, Cosine annealing to 1e − 5 for the next

20000 steps. We note here that we only train softmax-only and softmax-mlp

models for this amount of steps. For models trained on constructed tokens,

we never trained for more than 5000 steps. For models trained on constructed

tokens, we used a fixed learning rate of 7e −4 and

9e −5 for the interpolations.

We initialize the learnable regularization parameter λ for every mesa-head to 1.

Positional encodings

Dropout

Architecture 1-L. Constr.

Architecture 6-L. Constr.

Architecture 7-L. No Constr.

Architecture Hybrid-Mesa

Weight init

Learning rate (scheduler?)

Mesa regularization λ

not between the single weight matrices, but between the correct weight-matrix-products per head

to obtain a new, interpolated model. The reason for this procedure is the non-uniqueness of weight

matrices to obtain the found matrix products. We repeat this procedure for 5 different seeds, train a

newly initialized Transformer each time and plot the obtained mean and standard deviation values for

the test loss during training.

A6.1.2

M ULTI - LAYER SELF - ATTENTION T RANSFORMER

For the multi-layer experiments, we use different settings: For the experiments with constructed

tokens, we use a 6-layer, no input- or output-embedding layer architecture, while for the other

experiments, we use a 1+6-layer architecture with input- and output-embedding layers, where the first

layer is always a softmax self-attention layer, while the other 6 Transformer layers are either 6 linear,

6 softmax self-attention layers, or 1 mesa layer (1+1-layer architecture). We found that forward-pass

activation clipping after each layer (except for embedding layers) greatly stabilized training.

Interpolation details: The interpolation of multi-layer transformers when training on the token

construction, we follow the procedure described in the previous subsection, per layer, but extend it

to 4-head key-size 40 self-attention layers: We read off the parameters as the mean of the diagonals

of the respective D s × D s sub-matrices of the resulting matrix weight products W k ⊤ W q , P W v per

38Preprint

RevAlg-1

Interpolation

softmax-SA

Mesa

GD exact -1

Test

1.5

1.0

0.5 1

0.0 0

2000

Training steps

softmax-SA

Mesa

GD exact -1

W K T W Q Head 1

(PW V ) T Head 1

15 15

20 20

25 25

10 15 20 25

W K T W Q Head 2

Sequence length t

15 15

20 20

(PW V ) T Head 2

2.5

10 15 20 25

D 0 Attention-map head 1 E 0 Attention-map head 2 F 0 Cosine-similarities

10 10 10

20 20 20

30 30 30

40 40 40

10.0

7.5

5.0

2.5

0.0

Figure A10: Reverse-engineering a single trained softmax self-attention layer. (A & B) The

softmax self-attention layer outperforms a plain gradient descent step with the mesa-layer again

outperforming both, visible by plotting the loss across the sequence during training (A) as well as

when looking at the loss throughout the sequence (B) after training. (C) We show the non-zero parts

of the weight-products W K

W Q , P W V for both heads and find a clear structure as in Figure 2. At

this point, we speculate that a mix of gradient descent and induction-head-like copying leads to

the final prediction. This is reflected by the softmax attention maps (of a single non-cherry picked

sequence) which show waves of attention indicating a copying behavior based on data similarity

and not based on positional encodings which are anyway not present here. We again highlight that

generally different architectural choices may lead to different mesa-algorithms, here potentially a mix

of GD and copying within the model.

head of a trained Transformer. Then we construct sparse weight matrices consisting of identity-sub-

matrices (scaled by the resp. parameters). We use key-size 40 only here as it is otherwise hard to

accurately decompose the found weight products into non-square weight matrices. We proceed as

for the single-layer experiment and re-train the Transformer from the initial conditions, but during

training also report the test loss of a model that is obtained by equally averaging the weight products

of our construction and the Transformer. We average the products and not the single weight matrices

for the same reasons stated in the previous subsection A6.1.1.

Probing experiment details: We presented three variations of probing experiments for the multi-

layer models: Target, inverse and implicit target probings. For the basic target probings, we sampled

a batch of size 512 and linearly regressed the activations after each layer for that batch against the

targets. Using the model learned by the regression, we computed the loss using the same metric as

during model training. We repeated this process for 5 different seeds and reported mean and standard

deviation in the plotted results.

While analyzing the inverse and the implicit probing experiments, we noticed that a small number of

outlier-sequences per batch with had comparably large target norm i.e. ||(S t−1 S t−1

+ 1/λI) −1 s t ||.

To overcome this, we introduced a form of dataset pre-processing, where at each token s t , we

⊤

disregarded sequences for which the product (S t−1 S t−1

+ 1/λI) −1 s t with z− score larger than

2. For the inverse probing, we first pre-process the batch, sorting out ≤ 5% of the sequences. We

proceed now to linearly regress the models activations layer by layer for every time step t against

⊤

the implicit targets (S t−1 S t−1

+ 1/λI) −1 s t and report the loss, per token. We tuned the lambda

parameter per hand to obtain, across layers, the lowest implicit target probing loss and re-used it for

the implicit target probing.

39Preprint

There we proceeded as follows: Per token, we obtained a cleaned batch as in the inverse probings,

⊤

then computed the products (S t−1 S t−1

+ 1/λI) −1 s t and S t S t−1

. We first linearly regress the

activations at each layer against the inverse token product as in the inverse probing experiments and

(d)

obtain a linear model W t,inverse probe for every layer and time step t. Then we multiply the activations

⊤

at each layer with this learned model per layer and the other product S t S t−1

to obtain a least-squares

prediction model - or when learning a model based on a single gradient descent step. The prediction

of this model is weighted with a learning rate which we again tune by hand to achieve overall, across

(d)

layers, the best implicit target. We define this prediction/vector as g t = ηS t S t−1 W t,inverse probe e t ,

(d)

⊤

see Figure A12 & A13. Note again that if W t,inverse probe e t = (S t−1 S t−1

+ 1/λI) −1 s t and η = 1

we descent the loss optimally well in one step. We compute the loss per token and layer of this

prediction model by directly comparing it with the actual targets for one batch. This procedure is

repeated for 5 seeds and we report the mean and standard deviation. Furthermore, we include one

step of gradient descent as a comparison baseline. To provide a fair comparison between the implicit

target probes, we do not compare to −ηS t S t−1 s t i.e. a naive step of GD, for which we obtained

(d)

(0)

much worse results, but use (with abuse of notation) GD exact − 1 = −ηS t S t−1 W t,inverse probe e t i.e.

the performance after the data emebedding based on the same cleaning procedure. We tuned the

learning rate and λ parameters for all settings parameters by hand.

5000 10000

Train steps

d =4

d =5

d =6

d =7

d =4

d =5

d =6

d =7

d =0

d =1

d =2

d =3

d =0

d =1

d =2

d =3

0.5

0.4

0.3

0.2

0.1

0.0

t 0 = 50

t 0 = 49

t 0 = 48

t 0 = 47

t 0 = 46

t 0 < 45

(S t 1

Sequence length t

Figure A11: Reverse-engineering full-fledged Transformers: Linear-Hybrid 1+6-layer model

(A) The first softmax layer groups together neighboring tokens. This can be seen in the high

sensitivity to the current and previous tokens of the outputs of the first layer of a hybrid-linear

Transformer. (B & C) We linearly regress the activations of each layer against final targets (C) as well

⊤

as (S t−1 S t−1

+ 1/λI) −1 s t , the curvature-corrected inputs (D) predicted by the provided theory. We

observe a harsh phase transition in the last layer when measuring target probing (B) while observing

an intriguingly stable and gradual probing for curvature-corrected inputs (C), except for the last

layer, where we hypothesize that the worse probing loss is explained by the computation of the actual

predictions. Averages computed over 5 different seeds; shaded area represents standard deviation.

A6.2

T ESTING TRAINED T RANSFORMERS ON FEW - SHOT IN - CONTEXT LEARNING

We provide here details about the post-training in-context learning experiment. After training, see

Section A6.1 for details, we "prompt" the model with few-shot regression datasets i.e. simply switch

from sequences [x 1 , x 2 , . . . , x t−1 , x t ] where x t+1 = W x t and x 0 ∼ N (0, I) to [x 1 , y 1 , . . . , x N , y N ]

where y i = W x i and all x i ∼ N (0, I). Note that there is no relation between y i , x i+1 as in the

autoregressive case. In both cases we sample W , if not stated otherwise from the same distribution

i.e. as random orthogonal matrices. This results in a sequence length of t = 2N and t = 3N when

incorporating EOS tokens. Throughout the sequence we measure

)∥

L i = E ∥y i − f 2i−1 (x i ; {(y j , x j )} i−1

j=1

(A46)

for i ≥ 2 depicted e.g. in Figure 5.

For the EOS-token fine-tuning experiments, we initialize a single vector EOS ∼ N (0, I) and optimize

this single vector on the same loss

40Preprint

2 × 10

d =4

d =5

d =6

d =7

GD exact -1

d =0

d =1

d =2

d =3

Sequence length t

d =4

d =5

d =6

d =7

GD exact -1

d =0

d =1

d =2

d =3

Sequence length t

Figure A12: Implicit target probing for full-fledged Transformers. To further support the hy-

pothesis that trained multi-layer Transformers first precondition constructed optimization problems

by computing an, for example, an approximation of a truncated Neumann series of the required

inverses before solving the optimization problems, we provide another probing analysis: For each

layer, starting with the embedding layer at d = 0, we linearly regress the activations against the

⊤

preconditioned inputs (S t−1 S t−1

+ 1/λI) −1 s t and multiply these probes with ηS t S t−1

to compute

a least-squares prediction approximation. We measure therefore measure the possibility to implicitly

predict the target s t+1 from the hidden activations of the model. (A) For full-fledged softmax-only

Transformers, we observe as expected, a gradual increase in probing performance across layers as

expected at first, where we are able to outperform a step of gradient descent. (B) For the hybrid model,

we find similar results: The probing performance gradually increases across layers and decreases for

the last layer, were we hypothesize that the Transformer performs an update step of gradient descent

to solve the well-conditioned optimization problem. Note that again, we are able to outperform a

step of gradient descent. Averages computed over 5 different seeds; shaded area represents standard

deviation.

d =4

d =5

d =6

d =7

Sequence length t

d =0

d =1

d =2

d =3

3 × 10 1

2 × 10 1

D 2.0

t 0 = 50

t 0 = 49

t 0 = 48

t 0 = 47

t 0 = 46

t 0 < 45

1.5

1.0

0.5

Sequence length t

d =4

d =5

d =6

d =7

GD exact -1

d =0

d =1

d =2

d =3

d =4

d =5

d =6

d =7

d =0

d =1

d =2

d =3

Sequence length t

0.0

5000 10000

Train steps

Figure A13: Probing and copy results for a trained 7-layer Transformer with softmax self-

attention layers with MLPs and layer-normalization layers. Even when training full-fledged

Transformers i.e. use all parts of the common Transformer architecture, we observe mostly robust

and gradual improved probing on the target and the projected inverse when increasing the sequence

as well as the depth. Furthermore, we find very gradually increasing probing results for the implicit

target probings as in Figure A12 and especially outperform an update step of gradient descent. In

both the inverse and the implicit target probings, we find worse results for the probings of the last

layer. We hypothesize the reason for this is the update step on the well-conditioned optimization

problem that is performed in the last layer. (D) Again, we find very strong copying behavior indicated

by the sensitivity analyses.

41Preprint

1000

2000

0.2

d =1

d =2

d =3

Training steps

d =4

d =5

d =6

0.4

d =4

d =5

d =6

0.6

d =1

d =2

d =3

0.8

Interpolation

SM-6

RevAlg-6

Mesa

Sequence length t

Figure A14: Reverse-engineering 6-layer softmax self-attention Transformers trained on con-

structed inputs. Here we show that a trained multi-layer softmax Transformer trained on constructed

tokens is explained by a reverse-engineered algorithm (RevAlg-6) and provide an interpolation be-

tween constructed weights and the found Transformer weights. Furthermore, we provide the target as

well as inverse probe analogous to Figure 3 and again provide evidence of the hypotheses that targets

and the inverse-vector product are gradually approximated throughout depth in the trained Trans-

former. As the last layer has to perform one update step of gradient descent on the well-conditioned

optimization problem, we observe an expected decrease in inverse-probing performance. Averages

computed over 5 different seeds; shaded area represents standard deviation.

Attention-map hybrid, head 1

1.0 0

10 0.8 20 30 40

Attention-map hybrid, head 2

1.0 0

10 0.8 0.6 20 0.4 30 0.2 40

0.0

Attention-map hybrid, head 3

Attention-map hybrid, head 4

1.0 0 10 0.8 10 0.8

0.6 20 0.6 20 0.6

0.4 30 0.4 30 0.4

0.2 40 0.2 40 0.2

0.0

1.0

0.0

Figure A15: Softmax attention maps of the first softmax layer when training a linear-hybrid

Transformer on unconstructed inputs We visualize all four heads of the first softmax-attention

layer and observe strong copying behavior, as predicted by the provided theory, in the heads i.e. full

attention on the current and the previous token. We average the attention maps over a batch of 1000.

Attention-map softmax, head 1

1.0 0

10 0.8

Attention-map softmax, head 2

1.0 0

10 0.8 20 0.6 30 40

Attention-map softmax, head 3

Attention-map softmax, head 4

1.0 0 10 0.8 10 0.8

20 0.6 20 0.6 20 0.6

0.4 30 0.4 30 0.4 30 0.4

0.2 40 0.2 40 0.2 40 0.2

0.0

1.0

0.0

Figure A16: Softmax attention maps of the first softmax self-attention layer when training

a softmax-only Transformer on unconstructed inputs. We visualize all four heads of the first

softmax-attention layer and observe strong copying behavior, as predicted by the provided theory, in

the heads i.e. full attention on the current and the previous token. We average the attention maps over

a batch of 1000.

42Preprint

Attention-map mesa, head 1

1.0 0

10 0.8

Attention-map mesa, head 2

1.0 0

10 0.8 20 0.6 30 40

Attention-map mesa, head 3

Attention-map mesa, head 4

1.0 0 10 0.8 10 0.8

20 0.6 20 0.6 20 0.6

0.4 30 0.4 30 0.4 30 0.4

0.2 40 0.2 40 0.2 40 0.2

0.0

1.0

0.0

Figure A17: Softmax attention maps of the first softmax self-attention layer when training

a mesa-hybrid Transformer on unconstructed inputs. We visualize all four heads of the first

softmax-attention layer and observe strong copying behavior, as predicted by the provided theory, in

the heads i.e. full attention on the current and the previous token. We average the attention maps over

a batch of 1000.

Attention-map softmax-mlp, head 1

0.40

0.35

0.30

0.25

Attention-map softmax-mlp, head 2

1.0 0

10 0.8 20

Attention-map softmax-mlp, head 3

1.0 0

10 0.8 10

0.6 20 0.6 20

30 0.4 30 0.4 30

40 0.2 40 0.2 40

Attention-map softmax-mlp, head 4

0.35

0.30

0.25

0.20

0.15

0.10

0.05

0.00

0.40

0.20

0.0

0.15

0.10

0.05

0.00

Figure A18: Softmax attention maps of the first softmax self-attention layer when training a

softmax-only Transformer with MLPs and layer-normalization on unconstructed inputs. We

visualize all four heads of the first softmax-attention layer and observe strong copying behavior, as

predicted by the provided theory, in the heads i.e. full attention on the current and the previous token.

We average the attention maps over a batch of 1000.

1 X

i−1 2

L(EOS) = E

∥y i − f 3i−2 (x i , EOS; {(y j , x j )} j=1

)∥

2 i=1

(A47)

via batch gradient descent for 100 steps with batchsize 1024 on randomly sampled training data. Note

that we interleave every datapair with an EOS token i.e. [x 1 , y 1 , EOS, x 2 , . . . , y N −1 , EOS, x N , y N ]

and we therefore increase the sequence length from 2N to 3N .

For the prefix-prompt P, we fine-tune a single sequence of 20 tokens which we append at the

beginning of every in-context learning sequence. We initialize here again all vectors before training

of the soft-prompt P i ∼ N (0, I) and optimize again the same loss with or without the additional

(pre-trained, see above) EOS token

N −20

1 X

L(P) = E

∥y i−20 − f 3i−2+20 (x i−20 , P, EOS; {(y j , x j )} i−21

j=1 )∥

2 i=21

(A48)

via batch gradient descent for 100 steps with batchsize 1024 on randomly sampled training data

resulting in sequences [P 1 , . . . , P 20 , x 1 , y 1 , EOS, x 2 , . . . , y N −1 , EOS, x N , y N ]

43Preprint

1.5

Base

Base+EOS

Base+EOS+P

LSQ

2.0

1.5

1.0

Datapoints (x i , y i ) in sequence

Few-shot regression

Ablation 1

D 2.5 Continual few-shot Base regression E 1.50

Base

Base+EOS

Base+EOS+P

LSQ

1.0

0.5

2.5

Base+EOS

Base+EOS+P

LSQ

2.0

1.5

1.0

Datapoints (x i , y i ) in sequence

Base+EOS

Base+EOS (Abl.)

1.25

2.0

Base

Base+EOS

Base+EOS+P

LSQ

1.75

1.50

1.25

1.00

0.75

0.50

1.00

0.75

0.50

Datapoints (x i , y i ) in sequence

Figure A19: Autoregressively-trained hybrid-mesa and hybrid linear Transformers prompted to

solve supervised few-shot regression problems. We present here the results analog to Figure 5 in

the main text with hybrid-linear (A & B) and hybrid-mesa (C & D) Transformers. The in-context

learning performance for both models decreases gradually and significantly with the number of

labeled examples. Again, when prompted with a special EOS token or a prefix-prompt P, which

we both fine-tune for this regression task on a held-out training set, the performance improves

considerably. This highlights the usefulness of prompt-tuning already in this very simple setting.

Furthermore, both models show strong performance when tested with multi-task problems, with the

hybrid-mesa significantly outperforming the linear model for increasing number of datapoints (x i , y i ).

While the hybrid-linear model shows stronger performance for earlier datapoints, the hybrid-mesa

model matches (and outperforms) the LSQ baseline with prefix- and EOS-prompting. (E) As an

ablation, we test how a softmax-only model, corresponding to Figure 5 of the main text, behaves

when tested with a sequence that contains EOS tokens only for the first 20 data pairs. As expected,

the performance gradually decreases to non-EOS level once the datapoints are presented without

EOS-tokens. Averages computed over 5 different seeds; shaded area represents standard deviation.

44Preprint

Table A3: Hyperparameters for language modelling experiments across all Transformer variants i.e.

pure softmax, linear-hybrid and mesa-hybrid with/out MLPs.

Hyperparameter Value

Dataset

Tokenizer

Context size

Vocabulary size

Vocabulary dim

Optimizer

Weight decay

Batchsize

Gradient clipping

Positional encodings

Dropout

Architecture details

Weight init The pile (Gao et al., 2020)

GPT-2 tokenizer - we append a special "EOS" token between every sequence

1024

50257

756

Adam (Kingma & Ba, 2015) with ϵ = 1e −8 , β 1 = 0.9, β 2 = 0.95

0.1

256

Global norm of 1.

We add standard positional encodings.

We use embedding dropout of 0.1 right after adding positional encodings.

12 heads, key size 64, token size 756, no input- but output-embedding

W ∼ N (0, σ 2 ) with σ = 0.02 and bias parameter to zero. We scale all

weight matrices before a skip connection with 2 √ 1 N with N the number of layers.

Linear warm-up starting from 1e −6 to 3e −4 in the first 8000 training steps,

cosine annealing to 2e − 4 for the next 300 billion tokens

Widening factor 4 i.e. hidden dimension 4 ∗ 756 with ReLU

non-linearities (Hahnloser et al., 2000)

We initialize the learnable regularization parameter λ for every mesa-head to 1.

Learning rate scheduler

MLP size

Mesa regularization λ

A6.3

L ANGUAGE MODELING EXPERIMENTS

We provide here details about the language modeling ex-

periments. We use standard values found in the literature

and the same hyperparameters, which we did not tune,

across all experiments. We, if not stated otherwise, use

the standard GPT-2 transformer architecture with Layer-

Norm (Ba et al., 2016), MLPs between self-attention layer

and skip-connection after every layer which we train on

a standard (autoregressively) masked cross-entropy loss.

We do not use an input embedding layer but an output pro-

jection before computing the logits. To train enable stable

training of the linear as well as the mesa-layer, we apply

the proposed key and query normalization of schlag and

simply devide them by their L2 norm. Intriguingly, this

stabilizes training drastically also for the mesa-layer after

which we did not observe any more instabilities. Note

that this is very similar to using additional LayerNorm

(Ba et al., 2016) on the keys and queries. Except from

this normalization, all models are constructed and trained

identically. See A3 for an overview of all design decisions

and hyperparameters. Also, we refer to the appendix of

Schlag et al. (2021) on how to compute the DPFP kernels

to non-linearly alter the key and query features,we use

ν = 3 if not stated otherwise.

Figure A20: In-context learning score

comparing loss values later in se-

quence. When comparing to the in-

context learning score comparing earlier

time points, Figure 6, we still observe

a noticable gap between mesa and soft-

max indicating remaining memory reten-

tion problems for the mesa-layer. We

hypothesize that learned mesa-forgetting

factors might close this gap.

S OFTWARE

The results reported in this paper were produced with open-source software. We used the Python

programming language together with the Google JAX (Bradbury et al., 2018) framework, and the

NumPy (Harris et al., 2020), Matplotlib (Hunter, 2007), Flax (Heek et al., 2023), Haiku (Hennigan

et al., 2020) and Optax (Babuschkin et al., 2020) packages.