Summary of Schema-learning and rebinding in in-context learning

Summary Schema-learning and rebinding in in-context learning arxiv.org

12,163 words - PDF document - View PDF document

One Line

The paper suggests using clone-structured causal graphs as an effective tool for understanding in-context learning in large language models.

Slides

Slide Presentation (12 slides)

Copy slides outline Copy embed code Download as Word

Understanding In-Context Learning with Clone-Structured Causal Graphs

Source: arxiv.org - PDF - 12,163 words - view

Introduction to In-Context Learning

• In-context learning (ICL) in large language models (LLMs) is a complex process

• Clone-structured causal graphs (CSCGs) provide a tool to understand ICL

• CSCGs can help uncover the mechanisms behind ICL

Schema-Learning and Rebinding in ICL

• Schema-learning and rebinding are crucial mechanisms of ICL

• CSCGs offer insights into how schema-learning and rebinding occur

• CSCGs allow for a deeper understanding of these processes

Limitations of Bayesian Inference in ICL

• The Bayesian inference perspective falls short in explaining ICL properties

• Context-sensitive and transitively generalizing storage and retrieval alone cannot account for these properties

• CSCGs provide an alternative approach to address these limitations

Context-Sensitive Clone-Graph (CSCG) Model

• The CSCG model can learn and infer latent concepts in the GINC dataset

• Training the CSCG model with multiple clones per token improves localization

• CSCGs offer a powerful framework for understanding context-sensitive learning

Overallocation of Clones in CSCG Model

• Overallocation of clones in the CSCG model improves performance and accuracy

• Different overallocation ratios were tested to optimize results

• CSCGs with overallocated clones show enhanced capabilities

Evaluating Model Performance with the "Dax" Test

• The "dax" test evaluates a model's ability to absorb new words from a single presentation

• The CSCG model trained on the PreCo dataset for coreference resolution was tested on word-replaced data

• Results demonstrate the effectiveness of the CSCG model in absorbing new concepts

References to Related Research Papers

• A comprehensive list of references to research papers and articles on schema-learning, rebinding, and in-context learning

• These references cover various topics in artificial intelligence and machine learning

• Further reading for professionals interested in exploring the subject in depth

Average In-Context Accuracy for CSCG with Different Clones (Table 1)

• Table 1 shows the in-context accuracy for a CSCG with different numbers of clones trained on the GINC dataset

• The table provides insights into the impact of clone allocation on performance

• Visual: Include a visual representation of Table 1 for better comprehension

Natural Language Instructions for List and Reversal Tasks (Tables 2 and 3)

• Tables 2 and 3 present the natural language instructions used for the list and reversal tasks

• These instructions demonstrate the versatility of the CSCG model in handling different tasks

• Visual: Include visuals of Tables 2 and 3 to enhance understanding

Average In-Context Accuracy for Different Tasks and Prompts (Table)

• The table shows the average in-context accuracy for different tasks and prompts

• Accuracy is measured based on the overallocation ratio of the CSCG model

• Visual: Include a visual representation of the accuracy table for better visualization

Unveiling the Mechanisms of In-Context Learning

• Understanding ICL with CSCGs is essential for advancing language models

• CSCGs provide insights into schema-learning, rebinding, and context-sensitive learning

• Reminder: CSCGs offer a powerful tool for unraveling the complexities of in-context learning

Key Points

In-context learning (ICL) in large language models (LLMs) can be understood using clone-structured causal graphs (CSCGs).
Schema-learning and rebinding are mechanisms of in-context learning.
The Bayesian inference perspective is insufficient to explain the properties of ICL.
The Context-Sensitive Clone-Graph (CSCG) model can learn and infer latent concepts in the GINC dataset.
Overallocation of clones in the CSCG model improves performance and accuracy.
The "dax" test evaluates a model's ability to absorb new words from a single presentation.
The document includes references to various research papers and articles related to schema-learning, rebinding, and in-context learning.
Tables and figures present the average in-context accuracy of different tasks based on CSCG overallocation.

Summaries

30 word summary

This paper examines in-context learning (ICL) in large language models (LLMs) and proposes using clone-structured causal graphs (CSCGs) for understanding ICL. CSCGs are shown to be effective in acquiring knowledge.

34 word summary

This paper explores the mechanisms of in-context learning (ICL) in large language models (LLMs) and proposes an alternative approach using clone-structured causal graphs (CSCGs) to understand ICL. The authors demonstrate that CSCGs can acquire

512 word summary

The excerpt discusses the concepts of schema-learning and rebinding in in-context learning. It introduces the Conditional Slot-and-Clone Graph (CSCG) model, which uses a transition tensor and emission matrix to represent action-conditional dynamics and observation probabilities.

Schema-learning and rebinding are mechanisms of in-context learning and emergence. The fast rebinding algorithm (Algorithm 1) is used to update the emission matrix in a Conditional Sequence Context Graph (CSCG). The algorithm identifies latent states and time steps

The Bayesian inference perspective on in-context learning (ICL) is insufficient to explain the properties of ICL discussed in the next sections. Context-sensitive and transitively generalizing storage and retrieval alone cannot account for these properties. In addition to learning the layout

The study focuses on the ability of a Context-Sensitive Clone-Graph (CSCG) model to learn and infer latent concepts in the GINC dataset. The model is trained with 50 clones per token and achieves accurate prompt completion by improving localization

The study used a test set with 100 prompts, consisting of instructions and tokens. The training process involved allocating clones to tokens based on the number of distinct contexts in the training data. Different overallocation ratios were tested. The results showed that CSCGs

In the study, the researchers conducted a "dax" test to evaluate a model's ability to absorb new words from a single presentation. They trained a CSCG model on the PreCo dataset for coreference resolution and tested it on word-replaced

This document contains a list of references to various research papers and articles related to schema-learning, rebinding, in-context learning, and other topics in artificial intelligence and machine learning. The references cover a wide range of subjects, including interpretability, attention mechanisms

This text excerpt includes references to various papers on schema-learning and rebinding in in-context learning. It discusses the EM algorithm for learning the emission matrix of a CSCG with a fixed transition matrix. The prompt completion algorithm is described, which considers a single

The excerpted text presents a table with numerical values for different scenarios. The table is divided into four sections, each representing a different number of clones (10, 50, 100, and 1000). Within each section, there are columns

The summary includes the following key points:

- Table 1 shows the in-context accuracy for a CSCG with different numbers of clones trained on the GINC dataset. - Tables 2 and 3 present the natural language instructions used for the list and

The document discusses schema-learning and rebinding in in-context learning. It presents tables and figures that show the average in-context accuracy of different tasks based on CSCG overallocation. The results indicate that overallocation improves performance and increases accuracy. The document also

The table shows the average in-context accuracy for different tasks and prompts. The accuracy is measured based on the overallocation ratio of the CSCG model. The table provides information on various tasks such as listing elements, reversing lists, repeating lists, shifting lists

Raw indexed text (70,458 chars / 12,163 words / 1,533 lines)

Schema-learning and rebinding as mechanisms

of in-context learning and emergence

Sivaramakrishnan Swaminathan 1 , Antoine Dedieu 1 , Rajkumar Vasudeva Raju 1 , Murray Shanahan 1 , Miguel

Lázaro-Gredilla 1 and Dileep George 1

1 Google

DeepMind

In-context learning (ICL) is one of the most powerful and most unexpected capabilities to emerge

in recent transformer-based large language models (LLMs). Yet the mechanisms that underlie it are

poorly understood. In this paper, we demonstrate that comparable ICL capabilities can be acquired

by an alternative sequence prediction learning method using clone-structured causal graphs (CSCGs).

Moreover, a key property of CSCGs is that, unlike transformer-based LLMs, they are interpretable,

which considerably simplifies the task of explaining how ICL works. Specifically, we show that it uses a

combination of (a) learning template (schema) circuits for pattern completion, (b) retrieving relevant

templates in a context-sensitive manner, and (c) rebinding of novel tokens to appropriate slots in the

templates. We go on to marshall evidence for the hypothesis that similar mechanisms underlie ICL in

LLMs. For example, we find that, with CSCGs as with LLMs, different capabilities emerge at different

levels of overparameterization, suggesting that overparameterization helps in learning more complex

template (schema) circuits. By showing how ICL can be achieved with small models and datasets, we

open up a path to novel architectures, and take a vital step towards a more general understanding of

the mechanics behind this important capability.

1. Introduction

In a pre-trained sequence model, in-context learning (ICL), or few-shot prompting, is the ability to

learn a new task from a small set of examples presented within the context (the prompt) at inference

time. Surprisingly, large language models (LLMs) trained on sufficient data exhibit ICL, even though

they are trained only with the objective of next token prediction [1, 2]. A good deal of the ongoing

excitement surrounding LLMs arises from this unexpected capacity, since it dramatically enlarges

their set of potential applications. Attempts to understand this capability are ongoing and take

a variety of forms, including higher-level normative accounts using Bayesian inference [3], and

mechanistic explanations involving implicit gradient descent [4] or induction heads [5]. Despite this,

the mechanisms that underlie ICL in LLMs remain somewhat mysterious.

In this paper, we take an alternative approach. We reveal the conditions that drive ICL in a different

sequence learning model called a clone-structured causal graph (CSCG) [6, 7]. Using a combination

of new and standard datasets, we show how a CSCG assigns non-zero probabilities to sequences never

seen during training in a way that, thanks to the model’s causal graph structure, is open to explicit and

mechanistic interpretation. We hypothesize that similar mechanisms will exist in transformer-based

LLMs, and show how this could be the case.

Specifically, we show that ICL in CSCGs can be explained as a combination of (a) learning template

circuits for pattern completion, (b) retrieving relevant templates in a context-sensitive manner, and

allow transitive generalization in the latent space, which assigns non-zero probabilities to sequences

never seen during training in a semantically sensible way, ensuring that the contexts (prompts) used

for retrieval are not pure memorizations. In addition, the binding of novel tokens to slots in learned

Corresponding author(s): [email protected] and rebinding as mechanisms of in-context learning and emergence

templates allows the same structural knowledge to be applied to entirely novel inputs. By elucidating

the principles that underpin the mechanics of ICL, we hope to pave the way for the design of novel

architectures for abstraction and generalization, while the building blocks we identify guide the

search for mechanistically interpretable [9] and editable [10] circuits in transformers [11].

2. Rebinding algorithm for clone-structured causal graphs

2.1. Background on clone-structured causal graphs (CSCGs)

Consider an agent executing a series of discrete actions 𝑎 1 , . . . , 𝑎 𝑁 −1 with 𝑎 𝑛 ∈ {1 , . . . , 𝑁 actions }, e.g.

walking in a room. As a result of each action, the agent receives a perceptually aliased observation

[12], resulting in the stream of random variables 𝑋 1 , . . . , 𝑋 𝑁 with observed values 𝑥 1 , . . . , 𝑥 𝑁 , where

each 𝑥 𝑛 ∈ {1 , . . . , 𝑁 obs }. CSCG [6] is a probabilistic sequence learning model that learns latent graphs

to model these action-conditional sequences. A CSCG can recover a graph that represents the latent

causal structure [13] of the environment (e.g. the room), which can then be used to plan actions in

that environment. CSCGs introduce a latent explanatory variable 𝑍 𝑛 at each timestep 𝑛 , with values

𝑧 𝑛 ∈ {1 , . . . , 𝑁 latent }, to disambiguate the perceptually aliased observations. It then models the stream

of observations as

𝑃 ( 𝑥 1 , . . . , 𝑥 𝑁 | 𝑎 1 , . . . , 𝑎 𝑁 −1 ) =

∑︁

𝑧 1 ,...,𝑧 𝑛

𝑃 ( 𝑥 1 | 𝑧 1 ) 𝑃 ( 𝑧 1 )

𝑁

𝑃 ( 𝑥 𝑛 | 𝑧 𝑛 ) 𝑃 ( 𝑧 𝑛 | 𝑧 𝑛 −1 , 𝑎 𝑛 −1 ) .

𝑛 =2

The action-conditional dynamics are represented by a transition tensor 𝑇 : 𝑇 𝑖 𝑗𝑘 = 𝑃 ( 𝑍 𝑛 = 𝑘 | 𝑍 𝑛 −1 =

𝑗, 𝑎 𝑛 −1 = 𝑖 ) ∀ 𝑛 , and the observation probabilities by an emission matrix 𝐸 : 𝐸 𝑖 𝑗 = 𝑃 ( 𝑋 𝑛 = 𝑗 | 𝑍 𝑛 = 𝑖 ) ∀ 𝑛 .

The transition tensor 𝑇 defines a directed multigraph, whose nodes correspond to the values of 𝑧 .

Conditioned on an action, each entry of 𝑇 is the weight of a directed edge between two nodes (from

the row index to the column index of that entry). See Fig. 1D for an example of a recovered CSCG

latent graph.

CSCG has a deterministic observation model. That is, for each row of 𝐸 , one entry is set to 1 and the

remaining to 0. Hence, for any latent value 𝑧 , the same observation 𝑥 is always emitted. Multiple

values of 𝑧 can result in the same observed 𝑥 , making the model overcomplete [14]. Using these latent

states, a CSCG can disambiguate multiple aliased percepts (same observation 𝑥 ) into distinct causes

(different latent values 𝑧 ). The restriction of 𝐸 being deterministic makes CSCGs less general than a

hidden Markov model (HMM), but easier to learn [6]. CSCG can also be used as a language model if

the observations correspond to word tokens, with the single action that accesses the next token 1 .

2.2. Rebinding in CSCGs

When an agent encounters a new environment with a similar structure, but different observations, it

can learn that environment faster by reusing the latent graph 𝑇 (what we call a schema) [15] from

prior experience and relearning just the emission matrix. We call this process rebinding. Rebinding

can be interpreted as a soft intervention on the agent’s model [16, 17]. See Fig. 1D & F for examples

of two rooms that share the same latent structure but different observations. When a new emission

matrix binds to an existing schema, it has to respect the clone structure of the original emission matrix

(Fig. 1E). The clone structure function C(·) ∈ 1 , . . . , 𝑁 obs partitions the latent state in 𝑁 obs slots:

two latent states 𝑧 = 𝑖 and 𝑧 = 𝑖 ′ belong to the same clone slot iff C( 𝑖 ) = C( 𝑖 ′ ). An emission matrix

respects the clone structure C if the condition C( 𝑖 ) = C( 𝑖 ′ ) =⇒ 𝐸 𝑖 𝑗 = 𝐸 𝑖 ′ 𝑗 ∀ 𝑖, 𝑖 ′ , 𝑗 is satisfied. The

1 This

can be generalized to include actions that skip over one or more next tokens.

2Schema-learning and rebinding as mechanisms of in-context learning and emergence

z n z n+1

x n x n+1

a n

latent states

clones

observations

t=0

slot 1

slot 3

Room 1

x n x n+1

slot 4

slot 6

slot

observations

z n+1

emission matrix learns mapping

from slots to observations

observations

Room 2

…

observations

actions

z n

Room 1

different

floor & wall colors

Figure 1 | A. Inducing the structure of the room (cognitive maps) from sequential sensory observations

is challenging because of perceptual aliasing – local observations do not identify locations uniquely. B.

Cloned hidden Markov models (HMMs) [7]. Each observation is mapped to multiple clone states in the

latent space. C. Graphical model for CSCGs [6], extending cloned HMMs by incorporating actions. CSCGs

utilize the latent space to overcome the perceptual aliasing problem. Different clones learn to represent

different temporal contexts to recover the latent structure of the room. D. Learned CSCG for the room

shown in panel A consists of a latent transition matrix and an emission matrix. We visualize the model in

two ways: (i) stacking clone states for respective observations into columns, and (ii) clones as nodes in a

transition graph, colored with their respective emissions. E. The emission matrix imposes a slot structure –

nodes within the same slot are constrained to bind to the same observation. A new environment with the

same latent structure but different observation mapping (Room2, for example) can be learned quickly by

freezing the transition graph and the slot structure, and learning a new emission matrix by rebinding

slots to a new set of observations. F. CSCG for a different room learned purely through rebinding.

3-tuple { 𝑇, C , 𝐸 } defines a grounded schema, the tuple { 𝑇, C} defines an ungrounded schema with clone

structure and 𝑇 alone is a schema [15].

2.2.1. Fast rebinding by attending to surprise

Often, environment changes are localized such that most of the latent structure and observation

mapping is preserved while just a few observations need to be rebound: for example, just replacing

the carpet in a room while the wall colors remain the same, or getting exposed to a new word in a

familiar context. This insight can be utilized to derive an algorithm that focuses the update of the

emission matrix only to those observations that were found surprising by the existing model.

Suppose that at test time, a grounded schema { 𝑇, C , 𝐸 0 } is exposed to a sequence with novel observa-

tions. Algorithm 1 proposes a fast procedure to update the emission matrix to the new observations

by only performing local updates, and to bind the updated emission matrix to the existing schema 𝑇 ,

defining a new grounded schema { 𝑇, C , 𝐸 rb }. We call this process fast rebinding.

Given a prompt ( 𝑥 1 , . . . , 𝑥 𝑁 ) and a surprise threshold, Algorithm 1 proceeds by identifying the entries

3Schema-learning and rebinding as mechanisms of in-context learning and emergence

Algorithm 1 – Fast rebinding algorithm

Input: Grounded schema { 𝑇, C , 𝐸 0 }, pseudocount 𝜖 , prompt ( 𝑥 1 , . . . , 𝑥 𝑁 ), surprise probability thresh-

old 𝑝 surprise .

Output: Rebound emission matrix 𝐸 rb

˜ 0 ∝ 𝐸 0 + 𝜖 , with normalized rows.

1: 𝐸

2: For timestep 𝑛 , compute 𝑝 ( 𝑋 𝑛 = 𝑗 | 𝑥 \ 𝑛 ) = 𝑝 ( 𝑋 𝑛 = 𝑗 | 𝑥 1 , . . . , 𝑥 𝑛 −1 , 𝑥 𝑛 +1 , . . . , 𝑥 𝑁 ) , ∀ 𝑗 ≤ 𝑁 obs using

𝐸 ˜ 0 for the emission matrix.

3: Identify

latent states and timesteps that can act as anchors:

A = ( 𝑖, 𝑛 ) 𝑝 ( 𝑋 𝑛 = 𝑥 𝑛 | 𝑥 \ 𝑛 ) > 𝑝 surprise , and C( 𝑖 ) = 𝑥 𝑛

4: Identify

latent states to be rebound (and their timesteps):

R = ( 𝑖, 𝑛 ) 𝑝 ( 𝑋 𝑛 = 𝑗 | 𝑥 \ 𝑛 ) > 𝑝 surprise , 𝑗 ≠ 𝑥 𝑛 , (· , 𝑛 ) ∉ A , ( 𝑖, ·) ∉ A and C( 𝑖 ) = 𝑗

5: Fix 𝑇 , and use EM to update the emission matrix (initialized with 𝐸 0 , and without using any

pseudocount) by only using the beliefs for latent states 𝑖 and timesteps 𝑛 such that ( 𝑖, 𝑛 ) ∈ R.

of the emission matrix that need to be updated, and then updating those entries using the Expectation-

Maximization (EM) algorithm [18]. The conditional probability 𝑝 ( 𝑋 𝑛 = 𝑗 | 𝑥 \ 𝑛 ) of tokens at time step

𝑛 given all other time steps is used to identify time steps and latent states that are surprising or not. If

the current model predicts an observation with high confidence, and the observation actually occurs,

then the latent states corresponding to those observations and time steps should not be updated.

These latent states and time steps, called anchors, are identified in Step 3. Latent states corresponding

to observations that are incorrectly predicted with high confidence, but are not one of the anchor

states, are the candidates for rebinding. These are identified in Step 4. Finally, Step 5 locally updates

the emission matrix using the EM algorithm only on the latent states and time steps identified in step

4. This is a special case of updating the whole emission matrix as described in [15, Appendix A.2]

where the authors re-learn the whole emission matrix of a CSCG by keeping the schema 𝑇 fixed. In

contrast with [15], our EM updates in Step 5, described in detail in Appendix A, only locally update

the emission matrix. As a consequence, only a small subset of rows differ between 𝐸 0 and 𝐸 rb . The

rows that are protected correspond either to anchors in the current prompt, or slots not relevant to

the current prompt but still possibly relevant to future observations. The pseudocount used in Step 1

is an uncertainty parameter that lets the model smooth over incorrect observations. More details of

this parameter are available in [6].

After rebinding, we complete the prompt by performing MAP inference conditioned on the provided

prompt in the rebound CSCG. We run the max-product algorithm [19] forward (the backward

messages are all uniform) thus generating a series of MAP observations for the tokens following the

prompt. We stop once we generate a delimiter token. See Algorithm 2 in Appendix B for details.

Section 6 discusses how a mechanism similar to Algorithm 1 could be implemented in transformers

using buffered inputs and activations.

3. Outline of the overall argument using CSCG

3.1. Context-dependent latent representations and transitive generalization

The clone structure of CSCGs allows context-based separation and appropriate blending for language

modeling. For example, the sense of the word “bank” in “bank robber” is different from the one

in “river bank”. CSCG learning disambiguates these contexts in the latent space by wiring them to

4Schema-learning and rebinding as mechanisms of in-context learning and emergence

transitive generalization

ﬂexibly branching/merging sequences with sharing

eating

[ A B C D E F ]

copy sequence

separation of contexts

[

parity

]

[

A B C D E

reverse sequence

]

[

A B C D E

]

reverse lengths 3 to 5

[

A B C D E

]

K M N P R

reverse with rapid rebinding

Figure 2 | A. CSCGs allow both separation of contexts and transitive generalization. The word “bank”

is wired to different clones that correspond to the different contexts it is used in. If “milk and honey”,

and “bread and butter” are seen in training, transitive generalization occurs if they get wired through

the same ‘and’ clone, resulting in “bread and honey” and “milk and butter” being valid sequences. B.

Probabilistic branching & merging of sequences. C – F. Exemplar CSCG circuits for respectively copying

a sequence, parity operation, reversing a list with exactly five elements, reversing lists with a variable

number of elements. G. Rebinding to new observations: dashed gray arrows correspond to old emissions

while green arrows correspond to new rebound emissions.

different clones to improve predictive accuracy. In Fig. 2A, the sentences “river bank resort”, and “one

bank robber” use different clones of “bank”. Sequences can have probabilistic branching: “one bank

robber” can terminate at “\n”, or continue to “eating at river bank resort” or “eating bread and honey”,

or “eating bread and butter at river bank resort” (Fig. 2B). CSCGs also allow merging of contexts that

result in transitive generalization: even if training data has only the sequences “bread and butter”,

and “milk and honey”, if they go through the same clone state “and”, the model will generalize to

“bread and honey” and “milk and butter”, assigning non-zero probability to those sequences. Due

to the combination of context-sensitive separation and transitivity, related topics, concepts, and

algorithms get clustered into sub-networks that pass through the same clones. A prompt’s context

would activate its sub-network, and transitive generalization allows for prompts that are not exact

memorizations. As we show in Section 4, the Bayesian inference perspective on ICL [3] corresponds

to this context-sensitive and transitively generalizing storage and retrieval alone, and is insufficient to

explain the ICL properties we consider in the next sections.

3.2. Learning flexible schemas (template circuits) and rebinding

Just like learning the layout of rooms, CSCG can learn automata circuits [20] for sequence-to-sequence

(seq2seq) algorithms. See Fig. 2 for CSCG circuits for parity, copying a sequence, and reversing

sequences of multiple lengths. The list reversal circuit in Fig. 2E is bound to the specific symbols

𝐴, 𝐵, 𝐶, 𝐷, 𝐸 used in training. To be useful as a template, appropriate slots in this graph should be

able to bind to arbitrary symbols that occur in context during test time [8, 21]. Intuitively, rebinding

can be understood as operating based on prediction errors – if the latent context strongly predicts

the latent state corresponding to a time instant, but the real observation at that time is mismatched,

rebinding adjusts the emission matrix to wire all the clones of that latent state to the surprising

observation. This mechanism is formalized in Algorithm 1. The rebinding process allows the specific

list-reversal circuit in Figure 2E to become a flexible template with slots that can be dynamically

5Schema-learning and rebinding as mechanisms of in-context learning and emergence

bound to new inputs as required, creating a powerful way to mix and gate previous knowledge with

new content. For example, in the list reversal schema in Fig. 2F, tokens “[”, and “]” are prior content

that detect the beginning and end of the list – these act as anchors for grounding the schema in the

observations. Probabilistic branching based on the end of list token “]” allows for length generalization,

whereas absorbing arbitrary symbols into the slots corresponding to 𝐴, 𝐵, 𝐶, 𝐷, 𝐸 allows the algorithm

to generalize to new symbols. Fig. 2G illustrates the outcome of this rebinding mechanism where the

slots emitting 𝐴, 𝐵, 𝐶, 𝐷, 𝐸 are respectively rebound to symbols 𝐾, 𝑀, 𝑁, 𝑃, 𝑅 from the input prompt.

Similarly, in the sentence “I wrote in a notebook using a dax”, rebinding can absorb the new token

“dax” into the context by binding it to a clone corresponding to “pencil” or “pen”, and use the new

word in those contexts.

Figure 3 | A. Visualization of the transition graph of a CSCG with 50 clones trained on the GINC dataset.

The clones are slotted into five clusters, each corresponding to a different “concept”. B.[Left] In-context

accuracy for a CSCG model with 50 clones averaged over the GINC test dataset from [3]. As in [3] we

report the 95% confidence intervals (CIs). For contexts of 8 and 10 tokens, the model predicts the most

likely next token at least 95% of the time—including in the zero-shot regime. [Right] In-context accuracy

decreases when we reduce the number of clones to 10—for 𝑘 ∈ {8 , 10} it drops from above 95% to below

75%. The numerical values are reported in Appendix C, Table 1. C. Decoded latent state distributions

(increasing intensities of black for higher density) for a CSCG with 50 clones, for an 𝑛 = 0 & 𝑘 = 10

prompt o y w r m r o y aj , when truncated to different lengths ( 𝑘 = 2 , 3 , 5 , 8 , 10). Longer prompts

lead to improving latent state estimation—resulting in better concept retrieval, and next token prediction.

3.3. Instruction-based or content-based retrieval and completion of tasks

Zero-shot task recognition as content-based retrieval using rebinding: Many striking examples of

zero-shot learning involve the recognition of tasks from prompts, and then repeating the recognized

task on new inputs. For example, given a prompt “Input: [p, q, r, s] Output: [p, p, q, q, r, r, s, s];

Input: [l, m, n, o] Output: [l, l, m, m, n, n, o, o]” LLMs can infer the task as repeating the elements

of the sequence, and apply that to complete the output for a new input prompt even when the tokens

“p, q, r, s, l, m, n, o” were not seen during training, in association with this task. CSCG offers a

natural explanation for this using rebinding. Given the prompt, expectation maximization (EM) [18]

simultaneously evaluates the different rebindings to multiple latent algorithm schemas to infer the

best binding, which is then applied to complete the query prompt.

6Schema-learning and rebinding as mechanisms of in-context learning and emergence

Instruction-based retrieval: When algorithms are trained with prefixed language instructions,

CSCGs learn instruction sub-networks that directly point to the circuits that represent the algorithms

(see Section 4.2). The algorithm can be retrieved by direct prompting with language instructions that

can be significantly different from training instructions due to transitive generalization and rebinding.

3.4. Emergence

We hypothesize, and empirically demonstrate in Section 4, that emergence is explainable as the

combined effects of the above properties (context-separation, transitive generalization, schema-

formation, and rebinding), model capacity, and patterns in the data. Learning the schematic circuits

for more complex algorithms or more patterns in the data requires greater model capacity because

overparameterization helps in the optimization process. Training on a bigger dataset results in the

induction of more templates that might not have occurred in the smaller dataset.

4. Results

We substantiate the above argument using empirical results on three datasets: (a) the GINC benchmark

introduced in [3], (b) a suite of algorithm learning tasks that we introduce in our LIALT datasets, and

4.1. Context-sensitive retrieval on GINC dataset matches Bayesian inference explanation

Dataset: The GINC dataset, introduced in [3] to study ICL, is generated from a uniform mixture of

five factorial HMMs [22]. Each factorial HMM is referred to as a “concept”. A document is created by

concatenating independent sentence samples from a concept. The in-context test prompts have a

number of examples varying from 𝑛 = 0 to 𝑛 = 64: each example can be of lengths 𝑘 ∈ {3 , 5 , 8 , 10},

with 2500 prompts for each setting ( 𝑘, 𝑛 ). Each prompt uniformly selects a concept, samples 𝑛 − 1

( 𝑛 −1)

)

examples 𝑥 : (1)

, . . . , 𝑥 : 𝑘

of length 𝑘 , and one example 𝑥 : ( 𝑘 𝑛 −1

of length 𝑘 − 1. The in-context task is

𝑘

)

( 𝑛 −1)

( 𝑛 )

to infer the most likely last token of the last example, i.e., argmax 𝑥 ( 𝑛 ) 𝑝 𝑥 𝑘 ( 𝑛 −1

| 𝑥 : (1)

𝑥

𝑘

: 𝑘

: 𝑘 −1

𝑘 −1

Since the vocabulary is shared among different latent concepts, observations in GINC are aliased like

in natural language, and solving the task requires the model to disambiguate the aliased observations

to correctly infer the latent concepts.

Training: We train a single CSCG with 50 clones on the GINC dataset for 100 full-batch EM iterations

using a pseudocount [6] of 𝜖 = 10 −2 . Given a test prompt, CSCG infers the most likely hidden

sequence for that prompt, then predicts the next most likely observation.

Results: CSCG learns different latent sub-networks corresponding to the five latent concepts in the

GINC dataset ( Fig. 3A), and inference on a supplied prompt retrieves the correct latent sub-network

(Fig. 3C). Increasing the prompt length improves the localization of the sub-network and the particular

states within the sub-network. Figure 3C visualizes the decoded latent state distribution for an

example prompt in the zero-shot setting ( 𝑛 = 0). The decoding starts out uncertain, and improves

as the prompt gets longer. This localization (on the graph) results in effective schema retrieval, and

hence accurate prompt completion. Figure 3B[left] reports the in-context accuracy—defined as

the average ratio of correct predictions—for each ( 𝑘, 𝑛 ) pair of the GINC test set. CSCG in-context

accuracy matches the patterns exhibited by LSTMs and transformers in [3], while slightly improving

their performance. Fig. 3B also shows that a CSCG with larger capacity, i.e. with 50 clones per token,

better separates the latent concepts and significantly outperforms a CSCG with only 10 clones per

token. Fig. 9[left] in Appendix C displays the CSCG in-context confidence: for larger contexts, CSCG

7Schema-learning and rebinding as mechanisms of in-context learning and emergence

Language Instructed Algorithm Learning Tasks

Algorithms

repeat twice

reverse

print alternate even/odd

circ shift forward/backward

return nth element

return element at index

roll columns 1 step

transpose

diagonal

Training set format and examples

algo k language description / in 1 algo k (in 1 ) /.../ in M algo k (in M ) /

five variations

Test set 1: instruction based retrieval

language instruction / novel input completion

prompt

reverse the list / [ 2 r G ] [ G r 2 ]

Test set 2: example based retrieval

in 1 algo k (in 1 ) / in 2 completion

prompt

reverse the list / [ PZ LM RT ] [ RT LM PZ ] / [ QR FC JJ QQ ] [ QQ ... [ 2 r G J 7 ] [ 7 J G r 2 ] / [ a b c d ] [ d c b a ]

flip the list / [ QR FC JJ QQ ] [ QQ JJ FC QR ] / [ PZ LM RT ] [ RT ... [ a b 1 d m ] a / [ X a 2 3 ] X

reverse the list

Unrolled view

Example learned circuit

reverse the list

Figure 4 | A. [Top-left] List and matrix algorithms used in the LIALT dataset. Format of the training set

[bottom-left] and examples of the two LIALT test sets [right]. B. Example of a learned circuit for the

“reverse” algorithm, displayed by stacking clones [left] or unrolling them [right].

is better at disambiguating aliasing and the averaged predictions probabilities are higher. Finally,

Fig. 9[right] shows that similarly to the transformer and LSTM in [3], CSCG fails at ICL when test

prompts are sampled from concepts unseen during training.

The GINC results match the context-based retrieval argument in Section 3.1: ICL in this setting is

the retrieval of a shared latent concept between the prompt and the model. By using the long-range

coherence of concepts in the training documents, the model learns to separate concepts into different

latent representations. Despite the train and prompt distribution mismatch [3], CSCG succeeds at

prompt completion because the representation allows transitive mixing.

4.2. Learning schemas for seq2seq algorithms and generalization using rebinding

Training dataset: To test the ability of CSCG to learn algorithms that generalize to novel inputs not

seen during training, we construct the Language Instructed Algorithm Learning Tasks (LIALT) dataset.

The LIALT training set contains demonstrations of 13 list and matrix algorithms displayed in Fig.

4A[top-left]. A demonstration consists of a multi-word language instruction—each algorithm has five

different instructions—followed by 10 input-output examples of that algorithm. See Tables 2 & 3 in

Appendix D.1 for the complete list of instructions used. For each instruction, the dataset contains 20

demonstrations. Within a demonstration, the language instruction and the examples are separated by

a “/” delimiter. Demonstrations are separated by a “\n” delimiter. The input lists and matrices values

are created by uniformly sampling from a vocabulary of 676 tokens, created by random pairings of

uppercase letters. List operations examples vary in lengths from 3 to 6, and the matrix operations are

of sizes 2 × 2 or 3 × 3. Fig. 4A [bottom-left] shows the training data format.

8In-context

0.1

0.3

1.0

3.0

In-context accuracy by task

Overallocation ratio

prompts

Schema-learning and rebinding as mechanisms of in-context learning and emergence

0.1

0.3

1.0

3.0

Overallocation ratio

Figure 5 | [Left] In-context accuracy (with 95% CIs) after a single EM iteration, as a function of the

overallocation ratio for a CSCG trained on LIALT and averaged [top] on the instruction-based LIALT test

set [bottom] on the example-based LIALT test set. In-context accuracy increases for CSCGs with larger

capacities. [Right] In-context accuracy (with standard errors) per task on the two LIALT test sets: for each

task, overparametrization improves performance. Invisible bars indicate zero accuracy for the respective

combinations of model and task. All the numerical values are in Appendix D.3. Figure 11 in the Appendix

visualizes the same quantities after EM convergence; the similarity between the two sets of measurements

demonstrates that the fast rebinding algorithm is not just localized in its updates, but also rapid.

Test dataset: LIALT has two test datasets, respectively containing: (a) instruction-based retrieval

prompts, and (b) example-based retrieval prompts. An instruction-based retrieval test prompt consists

of a natural language instruction followed by a single input. An example-based retrieval test prompt

consists of a first input-output example of an algorithm, without any natural instruction, followed by

a second input. All the lists and matrices in the two test datasets contain novel tokens not seen during

training. For both types of prompts, the in-context task is to predict the algorithm’s output when

applied to the (last) input. Note that for an example-based prompt, CSCG has to infer the algorithm

used from the first example. Each test set contains 100 prompts, constructed by uniformly sampling

instructions, and list or matrix tokens. Fig. 4A [right] shows the formats of these two test sets.

Training: For each token, a CSCG allocates its number of clones proportionally to the number of

distinct contexts in the training data in which it occurs 2 . We parameterize CSCG capacity via this

proportionality factor – the “overallocation ratio”. We train CSCGs for an increasing sequence of

overallocation ratios on the training data with 500 EM iterations and a pseudocount of 𝜖 = 10 −6 .

After running EM, we run 10 iterations of Viterbi training [23].

Results: CSCGs with sufficient model capacity successfully learns the algorithms from the training

set, and rebinding generalizes those algorithms to novel tokens seen only during test time. Fig. 4B

shows the learned extracted circuit for the list reversal algorithm. Fig. 5[left] presents the in-context

accuracy of CSCGs (using 𝜖 = 10 −6 and 𝑝 surprise = 0 . 1) on the two LIALT test sets: the best performing

CSCG (a) successfully rebinds the learned schemas to the test prompts’ novel tokens and (b) correctly

infers the algorithm from a single input-output pair for example-based prompts. Fig. 5 also shows that

model size drives ICL performance [left] even when breaking down the performance by tasks [right].

The learned CSCG (initialized with an overallocation ratio of 3) is visualized in Fig. 10 in the Appendix,

using stacked clones. Fig. 6[A] shows the transition graph using the Kamada-Kawai algorithm [24].

It reveals thirteen loosely connected clusters corresponding to the thirteen algorithms present in

the LIALT dataset. Fig. 6[B] illustrates the rebinding process, with the decoded distributions over

latent states of the learned CSCG model, for two different example-based prompts. Even before

2 As

the same token might occur in different contexts in the training data, knowing the context allows predicting the

sequence of following tokens, up to the next “/” delimiter.

9Schema-learning and rebinding as mechanisms of in-context learning and emergence

Figure 6 | A. Transition graph of the CSCG model learned on the LIALT dataset, visualized using the

Kamada-Kawai algorithm. B. Visualizing the inferred probability of the observation at timestep 𝑛 ,

conditioned on observations at all other timesteps, before rebinding. This drives the identification of

anchors and slots selected for rebinding. C. Decoded latent state distributions — and predicted prompt

completions — for the two different example-based LIALT prompts specified in subfig. B: (top) before

rebinding, (middle) after one iteration of EM, and (bottom) after EM convergence. The left prompt

corresponds to the operation of circularly shifting the list forward, and the right prompt corresponds to

reversing the list.

any rebinding, the identification of anchors and slots already restricts the decoding to schemas

compatible with the prompt structure—in this case based on brackets & delimiters. However, the

structure is insufficient to disambiguate completely between the compatible schemas (list operations

corresponding to reversal, circular forward shift, and circular backward shift), and both the chosen

prompts result in the same latent state distribution. Hence, the decoded distribution after the first

E-step localizes to the three compatible schemas. In the M-step that follows, the slots in all three

10Schema-learning and rebinding as mechanisms of in-context learning and emergence

schemas will be rebound for this prompt. At the end of the first EM iteration, the new bindings for

slots in the correct schema will be highly certain given the consistent evidence, while inconsistent

evidence will lead to uncertain bindings for the other slots. In the E-step of the second iteration,

the respective levels of certainty in the bindings then help promote the correct algorithm schema to

become the most likely decoding—and complete the prompt appropriately. Note that a single EM

step is sufficient to derive the correct rebinding in these examples. Compare Figs. 5 & 11, and the

tables in Appendix Sec. D.3 for how the in-context completion performance after the first EM step in

the rebinding process is very similar to that at the end of EM convergence ().

The LIALT results substantiate the arguments we made in Sections 3.2 and 3.3. Similar to GINC Section

4.1, the Bayesian inference explanation that infers the latent context based on long-term coherence

does not explain the remapping of a latent representation to completely new tokens as required for

generalizing the algorithms in LIALT. Without rebinding, a prompt containing a full length example

of an algorithm does not localize the correct algorithm schema or produce the correct completion

based on inference over the latent states alone (Fig. 6[B], first row), in contrast to the results from

the GINC dataset. The correct algorithm schema cannot be retrieved because the prompt contains

novel tokens that were not seen during the training of that algorithm. In contrast, simultaneously

inferring the rebinding and the latent states results in accurate retrieval of the algorithm schema and

the correct prompt completion (Fig. 6[B], second and third rows). Utilizing rebinding, CSCG is able

to learn seq2seq algorithms and generalize those algorithms to novel tokens that are not encountered

in training.

Emergence: ICL performance of CSCG on the LIALT dataset shows characteristics attributed to emer-

gence: accuracy of in-context learning has a clear dependency on the level of overparameterization of

CSCG, offering evidence in support of our hypothesis in section 3.4.

4.3. Dax test

In language, the “dax” test [25] is used to demonstrate the capability of a model to absorb the usage

of an entirely new word from a single presentation. To test for this capability, we train a CSCG on the

PreCo dataset [26], which is a large-scale English dataset for coreference resolution. We then test the

model on five word-replaced query prompts, where certain words in the prompts do not appear in

the training set. We use Algorithm 1 with 𝜖 = 10 −6 and 𝑝 surprise = 16

to rebind the emission matrix on

each of these prompts, each time probing the model for completing a sentence by filling in the blanks

(uncertain inputs) using MAP inference. Fig. 7 shows these results.

unseen word

Replacement query prompts

after rebinding

Fill in the blanks probing

warming -> heating ... make global heating worse ! global heating may be a big problem, ...

winners -> victors ... victors receive a new computer ... contest victors will be announced ...

planets -> terras ... have n't found life on other terras yet . seven other terras also go around the sun .

artificial -> AG ... the field of AG intelligence . .... dangerous in this AG intelligence progress then ?

bikes -> cycles ... most people ride cycles to school . ... more people ride their cycles around the world .

Figure 7 | Examples of the dax test performed on a CSCG trained on the PreCo dataset. In each row, the

word in red (e.g. “terras”) in the replacement query prompt is not part of the vocabulary used to train the

model. The model absorbs the new token by binding it to the appropriate clones (corresponding to the

word in blue, e.g., planets). It can then use the new token in similar contexts as demonstrated in the

fill-in-the-blanks probing.

11Schema-learning and rebinding as mechanisms of in-context learning and emergence

5. Related work

In-context learning (ICL): Similar to how humans learn by analogy [27] and how synaptic plasticity

allows the brain to rapidly adapt to a new task [28], ICL allows a pretrained model to learn a new

task given only a few examples. Since [1] demonstrated the ICL abilities of GPT-3, a substantial body

of work has attempted to improve this capability. [29, 30] showed how demonstrations that explicitly

guide the reasoning process improve the ICL performance of transformers on new complex tasks.

Here, we first clarify some concepts that should not be confused with ICL. We then discuss some

works that aim at understanding ICL and the factors that influence it.

Supervised learning (SL) and few-shot learning (FSL): SL approaches learn a mapping that

minimizes a loss on the training data: gradient methods are a popular paradigm [31, 32, 33]. In

the FSL regime, a model learns to rapidly adapt to a new task from a limited number of supervised

examples [34, 35, 36], and is asked to perform this same task at inference. In contrast, while some

works [37, 38, 39, 40] finetuned the pretrained model for ICL, the new ICL tasks are only revealed

during inference. [38, 37] showed that finetuning transformers on instructions improves their ICL

performance.

Meta-learning: The meta-learning paradigm aims at learning to adapt to a new task with only a

few examples [41, 42, 43] by using multiple learning experiences. In contrast, ICL directly emerges

from the pretrained model. [39, 40] proposed a meta-learning framework for ICL where the model is

fine-tuned: it learns to leverage few-shot examples and to adapt to new tasks at inference time.

How ICL works: Several studies have highlighted the role of the training data in ICL. [44] showed

that ICL emerges when the training data has (a) examples appearing in clusters, and (b) a large

number of rare classes. [3] explained ICL as implicit Bayesian inference and constructed the GINC

dataset (see Section 4.1) for demonstrating ICL. Some works have also attempted to understand the

mechanics of ICL. [45] abstracted ICL as an algorithm learning problem and found that a transformer

can implicitly infer a hypothesis function. Similarly, [46] showed that a transformer can be trained to

perform ICL of unseen linear functions, with performance comparable to the optimal least squares

estimator. [47] showed that, in the linear case, transformers (a) implicitly implement gradient

descent and (b) train an implicit linear model on the ICL examples. [4] proposed a dual between

transformer attention and gradient methods and suggested pretrained models are meta-optimizer.

They presented ICL as implicit finetuning, where the forward pass on the demonstrative examples

produces meta-gradients. Finally, [5] showed the existence of “induction heads” in transformers, that

emerge during training, copy previous patterns, and drive ICL capacities.

What influences ICL: [48] proposed a substitute for the positional encoding, and demonstrated how

transformers can learn schemas for algorithmic tasks and generalize to test sequences longer than

any seen during training. [49] highlighted that (a) ICL can emerge when a model is trained on a

combination of multiple corpora (b) low perplexity and ICL performance do not always correlate. [1,

3] found that transformers’ ICL performance improves with (a) the model size and (b) the number of

demonstrative examples. Similarly, [50] indicated that ICL “emerges” when model size increases.

Some works have studied the influence of the demonstration samples in ICL. [51, 52] found that ICL

is highly unstable and is influenced by the prompting template, the selection of in-context examples,

and the order of the examples. [53] showed that the ICL performance is driven by the exposure to

(a) the label space, (b) the input distribution, and (c) the overall format of the sequence. Similarly,

[54] found that selecting ICL examples with closer embeddings to ICL test sample improves ICL

performance, and [55] found that adding explanations in-context improves performance. Finally,

[56] recently claimed that the sharp emergence of ICL in larger models might be an artifact of the

metrics, not a fundamental property of the model.

12Schema-learning and rebinding as mechanisms of in-context learning and emergence

6. Discussion

position

content

content-based

predictor

position-based

predictor content & position

based predictor template

template template

template 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 0 1 1 5 5 6 6 7 7

[ [ [ [ A B C D ] [ A A B B C C D D ]

gating

A B C D ]

selection

among

templates

selection

among templates

Template

N evaluated at at

di different

erent o sets offsets

Template

N 1 evaluated

Template

evaluated at di erent

o sets

Template 1 evaluated at di erent o sets

gating

A B C D ]

2 2

3 3

4 4

8 8

9 9 10

10 11

11 12

12 13 14

14 15

15 16

16 17

17 18

18 19

19 20

20 21

21 22

22 23

23 24

[

P Q R S T ]

[

gating

Figure 8 | A. Learned templates in a transformer could involve content, position, or a mix of both. B.

Activations in the forward pass of a transformer could be selected among pre-learned templates that mix

content and position to achieve ICL without weight changes.

Several prior explanations for ICL, including Bayesian inference [3] and implicit gradient descent [4],

have treated all content as the same without separating out the contributions of contents vis-a-vis

positions. As we showed, generalizing many algorithms requires templates that can gate content

based on patterns in both content and positions. We argue that theories might need to emphasize

this distinction (see Fig. 8A) to fully understand the inductive biases behind ICL.

Rebinding in CSCG requires localized weight changes to the emission matrix. Unlike CSCGs, trans-

formers buffer the inputs and use attention for the purpose of gating. Thanks to the positional

encoding, transformers are able to gate both by location and content. Since attention is implemented

as part of the forward pass, the slotting process lives in the space of activations (as opposed to requiring

weight modification). This is only a superficial difference: it is possible to temporally unroll the

rebinding process in CSCGs, to provide an algorithm for the same output, but with fixed weights.

Our conjecture is that layers of the transformer implement multiple mixed templates of positions

and content, that are evaluated at different offsets of a prompt. The template assembly that can

autoregressively match the prompt wins out the competition to gate the content.

Although we have illustrated here the concept of rebinding to attach new symbols to existing slots,

rebinding can also be done “through time”. Just like in our examples the connections among the

clones ( 𝑇 ) remained unchanged, but the connection between the clones and the symbols ( 𝐸 ) was

rebound based on the prompt, it is possible to do the converse, rebind the connections between the

clones ( 𝑇 ) while keeping the connection to the symbols ( 𝐸 ) unchanged. For instance, there could

be a schema within 𝑇 that recognizes an instruction, and another schema that executes it, triggered

by the last clone of the instruction recognizer. If we want a known instruction and all its variants to

execute a different task, it is possible to simply rebind the last clone of the instruction recognizer to

the trigger of a new task. Both types of rebinding can be combined. Transformers also show this type

of generalization, but we leave the details of this approach for future work.

Acknowledgements

We thank Stephanie Chan, Andrew Lampinen, Anirudh Goyal, Dharshan Kumaran, Neel Nanda and

Guangyao Zhou for helpful discussions and comments on the draft.

References

[1] Tom Brown et al. “Language models are few-shot learners”. In: Advances in neural information

processing systems 33 (2020), pp. 1877–1901.

[2] Taylor Webb, Keith J Holyoak, and Hongjing Lu. “Emergent Analogical Reasoning in Large

Language Models”. In: arXiv preprint arXiv:2212.09196 (2022).

13Schema-learning and rebinding as mechanisms of in-context learning and emergence

[3] Sang Michael Xie et al. “An explanation of in-context learning as implicit bayesian inference”.

In: arXiv preprint arXiv:2111.02080 (2021).

[4] Damai Dai et al. “Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient

Descent as Meta Optimizers”. In: arXiv preprint arXiv:2212.10559 (2022).

[5] Catherine Olsson et al. “In-context learning and induction heads”. In: arXiv preprint arXiv:2209.11895

(2022).

[6] Dileep George et al. “Clone-structured graph representations enable flexible learning and

vicarious evaluation of cognitive maps”. In: Nature communications 12.1 (2021), p. 2392.

[7] Antoine Dedieu et al. “Learning higher-order sequential structure with cloned HMMs”. In:

arXiv preprint arXiv:1905.00507 (2019).

[8] Murray Shanahan and Melanie Mitchell. “Abstraction for Deep Reinforcement Learning”. In:

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22.

2022, pp. 5588–5596.

[9] Neel Nanda et al. “Progress measures for grokking via mechanistic interpretability”. In: arXiv

preprint arXiv:2301.05217 (2023).

[10] Kevin Meng et al. “Locating and Editing Factual Associations in GPT”. In: Advances in Neural

Information Processing Systems 36 (2022).

[11] Ashish Vaswani et al. “Attention is all you need”. In: Advances in neural information processing

systems 30 (2017).

[12] Lonnie Chrisman. “Reinforcement learning with perceptual aliasing: The perceptual distinctions

approach”. In: AAAI. Vol. 1992. Citeseer. 1992, pp. 183–188.

[13] Judea Pearl. Causality. Cambridge university press, 2009.

[14] Vatsal Sharan et al. “Learning overcomplete hmms”. In: Advances in Neural Information Pro-

cessing Systems 30 (2017).

[15] J Swaroop Guntupalli et al. “Graph schemas as abstractions for transfer learning, inference,

and planning”. In: arXiv preprint arXiv:2302.07350 (2023).

[16] Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: founda-

tions and learning algorithms. The MIT Press, 2017.

[17] Daniel Eaton and Kevin Murphy. “Exact Bayesian structure learning from uncertain interven-

tions”. In: Artificial intelligence and statistics. PMLR. 2007, pp. 107–114.

[18] Arthur P Dempster, Nan M Laird, and Donald B Rubin. “Maximum likelihood from incomplete

data via the EM algorithm”. In: Journal of the royal statistical society: series B (methodological)

39.1 (1977), pp. 1–22.

[19] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan

kaufmann, 1988.

[20] Bingbin Liu et al. “Transformers learn shortcuts to automata”. In: arXiv preprint arXiv:2210.10749

(2022).

[21] Oren Kolodny, Arnon Lotem, and Shimon Edelman. “Learning a generative probabilistic

grammar of experience: A process-level model of language acquisition”. In: Cognitive Science

39.2 (2015), pp. 227–267.

[22] Zoubin Ghahramani and Michael Jordan. “Factorial hidden Markov models”. In: Advances in

Neural Information Processing Systems 8 (1995).

14Schema-learning and rebinding as mechanisms of in-context learning and emergence

[23] Frederick Jelinek. “Continuous speech recognition by statistical methods”. In: Proceedings of

the IEEE 64.4 (1976), pp. 532–556.

[24] Tomihisa Kamada, Satoru Kawai, et al. “An algorithm for drawing general undirected graphs”.

In: Information processing letters 31.1 (1989), pp. 7–15.

[25] Haley A Vlach and Catherine A DeBrock. “Remember dax? Relations between children’s cross-

situational word learning, memory, and language abilities”. In: Journal of memory and language

93 (2017), pp. 217–230.

[26] Hong Chen et al. “PreCo: A large-scale dataset in preschool vocabulary for coreference resolu-

tion”. In: arXiv preprint arXiv:1810.09807 (2018).

[27] Patrick H Winston. “Learning and reasoning by analogy”. In: Communications of the ACM 23.12

(1980), pp. 689–703.

[28] Katie C Bittner et al. “Behavioral time scale synaptic plasticity underlies CA1 place fields”. In:

Science 357.6355 (2017), pp. 1033–1036.

[29] Jason Wei et al. “Chain of thought prompting elicits reasoning in large language models”. In:

arXiv preprint arXiv:2201.11903 (2022).

[30] Denny Zhou et al. “Least-to-most prompting enables complex reasoning in large language

models”. In: arXiv preprint arXiv:2205.10625 (2022).

[31] Léon Bottou, Frank E Curtis, and Jorge Nocedal. “Optimization methods for large-scale machine

learning”. In: SIAM review 60.2 (2018), pp. 223–311.

[32] Sebastian Ruder. “An overview of gradient descent optimization algorithms”. In: arXiv preprint

arXiv:1609.04747 (2016).

[33] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv

preprint arXiv:1412.6980 (2014).

[34] Yaqing Wang et al. “Generalizing from a few examples: A survey on few-shot learning”. In:

ACM computing surveys (csur) 53.3 (2020), pp. 1–34.

[35] Li Fei-Fei, Robert Fergus, and Pietro Perona. “One-shot learning of object categories”. In: IEEE

transactions on pattern analysis and machine intelligence 28.4 (2006), pp. 594–611.

[36] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. “Human-level concept

learning through probabilistic program induction”. In: Science 350.6266 (2015), pp. 1332–

1338.

[37] Victor Sanh et al. “Multitask prompted training enables zero-shot task generalization”. In:

arXiv preprint arXiv:2110.08207 (2021).

[38] Jason Wei et al. “Finetuned language models are zero-shot learners”. In: arXiv preprint

arXiv:2109.01652 (2021).

[39] Sewon Min et al. “Metaicl: Learning to learn in context”. In: arXiv preprint arXiv:2110.15943

(2021).

[40] Yanda Chen et al. “Meta-learning via language model in-context tuning”. In: arXiv preprint

arXiv:2110.07814 (2021).

[41] Devang K Naik and Richard J Mammone. “Meta-neural networks that learn by learning”. In:

[Proceedings 1992] IJCNN International Joint Conference on Neural Networks. Vol. 1. IEEE.

1992, pp. 437–442.

[42] Sachin Ravi and Hugo Larochelle. “Optimization as a model for few-shot learning”. In: Inter-

national conference on learning representations. 2017.

15Schema-learning and rebinding as mechanisms of in-context learning and emergence

[43] Sepp Hochreiter, A Steven Younger, and Peter R Conwell. “Learning to learn using gradient

descent”. In: Artificial Neural Networks—ICANN 2001: International Conference Vienna, Austria,

August 21–25, 2001 Proceedings 11. Springer. 2001, pp. 87–94.

[44] Stephanie Chan et al. “Data distributional properties drive emergent in-context learning in

transformers”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 18878–

18891.

[45] Yingcong Li et al. “Transformers as Algorithms: Generalization and Stability in In-context

Learning”. In: ().

[46] Shivam Garg et al. “What can transformers learn in-context? a case study of simple function

classes”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 30583–30598.

[47] Ekin Akyürek et al. “What learning algorithm is in-context learning? investigations with linear

models”. In: arXiv preprint arXiv:2211.15661 (2022).

[48] Yuxuan Li and James L McClelland. “Systematic Generalization and Emergent Structures in

Transformers Trained on Structured Tasks”. In: arXiv preprint arXiv:2210.00400 (2022).

[49] Seongjin Shin et al. “On the effect of pretraining corpora on in-context learning by a large-scale

language model”. In: arXiv preprint arXiv:2204.13509 (2022).

[50] Jason Wei et al. “Emergent abilities of large language models”. In: arXiv preprint arXiv:2206.07682

(2022).

[51] Zihao Zhao et al. “Calibrate before use: Improving few-shot performance of language models”.

In: International Conference on Machine Learning. PMLR. 2021, pp. 12697–12706.

[52] Yao Lu et al. “Fantastically ordered prompts and where to find them: Overcoming few-shot

prompt order sensitivity”. In: arXiv preprint arXiv:2104.08786 (2021).

[53] Sewon Min et al. “Rethinking the Role of Demonstrations: What Makes In-Context Learning

Work?” In: arXiv preprint arXiv:2202.12837 (2022).

[54] Jiachang Liu et al. “What Makes Good In-Context Examples for GPT-3?” In: arXiv preprint

arXiv:2101.06804 (2021).

[55] Andrew K Lampinen et al. “Can language models learn from explanations in context?” In:

arXiv preprint arXiv:2204.02329 (2022).

[56] Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. “Are Emergent Abilities of Large Language

Models a Mirage?” In: arXiv preprint arXiv:2304.15004 (2023).

16Schema-learning and rebinding as mechanisms of in-context learning and emergence

A. Locally updating the emission matrix with the transition matrix fixed

We reuse the same notations as in [15], Appendix A.2. The authors describe the EM agorithm for

learning the emission matrix of a CSCG with a fixed transition matrix. In particular their M step

defines the new emission matrix as:

𝐸 ( 𝑗 ) =

𝑁

∑︁

𝑛 =1

1 𝑋 𝑛 = 𝑗 𝛾 ( 𝑛 ) ⊘

𝑁

∑︁

𝛾 ( 𝑛 ) , ∀ 𝑗

𝑛 =1

where 𝐸 ( 𝑗 ) is a column of the emission matrix corresponding to the emission 𝑗 , 1 𝑋 𝑛 = 𝑗 is an indicator

function, ⊘ is the element-wise division and 𝛾 ( 𝑛 ) is derived by the authors from the forward and

backward probabilities. The 𝑖 entry of the vector 𝐸 ( 𝑗 ) is then defined as:

Í 𝑁

=1 1 𝑋 𝑛 = 𝑗 𝛾 𝑖 ( 𝑛 )

𝐸 𝑖 𝑗 = 𝑛 Í

𝑁

𝑛 =1 𝛾 𝑖 ( 𝑛 )

In contrast, in Section 2.2, Step 5 of Algorithm 1 only updates the row 𝑖 for which we can find a pair

( 𝑖, 𝑛 ) ∈ R, by only using the beliefs at timestep 𝑛 . For this row, the 𝑗 th entry becomes:

𝑛 : ( 𝑖,𝑛 ) ∈ R 1 𝑋 𝑛 = 𝑗 𝛾 𝑖 ( 𝑛 )

𝐸 𝑖 𝑗 =

𝑛 : ( 𝑖,𝑛 ) ∈ R 𝛾 𝑖 ( 𝑛 )

B. Prompt completion algorithm

Algorithm 2 describes the prompt completion algorithm introduced in Section 2.2. It implicitly

considers a single action, which takes the next sequence element.

Algorithm 2 – Prompt completion

Input: Grounded schema { 𝑇, C , 𝐸 rb } with rebound CSCG emission matrix 𝐸 rb , delimiter token 𝑥 ∅ ,

prompt 𝑥 (prompt) = ( 𝑥 1 , . . . , 𝑥 𝑁 )

Output: A completed prompt 𝑥 (completed) = ( 𝑥 1 , . . . , 𝑥 𝑁 , 𝑥 𝑁 +1 , . . . , 𝑥 𝑁 + 𝑃 = 𝑥 ∅ )

1: Run max-product for MAP inference and return 𝑧 MAP = ( 𝑧 1 , . . . , 𝑧 𝑁 ) = argmax 𝑧 𝑝 ( 𝑧 | 𝑥 (prompt) ).

2: Set ℓ = 0. While 𝑥 𝑁 + ℓ ≠ 𝑥 ∅ , increment ℓ ← ℓ + 1 and sample the next most likely observation:

𝑧 𝑁 + ℓ ∈ argmax 𝑗 𝑇 𝑧 𝑁 + ℓ −1 , 𝑗 and 𝑥 𝑁 + ℓ ∈ argmax 𝑗 𝐸 rb

𝑧 𝑁 + ℓ , 𝑗

C. Additional materials for the GINC dataset

First, we present two additional plots for the GINC experiment.

Second, we present the table of results associated with Fig. 3 for the CSCGs with 10 and 50 clones.

CSCG performs better on zero-shot prompts than on few-shot prompts: We observe that, for short

contexts, CSCG in-context accuracy is higher on zero-shot prompts 𝑛 = 0 than on few-shot prompts

𝑛 = 1 , 2 , . . . . We hypothesize that the difference between the training and the prompt distributions

creates a gap that lowers few-shot in-context accuracy. The performance gap disappears for larger

contexts 𝑘 ∈ {8 , 10} as they “overpower” the train-test distribution divergence. [3] made a similar

observation for transformers. However, their performance gap was also observable for larger contexts.

17Schema-learning and rebinding as mechanisms of in-context learning and emergence

0.50

In-context confidence on seen concepts

0.45

0.40

0.35

0.30

Num examples

In-context performance on the GINC dataset

0.14

0.12

0.10

0.08

0.06

0.04

0.02

0.00

In-context accuracy on unseen concepts

k=3

k=5

k=8

k=10

Num examples

Figure 9 | [Left] In-context confidence for the CSCG with 50 clones on the GINC test dataset, defined as

the averaged probability of the predictions. For larger values of 𝑘 , CSCG correctly infers the context of

the aliased observations and is more confident in its predictions. [Right] Similar to the transformer and

LSTM reported in [3], CSCG fails to extrapolate and has a low in-context accuracy when the test prompts

are sampled from five novel concepts, unseen during training.

Context length No. of examples

CSCG with 10 clones

CSCG with 50 clones

3 0

64 0 . 509

0 . 351

0 . 366

0 . 356

0 . 360

0 . 354

0 . 369 (0 . 020)

(0 . 019)

(0 . 019) 0 . 534 (0 . 020)

0 . 445 (0 . 019)

0 . 453 (0 . 020)

0 . 468 (0 . 020)

0 . 454 (0 . 020)

0 . 460 (0 . 020)

0 . 441 (0 . 0219)

0 . 468 (0 . 020)

5 0

64 0 . 682

0 . 640

0 . 629

0 . 654

0 . 627

0 . 637

0 . 634

0 . 637 (0 . 018)

(0 . 019)

(0 . 019) 0 . 927

0 . 927

0 . 904

0 . 883

0 . 894

0 . 902

0 . 901

0 . 899

(0 . 010)

(0 . 012)

(0 . 013)

(0 . 012)

8 0

64 0 . 696

0 . 694

0 . 686

0 . 681

0 . 690

0 . 686

0 . 676

0 . 694 (0 . 018)

(0 . 018)

(0 . 018) 0 . 969

0 . 972

0 . 978

0 . 973

0 . 975

0 . 968

0 . 975

(0 . 007)

(0 . 006)

(0 . 007)

10 0

64 0 . 684

0 . 705

0 . 674

0 . 713

0 . 690

0 . 689

0 . 712

0 . 690 (0 . 018)

(0 . 018)

(0 . 018) 0 . 975

0 . 977

0 . 971

0 . 974

0 . 977

0 . 978

(0 . 006)

Table 1 | In-context accuracy for a CSCG with 10 clones and a CSCG 50 clones trained on the GINC

dataset, averaged (with 95% confidence intervals) on each each pair ( 𝑘, 𝑛 ) of context length and number

of examples 𝑛 of the GINC test set.

18Schema-learning and rebinding as mechanisms of in-context learning and emergence

D. Additional materials for the LIALT dataset

D.1. Natural language instructions

Tables 2 and 3 present the natural language instructions respectively used for the nine list algorithms

and four matrix algorithms of the LIALT dataset. Language instructions are grouped in clusters of

five: all five instructions within one cluster describe to the same algorithm. As described in the main

text, each demonstration of the LIALT training and first test set uniformly selects one instruction.

“find the element at index zero of the list”

“print the first element from the list”

“return the leading element from the list”

“find the head element from the list”

‘retrieve the starting element from the list” “print the element at index one of the list”

“find the second element from the list”

“retrieve the second element from the list”

“locate the second item from the list”

“return the element in second place from the list”

“print the element at index two of the list”

“find the third element from the list”

“locate the third element from the list”

“output the third item from the list”

“return the element in third place from the list” “reverse the list”

“mirror the list”

“flip the list”

“flip the order of the list”

“reverse the order of the items in the list”

“duplicate each list item”

“replicate every element in the list”

“make a copy of each element in the list”

“clone each element in the list”

“create a second instance of every element in the list” “rotate the list elements one place forward”

“roll the list elements one position to the right”

“switch the items of the list one position forward”

“advance the list elements one index forward”

“move the list elements one position forward”

“print every other member in the list starting with the

second member”

“retrieve alternate items in the list starting with the

second item”

“return every other object in the list starting with the

second object”

“retrieve every other entry in the list starting with the

second entry”

“output odd indexed elements” “print every other member in the list starting with the

first member”

“find alternate elements in the list beginning with the

first element”

“print every second item in the list, starting with the

first element”

“output every second element in the list, starting from the

first element”

“output even indexed elements”

“rotate the list elements one place backward”

“move the list elements one position to the left”

“change the items of the list one position backward”

“displace the elements of the list one index backward”

“roll the list items one position backward”

Table 2 | Natural language instructions for the list algorithms used in the LIALT dataset

“return the matrix diagonal”

“collect the diagonal values of the matrix”

“retrieve the diagonal elements of the matrix”

“return the diagonal entries of the matrix”

“fetch the diagonal items of the matrix” “return the matrix transpose”

“retrieve the transpose of the matrix”

“get the transposed matrix”

“compute the transposed form of the matrix”

“derive the transpose matrix”

“roll the columns of the matrix to the right” “find the matrix element in the second row and second

column”

‘find the value in the second row and second column of the

matrix”

“fetch the matrix element located in row 2 and column 2”

“print the value at 2 2 in the matrix”

“retrieve the matrix element at 2 2”

“rotate the matrix columns to the right”

“move the matrix columns to the right”

“shift the columns of the matrix to the right”

“spin the matrix columns to the right”

Table 3 | Natural language instructions for the matrix algorithms used in the LIALT dataset

D.2. Learned CSCG model

Our next Figure 10 displays the transition graph of the CSCG model trained on the LIALT dataset

with an overallocation ratio of 3, with stacked clones for each symbol.

19Schema-learning and rebinding as mechanisms of in-context learning and emergence

Figure 10 | CSCG model learned on the LIALT dataset, visualized with stacked clones.

20Schema-learning and rebinding as mechanisms of in-context learning and emergence

D.3. Results on the LIALT dataset

D.3.1. After a single EM iteration

Presented below are the tables of results associated with Fig. 5. Table 4 contains the in-context

accuracies averaged on the entire test set, Table 5 contains the in-context accuracies per task on

instructions-based prompts, and Table 6 contains the in-context accuracies per task on example-based

prompts.

Overallocation ratio Instruction-based prompts Example-based prompts

0.1

0.3

1.0

3.0 0.00 (0.00)

0.20 (0.08)

0.54 (0.10)

0.89 (0.06) 0.00 (0.00)

0.09 (0.06)

0.49 (0.10)

0.91 (0.06)

Table 4 | Average in-context accuracy of each CSCG model—with 95% confidence intervals—as a function

of CSCG overallocation on both (a) the instruction-based LILAT test set and (b) the example-based LIALT

test set.

Overallocation ratio

Task

list 1st elem.

list 2nd elem.

list 3rd elem.

list reverse

list repeat twice

list alt. even

list alt. odd

list circ. shift fw.

list circ. shift bw.

matrix diagonal

matrix transpose

matrix roll columns

matrix elem. at idx.

0.1 0.3 1.0 3.0

0.00 (0.00)

0.00 (0.00) 0.00 (0.00)

0.60 (0.15)

0.00 (0.00)

0.50 (0.18)

0.00 (0.00)

0.50 (0.18) 0.89 (0.10)

0.70 (0.14)

0.60 (0.15)

0.70 (0.14)

0.50 (0.25)

0.50 (0.20)

0.71 (0.17)

0.00 (0.00)

0.14 (0.13)

0.50 (0.18)

0.50 (0.20)

0.00 (0.00)

1.00 (0.00) 1.00 (0.00)

1.00 (0.00)

0.75 (0.22)

1.00 (0.00)

0.86 (0.13)

0.56 (0.17)

1.00 (0.00)

0.17 (0.15)

1.00 (0.00)

Table 5 | Average in-context accuracy by task—with standard errors—as a function of CSCG overallocation

on instruction-based prompts.

Overallocation ratio

Task

list 1st elem.

list 2nd elem.

list 3rd elem.

list reverse

list repeat twice

list alt. even

list alt. odd

list circ. shift fw.

list circ. shift bw.

matrix diagonal

matrix transpose

matrix roll columns

matrix elem. at idx.

0 . 1 0 . 3 1 . 0 3 . 0

0.00 (0.00)

0.00 (0.00) 0.00 (0.00)

0.12 (0.12)

0.38 (0.17)

0.00 (0.00)

0.33 (0.19)

0.00 (0.00)

0.38 (0.17) 1.00 (0.00)

1.00 (0.00)

0.12 (0.12)

0.50 (0.18)

0.00 (0.00)

0.38 (0.17)

0.00 (0.00)

0.29 (0.17)

0.67 (0.19)

0.43 (0.19)

0.00 (0.00)

1.00 (0.00) 1.00 (0.00)

1.00 (0.00)

0.88 (0.12)

0.62 (0.17)

1.00 (0.00)

0.88 (0.12)

1.00 (0.00)

0.50 (0.18)

1.00 (0.00)

Table 6 | Average in-context accuracy by task—with standard errors—as a function of CSCG overallocation

on example-based prompts.

D.3.2. After EM convergence

Fig. 11 presents the analogue of Fig. 5 but after the EM algorithm in the rebinding process has

converged. We note that the results are mostly identical. Table 7 contains the in-context accuracies

21Schema-learning and rebinding as mechanisms of in-context learning and emergence

In-context accuracy by task

0.1

0.3

1.0

3.0

Overallocation ratio

averaged on the entire test set, Table 8 contains the in-context accuracies per task on instructions-based

prompts, and Table 9 contains the in-context accuracies per task on example-based prompts.

0.1

0.3

1.0

3.0

Overallocation ratio

Figure 11 | [Left] In-context accuracy (with 95% CIs) after EM convergence, as a function of the over-

allocation ratio for a CSCG trained on LIALT and averaged [top] on the instruction-based LIALT test

set [bottom] on the example-based LIALT test set. In-context accuracy increases for CSCGs with larger

capacities. [Right] In-context accuracy (with standard errors) per task on the two LIALT test sets: for

each task, overparametrization improves performance. All the numerical values are in Appendix D.3.

Invisible bars indicate zero accuracy for the respective combination of model and task.

Overallocation ratio Instruction-based prompts Example-based prompts

0.1

0.3

1.0

3.0 0.00 (0.00)

0.16 (0.07)

0.54 (0.10)

0.89 (0.06) 0.00 (0.00)

0.11 (0.06)

0.49 (0.10)

0.93 (0.05)

Table 7 | Average in-context accuracy of each CSCG model—with 95% confidence intervals—as a function

of CSCG overallocation on both (a) the instruction-based LILAT test set and (b) the example-based LIALT

test set.

Overallocation ratio

Task

list 1st elem.

list 2nd elem.

list 3rd elem.

list reverse

list repeat twice

list alt. even

list alt. odd

list circ. shift fw.

list circ. shift bw.

matrix diagonal

matrix transpose

matrix roll columns

matrix elem. at idx.

0.1 0.3 1.0 3.0

0.00 (0.00)

0.00 (0.00) 0.00 (0.00)

0.60 (0.15)

0.00 (0.00)

0.50 (0.18) 0.89 (0.10)

0.70 (0.14)

0.60 (0.15)

0.70 (0.14)

0.50 (0.25)

0.50 (0.20)

0.71 (0.17)

0.00 (0.00)

0.14 (0.13)

0.50 (0.18)

0.50 (0.20)

0.00 (0.00)

1.00 (0.00) 1.00 (0.00)

1.00 (0.00)

0.75 (0.22)

1.00 (0.00)

0.86 (0.13)

0.56 (0.17)

1.00 (0.00)

0.17 (0.15)

1.00 (0.00)

Table 8 | Average in-context accuracy by task—with standard errors—as a function of CSCG overallocation

on instruction-based prompts.

22Schema-learning and rebinding as mechanisms of in-context learning and emergence

Overallocation ratio

Task

list 1st elem.

list 2nd elem.

list 3rd elem.

list reverse

list repeat twice

list alt. even

list alt. odd

list circ. shift fw.

list circ. shift bw.

matrix diagonal

matrix transpose

matrix roll columns

matrix elem. at idx.

0 . 1 0 . 3 1 . 0 3 . 0

0.00 (0.00)

0.00 (0.00) 0.00 (0.00)

0.12 (0.12)

0.38 (0.17)

0.00 (0.00)

0.67 (0.19)

0.00 (0.00)

0.38 (0.17) 1.00 (0.00)

1.00 (0.00)

0.00 (0.00)

0.50 (0.18)

0.00 (0.00)

0.50 (0.18)

0.00 (0.00)

0.29 (0.17)

0.67 (0.19)

0.43 (0.19)

0.00 (0.00)

1.00 (0.00) 1.00 (0.00)

1.00 (0.00)

0.88 (0.12)

1.00 (0.00)

0.88 (0.12)

1.00 (0.00)

0.50 (0.18)

1.00 (0.00)

Table 9 | Average in-context accuracy by task—with standard errors—as a function of CSCG overallocation

on example-based prompts.

D.4. Example failures

Finally, we present a few examples which illustrate the failure modes of our approach. These are

primarily a consequence of imperfections in the learned CSCG model.

Each example is presented in the format (prompt, ground truth correct output, actual model response).

1. For these failures, the instruction circuit has been wired to the wrong algorithm circuit (possibly

driven by the ambiguity of the forward slash delimiter separating the instruction from the

example), resulting in the retrieval of the wrong schema.

• output odd indexed elements / [ U V B Q K I ]

[ U B K ] /

[ V Q I ] /

• flip the list / [ S E J ]

[ J E S ] /

[ S S E E J J ]

• reverse the list / [ R T B ]

[ B T R ] /

[ R R T T B B ] /

• mirror the list / [ B A O T ]

[ T O A B ] /

[ B B A A O O T T ] /

2. For these failures, the schema has been learned incorrectly.

• switch the items of the list one position forward / [ L N G X M T ]

[ T L N G

• shift the

[ [ Y D ]

[ [ get

• / [ Z J B

[ B B A A

[ B B A E

• / [ V P X

[ F P W ]

X M ] /

X M T ] [ T L N G X M T ] ...

columns of the matrix to the right / [ [ D Y ] [ V F ] ]

[ F V ] ] /

]

[

]

[

]

B ] / [ B A E F W L ]

L ] /

[ V F J P E W ]

F P W ] ...