Summary of Geometric Interpretation of Transformers for NLP

Summary Geometric Interpretation of Transformers for NLP arxiv.org

10,328 words - PDF document - View PDF document

One Line

The paper explores the geometric aspect of transformers in NLP, emphasizing layer normalization and supporting findings with experiments and visualizations.

Slides

Slide Presentation (8 slides)

Copy slides outline Copy embed code Download as Word

Geometric Interpretation of Transformers for NLP

Source: arxiv.org - PDF - 10,328 words - view

Introduction

• Transformers in NLP have greatly advanced the field

• Understanding their internal mechanisms is a challenge

• Geometric perspective sheds light on inner workings

Layer Normalization and Latent Features

• Layer normalization projects latent features onto a hypersphere

• Enables attention to shape semantic representation of words

• Experimental evidence supports this perspective

Visual: Illustration of layer normalization projecting onto a hypersphere

Role of WQK and WVO Matrices

• WQK matrix overlaps queries and keys as an affine transformation

• WVO matrix serves as a key-value map between hypersphere and embedding space

• Probing experiments reveal insights into attention patterns

Visual: Diagram illustrating the transformations of WQK and WVO matrices

Iterative Refinement Process

• Dimensionality reduction techniques visualize iterative refinement process

• Representation of tokens shift from original meaning to next token's meaning

• Provides understanding of how transformers operate

Visual: UMAP projection showing the shift in token representation

Importance of Layer Normalization

• Layer normalization plays a crucial role in transformers

• Projects latent features onto a lower-dimensional hypersphere

• Impacts the distribution and scoring of word embeddings

Visual: Graph showing the impact of layer normalization on word embeddings

Geometric Interpretation Enhances Interpretability

• Geometric interpretation deepens understanding of transformer behavior

• Provides insights into the role of different components

• Improves interpretability and performance in NLP tasks

Visual: Illustration depicting the geometric interpretation of transformers

Key Takeaways

• Geometric interpretation sheds light on transformer operations in NLP

• Layer normalization, WQK, and WVO matrices play crucial roles

• Understanding the geometric properties improves interpretability and performance in NLP tasks

Key Points

The authors present a novel geometric interpretation of transformers in natural language processing (NLP).
Layer normalization is important in projecting latent features onto a lower-dimensional hypersphere.
The WQK and WVO matrices serve as transformations related to the hypersphere in transformer models.
The iterative refinement process within transformers is visualized using dimensionality reduction techniques.
Understanding the geometric properties of transformers can improve their interpretability and performance in NLP tasks.

Summaries

20 word summary

This paper presents a geometric perspective on transformers in NLP, focusing on layer normalization and providing experimental evidence and visualizations.

103 word summary

This paper presents a geometric perspective on transformers in NLP, with a focus on layer normalization. The authors propose a theoretical framework that breaks down transformer computation into residual streams and attention/feed-forward updates. They show that feed-forward module updates can be represented as a linear combination of sub-updates. Layer normalization is equivalent to projecting features onto a hyperplane and scaling the projection. Transformers model word particles on a hyper-sphere. The authors analyze each component of the transformer from a geometric perspective, including layer normalization, the WQK and WVO matrices, and the iterative refinement process. Experimental evidence and visualizations enhance interpretability for NLP improvement.

128 word summary

The paper introduces a novel geometric perspective on transformers in NLP, focusing on layer normalization. The authors propose a theoretical framework that decomposes transformer computation into a residual stream and attention/feed-forward updates. They demonstrate that feed-forward module updates can be represented as a linear combination of sub-updates from the module's second layer weight matrix. Layer normalization is shown to be equivalent to projecting features onto a hyperplane and scaling the projection. Transformers are depicted as processes that model word particles along the surface of a hyper-sphere. The authors analyze each component of the transformer from a geometric perspective, including layer normalization, the WQK and WVO matrices, and the iterative refinement process. Experimental evidence and visualizations enhance the interpretability of transformer models for researchers seeking to improve NLP performance.

390 word summary

The paper "Geometric Interpretation of Transformers for NLP" introduces a novel geometric perspective on transformers in natural language processing (NLP). The authors focus on layer normalization and validate their insights by analyzing a pre-trained GPT-2 model. They propose a theoretical framework that decomposes the transformer computation into a residual stream and attention/feed-forward updates. The authors demonstrate that the updates from the feed-forward module can be represented as a linear combination of sub-updates given by the weight matrix of the feed-forward module's second layer. They also use these ideas to enable zero-shot model stitching between different language models.

The authors present a complementary perspective on the geometric interpretation of layer normalization. They prove that layer normalization is equivalent to projecting features onto a hyperplane and scaling the projection. They connect these ideas, depicting transformers as processes that model word particles along the surface of a hyper-sphere.

The authors analyze each component of the transformer from a geometric perspective, starting with layer normalization. They show how layer normalization constrains input features to lie within the surface of a hyper-sphere. They also discuss the role of the WQK matrix as an affine transformation that overlaps queries and keys, and the role of the WVO matrix as a key-value mapping from the hyper-sphere back to R d. The authors review the key-value interpretation of the feed-forward module proposed by previous work.

In experiments with pre-trained GPT-2 weights, the authors measure the impact of layer normalization on embedding vectors and find that projection onto the hyper-sphere does not modify their orientation. They also analyze the distribution of top tokens from the word embedding matrix and observe that considering scaling and bias parameters shifts the distribution towards common words.

The authors probe attention heads at different layers using normalized representations of common nouns. They find that some heads preserve the meaning of queries, while others look for preceding keys or establish contextual associations. However, meaningful patterns in deeper layers are not identified.

In conclusion, the paper offers a novel geometric interpretation of transformers in NLP, providing insights into their inner mechanisms. The authors contribute to a deeper understanding of transformer operations, including layer normalization, the WQK and WVO matrices, and the iterative refinement process. The experimental evidence and visualizations presented enhance the interpretability of transformer models for researchers seeking to improve their performance in NLP tasks.

569 word summary

The paper titled "Geometric Interpretation of Transformers for NLP" presents a novel geometric interpretation of transformers in natural language processing (NLP). The authors introduce a geometric perspective that sheds light on the inner workings of transformer operations, with a focus on layer normalization. They validate their insights by probing a pre-trained GPT-2 model and demonstrate clear query-key attention patterns in early layers.

The authors build on previous work and propose a theoretical framework that decomposes the transformer computation into two main components: a residual stream and attention/feed-forward updates. They decompose the operations within the transformer and show that the updates from the feed-forward module can be represented as a linear combination of sub-updates given by the weight matrix of the feed-forward module's second layer. They also use these ideas to interpret the outcomes of each transformer operation in relation to the canonical space and weights, enabling zero-shot model stitching between different language models.

A complimentary perspective to this line of work comes from the geometric interpretation of layer normalization. The authors prove that layer normalization is equivalent to projecting features onto a hyperplane defined by a vector and scaling the projection by a factor. They connect these ideas under a single interpretation, depicting transformers as processes that model the trajectory of word particles along the surface of a hyper-sphere.

The authors analyze each component of the transformer from a geometric perspective, starting with layer normalization. They demonstrate how layer normalization constrains input features to lie within the surface of a hyper-sphere. They then consider the role of the WQK matrix as an affine transformation that overlaps queries and keys, and the role of the WVO matrix as a key-value mapping from the hyper-sphere back to R d. They also review the key-value interpretation of the feed-forward module proposed by previous work.

In experiments using pre-trained GPT-2 weights, the authors measure the impact of layer normalization on the position of embedding vectors and find that projection onto the hyper-sphere does not modify their orientation. They also analyze the top and bottom tokens from the word embedding matrix under different measurement settings and observe that considering scaling and bias parameters shifts the distribution of top tokens towards common words.

The authors further probe attention heads at different layers using normalized representations of common nouns. They find that some heads preserve the meaning of queries, while others look for keys that precede them or establish contextual associations. However, they do not identify meaningful patterns in deeper layers.

In conclusion, the paper presents a novel geometric interpretation of transformers in NLP. The authors provide insights into the inner mechanisms of transformer operations and offer an intuitive understanding of transformers as processes that model the trajectory of word particles along the surface of a hyper-sphere. Through their analysis of layer normalization, the WQK and WVO matrices, and the iterative refinement process, they contribute to a deeper understanding of how transformers operate and provide insights into their interpretability.

Overall, this paper sheds light on the geometric interpretation of transformers for NLP tasks. It explores the role of layer normalization, the WQK and WVO matrices, and iterative refinement in the transformation process. The experimental evidence and visualizations presented offer valuable insights into the inner workings of transformer models and their interpretability. By understanding the geometric properties of transformers, researchers can gain a better understanding of their behavior and potentially improve their performance in various NLP tasks.

1044 word summary

In this paper, the authors present a novel geometric interpretation of transformers in natural language processing (NLP). The transformers have greatly advanced the field of NLP, but understanding their internal mechanisms is still a challenge. The authors introduce a geometric perspective that sheds light on the inner workings of transformer operations. They focus on layer normalization, which confines latent features to a hyper-sphere and enables attention to shape the semantic representation of words on this surface.

The authors validate their insights by probing a pre-trained GPT-2 model with 124M parameters. They find clear query-key attention patterns in early layers and confirm previous observations about the subject-specific nature of attention heads in deeper layers. By harnessing these geometric insights, the authors present an intuitive understanding of transformers as processes that model the trajectory of word particles along a hyper-sphere.

The transformer architecture has had a significant impact on artificial intelligence (AI) and is used in advanced conversational AI systems and state-of-the-art applications in natural language processing, computer vision, robotics, and more. Previous work on the interpretability of transformers has focused on analyzing weights in relation to the word embedding space used in input and output layers. The authors build on this work and propose a theoretical framework that decomposes the transformer computation into two main components: a residual stream and attention/feed-forward updates.

The authors decompose the operations within the transformer and show that the updates from the feed-forward module can be represented as a linear combination of sub-updates given by the weight matrix of the feed-forward module's second layer. They also incorporate these ideas to interpret the outcomes of each transformer operation in relation to the canonical space and weights, enabling them to do zero-shot model stitching by "translating" between different language models.

A complimentary perspective to this line of work comes from the geometric interpretation of layer normalization. The authors prove that layer normalization is equivalent to projecting features onto a hyperplane defined by a vector and scaling the projection by a factor. They show that these properties are crucial for the attention mechanism to either attend to all keys equally or avoid the problem of having "unselectable" keys.

The authors connect these ideas under a single interpretation, depicting transformers as processes that model the trajectory of word particles along the surface of a hyper-sphere. They provide an overview of this interpretation, where the input token "Traveling" is embedded as a word particle using an embedding matrix and projected onto a hyper-sphere using layer normalization. Each subsequent layer in the transformer determines the path that the particle will follow along the surface of the hyper-sphere, culminating in the region closest to the next token.

The authors analyze each component of the transformer from a geometric perspective, starting with layer normalization. They demonstrate how layer normalization constrains input features to lie within the surface of a hyper-sphere. They then consider the role of the W QK matrix in terms of geometric transformations on this hyper-sphere and the W V O matrix as a key-value mapping from the hyper-sphere back to R d. They also review the key-value interpretation of the feed-forward module proposed by previous work. Finally, they discuss the role of the embedding matrix in the transformer's output probabilities.

In conclusion, the authors present a novel geometric interpretation of transformers in NLP. Their insights shed light on the inner mechanisms of transformer operations and provide an intuitive understanding

The paper titled "Geometric Interpretation of Transformers for NLP" explores the geometric intuition behind transformer models and how different components contribute to the transformation of input tokens. The authors begin by discussing layer normalization and its role in projecting latent features onto a lower-dimensional hypersphere. They provide experimental evidence that word embeddings in GPT-2 are distributed in different directions of the hypersphere. Furthermore, they show that the parameters of the final normalization layer are important for obtaining high-scoring tokens consistent with high-frequency tokens in English.

The paper also examines the WQK and WVO matrices as transformations related to the hypersphere. The WQK matrix is seen as an affine transformation that overlaps queries and keys, while the WVO matrix serves as a key-value map between the hypersphere and the original embedding space. Probing experiments are conducted to test these intuitions, revealing insights into the role of query-key attention in earlier layers and the subject-specific nature of the WVO matrix in attention heads at deeper layers.

The authors then integrate these ideas and examine the impact of each component on the residual stream. They provide visual evidence of how the iterative refinement process works within transformers by leveraging dimensionality reduction techniques. Using UMAP projection, they demonstrate how the representation of a token shifts from its original meaning to the meaning of the next token as it progresses through the network.

In conclusion, the paper presents a new interpretation of transformers based on their geometric properties. The authors highlight the importance of layer normalization, the role of WQK and WVO matrices, and the iterative refinement process in understanding the behavior of transformer models. These findings contribute to a deeper understanding of how transformers operate and provide insights into their interpretability.

Overall, this paper sheds light on the geometric interpretation of transformers for NLP tasks. It explores the role of layer normalization, WQK and WVO matrices, and iterative refinement in the transformation process. The experimental evidence and visualizations presented offer valuable insights into the inner workings of transformer models and their interpretability. By understanding the geometric properties of transformers, researchers can gain a better understanding of their behavior and potentially improve their performance in various NLP tasks.

Raw indexed text (60,309 chars / 10,328 words / 1,634 lines)

Traveling Words: A Geometric Interpretation of

Transformers

Raul Molina

[email protected]

September 2023

Abstract

Transformers have significantly advanced the field of natural language process-

ing, but comprehending their internal mechanisms remains a challenge. In this

paper, we introduce a novel geometric perspective that elucidates the inner

mechanisms of transformer operations. Our primary contribution is illustrating

how layer normalization confines the latent features to a hyper-sphere, subse-

quently enabling attention to mold the semantic representation of words on this

surface. This geometric viewpoint seamlessly connects established properties

such as iterative refinement and contextual embeddings. We validate our in-

sights by probing a pre-trained 124M parameter GPT-2 model. Our findings

reveal clear query-key attention patterns in early layers and build upon prior

observations regarding the subject-specific nature of attention heads at deeper

layers. Harnessing these geometric insights, we present an intuitive understand-

ing of transformers, depicting them as processes that model the trajectory of

word particles along the hyper-sphere.

Introduction

The transformer architecture [Vaswani et al., 2017] has sparked a significant shift

in Artificial Intelligence (AI). It is the central component behind some of the

most advanced conversational AI systems [Brown et al., 2020, Thoppilan et al.,

2022, Bai et al., 2022], and has been established as state-of-the-art for Natural

Language Processing (NLP), Computer Vision (CV) and Robotics applications,

and many others [OpenAI, 2023, Google, 2023, Chen et al., 2023, Zong et al.,

2022, Driess et al., 2023].

Recent work on the interpretability of the transformer architecture has focused

on analyzing weights in relation to the word embedding space used in its in-

put and output layers Dar et al. [2022], Elhage et al. [2021], Geva et al. [2022],

Brody et al. [2023], Windsor [2022], Millidge and Black [2022]. Elhage et al.

1[2021] introduces “Transformer Circuits”, a theoretical framework that decom-

poses the transformer computation into two main components: a residual stream

that carries information from input to output layers and attention/feed-forward

updates that modify the information flowing in the residual stream. A key

development from their work is the grouping of the W Q W K

and W O W V T ma-

trices from the attention mechanism, representing low-rank approximations of

virtual matrices W QK and W OV , respectively. These virtual matrices define

interactions between different words in the input sequence X within a canonical

feature space E given by the word embedding matrix W E . The resulting values

from these interactions are used to update the information carried throughout

the residual stream. Geva et al. [2022] further decompose the operations within

the Transformer, demonstrating that the updates from the feed-forward mod-

ule can be decomposed into a linear combination of sub-updates given by the

weight matrix of the feed-forward module’s second layer W 2 . The matrix W 2

also interacts within the canonical space E and allows the authors to measure

the impact of each sub-update on the model’s final prediction using the matrix

W E as a probe. Dar et al. [2022] incorporate these ideas to show that it is not

only possible to interpret the outcomes of each Transformer operation in rela-

tion to the canonical space E but also the weights themselves, enabling them to

do zero-shot model stitching by “translating” between the canonical spaces of

different language models. Finally, Millidge and Black [2022] note that analysis

on the singular vectors of the W OV matrices provides better practical results

when compared to analysis of its row and column weights.

A complimentary perspective to the line of work on Transformer Circuits comes

from the geometric interpretation of layer normalization [Ba et al., 2016] by

Brody et al. [2023]. The authors prove that layer normalization is equivalent

→

−

to projecting features onto

√ the hyperplane defined by the 1 vector and then

scaling the projection by d. They show that these properties are crucial for

the attention mechanism to either attend to all keys equally or to avoid the

problem of having “unselectable” keys (relevant keys within the convex hull of

a set of non-relevant keys). Windsor [2022] provides further evidence of the

representational power of layer normalization, visualizing the highly non-linear

behavior that arises from this operation. The authors demonstrate that, when

used as an activation function within a neural network, layer normalization is

capable of solving complex classification tasks.

In this work, we connect these ideas under a single interpretation: word particles

traveling around the surface of a hyper-sphere, completing a journey that goes

from a previous word to the next and transforming their meaning throughout

this process. An illustrated summary of this interpretation is given in Figure 1.

2Figure 1: Overview of the proposed geometric interpretation of Transformers.

In it, the phrase “Traveling Words” is to be completed by a given transformer

model. The input token “Traveling ” is embedded as a word particle using an

embedding matrix W E and projected onto a hyper-sphere using layer normal-

ization. Each subsequent layer in the transformer determines the path that the

particle will follow along the surface of the hyper-sphere, culminating on the

region closest to the next token: “Words”.

Transformers as a Composition of Geometric

Primitives

In this section, we analyze each of the transformer’s components from a geo-

metric perspective, leveraging the interpretation of one component to analyze

the next. We begin with the layer normalization function, for which we demon-

strate that it constrains d-dimensional input features to lie within the surface

of a (d − 1) dimensional hyper-sphere. Then we consider the role of the W QK

matrix in terms of geometric transformations on said hyper-sphere, and W V O

as a key-value mapping from the hyper-sphere back to R d . Additionally, we re-

view the key-value interpretation of the feed-forward module proposed by Geva

et al. [2021]. Finally, we discuss the role of the embedding matrix W E on the

transformer’s output probabilities.

3.1

Layer Normalization

In its original formulation [Ba et al., 2016], layer normalization is introduced

in terms of the mean µ and standard deviation σ of an input feature vector

X ∈ R d :

X − µ

(1)

LayerNorm(X) =

3Where both the mean and standard deviation are taken along the feature di-

mension d such that:

1 X

x i

d i

u d

u 1 X

σ = t

(x i − µ) 2

d i

µ =

Brody et al. [2023] note that the numerator in Equation 1 is itself an operation

between X and the vector µ defined as:

µ = [µ, µ, . . . , µ] ∈ R d

→

−

The resulting vector (X −µ) is shown to be orthogonal to the 1 vector, and thus

layer normalization can be interpreted as a projection of X onto the hyperplane

→

−

defined by the normal vector 1 . Brody et al. [2023] also show that the √ division

by σ acts as a scaling factor that modifies the norm of (X − µ) to be d:

u d

u 1 X

(x i − µ) 2

σ = t

d i

u d

(2)

1 u

= √ t (x i − µ) 2

= √ ||X − µ|| 2

We note that, if we consider unit-norm vector

shown that µ is the projection of X onto

→

−

√ 1

→

−

√ 1

→

−

1 instead of 1 , it can be

1 (explaining why X − µ isorthogonal to

→

−

1 ):

1 →

−

proj(X, √ 1 ) =

x · √ 1 √ 1

−

1 →

√

|| d 1 || 2

−

x · →

1 1 →

−

√ 1

= √

−

x · →

1 →

−

1 X

→

−

x i 1

d i

→

−

= µ 1

(3)

= µ

From this result, it is straightforward to calculate the projection of X onto the

→

−

hyperplane H defined by √ 1 d 1 :

1 →

−

proj H (X) = X − proj(X, √ 1 )

= X − µ

(4)

Finally, we can use the results from Equation 2 and Equation 4 to reformulate

layer normalization in geometric terms:

X − µ

proj H (X)

= 1

√ ||proj (X)|| 2

√

proj H (X)

= d

||proj H (X)|| 2

LayerNorm(X) =

(5)

Intuitively, layer normalization projects a vector X ∈ R d to the hyperplane H

→

−

perpendicular to 1 ∈ R d , and normalizes the projection √

such that it lies on

the surface of a d − 1 dimensional hyper-sphere of radius d. A visualization

of this process for d = 3 is shown in Figure 2. In practice, layer normalization

includes two additional parameters: a scaling factor γ and a bias term β. The

parameter γ scales each coordinate axis of R d independently, transforming the

hyper-sphere into a hyper-ellipsoid, and the bias term β shifts the center of

said ellipsoid away from the origin. A 2D representation of the entire process is

shown in Figure 3.

Xiong et al. [2020] show that layer normalization should be applied within each

block before the attention and feed-forward module updates and as a final step

5Figure 2: Layer normalization visualized on 3D data. Left: Original feature

space (from randomly sampled data), with each data point color-coded accord-

ing to its position in space. Right: Feature space after layer normalization, note

→

−

that all data points lie within the plane perpendicular to the 1 vector.

before prediction. Doing so scales down the gradient of the feed-forward mod-

ule’s weight parameters and keeps the magnitude of the hidden states bounded

with respect to the depth of a given layer, which improves stability during train-

ing and removes the need for a warm-up stage.

3.2

Multi-Head Self-Attention

In the previous section, we showed how the layer normalization approach given

by Xiong et al. [2020] enforces data within each layer to be constrained on the

surface of a hyper-sphere, potentially unique to each layer. However, thanks to

the residual nature of transformers, all intermediate layer representations share

the same vector space and thus are essentially projecting features onto the same

hyper-sphere H S . Furthermore, given that layer normalization is applied be-

fore the classification softmax, the model maximizes the dot-product similarity

between a subset of points within H S and the word vectors in the embedding

matrix W E ∈ R |V |×d (where |V | denotes the size of the vocabulary), establishing

the meaning of points in H S in relation to the words associated with W E . To

understand how the geometric intuition behind H S allows for interpretability,

we will first revisit the self-attention module in transformers [Vaswani et al.,

2017].

For a given input sequence X ∈ R s×d of length s, the self-attention mechanism

is defined as follows:

QK T

SelfAttention(X, W Q , W K , W V ) = softmax √

(6)where

Q = XW Q

K = XW K

(7)

V = XW V

Such that W Q ∈ R d×k , W K ∈ R d×k and W V ∈ R d×v are projection matrices

from the original model dimension d to intermediate dimension k and value di-

mension v, respectively. For multi-head attention, multiple projection matrices

W Q i , W K

, W V i are considered, one for each head i ∈ [1, . . . , h] (with h being the

number of heads). In this case, the value dimension v is commonly set equal

to k and an extra projection matrix W O ∈ R hk×d is introduced to combine

information from all heads as follows [Vaswani et al., 2017]:

MultiHead(X) = Concat([head 1 , . . . , head h ])W O

where head i = SelfAttention(X, W Q i , W K

, W V i )

(8)

Given that the concatenation happens along the row dimension of each head, it

is possible to re-write multi-head self-attention as follows:

MultiHead(X) =

SelfAttention(X, W Q i , W K

, W V i )W O i

where W O =

(9)

Concat[W O 1 , . . . , W O h ]

Such that each W O i ∈ R k×d denotes an element of the partition of matrix W O

alongside the row dimension. Combining Equation 7 and Equation 9 we obtain

a single formula for multi-head self-attention:

i T

XW Q i W K

X T

√

XW V i W O i

MultiHead(X) =

softmax

(10)

XW QK

X T

√

softmax

XW V O

Where W QK

∈ R d×d and W V i O ∈ R d×d are low-rank virtual matrices obtained

by grouping W Q i W K

and W V i W O i respectively [Elhage et al., 2021, Dar et al.,

2022].

3.3

The W QK Matrix

For any given head, the matrix W QK

is commonly interpreted as a bi-linear

form f : R × R → R that represents the relevance between keys and queries.

However, it is also possible to consider W QK

as a linear transformation that

7maps inputs to a query representation X q i ∈ R s×d (similar to that considered in

Brody et al. [2023]):

XW QK

= X q i

(11)

With the head’s attention score matrix A i ∈ [0, 1] s×s , for a given sequence

length s, obtained as:

X i X T

(12)

A i = softmax √

Alternatively, its transpose can be considered as a transformation that maps

inputs to a key representation:

XW QK

= X k i

With the attention score matrix as follows:

X(X i ) T

√ k

A i = softmax

(13)

(14)

This process is illustrated for normalized inputs in the bottom-right section of

Figure 3. Essentially, the role of the W QK matrix and the layer normalization

parameters is to find a transformation over H S such that, when superimposed

on itself, brings related terms closer together and keeps unrelated terms apart.

cannot be inverted,

It is important to mention that for k < d, the matrix W QK

as it won’t have a full rank. This implies (by the rank-nullity theorem) that for

each head, there must be a set of d − k query vectors Q inull ⊂ R d that map to

the zero vector and, as a consequence, attend to all keys equally. Conversely,

there must also exist a set of d − k keys K null

⊂ R d that are attended to by all

queries equally, with an attention score of zero.

Note on bias terms: In case the projection given by Equation 7 contains bias

terms β q , β k ∈ R 1×k , the attention score matrix from Equation 12 is calculated

as follows:

X i X T + XW i β T + β q W i T X T + β q β T

Q k

√

A i = softmax

(15)

In the bias formulation, three new terms are introduced. First, W Q i β k T ∈ R d×1 ,

which can be thought of as a reference vector for queries, such that queries

similar to it get higher attention scores. Given that the same “bias score”

will be broadcasted along all the different keys of the same query, the network

will ignore this term due to the shift-invariance of the softmax function. More

interesting is the second term β q W k i ∈ R 1×d , which acts as a reference for keys.

Given that its bias score is broadcasted along queries, it will result in higher

attention scores (in all queries) for keys similar to the reference. Finally, the

term β q β k T ∈ R acts as a global bias and, similar to W Q i β k T , will be ignored by

the network.

8Figure 3: Visualization of the self-attention process for a single head. Top

Left: Layer normalization projects the input features on the surface of the

hyper-sphere H S . Top Right: A scaling parameter γ is commonly applied after

normalization; it transforms H S into an (d − 1)-dimensional ellipsoid. Bottom

Left: A bias term β is also applied after normalization; it displaces the ellipsoid

away from the origin. Bottom Right: The input features are transformed to

a query representation (in red) using the W QK matrix and compared against

their previous representation to obtain the self-attention scores.

93.4

The W V O Matrix and the Residual Stream

To understand the role of the W V O matrix within the transformer, we now

consider the update step after the multi-head attention layer:

X l+1 = X l + MultiHead(LayerNorm(X))

(16)

Note that by plugging in Equation 10 and Equation 12, the layer update can

be re-written as:

X l+1 = X l +

A i X value

(17)

where

X value

= LayerNorm(X l )W V i O

(18)

It can be seen that the multi-head attention mechanism consists of the sum of

h individual updates, each one given by one of the attention heads. Within

each head, all words in the sequence propose an update X value

, and these are

aggregated according to their attention scores A . In Equation 18, the matrix

acts as a map that takes the normalized inputs in H S (adjusting for scale

W OV

and bias) and outputs a set of updates in the same space as W E , this process

is visualised in Figure 4. Furthermore, we propose that the W V i O matrix is

better understood as a second key-value store [Sukhbaatar et al., 2015, Geva

et al., 2021] within the attention layer. To see why, consider its Singular Value

Decomposition (SVD) [Millidge and Black, 2022]:

W V i O = U ΣV T

(19)

By substituting in Equation 18, we obtain:

X value

= (Q V O K OV

)V OV

(20)

where

Q V O = LayerNorm(X)

K OV

= (U Σ) T

V OV

= V

(21)

The left singular vectors, associated with the columns of U Σ ∈ R d×d , act as

a library of “keys” K OV

against which the normalized features X l ∈ H S are

compared. While the corresponding right singular vectors, associated with rows

in V T ∈ R d×d , act as the output values V OV

that define the direction in which

to update the information in the residual stream for a given key. This interpre-

tation is motivated by the results of Millidge and Black [2022], where it is shown

10that the right singular vectors V T of the W V O matrix tend to have interpretable

meanings when decoded using W E , with some of the transformer heads consis-

tently representing a single topic in most of their singular vectors. We would

also like to highlight that, similar to the W QK matrix, the W OV matrix has

at least d − k singular values equal to zero. This means that multiple queries

Q V O will map to the zero vector and thus won’t update the information in the

residual stream, allowing the model to skip the update process if necessary.

Note on bias terms: If the value projection in Equation 7 contains a bias

term β v ∈ R 1×k , and the output projection in Equation 8 contains a bias term

β o ∈ R 1×d . The layer update in Equation 17 can be re-written as follows:

X l+1 = X l + β o +

A i X value

+ β v W O i

(22)

Here, the term β v W O i ∈ R 1×d is a bias on the update direction of head i, while

β o ∈ R 1×d acts as a bias on the entire layer’s update.

3.5

The Feed Forward Module

We use the same interpretation for the feed-forward module as Geva et al. [2022,

2021]. In it, the feed-forward module behaves similarly to the W V O matrix in

the sense that it also acts as a key-value store [Geva et al., 2021] that proposes

directional updates for features in the residual stream [Geva et al., 2022]. Similar

to the previous section, we will begin by considering the update step after the

feed-forward layer:

X l+1 = X l + FeedForward(LayerNorm(X)) (23)

FeedForward(X) = f (XW 1 )W 2 + β f (24)

where

Geva et al. [2021] note that, for a hidden dimension d hidden , the feed-forward

module weights W 1 ∈ R d×d hidden and W 2 ∈ R d hidden ×d act as key and value

matrices, plus a bias term β f ∈ R 1×d . They propose an alternative formulation

of the feed-forward layer:

FeedForward(X) =

d hidden

f (Xk i ) · v i + β f =

d hidden

m i · v i + β f

(25)

where

k i = W in

[:, i]

v i = W out [i, :]

∈ R d×1

∈ R 1×d

(26)

Such that W 1 acts as a storage matrix for keys k i , W 2 acts as a storage matrix

for values v i , and the activation function f assigns a weight m i to each value

11Figure 4: Visualization of the residual update for a single attention head. For

each normalized data point (in gray), there is a corresponding un-normalized

data point in the residual stream (in blue). Data points in the residual stream

are updated according to a given direction (in green) calculated from the self-

attention scores and the update matrix W V O .

12v i depending on the input X. In [Geva et al., 2021], the top-n examples in

the training dataset that resulted in the highest m i coefficients showed inter-

pretable patterns, such that each key k i in a 16-layer transformer model trained

on WikiText-103 [Merity et al., 2016] was associated with either a syntactical or

semantical pattern by human experts. For the values v i , their role is better un-

derstood in terms of the impact that they have on the residual stream (referred

to as sub-updates):

d hidden

X l+1 = X l + β f +

m i · v i

(27)

It can be seen that each value v i modifies the residual stream independently,

implying that these share the same space as W E . Indeed, experiments from Geva

et al. [2021, 2022] have shown that many values v i are semantically meaningful

and can be intervened for applications like zero-shot toxic language suppression.

To conclude this subsection, we highlight that the attention and feed-forward

modules behave very similarly (see Equation 22 and Equation 27), as both

calculate relevance scores and aggregate sub-updates for the residual stream.

However, the way the scores and updates are calculated is very different. The

attention module relies primarily on dynamic context for its scores and values,

while the feed-forward module relies on static representations.

3.6

The Word Embedding Matrix W E and Output Prob-

abilities

Once all the attention and feed-forward updates have been applied, the output

probabilities of the network can be obtained as follows Xiong et al. [2020]:

p(Y ) = softmax LayerNorm(X L )W E T

(28)

In the case where layer normalization has no trainable parameters, Equation 28

can be interpreted as measuring the similarity between the final layer represen-

tation X L when projected to H S , and each of the embedding

√ vectors in W E .

Given that all vectors in the projection have the same norm d, the only rele-

vant factor in deciding the output probability distribution p(y t ) ∈ [0, 1] |V | , at a

given timestep t, is the location of its corresponding vector x tl within H S . This

behavior is very similar to that described by the von Mises-Fisher distribution

[Fisher, 1953], as both represent distributions parameterized by a reference vec-

tor within a hyper-sphere. Nonetheless, in the case of transformers, the support

of the distribution is defined over a discrete set of vectors in R d , instead of H S

as a whole.

In the case the layer normalization includes scaling and bias parameters γ and

β, the output probabilities are calculated as follows:

p(Y ) = softmax X ˆ L ΓW E T + βW E T

(29)

13where X ˆ L is the projection of X L to H S and Γ is a diagonal matrix such that

Γ ii = γ i . The effect of Γ on the representation is that of transforming H S into

an ellipsoid (see Top Right section of Figure 3) while βW E T acts as a bias that

assigns higher probability to certain tokens independent of the input.

In both cases (with and without bias and scale parameters), this perspective

aligns with that of iterative refinement [Jastrzebski et al., 2017] discussed in

nostalgebraist [2020], Elhage et al. [2021], Geva et al. [2022], Belrose et al.

[2023], given that intermediate representations X l can always be converted into

output probabilities using Equation 28.

To conclude this section, we would like to highlight that, by considering the

role of layer normalization and how it constrains the representation space, we

can get a geometric intuition behind iterative refinement. We provide a visual

interpretation of this concept in Figure 1.

Experiments

This section presents our experimental results. All experiments use pre-trained

weights from the 124M parameter version of GPT-2 [Radford et al., 2019, Karpa-

thy, 2023] unless stated otherwise. Code to replicate all experiments is available

at: https://github.com/santiag0m/traveling-words.

4.1

Impact of Layer Normalization on the Word Embed-

dings

To measure the impact of layer normalization on the position of the embedding

vectors in W E , we calculated both the ℓ 2 and cosine distances between the

layer-normalized weights and the following settings:

• Original: The original word embeddings without any modification

• Centered: Original + centering around the mean E[w e ]

• Scaled: Original

divided by the average vector norm E[||w e || 2 ] and multi-

√

plied by d

• Centered + Scaled: Original + centering + scaling

The results in Table 1 show that the mean cosine distance between the origi-

nal word embeddings and the embeddings after normalization is close to zero,

meaning that projection onto H S does not modify the orientation of the em-

bedding vectors. The results also confirm this when centering is applied, as the

cosine distance increases significantly when the original vectors are displaced

from the origin towards the mean. On the other hand, it can be seen that the

ℓ 2 distance is high for all settings except for when scaling is √

applied without

centering. Given an average norm of E[w e ] = 3.959 and for d = 27.713 we

can conclude that the original word embeddings lie between the origin and H S

rather than on its surface, with different embeddings having different norms.

14Table 1: Distance between the normalized embeddings LayerNorm(W E ) and

different transformations of the embedding matrix W E .

Setting

Mean ℓ 2 Distance

Original

Centered √

Scaled by d

√

Centered + Scaled by d

Mean Cosine Distance

23.747 (0.432)

24.872 (0.432)

2.413 (1.862)

14.591 (1.469)

<0.001

0.150

<0.001

0.150

(<0.001)

(0.035)

(<0.001)

(0.035)

Table 2: Top 5 and Bottom 5 tokens from the word embedding matrix. Tokens

were sorted according to the relevance of their corresponding embedding vectors

under different measurement settings.

Position

Top 1

Top 2

Top 3

Top 4

Top 5

Bottom

Norm

SPONSORED

\x96\x9a

soDeliveryDate

enegger

Reviewer Scaled Norm

\xa9\xb6\xe6

tremend

\x96\x9a

senal

millenn Norm + Bias

the

and

- Scaled Norm + Bias

the

and

for

at -

(

“\n”

. \xc0

\x07

\x10

\x11

\xfe \x07

\x0f

oreAndOnline

\x06

\xc1

Variance in the norm of embedding vectors is likely to be a result of the use of

the word embedding matrix as a classification layer later in the network (see

Equation 29). To verify whether this is the case, we select the top and bottom

5 embedding vectors based on the three following criteria:

• Norm: The norm of the original embedding vector in W E

• Scaled Norm: The norm of the embedding vector when scaled by the Layer

Norm parameter Γ

• Norm + Bias: The norm of the original embedding vector plus the bias

scores obtained from βW E T

• Scaled Norm + Bias: The sum between the Scaled Norm and the bias

scores.

The sorted tokens in Table 2 show that considering only the norm of the embed-

dings is not enough, as tokens that are not commonly used (like ‘SPONSORED’

and ‘soDeliveryDate’) have the highest norms, while common words like ‘for’,

‘an’, ‘on’ and ‘in’ have the lowest norm. After considering the scaling parameter

Γ, we observe that punctuation signs like the newline character or the comma

‘,’ have the lowest norm, and that there is no clear pattern on the top tokens.

15After considering bias, we see that the distribution of top tokens clearly shifts,

with punctuation symbols and common words at the top and uncommon bytes

at the bottom. Finally, note that when both scale and bias are considered, the

top tokens are consistent with some of the most common words in the English

language: ‘the’, ‘and’, ‘a’ and ‘in’ with the only exception being the comma

character, which is also very common in natural language, while the bottom

tokens are related to uncommon bytes and an anomalous token.

4.2

Probing Attention Heads with Normalized Represen-

tations of Common Nouns

Next, we use the interpretation from subsection 3.3 and 3.4 to probe the at-

tention heads at layers 0, 5 and 11 of the GPT-2 model using as inputs the

100 most common nouns taken from the Corpus of Contemporary American

English (COCA) [Davies, 2010]. First, we transform the embedding matrix W E

according to the normalization parameters specific to each layer (see Figure 3)

and then multiply the normalized embeddings W E norm by either W QK or W V O .

Then, we perform decoding steps specific to each matrix after multiplication:

• For W QK , we retrieve the top-k closest embedding vectors from W E norm

based on dot product similarity.

• For W V O , we add the head-specific and layer-specific output biases (see

Equation 22) to obtain the “update vectors”. These update vectors are

then added to the original embeddings from W E and transformed accord-

ing to the normalization parameters from the last layer; then, we retrieve

the top-k closest embeddings from the original W E based on dot product

similarity.

4.2.1

Query-Key Transformations

In Table 3, we present the results for the Query-Key transformations at layer

0 given the query inputs ‘time’, ‘life’ and ‘world’. We note that some of the

heads preserve the meaning of the query, as is the case for heads 1, 5 and

10, possibly looking for repetition, while others look for keys that precede it.

Such precedence heads might help to disambiguate the meaning of the words,

with examples like: ‘Showtime’ vs. ‘spacetime’, ‘battery life’ vs. ‘wildlife’ and

‘underworld’ vs. ‘Westworld’. Other heads appear to be looking for contextual

associations, as is the case for head 2, which seems to relate ‘world’ with dates

and concepts from the First and Second World wars. When looking at deeper

layers (as shown in Table A.1 & A.2), we were not able to identify any meaningful

patterns on the query transformations, suggesting that these layers might look

for more complex patterns.

16Table 3: Transformation of Queries Across Transformer Heads at Layer 0

Query → Keys

Head time life world

11 Level, [?], offenders

time, time, Time

cinematic, Priest, priest

space, lunch, mid

soft, heavy, tool

time, time, Time

Rated, chirop, u

Show, bed, Movie

java, framework, watch

stones, pal, cards

time, time, Time

Wine, a, food battery, Battery, Battery

Life, life, life

Notre, fetal, abortion

augh, ertain, ough

Middle, Hans, Middle

life, Life, Life

Fukushima, chirop, ulic

pro, wild, Wild

shark, sharks, Wild

Trojan, malware, Wi

life, life, Life

PHI, everal, Span legraph, Vers, Malf

World, world, world

1914, Churchill, 1916

under, Nether, Fort

ether, Unt, Know

world, World, world

ipt, u, Meta

Disc, West, West

edit, ”$:/, movie

Rogers, COUNTY, Rd

world, world, World

agus, true, ‘,’

4.2.2

Key-Value Transformations

In Table 4, we present the results for the Key-Value transformations for the

same three inputs. For most heads at layer 0, the meaning of the input key is

kept as is. However, when the sum of all the heads is considered, we see a slight

shift in the meaning of the words.

For heads at layer 5 (shown in Table A.3), we see that although most of the

heads preserve the meaning of the input keys ‘life’ and ‘world’ (and around half

of the heads for the input ‘time’), the sum of all heads does change the word

meaning dramatically, and without a clear output pattern. As our experiment is

limited to testing a single input key at a time, it might be possible that updates

in this layer rely more heavily on the composition between multiple keys, which

we did not capture.

Finally, in the last layer (Table A.4), we see that most individual heads map to

seemingly arbitrary values, with only a few preserving the meaning of the input

key. However, when the sum of the heads is considered, the layer preserves

the meaning of the input keys. To test the hypothesis that meaning-preserving

heads dominated the layer update, we measured the norm of the output values

for each head (before adding the layer-specific bias β o ). We found that, in most

cases, these heads do not have higher norms. Instead, heads promoting common

tokens like ‘the’, ‘,’ and ‘and’ had the highest norms. These results suggest that

contrary to our hypothesis, the heads at the last layer work together to preserve

the meaning of the input keys and mitigate the network’s bias towards common

tokens.

17Table 4: Transformation of Keys Across Transformer Heads at Layer 0

Key → Values

Head time life world

11 time, Time, time

time, TIME, time

time, [?], Minutes

time, Time, theless

time, time, Time

time, Time, Time

time, time, Time

time, eless, times

time, iversary, melodies

time, time, recall

equivalents, igation, planes

time, Time, Time life, choice, senal

life, lihood, life

life, Life, life

life, Life, Life

life, life, Life

life, Experience, Life

life, challeng, conservancy

[?], local, Main

life, ento, planner

life, Life, +++ world, World, worlds

world, Goes, ship

world, world, World

world, World, worlds

world, World, world

world, World, worlds

world, world, Feather

world, World, Abyss

world, worlds, droid

[?], world, local

world, ento, Tanzania

world, World, Trials

Sum time, etime, watch Indigo, life, crew world, Unleashed, World

4.3

Singular Value Decomposition of the W V O matrix

To verify whether the key-value interpretation of W V O matrix proposed in sub-

section 3.4 is correct, we probe each of its singular vectors (as proposed in

Millidge and Black [2022]). For the left singular vectors U (scaled by Σ), we

use the normalized embeddings W E norm as a probe, while for the right singular

vectors V T , we use the original embeddings W E . Given that all singular values

are constrained to be positive, we get two possible singular vector pairs corre-

sponding to each singular value: (u, v) and (−u, −v). For ease of analysis, we

choose the signed pair with its v component closest to any of the embeddings

w e ∈ W E , using the dot product similarity.

We did not observe any interpretable pattern for the attention heads at layer 0

and found only one interpretable head at layer 5 (head 10), which referred to

terms in politics and chemistry. However, we found that most heads in layer 11

were interpretable (except for heads 5, 7 and 9) and present the results for all

heads in Appendix B. An illustrative case of these patterns is head 3, where most

of its singular vector mappings are related to jobs or industries. For example,

‘Dairy’ maps to ‘USDA’ (the United States Department of Agriculture), ‘engine’

to ‘drivers’, ‘trading’ to ‘Sales’ and so on. Similar patterns were present in other

heads, listed as follows:

• Head 0: Formatting and punctuation symbols (end of text, new line,

brackets and parenthesis)

• Head 1: Gender words

• Head 2: Proper Nouns (Places)

• Head 3: Jobs / Industries

• Head 4: Letters and Numbers

18• Head 6: Suffixes and Prefixes related to the ending and beginning of words

• Head 8: Punctuation symbols

• Head 10: Proper Nouns (First and Last names)

• Head 11: The identity function (input similar to the output)

We found that these patterns were consistent with those obtained in the “Key

→ Value” results from Table A.4, implying that the subject-specific behavior

of the singular vectors is reflected in the input-output transformations of the

attention heads. These results complement previous work from Millidge and

Black [2022], in which only the right singular vectors V T were considered.

SVD of the W QK matrix

In additional experiments on the SVD of the W QK matrix, we found that some

singular vector pairs had clear associations. For example, in head 0 of layer

0, we found some associations related to programming languages (‘self, class,

=, import’ → ‘Python’) and digital cameras (‘Video, 264, minutes’ → ‘Nikon,

lineup, shot, camera’) but we could not identify any specialization for the heads.

Surprisingly, we did find that heads at the last layer had identifiable patterns

on their left singular vectors (associated with the queries) consistent with those

listed for the W V O matrix (punctuation for head 0, gender for head 1, and so

on), but no clear patterns were identified for the right singular vectors.

4.4

Visualizing Iterative Refinement

Finally, we visualize how the information in the residual stream is updated (i.e.

the iterative refinement process) leveraging dimensionality reduction techniques,

as shown in Figure 5. For this, we chose the test sentence ‘To kill two birds with

one stone’, as the predictability of its last token, ‘stone’, given the previous con-

text was high (correctly predicted by the model) and none of the words in the

sentence repeated. To project the high dimensional embeddings into 3D space,

we used UMAP [McInnes et al., 2018], with Laplacian Eigenmap initialization

[Belkin and Niyogi, 2001, Kobak and Linderman, 2021], and we fit the transform

using the first 10,000 embedding vectors from W E to accurately reflect proxim-

ity in the original embedding space. We show the original embedding tokens

as reference (in blue) and plot the trajectory of the second-to-last token, ‘one’,

as we process the entire sequence (with added positional embeddings) through-

out the network. For each layer, we transform the latent representations in the

residual stream using the normalization parameters from the final output layer

before projecting with UMAP. It can be seen that the representation of the

second-to-last token shifts from its original meaning (‘one’) towards the mean-

ing of the next token (‘stone’). Although the figure also shows the magnitude

and direction of each update in the trajectory, it is important to mention that

these quantities might have been modified due to the dimensionality reduction

process.

19Figure 5: UMAP 3D projection of the phrase ‘To kill two birds with one stone’.

The original word embeddings are in blue, the final latent representation for

the second-to-last token (‘one’) in purple, and its trajectory in red, with each

trajectory segment representing an update between transformer blocks. Note

that the latent representation starts close to its corresponding embedding, ‘one’,

and gets closer to that of the next token, ‘stone’, with each update.

Conclusion

We have presented a new interpretation of transformer models based on the

geometric intuition behind each component and how all these components come

together as the transformation of the meaning of one input token to the next.

First, we showed how layer normalization can be better understood as a projec-

tion of latent features in R d to a (d − 1)-dimensional hyper-sphere and provide

experimental evidence that the word embeddings learned by GPT-2 are dis-

tributed toward different directions of the hyper-sphere. We also showed that

the parameters of the final normalization layer are crucial to obtain high-scoring

tokens consistent with high-frequency tokens in the English language.

Next, we discussed the role of the W QK and W V O matrices as transformations

20related to the hyper-sphere, with W QK as an affine transformation that over-

laps queries and keys, and W V O as a key-value map between the hyper-sphere

and the original embedding space. These intuitions were tested with probing

experiments, showing promising results in understanding the role of query-key

attention in earlier layers and extending the results from Millidge and Black

[2022] on the subject-specific nature of the W V O matrix in attention heads at

deeper layers.

Finally, we integrated these ideas and the impact of each component on the

residual stream to provide visual evidence on how the iterative refinement pro-

cess works within transformers.

References

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,

Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you

need. Advances in neural information processing systems, 30, 2017.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Ka-

plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,

Amanda Askell, et al. Language models are few-shot learners. Advances in

neural information processing systems, 33:1877–1901, 2020.

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kul-

shreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du,

et al. Lamda: Language models for dialog applications. arXiv preprint

arXiv:2201.08239, 2022.

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson

Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron

McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv

preprint arXiv:2212.08073, 2022.

OpenAI. Gpt-4 technical report. OpenAI, 2023. URL https://cdn.openai.

com/papers/gpt-4.pdf.

Google. Palm 2 technical report. Google AI, 2023. URL https://ai.google/

static/documents/palm2techreport.pdf.

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao

Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, et al. Symbolic

discovery of optimization algorithms. arXiv preprint arXiv:2302.06675, 2023.

Zhuofan Zong, Guanglu Song, and Yu Liu. Detrs with collaborative hybrid

assignments training. arXiv preprint arXiv:2211.12860, 2022.

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery,

Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu,

et al. Palm-e: An embodied multimodal language model. arXiv preprint

arXiv:2303.03378, 2023.

21Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. Analyzing transform-

ers in embedding space. arXiv preprint arXiv:2209.02535, 2022.

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph,

Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al.

A mathematical framework for transformer circuits. Transformer Circuits

Thread, 1, 2021.

Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. Transformer feed-

forward layers build predictions by promoting concepts in the vocabulary

space. In Proceedings of the 2022 Conference on Empirical Methods in Natural

Language Processing, pages 30–45, 2022.

Shaked Brody, Uri Alon, and Eran Yahav. On the expressivity role of layernorm

in transformers’ attention. arXiv preprint arXiv:2305.02582, 2023.

Eric Windsor. Re-examining layernorm. https://www.lesswrong.com/posts/

jfG6vdJZCwTQmG7kb/re-examining-layernorm, 2022.

Beren Millidge and Sid Black.

The singular value decom-

positions

transformer

weight

matrices

are

highly

inter-

pretable.

https://www.lesswrong.com/posts/mkbGjzxD8d8XqKHzA/

the-singular-value-decompositions-of-transformer-weight, 2022.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.

arXiv preprint arXiv:1607.06450, 2016.

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-

forward layers are key-value memories. In Proceedings of the 2021 Confer-

ence on Empirical Methods in Natural Language Processing, pages 5484–5495,

2021.

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing,

Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer nor-

malization in the transformer architecture. In International Conference on

Machine Learning, pages 10524–10533. PMLR, 2020.

Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory

networks. Advances in neural information processing systems, 28, 2015.

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer

sentinel mixture models. In International Conference on Learning Represen-

tations, 2016.

Ronald Aylmer Fisher. Dispersion on a sphere. Proceedings of the Royal Society

of London. Series A. Mathematical and Physical Sciences, 217(1130):295–305,

1953.

Stanislaw Jastrzebski, Devansh Arpit, Nicolas Ballas, Vikas Verma, Tong Che,

and Yoshua Bengio. Residual connections encourage iterative inference. arXiv

preprint arXiv:1710.04773, 2017.

22nostalgebraist. Interpreting gpt: The logit lens. https://www.lesswrong.com/

posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens, 2020.

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev

McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predic-

tions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112,

2023.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya

Sutskever, et al. Language models are unsupervised multitask learners. Ope-

nAI blog, 1(8):9, 2019.

Andrej Karpathy. Github - karpathy/nanogpt: The simplest, fastest repository

for training/finetuning medium-sized gpts., 2023. URL https://github.

com/karpathy/nanoGPT.

Mark Davies. The corpus of contemporary american english as the first reliable

monitor corpus of english. Literary and linguistic computing, 25(4):447–464,

2010.

Leland McInnes, John Healy, and James Melville. Umap: Uniform mani-

fold approximation and projection for dimension reduction. arXiv preprint

arXiv:1802.03426, 2018.

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques

for embedding and clustering. Advances in neural information processing

systems, 14, 2001.

Dmitry Kobak and George C Linderman. Initialization is critical for preserving

global data structure in both t-sne and umap. Nature biotechnology, 39(2):

156–157, 2021.

23A

A.1

Attention Head Transformations for Layers 5

and 11

Query-Key Transformations

Table A.1: Transformation of Queries Across Transformer Heads at Layer 5

Query → Keys

Head time life world

11 depend, annot, reason

they, themselves, Vers

Nepal, ”:[”, —”

appeared, actually, had

attract, CP, contained

Plum, rice, Vers

en, annually, –

Wis, def, individual

condition, intensive, inf

post, market, destinations

jugg, continuously, Center

straight, interview, fucking so, inf, char

they, Im, depend

‘. . . .’, ‘. . . ’, Home

posted, axle, .avascript

misconception, (?, trophy

Sniper, too, hides

following, Generator, Library

y, ier, od

prol, operation, splend

She, steal, etc

essim, enter, tast

–, Eva, related Lab, dev, Dev

come, once, haven

posted, Logged, ideologically

aryl, Ala, GA

separatists, activists, extremists

Prim, Bright, am

§§, tournaments, StarCraft

Af, Gh, agle

Ard, marketplace, dev

strategy, pd, budget

exploration, jugg, PLAY

spotlight, television, TV

Table A.2: Transformation of Queries Across Transformer Heads at Layer 11

Query → Keys

Head time life world

11 UNCLASSIFIED, opausal, ster

assion, upp, pir

heid, GI, rict

opic, href, Hitchcock

gy, lots, whatever

olesterol, tx, erc

ones, volatile, RIS

whichever, ivalent, lower

ove, HTTP, metaphysical

Productions, actic, fare

-, code, ing

emb, ivan, Union opausal, backstage, piece

pir, Virgin, appa

heid, apy, brance

susceptibility, space, opic

his, whichever, gy

olesterol, avy, iana

volatile, olesterol, idency

mortal, whichever, living

spiritual, metaphysical, bio

stuff, ience, Productions

-, core, ola

Tour, etc, iona routine, cat, ocular

Frontier, theater, onies

region, urgy, encyclopedia

league, space, opic

whichever, whatever, underworld

wealth, Digest, Market

wealth, useum, theatre

-$, complex, world

Endless, metaphysical, Marvel

entertainment, stuff, World

core, Labs, ourse

pires, si, Tour

24A.2

Key-Value Transformations

Table A.3: Transformation of Keys Across Transformer Heads at Layer 5

Key → Values

Head time life world

11 BuyableInstoreAndOnline, [?], time

MON, Sophia, time

time, qualified, understatement

)?, ¿), ?’

time, TIME, Sabha

destro, time, rall

time, time, TIME

time, corrid, patch

NetMessage, [?], ibu

[?], [?], amina

time, contrace, Symphony

otle, ide, Ide life, advertising, Life

mallow, cause, unn

life, life, Life

\] =>, life, \\” >

Izan, eworld, ieu

life, Life, agre

life, Life, life

venge, idth, aten

raviolet, los, SPONSORED

life, Life, life

framing, plot, plots world, opathy, qus

world, Cav, fect

world, World, auri

world, \] =>, %”

world, izons, orld

world, toget, enthusi

world, World, WORLD

world, mathemat, redes

ULTS, Magikarp, [?]

Kraft, quickShipAvailable, Berks

world, World, worlds

ittee, rf, pawn

Sum externalActionCode, ]), issance ahon, awa, ]” Magikarp, Hig, ETHOD

Table A.4: Transformation of Keys Across Transformer Heads at Block 11

Key → Values

Head time life world

11 \n, ”, “

player, party, Party

Lisp, Ö, ¨

Weather, cinem, weather

b, k, 2

Part, Show, part

Sub, AM, BR

Journal, Air, Online

‘,’, the, and

interaction, impression, experience

time, TIME, Time

time, time, Time {, ¿, ”#

youth, House, Youth

[?], Quincy, Yemen

life, euth, Life

inav, d, 4

Well, Well, saw

West, West, East

home, Home, house

‘,’, the, and

encounter, belief, encounters

life, life, LIFE

life, LIFE, life [, [* ,[

party, Trump, party

Scotland, Osborne, Scotland

world, Worlds, geop

i, V, Rivals

sees, works, View

Sub, Under, ob

home, Home, internet

the, ‘,’, and

reservations, Illusion, illusions

world, world, worlds

world, oy, door

Sum time, Time, time life, Life, Life world, Worlds, worlds

25B

W V O SVD per Head for Layer 11

Table B.5: Left and Right Singular Vectors at Layer 11 - Head 0

Rank

Top-3 Left Words Top-3 Right Words

0 shenan, cryst, encount DragonMagazine, ertodd, soDe-

liveryDate

1 another, Iv, sil trave, BuyableInstoreAndOn-

line, convol

2 Sebastian, Luke, humankind quickShipAvailable,

MpServer

3 rans, thereby, hem BuyableInstoreAndOnline, ac-

know, Buyable

4 sectional, [+], Winged ThumbnailImage,

\ufffd\ufffd\u58eb,

able

EStream,

Order-

5 abl, isc, Ah etheless, olson, llah

6 <|endoftext|>, Advertisements,

cest <|endoftext|>, Advertisements,

kindred

7 ococ, ilan, guest pard, MBA, uid

8 ]., ],, ]; [, [*, [

9 \n\n, ),, cakes \n\n, Quote, Quote

10 snaps, Bills, Texans lineback, Chargers, Packers

11 )..., ...), ).” (\u00a3, (, (?,

12 pen, cle, Orioles .””, [, .”)

13 ](, drm, Updated \n\n, [/, [/

14 RBI, Field, Triple RHP, RBI, Negro

15 pod, illus, Maple ipeg, aboriginal, ”\u2026

16 am, ’m, hearted SPONSORED, Newsletter, ....

17 Document, whit, Scott SPONSORED, tsky, Ras

18 gen, idd, anned Ukrain, prin, rul

19 Ryder, icz, abet istries, plet, Gad

26Table B.6: Left and Right Singular Vectors at Layer 11 - Head 1

Rank

Top-3 Left Words Top-3 Right Words

0 Customers, However, Customer \u899a\u9192,

vancy, racuse

1 mint, Anne, Marie hers, actress, Denise

2 ook, Child, ooks parents, Parents, Children

3 gow, abad, BEL boy, student, Guy

4 eries, girl, girls Girl, girl, Queen

5 Marie, Sue, Patricia Woman, woman, woman

6 Him, les, LCS Person, Persons, Person

7 ndra, Joint, rity Her, Her, femin

8 Coach, recapt, Players Players, Coach, coaches

9 istries, WAYS, INAL god, Allaah, God

10 Ens, offspring, statute male, males, Woman

11 Junction, hole, Abdullah girl, daddy, Neighbor

12 HR, ig, akings Major, Major, minors

13 reunion, Madison, mes boys, males, Girls

14 asting, uba, ynt mom, moms, Jim

15 ately, ynam, OUS doctoral, apprentice, Child

16 ifier, Come, Weekly class, owners, Class

17 Confederation, ATE, ingredient Students, Students, Ms

18 athon, jen, candidates Candidate, candidate, traveler

19 Pres, ently, Secure character, Characters, Charac-

ter

natureconser-Table B.7: Left and Right Singular Vectors at Layer 11 - Head 2

Rank

Top-3 Left Words Top-3 Right Words

0 orpor, rul, Bolivia Adelaide, Edmonton, Calgary

1 ball, ERY, hem Filipino, Ultron, ANC

2 \u30f3\u30b8, else, Lib Ruby, Scarborough, Erit

3 verb, Lamar, Ankara Detroit, Detroit, Wenger

4 iana, amacare, edia Zoro, Shelby, Tehran

5 Gw, otle, Rangers \u00ed, Jinn, Texans

6 ration, Rim, ially Yang, McCain, Yang

7 detector, OTOS, Petersen Chilean, Pharaoh, ffen

8 ald, benefit, ahon Petersburg, Henderson, Kessler

9 scope, whe, verse acio, Mits, Jacobs

10 Gators, Laden, SEAL Malfoy, Swanson, Romney

11 Lilly, \u00e9t, lla Greenwood, Collins, Byrne

12 ister, ority, isters Niagara, Maharashtra, soDeliv-

eryDate

13 Paulo, nesota, Clayton Loki, \u011f, Finnish

14 creen, Cron, Base Pike, Krishna, Satoshi

15 lake, SP, seeing Alberta, Arlington, McKin

16 Bowie, ystem, rey Bowie, Murray, Utah

17 head, ding, ressed Bulgar, Warcraft, Crimean

18 Venom, elman, lyn SJ, Brit, Gordon

19 wright, ansas, arta NXT, Metroid, Aether

28Table B.8: Left and Right Singular Vectors at Layer 11 - Head 3

Rank

Left Words Right Words

0 suburbs, restaur, \ufffd DragonMagazine, BuyableIn-

storeAndOnline, \ufffd\u9192

1 Dairy, farm, Veget USDA, Dairy, cows

2 engine, drivers, Motor Drivers, drivers, driver

3 trading, trade, shoppers Sales, retailers, shoppers

4 instrument, musical, guitar Billboard, halftime, Grammy

5 sail, boat, sailing sail, sailing, autical

6 teachers, teacher, school teachers, uberty, curric

7 baker, kindergarten, bakery baker, SERV, kindergarten

8 apparel, prison, recruiting Sail, Prison, jail

9 shelter, indoors, shelters shelters, shelter, Radiant

10 tribe, fish, fish dred, whales, fisheries

11 workers, jobs, job workers, worker, subcontract

12 Derrick, tribe, Tribal Seg, forest, Derrick

13 chess, Chess, seating Chess, chess, Sheldon

14 Soy, Satellite, astronauts Soy, Satellite, transmissions

15 Anim, visa, Imm exhib, Anim, Imm

16 medicine, diagnose, doctors Doctors, hospital, doctor

17 boxing, trainer, spar boxing, spar, UFC

18 gun, firearm, Sheriff ITV, Decoder, Geral

19 gambling, tournaments, tourna-

ment gambling,

ments

Gaming,

tourna-Table B.9: Left and Right Singular Vectors at Layer 11 - Head 4

Rank

Top-3 Left Words Top-3 Right Words

0 them, their, him cloneembedreportprint,

\u899a\u9192,

\u30b5\u30fc\u30c6\u30a3

1 iator, ive, ibur natureconservancy,

\u25fc

2 if, born, forces the, ., ,

3 ually, ,., therein Buyable, misunder, lehem

4 irk, struct, actly 1, 2, 9

5 uku, handle, eenth nineteen, seventeen, seventy

6 ensional, insk, ploy M, M, m

7 allowance, \u2605, ther ii, Bs, B

8 ylon, works, plays EDITION, o\u011f, nt

9 ysc, oreal, Friend B, K, B

10 redits, rossover, ameron F, K, k

11 Tiger, urses, aught N, W, C

12 aughter, gling, eland L, l, L

13 othe, cano, ensity S, s, S

14 ISTORY, hum, pots H, H, h

15 gers, iegel, ki S, s, S

16 ya, seq, est selves, T, i

17 tl, ictionary, latch R, R, D

18 Fres, pine, delay R, u, llah

19 Shades, went, culosis G, G, S

Canaver,Table B.10: Left and Right Singular Vectors at Layer 11 - Head 5

Rank Top-3 Left Words Top-3 Right Words

0 assail, challeng, achie ertodd, \u25fc, \ufffd\u9192

1 WARE, padding, req \u9f8d\u5951\u58eb, Stream-

erBot, soDeliveryDate

2 uing, anche, Inquis heit, MpServer, partName

3 ward, ops, actory builds, projects, Building

4 ary, bell, vis ouf, unt, article

5 ments, Poo, emo Will, Will, terday

6 abdom, book, Til reads, read, writing

7 admission, Fighters, agy model, Models, ilib

8 line, lines, se line, lines, Hold

9 iness, less, ood udic, ridden, usky

10 absence, inar, Miko place, Must, must

11 hawk, nect, aff esson, sees, scene

12 ie, een, ennett Say, ighting, features

13 Peaks, construed, anguages finding, find, Find

14 ming, mers, pling ufact, Put, say

15 Authority, urated, disregard record, records, Record

16 cript, Seen, Crash Written, course, arium

17 ually, gladly, ously tions, show, find

18 im, ading, Expand image, Image, Image

19 NX, W, ees swer, \u30c7\u30a3, report

31Table B.11: Left and Right Singular Vectors at Layer 11 - Head 6

Rank

Top-3 Left Words Top-3 Right Words

0 issue, txt, Princ isSpecialOrderable,

Dragon-

Magazine, \ufffd\ufffd

1 mes, same, resa guiActiveUn, Yanuk, Beir

2 eatured, avier, AMES quickShipAvailable, BuyableIn-

storeAndOnline, RH

3 Levine, estone, Bronx skirts, Els, Bris

4 lder, xit, Sav Sov, grap, Al

5 xual, ss, soup Orient, owship, toile

6 rices, glers, lishing Uni, Tik, en

7 imation, hammer, nels BAD, Ze, sa

8 saturated, lying, Past Ry, AG, Val

9 activity, ozy, oko Ay, AW, Ay

10 ows, aghan, ergy Gul, cl, Nex

11 yrs, ish, hood Wh, Har, Mart

12 omp, grandmother, MS sidx, Alb, CTR

13 ses, ski, doctor AD, ython, Ty

14 heed, Monthly, angan OPS, Tur, Tam

15 Agency, VP, lex Red, Grey, Redd

16 FORE, sil, hing wcsstore, uci, Winged

17 idences, ining, ahl Ste, Pend, hal

18 iance, taxpayers, anches Fuj, appl, Zamb

19 ischer, apo, hiatus Zamb, Zer, Nek

32Table B.12: Left and Right Singular Vectors at Layer 11 - Head 7

Rank

Top-3 Left Words Top-3 Right Words

0 shortest, ses, mentally iHUD, DragonMagazine, Down-

loadha

1 our, ourselves, we ourselves, ours, our

2 himself, lements, them \u899a\u9192,

vancy, ertodd

3 etitive, EStream, workshop FTWARE,

\ufffd\u9192

4 \u0627\u0644, holders, mileage your, Free, Your

5 am, ’m, myself my, myself, me

6 themselves, auder, ighthouse Companies, theirs, THEIR

7 stract, hop, \u00a2 soDeliveryDate, Civil, civilian

8 bage, ros, hyster bage, aukee, Free

9 shop, acter, Shop Humans, ourning, electronically

10 ¡+, myself, pse my, myself, markets

11 Hold, SE, istant ilage, roups, usra

12 uffs, VG, GG verty, Leilan, Soft

13 sters, ual, ted machine, machine, business

14 making, weights, mare centrif, istani, culture

15 uador, oust, ertain us, ours, our

16 vable, cam, ophy system, System, systems

17 exch, velength, un Games, abeth, gaming

18 latex, Edwards, Conway Commercial, Community, com-

munity

19 ificial, rating, nces ificial, System, technology

natureconser-

SourceFile,Table B.13: Left and Right Singular Vectors at Layer 11 - Head 8

Rank

Top-3 Left Words Top-3 Right Words

0 the, in, a \ufffd\ufffd\ufffd, guiActiveUn,

cloneembedreportprint

1 *., ., determin ,, the, -

2 and, ,, Un arnaev,

DragonMagazine,

BuyableInstoreAndOnline

3 ?”, ?), ?), ?’”, TPPStreamerBot, ’,”

4 ,’”, GIF, ,” Orderable, \ufffd, \ufffd

5 .’”, They, .’ .’”, ’.”, ).”

6 ,’, \u2010, ,’” ,’”, ’,, ,’

7 ,’, ,’”, ’,’ ,’”, ,’, ’,”

8 .’, ,’, ’. .’, ,’, ’.

9 her, she, She she, hers, her

10 \ufffd, \ufffd, “ \ufffd, \ufffd, “

11 ”..., ”,”, ”, )”,, ),”, ”),

12 ).”, ”)., ...” ).”, .”[, ”).

13 us, ),”, our ),”, ).”, .”)

14 ));, );, ), ”,, ’,, ));

15 ]., ];, ], };, ];, ’;

16 ..., ...”, ... ...], :], ...”

17 ?], !], .] \u2026], !], ?]

18 ();, her, He hers, ();, His

19 \u00ad, \u300f, You \u300f, ¿., \u00ad

34Table B.14: Left and Right Singular Vectors at Layer 11 - Head 9

Rank

Top-3 Left Words Top-3 Right Words

0 esthes, Eat, pts DragonMagazine, Canaver, na-

tureconservancy

1 ups, motors, hinted confir, \ufffd, unlaw

2 pursue, pursuit, Frie ticket, Desire, iferation

3 posted, dates, rece achievement, unlocking, Hilbert

4 differential, prise, ushing acceptance, handled, accepting

5 Hide, etsu, LET optimizations, prioritize, em-

phasized

6 ously, uffer, ca opsis, \u30df, stall

7 ann, Horn, Specifications restraint, notice, surprises

8 supremacy, argon, ifier ACTIONS, Contin, rue

9 ling, ceived, inf errors, misunderstanding, accu-

racy

10 ittal, ampton, feld denotes, denote, hazard

11 inf, andy, ery plagiar, mentors, recommend-

ing

12 Soon, \ufffd, \ufffd lax, Talks, Fell

13 cia, war, Fighters dissatisf, consum, dissatisfac-

tion

14 NAS, Schwar, Streamer delet, sidx, inem

15 Glory, uan, ment Reviewed,

Congratulations,

congratulations

16 frey, clay, essional quirks, Integration, distinguish-

ing

17 uck, marked, Request appreciation, Guidelines, guide-

lines

18 prints, forcefully, Cel conviction, convictions, impres-

sions

19 utic, endez, inging disag, bruising, spo

35Table B.15: Left and Right Singular Vectors at Layer 11 - Head 10

Rank

Top-3 Left Words Top-3 Right Words

0 above, former, dm \u25fc, Downloadha, Canaver

1 Cohen, oku, Corporation be, ache, the

2 liar, Ross, Irving Rossi, Mind, Zen

3 Treatment, MT, tubing etts, Taylor, Tan

4 Torch, dt, Honour Divinity, marqu, vine

5 ==, Sinn, imitation Stafford, Bradford, Halo

6 asks, fitted, caution BW, BW, Berger

7 encer, hero, success Gon, Johnny, PATH

8 Chung, anke, IRE Chennai, Carey, Carmen

9 Commodore, iom, attract curry, Cunningham, clam

10 earth, CS, oyal Sov, Trin, paralle

11 ramid, el, DIT Hilton, diarr, \ufffd\u9192

12 ulla, alde, uality McInt, alde, Idle

13 cam, write, ports Cave, Chal, Connie

14 buf, anne, Emin Dwar, Dwarf, Das

15 job, play, job buquerque, Liber, reb

16 ASC, ector, Order Sorceress, Alic, Astro

17 ting, enced, te Forest, Kan, tree

18 ater, Turner, UAL \u9f8d\ufffd, Omn, Gamma

19 Matrix, RIP, oping Fed, STEP, Rand

36Table B.16: Left and Right Singular Vectors at Layer 11 - Head 11

Rank

Top-3 Left Words Top-3 Right Words

0 8, 9, 6 \u899a\u9192,

cloneem-

bedreportprint, StreamerBot

1 ”, [?], , DragonMagazine,

cloneem-

bedreportprint, ertodd

2 puff, rem, Ey \ufffd\ufffd\u58eb,

Flavoring

3 air, compressor, exchange air, blow, nose

4 burn, burning, burns burns, burning, burn

5 smoke, blowing, sky smoke, clouds, airflow

6 light, shade, lighting light, illumination, Light

7 break, breaks, Bre breaks, breaker, broken

8 finger, air, registrations finger, finger, Feet

9 rolls, roll, rolled Rolls, rolls, Ludwig

10 opening, opened, closing opened, closes, opening

11 ause, blank, generating rawdownloadcloneembedreportprint,

ause, sburg

12 anne, \u0639, sprayed \u30fc\ufffd, \ufffd\ufffd, iltra-

tion

13 ear, audio, Ear ear, ears, Ear

14 goggles, watched, devotion ideos, TPS, goggles

15 leaf, slashed, hunger gou, ouri, margins

16 voices, voic, Hand voic, leash, voiced

17 short, Short, short shorten, shortened, short

18 tones, tones, tone bells, tone, marrow

19 drawn, connected, ieties wu, River, Awakening

catentry,