Summary of Composable Function-Preserving Expansions for Transformer Architectures

Summary Composable Function-Preserving Expansions for Transformer Architectures arxiv.org

8,218 words - PDF document - View PDF document

One Line

This paper proposes six function-preserving expansions for transformers as a solution to the expensive and time-consuming process of training neural networks, which often requires starting from scratch when increasing network scale.

Slides

Slide Presentation (11 slides)

Copy slides outline Copy embed code Download as Word

Composable Function-Preserving Expansions for Transformer Architectures

Source: arxiv.org - PDF - 8,218 words - view

Introduction

• Training state-of-the-art neural networks is computationally expensive and time-consuming.

• Increasing the scale of a neural network hinders the transfer of knowledge from smaller models.

• The paper proposes six composable function-preserving expansions for transformer architectures.

MLP Expansion

• Increase the dimension of the internal representation of the MLP.

• Apply specific parameter adjustments to expand and modify the dimensions.

Visual: Graph showing the increase in MLP internal representation dimension

Head Addition

• Add new attention heads to maintain a consistent number.

• Expand the functionality while preserving the overall function of the model.

Visual: Image illustrating the addition of new attention heads

Attention Head Size Expansion

• Increase the size of the attention heads' output representation.

• Enhance the model's capacity to capture complex patterns.

Visual: Chart comparing the size of attention heads' output representation before and after expansion

Attention Head Number Expansion

• Increase the number of attention heads in the model.

• Improve the model's ability to attend to different aspects of the input.

Visual: Graph showing the increase in the number of attention heads

Transformer Size Expansion

• Increase the size of the transformer model.

• Enhance its capacity to handle larger and more complex tasks.

Visual: Image representing the growth in the size of a transformer model

Related Research Papers

• References to various research papers and technical reports.

• Explore topics such as scaling vision transformers, growing neural networks, etc.

Visual: Collage showcasing the covers of some referenced papers

Expansion Equations

• Equations and proofs related to the expansion and layer addition in transformer models.

• Demonstrate the transformation of the input and the impact on the model's performance.

Visual: Equations highlighting the expansion process

Conclusion

• The proposed composable function-preserving expansions offer a solution to the expensive and time-consuming process of training neural networks.

• They allow for the expansion and modification of transformer architectures while preserving their overall function.

• The transfer of knowledge from smaller models becomes more efficient, enabling scalability.

Key Takeaways

• Training state-of-the-art neural networks is computationally expensive and time-consuming.

• Increasing the scale of a neural network hinders knowledge transfer.

• The paper proposes six composable function-preserving expansions for transformer architectures.

• Different transformations can be applied to expand and modify dimensions.

• References to related research papers provide further insights into transformer architectures and neural networks.

Key Points

Training state-of-the-art neural networks is computationally expensive and time-consuming.
Increasing the scale of a neural network hinders the transfer of knowledge from smaller models.
The paper proposes six composable function-preserving expansions for transformer architectures.
The document discusses different transformations that can be applied to expand and modify the dimensions of a Transformer architecture.
The summary includes references to various research papers and technical reports related to transformer architectures and neural networks.

Summaries

23 word summary

Training neural networks is expensive and time-consuming. Increasing network scale often requires starting from scratch. This paper proposes six function-preserving expansions for transformers.

38 word summary

Training state-of-the-art neural networks is computationally expensive and time-consuming. Increasing the scale of a neural network usually requires starting from scratch and randomly initializing all parameters, hindering knowledge transfer. This paper proposes six composable function-preserving expansions for transformer

350 word summary

Training state-of-the-art neural networks is computationally expensive and time-consuming. Increasing the scale of a neural network usually requires starting from scratch and randomly initializing all parameters, which hinders the transfer of knowledge from smaller models. This paper proposes six composable

The text excerpt discusses composable function-preserving expansions for transformer architectures. It introduces the equations for the feed-forward layers and the multi-head attention (MHA) component. The size of the internal dimension of the MLP component is denoted as "p

The document discusses different transformations that can be applied to expand and modify the dimensions of a Transformer architecture. The first transformation discussed is the MLP expansion, which involves increasing the dimension of the internal representation of the MLP. This can be done by applying specific parameter

The paper discusses different transformations that can be applied to Transformer architectures to expand their functionality while preserving the overall function of the model. The first transformation discussed is called "head addition," which allows for the addition of new attention heads to maintain a consistent number of

The document discusses six transformations that can be applied to a transformer model to increase its scale. These transformations include increasing the size of the MLP internal representation, the number of attention heads, the size of the attention heads output representation, the size of the attention

The summary includes references to various research papers and technical reports related to transformer architectures and neural networks. These papers discuss topics such as scaling vision transformers, growing neural networks using gradient information, deep sparse rectifier neural networks, Gaussian error linear units (GEL

The summarized text is as follows:

The document includes references to various papers and preprints related to transformer architectures and language models. One paper discusses staged training for transformer language models, while another focuses on learning to grow pretrained models for efficient transformer training. Another

The summary discusses the application of composable function-preserving expansions for transformer architectures. The text excerpt includes equations and proofs related to the expansion and layer addition in transformer models. The main ideas presented are:

- The expansion equations demonstrate the transformation of the input

Raw indexed text (40,131 chars / 8,218 words / 2,410 lines)

C OMPOSABLE F UNCTION - PRESERVING E XPANSIONS

FOR T RANSFORMER A RCHITECTURES

Andrea Gesmundo 1 , Kaitlin Maile 1,2

Google DeepMind, 2 IRIT, University of Toulouse,

{agesmundo,kmaile}@google.com

A BSTRACT

Training state-of-the-art neural networks requires a high cost in terms of compute

and time. Model scale is recognized to be a critical factor to achieve and improve

the state-of-the-art. Increasing the scale of a neural network normally requires

restarting from scratch by randomly initializing all the parameters of the model,

as this implies a change of architecture’s parameters that does not allow for a

straightforward transfer of knowledge from smaller size models.

In this work, we propose six composable transformations to incrementally increase

the size of transformer-based neural networks while preserving functionality, al-

lowing to expand the capacity of the model as needed. We provide proof of exact

function preservation under minimal initialization constraints for each transfor-

mation. The proposed methods may enable efficient training pipelines for larger

and more powerful models by progressively expanding the architecture throughout

training. 1

I NTRODUCTION

Transformer-based neural networks have gained widespread attention in recent years due to their im-

pressive performance. The Transformer architecture, introduced by Vaswani et al. (2017), has become

the standard for many natural language processing (NLP) tasks, including machine translation, text

generation, and question answering. The success of transformer-based models is not limited to NLP:

they have also been applied to various other domains, including computer vision, speech recognition,

and recommendation systems. The largest and most performant of these models, large language

models (LLMs) and vision and multimodal foundation models, are reaching billions to trillions of

parameters (Dehghani et al., 2023; Touvron et al., 2023; Rae et al., 2021; Raffel et al., 2020).

However, each new model is generally trained from scratch, without reusing the capabilities acquired

by previously trained smaller models. Furthermore, the size of the model is constant throughout

training. The computational cost of training scales quadratically with model size due to the necessary

increase in amount of training data (Hoffmann et al., 2022; Google, 2023; Kaplan et al., 2020). The

ability to reuse parameters of a pretrained model or dynamically increase a model’s size during

training could thus reduce the overall cost of training, but how to accomplish parameter reuse

effectively without losing training progress is not straightforward.

To address these limitations, we propose parameter expansion transformations for transformer-based

models that are exactly function preserving. These transformations increase the model size and

thus the potential capacity of the model without changing its functionality, permitting continued

training. These composable transformations operate on independent dimensions of the architecture,

allowing for fine-grained architectural expansion.

Some previous works have also proposed function preserving parameter expansion transformations

for transformer-based models (Chen et al., 2022; Shen et al., 2022; Wang et al., 2023; Mazzawi

et al., 2023), extending from techniques for smaller convolutional and dense models (Chen et al.,

2016; Evci et al., 2022). Our framework is so far the most comprehensive and composable set

of function preserving transformations.

Implementation of the proposed transformations and empirical tests of the function preservation property

are available at: http://goo.gle/TransformerExpansions.

1Output

Head

Linear

N ✕

Multi Layer

Perceptron

Normalization

Multi Head

Attention

Normalization

Positional

Encoding

Input

Embedding

Input

Figure 1: Representation of a standard Neural Network based on the Transformer architecture.

The contributions of this paper are six composable function preserving transformations applicable

to Transformer architectures: 1) size of MLP internal representation, 2) number of attention heads,

3) size of the attention heads output representation, 4) size of the attention input representation, 5)

size of the transformer layers input/output representations, 6) number of layers, summarized in Table

1. For each transformation, we provide proof of how the exactly function preserving property is

achieved with a minimal set of constraints on the initialization of the added parameters.

T RANSFORMER ARCHITECTURE FORMALIZATION

This presentation is based on a particular instantiation of the transformer architecture: applica-

tions to variants (e.g. Encoder+Decoder, different normalization placement) can be obtained with

simple extensions.

Figure 1 represents the standard Transformer architecture (Vaswani et al., 2017). The Input Embedding

module maps the arbitrary input modality (e.g. image, text) into a bidimensional tensor I , where s is

s×h

the sequence dimension and h is the hidden dimension. The TransformerArchitecture(·) is defined

as a function that maps: I → O , where o is the hidden dimension of the output representation.

s×h

s×o

The Head component represents the output modality specific logic that maps O into a specific

s×o

output (e.g. a distribution over classes or text tokens).

TransformerArchitecture(·) is defined as:

TransformerArchitecture( I ) = TransformerLayer ◦N ( I + P ) × W out ,

s×h

h×o

(1)

where W out are the parameters of the final linear projection, P are the positional embedding

h×o

s×h

parameters, and TransformerLayer ◦N (·) represents the recursive application of N transformer

2layers. The n th transformer layer is defined as:

′

MLP ′

( I n )),

n + MLP n (Norm n

s×h

MHA n (Norm MHA

(

))

s×h

TransformerLayer n ( I n ) = I

′

s×h

= I n +

s×h

∀ n ∈ [1, N ].

(2)

MLP n (·) is the Multi Layer Perceptron (i.e. feed forward layers), defined as:

MLP n ( X ) = ReLU( X × W n l1 + B l1

n ) × W n + B n ,

s×h

h×p

s×p

p×h

(3)

s×h

where W n l1 is the matrix of parameters of the first fully connected layer and B l1

n are its bias parameters

broadcasted along the sequence dimension: B l1

and

n are the parameters of

s×1

s×h

1×h

the second fully connected layer. The broadcast operator applied to the bias parameters is omitted

for simplicity. The size of the internal dimension of the MLP component is represented with p.

The considered architecture instantiation assumes the uses of ReLU(·) (Glorot et al., 2011) as a

non-linearity function as this is a common choice. The proposed transformations also maintain the

function preserving property with alternative choices such as GELU(·) (Hendrycks & Gimpel, 2016).

MHA n (·) is the Multi Head Attention defined as:

MHA n ( X ) = H 1 · · · H E × W n O ,

s×h

H e

s×v

(E·v)×h

= Attention( X ×W n,e

, X ×W n,e

) ∀ e ∈ [1, E],

s×h

h×k

s×h

h×k

s×h

(4)

h×v

Attention( Q , K , V ) = Softmax( √ 1 k · Q × K ⊤ ) × V ,

s×k s×k s×v

s×k

k×s

s×v

where E is the number of heads, k is the hidden dimension of key, K, and query, Q, and v is the hidden

dimension of value, V. K ⊤ represents the transpose of K. The concatenation of the representations

produced by the attention heads is represented with the block notation: C = [A B].

As the normalization function in each component, we use RMSNorm (Zhang & Sennrich, 2019). The

original definition of the transformer architecture uses LayerNorm, but RMSNorm has become a more

common design choice in large language models (Raffel et al., 2020; Rae et al., 2021; Touvron et al.,

2023). The key difference is only scaling the variance of the inputs and using scaling parameters,

rather than also subtracting their mean and using bias parameters. Thus, we define Norm(·) as:

Norm cn ( X )

s×h

x i,j · g n,j

| i ∈ [1, s] ∧ j ∈ [1, h] ∀n ∈ [1, N ] ∧ c ∈ {MHA, MLP}, (5)

= q P

)

i,γ

γ=1

where g n c identifies the vector of the scaling parameters of the Norm(·) instance of component

1×h

c in the n th layer.

F UNCTION PRESERVING TRANSFORMATIONS

In this section, we define six function preserving transformations that can be applied to extend a

transformer architecture to increase its scale while keeping its function unaltered, thus allowing to

introduce new parameters to store additional knowledge while preserving the knowledge acquired

so far. Each transformation is defined to target the expansion of one of the hyper-parameters of the

architecture: p, E, v, k, h, and N , each controlling a distinct dimension of the scaling. The proposed

transformations are summarized in Table 1.

3Name

Sec. 3.1:

MLP

expansion

Sec. 3.2:

Head

addition

Sec. 3.3:

Heads

expansion Transformation

Def. 3.1: to increase the MLP internal dimension p to p̂, add p̂ − p

columns to the the first MLP weight matrix and bias vector and add

p̂ − p rows to the second MLP weight matrix.

Def. 3.2: to increase the number of attention heads E, per head added,

add v rows to the MHA output weight matrix. Function preserving constraint

Thrm. 3.1: zero initialize the new p̂ − p rows

of the second MLP weight matrix.

Def. 3.3: to increase the attention head representation dimension v to

v̂, add v̂ − v columns to the value weight matrix and insert v̂ − v rows

to each of E splits of the MHA output weight matrix. Thrm. 3.3: zero initialize the new v̂ − v rows

inserted to each of E splits of the MHA output

weight matrix.

Sec. 3.4:

Attention

expansion Def. 3.4: to increase the key/query representation dimension k to k̂,

add k̂ − k columns

key/query weight matrices and scale the key

p to the

√

weight matrix by k̂/ k. Thrm. 3.4: zero initialize the new k̂−k columns

of the key weight matrix.

Sec. 3.5:

Hidden

dimension

expansion Def. 3.5: to increase the transformer hidden dimension h to ĥ, add

ĥ − h columns to the positional encoding matrix, norm scaling vector,

second MLP weight matrix and bias vector, MHA output weight matrix,

and input representation matrix; add ĥ − h rows to the transformer

output weight matrix, first MLP weight matrix, and

key/query/value

√ p

weight matrices; scale norm scaling vector by h/ ĥ.

Def. 3.6: to increase the number of layers N to N̂ , per layer added,

insert new layer at position n and increment index of all following

layers. Thrm. 3.5: zero initialize the new ĥ−h columns

of the positional encoding matrix, norm scaling

vector, second MLP weight matrix and bias

vector, and MHA output weight matrix.

Sec. 3.6:

Layer

addition

Thrm. 3.2: zero initialize the new v rows of the

MHA output weight matrix.

Thrm. 3.6: zero initialize the new layer’s MHA

output weight matrix and weight matrix and

bias vector of the second MLP layer.

Table 1: Summary of proposed function preserving transformations.

For each transformation, we define how the existing parameters must be expanded and propose a set

of minimal initialization constraints to obtain the function preserving property with proof.

The presented transformations can be combined to allow the joint extension of multiple dimen-

sions of the transformer architecture. Furthermore, different subsets of such transformations can

be applied incrementally, interleaving training iterations, as well as independently to different

parts of the architecture.

Symbols denoting parameters, representations, and functions resulting from the application of the

transformation discussed in each of the following subsection are indicated with the “hat” symbol: ˆ.

3.1

MLP EXPANSION

The MLP expansion transformation can be applied to expand the scale of the MLP by expanding the

dimension of its internal representation. This scaling dimension is controlled by the hyper-parameter

p introduced in Equation 3.

Definition 3.1 (MLP expansion). Given a Transformer model as defined in Section 2, the internal

dimension of MLP n ∀ n∈[1, N ] can be increased from p to p̂ by applying the following parameter-

matrix transformations:

W n l1 7→ Ŵ n l1 := W n l1

h×p

h× p̂

h×p

b l1

1×p

→

b̂ l1

1× p̂

M W

h×( p̂−p)

(6)

b l1

1×p

m bl1

1×( p̂−p)

(7)W n l2





p×h



W n l2 7→ Ŵ n l2 := 



p×h



 ,



M W

p̂×h

(8)

( p̂−p)×h

where M W

, m bl1

, and M W

are matrices of the specified shape. For the purpose of defining

h×( p̂−p) 1×( p̂−p)

( p̂−p)×h

of the MLP expansion transformation, the values of these matrices can be assumed to be arbitrary.

Constraints on their initializer functions are introduced below to achieve the function preserving

property.

No other modifications to the Transformer architecture are required since the MLP n (·) function

(Equation 3) still inputs and outputs matrices of shape s × h after the transformation.

Theorem 3.1 (Function preserving MLP expansion).

M W

( p̂−p)×h

(9)

( p̂−p)×h

=⇒

ReLU( X × W n l1 + B l1

n ) × W n + B n = ReLU( X × Ŵ n + B̂ n ) × Ŵ n + B n

s×h

h×p

s×p

p×h

s×h

h×p

s×p

p×h

(10)

s×h

Informally: zero initializing M W

implies the function preservation property for the MLP expan-

( p̂−p)×h

sion transformation.

See Appendix A.1 for proof.

The MLP expansion transformation can be applied to all the MLP blocks to maintain the MLP

internal dimension uniformly across all the layers. However, it can also be applied to only a subset of

the layers independently to allow experimenting with different capacity at different depths.

3.2

H EAD ADDITION

The Head addition transformation can be applied to add new heads in a MHA component. This

scaling dimension is controlled by the hyper-parameter E introduced in Equation 4.

Definition 3.2 (Head addition). Given a Transformer model as defined in Section 2, a new

head can be added to MHA n (·) ∀ n ∈ [1, N ] by introducing new input projection matrices:

W n,E+1

, W n,E+1

and applying the following parameter-matrix transformation to the

h×k

h×v

output projection matrix:



W n O 7→

(E·v)×h

Ŵ n O

((E+1)·v)×h



:= 



W n O

(E·v)×h

M WO





 .



(11)

v×h

No other modifications to the Transformer architecture are required since the MHA n (·) function

(Equation 4) still inputs and outputs matrices of shape s × h after the transformation.

The Head addition transformation is defined to add one new head. The transformation can be applied

multiple times to add an arbitrary number of new heads.

5Theorem 3.2 (Function preserving head addition).

M n := 0 =⇒ H 1 · · · H E × W n = H 1 · · · H (E+1) ×

v×h

s×v

(E·v)×h

Ŵ n O

(12)

((E+1)·v)×h

s×v

Informally: zero initializing M WO

implies the function preservation property for the head addition

v×h

transformation.

See Appendix A.2 for proof.

The head addition transformation can be applied to all the MHA blocks to maintain the number of

MHA heads uniformly across all the layers. However, it can also be applied to only a subset of the

layers independently to allow experimenting with different capacity at different depths.

3.3

H EADS EXPANSION

The Heads expansion transformation can be applied to expand the dimension of the representation

generated by each attention heads. This scaling dimension is controlled by the hyper-parameter

v introduced in Equation 4.

Definition 3.3 (Heads expansion). Given a Transformer model as defined in Section 2, the dimension

of representation generated by the attention heads, H e ∀ e∈[1, E], of MHA n ∀ n∈[1, N ] can be

s×v

increased from v to v̂ by applying the following parameter-matrix transformations:

W n,e

→ Ŵ n,e

:= W n,e

h×v

h×v̂

h×v



M WV

n,e

h×(v̂−v)

∀ e ∈ [1, E],

(13)



W n,e

 v×h



W n,e

→ Ŵ n,e

:= 

 M WO

v×h

v̂×h

n,e



 ∀ e ∈ [1, E],



(14)

(v̂−v)×h

where W n,e

is the e th “split” of W n O along the (E · v) dimension:

(E·v)×h

v×h



 W n,e

W n := 

 v×h

(E·v)×h







| e ∈ [1, E]. 



(15)

No other modifications to the Transformer architecture are required since the MHA n (·) function

(Equation 4) still inputs and outputs matrices of shape s × h after the transformation.

Theorem 3.3 (Function preserving heads expansion).

M n,e :=

=⇒ H 1 · · · H E × W n = Ĥ 1 · · · Ĥ E × Ŵ n O

(v̂−v)×h

s×v

(E·v)×h

s×v̂

(16)

(E·v̂)×h

where:

Ĥ e

s×v̂

= Attention( X ×W n,e

, X ×W n,e

, X × Ŵ n,e

)

s×h

h×k

s×h

h×k

s×h

h×v̂

(17)Informally: zero initializing M WO

n,e implies the function preservation property for the head expansion

(v̂−v)×h

transformation.

See Appendix A.3 for proof

The heads expansion transformation can be applied to all heads of all the MHA blocks to maintain

the attention head representation dimension uniformly across all the layers. However, it can also

be applied to only a subset of the layers or even a subset of attention heads independently to allow

experimenting with different capacity at different parts of the architecture.

3.4

A TTENTION EXPANSION

The Attention expansion transformation can be applied to expand the key and query representations

whose inner product produces the attention weights matrix. This scaling dimension is controlled

by the hyper-parameter k introduced in Equation 4.

Definition 3.4 (Attention expansion). Given a Transformer model as defined in Section 2, the

dimension of representations generating the attention weights of MHA n ∀ n ∈ [1, N ] can be increased

from k to k̂ by applying the following parameter-matrix transformations:





W n,e

→ Ŵ n,e

:=  W n,e

h×k

h× k̂

 ∀ e ∈ [1, E],

M WQ

n,e

 p

W n,e

→ Ŵ n,e

h×k

h× k̂

(18)

h×( k̂−k)



k̂

:=  √ · W n,e

k h×k

 ∀ e ∈ [1, E].

M WK

n,e

(19)

h×( k̂−k)

Theorem 3.4 (Function preserving attention expansion).

M WK

n,e :=

(20)

h×( k̂−k)

=⇒

Attention( X ×W n,e

, X ×W n,e

) = Attention( X × Ŵ n,e

, X × Ŵ n,e

, X ×W n,e

)

s×h

h×k

s×h

h×k

s×h

h×v

h× k̂

s×h

h× k̂

s×h

h×v

(21)

Informally: zero initializing M WK

n,e implies the function preservation property for the attention

h×( k̂−k)

expansion transformation.

See Appendix A.4 for proof.

In most transformer implementations, k = v. In such cases, the attention expansion may be

performed jointly with the head expansion.

The attention expansion transformation can be applied to all heads of all the MHA blocks to maintain

the key/query representation dimension uniformly across all the layers. However, it can also be

applied to only a subset of the layers or even a subset of attention heads independently to allow

experimenting with different capacity at different parts of the architecture.

3.5

H IDDEN DIMENSION EXPANSION

The Hidden dimension expansion transformation can be applied to expand the dimension of the

representation produced by the transformer layers. This scaling dimension is controlled by the

hyper-parameter h introduced in Equation 1.

7Definition 3.5 (Hidden dimension expansion). Given a Transformer model as defined in Section 2,

the dimension of the transformer layers’ input/output representation can be increased from h to ĥ by

applying the following parameter-matrix transformations:

P 7→ P̂ :=

s×h

s× ĥ

M P

s×h

(22)

s×( ĥ−h)



W out 

M W out 

 ,



h×o



W out 7→ Ŵ out := 



h×o

ĥ×o

(23)

( ĥ−h)×o

" √

g n c 7→ ĝ n c := p · g n c

1×h

ĥ 1×h

1× ĥ

m g,c

1×( ĥ−h)

W n l1



h×p



W n l1 7→ Ŵ n l1 := 



h×p

M W l1

ĥ×p

∀n ∈ [1, N ] ∧ c ∈ {MHA, MLP},

(24)





 ∀n ∈ [1, N ],



(25)

( ĥ−h)×p

W n l2

p×h

→

Ŵ n l2

p× ĥ

W n l2

p×h

M W

p×( ĥ−h)

b l2

1×h

→

b̂ l2

1× ĥ

∀n ∈ [1, N ],

(26)

b l2

1×h



m bl2

1×( ĥ−h)

W n,e

 h×k



W n,e

→ Ŵ n,e

:= 

 M WQ

h×k

ĥ×k

n,e

∀n ∈ [1, N ],

(27)





 ∀n ∈ [1, N ] ∧ e ∈ [1, E],



(28)

( ĥ−h)×k



W n,e

 h×k



W n,e

→ Ŵ n,e

:= 

 M WK

h×k

ĥ×k

n,e





 ∀n ∈ [1, N ] ∧ e ∈ [1, E],



(29)

( ĥ−h)×k



W n,e

 h×v



W n,e

→ Ŵ n,e

:= 

 M WV

h×v

ĥ×v

n,e





 ∀n ∈ [1, N ] ∧ e ∈ [1, E],



(30)

( ĥ−h)×v

W n O

(E·v)×h

→ Ŵ n O :=

(E·v)× ĥ

W n O

(E·v)×h

M WO

(E·v)×( ĥ−h)

∀n ∈ [1, N ],

(31)and modifying the embedding function to produce an extended input representation:

Î :=

s× ĥ

M I

s×h

(32)

s×( ĥ−h)

For example, a token embedding table can be expanded by adding ( ĥ − h) randomly initialized

columns, mapping the same vocabulary into an extended embedding.

Theorem 3.5 (Function preserving hidden dimension expansion).

M P

s×( ĥ−h)

M W

p×( ĥ−h)

∀n ∈ [1, N ] (34)

∀n ∈ [1, N ] (35)

p×( ĥ−h)

m bl2

1×( ĥ−h)

M WO

s×( ĥ−h)

∀n ∈ [1, N ]

(36)

(E·v)×( ĥ−h)

M I

(33)

s×( ĥ−h)

(37)

s×( ĥ−h)

=⇒

Î n

s× ĥ

= [ I n

s×h

]

∀n ∈ [1, N + 1]

(38)

s×( ĥ−h)

=⇒

◦N

TransformerLayer ◦N ( I + P ) × W out = TransformerLayer

( I + P̂ ) × Ŵ out (39)

s×h

h×o

s×h

s× ĥ

ĥ×o

where I N +1 refers to the representations outputted by the last transformer layer, and I n ∀n ∈ [1, N ]

s×h

refers to the representation inputted by the n th transformer layer. Symbols denoting parameters,

representations and functions resulting from the application of the transformation discussed in this

section are indicated with the “hat” ˆ symbol.

Informally: zero initializing the specified matrices implies the function preservation property for the

hidden dimension expansion transformation.

See Appendix A.5 for proof.

The hidden dimension expansion transformation must be applied to all MHA blocks to maintain the

hidden dimension uniformly across all the layers, due to the skip connections used throughout

the architecture.

3.6

L AYER ADDITION

The Layer addition transformation can be applied to insert an new layer at any depth of the cur-

rent Transformer architecture. This scaling dimension is controlled by the hyper-parameter N

introduced in Equation 1.

9Definition 3.6 (Layer addition). A new TransformerLayer(·) whose parameters allow to input and

output matrices of x × h can be inserted in the sequence of the pre-existing N layers. The new

transformer layer can be inserted at any position n ∈ [1, N +1]. The index of the downstream layers

is incremented by one.

Theorem 3.6 (Function preserving layer addition). With n being the index of the added layer:



W n O := 0 



(E·v)×h



(E·v)×h





W n := 0

=⇒ TransformerLayer n ( I n ) = I n

(40)

p×h



s×h



b l2



n := 0

1×h

Informally: Zero initializing the parameters of the output projections of the MLP and MHA implies

that the added transformer layer output is equivalent to the input.

See Appendix A.6 for proof.

R ELATED WORK

Some existing works have proposed function preserving transformer expansion operators, but none

cover all six dimensions as proposed in this work. Bert2BERT (Chen et al., 2022) proposes function

preserving width expansions of the MLP internal dimension, hidden dimension, and number of

attention heads. Shen et al. (2022) achieve function preserving width expansion, although constrained

to doubling of all matrix and vector dimensions, and depth expansion via zero initialization of

LayerNorm and bias parameters. Yao et al. (2023) use masking on new hidden MLP neurons, attention

heads, and layers to achieve function preservation. Wang et al. (2023) use an inner optimization

to learn a linear mapping for parameter expansion in depth and width, but without constraints for

function preservation. Notably, our transformations form a function preserving subspace of their

learnable space. Deep Fusion (Mazzawi et al., 2023) extends the concept of expansion to multiple

source models, where the special case of self-fusion achieves function preserving width expansion.

Of these works, some methods are nearly function preserving but admit gaps due to LayerNorm

discrepancies (Chen et al., 2022; Mazzawi et al., 2023). No known works consider scaling factors,

as we address in Equations 19 and 24, nor RMSNorm.

C ONCLUSION

We have defined six transformations that can be applied to a transformer model to increase the

scale of all the different aspects of the architecture: 1) size of MLP internal representation, 2)

number of attention heads, 3) size of the attention heads output representation, 4) size of the

attention input representation, 5) size of the transformer layers input/output representations, 6)

number of layers. For each of these transformations, we have provided a proof of exact function

preservation given a minimal set of constraints on the initialization of the added parameters. These

six transformations are composable to permit many different ways to scale a transformer-based

model while preserving its function.

We note that, there exist alternative definitions to such transformations that achieve function-

preservation without requiring zero initialization. However, the form of the proposed transformations

is intended to be simple yet minimally constraining. The space of possible initialization strategies

may be explored with the aim to optimize for training in an empirical context.

In future work, these transformations may be applied in the training of a new large model by initializ-

ing a smaller model, training it under reduced data and computational complexity requirements, and

incrementally scaling it to larger sizes throughout training to the desired final size. They may also

be used to generate a family of models that are trained for the same task but at different sizes: all

models within the family can begin from the same checkpoint from training the smallest model, then

10each successively sized model can be branched and finetuned at its final size. Finally, neural archi-

tecture search (NAS) techniques could be applied to determine optimal transformation scheduling

and architectural progression for a given task and compute budget.

A CKNOWLEDGEMENTS

We would like to thank Jeffrey Pennington and Utku Evci for their input to this work.

R EFERENCES

Cheng Chen, Yichun Yin, Lifeng Shang, Xin Jiang, Yujia Qin, Fengyu Wang, Zhi Wang, Xiao

Chen, Zhiyuan Liu, and Qun Liu. bert2BERT: Towards reusable pretrained language models. In

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume

1: Long Papers), pp. 2134–2148, 2022.

Tianqi Chen, Ian J. Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge

transfer. CoRR, abs/1511.05641, 2016.

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer,

Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim M. Alabdulmohsin, Rodolphe Jenatton,

Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer,

Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh

Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Collier, Alexey A.

Gritsenko, Vighnesh Birodkar, Cristina Nader Vasconcelos, Yi Tay, Thomas Mensink, Alexander

Kolesnikov, Filip Paveti’c, Dustin Tran, Thomas Kipf, Mario Luvci’c, Xiaohua Zhai, Daniel

Keysers, Jeremiah Harmsen, and Neil Houlsby. Scaling vision transformers to 22 billion parameters.

ArXiv, abs/2302.05442, 2023.

Utku Evci, Max Vladymyrov, Thomas Unterthiner, Bart van Merrienboer, and Fabian Pedregosa.

GradMax: Growing neural networks using gradient information. ArXiv, abs/2201.05125, 2022.

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In

International Conference on Artificial Intelligence and Statistics, 2011.

Google. PaLM 2 technical report. arXiv preprint arXiv:2305.10403, 2023.

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). arXiv: Learning, 2016.

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza

Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom

Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy,

Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre.

Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.

Jared Kaplan, Sam McCandlish, T. J. Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott

Gray, Alec Radford, Jeff Wu, and Dario Amodei. Scaling laws for neural language models. ArXiv,

abs/2001.08361, 2020.

Hanna Mazzawi, Xavi Gonzalvo, and Michael Wunder. Deep fusion: Efficient network training via

pre-trained initializations. arXiv preprint arXiv:2306.11903, 2023.

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John

Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan,

Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks,

Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron

Huang, Jonathan Uesato, John F. J. Mellor, Irina Higgins, Antonia Creswell, Nathan McAleese,

Amy Wu, Erich Elsen, Siddhant M. Jayakumar, Elena Buchatskaya, David Budden, Esme Suther-

land, Karen Simonyan, Michela Paganini, L. Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna

Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur

Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, N. K. Grigorev, Doug Fritz, Thibault Sottiaux,

11Mantas Pajarskas, Tobias Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume,

Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas,

Aurelia Guy, Chris Jones, James Bradbury, Matthew G. Johnson, Blake A. Hechtman, Laura

Weidinger, Iason Gabriel, William S. Isaac, Edward Lockhart, Simon Osindero, Laura Rimell,

Chris Dyer, Oriol Vinyals, Kareem W. Ayoub, Jeff Stanway, L. L. Bennett, Demis Hassabis, Koray

Kavukcuoglu, and Geoffrey Irving. Scaling language models: Methods, analysis & insights from

training Gopher. ArXiv, abs/2112.11446, 2021.

Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,

Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified

text-to-text transformer. ArXiv, abs/1910.10683, 2020.

Sheng Shen, Pete Walsh, Kurt Keutzer, Jesse Dodge, Matthew Peters, and Iz Beltagy. Staged

training for transformer language models. In International Conference on Machine Learning, pp.

19893–19908. PMLR, 2022.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay

Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cris-

tian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu,

Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn,

Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel

Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee,

Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra,

Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi,

Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh

Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen

Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic,

Sergey Edunov, and Thomas Scialom. LLaMa 2: Open foundation and fine-tuned chat models.

arXiv preprint arXiv:2307.09288, 2023.

Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,

Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. ArXiv, abs/1706.03762, 2017.

Peihao Wang, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard, Leonid Karlinsky,

Rogerio Feris, David Daniel Cox, Zhangyang Wang, and Yoon Kim. Learning to grow pretrained

models for efficient transformer training. In The 11th International Conference on Learning

Representations, 2023.

Yiqun Yao, Zheng Zhang, Jing Li, and Yequan Wang. 2x faster language model pre-training via

masked structural growth. arXiv preprint arXiv:2305.02869, 2023.

Biao Zhang and Rico Sennrich. Root mean square layer normalization. ArXiv, abs/1910.07467, 2019.

12A

P ROOFS

A.1

MLP EXPANSION

Proof.

ReLU( X × Ŵ n l1 + B̂ l1

n ) × Ŵ n

s×h

s×p

h×p

p×h

W n l2



= ReLU

X × W n l1

s×h

M W

h×p

+ B l1

M bl1

1×p

h×( p̂−p)

p×h



× 



1×( p̂−p)







( p̂−p)×h

W n l2



X × W n l1

= ReLU

s×h

X × M W

s×h

h×p

M bl1

+ B l1

1×p

h×( p̂−p)

1×( p̂−p)



p×h



× 







( p̂−p)×h

W n l2



= ReLU

X × W n l1 + B l1

s×h

X × M W

+ M bl1

s×h

1×p

h×p

h×( p̂−p)

1×( p̂−p)



p×h



× 







( p̂−p)×h



= ReLU( X × W n l1 + B l1

n )

s×h



ReLU( X × M W

+ M bl1

n ) × 

s×h

1×p

h×p

h×( p̂−p)

1×( p̂−p)

W n l2



p×h





( p̂−p)×h

ReLU( X ×

s×h

W n l1

h×p

B l1

n )

1×p

W n l2

p×h

ReLU( X ×

s×h

M W

h×( p̂−p)

M bl1

n )

1×( p̂−p)

( p̂−p)×h

= ReLU( X × W n l1 + B l1

n ) × W n

s×h

1×p

h×p

(41)

p×h

Note that it is not necessary to impose any constraints on the values of M W

and m bl1

to achieve

h×( p̂−p)

1×( p̂−p)

function preservation property. Thus, these two matrices can be initialized arbitrarily.

A.2

H EAD ADDITION

Proof.

H 1 · · ·

s×v

H (E+1) ×

Ŵ n O

((E+1)·v)×h

s×v





= H 1 · · · H (E+1) × 



s×v

W n O

(E·v)×h

s×v

v×h







H 1 · · · H E

s×v



H (E+1) × 





W n O

(E·v)×h

s×v





v×h

H 1 · · · H E

s×v

W n O

(E·v)×h

H (E+1) × 0

v×h

s×v

= H 1 · · · H E × W n O

s×v

A.3

s×v

(42)

(E·v)×h

H EADS EXPANSION

Proof.

S n,e := Softmax

s×s

K ⊤

√ · ( X ×W n,e

) × ( X ×W n,e

)

s×h

k s×h h×k

h×k

(43)

=⇒

Ĥ e

s×v̂

= Attention( X ×W n,e

, X ×W n,e

, X × Ŵ n,e

)

s×h

h×k

s×h

h×v̂

X × Ŵ n,e

s×h

h×v̂

= S n,e ×

s×s

= S n,e ×

X ×

s×h

s×s

W n,e

h×v

M WV

n,e

h×(v̂−v)

X ×W n,e

= S n,e ×

s×h

s×s

h×v

X × M WV

n,e

s×h

h×(v̂−v)

= S n,e ×

s×s

( X ×W n,e

)

s×h

h×v

= H e

s×v

S n,e ×

s×s

( X × M WV

n,e )

s×h

h×(v̂−v)

S n,e ×

s×s

( X × M WV

n,e )

s×h

h×(v̂−v)

(44)

=⇒

Ĥ 1 · · · Ĥ E

s×v̂

× Ŵ n O

(E·v̂)×h



 Ŵ n,e

= · · · Ĥ e · · · | e ∈ [1, E] × 

v×h



s×v̂







| e ∈ [1, E] 

"

= · · · Ĥ e × Ŵ n,e

· · · | e ∈ [1, E]

s×v̂

v×h



W n,e





v×h



= 

Ĥ e × 

 · · · s×v̂







 · · · | e ∈ [1, E] 



(v̂−v)×h

W n,e



v×h



S n,e × ( X × M WV

n,e ) × 

s×h

s×s



= 

 · · · H e

s×v



 · · · | e ∈ [1, E] 



h×(v̂−v)



(v̂−v)×h

= · · · H e × W n,e

+ S n,e × ( X × M WV

n,e ) ×

s×v

s×s

v×h

s×h

h×(v̂−v)

= · · · H e ×

s×v

W n,e

v×h

+ 0

s×h

· · · | e ∈ [1, E]

(v̂−v)×h

· · · | e ∈ [1, E]

= · · · H e ×

s×v

W n,e

v×h

· · · | e ∈ [1, E]





 W n,e

= · · · H e · · · | e ∈ [1, E] × 

v×h



s×v





| e ∈ [1, E] 



= H 1 · · · H E × W n O

s×v

A.4

s×v

(45)

(E·v)×h

A TTENTION EXPANSION

Proof.

K ⊤

p · ( X × Ŵ n,e

) × ( X × Ŵ n,e

)

s×h

k̂

h× k̂





1 

= p · X ×  W n,e

s×h

h×k

k̂



= p ·  X ×W n,e

s×h

h×k

k̂

 

  ×

M WQ

n,e

h×( k̂−k)





X × M WQ

n,e

s×h

h×( k̂−k)

k̂

X × √ · W n,e

s×h

k h×k

k̂

× √ · X ×W n,e

k s×h h×k

#! ⊤

h×( k̂−k)

# ⊤

X ×

s×h

h×( k̂−k)



= p ·  X ×W n,e

s×h

h×k

k̂



X × M WQ

n,e

s×h

h×( k̂−k)

p 

k̂

= p · √ ·  X ×W n,e

s×h

h×k

k̂

× √ · X ×W n,e

s×h

h×k





X × M WQ

n,e

s×h

h×( k̂−k)





1 

= √ · X ×W n,e

s×h

h×k



X × M WQ

n,e

s×h

h×( k̂−k)



1 

= √ · X ×W n,e

s×h

h×k



# ⊤

X ×W n,e

s×h

h×k

s×( k̂−k)

# ⊤

X ×W n,e

s×h

h×k



s×( k̂−k)

K ⊤

( X ×W n,e

)

s×h

 × 

X × M WQ

n,e

s×h

s×( k̂−k)

# ⊤



h×k



h×( k̂−k)

( k̂−k)×s





1 

K ⊤

) × ( X ×W n,e

= √ · ( X ×W n,e

) + ( X × M WQ

0 

n,e ) ×

s×h

( k̂−k)×s

h×k

h×( k̂−k)

= √ ·

( X ×W n,e

)

s×h

h×k

K ⊤

( X ×W n,e

)

s×h

h×k

+ 0

s×s

K ⊤

) × ( X ×W n,e

)

= √ · ( X ×W n,e

s×h

k s×h h×k

h×k

A.5

(46)

H IDDEN DIMENSION EXPANSION

Proof. We demonstrate Î n = [ I n

] ∀n ∈ [0, N ] by induction on n.

s×h

s× ĥ

s×( ĥ−h)

Base case n = 0:

Î 0

s× ĥ

= Î + P̂

s×h

s× ĥ

s×h

s×( ĥ−h)

s×h

s×( ĥ−h)

I + P

s×h

Induction step, assuming Î n = [ I n

s× ĥ

s×h

s×( ĥ−h)

] holds:

s×( ĥ−h)

(47)

MHA

î µ,j · ĝ n,j

∈

[1,

∧

∈

[1,

ĥ]

Norm MHA

(

Î

)

P ĥ

s×h

γ=1 ( î µ,γ )

ĥ

= Norm MHA

([ I n

s×h

])

s×( ĥ−h)





MHA

i µ,j · ĝ n,j

0 · ĝ n,j

=  q P

| µ ∈ [1, s] ∧ j ∈ [1, h] q P

| µ ∈ [1, s] ∧ j ∈ [h + 1, ĥ] 

ĥ

γ=1 ( î µ,γ )

ĥ





MHA

i µ,j · ĝ n,j

| µ ∈ [1, s] ∧ j ∈ [1, h]

0 

=  q P

ĥ

s×( ĥ−h)

(

î

)

γ=1 µ,γ

ĥ





MHA

i µ,j · ĝ n,j

=  q P

| µ ∈ [1, s] ∧ j ∈ [1, h]

0 

P ĥ

s×( ĥ−h)

( γ=1 (i µ,γ ) + γ=h+1 0)

ĥ





MHA

i µ,j · ĝ n,j

=  q P

| µ ∈ [1, s] ∧ j ∈ [1, h]

0 

s×( ĥ−h)

)

µ,γ

γ=1

ĥ





√

MHA

i µ,j · √ h · g n,j





ĥ

| µ ∈ [1, s] ∧ j ∈ [1, h]

0 

=  q P

s×(

ĥ−h)

γ=1 (i µ,γ )

ĥ





MHA

i µ,j · g n,j

=  q P

| µ ∈ [1, s] ∧ j ∈ [1, h]

0 

s×(

ĥ−h)

γ=1 (i µ,γ )

= Norm MHA

( I n )

s×h

(48)

s×( ĥ−h)

For conciseness, we use the following notation: N cn := Norm cn ( I n ) and N̂ cn := [N cn

s×h

s× ĥ

s×h s×( ĥ−h)

=⇒

′

ˆ n (N̂ MHA

Î n = Î n + MHA

)

s× ĥ

= Î n +

s× ĥ

· · · Attention(N̂ MHA

× Ŵ n,e

, N̂ MHA

× Ŵ n,e

, N̂ MHA

× Ŵ n,e

) ···

s× ĥ

ĥ×k

ĥ×v



W n,e

h×v





= Î n +  · · · Attention([N MHA

]×





 M WQ

s×(

ĥ−h)

s×h

s× ĥ

n,e

| ∀e ∈ [1, E] × Ŵ n O

(E·v)× ĥ





 MHA



, N̂ MHA

Ŵ

)

∀e

∈

[1,

 , N̂ n × Ŵ n,e

 × Ŵ n O

n,e

 s× ĥ

 (E·v)× ĥ

s× ĥ

ĥ×k

ĥ×v

( ĥ−h)×v

= Î n +

s× ĥ

= Î n

· · · Attention(N MHA

×W n,e

, N MHA

×W n,e

, N MHA

×W n,e

) ···

s×h

h×k

h×v

+ · · · H e · · · | ∀e ∈ [1, E] × W n O

s×v

s× ĥ

(E·v)×h

= Î n +

s× ĥ

(E·v)×( ĥ−h)

)

MHA n (N MHA

s×h

s×( ĥ−h)

| ∀e ∈ [1, E] × Ŵ n O

(E·v)× ĥ"

MHA n (N MHA

)

s×h

I n

s×h s×( ĥ−h)

I n

s×h MHA n (N MHA

)

s×h

s×( ĥ−h)

′

= I

s×h

(49)

s×( ĥ−h)

=⇒

ˆ MHA

Following the demonstration provided for Norm

(·):

′

ˆ MLP

Norm

( Î n ) = Norm MLP

( Î n )

s×h

(50)

s×( ĥ−h)

s×h

ˆ MLP

N̂ MLP

:= Norm

( Î n )

(51)

s×h

s× ĥ

=⇒

Î n+1 = TransformerLayer

n ( Î n )

s× ĥ

′

= Î n +

s× ĥ

ˆ n (N̂ MLP

MLP

)

s× ĥ

′

ˆ n (N̂ MLP )

= Î n + MLP

s× ĥ

′

= Î n +

s× ĥ

ReLU(N̂ MLP

s× ĥ

× Ŵ n l1 + B l1

n ) × Ŵ n + B̂ n

s×p

ĥ×p



′

= Î n + ReLU([N MLP

s×h

s× ĥ



] × 



s×( ĥ−h)

p× ĥ s× ĥ

W n l1

h×p 

M W l1



 + B l1

n ) × Ŵ n + B̂ n

 s×p

p× ĥ

s× ĥ

( ĥ−h)×p

′

= Î n + ReLU(N MLP

× W n l1 + B l1

n ) × Ŵ n + B̂ n

s×h

s× ĥ

s×p

h×p

p× ĥ

s× ĥ

′

= Î n + ReLU(N MLP

× W n l1 + B l1

n ) × W n

s×h

s× ĥ

s×p

h×p

p×h

= Î n +

s× ĥ

s×h

ReLU(N MLP

s×h

W n l1

h×p

B l1

n )

s×p

W n l2

p×h

s×h

s×p

h×p

′

= Î n +

s× ĥ

p×h

MLP n (N MLP

)

s×h

s×( ĥ−h)

= I

′

s×h

B l2

s×h

= Î n + ReLU(N MLP

× W n l1 + B l1

n ) × W n + B n

+ MLP n (N MLP

)

s×h

s×( ĥ−h)

= TransformerLayer

n ( I n )

s×h

s×( ĥ−h)

s×h

s×( ĥ−h)

′

s× ĥ

+ B l2

p×( ĥ−h)

′

s×( ĥ−h)

s×( ĥ−h)"

= I n+1

s×h

(52)

s×( ĥ−h)

Having demonstrated that, after applying the hidden dimension expansion:

Î n+1 = I n+1

∀n ∈ [1, N + 1]

(53)

s×( ĥ−h)

s×h

s× ĥ

The output equivalence can be proven as follows:

◦N

TransformerArchitecture(

Î ) = TransformerLayer

( Î + P̂ ) × Ŵ out

ĥ×o

s× ĥ

s× ĥ s× ĥ





out

 h×o 

out

 = I N +1 × W out

= Î N +1 × Ŵ

= I N +1

× 





out

h×o

ĥ×o

s×( ĥ−h)

s×h

s× ĥ

( ĥ−h)×o

= TransformerArchitecture( I )

(54)

s×h

A.6

L AYER ADDITION

Proof.

MHA n (X n ) = H 1 · · · H E ×

s×h

s×v

= 0

(E·v)×h

(55)

s×h

MLP n (X n ) = ReLU(X n × W n l1 + B l1

n ) × 0 + 0 = 0

s×h

′

I n

s×h

h×p

s×p

p×h

s×h

(56)

s×h

= I n + MHA n (Norm MHA

( I n )) = I n + 0 n = I n

s×h

(57)

s×h

TransformerLayer n ( I n ) = I n + MLP n (Norm MLP

( I n )) = I n + 0 n = I n

s×h

(58)

s×h

Note that the function preserving property holds even if normalization is applied after the MLP and

MHA components as Norm(·) outputs zeros for zeros input.