Summary of Reducing Parameters in Transformer Architecture for Improved Efficiency

Summary Reducing Parameters in Transformer Architecture for Improved Efficiency arxiv.org

9,015 words - PDF document - View PDF document

One Line

The paper focuses on enhancing efficiency in the Transformer architecture by reducing parameters, specifically in the Feed Forward Network (FFN), and evaluates the impact of removing the FFN through experimental investigation.

Slides

Slide Presentation (12 slides)

Copy slides outline Copy embed code Download as Word

Reducing Parameters in Transformer Architecture for Improved Efficiency

Source: arxiv.org - PDF - 9,015 words - view

The Role of the Feed Forward Network (FFN)

• The FFN in the Transformer architecture is highly redundant despite its significant parameter usage.

• Reducing the number of parameters in the FFN can improve efficiency.

• Experimental investigation reveals the impact of removing the FFN.

Improving Efficiency by Reducing Parameters

• The focus is on reducing the number of parameters in the Transformer architecture.

• The feed-forward networks (FFNs) in the encoder and decoder are targeted.

• Previous work has shown that FFNs contribute significantly to the parameter budget.

Measuring Similarity with Local Neighborhood Similarity (LNS)

• LNS method measures similarity between semantic spaces of different models.

• Similarity is determined based on sentence neighbors.

• LNS is a useful tool for evaluating the impact of parameter reduction.

Experimental Configurations for Improved Efficiency

• Different configurations of the Transformer architecture are investigated.

• Dropout rates of 0.1, 0.3, and 0 are used for various datasets and models.

• Training is done using fp16.

Contribution of Encoder and Decoder FFNs

• Experimental results show that encoder and decoder FFNs have different contributions.

• Decoder FFNs are found to be more redundant.

• Sharing one FFN on the encoder and dropping it can lead to efficiency improvements.

Lower Similarity Scores and Decreased Redundancy

• Sharing feed-forward networks (FFNs) consistently lowers similarity scores.

• Decreased redundancy within the network is observed.

• Visual: Graph showing similarity scores before and after sharing FFNs.

Impact on Accuracy and Inference Speed

• Different models and configurations are experimented with.

• Dropping the decoder FFNs in the Deep Encoder Shallow Decoder model improves efficiency.

• Analysis of accuracy and inference speed is conducted.

Strategies for Reducing Parameters in Neural Machine Translation

• Exploration of parameter reduction strategies in neural machine translation.

• Sharing FFNs within a module of N layers is considered.

• Sequence, cycle, and other sharing methods are investigated.

References in the Field of Natural Language Processing and Machine Translation

• Various papers and conferences are referenced in the field.

• Topics include parameter efficiency in transformer architectures and scaling laws for neural machine translation.

• Visual: Images of key papers and conference logos.

Conclusion

• The role of the FFN in the Transformer architecture is highly redundant.

• Reducing parameters, especially in the FFNs, improves efficiency.

• Sharing FFNs leads to lower similarity scores and decreased redundancy.

• Overall message: Efficient parameter reduction strategies can enhance the performance of the Transformer architecture.

Key Takeaways

• Reducing parameters in the Transformer architecture improves efficiency.

• Sharing feed-forward networks (FFNs) and dropping decoder FFNs can lead to improvements.

• Efficient parameter reduction strategies enhance the performance of the architecture.

Key Points

The authors of this paper explore the role of the Feed Forward Network (FFN) in the Transformer architecture and find that it is highly redundant despite its significant parameter usage.
The authors aim to improve the efficiency of the Transformer architecture by reducing the number of parameters, particularly in the feed-forward networks (FFNs) of the encoder and decoder.
The Local Neighborhood Similarity (LNS) method is used to measure the similarity between the semantic spaces of different models in natural language processing.
The authors conducted experiments to reduce the number of parameters in the Transformer architecture while improving efficiency, finding that the encoder and decoder FFNs have different contributions and sharing one FFN on the encoder can lead to improvements.
Sharing feed-forward networks (FFNs) in the Transformer architecture consistently lowers similarity scores and decreases redundancy within the network.
Different models and configurations were experimented with to analyze their impact on accuracy and inference speed, with dropping the decoder FFNs in the Deep Encoder Shallow Decoder model resulting in improvements.
Strategies for reducing the number of parameters in neural machine translation were explored, including different ways of sharing feed-forward networks (FFNs) within a module of N layers.

Summaries

32 word summary

This paper aims to improve efficiency in the Transformer architecture by reducing parameters, particularly in the Feed Forward Network (FFN). Experimental investigation is conducted to assess the effects of removing the FFN.

45 word summary

The authors of this paper aim to improve the efficiency of the Transformer architecture by reducing the number of parameters. They focus on the redundancy of the Feed Forward Network (FFN) and conduct experiments to determine the impact of removing it. The Local Neighborhood Similar

479 word summary

The authors of this paper explore the role of the Feed Forward Network (FFN) in the Transformer architecture and find that it is highly redundant despite its significant parameter usage. They conduct experiments and find that they can substantially reduce the number of parameters by removing

The authors of this study aim to improve the efficiency of the Transformer architecture by reducing the number of parameters. They focus on the feed-forward networks (FFNs) in the encoder and decoder, which make up the majority of the parameter budget. Previous work

The Local Neighborhood Similarity (LNS) method is used to measure the similarity between the semantic spaces of different models. LNS determines similarity based on the similarity of sentence neighbors in the two spaces. The LNS of a sentence between two models is

In this study, the authors investigate different configurations of the transformer architecture to improve efficiency. They use dropout rates of 0.1, 0.3, and 0 for different datasets and models. The models are trained using fp16. The

The authors conducted experiments to reduce the number of parameters in the Transformer architecture while improving efficiency. They found that the encoder and decoder FFNs have different contributions, with the decoder's being more redundant. By sharing one FFN on the encoder and dropping it

The excerpt discusses the benchmark score and normalized similarity scores for several models in the Transformer architecture. It highlights that sharing feed-forward networks (FFNs) leads to consistently lower similarity scores and decreased redundancy within the network. The One Wide FFN model shows a

The study focuses on reducing parameters in the Transformer architecture to improve efficiency. The authors experiment with different models and configurations to analyze the impact on accuracy and inference speed. They find that dropping the decoder FFNs in the Deep Encoder Shallow Decoder model results in

The article discusses a method for reducing the parameters in Transformer architecture to improve efficiency. The authors found that by sharing the feed-forward network (FFN) across all encoder layers and removing it from the decoder layers, while increasing the dimension of the encoder FF

This excerpt includes references to various papers and conferences in the field of natural language processing and machine translation. The mentioned papers cover topics such as parameter efficiency in transformer architectures, scaling laws for neural machine translation, measuring statistical dependence, deep residual learning for image recognition

This excerpt is a list of references cited in a document about reducing parameters in transformer architecture to improve efficiency. The references include papers and conference proceedings related to various topics in natural language processing and machine translation. Some of the key points highlighted in the references include

In a study on the efficiency of Transformer architecture, the authors investigate strategies for reducing the number of parameters in neural machine translation. They explore different ways of sharing feed-forward networks (FFNs) within a module of N layers, including sequence, cycle,

Raw indexed text (54,640 chars / 9,015 words / 1,497 lines)

One Wide Feedforward is All You Need

†

Telmo Pessoa Pires ∗

Equall

[email protected]

Yannick Assogba Hendra Setiawan ∗

Apple

{antoniovilarinholopes, yassogba, hendra}@apple.com

António V. Lopes

Abstract

The Transformer architecture has two main

non-embedding components: Attention and

the Feed Forward Network (FFN). Attention

captures interdependencies between words re-

gardless of their position, while the FFN non-

linearly transforms each input token indepen-

dently. In this work we explore the role of the

FFN, and find that despite taking up a signif-

icant fraction of the model’s parameters, it is

highly redundant. Concretely, we are able to

substantially reduce the number of parameters

with only a modest drop in accuracy by remov-

ing the FFN on the decoder layers and sharing a

single FFN across the encoder. Finally we scale

this architecture back to its original size by in-

creasing the hidden dimension of the shared

FFN, achieving substantial gains in both accu-

racy and latency with respect to the original

Transformer Big.

Introduction

The Transformer architecture (Vaswani et al., 2017)

has become the de facto paradigm in many Nat-

ural Language Processing (NLP) tasks, includ-

ing Machine Translation (MT). Several studies

have shown that Transformers exhibit impressive

scaling-law properties (Gordon et al., 2021; Bansal

et al., 2022; Ghorbani et al., 2022), wherein in-

creasing the number of model parameters leads

to further accuracy gains. In parallel with this ar-

chitecture’s impressive scaling of the numbers of

parameters (Chowdhery et al., 2022), there is a

growing trend towards reducing model footprints

for real-world deployment, to satisfy practical con-

straints like latency requirements as well as mem-

ory and disk space limitations. In turn, researchers

are actively exploring parameter sharing (Ge et al.,

2022; Takase and Kiyono, 2023; Lou et al., 2022),

reducing the dimensionality of Transformer compo-

†

Equal contribution.

Work conducted while at Apple.

nents, and pruning components like attention heads

(Voita et al., 2019; Michel et al., 2019).

Although the role of attention in learning pair-

wise dependencies between tokens is relatively well

understood (Voita et al., 2019; Clark et al., 2019;

Vig and Belinkov, 2019), the role of the Feed For-

ward Network (FFN) remains under-explored. Re-

cently, Geva et al. (2021) established a connection

between the FFN and attention by positing that

the FFN corresponds to learnable key-value pairs

where the weights of the first layer of the FFN cor-

responds to the keys and those of the second to the

values. They find that the keys are able to cap-

ture salient textual patterns at each layer, and they

notice that the classes of patterns tend to overlap

between neighboring layers, indicating redundancy

in the representation.

This observation motivates our work, where we

revisit the conventional practice of allocating an

individual FFN per layer. We investigate the effect

of sharing and dropping the FFN across different

layers on MT models. We conduct thorough exper-

iments with different configurations of the Trans-

former, across different language pairs, including

a low resource language pair and multilingual. In

addition, we investigate the effect of the FFN in a

decoder-only Transformer-based model. We find

that a considerable level of redundancy exists be-

tween the encoder and decoder FFNs. As a result,

we are able to eliminate the decoder FFN and share

a single FFN across the encoder without signifi-

cantly compromising the model’s accuracy. This

step leads not only to significant parameter savings

but also opens up opportunities for further improve-

ments. We also suggest using wider FFNs in the

encoder while dropping the decoder’s FFN, which

results in a model with a similar size, but improved

accuracy and reduced latency.

Finally we conduct a fine-grained analysis of

the representational similarity between the origi-

nal model, using one independent FFN per layer,and various models with shared FFNs. Our results

reveal that both model accuracy and the internal

representation of Transformer blocks remain stable

when sharing the FFN.

Background and Methodology

2.1

Transformer

The Transformer architecture has two main compo-

nents: attention and the FFN, which are connected

via a residual connection (He et al., 2016) and layer

normalization (Ba et al., 2016). In an encoder-

decoder model, there are two types of attention:

self-attention and cross-attention. Self-attention is

used in both the encoder and the decoder, allowing

the model to focus on relevant information within

the same sequence. Cross-attention is exclusive to

the decoder and allows it to attend to the encoder’s

output. Attention takes as input a set of queries,

keys and values, projected using four R d model ×d model

matrices (one for the queries, keys, values, and

final output) where d model is the model’s hidden

dimension. It then applies the S OFTMAX function

to allow it to focus on the most relevant values.

The FFN is applied after attention on both the en-

coder and the decoder and consists of the following

2-layer linear transformation:

FFN(x) = max(0, xW 1 + b 1 )W 2 + b 2 ,

and suggesting a better allocation of these parame-

ters might be beneficial for performance.

2.2

The vanilla Transformer allocates one FFN for each

layer of the encoder and decoder, i.e. FFN enc

FFN dec

respectively.

Excluding

embedding

pa-

rameters, these FFNs occupy around two thirds

of the parameter budget, while attention occupies

the remaining third 1 . Earlier work found that con-

straining the parameterization of the decoder FFNs

causes no degradation in accuracy (Ge et al., 2022).

In this work, we share the parameters of the FFN

across layers and/or across the encoder and decoder

to minimize redundancy between FFNs.

Let N enc , N dec be the numbers of encoder and

decoder layers, respectively. We consider multiple

configurations for parameter sharing as follows:

• One FFN enc

all for the whole encoder:

tied

enc

FFN enc

i (·) = FFN all (·), ∀i : 1 ≤ i ≤ N enc

• One FFN dec

all for the whole decoder:

tied

dec

FFN dec

j (·) = FFN all (·), ∀j : 1 ≤ j ≤ N dec

• One FFN encdec

for both the encoder and the

all

decoder:

tied

(1)

where a R ELU non-linearity is applied to the trans-

formation of the input sequence (x). At each

layer, the FFN is parameterized with two matrices,

W 1 ∈ R d model ×d ff and W 2 ∈ R d ff ×d model where d ff

is the FFN dimension and is usually set to 4×d model

(Vaswani et al., 2017).

Recent work has drawn a significant link be-

tween attention and the FFN (Geva et al., 2021),

wherein W 1 and W 2 assume roles akin to the keys

and values to an unnormalized attention where the

input (x) acts as the query. Unlike regular attention,

the FFN employs a R ELU , which allows multiple

keys to significantly contribute to the final output

(Geva et al., 2021). Additionally, these keys cor-

respond to an inventory of salient patterns that are

learned from the training data. Geva et al. (2021)

suggest that at the lower layers the FFN learns

shallow syntactic patterns and progressively learns

deep semantic patterns on the deeper layers. More-

over, the authors find that there’s a substantial over-

lap between patterns captured by adjacent layers,

indicating that there are redundancies in the FFNs

Sharing and Widening the FFN

tied

dec

encdec (·),

FFN enc

i (·) = FFN j (·) = FFN all

∀i, j : 1 ≤ i ≤ N enc , 1 ≤ j ≤ N dec

Additionally, we explore modifying the dimen-

sion of the shared FFN, which we denote as d ff ′ .

Setting d ff ′ > d ff widens the shared FFN while

d ff ′ < d ff narrows it. We also consider the extreme

cases of setting d ff ′ to 0 or to (N enc + N dec ) × d ff

(and beyond). Setting d ff ′ = 0 is equivalent to

dropping the FFN 2 while setting d ff ′ = (N enc +

N dec ) × d ff is akin to sharing the concatenation of

all individual FFNs.

Sharing the FFNs directly affects the number of

parameters and, to a certain extent, latency. For

instance, sharing FFN enc

all for the whole encoder re-

duces the number of parameters by (N enc − 1) ×

2 × d model × d ′ ff 3 ; whereas removing the FFN on the

Ignoring layer normalization, there are 4 × d model × d model

parameters for attention vs 2×d model ×d ff = 8×d model ×d model

parameters for the FFN, assuming d ff = 4 × d model .

In our experiments without the FFN (i.e., d ff ′ = 0) we

remove the residual connection and layer normalization asso-

ciated with it, as they become redundant.

Plus the layer normalization parameters, which we are

ignoring for simplicity.decoder, i.e., setting d ff ′ = 0 for FFN dec

all , reduces

the parameters by (N dec ) × 2 × d model × d ′ ff and re-

duces the amount of computation to be done. This

is particularly important during inference since the

forward pass of the decoder is autoregressive, and

changing the decoder’s FFN dimension has a higher

latency impact than on the encoder.

Since different configurations have different im-

pacts, we analyse the trade-off between model size,

latency, and accuracy: (i) How many parameters

can be shared/pruned with negligible (if any) accu-

racy degradation? (ii) Are the encoder and decoder

FFNs affected similarly? (iii) Keeping the same

model size, can the FFN parameters be allocated

more efficiently?

We propose a novel configuration, which we call

the One Wide FFN model, consisting of a single

shared wide FFN on the encoder and no FFN on

the decoder. To keep the number of parameters

the same as in the baseline, we increase the shared

FFN dimension accordingly: FFN enc

all with d ff ′ =

(N enc + N dec ) × d ff .

For completeness, we include similar experi-

ments on the attention mechanism in Appendix B.

These experiments show that, contrary to the FFN,

individual layer-specific attention weights are more

important and not as redundant, as sharing the at-

tention leads to significant accuracy drops.

2.3

Representational Similarity

Besides investigating the impact on accuracy, we

study the similarity between different models in

terms of their internal representations and the se-

mantic space they produce.

We use Linear Centered Kernel Alignment ( CKA ,

Kornblith et al., 2019) to measure the similarity be-

tween the internal representations of different mod-

els. CKA uses inner products to estimate how simi-

lar the kernel matrices of two different representa-

tions are, and is based on the Hilbert-Schmidt Inde-

pendence Criterion (HSIC, Gretton et al., 2005), a

statistical measure of independence of two random

variables. Linear CKA uses the dot product as a

kernel and can be written as:

CKA(A, B) =

||AB T || 2 F

||A T A|| F ||B T B|| F

where || · || F is the Frobenius norm while A and

B are mean-centered (i.e., we subtract the mean)

feature matrices of the layers under comparison,

computed on the same dataset. Both matrices are

n × d, where n is the number of sentences in the

dataset and d is the output dimension of the compo-

nent, and are obtained by averaging the activation

of all tokens in each sentence. The linear kernel

is straightforward to compute and Kornblith et al.,

2019 report strong empirical performance of linear

CKA compared to other kernels and methods.

To measure the similarity between the semantic

spaces of different models, we use Local Neighbor-

hood Similarity ( LNS , Boggust et al., 2022). Lo-

cal neighborhood similarities have been previously

been used in analyzing semantic shifts in word em-

beddings (Hamilton et al., 2016). The premise of

LNS is that two semantic spaces are similar if a

sentence has similar neighbors in the two spaces.

The LNS of a sentence s between models 1 and 2

is defined as:

LNS(s) = Sim(k-NN 1 (s), k-NN 2 (s)),

where k-NN(s) is the set of k nearest neighbors of

sentence s for a model and Sim is the intersection-

over-union (Jaccard similarity) of the two sets of

neighbors. For each pair of components (attention

and FFN) in models 1 and 2 we compute the LNS

of all sentences in the evaluation dataset and take

the mean LNS as our layer similarity measure. The

smaller the value of k the more local the neigh-

borhoods we are comparing, and the more specific

the retrieval task. We pick k to be small enough

to visually inspect sentence neighborhoods if nec-

essary. In our analysis, we use cosine distance as

the distance metric between activations and set k

to 5% of the dataset size (∼ 100 sentences).

Experimental Setup

Data In our experiments, we show results on

WMT22 English (E N ) → German (D E ) (296M

pairs), which we obtained using the provided

mt-data scripts 4 , WMT16 E N → Romanian (R O )

(610K pairs), and for the multilingual setup of Pires

et al. (2023), consisting of 10 languages: German,

English, Spanish, French, Italian, Japanese, Ko-

rean, Portuguese, Swahili, and Chinese. In our

analysis, we mostly focus on WMT22 E N →D E .

Following Schmidt et al. (2022), we use

WMT’16 provided scripts to normalize the R O side.

E N →R O keeps diacritics for producing accurate

translations. For more details refer to Schmidt et al.

(2022). For the multilingual experiments, we repli-

cated the setup of Pires et al. (2023), which in-

https://www.statmt.org/wmt22/mtdata/cludes all details, including data preprocessing and

dataset sizes.

Metrics We compute B LEU 5 using sacreBLEU 6

version 2.3.1, with evaluation signatures nrefs:1

| case:mixed | eff:no | tok:13a | smooth:exp

for B LEU , and nrefs:1 | case:mixed | eff:no |

tok:flores101 | smooth:exp for SP BLEU. For

our main results, we also report C OMET using the

wmt20-comet-da model and CHR F using the sig-

nature nrefs:1 | case:mixed | eff:yes | nc:6

| nw:0 | space:no.

Latency We report inference time in tokens/sec-

ond (the higher, the better), averaged over 5 runs.

For the multilingual models, we use the D E →E N

test set. Our measurements were collected using

a single NVIDIA V100 GPU on a single-threaded

Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz with

batch size of 1, in order to realistically mimic the

inference of a deployed model.

Tokenization For WMT22 E N →D E , we use

S ENTENCE P IECE (Kudo and Richardson, 2018),

with a vocabulary size of 32K and a character cov-

erage of 1.0, while for the multilingual experiments

we use a vocabulary size of 250k and a character

coverage of 0.9995. For WMT16 E N →R O we

use byte-pair encoding (BPE, Sennrich et al., 2016)

with 40, 000 merge operations.

Model Architectures We focus our analysis on

the Transformer Big where N enc = N dec = 6,

d model = 1024, d ff = 4096, and it has 16 attention

heads. We also report results on Transformer Base

(N enc = N dec = 6, d model = 512, d ff = 2048,

and 8 attention heads), and a deep encoder shallow

decoder (Kasai et al., 2021) Transformer Big with

12 encoder layers, and 2 decoder layers. For our

decoder-only experiments, the model is identical to

the Transformer Big, except that all 12 layers are on

the decoder. Our decoder-only model is similar to

a Transformer-based language model, particularly

Prefix-LM (Raffel et al., 2020), where we apply a

non-autoregressive mask on the source side and an

autoregressive mask on the target.

Hyperparameters All experiments are imple-

mented using FAIRSEQ (Ott et al., 2019). Our op-

timizer is A DAM (Kingma and Ba, 2015) with a

learning rate of 0.0007. We train for 80k, 80k,

For the multilingual experiments, we select the Flores101

tokenizer in sacreBLEU, so technically we report SP BLEU.

https://github.com/mjpost/sacrebleu

150k steps on WMT22, WMT16, and multilin-

gual, respectively. We use 4000 warm-up steps,

and an inverse square root learning rate scheduler

(Vaswani et al., 2017). We use a dropout rate of

0.1 for WMT22, 0.3 for WMT16, and 0 for the

multilingual experiments due to the abundance of

data, following Pires et al. (2023). All models are

trained using fp16 (Ott et al., 2018).

Nomenclature In our experiments, we run a num-

ber of different configurations per model architec-

ture that differ in the way the FFN is used, shared,

or dropped, as well the size of the shared FFN (d ff ′ ).

To facilitate our discussion, we introduce in Table 1

the nomenclature that will serve as reference for

the rest of the text. Unless otherwise stated, the

dimension of the shared FNN ∗ all , i.e. d ff ′ is equal to

the d ff of the original model.

For decoder-only models, only SharedDec and

NoDec configurations are defined. For concise-

ness, we drop the mention of FFN from the

text when possible, i.e. SharedEnc instead of

SharedEncFFN.

FFN Description Encoder Decoder

SharedEnc

SharedDec

SharedEncSharedDec

SharedEncDec

NoDec

SharedEncNoDec dec

FNN enc

all FNN i

enc

dec

FNN i FNN all

dec

FNN enc

all FNN all

FNN encdec

all

FNN enc

No-op

enc

FNN all No-op

Table 1: Nomenclature used in our experiments. No-op

indicates an identity function, which is equivalent to

dropping the FFN.

Representational Similarity We use the

WMT22 E N →D E evaluation set for both CKA and

LNS analysis. We analyze encoder and decoder

representations independently and present these

metrics in a matrix heatmap plot showing pairwise

similarity between layers. The diagonal of this

matrix is the similarity of corresponding layers,

i.e., layer i on both architectures. In order to

facilitate an “apples-to-apples” comparison across

models, we extract decoder representations by

force decoding the (first) reference. We establish

2 crucial similarity scores: a benchmark on

similarity for each of these metrics, where we train

two additional models using the same architecture

but with different random seeds; a a similarity

lower bound, where we compare the baselineTransformer Big with a randomly initialized (i.e.,

untrained) model with the same architecture. We

present these bounds in Appendix C.

Experimental Results

4.1

Sharing FFNs

The results of various FFN sharing configurations

are summarized in Table 2, including their impact

on accuracy and model size (in millions of param-

eters and percentage). Sharing either the encoder

(SharedEnc) or the decoder FFN (SharedDec) re-

sults in just a 0.2 to 0.3 B LEU point decrease, while

reducing the parameter count by nearly 20%. Shar-

ing the FFN on each side (ShareEncShareDec)

leads to a more substantial degradation of 0.9

B LEU points, albeit reducing the parameter count

by 37%, while sharing a single FFN on the encoder

and decoder (ShareEncDec) results in a slightly

higher degradation of 1.1 B LEU points. Nonethe-

less, these findings support the hypothesis that the

FFN contains some degree of redundancy, since we

expected a greater accuracy degradation consider-

ing the substantial (20 − 40%) reduction in model

size.

Architecture B LEU | θ | (%)

Transformer Big

+ SharedEnc

+ SharedDec

+ SharedEncSharedDec

+ SharedEncDec 35.6

35.4

35.3

34.7

34.5

228M (100)

186M (82)

144M (63)

136M (59)

Table 2: sacreBLEU results on WMT 22 E N →D E for

different FFN sharing configurations. | θ | is the number

of parameters.

While we focus on sharing one FFN for all layers

within a module, we compare with sharing multi-

ple FFNs following Takase and Kiyono (2023) in

Appendix A. We find that sharing one FFN is as

accurate as sharing multiple FFNs within a module,

while being more parameter-efficient.

4.2

Dropping FFNs

Table 3 summarizes the performance of models

with no FFNs. Besides B LEU and number of pa-

rameters, we report the inference speed for each

architecture. Dropping the FFN on the encoder

(NoEnc) leads to a 0.9 B LEU point drop while re-

ducing the parameter count by 22% and with mini-

mal effect on inference speed. Dropping the FFN

on the decoder (NoDec), on the other hand, causes

a degradation of only 0.4 B LEU points while in-

creasing the inference speed by 20% 7 . The highest

latency reduction is obtained by removing the FFNs

on both the encoder and the decoder (NoEncNoDec),

but it comes with a significantly larger degradation

of over 2 B LEU points.

Architecture B LEU Speed

Transformer Big

+ NoEnc

+ NoDec

+ NoEncNoDec 35.6

34.7

35.2

33.5

| θ | (%)

111 ±1.2 228M (100)

112 ±1.0 178M (78)

133 ±0.9 178M (78)

138 ±1.9 127M (56)

+ SharedEncNoDec 35.3 136 ±1.1 136M (60)

+ NoEncSharedDec 33.9 127 ±1.0 136M (60)

Table 3: sacreBLEU results on WMT 22 E N →D E for

different FFN dropping configurations.

Combining sharing and dropping These re-

sults, together with those from Table 2, suggest that

the encoder and decoder FFNs have different con-

tributions: the decoder’s are more redundant, cor-

roborating previous work on FFNs parametrization

(Ge et al., 2022). With this in mind, we experiment

with one shared FFN on the encoder and dropping

it on the decoder, reported as SharedEncNoDec in

Table 3. As shown, we observe a 41% reduction of

the number of parameters and a 22% improvement

in inference speed, at the cost of 1.0 B LEU point.

4.3

One Wide FFN Model

Previous sections describe models that share and/or

drop FFNs, effectively reducing model size at some

modest accuracy cost. In this section, we investi-

gate whether we can regain the accuracy lost while

preserving the parameter efficiency and the latency

reduction. We focus on ShareEncNoDec model as

it provides a strong baseline with significant param-

eter savings and inference speedups.

We propose increasing the dimension of the

shared FFN to match the number of parameters

of the original (fully-parameterized) model, so as

to avoid increasing the overhead of model stor-

age. In particular, ShareEncNoDec saves around

(N enc + N dec − 1) × 2 × d model × d ff parame-

ters as there’s one single shared FFN in the en-

coder. On the other hand, the Transformer Big

The reason for this difference between NoEnc and NoDec

is that the encoder output is computed in parallel, while the

decoder operates in a step-by-step fashion.B LEU

Transformer Big E N →D E

+ SharedEncNoDec FFN d ff ′

35.6

= 4, 096

35.3

= 24, 576 35.7

= 49, 152 36.5 †

= 98, 304 36.4 †

CHR F

62.6

62.1

62.7

63.2 †

C OMET

Speed

110.8 ±1.2

57.2

56.1

57.9

59.6

59.0

135.7 ±1.1

138.2 ±0.9

137.5 ±1.6

134.5 ±1.6

| θ | (%)

228M

135M

177M

228M

328M

(100)

(60)

(80)

(100)

(145)

Table 4: Accuracy of One Wide FFN for Transformer Big E N →D E on WMT22. † implies the system is statistical

significantly different at p < 0.05.

has (N enc + N dec ) FFNs. Thus, we match the size

of the original model by setting the dimension of

the shared FFN, d ff ′ , to (N enc + N dec ) × d ff .

Table 4 summarizes our results. It includes our

proposed model, the One Wide FFN model (d ff ′ =

49, 152), as well as the baseline Transformer Big,

and the corresponding ShareEncNoDec (d ff ′ =

4, 096). It also includes a wide model with d ff ′ =

24, 576, which uses the same number of parame-

ters as NoDec, with d ff ′ = N enc × d ff . This model

achieves an accuracy on par (or slightly above) the

baseline Transformer Big with 20% fewer param-

eters and a significant inference speed-up.

Our proposed model with d ff ′ = 49, 152 goes be-

yond that, achieving a gain of 1.2 B LEU points over

the vanilla ShareEncNoDec and 0.9 B LEU points

over the Transformer Big. These gains remain

consistent across CHR F and C OMET . Furthermore,

it comes while maintaining a similar inference

speed as the ShareEncNoDec model. For complete-

ness, we include a wider model with d ff ′ = 98, 304.

Despite the extra capacity, this model does not

provide any additional accuracy gains, which we

suspect is due to the lack of data to train a model

this big.

4.4

Analyzing Internal Representations

We now report a post-hoc analysis of the internal

representations of the models introduced in pre-

ceding sections. Our objectives are twofold: 1) to

ascertain whether the proposed models’ internal

representations exhibit a significant degree of sim-

ilarity to those of the original base model; 2) to

delve into the impact of the proposed methods on

redundancy. We adopt the definition of redundancy

of Dalvi et al. (2020), who visually inspect the sim-

ilarity between adjacent modules within a model

(high similarity entails high redundancy).

Encoder

Architecture

Decoder

CKA LNS CKA LNS

Benchmark 100.0 100.0 100.0 100.0

SharedEnc

SharedDec

SharedEncSharedDec

SharedEncDec

NoDec

SharedEncNoDec 98.0 96.2 100.8 100.6

100.2 101.4 98.3 94.6

98.9 97.2 99.5 95.4

97.6 94.4 98.4 93.5

100.0 98.6 96.0 87.4

97.6 98.9 97.5 89.0

′

SharedEncNoDec d ff =49152 97.0 83.2 94.0 82.9

Table 5: Similarity of the representations (%) of cor-

responding modules of different architectures vs. the

Transformer Big for WMT22 E N →D E . These scores

are normalized by comparing them to the CKA and LNS

score upper bounds. For NoDec configurations we com-

pare the final output of the Transformer layer as a whole

as they have different modules than the baseline. The

columns for shared and for dropped FFNs are high-

lighted in gray and blue respectively.

4.4.1

Similarity to Baseline

We ground the pairwise similarity metrics, by nor-

malizing them against a benchmark. As mentioned

in Section 3, we establish the benchmark scores by

training two additional Transformer Big models,

but using different random seeds. These models

achieve similar accuracy as the baseline model (see

Appendix C.1 for more details). The benchmark

score is the similarity between the baseline and

these models Because the benchmark is calculated

by averaging similarity scores from different train-

ing runs of our baseline, individual runs can have a

normalized score above 100%.

Table 5 shows normalized similarity scores for

several models. Under the Encoder columns we

compare the encoder representations, and under the

Decoder columns we compare decoder represen-

tations. Sharing FFNs leads to consistenly lowerTransformer Big Shared Enc No Dec

ffn-dim=49152

2.self-attn

1.cross-attn

1.self-attn

0.cross-attn

Transformer Big

0.self-attn

CKA

1.0

5.ffn

5.cross-attn

5.self-attn

4.ffn

4.cross-attn

4.self-attn

3.ffn

3.cross-attn

3.self-attn

2.ffn

2.cross-attn

2.self-attn

1.ffn

1.cross-attn

1.self-attn

0.ffn

0.cross-attn

0.self-attn

0.9

0.8

0.7

0.6

2.cross-attn

5.self-attn

4.self-attn

0.self-attn

0.self-attn 0.ffn

0.ffn 1.self-attn

1.self-attn

1.ffn

3.self-attn

2.self-attn

1.ffn

2.ffn

2.self-attn

3.self-attn

2.ffn

3.self-attn

4.self-attn

3.cross-attn

0.5

3.ffn

4.cross-attn

0.6

0.7

5.self-attn

4.ffn 4.self-attn 4.ffn 4.self-attn

5.cross-attn

0.8

0.9

CKA

1.0

5.ffn

5.self-attn

5.ffn 5.self-attn

Transformer Big Shared Enc No Dec

ffn-dim=49152

(a) Encoder self similarity.

Transformer Big

(b) Decoder self similarity.

Figure 1: Self similarity structure of encoder and decoder layers of the One Wide Encoder model vs. the Transformer

Big baseline.

(normalized) similarity scores than models that do

not share, both in terms of internal representation

( CKA ) and semantic spaces ( LNS ). As shown, al-

though models that share FFNs have lower sim-

ilarity scores compared to those that do not, the

scores are still very close to 100%. Moreover, these

decreases align with the drops in B LEU seen in

Table 2, where the model with the lowest similar-

ity score (ShareEncDec) is also the least accurate

model. We observe a similar trend for dropping

FFNs where models that drop FFNs exhibit lower

similarity scores than those that do not, but the

scores not too far from the benchmark scores.

For completeness, we report on the last row the

similarity scores for the One Wide FFN model,

which is more accurate than the base model. The

internal representations generated by that model

diverge from those of the base model. Interestingly,

we observe a larger drop in LNS scores than in CKA

scores, indicating that the shift occurs mostly in

semantic space, rather than the Euclidean space

captured by CKA . For a detailed layer-wise similar-

ity analysis that breaks out the aggregate analysis

in Table 5 see Appendix C.2.

4.4.2 A Qualitative View of Redundancy

We now study into the impact of our One Wide

FFN model on the redundancy of the internal repre-

sentations. In addition to adopting their definition

of redundancy, we also adopt Dalvi et al. (2020)’s

method of computing self-similarity, namely look-

ing at how the representations change as they go

through each module (self-attention, FFN, or cross-

attention) of the model. In particular, we use CKA

to compute similarity between the output of differ-

ent modules within the same model.

In Figure 1a, we show the CKA self-similarity

matrices for the encoders of the One Wide FFN

model and the Transformer Big. We do the same

for the decoders in Figure 1b. These matrices show

how similar each module of the network is to all

other modules within that network. The diagonal

of the matrix is the similarity between a module

and itself and is always 1.

As shown, there is high similarity between ad-

jacent modules of the Transformer Big, both on

the encoder and decoder, indicated by areas with

darker red around the diagonal. The prevalence of

high similarity patterns among adjacent modules

suggests a substantial degree of redundancy, and

eliminating a module has a negligible impact on

the final representations. On the other hand, we

observe a distinct checkerboard pattern on the self-

similarity matrices of the One Wide FFN model,

where individual modules tend to exhibit lower sim-

ilarity with their immediate neighbors than with

their second neighbors (i.e., the neighbors of the

neighbors). On the encoder, the checkerboard pat-

tern emerges especially in the earlier modules while

on the decoder, that pattern appears more consis-

tently throughout the layers. This pattern gives an

indication that our model is learning non-trivial

transformations of the input, leading to decreased

redundancy within the network.

4.5

Other architectures and Languages

So far, all our experiments focused on the Trans-

former Big and on WMT22 E N →D E . In this

section, we apply what we learned to other archi-

tectures and language pairs. We run experiments

on the low resource language direction E N →R O

and a large scale multilingual model.

For E N →D E , we apply our proposal to a Trans-

former Base model, a Deep Encoder Shallow De-

coder model (Kasai et al., 2021), and a Decoder-

Only model. For the Transformer Base, we observe

an accuracy gain of 0.5 B LEU (2.2 B LEU over theB LEU

CHR F

C OMET

Speed | θ | (%)

116.3 ±0.9 Transformer Base E N →D E

+ SharedEncNoDec FFN d ff ′ = 2, 048

+ SharedEncNoDec FFN d ff ′ = 24, 576 34.2

32.5 †

34.7 61.6

60.1 †

61.8 54.1

50.0

55.6 146.8 ±1.3 70M (100)

47M (67)

70M (100)

Transformer Decoder-Only E N →D E

+ ShareDec FFN d ff ′ = 4, 096

+ ShareDec FFN d ff ′ = 49, 152 35.8

34.4 †

36.1 62.8

61.7 †

62.9 57.7

54.1

59.4 79.8 ±1.9

79.7 ±1.3

69.3 ±0.2 202M (100)

110M (48)

202M (100)

Transformer Deep Enc. Shallow Dec. E N →D E 35.5

+ ShareEncNoDec FFN d ff ′ = 4, 096

34.8 †

+ ShareEncNoDec FFN d ff ′ = 57, 344

35.7 62.4

61.6 †

62.4 58.0

55.4

58.9 230.1 ±0.8

235.0 ±0.5

233.5 ±0.7 236M (100)

127M (54)

236M (100)

Transformer Base E N →R O

+ SharedEncNoDec FFN d ff ′ = 2, 048

+ SharedEncNoDec FFN d ff ′ = 24, 576 22.9

22.2 †

22.9 52.9

52.5 †

52.8 50.9

45.8

46.7 119.3 ±1.1

152.8 ±1.4

150.6 ±0.5 64M (100)

41M (64)

64M (100)

Transformer Big Multilingual

+ SharedEncNoDec FFN d ff ′ = 4, 096

+ SharedEncNoDec FFN d ff ′ = 49, 152 26.8

25.5 †

28.0 † 46.3

45.1 †

47.3 † 47.7

40.8

50.7 94.6 ±1.6

107.1 ±1.4

111.5 ±1.1 422M (100)

330M (78)

422M (100)

146.0 ±1.6

Table 6: Accuracy of One Wide FFN for E N →D E with Transformer Base, Decoder Only, and Deep Encoder

Shallow Decoder on WMT22; for low resource E N →R O with Base version on WMT16, and multilingual with

Transformer big on Flores. † implies the system is statistical significantly different at p < 0.05.

vanilla SharedEncNoDec model) and an inference

speedup of around 25%. In the Deep Encoder Shal-

low Decoder model, we observe a more modest

accuracy gain of 0.2 B LEU points (0.9 B LEU over

the vanilla SharedEncNoDec model). However, the

inference speedup from dropping the decoder FFNs

is minimal (< 1%), which is expected because of

the small depth of the decoder in this architecture.

Decoder-only models With the advent of Large

Language Models (LLMs) like GPT (Brown et al.,

2020), and PaLM (Chowdhery et al., 2022), a lot

of effort has been put on decoder-only Transformer

models. We train a decoder-only model on WMT22

E N →D E , as shown on Table 6. Due to the ab-

sence of an encoder, we are limited to applying

a wide FFN on the decoder side. As in the other

setups, we get an accuracy gain of +0.3 B LEU over

the baseline decoder-only model (+1.7 B LEU over

ShareDec), but the latency degrades by 12%. This

is not surprising: due to the autoregressive nature

of the decoder, increasing the size of its FFN has a

bigger impact on speed.

Low-resource languages In E N →R O the ac-

curacy of the One Wide FFN Model is only on

par compared to the base model, even though it is

a higher than the vanilla SharedEncNoDec model.

We hypothesize that due to the low resource condi-

tion, our proposed model already reaches saturation

as there are not that many salient textual patterns

to be learned by the FFN.

Multilingual Finally, we observe the similar

trend on the multilingual setup, where the One

Wide FFN Model is +1.2 SP BLEU points more ac-

curate than the baseline Transformer Big and +2.5

SP BLEU points more accurate than the vanilla

SharedEncNoDec, this gain is significant in 79 out

of 90 directions and when all tests sets are con-

catenated. Additionally, this large accuracy gain

also comes with around 18% inference speed-up,

consistent with our previous results.

Related Work

Weight pruning and parameter sharing are well-

known techniques to reduce a model’s footprint.

Given the scale of the latest models (Chowdhery

et al., 2022), there have been multiple efforts to

prune neurons based on different automatic meth-

ods (Dalvi et al., 2020; Michel et al., 2019; Voita

et al., 2019), sharing parameters efficiently (Ge

et al., 2022; Reid et al., 2021), and factorizing cer-

tain components (Lan et al., 2020; Hu et al., 2022).

Neuron pruning methods often focus on finding

and pruning redundant neurons through correlation

methods (Dalvi et al., 2020), but also on how Trans-former components like the multi-head attention

can be pruned significantly due to model redun-

dancy in the encoder or decoder either by checking

the gradients salience (Michel et al., 2019) or a

differentiable relaxation of the l 0 regularization at

training time (Voita et al., 2019).

For parameter sharing, the Universal Trans-

former (Dehghani et al., 2019) proposed a model

where all layers are shared (i.e., in effect it reduced

the model to a single shared layer). Takase and

Kiyono (2023) proposes finding an optimal config-

uration of shared layers in the encoder or decoder

through different methods of sharing (in sequence,

in cycle, or in reversed cycle, i.e. starting sharing

from the top) always keeping a specified number

of final layers 8 . Similarly, Reid et al. (2021) pro-

poses an approach where just the middle layers are

shared, while the bottom and top layers are inde-

pendent, and using a lower dimensionality for the

embedding layer. Analogously, Ge et al. (2022)

focus on minimizing the number of parameters and

the number of calls to each parameters’ group in

order to optimise on-device models. They achieve

this by sharing the encoder and decoder in a simi-

lar way to both previous methods, particularly by

sharing all layers parameters in cycle like Takase

and Kiyono (2023).

Previous works also focus on reducing the di-

mensionality of certain parameters, mostly through

low rank factorization. Lan et al. (2020) decom-

poses the embedding layer into a lower rank em-

bedding matrix and a projection to the actual hid-

den size while also sharing all parameters across

all layers. In addition to sharing parameters ef-

ficiently, Ge et al. (2022) proposes a lightweight

decomposition of the FFN where instead of a single

component there are 2 projections with a smaller di-

mensionality than vanilla Transformers. Our work

is close to Ge et al. (2022) but instead of factoriz-

ing we explore sharing and full pruning of the FFN.

In contrast with previous works, we also explore

increasing the encoder FFN size while dropping

the decoder’s completely.

Conclusion

In this work, we studied the importance of the FFN

in Transformer models. We analyzed the impact of

removing and/or sharing the FFN across layers and

found that, due to this component’s redundancy, the

son.

See Appendix A for a detailed description and compari-

model sizes can be substantially reduced with little

impact on model accuracy for Machine Translation.

In particular, we found that sharing the FFN across

all encoder layers and removing it from the decoder

layers, while increasing the dimension (d ff ′ ) of the

encoder FFN leads to models that are more accurate

and faster at inference.

Our findings are applicable across multiple set-

tings, including decoder-only and multilingual

models. In a low-resource setting the results are

modest but our approach can still recover the base-

line’s performance with a faster inference.

Finally, we conducted a thorough similarity anal-

ysis between the vanilla Transformer and our pro-

posed architectures, and found that the latter’s inter-

nal representations do not differ significantly from

the former’s, except in that they are less redundant.

Limitations

In this work, our focus was Machine Translation.

Although we expect the results to generalize to

other sequence-to-sequence tasks, further experi-

ments are needed, which we leave for future work.

Ethics Statement

One important consideration is the energy con-

sumption for model training, which results in green-

house emissions (Strubell et al., 2019). Our work

uses existing datasets, and inherits some of the

risks associated with them, such as privacy leakage

(Carlini et al., 2021) and gender bias (Cho et al.,

2019). Mitigation strategies such as those from

Vanmassenhove et al. (2018) may be necessary.

References

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hin-

ton. 2016. Layer normalization. arXiv preprint

arXiv:1607.06450.

Yamini Bansal, Behrooz Ghorbani, Ankush Garg, Biao

Zhang, Colin Cherry, Behnam Neyshabur, and Orhan

Firat. 2022. Data scaling laws in NMT: The effect

of noise and architecture. In Proceedings of the 39th

International Conference on Machine Learning, vol-

ume 162 of Proceedings of Machine Learning Re-

search, pages 1466–1482. PMLR.

Angie Boggust, Brandon Carter, and Arvind Satya-

narayan. 2022. Embedding Comparator: Visualizing

Differences in Global Structure and Local Neighbor-

hoods via Small Multiples. In 27th International

Conference on Intelligent User Interfaces, pages 746–

766, Helsinki Finland. ACM.Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie

Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind

Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, Sandhini Agarwal, Ariel Herbert-Voss,

Gretchen Krueger, Tom Henighan, Rewon Child,

Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,

Clemens Winter, Christopher Hesse, Mark Chen, Eric

Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,

Jack Clark, Christopher Berner, Sam McCandlish,

Alec Radford, Ilya Sutskever, and Dario Amodei.

2020. Language models are few-shot learners.

Nicholas Carlini, Florian Tramèr, Eric Wallace,

Matthew Jagielski, Ariel Herbert-Voss, Katherine

Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar

Erlingsson, Alina Oprea, and Colin Raffel. 2021. Ex-

tracting training data from large language models. In

30th USENIX Security Symposium (USENIX Security

21), pages 2633–2650. USENIX Association.

Won Ik Cho, Ji Won Kim, Seok Min Kim, and Nam Soo

Kim. 2019. On measuring gender bias in translation

of gender-neutral pronouns. In Proceedings of the

First Workshop on Gender Bias in Natural Language

Processing, pages 173–181, Florence, Italy. Associa-

tion for Computational Linguistics.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,

Maarten Bosma, Gaurav Mishra, Adam Roberts,

Paul Barham, Hyung Won Chung, Charles Sutton,

Sebastian Gehrmann, Parker Schuh, Kensen Shi,

Sasha Tsvyashchenko, Joshua Maynez, Abhishek

Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin-

odkumar Prabhakaran, Emily Reif, Nan Du, Ben

Hutchinson, Reiner Pope, James Bradbury, Jacob

Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin,

Toju Duke, Anselm Levskaya, Sanjay Ghemawat,

Sunipa Dev, Henryk Michalewski, Xavier Garcia,

Vedant Misra, Kevin Robinson, Liam Fedus, Denny

Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim,

Barret Zoph, Alexander Spiridonov, Ryan Sepassi,

David Dohan, Shivani Agrawal, Mark Omernick, An-

drew M. Dai, Thanumalayan Sankaranarayana Pil-

lai, Marie Pellat, Aitor Lewkowycz, Erica Moreira,

Rewon Child, Oleksandr Polozov, Katherine Lee,

Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark

Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy

Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov,

and Noah Fiedel. 2022. Palm: Scaling language mod-

eling with pathways.

Kevin Clark, Urvashi Khandelwal, Omer Levy, and

Christopher D. Manning. 2019. What does BERT

look at? an analysis of BERT’s attention. In Pro-

ceedings of the 2019 ACL Workshop BlackboxNLP:

Analyzing and Interpreting Neural Networks for NLP,

pages 276–286, Florence, Italy. Association for Com-

putational Linguistics.

Fahim Dalvi, Hassan Sajjad, Nadir Durrani, and

Yonatan Belinkov. 2020. Analyzing redundancy in

pretrained transformer models. In Proceedings of the

2020 Conference on Empirical Methods in Natural

Language Processing (EMNLP), pages 4908–4926,

Online. Association for Computational Linguistics.

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals,

Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal

transformers. In International Conference on Learn-

ing Representations.

Tao Ge, Si-Qing Chen, and Furu Wei. 2022. Edge-

Former: A parameter-efficient transformer for on-

device seq2seq generation. In Proceedings of the

2022 Conference on Empirical Methods in Natu-

ral Language Processing, pages 10786–10798, Abu

Dhabi, United Arab Emirates. Association for Com-

putational Linguistics.

Mor Geva, Roei Schuster, Jonathan Berant, and Omer

Levy. 2021. Transformer feed-forward layers are key-

value memories. In Proceedings of the 2021 Confer-

ence on Empirical Methods in Natural Language Pro-

cessing, pages 5484–5495, Online and Punta Cana,

Dominican Republic. Association for Computational

Linguistics.

Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur

Bapna, Maxim Krikun, Xavier Garcia, Ciprian

Chelba, and Colin Cherry. 2022. Scaling laws for

neural machine translation. In International Confer-

ence on Learning Representations.

Mitchell A Gordon, Kevin Duh, and Jared Kaplan. 2021.

Data and parameter scaling laws for neural machine

translation. In Proceedings of the 2021 Conference

on Empirical Methods in Natural Language Process-

ing, pages 5915–5922, Online and Punta Cana, Do-

minican Republic. Association for Computational

Linguistics.

Arthur Gretton, Olivier Bousquet, Alex Smola, and

Bernhard Schölkopf. 2005. Measuring statistical

dependence with hilbert-schmidt norms. In Inter-

national conference on algorithmic learning theory,

pages 63–77. Springer.

William L. Hamilton, Jure Leskovec, and Dan Juraf-

sky. 2016. Cultural shift or linguistic drift? compar-

ing two computational measures of semantic change.

In Proceedings of the 2016 Conference on Empiri-

cal Methods in Natural Language Processing, pages

2116–2121, Austin, Texas. Association for Computa-

tional Linguistics.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian

Sun. 2016. Deep residual learning for image recogni-

tion. In 2016 IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), pages 770–778.

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan

Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and

Weizhu Chen. 2022. LoRA: Low-rank adaptation of

large language models. In International Conference

on Learning Representations.

Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross,

and Noah A. Smith. 2021. Deep encoder, shal-

low decoder: Reevaluating non-autoregressive ma-

chine translation. In 9th International Conference

on Learning Representations, ICLR, virtual. OpenRe-

view.net.Diederik P. Kingma and Jimmy Ba. 2015. Adam: A

method for stochastic optimization. In 3rd Inter-

national Conference on Learning Representations,

ICLR 2015, San Diego, CA, USA, May 7-9, 2015,

Conference Track Proceedings.

Simon Kornblith, Mohammad Norouzi, Honglak Lee,

and Geoffrey Hinton. 2019. Similarity of neural net-

work representations revisited. In Proceedings of

the 36th International Conference on Machine Learn-

ing, volume 97 of Proceedings of Machine Learning

Research, pages 3519–3529. PMLR.

Taku Kudo and John Richardson. 2018. SentencePiece:

A simple and language independent subword tok-

enizer and detokenizer for neural text processing. In

Proceedings of the 2018 Conference on Empirical

Methods in Natural Language Processing: System

Demonstrations, pages 66–71, Brussels, Belgium.

Association for Computational Linguistics.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman,

Kevin Gimpel, Piyush Sharma, and Radu Soricut.

2020. Albert: A lite bert for self-supervised learning

of language representations. In International Confer-

ence on Learning Representations.

Qian Lou, Ting Hua, Yen-Chang Hsu, Yilin Shen, and

Hongxia Jin. 2022. Dictformer: Tiny transformer

with shared dictionary. In International Conference

on Learning Representations.

Paul Michel, Omer Levy, and Graham Neubig. 2019.

Are sixteen heads really better than one? In Ad-

vances in Neural Information Processing Systems,

volume 32. Curran Associates, Inc.

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,

Sam Gross, Nathan Ng, David Grangier, and Michael

Auli. 2019. fairseq: A fast, extensible toolkit for

sequence modeling. In Proceedings of the 2019 Con-

ference of the North American Chapter of the Associa-

tion for Computational Linguistics (Demonstrations),

pages 48–53, Minneapolis, Minnesota. Association

for Computational Linguistics.

Myle Ott, Sergey Edunov, David Grangier, and Michael

Auli. 2018. Scaling neural machine translation. In

Proceedings of the Third Conference on Machine

Translation: Research Papers, pages 1–9, Brussels,

Belgium. Association for Computational Linguistics.

Telmo Pires, Robin Schmidt, Yi-Hsiu Liao, and Stephan

Peitz. 2023. Learning language-specific layers for

multilingual machine translation. In Proceedings

of the 61st Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers),

pages 14767–14783, Toronto, Canada. Association

for Computational Linguistics.

Colin Raffel, Noam Shazeer, Adam Roberts, Kather-

ine Lee, Sharan Narang, Michael Matena, Yanqi

Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the

limits of transfer learning with a unified text-to-text

transformer. Journal of Machine Learning Research,

21(140):1–67.

Machel Reid, Edison Marrese-Taylor, and Yutaka Mat-

suo. 2021. Subformer: Exploring weight sharing

for parameter efficiency in generative transformers.

In Findings of the Association for Computational

Linguistics: EMNLP 2021, pages 4081–4090, Punta

Cana, Dominican Republic. Association for Compu-

tational Linguistics.

Robin Schmidt, Telmo Pires, Stephan Peitz, and Jonas

Lööf. 2022. Non-autoregressive neural machine

translation: A call for clarity. In Proceedings of

the 2022 Conference on Empirical Methods in Nat-

ural Language Processing, pages 2785–2799, Abu

Dhabi, United Arab Emirates. Association for Com-

putational Linguistics.

Rico Sennrich, Barry Haddow, and Alexandra Birch.

2016. Neural machine translation of rare words with

subword units. In Proceedings of the 54th Annual

Meeting of the Association for Computational Lin-

guistics (Volume 1: Long Papers), pages 1715–1725,

Berlin, Germany. Association for Computational Lin-

guistics.

Emma Strubell, Ananya Ganesh, and Andrew McCal-

lum. 2019. Energy and policy considerations for

deep learning in NLP. In Proceedings of the 57th

Annual Meeting of the Association for Computational

Linguistics, pages 3645–3650, Florence, Italy. Asso-

ciation for Computational Linguistics.

Sho Takase and Shun Kiyono. 2023. Lessons on pa-

rameter sharing across layers in transformers. In

Proceedings of The Fourth Workshop on Simple and

Efficient Natural Language Processing (SustaiNLP),

pages 78–90, Toronto, Canada (Hybrid). Association

for Computational Linguistics.

Eva Vanmassenhove, Christian Hardmeier, and Andy

Way. 2018. Getting gender right in neural machine

translation. In Proceedings of the 2018 Conference

on Empirical Methods in Natural Language Process-

ing, pages 3003–3008, Brussels, Belgium. Associa-

tion for Computational Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob

Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz

Kaiser, and Illia Polosukhin. 2017. Attention is all

you need. In Advances in Neural Information Pro-

cessing Systems, volume 30. Curran Associates, Inc.

Jesse Vig and Yonatan Belinkov. 2019. Analyzing

the structure of attention in a transformer language

model. In Proceedings of the 2019 ACL Workshop

BlackboxNLP: Analyzing and Interpreting Neural

Networks for NLP, pages 63–76, Florence, Italy. As-

sociation for Computational Linguistics.

Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-

nrich, and Ivan Titov. 2019. Analyzing multi-head

self-attention: Specialized heads do the heavy lift-

ing, the rest can be pruned. In Proceedings of the

57th Annual Meeting of the Association for Computa-

tional Linguistics, pages 5797–5808, Florence, Italy.

Association for Computational Linguistics.A

Custom Sharing of Multiple FFNs

There is a combinatorial number of ways of sharing

M < N FFNs within a module of N layers. Since

this is prohibitive, we investigate the following

strategies from Takase and Kiyono (2023):

• Sequence: assign one FFN for every M/N

consecutive layers, forming a block pattern.

tied

FFN i (·) = FFN seq m (·), ∀i : 1 ≤ i ≤

N, m = ⌊(i − 1)/(N/M )⌋

In contrast with the FFN, attention seems to play

a more crucial role in the model’s performance, as

sharing the different attention mechanisms in both

encoder and decoder causes a large accuracy drop

across all settings, with the exception of sharing

the decoder’s cross attention and the encoder’s self

attention.

Encoder

Decoder

BLEU | θ | (%)

Self-Att Self-Att Cross-Att

Transformer

Shared Shared

Shared Indiv.

Indiv. Shared

Indiv. Indiv.

• Cycle: stack M FFNs in an identical order,

forming a repetitive checkerboard pattern.

tied

FFN i (·) = FFN cyc m (·), ∀i : 1 ≤ i ≤

N, m = (i − 1) modulo M

• Cycle (Rev): stack M FFNs in a reverse

order, forming a repetitive palindrome series.

tied

FFN i (·) = FFN cycrev m (·), ∀i : 1 ≤ i ≤

N, m = N/M − i

Note that we assume that N is an even number

and divisible by N . Cycle (Rev) is only valid

for M = N/2. The EdgeFormer (Ge et al., 2022)

adopts Cycle with M = 2 for the encoder FFNs.

Table 7 shows the results of these strategies ap-

plied on the encoder. As references, we copy the re-

sults of the Transformer Big and ShareEnc from

Table 2. Not only is the accuracy of ShareEnc sim-

ilar to Takase and Kiyono (2023)’s strategies, but it

also uses fewer parameters and is easier to extend.

Architecture

B LEU | θ | (%)

Transformer Big

35.6 228M (100)

+ SharedEnc (M=1) 35.4 186M (82)

+ Sequence M=2

+ Sequence M=3

+ Cycle M=2

+ Cycle M=3

+ Cycle Rev M=2

+ Cycle Rev M=3

35.2

35.3

35.2

35.5

35.2

35.5

194M

202M

194M

202M

194M

202M

(85)

(88)

(85)

(88)

(85)

(88)

Table 7: Accuracy of different FFN sharing strategies

on WMT22 E N → D E .

Sharing or Dropping Attention

We report the results of sharing attention modules

(either self, cross or both) across layers in Table 8.

Big

Shared

Indiv.

Shared

35.6

27.5

27.6

35.5

26.5

25.7

35.5

228M(100)

165M (72)

186M (82)

207M (91)

186M (82)

207M (91)

Table 8: B LEU scores on WMT 22 E N →D E when

sharing the attention of both encoder and decoder (self

and cross). Nomenclature follows Section 3 but with

Self Attn an Cross Attn as the encoder/decoder’s self

attention and cross-attention (decoder), respectively.

Details on Internal Representations

Analysis

C.1 Raw Similarity Scores for Benchmarking

We establish a benchmark score for the expected

similarity of our two metrics by comparing the

baseline Transformer Big with identical models

trained from different random seeds. Table 9

presents the raw similarity scores from which we

compute the normalized scores presented in Table 5.

As shown, the similarity between

C.2

Layer-wise Analysis

In Table 5, we report the aggregated similarity

scores across all layers of Transformer encoder

and decoder. Here, we report a more fine-grained

layer-wise similarity score mostly to showcase the

reliability of the aggregated scores. In Figure 2,

we plot layerwise LNS to study how similar the

semantic information captured at each layer is to

that of the baseline model at every layer. When

LNS scores are high, the network is producing sim-

ilar local neighborhoods for each sentence in our

evaluation set. In particular, we are interested in

comparing the benchmark LNS scores and those of

SharedEncSharedDec at each layer. As shown, the

layer-wise LNS scores of SharedEncSharedDecEncoder Decoder

Architecture

CKA LNS CKA LNS

TransformerBig Seed 2 96 61 94 62

TransformerBig Seed 3 96 62 95 62

Model

Shared Encoder + Shared Decoder

0.6

′

0.5

ShareEncNoDec d ff =41952 94 51 89 51

0.4

0.3

0.2

0.1

0.0

Table 9: Raw similarity of the representations (%) of

corresponding layer-modules of different architectures

vs. the Transformer Big for WMT22 E N →D E . For

NoDec configurations we compare the final output of

the transformer layer as a whole as they have different

sub-modules. The columns for shared and for dropped

FFNs are highlighted in gray and blue respectively.

Transformer Big Benchmark

0.7

SharedEnc

SharedDec

SharedEncSharedDec

SharedEncDec

NoDec

ShareEncNoDec

Layer.Module

(a) Encoder

track the baseline scores at almost every layer, con-

firming the reliability of the aggregated score. We

observe similar pattern for all the models that we

evaluate in this paper.

Model

Shared Encoder + Shared Decoder

Transformer Big Benchmark

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

Layer.Module

(b) decoder

Figure

Layerwise

LNS

between

SharedEncSharedDec and Transformer

Big

(blue bars). LNS between two versions of Transformer

Big trained from different random initializations are

shown by the grey bars to ground the comparison.

FFN sharing does not dramatically change activations

produced at each layer.