Summary of Designing Stable and Transferable Sparse Models

Summary Designing Stable and Transferable Sparse Models arxiv.org

19,416 words - PDF document - View PDF document

One Line

The document explores the design, stability, and performance trade-offs of sparse expert models, highlighting their advantages in various modalities and discussing potential advancements for future research.

Slides

Slide Presentation (7 slides)

Copy slides outline Copy embed code Download as Word

Designing Stable and Transferable Sparse Models: Key Insights

Source: arxiv.org - PDF - 19,416 words - view

Stabilizing Sparse Models

• Tradeoff between stability and model quality

• Router z-loss stabilizes models without quality degradation

• Removing multiplicative interactions and injecting model noise improve stability

• Constraining activations and gradients can stabilize models but may worsen quality

Sparse Expert Models

• Success in language processing, image recognition, and speech recognition

• Careful consideration of stability and quality tradeoffs

• Expert specialization observed in handling different types of tokens

• Routing algorithms play a crucial role in performance improvement

Fine-Tuning Analysis

• Sparse models benefit from smaller batch sizes and higher learning rates

• Fine-tuning only a subset of model parameters improves generalization

• Additional regularization techniques may be needed to prevent overfitting

• Sensitivity to roundoff errors and numerical precision format choice

Large-Scale Study

• Insights into the quality-stability trade-offs

• Fine-tuning analysis comparing sparse and dense models

• Design principles for efficient sparse models

• Advantages of sparse expert neural networks, including reduced carbon footprint

Improving Sparse Expert Models

• Stabilizing models without compromising quality is possible with the router z-loss

• Consideration of stability and quality tradeoffs is crucial in designing sparse models

• Routing algorithms and expert components are key factors for performance improvement

• Future research opportunities include exploring routing algorithms and regularization techniques

[Optional: Include visuals such as graphs depicting stability vs. model quality or comparison charts between sparse and dense models]

Key Points

Stabilizing sparse models often leads to a tradeoff with model quality.
The router z-loss stabilizes models without quality degradation.
Sparse models require careful consideration of stability and quality tradeoffs.
Sparse expert models have shown success in various modalities such as language processing, image recognition, and speech recognition.
The paper provides insights into the design, stability, and performance tradeoffs of sparse expert models.

Summaries

424 word summary

The document discusses the design and optimization of sparse models, highlighting the importance of stability and trade-offs in model quality. It explores the use of sparse expert models in natural language processing benchmarks and the advantages of sparse models in terms of reduced carbon footprint and energy training cost. The paper presents a large-scale study comparing sparse and dense models, as well as architectural, routing, and model design principles for efficient sparse models. The study also introduces a router z-loss to resolve instability issues and provides a design guide for sparse expert models.

Sparse models require careful consideration of stability and quality trade-offs. Top-n routing can be used to route tokens to multiple experts, and the router and expert components are important in sparse models. Load balancing techniques and capacity factor can improve stability but increase memory and computation costs. Training instabilities in sparse models are worse than in standard models, but stabilizing techniques such as constraining activations and gradients can be used. The router z-loss stabilizes models without quality degradation, but stabilizing sparse models often leads to a tradeoff with model quality.

The document highlights the performance of sparse models compared to previous approaches, such as question answering and summarization tasks. It discusses the benefits of sparse models in various modalities, including language processing, image recognition, and speech recognition. The document also mentions potential advancements in routing algorithms and regularization techniques to improve the quality of sparse models.

The document discusses the design of stable and transferable sparse models, exploring the architecture and training objective of these models. It mentions the use of sentinel tokens, gradient noise, rectified linear units (ReLU), and regularized dropout in neural networks. It also discusses the challenges of transfer across different implementations and applications.

The document presents various modifications and experiments conducted to design stable and transferable sparse models. It explores routing decisions using word embeddings and additional dense feed-forward network (FFN) layers to improve model quality. The effectiveness of these modifications is demonstrated through tables comparing different model variations.

The document also discusses the optimization of sparse models, exploring techniques such as mixing pre-training and fine-tuning data, load balancing terms, and noise introduction during pre-training. It highlights the importance of communication costs, mesh layout, and top-n routing algorithms. The document also explores batch prioritized routing and mentions experiments with negative results.

Overall, the document provides insights into the design, stability, and performance trade-offs of sparse expert models. It contributes to the understanding and improvement of sparse models and presents potential avenues for future research in architectural design.

1562 word summary

The document discusses the design and optimization of sparse models. The authors experimented with various techniques to improve the fine-tuning of sparse models, including mixing pre-training and fine-tuning data, adding load balancing terms, and introducing noise during pre-training. They also explored the addition of explicit expert positional information and information about dropped tokens to the router. The document highlights the negative results of certain ideas and provides details on communication costs for distributed models. Additionally, the authors discuss the mesh layout for data, model, and expert parallelism and present results on top-n routing algorithms. They also discuss the sensitivity of the fine-tuning protocol and provide details on the pre-training dataset used. Lastly, the document explores batch prioritized routing for lower capacity factors and mentions experiments with negative results. The text excerpt discusses various modifications and experiments conducted to design stable and transferable sparse models. The initial experiments showed that using word embeddings negatively impacted model quality, but using it in addition to the normal layer hidden activation improved performance. The researchers explored routing decisions using the word embedding and observed improvements in their setting. They also tried similar methods inspired by previous work but did not find significant improvements. The researchers found that adding more multiplicative interactions into networks improved the quality of sparse models. They also discovered that inserting an extra dense feed-forward network (FFN) layer immediately before or after each sparse layer significantly improved quality. The effectiveness of these modifications was demonstrated through tables comparing different model variations. The text also mentions the use of auxiliary losses, load balancing, and token routing techniques to ensure uniform distribution and improve model performance. Overall, the researchers found promising avenues for future architectural research in designing sparse models. The document discusses the design of stable and transferable sparse models. It mentions the use of gradient noise to improve learning in deep networks, as well as the benefits of rectified linear units (ReLU) in restricted Boltzmann machines. The document also highlights the importance of topic-aware convolutional neural networks for extreme summarization and the challenges of transfer across different implementations and applications. It mentions the use of transformer modifications and the benefits of regularized dropout for neural networks. The document also discusses the use of conditional computation and sparse-mlp architecture for efficient multi-trillion parameter pretraining. It mentions the M6-10t sharing-delinking paradigm for multilingual machine translation and the use of R-drop for regularization in neural networks. The document also discusses prefix-tuning for optimizing continuous prompts and the use of base layers and switch transformers for scaling and parameter-efficient prompt tuning. It mentions the Gshard approach for scaling giant models with conditional computation and the benefits of task-level mixture-of-experts for efficient inference. The document also mentions the use of sparse-mlp architecture for efficient scaling of language models and the benefits of 8-bit optimizers via block-wise quantization. It discusses the use of dense passage retrieval for open-domain question answering and the benefits of Gaussian error linear units (gelus) in deep learning. The document also mentions the use of language models for question answering and the benefits of hierarchical mixtures of experts in machine learning. It discusses the use of batch normalization for accelerating deep network training and the benefits of parameter-efficient transfer learning for NLP. The document also mentions the use of layer normalization and routing mechanisms in neural networks, as well as the benefits of mixture-of-experts architectures for scaling language models. It discusses the use of language modeling with routed transformers and the benefits of demixing models with simple and efficient sparsity. The document also mentions the use of mixtures of experts with applications to multi-task learning and the benefits of disentangling domains for modular language modeling. It discusses the use of unified scaling laws for few-shot learners and the benefits of language models with conditional computation. The document also mentions the use of bidirectional transformers for language understanding and the benefits of semantic parsing on freebase. It discusses the use of adaptive mixtures of local experts and hierarchical mixtures of experts in machine learning. The document also mentions the use of hierarchical image databases for computer vision and the benefits of long short-term memory (LSTM) in neural computation. It discusses the use of Sparse models with more multiplicative interactions can improve model performance. Future precision formats may consider compressed exponential ranges for training certain classes of models. Training models with lower precision can stabilize the models. Generalizing findings from small to large scale can be challenging but is important for designing stable models. Adaptive computation in sparse models allows for different computation to be applied to different inputs. Routing algorithms play a crucial role in the performance of sparse models and there is room for improvement in this area. Sparse expert models have shown success in various modalities such as language processing, image recognition, and speech recognition. There is potential for further advancements in routing algorithms and regularization techniques to improve the quality of sparse models. Pre-training on multilingual data can result in unpredictable dynamics and the variance of sequences per group across batches can affect model stability. Expert specialization is observed in sparse models, particularly in terms of handling different types of tokens. The entropy of routing in sparse models varies across layers and between the encoder and decoder. Further research is needed to better leverage sparsity and expert specialization in the decoder. The document discusses the design of stable and transferable sparse models. It explores the architecture and training objective of these models, highlighting the lack of expert specialization in the decoder. The text also mentions the routing of tokens among experts and the use of sentinel tokens in the decoder. The performance of the sparse models is compared to previous state-of-the-art approaches on various tasks, including question answering and summarization. The results show that the sparse models outperform or achieve comparable performance to dense models. The document concludes by discussing the limitations and potential improvements of the sparse models. Sparse models were studied in the context of designing stable and transferable models. The SuperGLUE benchmark was used to evaluate the performance of these models. The study found that increasing the train and eval capacity factors improved model quality. The number of experts used in the models was also considered, with the recommendation of using top-2 routing and at most one expert per core. The study also explored the impact of inserting sentinel tokens during fine-tuning and found that it improved performance on the Grammar Error Correction task. The study concluded that sparse models are robust to dropped tokens during fine-tuning. Finally, the study highlighted the sensitivity of sparse models to batch size and learning rate. Sparse and dense models have different fine-tuning protocols, with sparse models benefiting from smaller batch sizes and higher learning rates. Fine-tuning only a subset of model parameters can improve generalization and reduce memory usage. Sparse models are prone to overfitting and may require additional regularization techniques. Sparse models converge faster during fine-tuning but may have lower performance on smaller tasks compared to dense models. Sparse expert models are sensitive to roundoff errors due to the use of exponential functions. Choosing the right numerical precision format is important for efficiency and stability. The total loss during pre-training is a combination of cross entropy loss, auxiliary load balance loss, and router z-loss. Summary: 1. Stabilizing sparse models often leads to a tradeoff with model quality. 2. The router z-loss stabilizes models without quality degradation. 3. Removing multiplicative interactions and injecting model noise improve stability. 4. Constraining activations and gradients can stabilize models but may worsen quality. 5. Training instabilities in sparse models are worse than in standard models. 6. Load balancing techniques and capacity factor can improve stability but increase memory and computation costs. 7. The router and expert components are important in sparse models. 8. Top-n routing can be used to route tokens to multiple experts. 9. Sparse models require careful consideration of stability and quality tradeoffs. This summary provides a concise version of the text excerpt, highlighting key points and preserving important details. The summary is organized into separate paragraphs to distinguish distinct ideas.

Paragraph 1: The Transformer model was originally proposed in LSTMs and later used in the Mixture-of-Experts (MoE) layer. The MoE layer routes token representations to experts based on gate values. The gate value for each expert is determined by a router variable, and the output of the layer is a weighted sum of each expert's computation.

Paragraph 2: Sparse expert models replace neural network layers with a set of experts, each with unique weights. These models have been shown to achieve state-of-the-art performance across various natural language processing benchmarks.

Paragraph 3: The paper aims to increase the practicality and reliability of sparse models by studying the trade-offs between model quality and stability. It introduces a router z-loss that resolves instability issues and provides a design guide for sparse expert models.

Paragraph 4: The paper presents a large-scale study of the quality-stability trade-offs, a fine-tuning analysis comparing sparse and dense models, and architectural, routing, and model design principles for efficient sparse models.

Paragraph 5: The paper discusses the advantages of sparse expert neural networks, including reduced carbon footprint and energy training cost. It also mentions the challenges and difficulties in training sparse models.

Overall, the paper contributes to the understanding and improvement of sparse expert models, providing insights into their design, stability, and performance trade-offs.

Raw indexed text (120,274 chars / 19,416 words / 2,436 lines)

ST-M O E: D ESIGNING S TABLE AND T RANSFERABLE

S PARSE E XPERT M ODELS

Barret Zoph ∗

Google Brain

Irwan Bello ∗ †

Google Brain

Sameer Kumar

Google Nan Du

Google Brain

Jeff Dean

Google Research Noam Shazeer †

Google Brain

Yanping Huang

Google Brain

William Fedus ∗

Google Brain

A BSTRACT

Scale has opened new frontiers in natural language processing – but at a high cost.

In response, Mixture-of-Experts (MoE) and Switch Transformers have been pro-

posed as an energy efficient path to even larger and more capable language models.

But advancing the state-of-the-art across a broad set of natural language tasks has

been hindered by training instabilities and uncertain quality during fine-tuning.

Our work focuses on these issues and acts as a design guide. We conclude by

scaling a sparse model to 269B parameters, with a computational cost comparable

to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-

of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-

of-the-art performance in transfer learning, across a diverse set of tasks includ-

ing reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum,

CNN-DM), closed book question answering (WebQA, Natural Questions), and

adversarially constructed tasks (Winogrande, ANLI R3). 1

∗

Equal contribution. Correspondence to {barretzoph,liamfedus}@google.com.

Work was done while at Google.

Code for our models is available at https://github.com/tensorflow/mesh/blob/master/

mesh_tensorflow/transformer/moe.py

†

1C ONTENTS

1 Introduction 3

2 Background 3

3 Stabilizing Training of Sparse Models

3.1 Stability and Quality Tradeoffs when Removing Multiplicative Interactions .

3.2 Stability and Quality Tradeoffs when Adding Noise . . . . . . . . . . . . . .

3.3 Stability and Quality Tradeoffs when Constraining Activations and Gradients

3.4 Selecting a Precision Format: Trading Efficiency and Stability . . . . . . . . .

. .

. 5

Fine-Tuning Performance of Sparse Models

4.1 Hypothesis: A Generalization Problem . . . . . . . . . . . . . . . . .

4.2 Fine-Tuning a Subset of Model Parameters to Improve Generalization

4.3 Sparse and Dense Models Require Different Fine-Tuning Protocols . .

4.4 Sparse Models Are Robust to Dropped Tokens During Fine-Tuning . .

4.5 Inserting Sentinels Tokens During Fine-Tuning . . . . . . . . . . . . .

. .

. 9

5 Designing Sparse Models

5.1 Setting the Number of Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2 Choosing the Capacity Factor and Routing Algorithm . . . . . . . . . . . . . . . . 13

6 Experimental Results

6.1 ST-MoE-L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.2 ST-MoE-32B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

7 Tracing Tokens Through the Model

7.1 Encoder Experts Exhibit Specialization . . . . . . . . . . . . . . . . . . . . . . .

7.2 Decoder Experts Lack Specialization . . . . . . . . . . . . . . . . . . . . . . . . .

7.3 Multilingual Experts Specialize, But Not by Language . . . . . . . . . . . . . . . 19

8 Related Work 21

9 Discussion 22

10 Conclusion 24

A Token Load Balance Description 31

B Router Z-Loss Training Dynamics 31

C Improved Architectural Modifications 32

D Batch Prioritized Routing for Lower Capacity Factors 33

E Pre-Training Dataset Details 34

F Full Fine-tuning Sensitivity Data 35

G Optimally Setting the Routing Threshold 36

H Mesh Layout for Data, Model and Expert Parallelism with Few Experts 36

I Note on Communication Costs for Distributed Models 37

J Negative Results 38

I NTRODUCTION

Sparse expert neural networks showcase the advantage of sheer scale and offer an efficient alternative

to the static neural network architectures commonly used today (Raffel et al., 2019; Brown et al.,

2020; Rae et al., 2021). Rather than applying the same parameters to all inputs, sparse expert

networks dynamically select which parameters to use for each input (Shazeer et al., 2017). This

allows for networks to vastly expand their number of parameters, while keeping the FLOPs per

token roughly constant. These approaches have yielded state-of-the-art translation models (Lepikhin

et al., 2020), 4-7x pre-training speed-ups (Fedus et al., 2021; Artetxe et al., 2021), and GPT-3 level

one-shot performance using 1/3 the energy training cost (Du et al., 2021). And despite a shocking

number of parameters, sparse models reduce the carbon footprint for training large neural networks

by an order of magnitude (Patterson et al., 2021). However, difficulties remain.

Fedus et al. (2021) observed that a sparse 1.6T parameter model achieved a 4x pre-training speed-up

over the prior state-of-the-art (Raffel et al., 2019), but lagged smaller models when fine-tuned on

common benchmarks like SuperGLUE. Similar gaps were observed in Artetxe et al. (2021) when

MoE language models were fine-tuned on out-of-domain data. In response, Switch-XXL, a model

with fewer parameters, but a 8x-larger computational footprint (FLOPs approximately equal to the

largest T5 model), was proposed and improved quality on natural language understanding tasks.

However, necessary pre-training was hampered by training instabilities previously undetected during

smaller scale studies. These instabilities were later identified in other sparse models (Du et al., 2021).

These results revealed a necessary balance of parameters and computation, but left an open question

on how to reliably train these types of models.

Our aim in this paper is to increase the practicality and reliability of sparse models. We study these

two issues and pre-train a 269B sparse model that achieves state-of-the-art results when fine-tuned

across many competitive NLP benchmarks, including SuperGLUE. We also put forth additional

analysis and a design guide (or at least, our heuristics) for sparse expert models. Furthermore, this

work emphasizes jointly optimizing both the upstream pre-training and the downstream fine-tuning

metrics to avoid discrepancies (Tay et al., 2021).

Contributions

1. A large-scale study of the quality-stability trade-offs of stability techniques.

2. An introduction of the router z-loss that resolves instability issues, while slightly improv-

ing model quality.

3. A fine-tuning analysis of sparse and dense models highlighting different hyperparameter

sensitivity to the batch size and learning rate. We show bad hyperparameters result in

virtually no fine-tuning gain over dense models, despite large pre-training speed-ups.

4. Architectural, routing and model design principles for designing Pareto efficient sparse

models in a distributed setting.

5. A qualitative analysis tracing token routing decisions across expert layers.

6. A 269B sparse model (the Stable Transferable Mixture-of-Experts or ST-MoE-32B)

which achieves state-of-the-art performance across a diverse set of natural language

benchmarks.

B ACKGROUND

Sparse expert models typically substitute a neural network layer with a set of experts, each having

unique weights (Jacobs et al., 1991; Jordan and Jacobs, 1994). Typically all the experts within a layer

are of the same type and shape (homogeneous), however, varied (heterogeneous) expert-types are

possible. Inputs are only processed by a subset of the experts to save computation, so a mechanism

must be added to determine where to send each input. Usually a router or gating network determines

where to send inputs (i.e. words, sentences, image patches, etc.), but alternative schemes have been

proposed (Lewis et al., 2021; Roller et al., 2021; Zuo et al., 2021; Clark et al., 2022).

3Specifically, in natural language processing, Shazeer et al. (2017) proposed a Mixture-of-Experts

(MoE) layer which takes a token representation x as input and routes it to the best matched top-

k experts selected out of a set {E i (x)} N

i=1 of N experts. The router variable W r produces logits

h(x) = W r · x which are normalized via a softmax distribution over the available N experts at that

layer. The gate-value for expert i is given by

e h(x) i

p i (x) = P N

(1)

h(x) j

j e

and the token x is routed to the experts with the highest top-k gate values (set of indices T ). The

output of the layer is the weighted sum of each expert’s computation by the gate value

y =

p i (x)E i (x)

(2)

i∈T

Originally proposed in LSTMs (Hochreiter and Schmidhuber, 1997), expert layers were later used in

the Transformer (Vaswani et al., 2017) by Shazeer et al. (2018) and Lepikhin et al. (2020). Follow-

on work by Fedus et al. (2021) simplified the MoE further to route tokens to a single expert (top-1)

and reduced other costs to improve training efficiency.

To improve hardware utilization, most implementations of sparse models have static batch sizes

for each expert (Shazeer et al., 2017; 2018; Lepikhin et al., 2020; Fedus et al., 2021). The expert

capacity refers to the number of tokens that can be routed to each expert. If this capacity is exceeded

(the router sends too many inputs to that expert) then the overflowed tokens have no computation

applied to them and are passed to the next layer through a residual connection.

Terminology Definition

Expert An independently-learned neural network with unique weights.

Router A network that computes the probability of each token getting sent to each

expert.

Top-n Routing Routing algorithm where each token is routed to n experts.

Load Balancing Loss An auxiliary (aux) loss to encourage each group of tokens to evenly distribute

across experts.

Group Size The global batch size is split into smaller groups, each of size Group Size. Each

group is considered separately for load balancing across experts. Increasing it

increases memory, computation, and communication.

Capacity Factor (CF) Each expert can only process up to a fixed number of tokens, which is often

tokens

set by evenly dividing across experts, experts

. The capacity factor can expand or

tokens

contract this amount to CF · experts .

FFN Acronym of Feed Forward Network (FFN) layer of Transformer consisting of

linear, activation, linear.

Encoder-Decoder A Transformer architectural variant that all of our models are based on. Consists

of an encoder that does all-to-all attention on the inputs and a decoder that

attends to the encoder and to its own inputs in an autoregressive manner.

allreduce Communication primitive which sums a subset of n tensors on n different de-

vices, then broadcasts the summed value to all n devices. This is used in dis-

tributed training for gradient accumulation and model parallelism.

all2all Communication primitive where each device sends to every other device a part

of its tensor. Used in sparse Transformer models for token routing.

(↑/↓) Indicates whether higher/lower values are better (e.g. accuracy/train loss).

Table 1: Terminology used throughout the paper.

The batch B of input tokens is broken into G unique groups across the data-parallelism dimension 2 ,

each with size B/G. The expert capacity is equal to CF · tokens/experts where CF represents the

Our implementation relies on einsums with one-hot tensors for dispatching and combining tensors to/from

experts. The size of this one-hot tensor grows quadratically with the number of tokens being routed as a group

which motivates breaking the batch into smaller groups. This may be avoided with sparse lookup operations.

4capacity factor hyperparameter, experts is the number of experts and tokens is the group size. If the

capacity factor is increased, it creates extra buffer so that fewer tokens will be dropped in case of load

imbalance. However, increasing the capacity factor also increases the memory and computational

costs, so there exists a trade off 3 .

Finally, an auxiliary load balancing loss encourages tokens to be roughly evenly distributed across

the experts (Shazeer et al., 2017). This improves the hardware efficiency by ensuring that all accel-

erators are processing significant chunks of data in parallel as mentioned above. The details of the

loss are presented in Appendix A. However, alternatives exist: Lewis et al. (2021) and Clark et al.

(2022) treats balanced token allocation as an assignment problem and removes the auxiliary loss

entirely.

S TABILIZING T RAINING OF S PARSE M ODELS

Sparse models often suffer from training instabilities (Figure 1) worse than those observed in stan-

dard densely-activated Transformers.

350 8

300 7

250

200

150

100

2500

5000

7500 10000 12500 15000

Step

2500 5000 7500 10000 12500 15000

Step

Figure 1: Training instabilities for sparse models. We refer to training instabilities as divergences

in the training loss. Above are two runs from sparse models FLOP-matched to the T5-XL version

(Raffel et al., 2019) each trained with a batch size of 1M tokens using the Adafactor optimizer

(Shazeer and Stern, 2018). (Left) An unstable training run. (Right) A stable training run.

It’s straightforward to find changes that improve the stability, however, these often come at an un-

tenable expense to model quality (for instance, using an arbitrarily small learning rate or using tight

gradient clipping). We categorize and examine several approaches to improve stability. The sta-

bility techniques span generic fixes to Transformers as well as those specific to sparse models: (1)

Remove multiplicative interactions (2) Inject model noise (3) Constrain activations, and gradients.

We conclude with our recommendation: a new auxiliary loss, the router z-loss, which significantly

improves training stability with no quality degradation. This is an adaptation of the z-loss used for

final softmax logits in the Mesh Tensorflow codebase (Shazeer et al., 2018).

Stabilizing Sparse Models

1. Many methods stabilize sparse models, but at the expense of worse quality.

2. The router z-loss stabilizes models without quality degradation.

3. Transformer modifications with more multiplicative components (GEGLU, RMS nor-

malization) worsen stability, but boost quality.

Designing a large-scale stability study. We design a large-scale stability study of sparse models

FLOP-matched to the T5-XL version (Raffel et al., 2019) pre-trained on the multilingual corpus

mC4 (Xue et al., 2020). Each sparse model has 32 experts and we introduce a sparse MoE layer for

See Fedus et al. (2021) for a graphical illustration of how the capacity factor works.

5every fourth FFN. The train capacity factor is 1.25 and the eval capacity factor is 2.0. See Table 11

for a more detailed description of models used throughout this paper. For each stability technique,

we record the fraction that are stable, the mean quality (negative log perplexity on English), and the

standard deviation over seeds.

The primary issue in constructing this study is that small models are rarely unstable but large un-

stable models are too costly to run for sufficient steps and seeds. We found a sparse model FLOP-

matched to T5-XL to be good object of study because it was unstable roughly 1/3 of the runs, but

was still relatively cheap to train. Furthermore, we run our instability experiments on multilingual

data since we find this exacerbates model instabilities, allowing us to experiment on slightly smaller

models. See Section 9 for more details. Our baseline configuration is trained using six random seeds

and each configuration with a stability technique uses three random seeds. We use six seeds for the

baseline to better characterize the instability rate and three seeds for the variants to save compute.

Each model is pre-trained for 20k steps on mC4 using a masked language modeling objective (Fedus

et al., 2018; Devlin et al., 2018).

3.1

S TABILITY AND Q UALITY T RADEOFFS WHEN R EMOVING M ULTIPLICATIVE

I NTERACTIONS

Some architectural improvements involve more multiplications than additions or do not sum many

items at once. For example, a matrix multiplication has one multiplication for each addition and

hence we do not refer to it as a “multiplicative” operation. We present and analyze the impact of

two instances of multiplicative interactions in Transformers here.

GELU Gated Linear Units (GEGLU). Our first example is the Gated Linear Unit (Dauphin

et al., 2017) which is a component-wise product of two linear projections, one of which is first

passed through a sigmoid function. Shazeer (2020) extends this to other variants and presents a

GELU-Linear (Hendrycks and Gimpel, 2016) FFN layer as a replacement the usual ReLU (Nair and

Hinton, 2010) FFN in Transformer.

F F N GEGLU (x, W, V, b, c) = GELU (xW + b)

(xV + c)

(3)

This quality gain was corroborated in later work (Narang et al., 2021).

Root Mean Square Scale Parameters. Our second example is the scale parameter in root mean

square (RMS) normalization (Zhang and Sennrich, 2019). Within the Transformer, rather than call-

ing layers back-to-back, there is an internal structure (referred to as sublayer calls) which improve

gradient propagation and training dynamics. Our sublayer calls match that of Raffel et al. (2019)

and consist of: (1) RMS normalization, (2) layer call (e.g. Self Attention), (3) dropout (Srivastava

et al., 2014), (4) add residual (He et al., 2015). RMS normalization scales the input vector x ∈ R d

element-wise per the root-mean-square. It then rescales the output element-wise by multiplying with

a learned scale parameter g.

x i

y i = q P

· g i

(4)

i=1 i

Table 2 shows that both removing GEGLU layers or the RMS scale parameter improves stability, but

at a significant loss to model quality. We note that these scale parameters (g) have a disproportionate

gain to model quality versus parameters elsewhere (e.g. FFN). In line with our findings, Shleifer

et al. (2021) found adding a learned multiplicative scalar to the residual connection in Transformers

made them much more unstable.

In Appendix C, we further study the quality impact of adding new multiplicative interactions in

expert layers. We find that this operation yields quality improvements with virtually no slow-down

in model step time.

3.2

S TABILITY AND Q UALITY T RADEOFFS WHEN A DDING N OISE

We next explore a hypothesis that adding noise into the model can improve training stability (Nee-

lakantan et al., 2015). Taleb (2012) argues that certain systems exhibit the property of anti-fragility,

where they improve through noise. Inspired by the concept and by our observation that fine-tuning

6Method Fraction Stable Quality (↑)

Baseline

Remove GEGLU

Remove RMS Norm. Scale Param 4/6

3/3

3/3 -1.755 ±0.02

-1.849 ±0.02

-2.020 ±0.06

Table 2: Removing operations with more multiplicative interactions. Multiplicative interactions

improve quality, but can destabilize training. Individually removing two sources of multiplicative

components improves the stability, but worsens quality significantly. When we remove the GEGLU

layer, we replace it with with an equivalent Dense-ReLU-Dense layer to match the FLOPs and

parameters.

(which injects noise via dropout) was rarely unstable, we examined whether training noise might

improve the stability of sparse models. Table 3 shows a stability improvement versus the baseline,

but at the expense of lower quality. We also find that input-jitter, introduced by Fedus et al. (2021),

diminishes quality at XL-scale, hence we ablate it in our models. Input-jitter multiplies the input

logits to the router by a uniform random variable between [1 − 10 −2 , 1 + 10 −2 ] . Dropout in our

ablation is applied throughout the Transformer. As seen previously, improvements in small-scale

settings may fail to generalize when scaled up and therefore trends should always be monitored and

re-assessed at increasing scale (Kaplan et al., 2020).

Method Fraction Stable Quality (↑)

Baseline

Input jitter (10 −2 )

Dropout (0.1) 4/6

3/3

3/3 -1.755 ±0.02

-1.777 ±0.03

-1.822 ±0.11

Table 3: Injecting noise during training. Both input-jitter and dropout improve stability, but lead to

a significant loss of model quality. There is a clear tradeoff with most methods: when one improves

stability, it then typically decreases model quality. Our work aims to find methods that fix stability

without hurting quality.

3.3

S TABILITY AND Q UALITY T RADEOFFS WHEN C ONSTRAINING A CTIVATIONS AND

G RADIENTS

One of the most successful approaches to stabilizing neural networks are constraints on activations,

and gradients (Pascanu et al., 2013; Ioffe and Szegedy, 2015; Salimans and Kingma, 2016; Ba et al.,

2016). A popular approach consists in the clipping of gradient norms to remedy exploding gradients

while backpropagating through deep networks (Pascanu et al., 2013).

In this work, we use the Adafactor optimizer due to its memory efficiency (though recently in-

troduced 8-bit optimizers (Dettmers et al., 2021) may offer better trade-offs). Instead of gradient

clipping, Adafactor uses update clipping, where the changes to the weights are constrained to be

below a certain norm. We experiment with tightening the update clipping to a smaller value.

Next, we study constraints on the logits going into the router. The router computes the probability

distribution over the experts in float32 precision (i.e. selective precision) (Fedus et al., 2021).

However, at the largest scales, we find this is insufficient to yield reliable training. To fix this, we

introduce the router z-loss,



 2

1 X  X x (i)

log

e j 

(5)

L z (x) =

B i=1

j=1

where B is the number of tokens, N is the number of experts, and x ∈ R B×N are the logits going

into the router. This penalizes large logits into the gating network and Section 3.4 contains a more

detailed explanation of why the z-loss before the router is useful.

7Table 4 shows that both update clipping and the router z-loss stabilize the model in all 3 runs, but the

update clipping significantly hurts the model quality. Therefore we use the z-loss method for fixing

our model stability due to improved quality and stability 4 .

Method Fraction Stable Quality (↑)

Baseline

Update clipping (clip = 0.1)

Router Z-Loss 4/6

3/3

3/3 -1.755 ±0.02

-4.206 ±0.17

-1.741 ±0.02

Table 4: Constraining weight updates and router logits. Constraining the update clipping in

Adafactor improves stability, but at a catastrophic loss of quality. Looser clipping values did not

reliably stabilize training so we exclude them here. The router z-loss stabilizes the model without

any quality degradation (in this case, we observe a slight quality boost).

The router z-loss introduces another hyperparameter (c z ), which is the coefficient to weight this as

part of the total loss optimized. The total loss is a linearly weighted combination of the cross entropy

loss (L CE ), the auxiliary load balance loss (L B ), and the router z-loss (L Z ), yielding a total loss

L tot = L CE + c B L B + c z L Z

(6)

We choose a value of c z = 0.001 based on the best model quality after pre-training with a hyperpa-

rameter sweep. Appendix B logs the resulting losses over the course of pre-training.

3.4

S ELECTING A P RECISION F ORMAT : T RADING E FFICIENCY AND S TABILITY

As in most modern distributed Transformers we train with mixed precision (Micikevicius et al.,

2017) 5 . Weights are stored in float32 for gradient updates and then converted to bfloat16

when doing matrix multiplications in the forward and backward pass 6 . Furthermore, all activations

are stored and operated on in bfloat16 and allreduce communications can be done in either

bfloat16 or float32 numerical precision. For the largest model explored in this work (ST-

MoE-32B presented later) we find speed-ups halving the numerical precision of the allreduce,

however this also can destabilize the training so we keep this as float32 throughout this work.

A lower precision format enables more efficient models by reducing (a) communication costs be-

tween processors and memory, (b) computation costs, (c) memory for storing tensors (e.g. activa-

tions). However, lower precision formats come at the expense of larger roundoff errors which can

lead to irrecoverable training instabilities.

Precision Format: Float32

Exponent (8 bits)

Number Range Max BFloat16

Roundoff Error Max Float32

Roundoff Error

Mantissa (23 bits)

Precision Format: BFloat16

Exponent (8 bits) Mantissa (7 bits)

[2, 4) 0.01563 2.34x10^(-7)

[32, 64) 0.25 3.81x10^(-6)

[1024, 2048) 8.0 0.00012

[2^20, 2^21) 8192.0 0.125

[2^30, 2^31) 8288608.0 128.0

Figure 2: Numerical precision formats and roundoff errors. Larger numbers have larger roundoff

errors. bfloat16 has up to 65,536x worse roundoff errors than float32. The router z-loss

encourages the absolute magnitude of numbers to be small, which doesn’t hinder model performance

and reduces roundoff errors. The router z-loss is most effective into functions where larger errors

can drastically change the relative output (e.g. exponential and sinusoidal functions).

We also experimented with adding z-losses onto the attention logits which also improves model instability

without hurting model quality.

See Mesh Tensorflow for implementation details: https://github.com/tensorflow/mesh/

blob/master/mesh_tensorflow/

Matrix multiplications on TPUs perform multiplications in bfloat16 and accumulations in float32.

8Understanding precision format and roundoff errors. Figure 2 reviews the properties of differ-

ent precision formats and their corresponding roundoff errors for different number ranges. Numbers

in any range of two consecutive powers of 2 (e.g. [2,4) and [1024, 2048)) are represented by a fixed

number of mantissa bits (7 for bfloat16, 23 for float32). As a result, (1) bfloat16 will

have about 65,536x (i.e. 23 − 7 = 16 additional bits and 2 16 = 65536) as large roundoff errors as

float32 and (2) larger numbers have larger roundoff errors. Due to the 8 exponent bits, number

can get as large as ≈ 3e 38 , which leads to even float32 having some issues with roundoff errors.

Sparse expert models are sensitive to roundoff errors because they have more exponential

functions due to the routers. Sparse expert models introduce additional exponential functions –

through the router – which can exacerbate roundoff errors 7 and lead to training instabilities. While

a roundoff error does not change the ordering of probabilities within a softmax operation, it does

impact the routing of the second token in MoE due to relative thresholding (e.g. a token is only

routed to its second place expert if the gating probability for the second expert is 1/5 as large as

that of the first expert). Additionally, roundoff errors can drastically change the probability that

scales the expert output – which we have found to be important. Finally, we conjecture that the

higher stability we observed for decoder-only models (not shown here) was because they had fewer

exponential functions. Section 9 contains a more detailed discussion.

An aside on the router z-loss. One might think that the router z-loss is a convoluted method

replaceable by clipping logits (Wu et al., 2016). We explain why this is not the case. The goal is to

minimize large roundoff errors going into exponential functions. Clipping the logits occurs after any

roundoff errors – resulting in even larger discontinuities. In one view, clipping in itself is a roundoff

error; conversely, the z-loss naturally encourages the model to produce logits that are small in value

and thus more accurately modeled. Due to these dynamics, we ensure all exponentiated tensors are

cast to float32. This hints at the possibility of better number formats for neural networks because

of the unused exponent bits when z-losses are added throughout the network (see Section 9).

F INE -T UNING P ERFORMANCE OF S PARSE M ODELS

The best performing language models are usually obtained by (1) pre-training on large amounts of

data (e.g. the internet) followed by (2) fine-tuning on a task of interest (e.g. SuperGLUE). Promis-

ing new techniques have emerged as an alternative, including few-shot inference (Brown et al.,

2020), prefix tuning (Li and Liang, 2021), prompt tuning (Lester et al., 2021), and adapter modules

(Houlsby et al., 2019) – however, a quality gap still persists compared to fine-tuning. Because of

this, we focus on fine-tuning in this work, but highlight recent successes of sparse models in few-

shot settings from Du et al. (2021); Artetxe et al. (2021). Further, we leave as future work techniques

that adapt large language models through reinforcement learning (Ouyang et al., 2022)

4.1

H YPOTHESIS : A G ENERALIZATION P ROBLEM

Sparse models have performed remarkably well in the regime of large datasets, but have sometimes

performed poorly when fine-tuning (Fedus et al., 2021; Artetxe et al., 2021). We present evidence

for a (not so surprising) hypothesis that sparse models are prone to overfitting. We illustrate this

problem through two tasks in SuperGLUE (Wang et al., 2019) – Commitment Bank (De Marneffe

et al., 2019) and ReCORD (Zhang et al., 2018). Commitment Bank (CB) has 250 training examples

while ReCORD has over 100,000. This significant size discrepancy facilitates a natural study for

overfitting on two tasks selected as part of the same benchmark.

In Figure 3, we compare the fine-tuning characteristics of the Dense L and the ST-MoE-L model.

Each model was pre-trained on 500B tokens from the C4 corpus (Raffel et al., 2019). The models

Exponential functions have the property that a small input perturbation can lead to a large difference in

the output. As an example, consider inputting 10 logits to a softmax function with values of 128 and one logit

with a value 128.5. A roundoff error of 0.5 in bfloat16 will alter the softmax output by 36% and incorrectly

exp(0)

make all logits equal. The calculation goes from exp(0)+10·exp(−0.5)

≈ 0.142 to exp(0)+10·exp(0)

≈ 0.091.

This occurs because the max is subtracted from all logits (for numerical stability) in softmax operations and

the roundoff error changes the number from 128.5 to 128. This example was in bfloat16, but analogous

situations occur in float32 with larger logit values.

9SuperGLUE CB Task

95.0 96

92.5 94

100

97.5 100.0

90.0

87.5

82.5

80.0

Sparse train_eval

Sparse validation_eval

Dense train_eval

Dense validation_eval

Sparse train_eval

Sparse validation_eval

Dense train_eval

Dense validation_eval

85.0

SuperGLUE ReCoRD Task

102

Step

Figure 3: Sparse models are prone to overfit. We plot train and validation curves for our ST-MoE-

L and a dense-L models fine-tuned on the CB task (250 train sequences) and ReCoRD (138k train

sequences). In both cases, the sparse model learns more quickly on the train partition (blue exceeds

green line). However, for the smaller CB task, the dense model outperforms the sparse model on the

held-out validation set (red vs. orange). In contrast, on the larger ReCoRD task, the sparse model

outperforms the dense model by several percentage points.

are designed to be roughly FLOP matched variants of the T5-Large encoder-decoder models from

Raffel et al. (2019) with 770M parameters. The ST-MoE models have 32 experts with an expert

layer frequency of 1/4 (every fourth FFN layer is replaced by an MoE layer). The pre-training and

fine-tuning train capacity factor is 1.25 and the eval is 2.0. We evaluate performance on the held-out

validation and train dataset partitions.

Across both tasks, the sparse model converges faster to 100% train set accuracy supporting that

sparse models optimize effectively under a data distribution shift. On the larger task, ReCORD, the

validation quality of the sparse model follows the boost in training and significantly exceeds the

dense model. However, on the smaller task, CB, the sparse model lags its dense counterpart on held-

out data. As per the recommendation of Fedus et al. (2021), we consider increasing the dropout

within the expert hidden state (i.e. expert dropout), but find that at this scale, higher values only

moderately improve quality (Figure 4). We study further improvements to fine-tuning in Section 4.2

and hyperparameter sensitivity in Section 4.3.

0.0

0.1

0.2

Dropout Probability

0.3

0.0

0.1

0.2

0.3

0.4

Expert Dropout Probability

0.5

Figure 4: Regularization studies of sparse models for fine-tuning. For each setting, we train

three random seeds till convergence on SuperGLUE. We find that increased regularization through

dropout provides modest boosts. (Left) demonstrates peak SuperGLUE fine-tuning quality at a

global dropout rate of 0.1. Higher values over-regularize and severely hurt quality. (Right) Starting

with the best known global dropout rate of 0.1, we selectively increase the expert dropout (an in-

dependent dropout rate on the expert hidden activation). This yields further generalization benefits

and is in line with the findings of Fedus et al. (2021).

104.2

F INE -T UNING A S UBSET OF M ODEL P ARAMETERS TO I MPROVE G ENERALIZATION

To combat overfitting we experiment updating only a subset of models parameters during fine-

tuning. Figure 5 measures quality for updating 5 different subsets of parameters: all parameters

(All), only non MoE parameters (Non MoE), only MoE parameters (MoE), only the self-attention

and enc-dec attention parameters (Attention) and only the non MoE FFN parameters (FFN).

Score

All Non MoE MoE Attention FFN

Parameters Being Updated

Figure 5: Updating only a subset of model parameters during fine-tuning. To improve the gener-

alization of sparse models and combat overfitting, we fine-tune a subset of the model parameters. All

results are with the ST-MoE-L model and are an average of 5 different random seeds. We observe

that updating 3/5 of the subsets of parameters appear to work about the same, while fine-tuning only

the MoE parameters results in a drastic quality reduction.

We observe that updating the non MoE parameters works about as well as updating all the param-

eters and updating only the FFN parameters works a bit better. Updating only the MoE parameters

significantly degrades fine-tuning performance, which is where ≈80% of model parameters are.

Only updating the non MoE parameters can be an effective way to speedup and reduce memory for

fine-tuning.

We hypothesize that fine-tuning only the MoE parameters leads to bad performance since expert

layers only occur every 1/4 layers and a token will see at most two experts per layer. Therefore,

updating the MoE parameters will affect much fewer layers and FLOPs than updating any other

subset of the parameters we tried. Updating only the MoE parameters resulted in a much larger

training loss than updating the non MoE parameters, even though there are significantly more pa-

rameters. We further observe that updating all the non-MoE parameters results in a higher training

loss than updating all the parameters, but unfortunately this regularization effect didn’t translate to

better validation performance.

Further, one regularizer we tried was a dropout variant where entire experts were masked out

stochastically during training. However, this failed to improve generalization in our preliminary

studies. Appendix J expands on this experiment and contains other negative results.

4.3

S PARSE AND D ENSE M ODELS R EQUIRE D IFFERENT F INE -T UNING P ROTOCOLS

How sensitive are sparse and dense models to the fine-tuning protocol? We study two hyperparam-

eters: the batch size and the learning rate. We pretrain a Dense-L and ST-MoE-L on 500B tokens

of C4 and then fine-tune on SuperGLUE. Figure 6 summarizes our experiments with the full data

presented in Table 20 (Appendix F). Across all hyperparameter settings, the sparse models (orange)

outperform the dense (blue) counterparts – however, the best setting for each can materially change

results. Sparse and dense models have vastly different performance across different batch sizes and

learning rates. Sparse models benefit from smaller batch sizes and a higher learning rate. Consis-

tent with the overfitting hypothesis (Section 4.1), both these changes might improve generalization

through higher noise in the fine-tuning process. Finally, we point out the importance of correctly

tuning the batch size and learning rate during fine-tuning. Simply using the same fine-tuning hyper-

11parameters that worked well for the dense model can mask any pre-training improvements obtained

by the sparse model.

Dense

Sparse

Dense

Sparse

65k

262k

Batch Size

1e-4

5e-4

Learning Rate

1e-3

Figure 6: Batch size and learning rate sensitivity. We measure differences and sensitivity to fine-

tuning protocols between dense (blue) and sparse (orange) models. Each bar is an average across 6

different runs with different hyperparameters. On SuperGLUE, sparse models benefit from noisier

hyperparameters including small batch sizes and high learning rates. Dense models behave nearly

oppositely. See Appendix F for all data.

4.4

S PARSE M ODELS A RE R OBUST TO D ROPPED T OKENS D URING F INE -T UNING

Sparse models route tokens to one or more experts at each layer. To make these models efficient in

the SPMD paradigm with modern hardware, the expert capacity (the number of tokens each expert

processes) needs to be fixed ahead of time (see Section 2 for more details). When an expert receives

more tokens than its capacity, the extra tokens are dropped — no computation is applied to those

tokens. We again try to prevent this by (1) pre-training with an auxiliary loss that promotes equal

amounts of tokens getting sent to each expert and (2) a capacity factor (a hyperparameter) that adds

room for extra tokens at each expert. We experiment with turning off the auxiliary loss during

fine-tuning and using different capacity factors. Tables 5 reveals a surprising result that fine-tuning

quality is not materially impacted by dropping up to 10-15% of tokens 8 . Studies on ST-MoE-32B

corroborate that high capacity factors do not improve fine-tuning quality. This is in-line with findings

of Yang et al. (2021) that unequal load balance may not significantly impact model quality.

Model Train CF Eval CF Aux Loss Percent Tokens Dropped SuperGLUE (↑)

Sparse

Sparse 0.75

1.25

2.0

4.0 2.0

2.0

3.0

5.0 Yes

Yes

Yes 10.6%

0.3%

0.0%

0.0% 86.5 ± 0.21

86.7

85.8

86.4

Sparse

Sparse 0.75

1.25

2.0

4.0 2.0

2.0

3.0

5.0 No

No 15.6%

2.9%

0.4%

0.0% 85.7

85.8

85.9

86.4

Table 5: Sparse models are robust to dropped tokens when fine-tuning. We find the fine-tuning

quality on SuperGLUE is not impacted significantly across the values explored. Interestingly, drop-

ping 10-15% of tokens can perform approximately as well as models that drop < 1%. We also

observe that load balance losses (Aux Loss) improve fine-tuning. The dropped token percentage

corresponds to the fraction of dropped tokens across all expert layers at peak validation accuracy.

Token dropping may be a form of regularization and a more extensive study may be an interesting direction

for future work.

124.5

I NSERTING S ENTINELS T OKENS D URING F INE -T UNING

Sentinel tokens denote masked sequences in the span-corruption objective (Fedus et al., 2018; Devlin

et al., 2018). This differs from any fine-tuning task we would likely encounter, leading to a domain

mismatch between pre-training and fine-tuning. Table 6 illustrates the difference. We examine

whether modifying the fine-tuning task to look more like the pre-training task effects results.

Objective

Span Corruption

Fine-Tuning

Fine-Tuning + Sentinels

Inputs Targets

I like the pool day .

What is the capital of Illinois ?

What is the capital of Illinois ? going to on a sunny

Springfield

Table 6: Inserting sentinels during fine-tuning mimics the pre-training span objective. We

highlight the typical difference between span corruption and fine-tuning. We propose modifying the

fine-tuning task to resemble pre-training by inserting sentinel tokens.

In Table 7 we find that adding sentinel tokens while fine-tuning only improves Grammar Error

Correction (GEC) (Rothe et al., 2021), but not SuperGLUE. We tried to further reduce the data

distribution shift by inserting multiple sentinel tokens (as would be encountered by the model while

pre-training), but again found no universal benefit. However, despite no consistent benefit on held-

out data, we find that training convergence is accelerated for both dense and sparse models.

Model Insert Sentinel Tokens SuperGLUE (↑) GEC (↑)

Dense

Dense X 84.9 ± 0.33

85.1 ± 0.25 22.3 ± 0.25

22.1 ± 0.42

Sparse

Sparse X 86.6 ± 0.18

86.6 ± 0.24 22.2 ± 0.04

22.9 ± 0.09

Table 7: Impact of sentinel tokens for fine-tuning. The addition of sentinel tokens (a similar

concept used in Lester et al. (2021)) during fine-tuning has mixed performance on the two tasks we

consider. SuperGLUE records the average score and GEC records the exact match. While we find it

doesn’t improve generalization, sentinel tokens can accelerate training convergence.

D ESIGNING S PARSE M ODELS

The design of dense models has been guided by the foundational work of Kaplan et al. (2020). But

sparse models pose a myriad of additional questions: (1) How many experts to use? (2) Which

routing algorithm? (3) What value for the capacity factor? (4) How does hardware change these

decisions? In this section, we comment on these and offer recommendations for building Pareto ef-

ficient sparse models. Concurrently, Clark et al. (2022) provides additional design recommendations

including higher layer frequency and top-1 routing as per Fedus et al. (2021).

Designing Sparse Models

1. In our setup, we recommend top-2 routing with 1.25 capacity factor and at most one

expert per core.

2. The capacity factor can be changed during evaluation to adjust to new memory/compute

requirements.

3. Dense layer stacking and a multiplicative bias can boost quality (Appendix C).

5.1

S ETTING THE N UMBER OF E XPERTS

One of the first questions is the number of experts to use. Fedus et al. (2021) presented the scaling-

properties of Switch Transformer which yielded monotonic pre-training benefits (on a step basis) on

13C4 up to 512-experts, Kim et al. (2021) up to 64-experts and Clark et al. (2022) up to 512-experts.

But the incremental benefit quickly diminishes with many experts (>256) or equivalently, with very

sparse models (<1% of experts activated).

However, reflecting on the specific hardware system can further guide this choice. The compute-to-

memory ratio (operational intensity) can serve as an estimate of the efficiency of different operations

(Williams et al., 2009; Shazeer, 2019). A model is memory bound if the time to load tensors to

the computing core (e.g. ALU/MMU) greatly exceeds the time required to do the computation on

the tensors. On modern GPUs and TPUs, increasing this compute to memory ratio improves the

efficiency.

Returning to sparse expert models, using more than one expert per core increases memory transfer,

potentially hurting efficiency. Increasing the number of experts does not change the computation

done (sparse models apply a fixed amount of computation to each input), but increases the mem-

ory transfer requirement (additional expert variables must be loaded from device memory). This

decreases the compute-to-memory ratio 9 .

On our TPU system, we recommend to one expert (or less) per core. Our largest models use both data

and model parallelism where data parallelism is over “rows” and model-parallelism over “columns”

of the logical mesh. We use ≤ 1 expert per data parallelism row to ensure the compute-to-memory

ratio is high and to reduce the cores needed for evaluation and inference. Furthermore, using less

experts lets us allocate more cores to the model parallelism “column” to have more FLOPs in our

model. Appendix H explains our mesh layouts for when we have fewer experts than data parallelism

rows.

5.2

C HOOSING THE C APACITY F ACTOR AND R OUTING A LGORITHM

We generalize top-1 routing (Fedus et al., 2021; Roller et al., 2021) and top-2 (Shazeer et al., 2017;

Lepikhin et al., 2020) to study top-n routing where each token is processed by at most n experts. In

this study, all models are pre-trained for 100k steps with 1M tokens per batch and sparse models have

32 experts and are FLOP matched to T5-Large Raffel et al. (2019). We draw two key conclusions.

First, increasing both the train and eval capacity factors (CF) improves quality as seen by comparing

across the segmented blocks of Table 8. For instance, top-1 routing improves by +0.011 neg. log

perp. when increasing from 1.0 → 1.25 train CF and top-2 routing improves +0.009 increasing

from 1.25 → 2.0 train CF. To provide context for these numbers: tripling the size of a dense model

(Dense-L to Dense-XL) yields a +0.090 neg. log perp. boost. Therefore, these CF boosts are ∼

1/10 th of that magnitude. But this comes at a cost. Increasing the capacity factor linearly increases

the einsums costs, memory for activations, all2all communication costs, and model-parallelism

allreduce communication costs for expert layers 10 .

Second, there are small gains of top-(n+1) over top-n routing given a fixed capacity factor (Table 8).

For instance, top-2 routing improves +0.004 over top-1 at train CF of 1.25 or about 1/20 th the boost

of a dense model tripling. This revises an earlier recommendation from Fedus et al. (2021). The

primary difference between these experimental setups was scale of compute. Fedus et al. (2021)

trained 220M-FLOP matched models for 50B tokens. We find at an 8x larger scale of training

(1B-FLOP matched models for 100B tokens) there is instead a small gain to route to more than

one expert. Furthermore, at the larger experimental scale, the speed difference of top-n versus top-

(n + 1) routing is negligible. Speed differences were observed in Fedus et al. (2021) because the

router computation was a larger fraction of the total model computation.

b·h

As an exercise to the reader, verify the operational intensity of the first expert computation is b+h·e

with b

batch size, h hidden dimension, e number of experts.

all2all and allreduce costs depend on the number of devices, batch size, d model and capacity factor,

but not on the number of experts.

14Algorithm Train CF Eval CF Neg. Log Perp. (↑)

Dense-L

Dense-XL —

— —

— -1.474

-1.384

Top-1

Top-2

Top-2 0.75

0.75

0.75 0.75

2.0

0.75

2.0 -1.428

-1.404

-1.424

-1.402

Top-1

Top-2

Top-2 1.0

1.0

1.0 1.0

2.0

1.0

2.0 -1.397

-1.384

-1.392

-1.378

Top-1

Top-2

Top-2 1.25

1.25

1.25 1.25

2.0

1.25

2.0 -1.378

-1.373

-1.375

-1.369

Top-2

Top-3

Top-3 2.0

2.0

2.0 2.0

3.0

2.0

3.0 -1.360

-1.359

-1.360

-1.356

Table 8: Comparing capacity factors (CF) and routing algorithms. Increasing both train and eval

CF improves performance. Increasing or decreasing the eval CF gives an additional lever if you have

more or less compute at eval time. Next, there are smaller gains of top-(n + 1) over top-n routing

across capacity factors. Because the quality improves, but the speed slows as the CF increases, the

Pareto efficient CF must be determined by the specific hardware system.

The specific hardware-software system will determine the optimal n and capacity factor. For in-

stance, if the system supports fast all2all and allreduce communications, larger capacity

factors and larger n in top-n routing may be optimal. However, if the all2all and/or allreduce

communications are slow, smaller capacity factors may dominate. In our case, the hardware-

software stack is the TPU and Mesh Tensorflow. We record the training speed of both our ST-MoE-L

and ST-MoE-32B model in Table 9 as we increase the train capacity factor. As the models scale,

a higher capacity factor makes the models increasingly slower. The ST-MoE-L does not require

model parallelism (it fits within accelerators memory, which implies no additional allreduce

communications) making it better suited for high capacity factors than our ST-MoE-32B model. For

our largest model, we therefore continue to use the smaller train capacity factor of 1.25 advocated

by Fedus et al. (2021) for Pareto efficiency, differing from other work which use a larger and more

expensive 2.0 capacity factor (Lepikhin et al., 2020; Du et al., 2021).

Model

Train CF

Step Time (s) (↓)

ST-MoE-L

ST-MoE-L 1.25

2.0 2.397

2.447 (+7%)

ST-MoE-32B

ST-MoE-32B 1.25

2.0 4.244

4.819 (+14%)

Table 9: Profiling sparse models on TPUs. Increasing the train capacity factor from 1.25 to 2.0

increases the step-time by +7% for the large (1B) model but by +14% for our 32B model. As the

model size increases, we find the small quality gains of higher train capacity factors from Table 8

are more than offset by the significant 14% slow-down. Note: the step time between ST-MoE-L and

ST-MoE-32B are not comparable because they used a different number of cores.

Our results in this section focus on top-n routing, but we also experimented with a variety of other

routing techniques in Appendix J. We found most performed similarity or worse compared to top-n

routing. However we found Batch Prioritized Routing (BPR), introduced in Riquelme et al. (2021),

significantly helps performance for capacity factors less than one (Appendix D). We recommend

15BPR for larger models where all2all and allreduce are more expensive and lower capacity

factors are optimal.

E XPERIMENTAL R ESULTS

Given our improvements to training stability, fine-tuning and model design, we start by validating

a sparse model approximately FLOP-matched to T5-Large (Raffel et al., 2019). We conclude this

section by designing and training a 269B sparse parameter model (FLOP matched to a 32B dense

model) which achieves state-of-the-art quality across a wide set of NLP tasks.

We studied the SuperGLUE (Wang et al., 2019) benchmark throughout this work which consists

of tasks including sentiment analysis (SST-2), word sense disambiguation (WIC), sentence similar-

ity (MRPC, STS-B, QQP), natural language inference (MNLI, QNLI, RTE, CB), question answer-

ing (MultiRC, RECORD, BoolQ), coreference resolution (WNLI, WSC) and sentence completion

(COPA) and sentence acceptability (CoLA). We often observe good performance on SuperGLUE to

correlate with (but not guarantee) performance across many NLP tasks. We also include a divers set

of additional benchmarks. The CNN-DM (Hermann et al., 2015) and BBC XSum (Narayan et al.,

2018) datasets are used to measure the ability to summarize articles. Question answering is probed

with the SQuAD dataset (Rajpurkar et al., 2016) as well as on grade-school science questions in

ARC Easy and ARC Reasoning Challenge (Clark et al., 2018). And as in Roberts et al. (2020), we

evaluate the knowledge of our models by fine-tuning on three closed-book question answer datasets:

Natural Questions (Kwiatkowski et al., 2019), Web Questions (Berant et al., 2013) and Trivia QA

(Joshi et al., 2017). Closed-book simply refers to questions posed with no supplemental reference or

context material. To gauge the model’s common sense reasoning we evaluate it on the Winogrande

Schema Challenge (Sakaguchi et al., 2020). And finally, we test our model’s natural language infer-

ence capabilities on the Adversarial NLI Benchmark (Nie et al., 2019).

6.1

ST-M O E-L

For simplicity and to cover dozens of tasks easily, we train on mixtures of the tasks listed rather than

separately fine-tuning a model on each task. However, because the tasks vary in size considerably,

equally sampling per the number of examples would over-sample large tasks and under-sample small

ones. We therefore mix each task in proportion to the number of examples in its ‘train’ split (up to

some max num examples=65536) as in Raffel et al. (2019). This means that tasks containing

more than 65536 training examples are weighted as if they only contain max num examples.

Table 10 summarizes the quality of a dense T5-Large (L) model and sparse model with approxi-

mately the same number of FLOPs pre-trained for 500k steps with a 1M batch size (524B tokens)

on the C4 dataset (Raffel et al., 2019). The sequence length for the encoder was 512 and 114 for

the decoder. We observe improvements on the validation (dev) sets across a wide array of tasks ex-

amining natural language understanding, question answering, and summarization. As seen in Fedus

et al. (2021), striking gains are observed in closed book question answering (Roberts et al., 2020).

Also, in support of the overfitting hypothesis presented in Section 4.1, we observe two of the smallest

tasks CB and WSC (250 and 259 training examples, respectively), are the only ones where the sparse

model does not yield gains over its dense counterpart. This again suggests that improved forms of

regularization for sparse models may unleash greater performance.

6.2

ST-M O E-32B

With quality validated at the scale of T5-Large, we seek to push the capabilities of sparse models

through the ST-MoE-32B. When designing this, we sought a balance between FLOPs and parame-

ters. High-FLOP sparse models were previously unstable in Fedus et al. (2021) in our setting (i.e.

encoder-decoder models, Adafactor optimizer), but the router z-loss enabled us to proceed. For com-

putational efficiency, we expanded the hidden size of the experts (d f f in Table 11 below) 11 . Finally,

we increased the d kv to 128 for better performance on our hardware. The most salient changes are

fewer overall parameters and more FLOPs per token relative to both Switch-C and Switch-XXL. Our

allreduce activation communications introduced through model parallelism are independent of the

hidden size, but not the model dimension, making it a good choice to increase.

16Name Metric Split Dense-L (↑) ST-MoE-L (↑) Gain (%)

SQuADv2

SQuADv2 F1

acc dev

dev 94.0

87.6 94.5

88.1 +1%

+1%

SuperGLUE

BoolQ

Copa

RTE

WiC

MultiRC

WSC

ReCoRD

CB avg

acc

acc dev

dev

dev 85.1

87.1

83.0

91.0

70.4

83.9

95.2

85.7

100 87.4

88.6

91.0

92.1

74.0

86.0

93.3

88.9

98.2 +3%

+2%

+10%

+1%

+5%

+3%

−2%

+4%

−2%

XSum

CNN-DM ROUGE-2

ROUGE-2 dev

dev 19.9

20.3 21.8

20.7 +10%

+2%

WinoGrande (XL)

ANLI (R3)

ARC-Easy

ARC-Challenge acc

acc

acc dev

dev

dev 75.4

54.3

63.5

50.2 81.7

57.3

75.4

56.9 +8%

+6%

+19%

+13%

Closed Book TriviaQA

Closed Book NatQA

Closed Book WebQA acc

acc

acc dev

dev

dev 28.1

27.2

30.5 33.8

29.5

33.2 +20%

+8%

+9%

Table 10: Fine-tuning performance of FLOP-matched dense and sparse models. Comparison

of the dense-L baseline and the sparse FLOP-matched version (higher numbers better). We observe

consistent gains across diverse tasks, using approximately the same amount of computation. The

only two tasks without improvement from the sparse model are the two smallest: CB with 250

training examples and WSC with 259.

ST-MoE-32B has “only” 269B parameters and is approximately FLOP-matched to a dense Trans-

former with 32B parameters. The reduced parameter count from Switch-C and Switch-XXL eases

the burden for both serving and fine-tuning. Finally, we use the sparse-dense stacking described in

Appendix C.

We pre-train for 1.5T tokens on a mixture of English-only C4 dataset (Raffel et al., 2019) and the

dataset from GLaM (Du et al., 2021) summarized in Appendix E. We use 1M tokens per batch,

the Adafactor optimizer with default hyperparameters, and a learning rate warm-up of 10k steps

followed by inverse square root decay. Our model follows the initialization scheme proposed in

Fedus et al. (2021).

Table 12 evaluates our ST-MoE-32B model against previous state-of-the-art approaches using

inference-only (zero-shot, one-shot) as well as fine-tuning. On SuperGLUE, our model improves

upon the prior state-of-the-art model, achieving an average score of 91.2 on the test server (93.2

validation accuracy) which is over one percentage point beyond estimated human capability. For

both summarization datasets, XSum and CNN-DM, our model achieves state-of-the-art without ad-

ditional changes to training or fine-tuning (Raffel et al., 2019; Liang et al., 2021). ST-MoE-32B

improves the current state-of-the-art on the test server submissions for both ARC Easy (92.7 →

94.8) and ARC Challenge (81.4 → 86.5). On two of the three closed book QA tasks, we improve

over the prior state-of-the-art. Closed book WebQA achieves a 47.4 accuracy (prior best of 42.8 from

Roberts et al. (2020) and exceeds results from the zero-shot performance of the ERNIE 3.0 Titan

260B dense parameter model (Wang et al., 2021)). Closed book NatQA improves to 41.9 accuracy

(prior best of 41.5 from Karpukhin et al. (2020)). We find significant improvements on adversari-

ally constructed datasets (ANLI R3 and WinoGrande XL). ANLI R3 (Nie et al., 2019) improves the

state-of-the-art to 74.7 (prior best of 53.4).

We note some weaknesses in our model. ST-MoE-32B has lackluster performance on the small

SQuAD dataset, with an exact match score of 90.8 which falls short of the older benchmark set by

the T5-XXL of 91.3. Furthermore, while setting a new state-of-the-art for SuperGLUE in aggregate,

17Model Parameters FLOPs/seq d model FFN GEGLU d f f d kv

Dense-L

T5-XXL 0.8B

11.1B 645B

6.3T 1024

4096 X

X 2816

10240 64

Switch-XXL

Switch-C 395B

1571B 6.3T

890B 4096

2080 X 10240

6144 64

ST-MoE-L

ST-MoE-32B 4.1B

269B 645B

20.2T 1024

5120 X

X 2816

20480 64

128

Sparse-Dense Model Num. Heads Num. Layers Num. Experts Expert Layer Freq.

Dense-L

T5-XXL 16

64 27

24 –

– –

–

Switch-XXL

Switch-C 64

32 24

15 64

2048 1/4

1/1

ST-MoE-L

ST-MoE-32B 16

64 27

27 32

64 1/4

1/4

Table 11: Model comparisons. A comparison of the Dense-L and T5-XXL, the two largest Switch

Transformer variants (Switch-XXL and Switch-C), and the ST-MoE-L and ST-MoE-32B. d model

refers to the model hiddenstate size and d f f is the internal size of the FFN layer. d kv is the dimension

of each attention head. Expert Layer Freq. is the fraction of FFN layers replaced with a sparse layer.

Sparse-Dense refers to the architectural variant described in Appendix C.

Previous Best (↑)

Name

Metric

Split

Zero-Shot

One-Shot

Ours (↑)

Fine-Tune

e e a

Fine-Tune

SQuADv2

SQuADv2 F1

acc dev

dev 68.3

62.1 e 70.0

64.6 e 96.2

91.3 a 96.3

90.8

SuperGLUE

BoolQ

Copa

RTE

WiC

MultiRC

WSC

ReCoRD

CB avg

acc

acc test

dev/test

dev/test –

83.0 e

91.0 d

68.8 e

50.5 e

72.9 d

84.9 e

90.3 e

46.4 d –

82.8 e

92.0 e

71.5 e

52.7 e

72.9 d

83.9 e

90.8 e

73.2 e 90.9

92.0

98.2

94.1

77.9

88.6

97.3

96.4

99.2 91.2

92.4

99.2

93.5

77.7

89.6

96.6

95.1

98.0

XSum

CNN-DM ROUGE-2

ROUGE-2 test

test –

– –

– 24.6 h

21.6 a 27.1

21.7

WinoGrande XL

ANLI R3 acc

acc dev

test 73.4 e

40.9 e 73.2 d

40.8 e –

53.4 96.1

74.7

ARC-Easy

ARC-Challenge acc

acc test

test 71.9 e

51.4 76.6 e

53.2 92.7 g

81.4 g 95.2

86.5

CB TriviaQA

CB NatQA

CB WebQA em

em dev

test

test 68.0 e

21.5 e

38.0 f 74.8 e

23.9 e

25.3 61.6 b

41.5 c

42.8 b 62.3

41.9

47.4

Table 12: ST-MoE-32B versus previous best for inference-only techniques and fine-tuned mod-

els. A split of “dev/test” refers to dev split for Zero-Shot and One-Shot and test split for Fine-Tune

quality. Data not available filled in with “–”. Superscript letters denote the result: a : Raffel et al.

(2019) b : Roberts et al. (2020) c : Karpukhin et al. (2020), d : Brown et al. (2020), e : Du et al. (2021),

: Wang et al. (2021), g : UnifiedQA + ARC MC/DA + IR, h : Zhang et al. (2020).

18certain tasks, including small ones like CB, WSC, fail to improve. Finally, on closed book Trivia

QA, our model improves over the fine-tuned baseline with SSM from Roberts et al. (2020), but fails

to produce gains over both GPT-3 and GLAM.

While not the focus of this paper, we present the quality differential between recent advances in

inference-only techniques like few-shot learning and fine-tuning on these tasks (GPT-3 (Brown

et al., 2020), GLAM (Du et al., 2021) and Gopher (Rae et al., 2021)). As expected and observed

previously, fine-tuning outperforms zero/one-shot learning, but has the disadvantage of requiring

additional training and different models for each task.

T RACING T OKENS T HROUGH THE M ODEL

Thus far we have presented quantitative measures and performance metrics. We change tack to

explore qualitative features by visualizing how tokens are routed among the experts. We do so by

passing a batch of tokens to the model and manually inspecting token assignment at each layer. We

consider our ST-MoE-L model pre-trained either on the monolingual C4 corpus (Raffel et al., 2019)

or on the multilingual mC4 corpus (Xue et al., 2020). On both the encoder and the decoder, the

model has six sparse layers, each with 32 experts.

Preliminaries

The span corruption objective is to recover spans of variable-length contiguous segments

masked out in the inputs. This is formatted as:

Inputs: I went to to buy

Targets: the store milk

In our encoder-decoder architecture, the inputs will be passed to the encoder and targets to

the decoder.

Each group of tokens is routed jointly with load balancing across experts incentivized by an auxiliary

loss as proposed in Shazeer et al. (2017) (see Appendix A for details). Tokens compete for expert

assignment against other tokens in their group, rather than the entire batch, and expert specialization

is heavily influenced by the distribution of tokens in each group. The notion of groups is introduced

to limit the cost of dispatching and gathering the correct tokens to the correct experts.

7.1

E NCODER E XPERTS E XHIBIT S PECIALIZATION

Our first observation is that, at each layer, at least one expert specializes in sentinel tokens (mask to-

kens that represent blanks to fill-in). Additionally, some encoder experts exhibit clear specialization,

with some experts primarily operating on punctuation, verbs, proper names, counting, etc. Table 13

presents a few notable example of specialization across encoder experts. And while we find many

instances of specialization, these have been specifically extracted from many examples without a

clear semantic or syntactic specialization.

7.2

D ECODER E XPERTS L ACK S PECIALIZATION

In contrast, expert specialization is far less noticeable in the decoder. Not only are sentinel to-

kens routed somewhat uniformly across decoder experts (see Table 14), but we also do not observe

meaningful specialization (semantics or syntax) in decoder experts.

We hypothesize that this lack of meaningful expert specialization is caused by the distribution of

target tokens induced by the span corruption objective. In particular, (a) a smaller number of tokens

are routed jointly in the decoder due to longer sequence lengths in the encoder (e.g. group size

is 2048 in the encoder vs 456 in the decoder in our setup) and (b) a higher proportion of tokens

are sentinel tokens in the decoder. As a result, target tokens in each group typically cover a smaller

semantic space (compared to the encoder), perhaps explaining the lack of expert specialization in the

decoder. This intricate interplay between the architecture and the training objective invites further

19Expert specialization Expert position Routed tokens

Sentinel tokens Layer 1 been floral to

...

Layer 4

Layer 6

Punctuation Layer 2

Layer 6 , , , , , , , , , - , , , , , ). )

, , , , , : . : , & , & & ? & - , , ? , , , .

Conjunctions and articles Layer 3 The the the the the the the the the The the the

the the the The the the the

a and and and and and and and or and a and .

the the if ? a designed does been is not

Layer 6

Verbs Layer 1 died falling identified fell closed left posted lost felt

left said read miss place struggling falling signed died

falling designed based disagree submitted develop

Visual descriptions

color, spatial position Layer 0 her over her know dark upper dark outer

center upper blue inner yellow raw mama

bright bright over open your dark blue

Proper names Layer 1 A Mart Gr Mart Kent Med Cor Tri Ca Mart

R Mart Lorraine Colin Ken Sam Ken Gr Angel A

Dou Now Ga GT Q Ga C Ko C Ko Ga G

Counting and numbers

written and numerical forms Layer 1 after 37 19. 6. 27 I I Seven 25 4, 54 I two dead we

Some 2012 who we few lower each

Table 13: Notable examples of specialization in encoder experts. We find experts that specialize

in punctuation, conjunctions & articles, verbs, visual descriptions, proper names, counting & num-

bers. Across all layers (not shown), we observe experts that primarily operate on sentinel tokens

(marked as ). Note that a SentencePiece model (Kudo and Richardson, 2018) will

split a token if it doesn’t exist in the vocabulary, e.g. Kenneth may become Ken, ne, th.

research on better leveraging sparsity and expert specialization in the decoder. Alternatively, future

work could study simply removing the experts in the decoder layer, which also confers benefits

during autoregressive decoding (Kudugunta et al., 2021a).

Encoder

Decoder

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Uniform (32-experts)

2.2

3.4 1.8

3.4 1.6

3.4 1.7

3.4 1.2

3.4 3.5

3.5

Table 14: Entropy of routed sentinel tokens across encoder and decoder layers. We support our

qualitative observation that encoder experts specialize, but decoder expert don’t by computing the

entropy over the routing for sentinel tokens. The encoder routing entropy is low, but the decoder

router is high entropy, and nearly equal to uniform routing. Because each layer has 32-experts, a

completely uniform distribution has entropy of 3.5.

207.3

M ULTILINGUAL E XPERTS S PECIALIZE , B UT N OT BY L ANGUAGE

We next consider a multilingual sparse model pretrained on a mixture of different languages and

inspect the expert specialization in the encoder. As in the monolingual case, we find strong evidence

of expert specialization. Table 15 presents some examples of experts specializing in sentinel tokens,

numbers, conjunctions & articles and proper names.

Expert specialization Routed tokens

Sentinel tokens to to til

...

Numbers $50 comment .10.2016 ! 20 20 3 ! 5 1. ! 91 ? né ?

2 17 4 17 11 17 8 & 11 & 22:30 02 2016. ) iOS

Conjunctions & Articles of of of their their of any this this your your am von

this of Do of of This these our 的的于的在的在的

le les Le la di la sur sur 136 sur ののするのというのし

Prepositions & Conjunctions For for or for for or for from because https during https

并与和par c Pour à a par trè pour pour pour pour pour c とやのに

でででなので- and and + c between and and

Proper names Life Apple iOS A IGT 众莫HB

F HB A K A OPP OK HB A Gia C Gia C P Scand Wi

G H Z PC G Z ハイPC G Ti CPU PC PC A キットOS

Table 15: Examples of specialization in multilingual experts (encoder). Multilingual ex-

perts also exhibit specialization, which sometimes spans across different languages (e.g. ”for” and

”pour”). Experts trained on multilingual mixtures do not exhibit language specialization.

One might expect experts to specialize in languages, which appears as a natural criterion for divvying

up batches of data among experts. However, we find no evidence of language specialization (see

Table 15). Routers instead pass tokens from English, Japanese, French and Chinese indiscriminately

and the experts appear to be multilingual. But this lack of language specialization is less surprising

when considering the mechanism of token routing and load balancing. Since each group of tokens

may only contain one, to at most a few, languages (a group usually consists of 2-4 sequences in

our setup), then all experts are encouraged to handle tokens from all languages. We experimented

with a global load balance loss, however, this usually results in worse load-balance and worse model

performance, so we leave further improving multilingual expert models as an area of open work

(Section 9).

Our visualization reveals apparent specialization learned in our models (Tables 13, 15) for the en-

coder layers. Other expert specializations were also observed in the appendix of Shazeer et al.

(2017). However, this leads to an interesting question of how architectures that eliminate learned

routing Roller et al. (2021); Zuo et al. (2021) appear to perform well. An extensive study of the

scaling properties of learned versus random routing could prove helpful as future work and help

guide us to a better understanding of routing behavior.

R ELATED W ORK

Mixture-of-Experts (MoE) date back at least three decade history to the work of Jacobs et al. (1991);

Jordan and Jacobs (1994). In initial concepts, the MoE defined the entire neural network akin to

ensemble methods. But later Eigen et al. (2013) extended the idea of including MoE as a component

as part of deeper networks. Shazeer et al. (2017) then scaled this idea to a 137B parameter model to

achieve state-of-the-art in machine translation. Most of the later work (including ours) follows this

MoE as a component approach.

Scale in natural language processing. The remarkable success of scale in natural language pro-

cessing (Kaplan et al., 2020; Brown et al., 2020) has reinvigorated MoE research evidenced by a

21surge of recent work (Lepikhin et al., 2020; Fedus et al., 2021; Yang et al., 2021; Kim et al., 2021;

Du et al., 2021; Artetxe et al., 2021; Zuo et al., 2021; Clark et al., 2022). Sparse expert models

have been proposed as a method to achieve the results of large-scale dense models, more efficiently.

Fedus et al. (2021) showed a 4x pre-train speed-up over T5-XXL (Raffel et al., 2019) and Du et al.

(2021) matched the quality of GPT-3 (Brown et al., 2020) using only 1/3 of the energy. And in the

span of the last twelve months, a milestone of efficiently training trillion parameter deep neural net-

works has been achieved by multiple groups (Fedus et al., 2021; Yang et al., 2021; Du et al., 2021),

and most recently, Lin et al. (2021) introduced techniques to train a 10T parameter model. One side

note is that the recent significant successes of sparse expert models have often been in settings with

a lot of data and no distribution shift – two examples being language modeling/span corruption and

machine translation (Shazeer et al., 2017; Lepikhin et al., 2020; Kim et al., 2021; Fedus et al., 2021).

In contrast, discrepancies between strong pre-training quality and poor fine-tuning quality for sparse

models have been observed in Fedus et al. (2021); Narang et al. (2021); Artetxe et al. (2021), but we

expect advances in regularization techniques to continue to improve downstream quality.

Towards better routing algorithms. BASE layers (Lewis et al., 2021) recasts token routing as a

linear assignment problem – removing the need for load balancing auxiliary losses. This work also

demonstrated the efficacy of a single expert layer. Clark et al. (2022) studies in depth the scaling

properties of a few different routing algorithms and propose their own variant of BASE layers that

uses an optimal transport formulation. Yang et al. (2021) introduces the M6-T architecture and

expert prototyping which splits experts into different groups and applies k top-1 routing procedures

(contrasting with the top-k routing commonly used elsewhere). Hazimeh et al. (2021) proposed a

continuously differentiable sparse gate with demonstrated improvements over vanilla top-k gating.

Other work (Bengio et al., 2016) considered casting the routing selection as a reinforcement learning

problem. More radical versions remove learning the routing entirely. Hash layers (Roller et al.,

2021) shows random fixed routing (per hash functions) led to competitive performance with learned

routing. Zuo et al. (2021) also proposed an algorithm which randomly selects experts during training

and inference and found gains of 2 BLEU points over Switch Transformers and competitive scores

with the larger models of Kim et al. (2021). Finally, Fan et al. (2021) designs an architecture with

explicit language-specific sublayers (rather than allowing arbitrary routing as done in Lepikhin et al.

(2020)) to yield gains of +1 BLEU.

Sparse expert models in other modalities. MoE and sparse experts model have also advanced

results in modalities aside from language. Riquelme et al. (2021) designed a 15B parameter V-MoE

to match state-of-the-art ImageNet (Deng et al., 2009) models with fewer computational resources.

Lou et al. (2021) similarly showed a benefit over dense vision models by using MoE layers across

both image patch and channel dimensions. Additionally, Automatic Speech Recognition has been

improved by the SpeechMoE variants (You et al., 2021a;b). Kumatani et al. (2021) reduced word

error rates using MoE models in Sequence-to-Sequence Transformer and Transformer Transducer.

Improving deployment of sparse models. Initial expert designs (including this work) route each

token separately to experts at that layer. One issue is that these type of architectures may be burden-

some to serve since it requires sufficient memory for storing the parameters. Distillation was shown

in Fedus et al. (2021) to be moderately effective, but recent approaches modified the routing to in-

stead route full sentences or tasks (Kudugunta et al., 2021b; Zuo et al., 2021) which then permits

extraction of sub-networks at time of serving (e.g. deploy only the network associated with the new

task). As an alternative to distillation, Kim et al. (2021) considers directly pruning away experts not

essential to the task of interest.

Multitask learning with MoE. We conclude our tour of recent MoE research with successes in

multitask settings. Ma et al. (2018) recommended using a separate gating or router network for each

task, an idea that may soon be revisited for Transformer architectures. Finally, Gururangan et al.

(2021) recommends even greater modularity of language models and conditionally activates experts

based on the domain/task label or by an inferred label.

D ISCUSSION

While this work is on sparse models, these models intersect with many other interesting topics

in machine learning such as adaptive computation, low-precision training, scaling principles, and

22neural network architecture advances. Our discussion therefore covers a broader range of topics

surfaced during this research.

Unpredictable dynamics when pre-training on multilingual data. We often observe that the

same model pre-trained on multilingual data will yield smaller pre-training speed-ups and be more

unstable. One hypothesis is that this is due to the variance of sequences per group across batches.

As a reminder, we encourage tokens in a group to be load-balanced. There are usually only 2-8

sequences per group (higher becomes expensive) where each sequence is written in a single lan-

guage. Therefore, at most 2-8 languages must be balanced across experts – even when training with

over 100 languages. This leads to high variance across groups and batches, resulting in chaotic and

unpredictable routing. In a follow-up experiment (just highlighted for brevity), we pre-trained on

a mixture of English C4 plus a small fraction of a fine-tuning task which similarly resulted in an

unstable model.

The robustness of sparse models. Despite a paper focused on the details of sparse model-

particulars, zooming out we find them to be robust to a wide set of hyperparameters and architectural

changes. Sparse models obtain great performance under a variety of routing algorithms, dropping

high fractions of tokens, and different hyperparameters. While we did point out the importance of

tuning the batch size and learning rate for fine-tuning, our intuition, in-line with Kaplan et al. (2020),

is that the real winner is scale. For instance, Table 8 shows larger gains to be had by simply increas-

ing the capacity factor (i.e. FLOPs) rather than by more sophisticated routing (i.e. algorithms).

Adaptive computation. Sparse models are a subclass of adaptive computation models since each

input gets different computation applied to it. In sparse models a token is routed to the expert(s)

of its choosing. When capacity factors are less than one, the model learns to not apply computa-

tion to certain tokens. This has shown promise in computer vision (Riquelme et al., 2021) and our

language experiments (Appendix D). We envision future models expanding this through heteroge-

neous experts (e.g. each expert applies differing computation). Intuitively, different input examples

will likely require different amounts of processing depending on difficulty. Future models in this

direction will be efficiently enabled through emerging computing infrastructures (Dean, 2021).

Generalizing findings from small to large scale. A key issue we faced throughout our work was

identifying small scale models and training setups that reflect larger scale experiments. This was

evident in our stability studies in Section 3 where experiments had to be run with XL sized models

to surface relevant dynamics. For our architecture and routing algorithm experiments, we often find

improvements vanish, or even reverse, when models are trained for longer or made larger. As one

example, the top-n findings of Fedus et al. (2021) were reversed in our 8x larger-scale experiments

presented here, which revealed small boosts of top-(n + 1) routing over top-n routing (see Table 8).

Training models with even lower precision. The best method we found to stabilize our models

without hurting (and sometimes improving) quality was the router z-loss. This is an auxiliary loss

that encourages the model logits to have values smaller in absolute magnitude. Given the max range

of numbers float32 and bfloat16 can support (∼ 3e 38 ), this leads us to believe most of this

range is not needed, and compressing it actually might improve model training dynamics. Therefore,

future precision formats might take into account more compressed exponential ranges to train certain

classes of models.

Designing new operations with more multiplicative interactions. Section 3.1 shows that op-

erations with more multiplicative interactions than additions, or those that don’t accumulate over

many numbers, improve model performance. We test this further by injecting more multiplicative

interactions into expert layers which speedup pre-training by 4% without any change to step-time

(Appendix C). We think this hints at promising architectural improvements for models and could be a

good design principle. Recently depthwise convolutions, which only accumulate 3-5 elements, have

also been shown to greatly improve Transformer performance (So et al., 2021). These operations

are especially exciting as elementwise multiplications typically do not introduce any communication

overhead when using model parallelism (which makes operations like depthwise convolutions and

our multiplicative interactions very efficient). While we did note these methods to increase model

instabilities in Section 3.1, using the router z-loss in our models prevented any further instabilities.

23Constrain activations to alleviate other undesirable model scaling dynamics. We observed

two additional sources of training instability. (1) Encoder-decoder models are more unstable than

decoder only models (for fixed amount of FLOPs). Encoder-decoder models have a higher ratio

of attention layers (e.g. more exponential functions) due to having both self-attention and enc-dec

attention layers for each FFN on the decoder. (2) Deeper models are more unstable than shallower

models for a fixed amount of FLOPs. Deeper models also introduce more exponential functions

through additional attention layers. We hypothesize that a contributing factor to both of these obser-

vations is simply the increased number of exponential functions found in the network. Future work

could look at resolving these training dynamics by adding z-loss penalties to the attention softmaxes

for non-sparse models, especially since we observed adding them didn’t change model quality.

Dense and sparse models depend differently on hyperparameters. Our fine-tuning analysis

in Section 4.3 shows optimal fine-tuning hyperparameters differ significantly between dense and

sparse models. In certain settings, fine-tuning hyperparamters that worked well for the dense model

masked any improvements from the sparse model (despite large pre-training speedups). For new

model classes, we recommend researchers and practitioners to extensively test key hyperparameters

before prematurely abandoning a method.

C ONCLUSION

We temper the over-exuberance for scale in Fedus et al. (2021) by showing how a model with 1/5th

the size, but with a better balance of computation (FLOPs) to parameters – is a more effective sparse

learner. Furthermore, this improves the usability of sparse models since it can be deployed with less

memory overhead. Using our sparse model variant, we achieve SOTA across a wide range of the

most competitive public benchmarks. We hope this work shows the power of model sparsity and

accelerates the adoption of such models.

A CKNOWLEDGEMENTS

We would like to thank Alex Passos, Ekin Cubuk, Margaret Li, Noah Constant, Oriol Vinyals,

Basil Mustafa, Joan Puigcerver, Diego de Las Casas, Mike Lewis, and Ryan Sepassi for detailed

comments and feedback on early versions of the draft. We also thank the Google Brain Team for

useful discussions throughout the course of this work.

24R EFERENCES

Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria

Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li, Shuohui

Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O’Horo,

Jeff Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, and Ves Stoyanov. Efficient large

scale language modeling with mixtures of experts, 2021.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint

arXiv:1607.06450, 2016.

Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computation

in neural networks for faster models, 2016.

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from

question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural

language processing, pages 1533–1544, 2013.

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,

Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are

few-shot learners. arXiv preprint arXiv:2005.14165, 2020.

Aidan Clark, Diego de las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann,

Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for

routed language models. arXiv preprint arXiv:2202.01169, 2022.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and

Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.

arXiv preprint arXiv:1803.05457, 2018.

Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with

gated convolutional networks. In International conference on machine learning, pages 933–941.

PMLR, 2017.

Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. The commitmentbank: In-

vestigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung,

volume 23, pages 107–124, 2019.

Jeff Dean. Introducing pathways: A next-generation ai architecture. Google AI Blog, 2021.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi-

erarchical image database. In 2009 IEEE conference on computer vision and pattern recognition,

pages 248–255. Ieee, 2009.

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise

quantization, 2021.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep

bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas

Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An

image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint

arXiv:2010.11929, 2020.

Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim

Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma,

Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathy

Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen,

and Claire Cui. Glam: Efficient scaling of language models with mixture-of-experts, 2021.

David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep

mixture of experts. arXiv preprint arXiv:1312.4314, 2013.

25Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Man-

deep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, et al. Beyond english-centric

multilingual machine translation. Journal of Machine Learning Research, 22(107):1–48, 2021.

William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: Better text generation via filling in

the . arXiv preprint arXiv:1801.07736, 2018.

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter

models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961, 2021.

Suchin Gururangan, Mike Lewis, Ari Holtzman, Noah A. Smith, and Luke Zettlemoyer. Demix

layers: Disentangling domains for modular language modeling, 2021.

Hussein Hazimeh, Zhe Zhao, Aakanksha Chowdhery, Maheswaran Sathiamoorthy, Yihua Chen,

Rahul Mazumder, Lichan Hong, and Ed H. Chi. Dselect-k: Differentiable selection in the mixture

of experts with applications to multi-task learning, 2021.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-

nition, 2015.

Dan Hendrycks and Kevin Gimpel.

arXiv:1606.08415, 2016.

Gaussian error linear units (gelus).

arXiv preprint

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay,

Mustafa Suleyman, and Phil Blunsom.

Teaching machines to read and comprehend.

In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances

in Neural Information Processing Systems, volume 28, pages 1693–1701. Curran Asso-

ciates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/

afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf.

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):

1735–1780, 1997.

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, An-

drea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp.

In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by

reducing internal covariate shift. In International conference on machine learning, pages 448–

456. PMLR, 2015.

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of

local experts. Neural computation, 3(1):79–87, 1991.

Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm.

Neural computation, 6(2):181–214, 1994.

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly

supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child,

Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language

models. arXiv preprint arXiv:2001.08361, 2020.

Vladimir Karpukhin, Barlas Ouz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi

Chen, and Wen tau Yih. Dense passage retrieval for open-domain question answering, 2020.

Young Jin Kim, Ammar Ahmad Awan, Alexandre Muzio, Andres Felipe Cruz Salinas, Liyang Lu,

Amr Hendy, Samyam Rajbhandari, Yuxiong He, and Hany Hassan Awadalla. Scalable and effi-

cient moe training for multitask multilingual models, 2021.

Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword

tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.

26Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin, Minh-Thang

Luong, and Orhan Firat. Beyond distillation: Task-level mixture-of-experts for efficient inference.

arXiv preprint arXiv:2110.03742, 2021a.

Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin, Minh-Thang

Luong, and Orhan Firat. Beyond distillation: Task-level mixture-of-experts for efficient inference.

arXiv preprint arXiv:2110.03742, 2021b.

Kenichi Kumatani, Robert Gmyr, Felipe Cruz Salinas, Linquan Liu, Wei Zuo, Devang Patel, Eric

Sun, and Yu Shi. Building a great multi-lingual teacher with sparsely-gated mixture of experts for

speech recognition, 2021.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris

Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a

benchmark for question answering research. Transactions of the Association for Computational

Linguistics, 7:453–466, 2019.

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang,

Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional

computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt

tuning. arXiv preprint arXiv:2104.08691, 2021.

Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers:

Simplifying training of large, sparse models. arXiv preprint arXiv:2103.16716, 2021.

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv

preprint arXiv:2101.00190, 2021.

Xiaobo Liang, Lijun Wu, Juntao Li, Yue Wang, Qi Meng, Tao Qin, Wei Chen, Min Zhang, and

Tie-Yan Liu. R-drop: Regularized dropout for neural networks, 2021.

Junyang Lin, An Yang, Jinze Bai, Chang Zhou, Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang,

Yong Li, Wei Lin, Jingren Zhou, and Hongxia Yang. M6-10t: A sharing-delinking paradigm for

efficient multi-trillion parameter pretraining, 2021.

Yuxuan Lou, Fuzhao Xue, Zangwei Zheng, and Yang You. Sparse-mlp: A fully-mlp architecture

with conditional computation, 2021.

Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relation-

ships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM

SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1930–1939,

2018.

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia,

Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision

training. arXiv preprint arXiv:1710.03740, 2017.

Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines.

In Icml, 2010.

Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Kar-

ishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, et al. Do transformer modifications

transfer across implementations and applications? arXiv preprint arXiv:2102.11972, 2021.

Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the sum-

mary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint

arXiv:1808.08745, 2018.

Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and

James Martens. Adding gradient noise improves learning for very deep networks. arXiv preprint

arXiv:1511.06807, 2015.

27Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversar-

ial nli: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599,

2019.

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong

Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow

instructions with human feedback. 2022.

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural

networks. In International conference on machine learning, pages 1310–1318. PMLR, 2013.

David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild,

David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. arXiv

preprint arXiv:2104.10350, 2021.

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John

Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan,

Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks,

Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron

Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu,

Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen

Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kun-

coro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Men-

sch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux,

Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yu-

jia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Au-

relia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger,

Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol

Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu,

and Geoffrey Irving. Scaling language models: Methods, analysis & insights from training go-

pher, 2021.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi

Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text

transformer. arXiv preprint arXiv:1910.10683, 2019.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions

for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.

Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Su-

sano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts.

arXiv preprint arXiv:2106.05974, 2021.

Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the

parameters of a language model? arXiv preprint arXiv:2002.08910, 2020.

Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, and Jason Weston. Hash layers for large sparse

models. arXiv preprint arXiv:2106.04426, 2021.

Sascha Rothe, Jonathan Mallinson, Eric Malmi, Sebastian Krause, and Aliaksei Severyn. A simple

recipe for multilingual grammatical error correction. arXiv preprint arXiv:2106.03830, 2021.

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adver-

sarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial

Intelligence, volume 34, pages 8732–8740, 2020.

Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate

training of deep neural networks. Advances in neural information processing systems, 29:901–

909, 2016.

Noam Shazeer. Fast transformer decoding: One write-head is all you need.

arXiv:1911.02150, 2019.

arXiv preprintNoam Shazeer. Glu variants improve transformer, 2020.

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost.

In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018.

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton,

and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.

arXiv preprint arXiv:1701.06538, 2017.

Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool,

Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. Mesh-tensorflow: Deep

learning for supercomputers. In Advances in Neural Information Processing Systems, pages

10414–10423, 2018.

Sam Shleifer, Jason Weston, and Myle Ott. Normformer: Improved transformer pretraining with

extra normalization. arXiv preprint arXiv:2110.09456, 2021.

David R So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and Quoc V Le. Primer:

Searching for efficient transformers for language modeling. arXiv preprint arXiv:2109.08668,

2021.

Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.

Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning

Research, 15(1):1929–1958, 2014. URL http://www.cs.toronto.edu/˜rsalakhu/

papers/srivastava14a.pdf.

Nassim Nicholas Taleb. Antifragile: Things that gain from disorder, volume 3. Random House

Incorporated, 2012.

Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan

Narang, Dani Yogatama, Ashish Vaswani, and Donald Metzler. Scale efficiently: Insights from

pre-training and fine-tuning transformers. arXiv preprint arXiv:2109.10686, 2021.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,

Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information

processing systems, pages 5998–6008, 2017.

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer

Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language

understanding systems. In Advances in Neural Information Processing Systems, pages 3266–

3280, 2019.

Shuohuan Wang, Yu Sun, Yang Xiang, Zhihua Wu, Siyu Ding, Weibao Gong, Shikun Feng, Jun-

yuan Shang, Yanbin Zhao, Chao Pang, et al. Ernie 3.0 titan: Exploring larger-scale knowledge en-

hanced pre-training for language understanding and generation. arXiv preprint arXiv:2112.12731,

2021.

Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual perfor-

mance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey,

Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine trans-

lation system: Bridging the gap between human and machine translation. arXiv preprint

arXiv:1609.08144, 2016.

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya

Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv

preprint arXiv:2010.11934, 2020.

An Yang, Junyang Lin, Rui Men, Chang Zhou, Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Jia-

mang Wang, Yong Li, Di Zhang, Wei Lin, Lin Qu, Jingren Zhou, and Hongxia Yang. M6-t:

Exploring sparse expert models and beyond, 2021.

29Zhao You, Shulin Feng, Dan Su, and Dong Yu. Speechmoe: Scaling to large acoustic models with

dynamic routing mixture of experts, 2021a.

Zhao You, Shulin Feng, Dan Su, and Dong Yu. Speechmoe2: Mixture-of-experts model with im-

proved routing, 2021b.

Biao Zhang and Rico Sennrich.

arXiv:1910.07467, 2019.

Root mean square layer normalization.

arXiv preprint

Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. Pegasus: Pre-training with extracted

gap-sentences for abstractive summarization, 2020.

Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme.

Record: Bridging the gap between human and machine commonsense reading comprehension.

arXiv preprint arXiv:1810.12885, 2018.

Simiao Zuo, Xiaodong Liu, Jian Jiao, Young Jin Kim, Hany Hassan, Ruofei Zhang, Tuo Zhao, and

Jianfeng Gao. Taming sparsely activated transformer with stochastic experts, 2021.

30A

T OKEN L OAD B ALANCE D ESCRIPTION

The auxiliary load balancing loss from Shazeer et al. (2017) is also used to here to balance tokens

across experts. Assume we have N experts indexed by i = 1 to N and a batch B with T tokens. The

auxiliary loss is computed as the scaled dot-product between vectors f and P,

loss = α · N ·

f i · P i

(7)

i=1

where f i is the fraction of tokens dispatched to expert i,

f i =

1 X

1 {argmax p(x), i}

(8)

x∈B

and P i is the fraction of the router probability allocated for expert i, 2

P i =

1 X

p i (x)

(9)

x∈B

Since we seek uniform routing of the batch of tokens across the N experts, we desire both vectors

to have values of 1/N . The auxiliary loss of Equation 7 encourages uniform routing since it is

minimized under a uniform distribution. The objective can also be differentiated as the P -vector is

differentiable, but the f -vector is not. The final loss is multiplied by expert count N to keep the loss

P N

constant as the number of experts varies since under uniform routing 1 (f i · P i ) = 1 ( N 1 · N 1 ) =

N . Finally, a hyperparameter α is a multiplicative coefficient for these auxiliary losses; throughout

this work we use an α = 10 −2 which was sufficiently large to ensure load balancing while small

enough to not to overwhelm the primary cross-entropy objective.

R OUTER Z-L OSS T RAINING D YNAMICS

Figure 7 plots the router z-loss from Equation 5 across a coefficient sweep where the best value of

c z = 0.001 is plotted in green for the encoder and decoder.

0.0

1e-4

1e-3

1e-2

1e-1

0.0

1e-4

1e-3

1e-2

1e-1

25000 50000 75000 100000 125000 150000

Step

25000 50000 75000 100000 125000 150000

Step

Figure 7: Sweeping loss coefficient (c z ) for Router Z-Loss. We plot the router z-losses over the

course of pre-training without router z-loss (blue) and with increasing values of c z (we selected

coefficient associated with green curve for all later experiments). With values of 1e-2, or larger,

the z-loss shrinks near to zero. The left plot shows an encoder layer and the right plot shows a

decoder layer.

A potential source of confusion: p i (x) is the probability of routing token x to expert i. P i is the probability

fraction to expert i across all tokens in the batch B.

31C

I MPROVED A RCHITECTURAL M ODIFICATIONS

We consider a few small architecture variations here. The first modification was adding additional

FFN layers (feed-forward network, see Table 1 for more details) immediately before or after each

MoE layer (referred to as Sparse-Dense). Table 16 reveals the effectiveness of an FFN layer im-

mediately preceding or following each sparse layer and that these extra FFN layers help less when

added elsewhere in the network. Guaranteeing all tokens have at least one FFN applied to them

between each attention layer appears useful.

Neg. Log Perp. (↑) ∆

Dense model (baseline)

Dense model w/ extra FFN layers -1.474

-1.452 -

0.022

Sparse model (baseline)

Sparse model w/ extra FFN layer after each sparse layer

Sparse model w/ extra FFN layer before each sparse layer

Sparse model w/ extra FNN layers placed randomly in the network -1.383

-1.369

-1.376 -

0.014

0.007

Model

Table 16: A dense FFN immediately before or after each sparse layer improves quality. Insert-

ing an extra dense FFN immediately before or after each sparse layer improves quality 2x as much

as placing the dense layers (randomly) elsewhere in the network. All of the non-baseline models

have the same amount of FFN layers added for fair comparisons. Note that improving perplexity

becomes harder as the model gets better.

Second, we introduce an additional bias in the expert layers. All our models use the GELU-Linear

FFN (Shazeer, 2020), rather than the ReLU FFN:

FFN ReLU (x) = (ReLU(xW 1 ))W 2

FFN GEGLU (x) = (GELU(xW 11 ) xW 12 )W 2

The additive bias is a learned weight (B) added after the first matrix multiplication in the FFN layer

of shape [batch, d f f ]. The multiplicative bias (also referred to as a scale parameter) is a learned

weight of the same shape, but does an elementwise multiplication. We initialize the additive bias to

zeros and the multiplicative bias to ones.

FFN GEGLU + Add Bias(x) = [(GELU(xW 11 ) xW 12 ) + B]W 2

FFN GEGLU + Mult Bias(x) = [(GELU(xW 11 ) xW 12 ) B]W 2

Table 17 shows the results of our different methods. Both the additive and multiplicative biases are

essentially free: cheap to compute, adds few new parameters, and incurs no additional communi-

cation costs with model and expert parallelism. When using our router z-loss from Section 3.1, we

observe no instabilities from the multiplicative bias. We do see that the multiplicative interactions

improve performance, achieving a 4% speedup in convergence time over our strong sparse baseline.

This hints that a promising avenue for future architectural research is finding new ways of adding

more multiplicative interactions into networks.

Model Neg. Log. Perp. (↑) ∆

Dense Baseline

Sparse Baseline -1.474

-1.369 -

Sparse + Additive Bias

Sparse + Multiplicative Bias -1.371

-1.361 -0.002

0.008

Table 17: More multiplicative interactions improve sparse model quality. Both the additive and

the multiplicative bias add virtually no parameters or compute.

Finally, motivated by the work of Roller et al. (2021), we explored similar methods, but did not find

improvements in our setting. We tried routing using the word embedding exclusively, as well as

32an additional input to the layer embedding for routing decisions. We toggled stopping the gradient

through the word embedding or allowing it to have gradients propagated from the router. Using only

the word embedding hurt quality, while using it in addition to the normal layer hidden activation was

initially positive, but after pre-training for 50B+ tokens on models of scale 1B+ dense parameters it

had a neutral effect. Appendix J has further details on the experiments with negative results.

B ATCH P RIORITIZED R OUTING FOR L OWER C APACITY F ACTORS

Surprisingly, top-1 and top-2 routing work well with CF less than 1.0 despite token routing being

done in a left to right order over the sequence. If N tokens are sent to an expert with only M spaces

then N > M tokens will dropped. The ordering of the dropping is important: we drop tokens going

left to right (e.g. tokens earlier in the sentence will be routed first over the end tokens). This is done

to avoid the model cheating. If we dropped tokens in another ordering, the model gets information

on what tokens are occurring later in the sequence based on if tokens are being dropped or not.

Batch Prioritized Routing (BPR) from Riquelme et al. (2021) was introduced in Vision Transformers

(Dosovitskiy et al., 2020) for image classification. Our work explores BPR with top-1 routing in the

context of language modeling. BPR aims to have a global view of all tokens to determine which

tokens should be dropped instead of the left-to-right ordering. The algorithm works by looking at

all N tokens getting sent to Expert i and then only routing the M ones with the highest probabilities

from the router. Table 18 shows that BPR top-1 routing improves performance over top-2 routing,

especially when capacity factors are less than 1.0. We leave it to future work to try top-n BPR

routing, which will hopefully yield larger improvments for higher capacity factors.

Importantly, BPR routing can only be done on the encoder side of the encoder-decoder model. On

the encoder side there are not autoregressive predictions and all tokens can see each other. If you use

BPR on the decoder, it learns to cheat by using future token information to improve current token

predictions.

33Algorithm Train CF Eval CF Neg. Log. Perp. (↑)

Dense

Dense-L —

— —

— -1.474

-1.384

BPR Top-1

BPR Top-1 0.5

0.5 0.5

2.0 -1.433

-1.416

Top-1

Top-2

BPR Top-1

BPR Top-1 0.75

0.75

0.75 0.75

2.0

0.75

2.0

0.75

2.0 -1.428

-1.404

-1.424

-1.402

-1.409

-1.397

Top-1

Top-2

BPR Top-1

BPR Top-1 1.0

1.0

1.0 1.0

2.0

1.0

2.0

1.0

2.0 -1.397

-1.384

-1.392

-1.378

-1.386

-1.379

Top-1

Top-2

BPR Top-1

BPR Top-1 1.25

1.25

1.25 1.25

2.0

1.25

2.0

1.25

2.0 -1.378

-1.373

-1.375

-1.369

-1.376

-1.375

Table 18: Batch Prioritized Top-1 Routing (BPR) performance. BPR top-1 routing improves

quality when capacity factors are ≤ 1. However, once the capacity factor reaches 1.25, the improve-

ments greatly diminish and it underperforms top-2 routing. Future work can try BPR with top-2

routing, which should hopefully further improve the performance.

P RE -T RAINING D ATASET D ETAILS

The pre-training dataset used to train our Sparse 32B model is a mix of C4 (Raffel et al., 2019) and

the dataset introduced in GLaM (Du et al., 2021).

Dataset

Filtered C4

Filtered Webpages

Wikipedia

Conversations

Forums

Books

News

Tokens (B) Weight in Mixture

183

143

174

247

390

650 0.17

0.34

0.05

0.23

0.02

0.17

0.02

Table 19: Data and mixture weights in the training set. We sample from different dataset sources

with probability proportional to “weight in mixture”. The number of tokens listed are in billions (B).

For more details on the C4 corpus see Raffel et al. (2019) and for the other datasets see Du et al.

(2021).

34F

F ULL F INE - TUNING S ENSITIVITY D ATA

Table 20 contains the raw data for Figure 6 measuring the fine-tuning protocol sensitivity. Dense

and Sparse are encoder-decoder models FLOP matched to T5-Large that were pre-trained for 500k

steps with a batch size of 1M tokens on the C4 corpus.

Model Learning Rate Batch Size

Dense

Dense 1e-3

1e-3

5e-4

1e-4

1e-4 1M

Dense

Dense 1e-3

1e-3

5e-4

1e-4

1e-4 262k

262k

Dense

Dense 1e-3

1e-3

5e-4

1e-4

1e-4 65k

65k

Sparse

Sparse 1e-3

1e-3

5e-4

1e-4

1e-4 1M

Sparse

Sparse 1e-3

1e-3

5e-4

1e-4

1e-4 262k

262k

Sparse

Sparse 1e-3

1e-3

5e-4

1e-4

1e-4 65k

65k

Reset Optimizer Slot Vars

SuperGLUE (↑)

84.8

84.3

84.8

84.2

84.0

84.8

84.9

83.7

84.9

84.0

85.1

85.0

83.7

82.5

84.4

84.1

84.9

84.6

86.9

85.9

86.1

83.5

84.3

86.2

85.2

85.5

84.8

85.1

85.5

85.8

85.5

86.5

85.1

85.6

84.5

Table 20: Fine-tuning protocol sensitivity. We vary the batch size, learning rate and whether we

reset the optimizer slot variables for both dense and sparse models. Resetting the optimizer state

during fine-tuning hurts performance. We observe a difference in optimal batch size and learning

rate for sparse vs. dense models. Certain hyperparameter fine-tuning settings make the sparse and

dense models perform almost exactly the same, showing the importance of correctly tuning the

hyperparameters.

35G

O PTIMALLY S ETTING THE R OUTING T HRESHOLD

Top-n Routing Algorithm

1. Route each token x to the expert with the highest router probability (gate 1 (x)).

2. Normalize the top-n expert router scores for each token x, so gate i =

P n gate i (x)

i=1 gate i (x)

gate (x)

3. Route the token to the other n-1 experts (indexed by i) with probability min(1.0, threshold

Threshold is a predefined hyperparameter that is typically set to 0.2.

We describe the MoE hyperparameters and how they should change as the routing algorithm

changes. The MoE top-2 routing algorithm (Shazeer et al., 2017; 2018; Lepikhin et al., 2020) works

as follows: first the router finds the expert that is assigned the higher router score (gate 1 ) and always

sends the token to that expert. The token is also sent to its second highest expert with probability

min(1.0, gate 2 /threshold). The threshold is a hyperparameter that is typically set to 0.2, and gate 2 is

the token’s router probability for the second highest expert. Note that gate 1 and gate 2 get normalized

by the sum of their two scores, so they sum to one.

We trivially extend the top-2 algorithm to work for top-n routing here. Take the scores of the top-n

experts per token and sum them, then renormalize each expert router score based on that sum. If the

specific renormalized expert score has a higher value than the threshold (e.g. 0.2), then the token will

score

be routed, otherwise it will be routed with probability threshold

. At a high level this only routes the

token to the next n-1 experts if their scores are not too much lower than the highest scored expert.

For top-3 routing vs top-2, the sum that the expert scores are normalized by is larger, therefore

we experimented with decreasing the threshold. Our experimental results are shown in Table 21.

Interestingly, we do observe the top-3 routing to slightly benefit from the lower threshold, while the

opposite is true for top-2 routing.

We also experimented with an absolute threshold policy instead of a relative one. This is where the

next n-1 tokens will be routed only if their router score is great than some pre-defined value (e.g.

0.2). We found it can achieve as good of performance if the threshold value is tuned.

Algorithm Train CF Threshold Neg. Log. Perp. (↑)

Dense

Dense-L —

— —

— -1.474

-1.384

Top-2

Top-3

Top-3 3.0

3.0

3.0 0.2

0.05

0.2

0.05 -1.354

-1.356

-1.351

-1.349

Table 21: Performance of top-2 and top-3 routing with different thresholds. Top-3 routing does

slightly better with lower thresholds than top-2 routing.

M ESH L AYOUT FOR D ATA , M ODEL AND E XPERT P ARALLELISM WITH

F EW E XPERTS

We use data and model parallelism partitioning with Mesh-Tensorflow (Shazeer et al., 2018). The

partitioning strategy works by first forming a logical 2D mesh of size d x m, with the rows corre-

sponding to the data dimension (d) and the columns as the model dimension (m) and the product

equal to the total number of cores, n = d x m. This mesh is only an abstraction. Each logical core

must be mapped to a physical core, which is optimized through performance tuning.

As a refresher, each row in the mesh will have its own unique slice of the data and each column will

have a unique slice of the model weights. The final gradient allreduce communication occurs

across each individual column. The model parallelism allreduce communications occur across

each row in the mesh. One constraint from this approach is that the number of rows must evenly

36Experts >= Data Dimension

Experts < Data Dimension

Model Dimension

Inner Data Dimension

sio

Figure 8: Data and model parallelism meshes used for distributing models. In this example there

are a total of 32 processors (e.g. n = 32). (Left) A valid 2D mesh if the number of experts is greater

than or equal to the data parallelism dimension. The data dimension has 8 rows (d) and the model

dimension has 4 columns (m). (Right) A valid 3D mesh when we have fewer experts than the data

parallelism dimension. The batch dimension is factorized into two new dimensions: inner data and

outer data dimensions. Now we have 1 expert per inner data dimension (i). The 8 data rows in the

left figure become 4 in the outer batch (o) and 2 in the inner batch (i) with 2 experts instead of 8.

divide the number of data sequences and the number of columns must evenly divide the model

dimensions being partitioned.

But if we have fewer than d experts then this layout will not work. To allow for fewer experts than

data parallelism rows in our mesh, we factorize the data dimension into two new dimensions: inner

(i) and outer (o) where i x o = d and the number of experts equals i. This transforms the logical 2D

mesh of shape d x m into a 3D mesh of shape o x i x m. See Figure 8 for a visualization of both

meshes 12 .

N OTE ON C OMMUNICATION C OSTS FOR D ISTRIBUTED M ODELS

Communication operations (allreduce and all2all) can significantly impact sparse model

training throughput (see Table 1 for a description of the communication operations). allreduce

calls are executed along model and batch dimensions, typically dominated by the model dimension

allreduce calls that sum results of partial matrix multiplication operations from the workers.

These calls are needed when matrix multiplications are partitioned across multiple cores (e.g. model

parallelism). The gradient summation allreduce calls can be amortized away by training models

with larger batch sizes since the gradient accumulation allreduce communication cost is inde-

pendent of the batch size. To alleviate the memory issues of larger batch sizes, microbatches can

be used. Microbatches do this by splitting the batch into n evenly divisible chunks and computing

gradients on each sequentially, then summing.

To increase the allreduce throughput, more workers may need to be assigned to the model di-

mension (instead of batch dimension). However, increasing the number of workers may reduce

compute per worker resulting in higher communication overheads that cancel some of the gains

from higher communication throughput from allreduce. For the results in this paper, first we

explored various model partitioning strategies. Next the shapes of the pre-training jobs were al-

located based on performance benchmarking which showed the lowest cumulative communication

overheads in allreduce and all2all.

See Mesh Tensorflow for more details on the inner and outer batch: https://github.com/

tensorflow/mesh/blob/master/mesh_tensorflow/transformer/moe.py

37J

N EGATIVE R ESULTS

We conclude with some ideas that yielded negative results in our setting.

Adding information if tokens were dropped to the router. We experimented with having the

expert layer have information of whether the token was routed or dropped in the previous expert

layers. We implemented this through counting the number of times a token was routed in all pre-

vious expert layers, having embeddings for each possible value and then adding this to the router

embedding. We found that this made no difference in performance.

Adding explicit expert positional information. We experimented with adding explicit positional

information into the outputs of the expert layer. We wanted to see if it either improved performance

or sped up convergence during the beginning of training when expert layers were drastically chang-

ing. We did this through adding an embedding corresponding to what expert each token was sent

(including an embedding if the token was dropped), but this did not improve performance.

Adding pre-training noise to fix pre-training and fine-tuning discrepancies. To help fix the

pre-training perplexity and fine-tuning gap we tried pre-training the sparse models with a variety of

different types of noise. The goal was to help pre-training match the fine-tuning conditions where

dropout is used and more tokens can be dropped. Some of the noise types we tried adding during

pre-training were dropout, dropping out full experts for a batch of tokens, and adding an entropy

maximization auxiliary loss to the router. Unfortunately, all of the methods either hurt the pre-

training quality too much or didn’t end up helping the fine-tuning.

Load balancing in top-n routing over lower n-1 experts. In the standard top-n MoE formaliza-

tion there is only loading balancing over the top expert a token is sent to. We experimented with

adding an auxiliary load balancing term to the other n − 1 experts in top-n routing, but found this to

provide minimal benefits.

Mixing pre-training and fine-tuning data to prevent overfitting. To help combat the overfitting

of sparse models during fine-tuning, we tried mixing in pre-training span corruption data at varying

amounts (e.g. 1%, 5%, 25%, ...) during fine-tuning. This ended up not helping the fine-tuning

performance, but did increase the training loss.