Summary of Pre-training Modular Transformers for Multilingual NLP

Summary Pre-training Modular Transformers for Multilingual NLP aclanthology.org

11,049 words - PDF document - View PDF document

One Line

Pfeiffer et al. propose language-specific modules to enhance the performance and scalability of NLP models in multilingual settings.

Slides

Slide Presentation (10 slides)

Copy slides outline Copy embed code Download as Word

Pre-training Modular Transformers for Multilingual NLP

Source: aclanthology.org - PDF - 11,049 words - view

Introducing the Curse of Multilinguality

• Multilingual NLP models often suffer from a drop in per-language performance as they cover more languages.

Solution: Language-Specific Modules

• Pre-training modular models with language-specific components can mitigate the curse of multilinguality.

• X-MOD models outperform conventional non-modular models (SHARED models) on various tasks.

• Adding language-specific capacity during pre-training is crucial for mitigating negative interference between languages.

Improved Performance with X-MOD Models

• X-MOD models show improved monolingual and cross-lingual performance.

• Language-specific components help in achieving positive transfer between languages.

• Longer training and more update steps improve the performance of X-MOD models.

Scalability and Flexibility

• The scalability of the approach allows for the addition of languages post-hoc without sacrificing performance.

• X-MOD models can be extended to new languages, making them suitable for a large number of languages.

Comparison to Adapter-Based Approaches

• Adapters added after pre-training do not mitigate the curse of multilinguality.

• Performance of adapters strongly correlates with the performance of fully shared models.

• Importance of adding language-specific capacity during pre-training is highlighted.

Impact of Training Steps on Performance

• Longer training results in improved performance.

• More update steps are needed for modularity to take effect in X-MOD models.

Implications for Multilingual NLP Models

• Pre-training modular transformers with language-specific components improves performance on multilingual NLP tasks.

• Challenges of multilinguality in NLP can be addressed by incorporating language-specific information.

Contributing to Pre-training Methods

• Importance of considering language-specific information in model design.

• Benefits of incorporating language-specific information in modular transformers.

• Future research can focus on developing more effective and efficient multilingual NLP models.

Pre-training Modular Transformers for Multilingual NLP

• Pre-training modular models with language-specific components mitigates the curse of multilinguality.

• X-MOD models outperform SHARED models, showing improved performance.

• Scalability allows for the addition of languages post-hoc without sacrificing performance.

• Language-specific capacity during pre-training is crucial for success.

• Overall, this approach improves multilingual NLP and contributes to the understanding of handling multilinguality.

Key Points

Pre-training modular transformers with language-specific components can mitigate the curse of multilinguality in multilingual natural language processing (NLP) models.
X-MOD models outperform conventional non-modular models (SHARED models) on various tasks, showing improved monolingual and cross-lingual performance.
Adding language-specific capacity during pre-training is crucial for mitigating the negative interference between languages.
Longer training and more update steps improve the performance of X-MOD models.
The scalability of the approach allows for the addition of languages post-hoc without sacrificing performance.

Summaries

19 word summary

Pfeiffer et al. address multilinguality in NLP models with language-specific modules, improving performance for various tasks and maintaining scalability.

56 word summary

Pfeiffer et al. propose a solution to the curse of multilinguality in NLP models by introducing language-specific modules in their X-MOD models. Pre-training these modules from the start improves monolingual and cross-lingual performance for tasks such as NLI, NER, and QA. The study shows that adding languages post-hoc does not decrease performance, making the model scalable.

123 word summary

In their study titled "Lifting the Curse of Multilinguality by Pre-training Modular Transformers," Pfeiffer et al. propose a solution to the curse of multilinguality in multilingual natural language processing (NLP) models. They introduce language-specific modules in their Cross-lingual Modular (X-MOD) models and pre-train them from the start. The experiments conducted on natural language inference (NLI), named entity recognition (NER), and question answering (QA) tasks show that X-MOD models mitigate negative interference between languages and enable positive transfer, resulting in improved monolingual and cross-lingual performance. The study also demonstrates that adding languages post-hoc does not decrease performance, making their model scalable to new languages. Overall, pre-training modular models with language-specific components from the start can lift the curse of multilinguality and improve cross-lingual performance.

437 word summary

In their study, "Lifting the Curse of Multilinguality by Pre-training Modular Transformers," Pfeiffer et al. propose a solution to the curse of multilinguality in multilingual natural language processing (NLP) models. They introduce language-specific modules in their Cross-lingual Modular (X-MOD) models, pre-training the modules from the start. The authors conducted experiments on natural language inference (NLI), named entity recognition (NER), and question answering (QA) tasks, comparing the performance of X-MOD models to conventional non-modular models (SHARED models) on increasing sets of languages. The results showed that X-MOD models mitigate negative interference between languages and enable positive transfer, resulting in improved monolingual and cross-lingual performance.

The study demonstrated that adding languages post-hoc does not decrease performance, making their model scalable to new languages. Comparisons with adapter-based approaches revealed the importance of language-specific capacity during pre-training for mitigating the curse of multilinguality.

The impact of the number of update steps on X-MOD model performance was analyzed, showing that longer training improved performance, indicating the need for more update steps for modularity to take effect.

Overall, pre-training modular models with language-specific components from the start can lift the curse of multilinguality and improve cross-lingual performance. The scalability of their approach and potential to cover all languages were emphasized.

In conclusion, Pfeiffer et al. present a novel approach to addressing the curse of multilinguality in multilingual NLP models. Pre-training modular models with language-specific components mitigates negative interference between languages and achieves positive transfer. Their approach enables the addition of languages post-hoc without a drop in performance, making their model scalable to a large number of languages.

The study explores pre-training modular transformers for multilingual NLP, comparing SHARED and X-MOD models. X-MOD consistently outperforms SHARED, suggesting that language-specific components help mitigate negative interference caused by multilinguality.

The impact of training steps on model performance is investigated, showing that as the number of training steps increases, the X-MOD model becomes more competitive with SHARED, especially with a small number of languages.

The performance of pre-trained and added languages on various datasets consistently shows that X-MOD outperforms SHARED. Language selection for pre-training is analyzed, providing details about language families, scripts, and results on perplexity, XNLI, and NER for each set of languages.

The results demonstrate that pre-training modular transformers with language-specific components improves performance on multilingual NLP tasks, addressing the challenges of multilinguality and improving model generalization across languages.

The study contributes to the research on pre-training methods for multilingual NLP, highlighting the importance of language-specific information in model design and demonstrating the benefits of incorporating such information in modular transformers. The findings inform future research on developing more effective and efficient multilingual NLP models.

555 word summary

In the study "Lifting the Curse of Multilinguality by Pre-training Modular Transformers," Pfeiffer et al. propose a solution to the issue of the curse of multilinguality in multilingual natural language processing (NLP) models. They introduce language-specific modules in their Cross-lingual Modular (X-MOD) models, pre-training the modules from the start. The authors conducted experiments on natural language inference (NLI), named entity recognition (NER), and question answering (QA) tasks, comparing the performance of X-MOD models to conventional non-modular models (SHARED models) on increasing sets of languages. The results showed that X-MOD models mitigate negative interference between languages and enable positive transfer, resulting in improved monolingual and cross-lingual performance.

The authors demonstrated that their approach allows for the addition of languages post-hoc without a drop in performance, making their model scalable to new languages. They also compared X-MOD models to adapter-based approaches and found that adding language-specific capacity during pre-training was crucial for mitigating the curse of multilinguality.

The study analyzed the impact of the number of update steps on X-MOD model performance and found that longer training improved performance, indicating that more update steps were needed for modularity to take effect.

Overall, the study showed that pre-training modular models with language-specific components from the start can lift the curse of multilinguality and improve cross-lingual performance. The authors emphasized the scalability of their approach and its potential for covering all languages of the world.

In conclusion, Pfeiffer et al. present a novel approach to addressing the curse of multilinguality in multilingual NLP models. By pre-training modular models with language-specific components, they mitigate negative interference between languages and achieve positive transfer. Their approach enables the addition of languages post-hoc without a drop in performance, making their model scalable to a large number of languages.

The study explores pre-training modular transformers for multilingual NLP, comparing SHARED and X-MOD models. The researchers evaluate the models on various tasks and find that X-MOD consistently outperforms SHARED, suggesting that language-specific components in X-MOD help mitigate negative interference caused by multilinguality.

The impact of training steps on model performance is investigated, and it is found that as the number of training steps increases, the X-MOD model becomes more competitive with SHARED, especially with a small number of languages. This indicates the effectiveness of the added language-specific components in handling multilinguality.

The study also evaluates the performance of pre-trained and added languages on various datasets, consistently showing that X-MOD outperforms SHARED. An analysis of language selection for pre-training is included, providing details about language families, scripts, and results on perplexity, XNLI, and NER for each set of languages.

The results demonstrate that pre-training modular transformers with language-specific components improves performance on multilingual NLP tasks. The findings have implications for the development of multilingual NLP models, addressing the challenges of multilinguality and improving model generalization across languages.

In conclusion, the study investigates pre-training modular transformers for multilingual NLP and shows that incorporating language-specific components improves performance on various tasks. The findings have implications for the development of multilingual NLP models and contribute to understanding how to handle multilinguality in pre-training.

852 word summary

In the study "Lifting the Curse of Multilinguality by Pre-training Modular Transformers" by Jonas Pfeiffer et al., the authors address the issue of the curse of multilinguality in multilingual natural language processing (NLP) models. These models often suffer from a drop in per-language performance as they cover more languages. The authors propose a solution to this problem by introducing language-specific modules in their Cross-lingual Modular (X-MOD) models. Unlike previous approaches that add language-specific components after pre-training, the authors pre-train the modules from the start.

The authors conducted experiments on three downstream tasks: natural language inference (NLI), named entity recognition (NER), and question answering (QA). They compared the performance of their X-MOD models to conventional non-modular models (referred to as SHARED models) on increasing sets of languages. The results showed that the X-MOD models not only mitigated the negative interference between languages but also enabled positive transfer, resulting in improved monolingual and cross-lingual performance.

Furthermore, the authors demonstrated that their approach allowed for the addition of languages post-hoc without a measurable drop in performance. This means that their model can be extended to new languages without limiting its usage to a set of pre-trained languages.

The authors also compared their X-MOD models to adapter-based approaches, such as MAD-X. They found that the additional capacity provided by adapters added after pre-training was not able to mitigate the curse of multilinguality. The performance of the adapters strongly correlated with the performance of the corresponding fully shared models. This highlights the importance of adding language-specific capacity during pre-training.

The authors analyzed the impact of the number of update steps on the performance of their X-MOD models. They found that longer training resulted in improved performance, suggesting that more update steps were needed for modularity to take effect.

Overall, the results of the study showed that pre-training modular models with language-specific components from the start can lift the curse of multilinguality and improve cross-lingual performance. The authors emphasized the scalability of their approach, as their model can be extended to new languages post-hoc without sacrificing performance. They also highlighted the potential of their approach for covering all languages of the world.

In conclusion, the study by Pfeiffer et al. presents a novel approach to addressing the curse of multilinguality in multilingual NLP models. By pre-training modular models with language-specific components, the authors were able to mitigate negative interference between languages and achieve positive transfer. Their approach enables the addition of languages post-hoc without a drop in performance, making their model scalable to a large number of languages.

The study explores pre-training modular transformers for multilingual natural language processing (NLP). The authors compare two model variants: SHARED, which uses a shared vocabulary across all languages, and X-MOD, which incorporates language-specific components.

The researchers evaluate the performance of the models on various tasks, including machine translation, named entity recognition (NER), and question answering. They find that the X-MOD model consistently outperforms the SHARED model on these tasks. The results suggest that the language-specific components in X-MOD help mitigate the negative interference caused by multilinguality.

The authors also investigate the impact of training steps on model performance. They find that as the number of training steps increases, the X-MOD model becomes more competitive with the SHARED model, especially when the number of languages is small. This indicates that the added language-specific components in X-MOD are effective in handling multilinguality.

In addition to evaluating the performance of pre-trained languages, the researchers also evaluate the performance of added languages. They report results on the MLQA, XQuAD, and NER datasets for both pre-trained and added languages. The X-MOD model consistently outperforms the SHARED model on these datasets as well.

The study includes an analysis of language selection for pre-training. The researchers provide details about the selection of languages, including their language families and scripts. They also discuss how they trained models on different numbers of languages and report results on perplexity, XNLI, and NER for each set of languages.

Overall, the results demonstrate that pre-training modular transformers with language-specific components can improve performance on multilingual NLP tasks. The X-MOD model consistently outperforms the SHARED model on various datasets, indicating the effectiveness of incorporating language-specific information.

The findings of this study have implications for the development of multilingual NLP models. By incorporating language-specific components, researchers can improve the performance of pre-trained models on a wide range of languages and tasks. This approach can help address the challenges of multilinguality in NLP and improve the generalization capabilities of models across languages.

The study contributes to the growing body of research on pre-training methods for multilingual NLP. It highlights the importance of considering language-specific information in model design and demonstrates the benefits of incorporating such information in modular transformers. The findings can inform future research on developing more effective and efficient multilingual NLP models.

In conclusion, the study presents an investigation into pre-training modular transformers for multilingual NLP. The results show that incorporating language-specific components in models can improve performance on various tasks. The findings have implications for the development of multilingual NLP models and contribute to the understanding of how to handle multilinguality in pre-training.

Raw indexed text (69,373 chars / 11,049 words / 2,717 lines)

Lifting the Curse of Multilinguality

by Pre-training Modular Transformers

Jonas Pfeiffer ∗1,2,3 , Naman Goyal 3 , Xi Victoria Lin 3 , Xian Li 3 ,

James Cross 3 , Sebastian Riedel 3 , Mikel Artetxe 3

New York University, 2 TU Darmstadt,

Meta AI

Abstract

...

FF Down

Feed

Forward

Add & Norm

Multi-Head

Attention

Figure 1: A transformer layer of our proposed modular

architecture. The dark blue and green components illus-

trate the modular layers, which are language specific.

The Multi-Head Attention and Feed-Forward compo-

nents are shared by all languages.

Recent work on multilingual NLP has focused on

pre-training transformer-based models (Vaswani

et al., 2017) on concatenated corpora of a large

number of languages (Devlin et al., 2019; Conneau

et al., 2020). These multilingual models have been

shown to work surprisingly well in cross-lingual

settings, despite the fact that they do not rely on

direct cross-lingual supervision (e.g., parallel data

or translation dictionaries; Pires et al., 2019; Wu

and Dredze, 2019; Artetxe et al., 2020; Hu et al.,

2020; K et al., 2020; Rust et al., 2021).

However, recent work has uncovered fundamen-

tal limitations of multilingual transformers. Con-

neau et al. (2020) observe that pre-training a model

with a fixed capacity on an increasing amount of

languages only improves its cross-lingual perfor-

mance up to a certain point, after which perfor-

Work done while interning at Meta AI.

FF Down

FF Up

Add & Norm

Introduction

∗

FF Up

Add & Norm

Multilingual pre-trained models are known to

suffer from the curse of multilinguality, which

causes per-language performance to drop as

they cover more languages. We address this is-

sue by introducing language-specific modules,

which allows us to grow the total capacity of

the model, while keeping the total number of

trainable parameters per language constant. In

contrast with prior work that learns language-

specific components post-hoc, we pre-train the

modules of our Cross-lingual Modular (X-

M OD ) models from the start. Our experiments

on natural language inference, named entity

recognition and question answering show that

our approach not only mitigates the negative

interference between languages, but also en-

ables positive transfer, resulting in improved

monolingual and cross-lingual performance.

Furthermore, our approach enables adding lan-

guages post-hoc with no measurable drop in

performance, no longer limiting the model us-

age to the set of pre-trained languages.

mance drops can be measured—a phenomenon

known as the curse of multilinguality (Figure 2).

As such, prior work had to find a trade-off between

supporting more languages and obtaining better

performance on a smaller set of languages.

In this work, we address this problem by in-

troducing language-specific, modular components

during pre-training (Figure 1). Our Cross-lingual,

Modular (X-M OD ) language model shares the ma-

jority of the transformer parameters between all pre-

training languages, while providing each language

with individual capacity to learn idiosyncratic in-

formation without increasing the total number of

trainable parameters per language. While previous

adapter-based approaches (Figure 3a) extend pre-

trained multilingual language models (LMs) with

modular components after pre-training, we add

modular components during pre-training, thereby

3479

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics:

Human Language Technologies, pages 3479 - 3495

X-Mod

4.25

4.00

3.75

# Languages

(a) Mean Perplexity.

shared

# Languages

(b) Mean Performance on XNLI and NER.

Figure 2: Average (a) perplexity and (b) transfer performance on XNLI and NER across pre-trained languages

when training on an increasing number of languages. Each model has seen the same amount of examples in

each language. Lower perplexity and higher downstream score indicate better performance. Refer to Figure 4 for

per-task performance, and Appendix A for per-language performance.

preparing the model to be extended to new lan-

guages post-hoc. Our experiments on natural lan-

guage inference (NLI), named entity recognition

(NER), and question answering (QA) demonstrate

that our modular architecture not only is effective at

mitigating interference between languages, but also

achieves positive transfer, resulting in improved

monolingual and cross-lingual performance. In ad-

dition, we show that X-M OD can be extended to

unseen languages, with no measurable drop in per-

formance, by learning its corresponding modules

and leaving the shared parameters frozen. All in

all, we propose a multilingual architecture that can

scale to a large number of languages without any

loss in performance, and can be further extended

to new languages after pre-training. 1

Background and related work

We provide a background on multilingual and mod-

ular language modelling, as well as approaches that

extend LMs to new languages.

2.1 Multilingual transformers

Recent LMs (Devlin et al., 2019; Conneau et al.,

2020), based on transformer architectures (Vaswani

et al., 2017) and pre-trained on massive amounts

of multilingual data, have surpassed (static) cross-

lingual word embedding spaces (Ruder et al., 2019;

Glavas et al., 2019) for cross-lingual transfer in

NLP (Pires et al., 2019; Wu and Dredze, 2019;

Wu et al., 2020; Hu et al., 2020; K et al., 2020).

Transformer-based models are 1) pre-trained on

textual corpora using Masked Language Modelling

Code and pre-trained models are available at:

https://github.com/pytorch/fairseq/tree/main/examples/xmod.

Head Head

...

Emb Emb

Head Head

...

Emb Emb

Head Head

...

Head Head

...

Emb Emb

(a) Adapter-based

...

Emb Emb

(b) X-M OD

Figure 3: Our proposed architecture in comparison

to adapter-based approaches. (a) Previous approaches

¬ utilize non-modular pre-trained transformer models

and extend them with modular adapter components.

(b) We ¬ pre-train the transformer with modular units

from the get-go, preparing the model to be extended

with additional modular units later on. Yellow and

light blue components indicate standard Multi-Head

Attention and Feed-Forward layers. The remaining

(non-gray) components are bottleneck (modular) units.

Grayed-out components are frozen.

(MLM). They are then 2) fine-tuned on labelled

data of a downstream task in a source language and

3) directly applied to perform inference in a target

language (Hu et al., 2020).

2.2

Modular language models

Modular approaches have a long standing history

in NLP, preceding pre-trained models (Andreas

et al., 2016). They have recently re-gained in-

terest for transformer-based models, where mix-

3480ture of experts (MoE; Shazeer et al., 2017) ap-

proaches have enabled training trillion parame-

ters models in a distributed fashion (Fedus et al.,

2021). More recently modular MoE approaches

have been shown to improve domain-specific pre-

training of LMs (Gururangan et al., 2021). In a

similar trend, ‘expert’ modules have been added

to (non-modular) pre-trained LMs post-hoc, pre-

dominantly referred to as adapters (Rebuffi et al.,

2017, 2018; Houlsby et al., 2019). Next to being ex-

tremely parameter (Houlsby et al., 2019; Mahabadi

et al., 2021a; He et al., 2022) and training efficient

(Pfeiffer et al., 2020a; Rücklé et al., 2021), these

modular approaches allow models to be extended

to new data settings (Chen et al., 2019; Rücklé

et al., 2020), where newly learned knowledge can

be combined (Stickland and Murray, 2019; Wang

et al., 2021a; Pfeiffer et al., 2021a; Lauscher et al.,

2020a; Mahabadi et al., 2021b; Poth et al., 2021),

or stacked for combinatory cross-lingual (Pfeiffer

et al., 2020b, 2021b; Üstün et al., 2020; Vidoni

et al., 2020; Ansell et al., 2021b,a; Wang et al.,

2021b) as well as NMT scenarios (Bapna and Fi-

rat, 2019; Philip et al., 2020; Chronopoulou et al.,

2020; Le et al., 2021; Üstün et al., 2021; Stickland

et al., 2021; Garcia et al., 2021).

2.3 Weaknesses, improvements, and

extensions of language models

Next to the curse of multilinguality, recent works

have shown substantially reduced cross-lingual and

monolingual abilities of models for low-resource

languages with smaller pre-training data (Wu and

Dredze, 2020; Hu et al., 2020; Lauscher et al.,

2020b; Artetxe et al., 2020; Pfeiffer et al., 2020b,

2021b; Chau et al., 2020b; Ponti et al., 2020).

K et al. (2020); Artetxe et al. (2020) show that a

shared vocabulary is not necessary for cross-lingual

transfer. Chung et al. (2021) demonstrate that de-

coupling the input embeddings from the predic-

tion head improves the performance on a number

of downstream tasks. Dufter and Schütze (2020)

show that the number of parameters and training

duration is interlinked with the model’s multilin-

gual capability. Chung et al. (2020); Rust et al.

(2021) show that the tokenizer plays an important

role in the per-language downstream task perfor-

mance, which Clark et al. (2022); Xue et al. (2022);

Tay et al. (2021) take to the extreme by proposing

tokenizer-free approaches.

To extend a monolingual LM to other languages,

Artetxe et al. (2020) train a new embedding layer

with a corresponding target-language tokenizer,

while freezing the pre-trained transformer weights.

Tran (2020) extend a monolingual model to new

languages using bilingual corpora. Wang et al.

(2020); Chau et al. (2020a) extend the vocabu-

lary of multilingual models with a small number

of target-language tokens, to improve the perfor-

mance in the target language. Muller et al. (2021)

propose a transliteration based approach, Vernikos

and Popescu-Belis (2021) propose subword map-

pings, and Pfeiffer et al. (2020b, 2021b); Vidoni

et al. (2020); Ansell et al. (2021b) propose adapter-

based approaches to extend multilingual models to

unseen languages.

While these approaches achieve considerable

performance gains over unseen languages, they are

outperformed by standard full fine-tuning methods

for seen languages. One can further argue that, as

the pre-trained models have already been cursed by

multilinguality, the adapter-based approaches build

upon sub-optimal parameter initializations. 2 In our

work, we consequently aim to 1) modularize the

model from the start to prepare the model to be 2)

extendable to new languages post-hoc.

Proposed approach

We propose X-M OD , a modular multilingual archi-

tecture that combines shared and language-specific

parameters. In contrast to prior work, we pre-

train modular models from the get-go. Our mod-

els can be extended to new languages after pre-

training, and used for cross-lingual transfer learn-

ing in downstream tasks.

Architecture. As illustrated in Figure 1, we

extend the transformer-based architecture from

mBERT (Devlin et al., 2019) and XLM-R (Con-

neau et al., 2020) by incorporating language-

specific modules—bottleneck feed-forward layers—

at every transformer layer. We learn a separate

module for each language, whereas the attention

and feed-forward components are shared. While

the total number of parameters of the model grows

linearly with the number of languages, the train-

ing and inference cost does not increase (as mea-

sured in FLOPs), as only the module in the relevant

language is used for each input. Inspired by the

adapter 3 architecture of Pfeiffer et al. (2021a) we

We investigate this claim further in §6.2.

The term ‘adapter’ refers to newly introduced layers

within a pre-trained (frozen) model. These layers adapt the

3481

3place our ‘modules’ after the LayerNorm of the

feed-forward transformer block, and the residual

connection is placed after the LayerNorm; 4 the Lay-

erNorm before and after the modular component is

shared. 5

Pre-training procedure. Similar to Conneau et al.

(2020), we pre-train our model on MLM on com-

bined monolingual corpora in multiple languages.

Examples of each language are passed through

the shared embedding matrix as well as the multi-

head attention and feed-forward components at

each layer. As each layer contains a language-

specific modular component, the examples are

routed through the respective designated modular

bottleneck layer. Given that each example only

requires access to a single module, modules can

be efficiently stored on only a subset of GPUs in

distributed training.

Extending to new languages. The modular de-

sign of our model allows us to extend it to new

languages after pre-training. To that end, we learn

new embeddings and adapter modules for the tar-

get language through MLM, while the rest of the

components are frozen. 6 Consequently, we are able

to extend the model to a new language by learning

a small number of new parameters, without affect-

ing performance in the set of pre-trained languages.

Following Pfeiffer et al. (2021b), we learn a new

subword vocabulary for the added languages, and

initialize the embeddings of lexically overlapping

tokens from the original embedding matrix.

Fine-tuning on downstream tasks. To transfer

the models to cross-lingual downstream tasks, we

fine-tune the shared weights only on the source

language data, while keeping the modular compo-

nents and the embedding layer frozen. We follow

the standard fine-tuning procedure of adding a pre-

diction head on top of the CLS token. We then

replace the source language modules (as well as

embedding layer for added languages) with the tar-

get language parameters, passing the text of the

target language through the model. 7

representations of the pre-trained mode; we train these mod-

ular components together with the transformer weights, and

therefore refer to them as modules.

We find that the residual connection proposed by Pfeiffer

et al. (2021a) results in training instabilities when trained

together with the transformer weights.

Preliminary results showed that sharing the LayerNorm

results in better cross-lingual transfer performance.

Following Artetxe et al. (2020) we train positional em-

beddings.

We initially also experimented with stacking adapters on

Experimental design

We detail the baseline and models (§4.1), and their

training (§4.2) and evaluation settings (§4.3).

4.1

Model variants

We pre-train separate models for all combinations

along the following axes:

X-M OD vs. SHARED . To evaluate the effective-

ness of our X-M OD model, we aim to compare

ourselves to a conventional non-modular architec-

ture. However, simply removing the modular com-

ponent would be unfair, as the number of FLOPs

and trainable parameters per language would not

be the same—both in terms of pre-training, as

well as fine-tuning. Consequently, for our base-

line model—where all parameters should be fully

shared between all languages—we include a single

bottleneck layer right after the Feed-Forward com-

ponent. Effectively, this is the same architecture

as our X-M OD model, just with a single module

that is shared by all languages. We refer to this

as the SHARED model throughout this paper. 8 To

extend the SHARED model to unseen languages,

we follow Artetxe et al. (2020) and only learn a

new embedding layer, freezing the transformer pa-

rameters. To fine-tune the SHARED model on a

downstream task, we freeze the embedding layer,

as well as the (single) module, thereby fine-tuning

an equal amount of parameters on the downstream

task as the X-M OD model. 9

13 vs. 30 vs. 60 vs. 75 languages. So as to under-

stand how each approach is affected by the curse

of multilinguality, we pre-train the X-M OD and

SHARED models on 4 increasing sets of languages.

We start with an initial set of 13 typologically di-

verse languages that we evaluate on, and add addi-

tional languages for larger sets of 30, 60, and 75

languages. In addition, we keep a set of 7 held-out

languages that we extend the pre-trained models

to. Table 1 lists the specific languages in each

top of the language modules similar to Pfeiffer et al. (2020b,

2021b). While this approach is considerably more parameter

efficient, we find that fine-tuning all shared weights slightly

outperformed the adapter-based approach.

Extending the total number of shared parameters would

be unfair, as X-M OD and SHARED would not have the same

FLOPs nor the same number of trainable parameters when

fine-tuning.

Adapter-based approach such as MAD-X (Pfeiffer et al.,

2020b) would be an alternative. However, this would require

training on languages twice—once during pre-training, and

once when adding adapters—which is not directly comparable

to X-M OD . Nonetheless, we report results in §6.2.

3482Pre-trained

languages

13-LANGS en, ar, fr, hi, ko, ru, th, vi, ta, id, fi, sw, ka

30-LANGS 13-LANGS + cs, eu, hr, hu, hy, it, lt, ml, mn, ms, pl, ro, si, sk, sq, sv, tl

60-LANGS 30-LANGS + af, am, be, bn, ca, cy, da, eo, et, fa, ga, gl, gu, ha, is, ku, la, lv, mk, ne, nl, no, ps,

pt, sa, sd, sl, so, sr, te

75-LANGS 60-LANGS + as, br, bs, fy, gd, jv, kn, mg, mr, om, or, pa, su, xh, yi,

Added languages

bg, de, el, es, tr, ur, zh,

Table 1: Selection of languages. We pre-train different models on 4 sets of languages, and further extend them to

a set of held-out languages post-hoc. We evaluate on XNLI (languages in bold), NER (underlined languages) and

XQuAD/MLQA (languages in italic). For more details about the language selection, see Appendix C.

group. The selection and split of initial as well as

added languages is motivated by typological and

geographical diversity, as well as the availability of

downstream task evaluation data.

Controlling for total vs. per-language updates.

Conneau et al. (2020) investigated the effect of

adding more languages during pre-training, while

training on an equal number of update steps. How-

ever, increasing the number of languages while

keeping the number of updates constant results in

the model seeing less data in each individual lan-

guage. As such, it remains unclear if the curse of

multilinguality happens because of negative inter-

ference, or simply because the number of updates

for each specific language is smaller. So as to un-

derstand this, we compare (1) training on an equal

number of update steps and (2) training on an equal

number of seen examples per language. We start

with the set of 13 languages (Table 1) and train the

respective models for 125k update steps. When

adding more languages, we compare (1) training

models on each set of languages for 125k update

steps, and (2) increasing the number of update steps

such that the models are trained on the same num-

ber of examples in each of the initial 13 languages.

For the latter, this amounts to training for 195k,

265k and 269k update steps, respectively.

4.2 Training details

Data and hyperparameters. We sample lan-

guages with α = 0.7 and train our models with

a batch size of 2048 across 64 V100 GPUs on

the CC100 dataset (Conneau et al., 2020) using

fairseq (Ott et al., 2019). All our models extend the

base transformer architecture, with 12 layers and

768 dimensions. Modules are implemented with

a bottleneck size of 384. The shared transformer

weights account for 270M parameters, whereas

each individual module accounts for 7M parame-

ters. We train our models with a linear learning

rate decay peaking at 7e−4 during pre-training and

1e−4 when adding languages.

Vocabulary. As we aim to identify the impact

of modularity on the curse of multilinguality, we

control for consistent tokenization across the differ-

ent axes. We therefore tokenize using the XLM-R

vocabulary for all our pre-training experiments. 10

However, for languages added post-hoc, we learn a

new SentencePiece tokenizer for each of the target

language, 11 as the languages potentially use scripts

unseen by the original tokenizer.

4.3

Evaluation

We conduct experiments on NLI, NER, and QA.

In all cases, we fine-tune the model on English

and measure the zero-shot transfer performance in

other languages. For NLI we train on MultiNLI

(Williams et al., 2018) and evaluate on XNLI (Con-

neau et al., 2018). For QA, we train on SQuAD

(Rajpurkar et al., 2016) and evaluate on XQuAD

(Artetxe et al., 2020) and MLQA (Lewis et al.,

2020). For NER, we use WikiANN (Pan et al.,

2017; Rahimi et al., 2019). We experiment with

learning rates 1e−4, 3e−4, and 5e−4 and train for

3 or 5 epochs for QA and 5 or 10 epochs for NER

and NLI. For NER and NLI we take the hyperpa-

rameter setting performing best on the development

sets, averaged across the pre-trained languages (Ta-

ble 1). For SQuAD we take the best performing

checkpoint evaluated on the English development

set, and report the cross-lingual test set results. 12

All results are averaged across 5 random seed runs.

Rust et al. (2021) have previously demonstrated the im-

pact of the multilingual tokenizer on the downstream task

performance: languages underrepresented in the sub-word

vocabulary exhibit considerable performance drops when com-

pared to vocabularies dedicated to the respective language.

We train the new tokenizers for a vocabulary size of 30k.

In contrast to NER and NLI, the cross-lingual evaluation

benchmarks of SQuAD do not provide a development set for

each target language on the basis of which the best checkpoint

can be selected. Consequently, we select the checkpoint based

3483Average Pre-Trained Languages

125k

30 40 50 60

number of languages

125k

0.59 125k

0.58

125k

0.810

0.805

0.800

73.0 125k

72.5

72.0

71.5

71.0

10 20

125k

30 40 50 60

number of languages

125k

30 40 50 60

number of languages

125k

30 40 50 60

number of languages

X-Mod

shared

Average Added Languages

0.64 125k

125k

0.62

0.57

0.55

125k

72.5

125k

0.56

125k

73.0

72.0

Average Pre-Trained Languages

Average Added Languages

73.5 125k

125k

Source Language (English)

0.820 125k

0.815

Source Language (English)

84.0 125k

83.5

83.0

82.5

82.0

10 20

X-Mod

shared

0.60

0.58

30 40 50 60

number of languages

30 40 50 60

number of languages

195k

265k

125k

269k

30 40 50 60

number of languages

Source Language (English)

0.820 125k

195k

265k

0.810

0.805

74.0

30 40 50 60

number of languages

Average Pre-Trained Languages

195k

265k

0.58

0.57

30 40 50 60

number of languages

125k

195k

265k

269k

X-Mod

shared

73.5

73.0

269k

0.59

0.815

269k

72.5

0.60 125k

269k

265k

195k

125k

84.5

84.0

83.5

83.0

82.5

(a) All models are trained for 125k update steps. Models trained on more languages have seen less examples in each language.

Source Language (English)

Average Pre-Trained Languages

Average Added Languages

74.5

0.56

30 40 50 60

number of languages

0.650 125k

0.625

0.600

0.575

0.550

10 20

30 40 50 60

number of languages

Average Added Languages

195k

265k

30 40 50 60

number of languages

269k

X-Mod

shared

(b) Models trained on more languages are trained longer. All models have seen the same amount of examples in each language.

Figure 4: Test set results on XNLI (top) and NER (bottom) for models trained on different numbers of languages.

Source Language (English) only includes scores of the source language. Average Pre-Trained Languages includes

all evaluation languages that the model was pre-trained on. Average Added Languages includes all languages that

were added to the model after pre-training. Scores are averaged across all languages and random seeds.

Results and discussion

We present results for pre-trained languages in §5.1

and added languages in §5.2.

5.1 Pre-trained languages

In Figure 4 we plot downstream task results of

models pre-trained on different amounts of lan-

guages. Table 2 reports the individual language per-

formance for the models trained on 60 languages.

The Curse of Multilinguality. Conneau et al.

(2020) showed that multilingual LMs trained on in-

creasing amounts of languages, while maintaining

the number of update steps, exhibit drops in down-

stream task XNLI performance. We reproduce

these results, both in terms of language modelling

perplexity (Figure 2a), 13 as well as downstream

on the best performance on the English development set.

For per-language perplexity see Appendix A.

task performance on XNLI and NER (Figure 4a).

We further find that the curse of multilinguality

does not only happen because the total number of

update steps per language decreases, but also when

all SHARED models are trained on the same num-

ber of examples per language (Figure 4b). This

confirms that fully shared architectures suffer from

negative interference.

Lifting the Curse. While for the SHARED model

we witness negative interference between lan-

guages in terms of perplexity, the X-M OD model is

able to maintain performance, and even improves

for a subset of languages. We observe similar

patterns in the downstream task performance: In

both our experimental setups—(1) we control for

the number of update steps (Figure 4a); (2) we

control for the number of per-language seen ex-

amples (Figure 4b)—our X-M OD model—in con-

trast to the SHARED model—is able to maintain, or

3484NER

XNLI

XQuAD

MLQA

X-M OD

SHARED

X-M OD

SHARED

X-M OD

SHARED

X-M OD

SHARED

en ar fr hi ko ru th vi ta id fi sw ka avg

81.4

81.5 78.9

74.1 77.2

74.7 70.1

64.4 53.0

46.0 59.1

58.3 2.8

4.0 66.2

63.7 51.1

52.5 50.5

51.5 78.6

74.4 73.4

57.2 67.3

61.5 62.8

58.8

84.4

82.8 71.2

69.2 77.6

75.6 68.3

66.6 -

- 74.1

73.2 71.7

68.5 73.4

72.5 -

- -

- 66.9

62.1 -

- 73.5

72.5

85.1

83.8 68.1

64.6 -

- 67.5

65.8 -

- 75.0

72.7 66.3

63.0 74.9

72.6 -

- -

- 72.8

70.4

80.1

79.6 58.6

53.6 -

- 60.7

58.7 -

- -

- 67.5

64.9 -

- -

- 66.7

64.2

Table 2: Pre-trained language results for the modular and shared model variants, pre-trained on the set of 60

languages for 265k update steps. For NER and MLQA we report F 1 , for XNLI accuracy scores. Scores are

averaged across all 5 random seeds of the best hyperparameter setting, evaluated on the development set.

NER

XNLI

MLQA

bg de el es tr ur zh avg

77.6

74.9 75.1

66.3 75.2

69.6 71.9

49.1 72.6

64.8 54.7

50.4 21.6

9.2 64.1

54.9

SHARED 77.4

76.3 75.4

74.1 76.2

74.9 78.5

77.3 72.4

71.0 64.9

64.3 73.8

71.4 74.1

72.8

X-M OD

SHARED -

- 63.8

58.9 -

- 68.6

66.7 -

- -

- 61.7

56.5 64.8

60.7

X-M OD

SHARED

X-M OD

5.2

Table 3: Results for added languages, for models pre-

trained on the set of 60 languages for 265k update steps.

We report F 1 and accuracy scores which are averaged

across all 5 random seeds of the best hyperparameter

setting on the development set.

even outperform model variants trained on less lan-

guages. These results demonstrate that the added

per-language capacity is sufficient for the model to

adequately represent all languages.

Surprisingly, X-M OD not only maintains per-

formance, but actually slightly improves while we

increase the number of languages we pre-train on.

This is even the case for settings where the model

sees less examples in the target language. This

suggests that increasing the language diversity can

have a positive impact on the model’s cross-lingual

representation capability.

X-M OD vs SHARED . Overall, the X-M OD model

pre-trained on 60 languages achieves the best cross-

lingual performance. 14 Our results on XNLI, NER,

MLQA, and XQuAD in Table 2 demonstrate con-

sistent performance gains over the SHARED model

for every task and across (almost) all high- as well

as low-resource languages.

We find that the X-M OD model trained on 75 languages

is less stable than the versions trained on less languages. We

think that this can be attributed to the 15 added languages

being extremely low resource—we only train for an additional

4k update steps—resulting in the respective randomly initial-

ized modules being updated very infrequently. This variance

could potentially be mitigated by training for longer.

Extending to unseen languages

We further evaluate the cross-lingual performance

of languages added in the second step; (1) on the

architectural side—comparing the SHARED with

the X-M OD modelling variant—and (2) by com-

paring the performance when pre-training on the

language, vs. when adding the language post-hoc.

Modular vs Shared. We evaluate if the additional

per-language capacity improves the extendability

of the X-M OD model. On the right in Figure 4a

we plot the results for added languages on XNLI

(top) and NER (bottom). Similarly, we plot the

results for the models where we control for the

number of seen examples per target language in

Figure 4b. We find that the X-M OD model consis-

tently outperforms the SHARED model, with a peak

performance when pre-training on 60 languages,

demonstrating that the language specific capacity

is beneficial for adding new languages post-hoc.

We report results for the 60 language versions in

Table 3, demonstrating the consistent advantage of

the X-M OD over the SHARED model.

Pre-training vs Adding Languages. To evaluate

if there is a measurable difference on downstream

performance for languages that we pre-train on vs.

those we add post-hoc, we train 2 models on differ-

ent initial sets of languages, adding the respectively

missing ones in the second step. So as to under-

stand if the typological similarity of languages has

impact on the downstream task performance, we

split the initial and added languages (Table 1) of

our previous experiments into two parts. The first

split consists of languages where the model was

pre-trained on at least one language of the same

language family (e.g. English vs. German). The

second split consists of languages that are part of

a unique language family, i.e. the model was not

3485Model 1 pre-trained

X-Mod

Model 2 pre-trained

English

Language iso Family Script Model 1 Model 2

English

German

French

Spanish

Russian

Ukranian

Hindi

Urdu

Arabic

Hebrew en

he IE: Germanic

IE: Germanic

IE: Romance

IE: Slavic

IE: Iranian

Afro-Asiatic

Afro-Asiatic Latin

Latin

Cyrillic

Devanagari

Arabic

Hebrew pre-train

add

pre-train

add

pre-train

add

pre-train

add

pre-train

add add

pre-train

add

pre-train

add

pre-train

add

pre-train

add

pre-train

Vietnamese

Thai

Korean

Japanese

Greek

Turkish vi

tr Austro-Asiatic

Kra-Dai

Koreanic

Japonic

IE: Hellenic

Turkic Latin

Thai

Korean

Japanese

Greek

Latin pre-train

pre-train

add

add add

add

pre-train

73.0

72.5

125k

250k

72.0

125k

250k

Figure 6: Results on XNLI when when pre-training on

13 languages for 125k and 250k update steps.

M OD has the potential to cover all languages of the

world, as the model has the capability to be adapted

to new languages post-hoc.

Further analysis

We further analyze the impact of the number of

update steps on X-M OD (§6.1) and compare our

method to adapter-based approaches (§6.2).

6.1

Table 4: Selection of 2 sets of languages that we either

pre-train on, or add post-hoc. The last 6 languages in

the list are part of language families which are unique

in the total list of languages we pre-train on (Table 1),

i.e. none of our models was pre-trained on a language

of the same family.

pre-trained on a language of the same family (Ta-

ble 4). Consequently, we pre-train two models on

two sets of languages, adding the respective other

set post-hoc. 15

Our XNLI results (Figure 5) demonstrate that

the per-language performance is on par when pre-

training vs. when adding the language post-hoc. 16

We also find that the family does not have a measur-

able effect on the performance of the language. Our

results therefore suggest that it is sufficient to train

X-M OD on only a subset of languages for which

sufficient pre-training data exists. Essentially, X-

Pre-Trained Langs

73.5

Figure 5: XNLI test set accuracy of X-M OD mod-

els pre-trained on different languages in comparison to

those added post-hoc (Table 4).

shared

In previous experiments, the modular model trained on

60 languages achieved the best performance. Therefore, the

models in these experiments are also trained on 60 languages.

Both models are trained on the same additional languages, i.e.

the 60-LANGS of Table 1, where only the 13-LANGS differ.

The models have seen an equal amount of examples in

the respective languages in each case.

The importance of update steps

In Figure 4 we have witnessed a slight edge of

the SHARED model over the X-M OD model, when

training on only 13 languages and only training

for 125k update steps. Dufter and Schütze (2020)

found that it requires a large number of update steps

for a model pre-trained on multiple languages to

become multilingual; with the added per-language

capacity we hypothesize that update steps also play

an important role for modular models. We com-

pare the downstream task performance of mod-

els pre-trained on 13 languages, when training for

125k with 250k update steps in Figure 6. When

training for longer we find that the X-M OD model

begins to outperforms the SHARED model in the

source language, while almost closing the gap in

the cross-lingual setting. This supports the hypoth-

esis that the X-M OD model requires more update

steps when training only on a small number of lan-

guages, in order for modularity to “kick-in”.

6.2

X-M OD vs. Adapters

As illustrated in Figure 3, from an architecture per-

spective X-M OD is similar to previously proposed

multilingual Adapter-based methods (MAD-X;

Pfeiffer et al., 2020b). MAD-X utilizes a pre-

trained massively multilingual transformer-based

model and fine-tunes newly introduced adapter

weights on languages the model has seen during

pre-training, and ones the model has not been

3486Source Language (English)

84.0 125k 125k

125k 125k

83.0

83.5

82.5

82.0

81.5

number of languages

Average Pre-Trained Languages

125k

125k 125k

73.0 125k

72.5

72.0

71.5

71.0

70.5

70.0

number of languages

X-Mod

shared

shared_nm

Adapters

Figure 7: Comparison on XNLI of X-M OD and shared

models with an Adapter baseline, all models are pre-

trained for 125k update steps.

trained on. For a fair comparison in terms of seen

examples and number of update steps we train a

transformer model without module components

(shared_nm) for 100k update steps on the respec-

tive languages (Table 1). We subsequently train

adapters on each of the target languages for an-

other 25k update steps. 17 We report results in com-

parison to X-M OD in Figure 7, here results for

shared_nm are for a model that was trained for

125k update steps to instantiate a fair comparison.

Our results demonstrate that the additional capac-

ity of adapters added after pre-training is not able

to mitigate the curse of multilinguality which has al-

ready had a catastrophic impact on the shared trans-

former weights; the performance of the adapters

strongly correlates with the performance of the cor-

responding fully shared model shared_nm. Conse-

quently, adding language-specific capacity during

pre-training is important, as the curse of multilin-

guality cannot be lifted post-hoc.

Conclusions

In this paper, we have evaluated the effectiveness

of modular multilingual language modelling across

multiple axes. We have demonstrated that by

providing additional per-language capacity, while

maintaining the total number of trainable parame-

ters per language, we are not only able to mitigate

negative interference between languages, but ad-

ditionally achieve positive transfer. Our results

suggest that it is sufficient to train our proposed

X-M OD model only on a subset of languages for

which sufficient amounts of textual data is avail-

We follow Pfeiffer et al. (2020b) and train adapter weights

with a learning rate of 0.0001. While they have found that

cross-lingual transfer performance of adapters converges at

∼20k update-steps, we would like to stress that our experi-

mental setup is only one of multiple different valid versions.

A more thorough investigation to find the optimal number of

update steps for pre-training and subsequent adapter training

is necessary, which was out of scope for this work.

able. Unseen languages can be added post-hoc,

with no measurable drop in performance on XNLI.

By pre-training the model in a modular fashion, we

thus mitigate negative interference of idiosyncratic

information, while simultaneously preparing the

model to be extendable to unseen languages.

While in this work we have simulated language

adding scenarios with a held out set of languages, in

future work we aim to evaluate the performance on

truly low-resource languages such as MasakhaNER

(Adelani et al., 2021) and AmericasNLI (Ebrahimi

et al., 2021). We further aim to evaluate the cross-

lingual transfer performance from typologically

more diverse source languages, besides English.

Acknowledgments

We thank Samuel Broscheit for insightful feedback

and suggestions on a draft of this paper, as well

as the ARR reviewers and meta-reviewers for their

valuable comments.

References

David Ifeoluwa Adelani, Jade Z. Abbott, Graham

Neubig, Daniel D’souza, Julia Kreutzer, Constan-

tine Lignos, Chester Palen-Michel, Happy Buza-

aba, Shruti Rijhwani, Sebastian Ruder, Stephen

Mayhew, Israel Abebe Azime, Shamsuddeen Has-

san Muhammad, Chris Chinenye Emezue, Joyce

Nakatumba-Nabende, Perez Ogayo, Aremu An-

uoluwapo, Catherine Gitau, Derguene Mbaye, Jesu-

joba O. Alabi, Seid Muhie Yimam, Tajuddeen Gwad-

abe, Ignatius Ezeani, Rubungo Andre Niyongabo,

Jonathan Mukiibi, Verrah Otiende, Iroro Orife,

Davis David, Samba Ngom, Tosin P. Adewumi, Paul

Rayson, Mofetoluwa Adeyemi, Gerald Muriuki,

Emmanuel Anebi, Chiamaka Chukwuneke, Nkiruka

Odu, Eric Peter Wairagala, Samuel Oyerinde,

Clemencia Siro, Tobius Saul Bateesa, Temilola

Oloyede, Yvonne Wambui, Victor Akinode, Deb-

orah Nabagereka, Maurice Katusiime, Ayodele

Awokoya, Mouhamadane Mboup, Dibora Gebrey-

ohannes, Henok Tilaye, Kelechi Nwaike, Degaga

Wolde, Abdoulaye Faye, Blessing Sibanda, Ore-

vaoghene Ahia, Bonaventure F. P. Dossou, Kelechi

Ogueji, Thierno Ibrahima Diop, Abdoulaye Diallo,

Adewale Akinfaderin, Tendai Marengereke, and Sa-

lomey Osei. 2021. MasakhaNER: Named Entity

Recognition for African Languages. In Transac-

tions of the Association for Computational Linguis-

tics 2021.

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and

Dan Klein. 2016. Learning to compose neural net-

works for question answering. In NAACL HLT 2016,

The 2016 Conference of the North American Chap-

ter of the Association for Computational Linguistics:

3487Human Language Technologies, San Diego Califor-

nia, USA, June 12-17, 2016, pages 1545–1554. The

Association for Computational Linguistics.

Alan Ansell, Edoardo Maria Ponti, Anna Korhonen,

and Ivan Vulic. 2021a. Composable sparse fine-

tuning for cross-lingual transfer. arXiv preprint.

Alan Ansell, Edoardo Maria Ponti, Jonas Pfeiffer, Se-

bastian Ruder, Goran Glavaš, Ivan Vulić, and Anna

Korhonen. 2021b. MAD-G: Multilingual adapter

generation for efficient cross-lingual transfer. In

Findings of the Association for Computational Lin-

guistics: EMNLP 2021, pages 4762–4781, Punta

Cana, Dominican Republic. Association for Compu-

tational Linguistics.

Mikel Artetxe, Sebastian Ruder, and Dani Yogatama.

2020. On the cross-lingual transferability of mono-

lingual representations. In Proceedings of the 58th

Annual Meeting of the Association for Computa-

tional Linguistics, pages 4623–4637, Online. Asso-

ciation for Computational Linguistics.

Ankur Bapna and Orhan Firat. 2019. Simple, scal-

able adaptation for neural machine translation. In

Proceedings of the 2019 Conference on Empirical

Methods in Natural Language Processing and the

9th International Joint Conference on Natural Lan-

guage Processing, EMNLP-IJCNLP 2019, Hong

Kong, China, November 3-7, 2019, pages 1538–

1548. Association for Computational Linguistics.

Ethan C. Chau, Lucy H. Lin, and Noah A. Smith.

2020a. Parsing with multilingual BERT, a small cor-

pus, and a small treebank. In Findings of the Associ-

ation for Computational Linguistics: EMNLP 2020,

pages 1324–1334, Online. Association for Computa-

tional Linguistics.

Ethan C. Chau, Lucy H. Lin, and Noah A. Smith.

2020b. Parsing with multilingual bert, a small tree-

bank, and a small corpus. In Proceedings of the

2020 Conference on Empirical Methods in Natu-

ral Language Processing: Findings, EMNLP 2020,

Online Event, 16-20 November 2020, pages 1324–

1334.

Vincent S. Chen, Sen Wu, Alexander J. Ratner, Jen

Weng, and Christopher Ré. 2019. Slice-based learn-

ing: A programming model for residual learning in

critical data slices. In Advances in Neural Infor-

mation Processing Systems 32: Annual Conference

on Neural Information Processing Systems 2019,

NeurIPS 2019, December 8-14, 2019, Vancouver,

BC, Canada, pages 9392–9402.

Alexandra Chronopoulou, Dario Stojanovski, and

Alexander Fraser. 2020. Reusing a Pretrained Lan-

guage Model on Languages with Limited Corpora

for Unsupervised NMT. In Proceedings of the 2020

Conference on Empirical Methods in Natural Lan-

guage Processing (EMNLP), pages 2703–2711, On-

line. Association for Computational Linguistics.

Hyung Won Chung, Thibault Févry, Henry Tsai,

Melvin Johnson, and Sebastian Ruder. 2021. Re-

thinking embedding coupling in pre-trained lan-

guage models. In 9th International Conference on

Learning Representations, ICLR 2021, Virtual Event,

Austria, May 3-7, 2021. OpenReview.net.

Hyung Won Chung, Dan Garrette, Kiat Chuan Tan, and

Jason Riesa. 2020. Improving multilingual models

with language-clustered vocabularies. In Proceed-

ings of the 2020 Conference on Empirical Methods

in Natural Language Processing, EMNLP 2020, On-

line, November 16-20, 2020, pages 4536–4546. As-

sociation for Computational Linguistics.

Jonathan H. Clark, Dan Garrette, Iulia Turc, and John

Wieting. 2022. CANINE: pre-training an efficient

tokenization-free encoder for language representa-

tion. Transactions of the Association for Computa-

tional Linguistics, 10.

Alexis Conneau, Kartikay Khandelwal, Naman Goyal,

Vishrav Chaudhary, Guillaume Wenzek, Francisco

Guzmán, Edouard Grave, Myle Ott, Luke Zettle-

moyer, and Veselin Stoyanov. 2020. Unsupervised

cross-lingual representation learning at scale. In

Proceedings of the 58th Conference of the Associ-

ation for Computational Linguistics, ACL 2020, Vir-

tual Conference, July 6-8, 2020, pages 8440–8451.

Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad-

ina Williams, Samuel Bowman, Holger Schwenk,

and Veselin Stoyanov. 2018. XNLI: Evaluating

cross-lingual sentence representations. In Proceed-

ings of the 2018 Conference on Empirical Methods

in Natural Language Processing, pages 2475–2485,

Brussels, Belgium. Association for Computational

Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

Kristina Toutanova. 2019. BERT: pre-training of

deep bidirectional transformers for language under-

standing. In Proceedings of the 2019 Conference

of the North American Chapter of the Association

for Computational Linguistics: Human Language

Technologies, NAACL-HLT 2019, Minneapolis, MN,

USA, June 2-7, 2019, Volume 1 (Long and Short Pa-

pers), pages 4171–4186.

Philipp Dufter and Hinrich Schütze. 2020. Identifying

elements essential for BERT’s multilinguality. In

Proceedings of the 2020 Conference on Empirical

Methods in Natural Language Processing (EMNLP),

pages 4423–4437, Online. Association for Computa-

tional Linguistics.

Abteen Ebrahimi, Manuel Mager, Arturo Oncevay,

Vishrav Chaudhary, Luis Chiruzzo, Angela Fan,

John Ortega, Ricardo Ramos, Annette Rios, Ivan

Vladimir, Gustavo A. Giménez-Lugo, Elisabeth

Mager, Graham Neubig, Alexis Palmer, Rolando

A. Coto Solano, Ngoc Thang Vu, and Katharina

Kann. 2021. AmericasNLI: Evaluating Zero-shot

Natural Language Understanding of Pretrained Mul-

tilingual Models in Truly Low-resource Languages.

arXiv preprint.

3488William Fedus, Barret Zoph, and Noam Shazeer. 2021.

Switch Transformers: Scaling to Trillion Parameter

Models with Simple and Efficient Sparsity. arXiv

preprint. First Workshop on Knowledge Extraction and Inte-

gration for Deep Learning Architectures, pages 43–

49, Online. Association for Computational Linguis-

tics.

Xavier Garcia, Noah Constant, Ankur Parikh, and

Orhan Firat. 2021. Towards continual learning for

multilingual machine translation via vocabulary sub-

stitution. In Proceedings of the 2021 Conference of

the North American Chapter of the Association for

Computational Linguistics: Human Language Tech-

nologies, pages 1184–1192, Online. Association for

Computational Linguistics. Anne Lauscher, Vinit Ravishankar, Ivan Vulić, and

Goran Glavaš. 2020b. From zero to hero: On the

limitations of zero-shot language transfer with mul-

tilingual Transformers. In Proceedings of the 2020

Conference on Empirical Methods in Natural Lan-

guage Processing (EMNLP), pages 4483–4499, On-

line.

Goran Glavas, Robert Litschko, Sebastian Ruder, and

Ivan Vulic. 2019. How to (properly) evaluate cross-

lingual word embeddings: On strong baselines, com-

parative analyses, and some misconceptions. In Pro-

ceedings of the 57th Conference of the Association

for Computational Linguistics, ACL 2019, Florence,

Italy, July 28- August 2, 2019, Volume 1: Long Pa-

pers, pages 710–721.

Suchin Gururangan, Mike Lewis, Ari Holtzman,

Noah A. Smith, and Luke Zettlemoyer. 2021. Demix

layers: Disentangling domains for modular lan-

guage modeling. arXiv preprint.

Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-

Kirkpatrick, and Graham Neubig. 2022. Towards a

unified view of parameter-efficient transfer learning.

In 10th International Conference on Learning Rep-

resentations, ICLR 2022, Virtual Conference, April

25 - 29, 2022.

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzkeb-

ski, Bruna Morrone, Quentin de Laroussilhe, An-

drea Gesmundo, Mona Attariyan, and Sylvain Gelly.

2019. Parameter-efficient transfer learning for NLP.

In Proceedings of the 36th International Conference

on Machine Learning, ICML 2019, 9-15 June 2019,

Long Beach, California, USA, pages 2790–2799.

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra-

ham Neubig, Orhan Firat, and Melvin Johnson.

2020. XTREME: A massively multilingual multi-

task benchmark for evaluating cross-lingual gener-

alization. In Proceedings of the 37th International

Conference on Machine Learning, ICML 2020, 12-

18 July 2020, Virtual Conference.

Karthikeyan K, Zihan Wang, Stephen Mayhew, and

Dan Roth. 2020. Cross-lingual ability of multi-

lingual BERT: an empirical study. In 8th Inter-

national Conference on Learning Representations,

ICLR 2020, Addis Ababa, Ethiopia, April 26-30,

2020.

Anne Lauscher, Olga Majewska, Leonardo F. R.

Ribeiro, Iryna Gurevych, Nikolai Rozanov, and

Goran Glavaš. 2020a. Common sense or world

knowledge? investigating adapter-based knowledge

injection into pretrained transformers. In Proceed-

ings of Deep Learning Inside Out (DeeLIO): The

Hang Le, Juan Miguel Pino, Changhan Wang, Jiatao

Gu, Didier Schwab, and Laurent Besacier. 2021.

Lightweight adapter tuning for multilingual speech

translation. In Proceedings of the 59th Annual Meet-

ing of the Association for Computational Linguis-

tics and the 11th International Joint Conference on

Natural Language Processing, ACL/IJCNLP 2021,

(Volume 2: Short Papers), Virtual Event, August 1-

6, 2021, pages 817–824. Association for Computa-

tional Linguistics.

Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian

Riedel, and Holger Schwenk. 2020. MLQA: Evalu-

ating cross-lingual extractive question answering. In

Proceedings of the 58th Annual Meeting of the Asso-

ciation for Computational Linguistics, pages 7315–

7330, Online. Association for Computational Lin-

guistics.

Rabeeh Karimi Mahabadi, James Henderson, and Se-

bastian Ruder. 2021a. Compacter: Efficient low-

rank hypercomplex adapter layers. Advances in Neu-

ral Information Processing Systems 34: Annual Con-

ference on Neural Information Processing Systems

2021, NeurIPS 2021.

Rabeeh Karimi Mahabadi, Sebastian Ruder, Mostafa

Dehghani, and James Henderson. 2021b. Parameter-

efficient multi-task fine-tuning for transformers via

shared hypernetworks.

In Proceedings of the

59th Annual Meeting of the Association for Com-

putational Linguistics and the 11th International

Joint Conference on Natural Language Processing,

ACL/IJCNLP 2021, (Volume 1: Long Papers), Vir-

tual Event, August 1-6, 2021, pages 565–576. Asso-

ciation for Computational Linguistics.

Benjamin Muller, Antonios Anastasopoulos, Benoît

Sagot, and Djamé Seddah. 2021. When being un-

seen from mBERT is just the beginning: Handling

new languages with multilingual language models.

In Proceedings of the 2021 Conference of the North

American Chapter of the Association for Computa-

tional Linguistics: Human Language Technologies,

pages 448–462, Online. Association for Computa-

tional Linguistics.

Myle Ott, Sergey Edunov, Alexei Baevski, Angela

Fan, Sam Gross, Nathan Ng, David Grangier, and

Michael Auli. 2019. fairseq: A fast, extensible

toolkit for sequence modeling. In Proceedings of

3489the 2019 Conference of the North American Chap-

ter of the Association for Computational Linguistics

(Demonstrations), pages 48–53, Minneapolis, Min-

nesota. Association for Computational Linguistics.

Xiaoman Pan, Boliang Zhang, Jonathan May, Joel

Nothman, Kevin Knight, and Heng Ji. 2017. Cross-

lingual name tagging and linking for 282 languages.

In Proceedings of the 55th Annual Meeting of the As-

sociation for Computational Linguistics, ACL 2017,

Vancouver, Canada, July 30 - August 4, Volume 1:

Long Papers, pages 1946–1958.

Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé,

Kyunghyun Cho, and Iryna Gurevych. 2021a.

AdapterFusion: Non-destructive task composition

for transfer learning. In Proceedings of the 16th

Conference of the European Chapter of the Associ-

ation for Computational Linguistics: Main Volume,

pages 487–503, Online. Association for Computa-

tional Linguistics.

Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aish-

warya Kamath, Ivan Vulić, Sebastian Ruder,

Kyunghyun Cho, and Iryna Gurevych. 2020a.

AdapterHub: A Framework for Adapting Transform-

ers. In Proceedings of the 2020 Conference on

Empirical Methods in Natural Language Process-

ing (System Demonstrations), EMNLP 2020, Virtual

Conference, 2020.

Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Se-

bastian Ruder. 2020b. MAD-X: An Adapter-Based

Framework for Multi-Task Cross-Lingual Transfer.

In Proceedings of the 2020 Conference on Empirical

Methods in Natural Language Processing (EMNLP),

pages 7654–7673, Online. Association for Computa-

tional Linguistics.

Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebas-

tian Ruder. 2021b. UNKs Everywhere: Adapting

Multilingual Language Models to New Scripts. In

Proceedings of the 2021 Conference on Empirical

Methods in Natural Language Processing, EMNLP

2021, Online, November , 2021.

Jerin Philip, Alexandre Berard, Matthias Gallé, and

Laurent Besacier. 2020. Monolingual adapters for

zero-shot neural machine translation. In Proceed-

ings of the 2020 Conference on Empirical Methods

in Natural Language Processing (EMNLP), pages

4465–4470, Online. Association for Computational

Linguistics.

Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.

How multilingual is multilingual bert?

In Pro-

ceedings of the 57th Conference of the Association

for Computational Linguistics, ACL 2019, Florence,

Italy, July 28- August 2, 2019, Volume 1: Long Pa-

pers, pages 4996–5001.

Edoardo Maria Ponti, Goran Glavaš, Olga Majewska,

Qianchu Liu, Ivan Vulić, and Anna Korhonen. 2020.

XCOPA: A multilingual dataset for causal common-

sense reasoning. In Proceedings of the 2020 Con-

ference on Empirical Methods in Natural Language

Processing (EMNLP), pages 2362–2376, Online. As-

sociation for Computational Linguistics.

Clifton Poth, Jonas Pfeiffer, Andreas Rücklé, and Iryna

Gurevych. 2021. What to pre-train on? efficient

intermediate task selection. In Proceedings of the

2021 Conference on Empirical Methods in Natural

Language Processing, EMNLP 2021, Virtual Event

/ Punta Cana, Dominican Republic, 7-11 November,

2021, pages 10585–10605. Association for Compu-

tational Linguistics.

Afshin Rahimi, Yuan Li, and Trevor Cohn. 2019.

Massively multilingual transfer for NER. In Pro-

ceedings of the 57th Conference of the Association

for Computational Linguistics, ACL 2019, Florence,

Italy, July 28- August 2, 2019, Volume 1: Long Pa-

pers, pages 151–164.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and

Percy Liang. 2016. SQuAD: 100,000+ Questions

for Machine Comprehension of Text. In Proceed-

ings of the 2016 Conference on Empirical Meth-

ods in Natural Language Processing, EMNLP 2016,

Austin, Texas, USA, November 1-4, 2016, pages

2383–2392.

Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea

Vedaldi. 2017. Learning multiple visual domains

with residual adapters. In Advances in Neural Infor-

mation Processing Systems 30: Annual Conference

on Neural Information Processing Systems 2017, 4-

9 December 2017, Long Beach, CA, USA, pages

506–516.

Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea

Vedaldi. 2018. Efficient parametrization of multi-

domain deep neural networks. In 2018 IEEE Confer-

ence on Computer Vision and Pattern Recognition,

CVPR 2018, Salt Lake City, UT, USA, June 18-22,

2018, pages 8119–8127.

Andreas Rücklé, Gregor Geigle, Max Glockner,

Tilman Beck, Jonas Pfeiffer, Nils Reimers, and Iryna

Gurevych. 2021. Adapterdrop: On the efficiency

of adapters in transformers. In Proceedings of the

2021 Conference on Empirical Methods in Natural

Language Processing, EMNLP 2021, Virtual Event

/ Punta Cana, Dominican Republic, 7-11 November,

2021, pages 7930–7946. Association for Computa-

tional Linguistics.

Andreas Rücklé, Jonas Pfeiffer, and Iryna Gurevych.

2020.

Multicqa: Zero-shot transfer of self-

supervised text matching models on a massive scale.

In Proceedings of the 2020 Conference on Empirical

Methods in Natural Language Processing, EMNLP

2020, Online, November 16-20, 2020, pages 2471–

2486. Association for Computational Linguistics.

Sebastian Ruder, Ivan Vulić, and Anders Søgaard.

2019. A survey of cross-lingual embedding models.

Journal of Artificial Intelligence Research, 65:569–

631.

3490Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian

Ruder, and Iryna Gurevych. 2021. How good is

your tokenizer? on the monolingual performance of

multilingual language models. In Proceedings of the

59th Annual Meeting of the Association for Compu-

tational Linguistics and the 11th International Joint

Conference on Natural Language Processing (Vol-

ume 1: Long Papers), pages 3118–3135, Online. As-

sociation for Computational Linguistics.

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,

Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and

Jeff Dean. 2017. Outrageously large neural net-

works: The sparsely-gated mixture-of-experts layer.

In 5th International Conference on Learning Rep-

resentations, ICLR 2017, Toulon, France, April 24-

26, 2017, Conference Track Proceedings. OpenRe-

view.net.

Asa Cooper Stickland, Alexandre Berard, and Vassilina

Nikoulina. 2021. Multilingual domain adaptation

for NMT: decoupling language and domain informa-

tion with adapters. In Proceedings of the Sixth Con-

ference on Machine Translation, WMT@EMNLP

2021, Online Event, November 10-11, 2021, pages

578–598. Association for Computational Linguis-

tics.

Asa Cooper Stickland and Iain Murray. 2019. BERT

and pals: Projected attention layers for efficient

adaptation in multi-task learning. In Proceedings

of the 36th International Conference on Machine

Learning, ICML 2019, 9-15 June 2019, Long Beach,

California, USA, volume 97 of Proceedings of Ma-

chine Learning Research, pages 5986–5995. PMLR.

Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Prakash

Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin,

Simon Baumgartner, Cong Yu, and Donald Met-

zler. 2021. Charformer: Fast character transform-

ers via gradient-based subword tokenization. arXiv

preprint.

Ke M. Tran. 2020. From english to foreign languages:

Transferring pre-trained language models. arXiv

preprint.

Ahmet Üstün, Alexandre Berard, Laurent Besacier, and

Matthias Gallé. 2021. Multilingual unsupervised

neural machine translation with denoising adapters.

In Proceedings of the 2021 Conference on Empirical

Methods in Natural Language Processing, EMNLP

2021, Virtual Event / Punta Cana, Dominican Re-

public, 7-11 November, 2021, pages 6650–6662. As-

sociation for Computational Linguistics.

Ahmet Üstün, Arianna Bisazza, Gosse Bouma, and

Gertjan van Noord. 2020. UDapter: Language adap-

tation for truly Universal Dependency parsing. In

Proceedings of the 2020 Conference on Empirical

Methods in Natural Language Processing (EMNLP),

pages 2302–2315, Online. Association for Computa-

tional Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob

Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz

Kaiser, and Illia Polosukhin. 2017. Attention Is All

You Need. In Advances in Neural Information Pro-

cessing Systems 30: Annual Conference on Neural

Information Processing Systems 2017, 4-9 Decem-

ber 2017, Long Beach, CA, USA, pages 5998–6008.

Giorgos Vernikos and Andrei Popescu-Belis. 2021.

Subword mapping and anchoring across languages.

In Findings of the Association for Computational

Linguistics: EMNLP 2021, pages 2633–2647, Punta

Cana, Dominican Republic. Association for Compu-

tational Linguistics.

Marko Vidoni, Ivan Vulić, and Goran Glavaš. 2020.

Orthogonal language and task adapters in zero-shot

cross-lingual transfer. In arXiv preprint.

Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei,

Xuanjing Huang, Jianshu Ji, Guihong Cao, Daxin

Jiang, and Ming Zhou. 2021a. K-adapter: Infusing

knowledge into pre-trained models with adapters. In

Findings of the Association for Computational Lin-

guistics: ACL/IJCNLP 2021, Online Event, August

1-6, 2021, volume ACL/IJCNLP 2021 of Findings

of ACL, pages 1405–1418. Association for Compu-

tational Linguistics.

Xinyi Wang, Yulia Tsvetkov, Sebastian Ruder, and Gra-

ham Neubig. 2021b. Efficient test time adapter en-

sembling for low-resource language varieties. In

Findings of the Association for Computational Lin-

guistics: EMNLP 2021, pages 730–737, Punta

Cana, Dominican Republic. Association for Compu-

tational Linguistics.

Zihan Wang, Karthikeyan K, Stephen Mayhew, and

Dan Roth. 2020. Extending multilingual BERT to

low-resource languages. In Findings of the Associ-

ation for Computational Linguistics: EMNLP 2020,

pages 2649–2656, Online. Association for Computa-

tional Linguistics.

Adina Williams, Nikita Nangia, and Samuel Bowman.

2018. A broad-coverage challenge corpus for sen-

tence understanding through inference. In Proceed-

ings of the 2018 Conference of the North American

Chapter of the Association for Computational Lin-

guistics: Human Language Technologies, Volume

1 (Long Papers), pages 1112–1122, New Orleans,

Louisiana. Association for Computational Linguis-

tics.

Shijie Wu, Alexis Conneau, Haoran Li, Luke Zettle-

moyer, and Veselin Stoyanov. 2020. Emerging cross-

lingual structure in pretrained language models. In

Proceedings of the 58th Conference of the Associa-

tion for Computational Linguistics, ACL 2020, Vir-

tual Conference, July 6-8, 2020, pages 6022–6034.

Shijie Wu and Mark Dredze. 2019. Beto, bentz, be-

cas: The surprising cross-lingual effectiveness of

BERT. In Proceedings of the 2019 Conference on

Empirical Methods in Natural Language Processing

and the 9th International Joint Conference on Natu-

ral Language Processing (EMNLP-IJCNLP), pages

3491833–844, Hong Kong, China. Association for Com-

putational Linguistics.

Linting Xue, Aditya Barua, Noah Constant, Rami Al-

Rfou, Sharan Narang, Mihir Kale, Adam Roberts,

and Colin Raffel. 2022. Byt5: Towards a token-free

future with pre-trained byte-to-byte models. Trans-

actions of the Association for Computational Lin-

guistics 2022.

Intermediate checkpoints

Our results in §6.1 suggest that, when the number

of languages is small, X-M OD becomes more com-

petitive with SHARED as the number of training

steps increases. So as to understand if this behav-

ior also holds for models covering more languages,

we evaluate intermediate checkpoints for the 60-

LANG model on XNLI. As shown in Figure 8,

we find that the X-M OD model continuously out-

performs the SHARED model. This suggests that

the SHARED model immediately suffers from neg-

ative interference between languages, while the

added, language-specific components of the X-

M OD model are able to mitigate the curse of mul-

tilinguality, resulting in considerable performance

gains at all evaluated checkpoints.

shared

English

50k 100k 150k 200k

265k

Pre-Trained Langs

50k 100k 150k 200k

265k

Figure 8: Results on XNLI using intermediate check-

points of the models trained on 60 languages.

Additional results

We report MLQA and XQuAD results on pre-

trained languages in Tables 5 and 6, respectively,

and MLQA results on added languages in Table 7.

Table 8 report NER results on more languages.

Figures 9, 10 and 11 report per-language results

as we increase the amount of languages on lan-

guage modeling perplexity, XNLI and NER, re-

spectively.

X-Mod

Shijie Wu and Mark Dredze. 2020. Are all languages

created equal in multilingual BERT? In Proceedings

of the 5th Workshop on Representation Learning for

NLP, pages 120–130, Online. Association for Com-

putational Linguistics.

X-M OD

SHARED

F 1 / EM ar

F 1 / EM hi

F 1 / EM vi

F 1 / EM avg

F 1 / EM

80.1 / 66.9

79.6 / 66.5 58.6 / 38.9

53.6 / 33.9 60.7 / 42.4

58.7 / 40.4 67.5 / 46.1

64.9 / 43.8 66.7 / 48.6

64.2 / 46.2

Table 5: Average F 1 and Exact Match results for pre-

trained languages, on the test set of MLQA for the

X-M OD and SHARED model variants, pre-trained on

the set of 60 languages for 265k update steps. Bold

numbers indicate better performance for the respective

language.

X-M OD

SHARED

F 1 / EM ar

F 1 / EM hi

F 1 / EM ru

F 1 / EM th

F 1 / EM vi

F 1 / EM avg

F 1 / EM

85.1 / 73.4

83.8 / 72.1 68.1 / 52.4

64.6 / 48.5 67.5 / 50.3

65.8 / 48.3 75.0 / 57.8

72.7 / 54.5 66.3 / 52.6

63.0 / 48.0 74.9 / 54.6

72.6 / 52.1 72.8 / 56.9

70.4 / 53.9

Table 6: Average F 1 and Exact Match results for pre-

trained languages, on the test set of XQuAD for the

X-M OD and SHARED model variants, pre-trained on

the set of 60 languages for 265k update steps. Bold

numbers indicate better performance for the respective

language.

Language selection

We provide more details about our selection of

languages in Table 9.

X-M OD

SHARED

F 1 / EM es

F 1 / EM zh

F 1 / EM avg

F 1 / EM

63.8 / 48.9

58.9 / 44.1 68.8 / 50.3

66.7 / 48.3 61.7 / 36.4

56.5 / 32.2 64.8 / 45.2

60.7 / 41.5

Table 7: Average F 1 and Exact Match results for added

languages, on the test set of MLQA for the X-M OD

and SHARED model variants, pre-trained on the set of

60 languages for 265k update steps. Bold numbers in-

dicate better performance for the respective language.

3492X-M OD

SHARED

en af ar bn et eu fa fi fr hi hu id it ka ko ru sw ta th vi avg

81.4

81.5 78.9

74.1 43.5

44.2 63.2

62.4 76.2

70.7 62.2

58.1 44.3

40.3 78.6

74.4 77.2

74.7 70.1

64.4 78.3

74.2 50.5

51.5 78.7

75.5 67.3

61.5 53.0

46.0 59.1

58.3 73.4

57.2 51.1

52.5 2.8

4.0 66.2

63.7 62.8

59.5

Table 8: Average F 1 results for pre-trained languages, on the test set of NER for the X-M OD and SHARED model

variants, pre-trained on the set of 60 languages. Bold numbers indicate better performance for the respective

language.

4.4 5.5

4.2 5.0

4.0

# Languages

4.0

3.5

3.2

5.0

# Languages

3.0

# Languages

X-Mod

shared

3.4

3.5

2.8

5.5

4.5

3.2

2.9

5.0

3.4

3.0

# Languages

4.5

3.1

4.5

4.0

3.2

4.5

# Languages

Figure 9: Perplexity when training on more languages. Each model has seen the same amount of examples in

each language. Lower perplexity indicates better performance.

X-Mod

shared

84.0

83.5

83.0

71.0 77.5

70.5 77.0

70.0 76.5

69.5 76.0

75.5

30 40 50 60

number of languages

73.5

73.0

30 40 50 60

number of languages

30 40 50 60

number of languages

72.5

30 40 50 60

number of languages

(a) Pre-Trained Languages

X-Mod

77.5

shared

72.5

74.5 75.0

74.0 74.5

64.5

71.5 64.0

71.0 63.5

30 40 50 60

number of languages

63.0

78.0

77.5

77.0

65.0

72.0

78.5

75.5

75.0

76.5

76.0

75.5

77.0

70.5

76.0

30 40 50 60

number of languages

73.5

73.0

72.5

72.0

71.5

71.0

30 40 50 60

number of languages

(b) Added Languages

Figure 10: Testset results on XNLI of pre-trained (top) and added (bottom) languages trained on different numbers

of languages. Models trained on more languages are trained for longer → all models have seen the same amount

of examples in each individual language. Scores are averaged across all random seeds.

3493X-Mod

0.8175

shared

0.8150 0.46

0.8125 0.44

0.8100

0.40

0.8050

0.70

0.78

0.79

0.78

0.77

0.76

0.75

0.74

0.42

0.8075

0.74

0.54

0.68 0.50 0.66 0.52

0.66 0.48 0.64 0.50

0.64 0.46 0.62

0.67

0.46

0.65

0.64

0.525

0.70

0.60

0.66

0.48

0.62

0.76

0.48

0.520

0.58 0.65 0.515

0.56 0.60 0.510

0.54

0.050

0.045

0.040

0.035

0.030

30 40 50 60

number of languages

(a) Pre-Trained Languages

X-Mod

shared

0.77

0.76

0.72

0.700

0.20

0.50

0.70

0.68

0.15

0.45

0.66

30 40 50 60

number of languages

0.40

0.5

0.55

0.72

0.6

0.70

0.675

0.7

0.74

0.725

0.75

0.750

0.10

30 40 50 60

number of languages

30 40 50 60

number of languages

(b) Added Languages

Figure 11: Testset results on NER of pre-trained (top) and added (bottom) languages trained on different numbers

of languages. Models trained on more languages are trained for longer → all models have seen the same amount

of examples in each individual language. Scores are averaged across all random seeds.

3494Language iso Family

Afrikaans

Albanian

Amharic

Arabic

Armenian

Assamese

Basque

Belarusian

Bengali

Bosnian

Breton

Bulgarian

Catalan

Chinese

Croatian

Czech

Danish

Dutch

English

Estonian

Esperanto

Finnish

French

Frisian

Galician

Georgian

German

Greek

Gujarati

Hausa

Hebrew

Hindi

Hungarian

Icelandic

Indonesian

Irish

Italian

Japanese

Javanese

Kannada

Korean

Kurdish

Latin af

la IE:Germanic

IE:Albanian

Afro-Asiatic

IE:Armenian

IE:Iranian

Isolate

IE:Slavic

IE:Iranian

IE:Slavic

IE:Celtic

IE:Slavic

IE:Romance

Sino-Tibetan

IE:Slavic

IE:Germanic

Uralic

Constructed

Uralic

IE:Romance

IE:Germanic

IE:Romance

Kartvelian

IE:Germanic

IE:Hellenic

IE:Iranian

Afro-Asiatic

IE:Iranian

Uralic

IE:Germanic

Austronesian

IE:Celtic

IE:Romance

Japonic

Austronesian

Dravidian

Koreanic

IE:Iranian

IE:Romance

Script

Latin

Amharic

Arabic

Armenian

Assamese

Latin

Cyrillic

Bengali

Latin

Cyrillic

Latin

Chinese

Latin

Georgian

Latin

Greek

Gujarati

Latin

Hebrew

Devanagari

Latin

Japanese

Latin

Kannada

Korean

Latin

X,(+)

30 60 75 Language iso Family

X,(+)

X X

X,(+)

X X

X Latvian

Lithuanian

Macedonian

Malagasy

Malay

Malayalam

Marathi

Mongolian

Nepali

Norwegian

Oriya

Oromo

Pashto

Persian

Polish

Portuguese

Punjabi

Romanian

Russian

Sanskrit

Scottish Gaelic

Serbian

Sindhi

Sinhala

Slovak

Slovenian

Somali

Spanish

Sundanese

Swahili

Swedish

Tagalog

Tamil

Telugu

Thai

Turkish

Ukrainian

Urdu

Vietnamese

Welsh

Xhosa

Yiddish lv

yi IE:Slavic

IE:Slavic

Austronesian

Dravidian

IE:Iranian

Mongolian

IE:Iranian

IE:Germanic

IE:Iranian

Afro-Asiatic

IE:Iranian

IE:Slavic

IE:Romance

IE:Iranian

IE:Romance

IE:Slavic

IE:Iranian

IE:Germanic

IE:Slavic

IE:Iranian

IE:Slavic

Afro-Asiatic

IE:Romance

Austronesian

Niger-Congo

IE:Germanic

Austronesian

Dravidian

Kra-Dai

Turkic

IE:Slavic

IE:Iranian

Austroasiatic

IE:Celtic

Niger-Congo

IE:Germanic

X X

+ + + +

X X,(+) X,(+) X

X,(+) X

X,(+) +

X,(+)

+,(X)

+,(X) X

+,(X)

X,(+) +,(X)

X,(+)

X X

+,(X) X

+,(X)

X,(+) X,(+)

+,(X)

X,(+)

+,(X)

X,(+)

+,(X)

X,(+)

+,(X)

X,(+)

Script

Latin

Cyrillic

Latin

Malayalam

Devanagari

Cyrillic

Devanagari

Latin

Odia

Ge’ez

Arabic

Latin

Gurmukhi

Latin

Cyrillic

Devanagari

Latin

Cyrillic

Arabic

Sinhala

Latin

Tamil

Telugu

Thai

Latin

Cyrillic

Arabic

Latin

Hebrew

30 60 75

X X

X X X

X X

X,(+)

+,(X)

X,(+)

+,(X)

X,(+)

+,(X) +,(X)

X X X

X,(+)

+,(X)

X,(+) X,(+)

+,(X)

X,(+)

+,(X)

X,(+)

+,(X)

X,(+)

Table 9: List of languages we pre-train Xon or add + in the different sets (13, 30, 60, 75). (·) indicates the

respectively different pre-training/added languages of models 1 and 2 as described in §5.2 and Table 4. IE stands

for Indo-European.

3495