Summary of Scaling Multilingual Corpora and Language Models

Summary Scaling Multilingual Corpora and Language Models arxiv.org

22,987 words - PDF document - View PDF document

One Line

The authors suggest horizontally scaling Large Language Models (LLMs) for low-resource languages and demonstrate this through the creation of Glot500-m, while also examining transfer learning and benchmarking dialectal variations.

Slides

Slide Presentation (6 slides)

Copy slides outline Copy embed code Download as Word

Scaling Multilingual Corpora and Language Models

Source: arxiv.org - PDF - 22,987 words - view

The Need for Horizontal Scaling

• The NLP community has primarily focused on scaling Large Language Models (LLMs) vertically for high-resource languages.

• This paper proposes scaling LLMs horizontally to a large number of predominantly low-resource languages with Glot500-m.

• Glot500-m is a multilingual model trained on a 600GB corpus covering over 500 diverse languages.

Performance Comparison

• Glot500-m outperforms XLM-R-B on various language tasks for both head and tail language-scripts, except for POS on head.

• Glot500-m performs better for languages it was pretrained on, but can also improve performance for languages not covered by XLM-R if enough data is collected.

Benefits of Glot500-m

• Glot500-m supports 354 language-scripts and outperforms XLM-R-B on all tasks for both head and tail language-scripts, except for POS on head.

• Glot500-m performs better for tail language-scripts in terms of pseudoperplexity.

• The training progress of Glot500-m shows rapid improvement at the beginning but slows down later, especially for tail languages.

Language Coverage

• Glot500-m covers a wide range of languages, including low-resource ones.

• The difference in coverage between Glot500-m and XLM-R is partially predictive of performance.

Key Takeaways

• Scaling LLMs horizontally for low-resource languages with Glot500-m is effective.

• Glot500-m outperforms XLM-R-B on various language tasks.

• Glot500-m's performance can be improved for languages not covered by XLM-R if enough data is collected.

Key Points

The NLP community has focused on scaling Large Language Models (LLMs) vertically for high-resource languages.
This paper proposes scaling LLMs horizontally to a large number of predominantly low-resource languages with Glot500-m.
Glot500-m is a multilingual model trained on a 600GB corpus covering over 500 diverse languages.
Glot500-m outperforms XLM-R-B on various language tasks for both head and tail language-scripts, except for POS on head.
Glot500-m performs better for languages it was pretrained on, but can also improve performance for languages not covered by XLM-R if enough data is collected.

Summaries

31 word summary

The NLP community has focused on scaling Large Language Models (LLMs) vertically, but the authors propose scaling horizontally for low-resource languages. They create Glot500-m and explore transfer learning and benchmarking dialectal.

77 word summary

The NLP community has primarily focused on scaling Large Language Models (LLMs) vertically for high-resource languages. However, the authors propose scaling LLMs horizontally to a large number of predominantly low-resource languages. They create Glot500-m, a

This summary discusses the main points from the text excerpt on scaling multilingual corpora and language models. The excerpt mentions research papers and conference proceedings related to multilingual language models and natural language processing, exploring topics such as transfer learning, benchmarking dialectal

925 word summary

The NLP community has primarily focused on scaling Large Language Models (LLMs) vertically for high-resource languages. In this paper, the authors propose scaling LLMs horizontally to a large number of predominantly low-resource languages. They create Glot500-m

The curse of multilinguality has been studied for high-resource languages, but Glot500-m allows for investigation in a more realistic setting. Glot500-m is a multilingual model trained on a 600GB corpus covering over 500 diverse languages

The article discusses the scaling of multilingual corpora and language models. It mentions that some languages are written in multiple scripts, and each language-script is treated as a separate entity. A 3-gram character-level language model is trained for each language

We merge tokens with XLM-R's vocabulary, adding 100K new tokens. The probabilities of genuinely new tokens are taken from SentencePiece. The new tokenizer changes 0.2% to 50% of tokens in head languages, but this

We compare Glot500-m and XLM-R-B in various language tasks. Glot500-m supports 354 language-scripts and outperforms XLM-R-B on all tasks for both head and tail language-scripts, except for POS on head.

Glot500-m outperforms XLM-R-B in terms of pseudoperplexity, particularly for tail language-scripts. The training progress of Glot500-m shows rapid improvement at the beginning but slows down later, especially for tail languages. Gl

Glot500-m performs better for languages it was pretrained on, but can also improve performance for languages not covered by XLM-R if enough data is collected. The difference in coverage between Glot500-m and XLM-R is partially predictive of performance

The summary is omitted as the given text excerpt does not contain coherent information or key points.

Composable sparse fine-tuning for cross-lingual transfer. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer. Empirical models for an indic language continuum. ParaCrawl: Web-scale acquisition of parallel corpora. Mac

This summary provides a list of references from various papers and conferences related to the topic of scaling multilingual corpora and language models. The references include papers on cross-lingual language model pre-training, investigating language relationships in multilingual sentence encoders,

Mapping languages: the corpus of global language use. Ethnologue: Languages of the world. How to adapt pretrained multilingual models to 1600 languages. Habibi - a multi dialect multi national Arabic song lyrics corpus. Arabic dialect identification in the

The summary is as follows:

- Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages, Dublin, Ireland. - Many-to-English machine translation tools, data, and pretrained models. - Xl-sum:

Taku Kudo and John Richardson presented SentencePiece, a subword tokenizer for neural text processing. Anoop Kunchukuttan, Pratik Mehta, and Pushpak Bhattacharyya introduced the IIT Bombay English-Hindi parallel

The summary is not provided.

This excerpt includes references to various research papers and conference proceedings related to multilingual language models and natural language processing. The mentioned works explore topics such as transfer learning, benchmarking dialectal Arabic-English machine translation, masked language model scoring, parallel sentence mining

This summary provides a concise version of the text excerpt while retaining important details and highlighting key points. The summary is organized into separate paragraphs to distinguish distinct ideas for readability, while preserving the original order in which ideas were presented.

The text excerpt includes references to

Perplexity is used to measure how well a language model predicts test data. The divergence between two languages is computed using the maximum perplexity values in both directions. The study evaluates the proposed approach using language family trees as a baseline. The accuracy of

The document discusses scaling multilingual corpora and language models. It mentions various tools and resources used in the study, including Head and various language models. The detailed results for different tasks and languages are reported in tables. Perplexity numbers for all languages

The text excerpt contains a list of numerical values and language-script combinations. The values in the list are not explained or described, making it difficult to understand their meaning. The language-script combinations represent different languages and writing systems. The purpose or context of this

The text excerpt consists of a list of language-script pairs followed by numerical values. The pairs represent different languages written in different scripts, and the numerical values denote accuracy scores for three different models. The accuracy scores indicate the performance of the models on a sentence

Table 17 shows the F1 scores of XLM-R-B, XLM-R-L, and Glot500-m on NER. The scores are listed for various language-scripts, such as ori-Orya, oss-Cyrl, pan

The excerpt consists of a long list of language-script combinations and their corresponding F1 scores for XLM-R-B, XLM-R-L, and Glot500-m language models on text classification. The list includes various language scripts such as Latin, Cyril

The summary provides a list of language-script pairs along with their respective F1 scores for XLM-R-B, XLM-R-L, and Glot500-m language models in text classification. The language-script pairs are organized in a table format, with

The excerpt includes a long list of numerical values and language-script pairs. The accuracy of different language models (XLM-R-B, XLM-R-L, and Glot500-m) in round trip alignment is provided for each language-script pair.

The excerpt presents a table showing the accuracy of XLM-R-B, XLM-R-L, and Glot500-m on Round Trip Alignment. The table includes language-script pairs and corresponding accuracy scores. The language-script pairs are listed in the first column

The document provides a table showing perplexity values for various languages covered by Glot500-m. The table includes language-script pairs, as well as perplexity scores for two language models (XLM-R-B and XLM-R-L) and the Gl

Perplexity scores for various languages in the Glot500-m dataset are provided in Tables 24 and 25. The tables include language-script pairs, perplexity scores for the XLM-R-B and XLM-R-L language models, and the

Raw indexed text (148,342 chars / 22,987 words / 11,853 lines)

Glot500:

Scaling Multilingual Corpora and Language Models to 500 Languages

Ayyoob Imani ∗1,2 , Peiqin Lin ∗1,2 , Amir Hossein Kargaran 1,2 , Silvia Severini 1 ,

Masoud Jalili Sabet 1 , Nora Kassner 1,2 , Chunlan Ma 1,2 ,

Helmut Schmid 1 , André F. T. Martins 3,4,5 , François Yvon 6 and Hinrich Schütze 1,2

1 CIS, LMU Munich, Germany

2 Munich Center for Machine Learning (MCML), Germany

3 Instituto Superior Técnico (Lisbon ELLIS Unit) 4 Instituto de Telecomunicações

5 Unbabel

6 Sorbonne Université, CNRS, ISIR, France

{ayyoob, linpq, amir, silvia}@cis.lmu.de

Abstract

The NLP community has mainly focused on

scaling Large Language Models (LLMs) ver-

tically, i.e., making them better for about 100

languages. We instead scale LLMs horizon-

tally: we create, through continued pretraining,

Glot500-m, an LLM that covers 511 predom-

inantly low-resource languages. An impor-

tant part of this effort is to collect and clean

Glot500-c, a corpus that covers these 511 lan-

guages and allows us to train Glot500-m. We

evaluate Glot500-m on five diverse tasks across

these languages. We observe large improve-

ments for both high-resource and low-resource

languages compared to an XLM-R baseline.

Our analysis shows that no single factor ex-

plains the quality of multilingual LLM rep-

resentations. Rather, a combination of fac-

tors determines quality including corpus size,

script, “help” from related languages and the

total capacity of the model. Our work ad-

dresses an important goal of NLP research: we

should not limit NLP to a small fraction of the

world’s languages and instead strive to support

as many languages as possible to bring the ben-

efits of NLP technology to all languages and

cultures. Code, data and models are available

at https://github.com/cisnlp/Glot500.

Introduction

The NLP community has mainly focused on scaling

Large Language Models (LLMs) vertically, i.e.,

deepening their understanding of high-resource lan-

guages by scaling up parameters and training data.

While this approach has revolutionized NLP, the

achievements are largely limited to high-resource

languages. Examples of “vertical” LLMs are GPT3

(Brown et al., 2020), PaLM (Chowdhery et al.,

2022) and Bloom (BigScience et al., 2022). In this

paper, we create Glot500-m, a model that instead

focuses on scaling multilingual LLMs horizontally,

i.e., scaling to a large number of languages the great

* Equal

contribution.

majority of which is low-resource. As LLMs are

essential for progress in NLP, lack of LLMs support-

ing low-resource languages is a serious impediment

to bringing NLP to all of the world’s languages and

cultures. Our goal is to address this need with the

creation of Glot500-m. 1

Existing multilingual LLMs support only about

100 (Conneau et al., 2020) out of the 7000 languages

of the world. These supported languages are the

ones for which large amounts of training data are

available through projects such as Oscar (Suárez

et al., 2019) and the Wikipedia dumps. 2 Following

Siddhant et al. (2022), we refer to the 100 languages

covered by XLM-R (Conneau et al., 2020) as head

languages and to the remaining languages as tail

languages. This terminology is motivated by the

skewed distribution of available data per language:

for the best-resourced languages there are huge

corpora available, but for the long tail of languages,

only small corpora exist. This is a key problem we

address: the availability of data for tail languages

is limited compared to head languages. As a result,

tail languages have often been ignored by language

technologies (Joshi et al., 2020).

Although there exists some work on machine

translation for a large number of tail languages

(Costa-jussà et al., 2022; Bapna et al., 2022), ex-

isting LLMs for tail languages are limited to a

relatively small number of languages (Wang et al.,

2019; Alabi et al., 2022; Wang et al., 2022). In this

paper, we address this gap. Our work has three parts.

(i) Corpus collection. We collect Glot2000-c, a

corpus covering thousands of tail languages. (ii)

Model training. Using Glot500-c, a subset of

Glot2000-c, we train Glot500-m, an LLM covering

511 languages. (iii) Validation. We conduct an

extensive evaluation of the quality of Glot500-m’s

1 In concurrent work, Adebara et al. (2022) train a multilin-

gual model for 517 African languages on a 42 gigabyte corpus,

but without making the model available.

2 https://dumps.wikimedia.org/representations of tail languages on a diverse suite

of tasks.

In more detail, corpus collection considers three

major sources: websites that are known to publish

content in specific languages, corpora with clas-

sified multilingual content and datasets published

in specific tail languages. The resulting dataset

Glot2000-c comprises 700GB in 2266 languages

collected from ≈150 sources. After cleaning and

deduplication, we create the subset Glot500-c, con-

sisting of 511 languages and 534 language-scripts

(where we define a language-script as a combina-

tion of ISO 639-3 3 and script) to train Glot500-m.

Our criterion for including a language-script in

Glot500-c is that it includes more than 30,000 sen-

tences.

Model training. To train Glot500-m, we employ

vocabulary extension and continued pretraining.

XLM-R’s vocabulary is extended with new tokens

trained on Glot500-c. We then perform continued

pretraining of XLM-R with the MLM objective

(Devlin et al., 2019).

Validation. We comprehensively evaluate

Glot500-m on a diverse suite of natural language

understanding, sequence labeling and multilingual

tasks for hundreds of languages. The results demon-

strate that Glot500-m performs better than XLM-

R-B (XLM-R-base) for tail languages by a large

margin while performing comparably (or better) for

head languages.

Previous work on multilinguality has been hin-

dered by the lack of LLMs supporting a large num-

ber of languages. This limitation has led to studies

being conducted in settings dissimilar from real-

world scenarios. For example, Dufter and Schütze

(2020) use synthetic language data. And the curse

of multilinguality has been primarily studied for

a set of high-resource languages (Conneau et al.,

2020). By creating Glot500-m, we can investigate

these issues in a more realistic setting. We make

code, data and trained models available to foster

research by the community on how to include hun-

dreds of languages that are currently ill-served by

NLP technology.

Contributions. (i) We train the multilingual

model Glot500-m on a 600GB corpus, covering

more than 500 diverse languages, and make it pub-

licly available at https://github.com/cisnlp/

Glot500. (ii) We collect and clean Glot500-c, a

corpus that covers these diverse languages and al-

3 https://iso639-3.sil.org/code_tables/639

lows us to train Glot500-m, and will make as much

of it publicly available as possible. (iii) We evaluate

Glot500-m on pseudoperplexity and on five diverse

tasks across these languages. We observe large im-

provements for low-resource languages compared

to an XLM-R baseline. (iv) Our extensive analysis

shows that no single factor explains the quality of

multilingual LLM representations. Rather, a com-

bination of factors determines quality including

corpus size, script, “help” from related languages

and the total capacity of the model. (v) Our work

addresses an important goal of NLP research: we

should not limit NLP to a relatively small number

of high-resource languages and instead strive to

support as many languages as possible to bring the

benefits of NLP to all languages and cultures.

Related Work

Training multilingual LLMs using the masked lan-

guage modeling (MLM) objective is effective to

achieve cross-lingual representations (Devlin et al.,

2019; Conneau et al., 2020). These models can be

further improved by incorporating techniques such

as discriminative pre-training (Chi et al., 2022) and

the use of parallel data (Yang et al., 2020; Chi et al.,

2021). However, this primarily benefits a limited

set of languages with large corpora.

Recent research has attempted to extend exist-

ing LLMs to languages with limited resources.

Wang et al. (2019) propose vocabulary extension;

Ebrahimi and Kann (2021) investigate adaptation

methods, including MLM and Translation Lan-

guage Model (TLM) objectives and adapters; Alabi

et al. (2022) adapt XLM-R to 17 African languages;

Wang et al. (2022) expand language models to

low-resource languages using bilingual lexicons.

Alternatively, parameter-efficient fine-tuning

adapts pre-trained models to new languages by

training a small set of weights effectively (Zhao

et al., 2020; Pfeiffer et al., 2021; Ansell et al., 2022).

Pfeiffer et al. (2022) address the “curse of multilin-

guality” by sharing a part of the model among all

languages and having separate modules for each lan-

guage. We show that the common perception that

multilinguality increases as we add more languages,

until, from some point, it starts decreasing, is naive.

The amount of available data per language and the

similarity between languages also play important

roles (§6.8).

Another approach trains LLMs from scratch for

a limited number of tail languages; e.g., AfriBERTa(Ogueji et al., 2021a) and IndicNLPSuite (Kakwani

et al., 2020) are LLMs for 11 African languages and

11 Indic languages. In concurrent work, Adebara

et al. (2022) train a multilingual model for 517

African languages on a 42 GB corpus, but without

making the model available and with an evaluation

on a smaller number of languages than ours.

Closely related to our work on corpus creation,

Bapna et al. (2022) and Costa-jussà et al. (2022)

also create NLP resources for a large number of tail

languages. They train a language identifier model

and extract textual data for tail languages from large-

scale web crawls. This approach is effective, but

it requires significant computational resources and

native speakers for all tail languages. This is hard

to do outside of large corporations. Bapna et al.

(2022) have not made their data available. Costa-

jussà et al. (2022) have only released a portion of

their data in around 200 languages.

A key benefit of “horizontally” scaled multilin-

gual LLMs is transfer from high- to low-resource

languages. Our evaluation suggests that Glot500-m

excels at this, but this is not the main focus of our

paper. There is a large body of work on crosslin-

gual transfer: (Artetxe and Schwenk, 2019; Imani-

Googhari et al., 2022; Lauscher et al., 2020; Con-

neau et al., 2020; Turc et al., 2021; Fan et al., 2021;

Severini et al., 2022; Choenni and Shutova, 2022;

Wang et al., 2023), inter alia.

3.1

Glot2000-c

Data Collection

One of the major challenges in developing NLP

technologies for tail languages is the scarcity of

high-quality training data. In this work, we propose

a lightweight methodology that is easily replicable

for academic labs. We identify tail language data

previously published by researchers, publishers and

translators and then crawl or download them. By

crawling a few websites and compiling data from

around 150 different datasets, we amass more than

700GB of text in 2266 languages. We will refer

to these sources of data as data sources. Our data

covers many domains, including religious texts,

news articles and scientific papers. Some of the

data sources are high-quality, verified by native

speakers, translators and linguists. Others are less

reliable such as web crawls and Wikipedia dumps.

It is therefore necessary to clean the data. For a list

of data sources, see §C.

3.2

Language-Scripts

Some languages are written in multiple scripts; e.g.,

Tajik is written in both Cyrillic and Arabic scripts.

Some data sources indicate the script, but others

either do not or provide mixed text in multiple

scripts. We detect the script for each sentence and

treat each language-script as a separate entity.

3.3

Ngram LMs and Language Divergence

We train a 3-gram character-level language model

𝑀 𝑖 for each language-script 𝐿 𝑖 , using KenLM

(Heafield, 2011). We refer to the perplexity calcu-

lated for the corpus of language 𝐿 𝑖 using language

model 𝑀 𝑗 as PP (𝑀 𝑗 , 𝐿 𝑖 ). Similar to Gamallo

et al. (2017), we define a perplexity-based diver-

gence measure of languages 𝐿 𝑖 and 𝐿 𝑗 as:

D 𝐿 𝑖 ,𝐿 𝑗 = max PP (𝑀 𝑗 , 𝐿 𝑖 ), PP (𝑀 𝑖 , 𝐿 𝑗 )

We use D to filter out noisy data in §3.4 and study

the effect of similar languages in LLM training in

§6.7 and §6.8. For more details, see §A.

3.4

Data Cleaning

To remove noise, we use chunk-level and corpus-

level filters.

While some sources are sentence-split, others

provide multiple sentences (e.g., a paragraph) as

one chunk. Chunk-level filters process each chunk

of text from a data source as a unit, without sentence-

splitting. Some chunk-level filters are based on the

notion of word: we use white space tokenization

when possible and otherwise resort to sentencePiece

(Kudo and Richardson, 2018) trained by Costa-jussà

et al. (2022).

As chunk-level filters, we employ the sentence-

level filters SF1–SF5 from BigScience ROOTS

(Laurençon et al., 2022).

SF1 Character repetition. If the ratio of repeated

characters is too high, it is likely that the sentence

has not enough textual content.

SF2 Word repetition. A high ratio of repeated

words indicates non-useful repetitive content.

SF3 Special characters. Sentences with a high

ratio of special characters are likely to be crawling

artifacts or computer code.

SF4 Insufficient number of words. Since training

language models requires enough context, very

small chunks of text are not useful.

SF5 Deduplication. If two sentences are identical

after eliminating punctuation and white space, one

is removed.35

Glot2000-c

2266

Glot500-c

511

Costa-jussà et al. (2022) 134

Bapna et al. (2022)

1503

XLM-R-B XLM-R-L Glot500-m

2.3B

1.5B 120K

2.4B 3.3M

1.7B 25K

Table 1: Statistics for Glot2000-c, Glot500-c and ex-

isting multilingual datasets: number of languages,

scripts, sentences’ and median number of sentences’

per language-script.

In the rest of the paper, we refer to a chunk as

a sentence’. A sentence’ can consist of a short

segment, a complete sentence or a chunk (i.e.,

several sentences).

Corpus-level filters detect if the corpus of a

language-script is noisy; e.g., the corpus is in an-

other language or consists of non-meaningful con-

tent such as tabular data. We employ filters CF1

and CF2.

CF1 In case of mismatch between language

and script, the corpus is removed; e.g., Chinese

written in Arabic is unlikely to be Chinese.

CF2 Perplexity mismatch. For each language-

script L1, we find its closest language-script L2:

the language-script with the lowest perplexity di-

vergence (§3.3). If L1 and L2 are not in the same

typological family, we check L1/L2 manually and

take appropriate action such as removing the corpus

(e.g., if it is actually English) or correcting the ISO

code assigned to the corpus.

3.5

Training Data: Glot500-c

Among the 2000+ language-scripts that we col-

lected data for, after cleaning, most have too little

data for pretraining LLMs. It is difficult to quan-

tify the minimum amount needed for pretraining.

Therefore, we pick a relatively high “safe” threshold,

30,000 sentences’, for inclusion of language-scripts

in model training. This allows us to train the model

effectively and cover many low-resource languages.

Table 1 gives Glot500-c statistics. See §B for a

list of language-scripts. We train Glot500-m on

Glot500-c; note that while Glot500-c focuses on

tail languages, it contains some data in head lan-

guages which we include in Glot500-m training to

prevent catastrophic forgetting.

We divide the corpus for each language into

train/dev/test, reserving 1000 sentences’ each for

dev and test and using the rest for train. We pick

1000 parallel verses if we have a Bible translation

Model Size

Vocab Size

Transformer Size

278M

250K

86M

560M

250K

303M

395M

401K

86M

Table 2: Model sizes. Glot500-m and XLM-R-B have

the same transformer size, but Glot500-m has a larger

vocabulary, resulting in an overall larger model.

and add 500 each to test and dev. These parallel

verses convey identical meanings and facilitate

crosslingual evaluation. We pretrain the model

using only the training data.

4.1

Glot500-m

Vocabulary Extension

To extend XLM-R’s vocabulary, we use Sentence-

Piece (Kudo and Richardson, 2018) with a unigram

language model (Kudo, 2018) to train a tokenizer

with a vocabulary size of 250K on Glot500-c. We

sample data from different language-scripts accord-

ing to a multinomial distribution, with 𝛼=.3. The

amount we sample for head languages is the same

as tail languages with the lowest amount; this favors

tail languages – head languages are already well

learned by XLM-R. We merge the obtained tokens

with XLM-R’s vocabulary. About 100K new to-

kens were in fact old tokens, i.e., already part of

XLM-R’s vocabulary. We take the probabilities

of the (genuinely) new tokens directly from Sen-

tencePiece. After adding the 151K new tokens to

XLM-R’s vocabulary (which has size 250K), the

vocabulary size of Glot500-m is 401K.

We could also calculate probabilities of existing

and new tokens over a mixture of original XLM-R

training corpus and Glot500-c (Chung et al., 2020).

For head languages, the percentage of changed

tokens using the new tokenizer compared to the

original tokenizer ranges from 0.2% to 50%. How-

ever, we found no relationship between percentage

of changed tokens and change in performance on

downstream tasks. Thus, there was little effect of

tokenization in our experiments.

4.2

Continued Pretraining

We create Glot500-m by continued pretraining of

XLM-R-B with the MLM objective. The opti-

mizer used is Adam with betas (0.9, 0.999). Initial

learning rate: 5e-5. Each training step contains

a batch of 384 training samples randomly picked

from all language-scripts. The sampling strategy

across language-scripts is the same as for vocabu-|head| |tail| measure (%)

Sentence Retrieval Tatoeba

Sentence Retrieval Bible

Text Classification

NER

POS

Roundtrip Alignment

70 28 Top10 Acc.

94 275 Top10 Acc.

90 264

89 75

63 28

85 288 Accuracy

Table 3: Evaluation tasks and measures. |head|/|tail|:

number of head/tail language-scripts

lary extension (§4.1). We save checkpoints every

10K steps and select the checkpoint with the best

average performance on downstream tasks by early

stopping. Table 2 lists the sizes of XLM-R-B, XLM-

R-L and Glot500-m. Except for a larger vocabulary

(§4.1), Glot500-m has the same size as XLM-R-B.

We train Glot500-m on a server with eight NVIDIA

RTX A6000 GPUs for two weeks.

Similar to XLM-R, we concatenate sentences’ of

a language-script and feed them as a stream to the

tokenizer. The resulting output is then divided into

chunks of 512 tokens and fed to the model.

Experimental Setup

For most tail languages, there are no manually

labeled evaluation data. We therefore adopt a mixed

evaluation strategy: based partly on human labels,

partly on evaluation methods that are applicable

to many languages without requiring gold data.

Table 3 lists all our evaluation tasks.

Perplexity Following Salazar et al. (2020), we

calculate pseudoperplexity (PPPL) over the held-

out test set. PPPL is based on masking tokens

one-by-one (not left to right). Salazar et al. (2020)

give evidence that PPPL is a better measure of

linguistic acceptability compared to standard left-

to-right perplexity.

Roundtrip Alignment For assessing the quality

of multilingual representations for a broad range of

tail languages without human gold data, we adopt

roundtrip evaluation (Dufter et al., 2018). We first

word-align sentences’ in a parallel corpus based on

the multilingual representations of an LLM. We then

start from a word 𝑤 in a sentence’ in language-script

L1, follow the alignment links to its translations in

language-script L2, then the alignment links from

L2 to L3 and so on, until in the end we follow

alignment links back to L1. If this “roundtrip” gets

us back to 𝑤, then it indicates that the LLM has

similar representations for the meaning of 𝑤 in

language-scripts L1, L2, L3, etc. In other words,

the cross-lingual quality of representations is high.

Vice versa, failure to get back to 𝑤 is a sign of poor

multilingual representations.

We use SimAlign (Jalili Sabet et al., 2020) and

align on the sub-word level on the Bible part of test,

based on the representations of the LLM computed

by transformer layer 8 as suggested in the original

paper. We use intersection symmetrization: each

word in a sentence’ is aligned to at most one word

in the other sentence’.

As evaluation measure we compute the percent-

age of roundtrips that were successes, i.e., the

roundtrip starts at 𝑤 in L1 and returns back to 𝑤.

For each language-script in test, we randomly select

three language-scripts as intermediate points L2,

L3, L4. Since the intermediate points influence

the results, we run the experiment five times with

different intermediate points and report the average.

All models are evaluated with the same five sets of

three intermediate language-scripts.

Sequence Labeling We consider two sequence

labeling tasks: Named Entity Recognition (NER)

and Part-Of-Speech (POS) tagging. We use the

WikiANN dataset (Pan et al., 2017) for NER and

version v2.11 of Universal Dependencies (UD)

(de Marneffe et al., 2021) for POS. Since training

data does not exist for some languages, we finetune

on English (with early stopping based on dev) and

evaluate zero-shot transfer on all languages covered

by WikiANN/UD. We set the learning rate to 2e-5

with Adam.

Sentence Retrieval Following (Hu et al., 2020),

we use up to 1000 English-aligned sentences’ from

Tatoeba (Artetxe and Schwenk, 2019) to evaluate

SentRetr (sentence retrieval). We also use 500

English-aligned sentences’ from the Bible part of

test. We find nearest neighbors using cosine sim-

ilarity based on the average word embeddings in

layer 𝑙 = 8 – following Jalili Sabet et al. (2020) –

and compute top10 accuracy. For fair comparison

and because the architectures are the same, we do

not optimize the hyperparameter 𝑙 for Glot500-m

and XLM-R-B.

Text Classification We evaluate on Taxi1500

(Ma et al., 2023). It provides gold data for text

classification with six classes in a large number

of language-scripts of which Glot500-m supports

354. We finetune on English (with early stopping

on dev) and evaluate zero-shot on test of the target

language-script. Learning rate: 2e-5, batch size:16 (following Ma et al. (2023)).

Sentence Retrieval Tatoeba Sentence Retrieval Bible

0.800 0.600

head

tail

Experiments

head

0.500

0.700

tail

0.400

0.600

0.300

0.200

0.500

In this section, we discuss aggregate results. For

detailed results, see §D and §E.

0.100

0.400

POS

NER

0.800

6.1

Results

head

0.700

6.2

Language Coverage

Table 5 compares Glot500-m vs. XLM-R-B on

pseudoperplexity. For fair comparison we use

word-level normalization. For 69 head language-

scripts, Glot500-m underperforms XLM-R-B. This

is expected as Glot500-m’s training data is small

for these language-scripts. Glot500-m outperforms

XLM-R-B for 420 tail language-scripts.

There are eight tail language-scripts for which

tail

0.600

Table 4 gives results. Glot500-m outperforms

XLM-R-B on all tasks for both head and tail

language-scripts, except for POS on head. That

Glot500-m outperforms XLM-R-B is expected for

tail language-scripts (i.e., those not covered by

XLM-R). For these language-scripts the improve-

ment margin is large. Outperformance may seem

counterintuitive for head language-scripts (those

covered by XLM-R) since Glot500-m has the same

number of (non-embedding) parameters as XLM-

R-B. Since the number of covered languages has

greatly increased, leaving less capacity per lan-

guage, we might expect underperformance. There

are a few possible explanations. First, XLM-R may

be undertrained, and the inclusion of more head

language training data may improve their repre-

sentations. Second, having more languages may

improve multilinguality by allowing languages to

synergize and enhance each other’s representations

and cross-lingual transfer. Third, there are lan-

guages similar to head languages among the tail

languages, which in turn aids head languages.

The gap between Glot500-m and the baselines

for tail language-scripts in sequence labeling is

smaller. These tasks do not require as deep an

understanding of language and thus transfer from

head to tail language-scripts is easier through shared

tokens.

Glot500-m also outperforms XLM-R-L for tail

language-scripts (all tasks) and head language-

scripts (3 tasks). This suggests that scaling up

size is not the only way for improvements. We can

also improve the quality of multilingual LLM repre-

sentations by increasing the number of languages.

head

0.650

tail

0.550

0.500

epochs

Figure 1: Progression of training for sentence retrieval

and sequence labeling. x-axis: epochs/10K. The im-

provement is fast in the beginning for tail languages,

then gets slower and and reaches a plateau. This pattern

is partially observed for head languages.

Glot500-m performs worse than XLM-R-B. Five

are tail languages with a similar head lan-

guage where the two share a macro-language:

ekk/Standard Estonian (est/Estonian), aln/Gheg

Albanian (sqi/Albanian), nob/Norwegian Bokmal

(nor/Norwegian), hbs/Serbo-Croatian (srp/Serbian),

lvs/Standard Latvian (lav/Latvian). Since XLM-

R-B’s pretraining corpus is large for the five head

languages, its performance is good for the close tail

languages.

The other three languages all have a unique

script: sat/Santali (Ol Chiki script), div/Dhivehi

(Thaana script), iku/Inuktitut (Inuktitut syllabics).

For these languages, XLM-R-B’s tokenizer returns

many UNK tokens since it is not trained on these

scripts, resulting in an unreasonably optimistic esti-

mate of pseudoperplexity by our implementation.

Glot500-m’s token-level normalized pseudoper-

plexity ranges from 1.95 for lhu/Lahu to 94.4 for

tok/Toki Pona. The average is 13.5, the median

10.6. We analyze the five language-scripts with

the highest pseudoperplexity: tok_Latn, luo_Latn,

acm_Arab, ach_Latn, and teo_Latn.

tok/Toki Pona is a constructed language. Accord-

ing to Wikipedia: “Essentially identical concepts

can be described by different words as the choice

relies on the speaker’s perception and experience.”

This property can result in higher variability and

higher perplexity.

acm/Mesopotamian Arabic contains a large num-

ber of tweets in raw form. This may result in

difficult-to-predict tokens in test.

luo/Luo, ach/Acoli and teo/Teso are related

Nilotic languages spoken in Kenya, Tanzania,

Uganda and South Sudan. Their high perplex-tail

head

all

XLM-R-B XLM-R-L Glot500-m XLM-R-B XLM-R-L Glot500-m XLM-R-B XLM-R-L Glot500-m

Pseudoperplexity

Sentence Retrieval Tatoeba

Sentence Retrieval Bible

Text Classification

NER

POS

Roundtrip Alignment

304.2

32.6

7.4

13.7

47.5

41.7

2.6

168.6

33.6

7.1

13.9

51.8

43.5

3.1

12.2

59.8

43.2

46.6

60.7

62.3

4.5

12.5

66.2

54.2

51.3

61.8

76.4

3.4

8.4

71.1

58.3

60.5

66.0

78.4

4.1

11.8

75.0

59.0

54.7

63.9

76.0

5.5

247.8

56.6

19.3

23.3

55.3

65.8

2.8

136.4

60.4

20.1

25.8

59.5

67.7

3.3

11.6

70.7

47.3

48.7

62.4

71.8

4.7

Table 4: Evaluation of XLM-R base and large (XLM-R-B and XLM-R-L) and Glot500-m on pseudoperplexity and

six multilingual tasks across 5 seeds. Each number is an average over head, tail and all language-scripts. See §D, §E

for results per task and language-script. Glot500-m outperforms XLM-R-B in all tasks for head (except for POS)

and tail language-scripts and XLM-R-L for tail language-scripts. Best result per row/column group in bold.

Glot500-m is better

XLM-R-B is better

head

tail

420

Table 5: Pseudoperplexity Glot500-m vs XLM-R-B.

Glot500-m’s worse performance on head can be at-

tributed to smaller training corpora and the relative diffi-

culty of learning five times more languages with the same

number of (non-embedding) parameters. Glot500-m per-

forms better on almost all tail language-scripts. §6.2

discusses the eight exceptions.

through shared vocabulary, resulting in a smaller

improvement of Glot500-m vs. XLM-R-B.

For SentRetr, we observe larger improvements

for the Bible than for Tatoeba. This is likely due to

the higher proportion of religious data in Glot500-c,

compared to XLM-R’s training data (i.e., CC100).

The average performance on downstream tasks

peaks at 480K steps. We have taken a snapshot of

Glot500-m at this stage and released it.

6.4

ity could be related to the fact that they are tonal

languages, but the tones are not orthographically

indicated. Another possible explanation is that

the training data is dominated by one subcorpus

(Jehova’s Witnesses) whereas the test data are dom-

inated by PBC. There are orthographic differences

between the two, e.g., “dong” (JW) vs. “doŋ” (PBC)

for Acoli. These three languages are also spoken

over a large area in countries with different standard

languages, which could increase variability.

Our analysis is not conclusive. We note however

that the gap between the three languages and the

next most difficult languages in terms of pseudop-

erplexity is not large. So maybe Luo, Acoli and

Teso are simply (for reasons still to be determined)

languages that have higher perplexity than others.

6.3

Training Progression

To analyze the training process, we evaluate

Glot500-m on sequence labeling and SentRetr at

10,000-step intervals. Figure 1 shows that perfor-

mance improves rapidly at the onset of training, but

then the rate of improvement slows down. This

trend is particularly pronounced for tail languages in

SentRetr. In comparison, sequence labeling is rela-

tively straightforward, with the baseline (XLM-R-B,

epoch 0) achieving high performance by correctly

transferring prevalent classes such as verb and noun

Analysis across Language-Scripts

To analyze the effect of language-scripts, we select

five tail language-scripts each with the largest and

smallest gain when comparing Glot500-m vs. XLM-

R-B for SentRetr and sequence labeling.

Table 6 shows that Glot500-m improves lan-

guages with scripts not covered by XLM-R (e.g.,

div/Dhivehi, Thaana script, see §6.2) by a large

margin since XLM-R simply regards the uncovered

scripts as unknown tokens and cannot compute

meaningful representations for the input. The large

amount of data we collected in Glot500-c also

contributes to the improvement for tail languages,

e.g., for tat_Cyrl (Tatar) in SentRetr Tatoeba and

mlt_Latn (Maltese) in POS. See §6.7 for a detailed

analysis of the effect of corpus size.

On the other hand, Glot500-m achieves just com-

parable or even worse results for some language-

scripts. We see at least three explanations. (i)

As discussed in §6.2, some tail languages (e.g.,

nob/Norwegian Bokmal) are close to a head lan-

guage (e.g., nor/Norwegian), so Glot500-m has no

advantage over XLM-R-B. (ii) A language is at the

low end of our corpus size range (i.e., 30,000 sen-

tences’). Example: xav_Latn, Xavánte. (iii) Some

languages are completely distinct from all other

languages in Glot500-c, thus without support from

any similar language. An example is mau_Latn,

Huautla Mazatec. Glot500-m has a much harderlanguage-script uzn C Northern Uzbek

crs L Seselwa Creole

srn L Sranan Tongo

uzb C Uzbek

bcl L Central Bikol

xav L Xavánte

mauL Huautla Mazatec

ahk L Akha

aln L Gheg Albanian

nob L Bokmål 5.4

7.4

6.8

6.2

10.2

2.2

2.4

3.0

67.8

82.8 87.0

80.6

79.8

78.8

79.8

5.0

3.6

3.2

67.6

79.2 81.6

73.2

73.0

72.6

69.6

2.8

1.2

0.2

-0.2

-3.6

mlt L Maltese

sah C Yakut

sme L Northern Sami

yor L Yoruba

quc L K’iche’

lzh HLiterary Chinese

nap L Neapolitan

hywAWestern Armenian

kmr L Northern Kurdish

aln L Gheg Albanian 21.3

21.9

29.6

22.8

28.5

11.7

47.1

79.1

73.5

54.7 80.3

76.9

73.6

64.2

64.1

18.4

50.0

81.1

75.2

51.2 59.0

55.0

44.1

41.4

35.6

6.7

2.9

2.0

1.7

-3.5

Tatoeba gain 10.3

28.8

16.3

34.6

25.2

5.6

3.7

4.8

73.4

93.5 70.3

77.1

63.5

75.6

64.5

21.1

16.4

11.0

76.9

95.7 60.0

48.3

47.3

41.0

39.3

15.5

12.7

6.2

3.5

2.2

XLMR Glot500

tat C Tatar

nds L Low German

tuk L Turkmen

ile L Interlingue

uzb C Uzbek

dtp L Kadazan Dusun

kab L Kabyle

pamL Pampanga

lvs L Standard Latvian

nob L Bokmål div T Dhivehi

che C Chechen

mri L Maori

nan L Min Nan

tgk C Tajik

zea L Zeeuws

vol L Volapük

min L Minangkabau

wuuHWu Chinese

lzh HLiterary Chinese 0.0

15.3

16.0

42.3

26.3

68.1

60.0

42.3

28.9

15.7 50.9

61.2

58.9

84.9

66.4

67.3

59.0

40.4

23.9

10.3 50.9

45.9

42.9

42.6

40.0

-0.8

-1.0

-1.8

-5.0

-5.4

language-script

XLMR Glot500

gain

Table 6: Results for five tail language-scripts each with the largest (high end) and smallest (low end) gain Glot500-m

vs. XLM-R-B for four tasks. Glot500-m’s gain over XLM-R-B is large at the high end and small or slightly negative

at the low end. L = Latin, C = Cyrillic, H = Hani, A = Armenian, T = Thaana

lang-script

uig_Arab

uig_Latn

hin_Deva

hin_Latn

uzb_Latn

uzb_Cyrl

kaa_Cyrl

kaa_Latn

kmr_Cyrl

kmr_Latn

tuk_Cyrl

tuk_Latn

head

tail

head

tail

head

tail

XLM-R-B Glot500-m gain

45.8

9.8

67.0

13.6

54.8

6.2

17.6

9.2

4.0

35.8

13.6

9.6 56.2

62.8

76.6

43.2

67.6

78.8

73.8

43.4

42.4

63.0

65.0

66.2 10.4

53.0

9.6

29.6

12.8

72.6

56.2

34.2

38.4

27.2

51.4

56.6

Table 7: Sentence Retrieval Bible performance of

Glot500-m and XLM-R-B for six languages with two

scripts: Uighur (uig), Hindi (hin), Uzbek (uzb), Kara-

Kalpak (kaa), Northern Kurdish (kmr), Turkmen (tuk).

Glot500-m clearly outperforms XLM-R-B with large

differences for tail language-scripts.

R, the performance is better for the script for which

we collect a larger corpus. For example, kaa_Cyrl

(Kara-Kalpak) has about three times as much data as

kaa_Latn. This explains why kaa_Cyrl outperforms

kaa_Latn by 30%.

Dufter and Schütze (2020) found that, after train-

ing a multilingual model with two scripts for English

(natural English and “fake English”), the model per-

formed well at zero-shot transfer if the capacity of

the model was of the right size (i.e., not too small,

not too large). Our experiments with real data show

the complexity of the issue: even if there is a “right”

size for an LLM that supports both full acquisition

of languages and multilingual transfer, this size is

difficult to determine and it may be different for dif-

ferent language pairs in a large horizontally scaled

model like Glot500-m.

time learning good representations in these cases. 6.6

6.5 Table 8 compares SentRetr performance Glot500-m

vs. XLM-R-B for seven language families that have

ten or more language-scripts in Glot500-c. We

assign languages to families based on Glottolog. 4

Generally, XLM-R has better performance the more

language-scripts from a language family are rep-

resented in its training data; e.g., performance is

better for indo1319 and worse for maya1287. The

results suggest that Glot500-m’s improvement over

Languages with Multiple Scripts

Table 7 compares SentRetr performance XLM-R-B

vs. Glot500-m for six languages with two scripts.

Unsurprisingly, XLM-R performs much better for a

language-script it was pretrained on (“head”) than

on one that it was not (“tail”). We can improve

the performance of a language, even surpassing the

language-script covered by XLM-R, if we collect

enough data for its script not covered by XLM-R.

For languages with two scripts not covered by XLM-

Analysis across Language Families

4 http://glottolog.org/glottolog/familyfamily

|𝐿 𝐺 | |𝐿 𝑋 | XLM-R-B Glot500-m gain

indo1319

atla1278

aust1307

turk1311

sino1245

maya1287

afro1255

41.5

5.5

13.7

20.1

7.6

3.8

13.0

61.4

45.2

47.0

62.9

38.9

20.3

34.3

19.9

39.6

33.2

42.8

31.3

16.4

21.4

Table 8: Average Sentence Retrieval Bible performance

of Glot500-m and XLM-R-B for seven language families.

The difference in coverage of a family by Glot500-m

vs. XLM-R-B is partially predictive of the performance

difference. |𝐿 𝐺 |/|𝐿 𝑋 |: number of language-scripts from

family covered by Glot500-m/XLM-R.

lang-script

Glot+1 Glot500-m

rug_Latn, Roviana

yan_Latn, Mayangna/Sumo

wbm_Latn, Wa/Va 51.0

46.4

49.6 49.0

31.8

46.4

ctd_Latn, Tedim Chin

quh_Latn, Southern Quechua

tat_Cyrl, Tatar 47.4

33.4

58.8 59.4

56.2

67.2

Table 9: Performance on Sentence Retrieval Bible of con-

tinued pretraining on just one language-script (Glot+1)

vs. on Glot500-c (Glot500-m). Glot500-m underper-

forms on the top three and outperforms on the bottom

three. Our explanation is that the second group is sup-

ported by closely related languages in Glot500-c; e.g.,

for Southern Quechua (quh), Glot500-m also covers

closely related Cuzco Quechua (quz). For the first group

this is not the case; e.g., the Wa language (wbm) has no

close relative in Glot500-c.

XLM-R is the larger, the better our training corpus

Glot500-c’s coverage is of a family.

6.7

Effect of Amount of Training Data

We examine correlation between pretraining corpus

size and Glot500-m zero-shot performance. We

focus on SentRetr Bible (§5) since it supports the

most head and tail languages. We find that Pearson’s

𝑟 = .34, i.e., corpus size and performance are

moderately, but clearly correlated. We suspect that

the correlation is not larger because, in addition

to corpus size of language 𝑙 itself, corpus size of

languages closely related to 𝑙 is also an important

factor (see §6.4 for a similar finding for Norwegian).

We therefore also compute Pearson’s 𝑟 between (i)

performance of language 𝑙 on SentRetr Bible and

(ii) joint corpus size of 𝑙 and its 𝑘 nearest neighbors

(according to perplexity divergence, §3.3). In this

case, Pearson’s 𝑟 = .44 (for both 𝑘 = 3 and 𝑘 = 4),

indicating that the corpus size of nearest neighbor

languages does play a role.

6.8

Support through Related Languages

Building on §6.7, there is another way we can inves-

tigate the positive effect of closely related languages

on performance: We can compare performance

(again on SentRetr Bible) of continued pretraining

on just one language (we refer to this model as

Glot+1) vs. on all 511 languages represented in

Glot500-c (i.e., Glot500-m). Table 9 presents re-

sults for six language-scripts selected from various

language families and suggests that some languages

do not receive support from related languages (top

three). In that case, Glot+1 can fully concentrate

on learning the isolated language and does better

than Glot500-c. Other languages (bottom three)

do receive support from related languages. For

example, Southern Quechua (quh) seems to receive

support in Glot500-m from closely related Cuzco

Quechua (quz), resulting in Glot500-m outperform-

ing Glot+1.

Conclusion and Future Work

We collect and data-clean Glot500-c, a large corpus

of hundreds of usually neglected tail (i.e., long-tail)

languages and create Glot500-m, an LLM that is

trained on Glot500-c and covers these languages.

We evaluate Glot500-m on six tasks that allow us

to evaluate almost all languages. We observe large

improvements for both head and tail languages com-

pared to XLM-R. Our analysis shows that no single

factor fully explains the quality of the representa-

tion of a language in a multilingual model. Rather,

a combination of factors is important, including

corpus size, script, “help” from related languages

and the total capacity of the model.

This work is the first to create a language model

on a dataset of several hundreds of gigabytes and

to make it publicly available for such a large and di-

verse number of low-resource languages. In future

research, we would like to train larger models to

further investigate the effect of model size, distill

highly multilingual models for resource-efficient

deployment, explore alternatives to continued pre-

training and use models for more tail language

downstream tasks.

Limitations

(1) We did not perform any comprehensive hy-

perparameter search, which would have further

consolidated our results. This decision was made

due to the high cost of training multiple models. (2)

Compared to current very large models, Glot500-mis comparatively small. (3) Although we have tried

to minimize the amount of noise in our data, some

noise is still present.

Ethics Statement

There are two issues worth mentioning in regards

to this project. First, it was not feasible for us

to thoroughly examine the content of the data for

all languages, thus we cannot confirm the absence

of discrimination based on factors such as race or

sexuality. The data was solely utilized as a textual

corpus, and the content should not be interpreted

as an endorsement by our team. If the model is sub-

sequently utilized for generation, it is possible that

the training data may be reflected in the generated

output. However, addressing potential biases within

the data is an area for future research. Second, it

is important to note that while the data sources

utilized in this study do not explicitly prohibit the

reuse of data for research purposes, some sources

do have copyright statements indicating that such

use is permissible while others do not. Additionally,

certain sources prohibit the redistribution of data.

As such, data from these sources is omitted from

the published version of Glot2000-c.

Acknowledgements

We would like to thank Renhao Pei, Yihong Liu,

Verena Blaschke, and the anonymous reviewers.

This work was funded by the European Research

Council (grants #740516 and #758969) and EU’s

Horizon Europe Research and Innovation Actions

(UTTER, contract 101070631).

References

Solomon Teferra Abate, Michael Melese, Martha Yi-

firu Tachbelie, Million Meshesha, Solomon Ati-

nafu, Wondwossen Mulugeta, Yaregal Assabie, Hafte

Abera, Binyam Ephrem, Tewodros Abebe, Wondim-

agegnhue Tsegaye, Amanuel Lemma, Tsegaye An-

dargie, and Seifedin Shifaw. 2018. Parallel corpora

for bi-lingual English-Ethiopian languages statisti-

cal machine translation. In Proceedings of the 27th

International Conference on Computational Linguis-

tics, pages 3102–3111, Santa Fe, New Mexico, USA.

Association for Computational Linguistics.

Ahmed Abdelali, Hamdy Mubarak, Younes Samih, Sabit

Hassan, and Kareem Darwish. 2021. QADI: Arabic

dialect identification in the wild. In Proceedings of the

Sixth Arabic Natural Language Processing Workshop,

pages 1–10, Kyiv, Ukraine (Virtual). Association for

Computational Linguistics.

Kathrein Abu Kwaik, Motaz Saad, Stergios Chatzikyri-

akidis, and Simon Dobnik. 2018. Shami: A corpus

of Levantine Arabic dialects. In Proceedings of

the Eleventh International Conference on Language

Resources and Evaluation (LREC 2018), Miyazaki,

Japan. European Language Resources Association

(ELRA).

Ife Adebara, AbdelRahim Elmadany, Muhammad

Abdul-Mageed, and Alcides Alcoba Inciarte. 2022.

SERENGETI: Massively multilingual language mod-

els for Africa. arXiv preprint arXiv:2212.10785.

David Adelani, Jesujoba Alabi, Angela Fan, Julia

Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter,

Dietrich Klakow, Peter Nabende, Ernie Chang, Tajud-

deen Gwadabe, Freshia Sackey, Bonaventure F. P.

Dossou, Chris Emezue, Colin Leong, Michael Beuk-

man, Shamsuddeen Muhammad, Guyo Jarso, Oreen

Yousuf, Andre Niyongabo Rubungo, Gilles Hacheme,

Eric Peter Wairagala, Muhammad Umair Nasir, Ben-

jamin Ajibade, Tunde Ajayi, Yvonne Gitau, Jade

Abbott, Mohamed Ahmed, Millicent Ochieng, An-

uoluwapo Aremu, Perez Ogayo, Jonathan Mukiibi,

Fatoumata Ouoba Kabore, Godson Kalipe, Derguene

Mbaye, Allahsera Auguste Tapo, Victoire Memd-

jokam Koagne, Edwin Munkoh-Buabeng, Valen-

cia Wagner, Idris Abdulmumin, Ayodele Awokoya,

Happy Buzaaba, Blessing Sibanda, Andiswa Bukula,

and Sam Manthalu. 2022. A few thousand transla-

tions go a long way! leveraging pre-trained models

for African news translation. In Proceedings of the

2022 Conference of the North American Chapter

of the Association for Computational Linguistics:

Human Language Technologies, pages 3053–3070,

Seattle, United States. Association for Computational

Linguistics.

David Adelani, Dana Ruiter, Jesujoba Alabi, Damilola

Adebonojo, Adesina Ayeni, Mofe Adeyemi, Ayo-

dele Esther Awokoya, and Cristina España-Bonet.

2021. The effect of domain and diacritics in Yoruba–

English neural machine translation. In Proceedings of

Machine Translation Summit XVIII: Research Track,

pages 61–75, Virtual. Association for Machine Trans-

lation in the Americas.

Rodrigo Agerri, Xavier Gómez Guinovart, German

Rigau, and Miguel Anxo Solla Portela. 2018. De-

veloping new linguistic resources and tools for the

Galician language. In Proceedings of the Eleventh

International Conference on Language Resources and

Evaluation (LREC 2018), Miyazaki, Japan. European

Language Resources Association (ELRA).

Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius

Mosbach, and Dietrich Klakow. 2022. Adapting pre-

trained language models to African languages via

multilingual adaptive fine-tuning. In Proceedings of

the 29th International Conference on Computational

Linguistics, pages 4336–4349, Gyeongju, Republic

of Korea. International Committee on Computational

Linguistics.Israa Alsarsour, Esraa Mohamed, Reem Suwaileh, and

Tamer Elsayed. 2018. DART: A large dataset of di-

alectal Arabic tweets. In Proceedings of the Eleventh

International Conference on Language Resources and

Evaluation (LREC 2018), Miyazaki, Japan. European

Language Resources Association (ELRA).

Antonios Anastasopoulos, Alessandro Cattelan, Zi-

Yi Dou, Marcello Federico, Christian Federmann,

Dmitriy Genzel, Franscisco Guzmán, Junjie Hu, Mac-

duff Hughes, Philipp Koehn, Rosie Lazar, Will Lewis,

Graham Neubig, Mengmeng Niu, Alp Öktem, Eric

Paquin, Grace Tang, and Sylwia Tur. 2020. TICO-19:

the translation initiative for COvid-19. In Proceed-

ings of the 1st Workshop on NLP for COVID-19

(Part 2) at EMNLP 2020, Online. Association for

Computational Linguistics.

Alan Ansell, Edoardo Ponti, Anna Korhonen, and Ivan

Vulić. 2022. Composable sparse fine-tuning for cross-

lingual transfer. In Proceedings of the 60th Annual

Meeting of the Association for Computational Lin-

guistics (Volume 1: Long Papers), pages 1778–1796,

Dublin, Ireland. Association for Computational Lin-

guistics.

Mikel Artetxe and Holger Schwenk. 2019. Massively

multilingual sentence embeddings for zero-shot cross-

lingual transfer and beyond. Transactions of the

Association for Computational Linguistics, 7:597–

610.

Niyati Bafna. 2022. Empirical models for an indic

language continuum.

Marta Bañón, Pinzhen Chen, Barry Haddow, Kenneth

Heafield, Hieu Hoang, Miquel Esplà-Gomis, Mikel L.

Forcada, Amir Kamran, Faheem Kirefu, Philipp

Koehn, Sergio Ortiz Rojas, Leopoldo Pla Sempere,

Gema Ramírez-Sánchez, Elsa Sarrías, Marek Strelec,

Brian Thompson, William Waites, Dion Wiggins, and

Jaume Zaragoza. 2020. ParaCrawl: Web-scale acqui-

sition of parallel corpora. In Proceedings of the 58th

Annual Meeting of the Association for Computational

Linguistics, pages 4555–4567, Online. Association

for Computational Linguistics.

Marta Bañón, Miquel Esplà-Gomis, Mikel L. For-

cada, Cristian García-Romero, Taja Kuzman, Nikola

Ljubesic, Rik van Noord, Leopoldo Pla Sempere,

Gema Ramírez-Sánchez, Peter Rupnik, Vít Suchomel,

Antonio Toral, Tobias van der Werff, and Jaume

Zaragoza. 2022. Macocu: Massive collection and

curation of monolingual and bilingual data: focus

on under-resourced languages. In Proceedings of the

23rd Annual Conference of the European Associa-

tion for Machine Translation, EAMT 2022, Ghent,

Belgium, June 1-3, 2022, pages 301–302. European

Association for Machine Translation.

Ankur Bapna, Isaac Caswell, Julia Kreutzer, Orhan Firat,

Daan van Esch, Aditya Siddhant, Mengmeng Niu,

Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey,

et al. 2022. Building machine translation systems

for the next thousand languages. arXiv preprint

arXiv:2205.03983.

Workshop BigScience, :, Teven Le Scao, Angela Fan,

Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel

Hesslow, Roman Castagné, Alexandra Sasha Luc-

cioni, François Yvon, Matthias Gallé, Jonathan

Tow, Alexander M. Rush, Stella Biderman, Albert

Webson, Pawan Sasanka Ammanamanchi, Thomas

Wang, Benoît Sagot, Niklas Muennighoff, Albert Vil-

lanova del Moral, Olatunji Ruwase, Rachel Bawden,

Stas Bekman, Angelina McMillan-Major, Iz Belt-

agy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pe-

dro Ortiz Suarez, Victor Sanh, Hugo Laurençon,

Yacine Jernite, Julien Launay, Margaret Mitchell,

Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor

Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers,

Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou,

Chris Emezue, Christopher Klamm, Colin Leong,

Daniel van Strien, David Ifeoluwa Adelani, Dragomir

Radev, Eduardo González Ponferrada, Efrat Lev-

kovizh, Ethan Kim, Eyal Bar Natan, Francesco

De Toni, Gérard Dupont, Germán Kruszewski, Gi-

ada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu

Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar

Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse

Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg,

Joseph Tobing, Joydeep Bhattacharjee, Khalid Al-

mubarak, Kimbo Chen, Kyle Lo, Leandro Von Werra,

Leon Weber, Long Phan, Loubna Ben allal, Lu-

dovic Tanguy, Manan Dey, Manuel Romero Muñoz,

Maraim Masoud, María Grandury, Mario Šaško, Max

Huang, Maximin Coavoux, Mayank Singh, Mike Tian-

Jian Jiang, Minh Chien Vu, Mohammad A. Jauhar,

Mustafa Ghaleb, Nishant Subramani, Nora Kassner,

Nurulaqilla Khamis, Olivier Nguyen, Omar Espe-

jel, Ona de Gibert, Paulo Villegas, Peter Henderson,

Pierre Colombo, Priscilla Amuok, Quentin Lhoest,

Rheza Harliman, Rishi Bommasani, Roberto Luis

López, Rui Ribeiro, Salomey Osei, Sampo Pyysalo,

Sebastian Nagel, Shamik Bose, Shamsuddeen Hassan

Muhammad, Shanya Sharma, Shayne Longpre, So-

maieh Nikpoor, Stanislav Silberberg, Suhas Pai, Syd-

ney Zink, Tiago Timponi Torrent, Timo Schick, Tris-

tan Thrush, Valentin Danchev, Vassilina Nikoulina,

Veronika Laippala, Violette Lepercq, Vrinda Prabhu,

Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin

Heinzerling, Chenglei Si, Davut Emre Taşar, Eliz-

abeth Salesky, Sabrina J. Mielke, Wilson Y. Lee,

Abheesht Sharma, Andrea Santilli, Antoine Chaffin,

Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla,

Gunjan Chhablani, Han Wang, Harshit Pandey, Hen-

drik Strobelt, Jason Alan Fries, Jos Rozen, Leo

Gao, Lintang Sutawika, M Saiful Bari, Maged S.

Al-shaibani, Matteo Manica, Nihal Nayak, Ryan Tee-

han, Samuel Albanie, Sheng Shen, Srulik Ben-David,

Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault

Fevry, Trishala Neeraj, Urmish Thakker, Vikas Rau-

nak, Xiangru Tang, Zheng-Xin Yong, Zhiqing Sun,

Shaked Brody, Yallow Uri, Hadar Tojarieh, Adam

Roberts, Hyung Won Chung, Jaesung Tae, Jason

Phang, Ofir Press, Conglong Li, Deepak Narayanan,

Hatim Bourfoune, Jared Casper, Jeff Rasley, Max

Ryabinin, Mayank Mishra, Minjia Zhang, Moham-

mad Shoeybi, Myriam Peyrounette, Nicolas Pa-

try, Nouamane Tazi, Omar Sanseviero, Patrick vonPlaten, Pierre Cornette, Pierre François Lavallée,

Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi,

Shaden Smith, Stéphane Requena, Suraj Patil, Tim

Dettmers, Ahmed Baruwa, Amanpreet Singh, Anasta-

sia Cheveleva, Anne-Laure Ligozat, Arjun Subramo-

nian, Aurélie Névéol, Charles Lovering, Dan Garrette,

Deepak Tunuguntla, Ehud Reiter, Ekaterina Takta-

sheva, Ekaterina Voloshina, Eli Bogdanov, Genta In-

dra Winata, Hailey Schoelkopf, Jan-Christoph Kalo,

Jekaterina Novikova, Jessica Zosa Forde, Jordan

Clive, Jungo Kasai, Ken Kawamura, Liam Hazan,

Marine Carpuat, Miruna Clinciu, Najoung Kim, New-

ton Cheng, Oleg Serikov, Omer Antverg, Oskar

van der Wal, Rui Zhang, Ruochen Zhang, Sebas-

tian Gehrmann, Shachar Mirkin, Shani Pais, Tatiana

Shavrina, Thomas Scialom, Tian Yun, Tomasz Lim-

isiewicz, Verena Rieser, Vitaly Protasov, Vladislav

Mikhailov, Yada Pruksachatkun, Yonatan Belinkov,

Zachary Bamberger, Zdeněk Kasner, Alice Rueda,

Amanda Pestana, Amir Feizpour, Ammar Khan, Amy

Faranak, Ana Santos, Anthony Hevia, Antigona Unl-

dreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tam-

mour, Azadeh HajiHosseini, Bahareh Behroozi, Ben-

jamin Ajibade, Bharat Saxena, Carlos Muñoz Ferran-

dis, Danish Contractor, David Lansky, Davis David,

Douwe Kiela, Duong A. Nguyen, Edward Tan, Emi

Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline

Ononiwu, Habib Rezanejad, Hessie Jones, Indrani

Bhattacharya, Irene Solaiman, Irina Sedenko, Isar

Nejadgholi, Jesse Passmore, Josh Seltzer, Julio Bo-

nis Sanz, Livia Dutra, Mairon Samagaio, Maraim

Elbadri, Margot Mieskes, Marissa Gerchick, Martha

Akinlolu, Michael McKenna, Mike Qiu, Muhammed

Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Ra-

jani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel,

Ran An, Rasmus Kromann, Ryan Hao, Samira Al-

izadeh, Sarmad Shubber, Silas Wang, Sourav Roy,

Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le,

Yoyo Yang, Zach Nguyen, Abhinav Ramesh Kashyap,

Alfredo Palasciano, Alison Callahan, Anima Shukla,

Antonio Miranda-Escalada, Ayush Singh, Benjamin

Beilharz, Bo Wang, Caio Brito, Chenxi Zhou, Chirag

Jain, Chuxin Xu, Clémentine Fourrier, Daniel León

Periñán, Daniel Molano, Dian Yu, Enrique Manjava-

cas, Fabio Barth, Florian Fuhrimann, Gabriel Altay,

Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec,

Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi,

Jonas Golde, Jose David Posada, Karthik Rangasai

Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa Shin-

zato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi,

Marc Pàmies, Maria A Castillo, Marianna Nezhurina,

Mario Sänger, Matthias Samwald, Michael Cullan,

Michael Weinberg, Michiel De Wolf, Mina Mihalj-

cic, Minna Liu, Moritz Freidank, Myungsun Kang,

Natasha Seelam, Nathan Dahlberg, Nicholas Michio

Broad, Nikolaus Muellner, Pascale Fung, Patrick

Haller, Ramya Chandrasekhar, Renata Eisenberg,

Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi

Su, Samuel Cahyawĳaya, Samuele Garda, Shlok S

Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Si-

mon Ott, Sinee Sang-aroonsiri, Srishti Kumar, Stefan

Schweter, Sushil Bharati, Tanmay Laud, Théo Gigant,

Tomoya Kainuma, Wojciech Kusa, Yanis Labrak,

Yash Shailesh Bajaj, Yash Venkatraman, Yifan Xu,

Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie, Zifan

Ye, Mathilde Bras, Younes Belkada, and Thomas

Wolf. 2022. BLOOM: a 176b-parameter open-access

multilingual language model.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie

Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind

Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, et al. 2020. Language models are few-shot

learners. Advances in neural information processing

systems, 33:1877–1901.

José Camacho-Collados, Claudio Delli Bovi, Alessandro

Raganato, and Roberto Navigli. 2016. A large-scale

multilingual disambiguation of glosses. In Proceed-

ings of the Tenth International Conference on Lan-

guage Resources and Evaluation (LREC’16), pages

1701–1708, Portorož, Slovenia. European Language

Resources Association (ELRA).

Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham

Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao,

Heyan Huang, and Ming Zhou. 2021. InfoXLM:

An information-theoretic framework for cross-lingual

language model pre-training. In Proceedings of the

2021 Conference of the North American Chapter of the

Association for Computational Linguistics: Human

Language Technologies, pages 3576–3588, Online.

Association for Computational Linguistics.

Zewen Chi, Shaohan Huang, Li Dong, Shuming Ma,

Bo Zheng, Saksham Singhal, Payal Bajaj, Xia Song,

Xian-Ling Mao, Heyan Huang, and Furu Wei. 2022.

XLM-E: Cross-lingual language model pre-training

via ELECTRA. In Proceedings of the 60th Annual

Meeting of the Association for Computational Lin-

guistics (Volume 1: Long Papers), pages 6170–6182,

Dublin, Ireland. Association for Computational Lin-

guistics.

Rochelle Choenni and Ekaterina Shutova. 2022. Inves-

tigating language relationships in multilingual sen-

tence encoders through the lens of linguistic typology.

Computational Linguistics, 48(3):635–672.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,

Maarten Bosma, Gaurav Mishra, Adam Roberts,

Paul Barham, Hyung Won Chung, Charles Sutton,

Sebastian Gehrmann, et al. 2022. Palm: Scaling

language modeling with pathways. arXiv preprint

arXiv:2204.02311.

Hyung Won Chung, Dan Garrette, Kiat Chuan Tan, and

Jason Riesa. 2020. Improving multilingual models

with language-clustered vocabularies. In Proceed-

ings of the 2020 Conference on Empirical Methods

in Natural Language Processing (EMNLP), pages

4536–4546, Online. Association for Computational

Linguistics.

Alexis Conneau, Kartikay Khandelwal, Naman Goyal,

Vishrav Chaudhary, Guillaume Wenzek, Francisco

Guzmán, Edouard Grave, Myle Ott, Luke Zettle-

moyer, and Veselin Stoyanov. 2020. Unsupervisedcross-lingual representation learning at scale. In Pro-

ceedings of the 58th Annual Meeting of the Associa-

tion for Computational Linguistics, pages 8440–8451,

Online. Association for Computational Linguistics.

Marta R Costa-jussà, James Cross, Onur Çelebi, Maha

Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe

Kalbassi, Janice Lam, Daniel Licht, Jean Maillard,

et al. 2022. No language left behind: Scaling

human-centered machine translation. arXiv preprint

arXiv:2207.04672.

Marie-Catherine de Marneffe, Christopher D. Manning,

Joakim Nivre, and Daniel Zeman. 2021. Universal

dependencies. Computational Linguistics, 47(2):255–

308.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

Kristina Toutanova. 2019. BERT: Pre-training of

deep bidirectional transformers for language under-

standing. In Proceedings of the 2019 Conference of

the North American Chapter of the Association for

Computational Linguistics: Human Language Tech-

nologies, Volume 1 (Long and Short Papers), pages

4171–4186, Minneapolis, Minnesota. Association for

Computational Linguistics.

Philipp Dufter and Hinrich Schütze. 2020. Identifying

elements essential for BERT’s multilinguality. In

Proceedings of the 2020 Conference on Empirical

Methods in Natural Language Processing (EMNLP),

pages 4423–4437, Online. Association for Computa-

tional Linguistics.

Philipp Dufter, Mengjie Zhao, Martin Schmitt, Alexan-

der Fraser, and Hinrich Schütze. 2018. Embedding

learning through multilingual concept induction. In

Proceedings of the 56th Annual Meeting of the Associ-

ation for Computational Linguistics (Volume 1: Long

Papers), pages 1520–1530, Melbourne, Australia.

Association for Computational Linguistics.

Jonathan Dunn. 2020. Mapping languages: the corpus

of global language use. Lang. Resour. Evaluation,

54(4):999–1018.

Eberhard, David M., Gary F. Simons, and Charles D. Fen-

nig (eds.). 2022. Ethnologue: Languages of the world.

twenty-fifth edition.

Abteen Ebrahimi and Katharina Kann. 2021. How

to adapt your pretrained multilingual model to 1600

languages. In Proceedings of the 59th Annual Meeting

of the Association for Computational Linguistics and

the 11th International Joint Conference on Natural

Language Processing (Volume 1: Long Papers), pages

4555–4567, Online. Association for Computational

Linguistics.

Mahmoud El-Haj. 2020. Habibi - a multi dialect multi

national Arabic song lyrics corpus. In Proceedings

of the Twelfth Language Resources and Evaluation

Conference, pages 1318–1326, Marseille, France.

European Language Resources Association.

Mahmoud El-Haj, Paul Rayson, and Mariam Aboelezz.

2018. Arabic dialect identification in the context of

bivalency and code-switching. In Proceedings of

the Eleventh International Conference on Language

Resources and Evaluation (LREC 2018), Miyazaki,

Japan. European Language Resources Association

(ELRA).

Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma,

Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines,

Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary,

Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey

Edunov, Michael Auli, and Armand Joulin. 2021. Be-

yond english-centric multilingual machine translation.

J. Mach. Learn. Res., 22:107:1–107:48.

Pablo Gamallo, Jose Ramom Pichel, and Iñaki Alegria.

2017. A perplexity-based method for similar lan-

guages discrimination. In Proceedings of the Fourth

Workshop on NLP for Similar Languages, Varieties

and Dialects (VarDial), pages 109–114, Valencia,

Spain. Association for Computational Linguistics.

Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff.

2012. Building large monolingual dictionaries at

the leipzig corpora collection: From 100 to 200

languages. In Proceedings of the Eighth International

Conference on Language Resources and Evaluation,

LREC 2012, Istanbul, Turkey, May 23-25, 2012, pages

759–765. European Language Resources Association

(ELRA).

Santiago Góngora, Nicolás Giossa, and Luis Chiruzzo.

2021. Experiments on a Guarani corpus of news

and social media. In Proceedings of the First Work-

shop on Natural Language Processing for Indigenous

Languages of the Americas, pages 153–158, Online.

Association for Computational Linguistics.

Santiago Góngora, Nicolás Giossa, and Luis Chiruzzo.

2022. Can we use word embeddings for enhancing

Guarani-Spanish machine translation? In Proceed-

ings of the Fifth Workshop on the Use of Computa-

tional Methods in the Study of Endangered Languages,

pages 127–132, Dublin, Ireland. Association for Com-

putational Linguistics.

Thamme Gowda, Zhao Zhang, Chris Mattmann, and

Jonathan May. 2021. Many-to-English machine trans-

lation tools, data, and pretrained models. In Proceed-

ings of the 59th Annual Meeting of the Association for

Computational Linguistics and the 11th International

Joint Conference on Natural Language Processing:

System Demonstrations, pages 306–316, Online. As-

sociation for Computational Linguistics.

Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Is-

lam, Kazi Samin Mubasshir, Yuan-Fang Li, Yong-

Bin Kang, M. Sohel Rahman, and Rifat Shahriyar.

2021. Xl-sum: Large-scale multilingual abstrac-

tive summarization for 44 languages. In Findings

of the Association for Computational Linguistics:

ACL/ĲCNLP 2021, Online Event, August 1-6, 2021,

volume ACL/ĲCNLP 2021 of Findings of ACL, pages

4693–4703. Association for Computational Linguis-

tics.Kenneth Heafield. 2011. KenLM: Faster and smaller

language model queries. In Proceedings of the Sixth

Workshop on Statistical Machine Translation, pages

187–197, Edinburgh, Scotland. Association for Com-

putational Linguistics.

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra-

ham Neubig, Orhan Firat, and Melvin Johnson.

2020. XTREME: A massively multilingual multi-task

benchmark for evaluating cross-lingual generalisation.

In Proceedings of the 37th International Conference

on Machine Learning, volume 119 of Proceedings

of Machine Learning Research, pages 4411–4421.

PMLR.

Ayyoob ImaniGooghari, Silvia Severini, Masoud

Jalili Sabet, François Yvon, and Hinrich Schütze.

2022. Graph-based multilingual label propagation

for low-resource part-of-speech tagging. In Proceed-

ings of the 2022 Conference on Empirical Methods

in Natural Language Processing, pages 1577–1589,

Abu Dhabi, United Arab Emirates. Association for

Computational Linguistics.

Masoud Jalili Sabet, Philipp Dufter, François Yvon,

and Hinrich Schütze. 2020. SimAlign: High quality

word alignments without parallel training data using

static and contextualized embeddings. In Findings

of the Association for Computational Linguistics:

EMNLP 2020, pages 1627–1643, Online. Association

for Computational Linguistics.

Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika

Bali, and Monojit Choudhury. 2020. The state and

fate of linguistic diversity and inclusion in the NLP

world. In Proceedings of the 58th Annual Meeting of

the Association for Computational Linguistics, pages

6282–6293, Online. Association for Computational

Linguistics.

Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla,

Gokul N.C., Avik Bhattacharyya, Mitesh M. Khapra,

and Pratyush Kumar. 2020. IndicNLPSuite: Monolin-

gual corpora, evaluation benchmarks and pre-trained

multilingual language models for Indian languages.

In Findings of the Association for Computational

Linguistics: EMNLP 2020, pages 4948–4961, Online.

Association for Computational Linguistics.

Muhammad, Nanda Muhammad, Ayanda Mnyakeni,

Jamshidbek Mirzakhalov, Tapiwanashe Matangira,

Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine

Jernite, Mathias Jenny, Orhan Firat, Bonaventure

F. P. Dossou, Sakhile Dlamini, Nisansa de Silva,

Sakine Çabuk Ballı, Stella Biderman, Alessia Bat-

tisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar,

Israel Abebe Azime, Ayodele Awokoya, Duygu Ata-

man, Orevaoghene Ahia, Oghenefego Ahia, Sweta

Agrawal, and Mofetoluwa Adeyemi. 2022. Quality

at a glance: An audit of web-crawled multilingual

datasets. Transactions of the Association for Compu-

tational Linguistics, 10:50–72.

Taku Kudo. 2018. Subword regularization: Improv-

ing neural network translation models with multiple

subword candidates. In Proceedings of the 56th An-

nual Meeting of the Association for Computational

Linguistics (Volume 1: Long Papers), pages 66–75,

Melbourne, Australia. Association for Computational

Linguistics.

Taku Kudo and John Richardson. 2018. SentencePiece:

A simple and language independent subword tok-

enizer and detokenizer for neural text processing. In

Proceedings of the 2018 Conference on Empirical

Methods in Natural Language Processing: System

Demonstrations, pages 66–71, Brussels, Belgium.

Association for Computational Linguistics.

Anoop Kunchukuttan, Pratik Mehta, and Pushpak Bhat-

tacharyya. 2018. The IIT Bombay English-Hindi

parallel corpus. In Proceedings of the Eleventh In-

ternational Conference on Language Resources and

Evaluation (LREC 2018), Miyazaki, Japan. European

Language Resources Association (ELRA).

Hugo Laurençon, Lucile Saulnier, Thomas Wang,

Christopher Akiki, Albert Villanova del Moral, Teven

Le Scao, Leandro Von Werra, Chenghao Mou, Ed-

uardo González Ponferrada, Huu Nguyen, et al. 2022.

The BigScience ROOTS Corpus: A 1.6 TB Compos-

ite Multilingual Dataset. In Thirty-sixth Conference

on Neural Information Processing Systems Datasets

and Benchmarks Track.

Fajri Koto and Ikhwan Koto. 2020. Towards computa-

tional linguistics in Minangkabau language: Studies

on sentiment analysis and machine translation. In

Proceedings of the 34th Pacific Asia Conference on

Language, Information and Computation, pages 138–

148, Hanoi, Vietnam. Association for Computational

Linguistics. Anne Lauscher, Vinit Ravishankar, Ivan Vulić, and

Goran Glavaš. 2020. From zero to hero: On the limita-

tions of zero-shot language transfer with multilingual

Transformers. In Proceedings of the 2020 Conference

on Empirical Methods in Natural Language Process-

ing (EMNLP), pages 4483–4499, Online. Association

for Computational Linguistics.

Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab,

Daan van Esch, Nasanbayar Ulzii-Orshikh, Allah-

sera Tapo, Nishant Subramani, Artem Sokolov, Clay-

tone Sikasote, Monang Setyawan, Supheakmungkol

Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera,

Annette Rios, Isabel Papadimitriou, Salomey Osei,

Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, An-

dre Niyongabo Rubungo, Toan Q. Nguyen, Math-

ias Müller, André Müller, Shamsuddeen Hassan Colin Leong, Joshua Nemecek, Jacob Mansdorfer, Anna

Filighera, Abraham Owodunni, and Daniel White-

nack. 2022. Bloom library: Multimodal datasets in

300+ languages for a variety of downstream tasks. In

Proceedings of the 2022 Conference on Empirical

Methods in Natural Language Processing, EMNLP

2022, Abu Dhabi, United Arab Emirates, Decem-

ber 7-11, 2022, pages 8608–8621. Association for

Computational Linguistics.Chunlan Ma, Ayyoob ImaniGooghari, Haotian Ye,

Ehsaneddin Asgari, and Hinrich Schütze. 2023.

Taxi1500: A multilingual dataset for text classifi-

cation in 1500 languages.

Martin Majliš. 2011. W2C – web to corpus – corpora.

LINDAT/CLARIAH-CZ digital library at the Institute

of Formal and Applied Linguistics (ÚFAL), Faculty

of Mathematics and Physics, Charles University.

Jamshidbek Mirzakhalov, Anoop Babu, Duygu Ataman,

Sherzod Kariev, Francis Tyers, Otabek Abduraufov,

Mammad Hajili, Sardana Ivanova, Abror Khaytbaev,

Antonio Laverghetta Jr., Bekhzodbek Moydinboyev,

Esra Onal, Shaxnoza Pulatova, Ahsan Wahab, Orhan

Firat, and Sriram Chellappan. 2021. A large-scale

study of machine translation in Turkic languages. In

Proceedings of the 2021 Conference on Empirical

Methods in Natural Language Processing, pages 5876–

5890, Online and Punta Cana, Dominican Republic.

Association for Computational Linguistics.

Steven Moran, Christian Bentz, Ximena Gutierrez-

Vasques, Olga Pelloni, and Tanja Samardzic. 2022.

TeDDi sample: Text data diversity sample for lan-

guage comparison and multilingual NLP. In Pro-

ceedings of the Thirteenth Language Resources and

Evaluation Conference, pages 1150–1158, Marseille,

France. European Language Resources Association.

Makoto Morishita, Jun Suzuki, and Masaaki Nagata.

2020. JParaCrawl: A large scale web-based English-

Japanese parallel corpus. In Proceedings of the

Twelfth Language Resources and Evaluation Confer-

ence, pages 3603–3609, Marseille, France. European

Language Resources Association.

Toshiaki Nakazawa, Hideya Mino, Isao Goto, Raj Dabre,

Shohei Higashiyama, Shantipriya Parida, Anoop

Kunchukuttan, Makoto Morishita, Ondřej Bojar,

Chenhui Chu, Akiko Eriguchi, Kaori Abe, Yusuke

Oda, and Sadao Kurohashi. 2022. Overview of the

9th workshop on Asian translation. In Proceedings

of the 9th Workshop on Asian Translation, pages

1–36, Gyeongju, Republic of Korea. International

Conference on Computational Linguistics.

Toshiaki Nakazawa, Hideki Nakayama, Chenchen Ding,

Raj Dabre, Shohei Higashiyama, Hideya Mino, Isao

Goto, Win Pa Pa, Anoop Kunchukuttan, Shantipriya

Parida, Ondřej Bojar, Chenhui Chu, Akiko Eriguchi,

Kaori Abe, Yusuke Oda, and Sadao Kurohashi. 2021.

Overview of the 8th workshop on Asian translation.

In Proceedings of the 8th Workshop on Asian Trans-

lation (WAT2021), pages 1–45, Online. Association

for Computational Linguistics.

Graham Neubig. 2011. The Kyoto free translation task.

http://www.phontron.com/kftt.

Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. 2021a.

Small data? no problem! exploring the viability

of pretrained multilingual language models for low-

resourced languages. In Proceedings of the 1st Work-

shop on Multilingual Representation Learning, pages

116–126, Punta Cana, Dominican Republic. Associa-

tion for Computational Linguistics.

Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. 2021b.

Small data? no problem! exploring the viability

of pretrained multilingual language models for low-

resourced languages. In Proceedings of the 1st Work-

shop on Multilingual Representation Learning, pages

116–126.

Chester Palen-Michel, June Kim, and Constantine Lig-

nos. 2022. Multilingual open text release 1: Public

domain news in 44 languages. In Proceedings of

the Thirteenth Language Resources and Evaluation

Conference, pages 2080–2089, Marseille, France.

European Language Resources Association.

Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Noth-

man, Kevin Knight, and Heng Ji. 2017. Cross-lingual

name tagging and linking for 282 languages. In

Proceedings of the 55th Annual Meeting of the As-

sociation for Computational Linguistics (Volume 1:

Long Papers), pages 1946–1958, Vancouver, Canada.

Association for Computational Linguistics.

Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James

Cross, Sebastian Riedel, and Mikel Artetxe. 2022.

Lifting the curse of multilinguality by pre-training

modular transformers. In Proceedings of the 2022

Conference of the North American Chapter of the

Association for Computational Linguistics: Human

Language Technologies, pages 3479–3495, Seattle,

United States. Association for Computational Lin-

guistics.

Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian

Ruder. 2021. UNKs everywhere: Adapting multilin-

gual language models to new scripts. In Proceedings

of the 2021 Conference on Empirical Methods in

Natural Language Processing, pages 10186–10203,

Online and Punta Cana, Dominican Republic. Asso-

ciation for Computational Linguistics.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine

Lee, Sharan Narang, Michael Matena, Yanqi Zhou,

Wei Li, and Peter J. Liu. 2020. Exploring the limits

of transfer learning with a unified text-to-text trans-

former. J. Mach. Learn. Res., 21:140:1–140:67.

Roberts Rozis and Raivis Skadin , š. 2017. Tilde MODEL

- multilingual open data for EU languages. In Proceed-

ings of the 21st Nordic Conference on Computational

Linguistics, pages 263–265, Gothenburg, Sweden.

Association for Computational Linguistics.

Hassan Sajjad, Ahmed Abdelali, Nadir Durrani, and

Fahim Dalvi. 2020. AraBench: Benchmarking di-

alectal Arabic-English machine translation. In Pro-

ceedings of the 28th International Conference on Com-

putational Linguistics, pages 5094–5107, Barcelona,

Spain (Online). International Committee on Compu-

tational Linguistics.

Julian Salazar, Davis Liang, Toan Q. Nguyen, and Katrin

Kirchhoff. 2020. Masked language model scoring.In Proceedings of the 58th Annual Meeting of the

Association for Computational Linguistics, pages

2699–2712, Online. Association for Computational

Linguistics.

Holger Schwenk, Vishrav Chaudhary, Shuo Sun,

Hongyu Gong, and Francisco Guzmán. 2021. Wiki-

Matrix: Mining 135M parallel sentences in 1620

language pairs from Wikipedia. In Proceedings of

the 16th Conference of the European Chapter of the

Association for Computational Linguistics: Main

Volume, pages 1351–1361, Online. Association for

Computational Linguistics.

Silvia Severini, Ayyoob Imani, Philipp Dufter, and Hin-

rich Schütze. 2022. Towards a broad coverage named

entity resource: A data-efficient approach for many

diverse languages. arXiv preprint arXiv:2201.12219.

Aditya Siddhant, Ankur Bapna, Orhan Firat, Yuan Cao,

Mia Xu Chen, Isaac Caswell, and Xavier Garcia. 2022.

Towards the next 1000 languages in multilingual ma-

chine translation: Exploring the synergy between su-

pervised and self-supervised learning. arXiv preprint

arXiv:2201.03110.

Anil Kumar Singh. 2008. Named entity recognition

for south and south East Asian languages: Taking

stock. In Proceedings of the ĲCNLP-08 Workshop

on Named Entity Recognition for South and South

East Asian Languages.

Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent

Romary. 2019. Asynchronous pipeline for processing

huge corpora on medium to low resource infrastruc-

tures. In 7th Workshop on the Challenges in the

Management of Large Corpora (CMLC-7). Leibniz-

Institut für Deutsche Sprache.

Jörg Tiedemann. 2012. Parallel data, tools and interfaces

in opus. In Proceedings of the Eight International

Conference on Language Resources and Evaluation

(LREC’12), Istanbul, Turkey. European Language

Resources Association (ELRA).

Iulia Turc, Kenton Lee, Jacob Eisenstein, Ming-Wei

Chang, and Kristina Toutanova. 2021. Revisiting the

primacy of english in zero-shot cross-lingual transfer.

CoRR, abs/2106.16171.

Hai Wang, Dian Yu, Kai Sun, Jianshu Chen, and Dong

Yu. 2019. Improving pre-trained multilingual model

with vocabulary expansion. In Proceedings of the

23rd Conference on Computational Natural Lan-

guage Learning (CoNLL), pages 316–327, Hong

Kong, China. Association for Computational Linguis-

tics.

Mingyang Wang, Heike Adel, Lukas Lange, Jannik

Strötgen, and Hinrich Schütze. 2023. NLNDE at

semeval-2023 task 12: Adaptive pretraining and

source language selection for low-resource multi-

lingual sentiment analysis. CoRR, abs/2305.00090.

Xinyi Wang, Sebastian Ruder, and Graham Neubig. 2022.

Expanding pretrained models to thousands more lan-

guages via lexicon-based adaptation. In Proceedings

of the 60th Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers),

pages 863–877, Dublin, Ireland. Association for Com-

putational Linguistics.

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con-

neau, Vishrav Chaudhary, Francisco Guzmán, Ar-

mand Joulin, and Edouard Grave. 2020a. Ccnet:

Extracting high quality monolingual datasets from

web crawl data. In Proceedings of The 12th Language

Resources and Evaluation Conference, LREC 2020,

Marseille, France, May 11-16, 2020, pages 4003–

4012. European Language Resources Association.

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con-

neau, Vishrav Chaudhary, Francisco Guzmán, Ar-

mand Joulin, and Edouard Grave. 2020b. CCNet:

Extracting high quality monolingual datasets from

web crawl data. In Proceedings of the Twelfth Lan-

guage Resources and Evaluation Conference, pages

4003–4012, Marseille, France. European Language

Resources Association.

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale,

Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and

Colin Raffel. 2021. mT5: A massively multilingual

pre-trained text-to-text transformer. In Proceedings of

the 2021 Conference of the North American Chapter

of the Association for Computational Linguistics: Hu-

man Language Technologies, pages 483–498, Online.

Association for Computational Linguistics.

Jian Yang, Shuming Ma, Dongdong Zhang, Shuangzhi

Wu, Zhoujun Li, and Ming Zhou. 2020. Alternating

language modeling for cross-lingual pre-training. In

Proceedings of the AAAI Conference on Artificial

Intelligence, volume 34, pages 9386–9393.

Rodolfo Zevallos, John Ortega, William Chen, Richard

Castro, Núria Bel, Cesar Toshio, Renzo Venturas,

Hilario Aradiel, and Nelsi Melgarejo. 2022. Introduc-

ing QuBERT: A large monolingual corpus and BERT

model for Southern Quechua. In Proceedings of the

Third Workshop on Deep Learning for Low-Resource

Natural Language Processing, pages 1–13, Hybrid.

Association for Computational Linguistics.

Mengjie Zhao, Tao Lin, Fei Mi, Martin Jaggi, and

Hinrich Schütze. 2020. Masking as an efficient alter-

native to finetuning for pretrained language models.

In Proceedings of the 2020 Conference on Empirical

Methods in Natural Language Processing (EMNLP),

pages 2226–2241, Online. Association for Computa-

tional Linguistics.

N-grams LMs and Language

Divergence

Perplexity and Language Divergence. Perplexity

measures how well a model predicts a sample test

data. Assuming a test data contains sequences ofcharacters 𝑆 = 𝑐ℎ 1 , 𝑐ℎ 2 , · · · , 𝑐ℎ 𝑇 , perplexity (PP)

of 𝑆 given an n-gram character level language model

𝑀 is computed as follows:

t 𝑇

(1)

PP (𝑆, 𝑀) = 𝑇

P 𝑐ℎ 𝑡 | 𝑐ℎ 1 𝑡 −1

𝑡=1

where P 𝑐ℎ 𝑡 | 𝑐ℎ 1 𝑡 −1 is computed as by dividing

the observed frequency (𝐶) of 𝑐ℎ 1 𝑡 −1 𝑐ℎ 𝑖 by the

observed frequency of 𝑐ℎ 1 𝑡 −1 in 𝑀 training data:

P 𝑐ℎ 𝑡 |

𝑐ℎ 1 𝑡 −1

𝐶 𝑐ℎ 1 𝑡 −1 𝑐ℎ 𝑡

𝐶 𝑐ℎ 1 𝑡 −1

(2)

Given the definition of perplexity, we can determine

how well a trained language model on language 𝐿 1

predicts the test text of language 𝐿 2 and vice-versa.

The divergence between two languages is computed

with the maximum of the perplexity values in both

directions. Two reasons lead to the use of max:

first, a symmetrical divergence is required, and

second, languages differ in their complexity, so

one direction of computing perplexity may result

in a much lower perplexity than another. Thus,

comparing perplexity results becomes difficult. As

an example, the Kuanua language (ksd_Latn) has

short words and a simple structure, which results

in 3−gram models getting lower perplexity on its

text compared to other languages. The lower the

perplexity the smaller the divergence between lan-

guages. The divergence (D) between language 𝐿 𝑖

and 𝐿 𝑗 with trained language models of 𝑀 𝐿 𝑧 and

test texts of 𝑆 𝐿 𝑧 , where 𝐿 𝑧 is the corresponding

language, computed as follows:

D 𝐿 𝑖 ,𝐿 𝑗 = max PP (𝑆 𝐿 𝑖 , 𝑀 𝐿 𝑗 ), PP (𝑆 𝐿 𝑗 , 𝑀 𝐿 𝑖 )

(3)

Runs and Data. The data used to train and test

the character level n-gram models is the same data

used for the training and testing of the Glot500-m.

The training of the models was limited to 100, 000

sentences’ per language-script. We use KenLM

library (Heafield, 2011) to build n-gram models.

This library uses an interpolated modified Kneser-

Ney smoothing for estimating the unseen n-grams.

Our evaluation has been performed over 7 n-gram

models (3 ≤ 𝑛 ≤ 9).

Baseline and Evaluation. Language family trees

were used as a baseline for evaluating the diver-

gence measures of the proposed approach. We

obtained language family tree data from Ethno-

logue online version (Eberhard et al., 2022). For

each language, the family tree follows the general or-

der from largest typological language family group

to smallest. There is only one family tree for each

language in the baseline data. Nodes in the family

tree represent typological language family groups.

Each node only has one parent, so if a node is

common in the family tree of two languages, its

parent is also common. We evaluate our perplex-

ity method on the following binary classification

task: Do the majority of a language 𝐿 𝑧 ’s 𝑘 nearest

neighbors belong to the same typological language

family group as 𝐿 𝑧 ? Assuming languages 𝐿 𝑖 and

𝐿 𝑗 , with the following family trees:

𝑇 𝐿 𝑖 : 1 → 2 → 3 → 4 → 5 → 6

𝑇 𝐿 𝑗 : 1 → 2 → 7 → 8

These 2 languages belong to the same typological

family group with family tree levels of 𝑙 ∈ {1, 2},

but not with family tree levels of 𝑙 = 3 and higher.

Result. When it comes to language families, the

majority of studies only refer to the largest typo-

logical language family group (level 𝑙 = 1). Here,

we also assess our methodology for other levels.

The results of classification accuracy for 3−gram

model, 𝑘 ∈ {1, 3, 7, 13, 21} and 𝑙 ∈ {1, 2, 3, max}

are shown in Table 10. In cases where the maximum

level of a tree is less than the 𝑙 parameter, the max-

imum level for that language is used. Languages

without a family or no other family member in our

data are excluded. We only report the 3−gram

model results as it gets the best results in most

configurations among other n-gram models. With

increasing 𝑙, the accuracy decreases, since more

languages fall outside the same typological family.

As 𝑘 increases, the accuracy decreases, because lan-

guages with faraway neighbors are being included

but the number of languages in the language typo-

logical group family will remain the same. There

are times when languages have a lot of loan words

from other languages because of geological proxim-

ity or historical reasons (e.g, colonization), which

makes them similar to the languages they borrowed

words from in our method. However they are differ-

ent when it comes to their typological families and

our method fails in these cases. Aymara (Macrolan-

guage: aym_Latn) and Quechua (Macrolanguage:

que_Latn), for example, had a great deal of contact

and influence on each other, but they do not belong

to the same typological group. As well, some of

the typological families are not that large, which

makes our results worse when 𝑘 increases. This isthe case, for instance, of the Tarascan typological

family which only has two members.

model 𝑙 𝑘 accuracy (%)

3-gram

3-gram 1

max

max 1

21 84.45

75.77

69.08

62.75

55.33

79.75

67.63

59.49

51.36

42.68

75.05

60.22

49.55

38.34

29.84

59.31

36.89

18.81

6.87

2.89

Table 10: Detecting the typological relatedness of lan-

guage with n-gram divergence: (Eq. 3); 𝑙: level of

typological language family group; 𝑘: number of near-

est language neighbors.

Languages

The list of languages used to train Glot500-m with

the amount of available data for each language is

available in Tables 11, 12 and 13.

On Macrolanguages The presence of language

codes that are supersets of other language codes

within datasets is not uncommon (Kreutzer et al.,

2022). This issue becomes more prevalent in ex-

tensive collections. Within the ISO 639-3 standard,

these languages are referred to as macrolanguages.

When confronted with macrolanguages, if it is not

feasible to ascertain the specific individual language

contained within a dataset, the macrolanguage code

is retained. Consequently, it is possible that in

Glot2000-c and Glot500-c both the corpora for the

macrolanguage and its individual languages have

been included.

List of data sources

The datasets and repositories used in this project

involve: AI4Bharat, 5 AIFORTHAI-LotusCorpus, 6

Add (El-Haj et al., 2018), AfriBERTa (Ogueji

et al., 2021b), AfroMAFT (Adelani et al., 2022;

Xue et al., 2021), Anuvaad, 7 AraBench (Sajjad

et al., 2020), AUTSHUMATO, 8 Bloom (Leong

et al., 2022), CC100 (Conneau et al., 2020;

Wenzek et al., 2020a), CCNet (Wenzek et al.,

2020b), CMU_Haitian_Creole, 9 CORP.NCHLT, 10

Clarin, 11 DART (Alsarsour et al., 2018), Earth-

lings (Dunn, 2020), FFR, 12 Flores200 (Costa-jussà

et al., 2022), GiossaMedia (Góngora et al., 2022,

2021), Glosses (Camacho-Collados et al., 2016),

Habibi (El-Haj, 2020), HinDialect (Bafna, 2022),

HornMT, 13 IITB (Kunchukuttan et al., 2018), In-

dicNLP (Nakazawa et al., 2021), Indiccorp (Kak-

wani et al., 2020), isiZulu, 14 JParaCrawl (Morishita

et al., 2020), KinyaSMT, 15 LeipzigData (Goldhahn

et al., 2012), Lindat, 16 Lingala_Song_Lyrics, 17

Lyrics, 18 MC4 (Raffel et al., 2020), MTData

(Gowda et al., 2021), MaCoCu (Bañón et al.,

2022), Makerere MT Corpus, 19 Masakhane com-

munity, 20 Mburisano_Covid, 21 Menyo20K (Ade-

lani et al., 2021), Minangkabau corpora (Koto

and Koto, 2020), MoT (Palen-Michel et al.,

2022), NLLB_seed (Costa-jussà et al., 2022),

Nart/abkhaz, 22 OPUS (Tiedemann, 2012), OS-

CAR (Suárez et al., 2019), ParaCrawl (Bañón

et al., 2020), Parallel Corpora for Ethiopian Lan-

5 https://ai4bharat.org/

6 https://github.com/korakot/corpus/releases/

download/v1.0/AIFORTHAI-LotusCorpus.zip

7 https://github.com/project-anuvaad/

anuvaad-parallel-corpus

8 https://autshumato.sourceforge.net/

9 http://www.speech.cs.cmu.edu/haitian/text/

10 https://repo.sadilar.org/handle/20.500.12185/

11 https://www.clarin.si/

12 https://github.com/bonaventuredossou/ffr-v1/

tree/master/FFR-Dataset

13 https://github.com/asmelashteka/HornMT

14 https://zenodo.org/record/5035171

15 https://github.com/pniyongabo/kinyarwandaSMT

16 https://lindat.cz/faq-repository

17 https://github.com/espoirMur/songs_lyrics_

webscrap

18 https://lyricstranslate.com/

19 https://zenodo.org/record/5089560

20 https://github.com/masakhane-io/

masakhane-community

21 https://repo.sadilar.org/handle/20.500.12185/

536

22 https://huggingface.co/datasets/Nart/abkhaz_

textLanguage-Script |Sent| Family

hbs_Latn

mal_Mlym

aze_Latn

guj_Gujr

ben_Beng

kan_Knda

tel_Telu

mlt_Latn

fra_Latn

spa_Latn

eng_Latn

fil_Latn

nob_Latn

rus_Cyrl

deu_Latn

tur_Latn

pan_Guru

mar_Deva

por_Latn

nld_Latn

ara_Arab

zho_Hani

ita_Latn

ind_Latn

ell_Grek

bul_Cyrl

swe_Latn

ces_Latn

isl_Latn

pol_Latn

ron_Latn

dan_Latn

hun_Latn

tgk_Cyrl

srp_Latn

fas_Arab

ceb_Latn

heb_Hebr

hrv_Latn

glg_Latn

fin_Latn

slv_Latn

vie_Latn

mkd_Cyrl

slk_Latn

nor_Latn

est_Latn

ltz_Latn

eus_Latn

lit_Latn

kaz_Cyrl

lav_Latn

bos_Latn

epo_Latn

cat_Latn

tha_Thai

ukr_Cyrl

tgl_Latn

sin_Sinh

gle_Latn

hin_Deva

kor_Hang

ory_Orya

urd_Arab

swa_Latn

sqi_Latn

bel_Cyrl

afr_Latn

nno_Latn

tat_Cyrl 63411156

48098273

46300705

45738685

43514870

41836495

41580525

40654838

39197581

37286756

36122761

33493255

32869205

31787973

31015993

29184662

29052537

28748897

27824391

25061426

24524122

24143786

23539857

23018106

22033282

21823004

20725883

20376340

19547941

19339945

19190217

19174573

18800025

18659517

18371769

18277593

18149215

18128962

17882932

17852274

16730388

15719210

15697827

14717004

14633631

14576191

13600579

12997242

12775959

12479626

12378727

12143980

11014744

8737198

8648271

7735209

7462046

7411064

7293178

7225513

7046700

6468444

6266475

6009594

5989369

5526836

5319675

5157787

4899103

4708088 indo1319

drav1251

indo1319

drav1251

afro1255

indo1319

aust1307

indo1319

turk1311

indo1319

aust1307

indo1319

ural1272

indo1319

aust1307

afro1255

indo1319

ural1272

indo1319

aust1305

indo1319

Head

yes

indo1319

turk1311

indo1319

arti1236

indo1319

taik1256

indo1319

aust1307

indo1319

kore1284

indo1319

turk1311

yes

Language-Script |Sent| Family

vec_Latn

jpn_Jpan

lus_Latn

crs_Latn

kqn_Latn

ndo_Latn

snd_Arab

yue_Hani

tiv_Latn

kua_Latn

kwy_Latn

hin_Latn

iku_Cans

kal_Latn

tdt_Latn

gsw_Latn

mfe_Latn

swc_Latn

mon_Latn

mos_Latn

kik_Latn

cnh_Latn

gil_Latn

pon_Latn

umb_Latn

lvs_Latn

sco_Latn

ori_Orya

arg_Latn

kur_Latn

dhv_Latn

luo_Latn

lun_Latn

nzi_Latn

gug_Latn

bar_Latn

bci_Latn

chk_Latn

roh_Latn

aym_Latn

yap_Latn

ssw_Latn

quz_Latn

sah_Cyrl

tsn_Latn

lmo_Latn

ido_Latn

abk_Cyrl

zne_Latn

quy_Latn

kam_Latn

bbc_Latn

vol_Latn

wal_Latn

uig_Arab

vmw_Latn

kwn_Latn

pam_Latn

seh_Latn

tsc_Latn

nyk_Latn

kmb_Latn

zai_Latn

gym_Latn

bod_Tibt

nde_Latn

fon_Latn

ber_Latn

nbl_Latn

kmr_Latn 514240

510722

509250

508755

507913

496613

488730

484700

483064

473535

473274

466175

465011

462430

459818

449240

447435

446378

437950

437666

437228

436667

434529

434522

431589

422952

411591

410827

410683

407169

405711

398974

395764

394247

392227

387070

384059

380596

377067

373329

358929

356561

354781

352697

350954

348135

331239

321578

318871

311040

310659

310420

310399

309873

307302

306899

305362

303737

300243

298442

297976

296269

277632

274512

273489

269931

268566

264426

259158

256677 indo1319

japo1237

sino1245

indo1319

atla1278

indo1319

sino1245

atla1278

indo1319

Head

yes

eski1264

aust1307

indo1319

atla1278

mong1349

atla1278

sino1245

aust1307

atla1278

indo1319

yes

indo1319

aust1307

nilo1247

atla1278

tupi1275

indo1319

atla1278

aust1307

indo1319

ayma1253

aust1307

atla1278

quec1387

turk1311

atla1278

indo1319

arti1236

abkh1242

atla1278

quec1387

atla1278

aust1307

arti1236

gong1255

turk1311

atla1278

aust1307

atla1278

otom1299

chib1249

sino1245

atla1278

indo1319

yes

Language-Script |Sent| Family Head

swh_Latn

alt_Cyrl

rmn_Grek

miq_Latn

kaa_Cyrl

kos_Latn

grn_Latn

lhu_Latn

lzh_Hani

ajp_Arab

cmn_Hani

gcf_Latn

rmn_Cyrl

kjh_Cyrl

rng_Latn

mgh_Latn

xmv_Latn

ige_Latn

rmy_Latn

srm_Latn

bak_Latn

gur_Latn

idu_Latn

yom_Latn

tdx_Latn

mzn_Arab

cfm_Latn

zpa_Latn

kbd_Cyrl

lao_Laoo

nap_Latn

qub_Latn

oke_Latn

ote_Latn

bsb_Latn

ogo_Latn

abn_Latn

ldi_Latn

ayr_Latn

gom_Deva

bba_Latn

aln_Latn

leh_Latn

ban_Latn

ace_Latn

pes_Arab

skg_Latn

ary_Arab

hus_Latn

glv_Latn

fat_Latn

frr_Latn

mwn_Latn

mai_Deva

dua_Latn

dzo_Tibt

ctd_Latn

nnb_Latn

sxn_Latn

mps_Latn

mny_Latn

gkp_Latn

kat_Latn

bjn_Latn

acr_Latn

dtp_Latn

lam_Latn

bik_Latn

poh_Latn

phm_Latn 95776

95148

94533

94343

88815

88603

87568

87255

86035

83297

80745

80737

79925

79262

78177

78117

77896

77114

76991

76884

76809

76151

75106

74818

74430

73719

70227

69237

67914

66966

65826

64973

64508

64224

63634

61901

61830

61827

61570

61140

61123

60989

59944

59805

59333

57511

57228

56933

56176

55641

55609

55254

54805

54687

53392

52732

52135

52041

51749

50645

50581

50549

50424

49068

48886

48468

46853

46561

46454

45862 atla1278

turk1311

indo1319

misu1242

turk1311

aust1307 yes

Table 11: List of languages used to train Glot500-m (Part I).

sino1245

afro1255

sino1245

indo1319

turk1311

atla1278

aust1307

atla1278

indo1319

turk1311

atla1278

aust1307

indo1319

sino1245

otom1299

abkh1242

taik1256

indo1319

quec1387

atla1278

otom1299

aust1307

atla1278

ayma1253

indo1319

atla1278

indo1319

atla1278

aust1307

indo1319

aust1307

afro1255

maya1287

indo1319

atla1278

indo1319

atla1278

indo1319

atla1278

sino1245

atla1278

aust1307

tebe1251

atla1278

mand1469

kart1248

aust1307

maya1287

aust1307

atla1278

maya1287

atla1278

yes

yesLanguage-Script |Sent| Family

ast_Latn

mon_Cyrl

hbs_Cyrl

hau_Latn

sna_Latn

msa_Latn

som_Latn

srp_Cyrl

mlg_Latn

zul_Latn

arz_Arab

nya_Latn

tam_Taml

hat_Latn

uzb_Latn

sot_Latn

uzb_Cyrl

cos_Latn

als_Latn

amh_Ethi

sun_Latn

war_Latn

div_Thaa

yor_Latn

fao_Latn

uzn_Cyrl

smo_Latn

bak_Cyrl

ilo_Latn

tso_Latn

mri_Latn

hmn_Latn

asm_Beng

hil_Latn

nso_Latn

ibo_Latn

kin_Latn

hye_Armn

oci_Latn

lin_Latn

tpi_Latn

twi_Latn

kir_Cyrl

pap_Latn

nep_Deva

azj_Latn

bcl_Latn

xho_Latn

cym_Latn

gaa_Latn

ton_Latn

tah_Latn

lat_Latn

srn_Latn

ewe_Latn

bem_Latn

efi_Latn

bis_Latn

orm_Latn

haw_Latn

hmo_Latn

kat_Geor

pag_Latn

loz_Latn

fry_Latn

mya_Mymr

nds_Latn

run_Latn 4683554

4616960

4598073

4368483

4019596

3929084

3916769

3864091

3715802

3580113

3488224

3409030

3388255

3226932

3223485

3205510

3029947

3015055

2954874

2862985

2586011

2584810

2418687

2392359

2365271

2293672

2290439

2264196

2106531

2100708

2046850

1903898

1882353

1798875

1619354

1543820

1521612

1463123

1449128

1408460

1401844

1400979

1397566

1360138

1317291

1315834

1284493

1262364

1244783

1222307

1216118

1190747

1179913

1172349

1161605

1111969

1082621

1070170

1067699

1062491

1033636

1004297

983637

964418

957422

945180

944715

943828 indo1319

mong1349

indo1319

afro1255

atla1278

afro1255

indo1319

atla1278

afro1255

atla1278

drav1251

indo1319

turk1311

atla1278

turk1311

indo1319

afro1255

aust1307

indo1319

atla1278

indo1319

turk1311

aust1307

turk1311

aust1307

atla1278

aust1307

indo1319

aust1307

atla1278

indo1319

atla1278

indo1319

atla1278

turk1311

indo1319

turk1311

aust1307

atla1278

indo1319

atla1278

aust1307

indo1319

atla1278

indo1319

Head

yes

aust1307

pidg1258

kart1248

aust1307

atla1278

indo1319

sino1245

indo1319

atla1278

yes

Language-Script |Sent| Family

guc_Latn

mam_Latn

nia_Latn

nyn_Latn

cab_Latn

top_Latn

tog_Latn

mco_Latn

tzh_Latn

pms_Latn

wuu_Hani

plt_Latn

yid_Hebr

ada_Latn

iba_Latn

kek_Latn

koo_Latn

sop_Latn

kac_Latn

qvi_Latn

cak_Latn

kbp_Latn

ctu_Latn

kri_Latn

mau_Latn

scn_Latn

tyv_Cyrl

ina_Latn

btx_Latn

nch_Latn

ncj_Latn

pau_Latn

toj_Latn

pcm_Latn

dyu_Latn

kss_Latn

afb_Arab

urh_Latn

quc_Latn

new_Deva

yao_Latn

ngl_Latn

nyu_Latn

kab_Latn

tuk_Cyrl

xmf_Geor

ndc_Latn

san_Deva

nba_Latn

bpy_Beng

ncx_Latn

qug_Latn

rmn_Latn

cjk_Latn

arb_Arab

kea_Latn

mck_Latn

arn_Latn

pdt_Latn

her_Latn

gla_Latn

kmr_Cyrl

mwl_Latn

nav_Latn

ksw_Mymr

mxv_Latn

hif_Latn

wol_Latn 249044

248348

247406

241992

240101

239232

231969

231209

230706

227748

224088

220413

220214

219427

213615

209932

209375

206501

205542

205447

204472

202877

201662

201087

199134

199068

198649

197315

193701

193129

192962

190529

189651

187594

186367

185868

183694

182214

181559

181427

179965

178498

177483

176015

175769

174994

174305

165616

163485

162838

162558

162500

162069

160645

159884

158047

157521

155882

155485

154827

152563

151728

150054

147702

147674

147591

147261

146992 araw1281

maya1287

aust1307

atla1278

araw1281

toto1251

atla1278

mixe1284

maya1287

indo1319

sino1245

aust1307

indo1319

atla1278

aust1307

maya1287

atla1278

sino1245

quec1387

maya1287

atla1278

maya1287

indo1319

otom1299

indo1319

turk1311

arti1236

aust1307

utoa1244

aust1307

maya1287

indo1319

mand1469

atla1278

afro1255

atla1278

maya1287

sino1245

atla1278

afro1255

turk1311

kart1248

atla1278

indo1319

atla1278

indo1319

utoa1244

quec1387

indo1319

atla1278

afro1255

indo1319

atla1278

arau1255

indo1319

atla1278

indo1319

atha1245

sino1245

otom1299

indo1319

atla1278

Head

yes

Language-Script |Sent| Family

hrx_Latn

quh_Latn

hyw_Cyrl

rue_Cyrl

eml_Latn

acm_Arab

tob_Latn

ach_Latn

vep_Latn

npi_Deva

tok_Latn

sgs_Latn

lĳ_Latn

myv_Cyrl

tih_Latn

tat_Latn

lfn_Latn

cgg_Latn

ful_Latn

gor_Latn

ile_Latn

ium_Latn

teo_Latn

kia_Latn

crh_Cyrl

crh_Latn

enm_Latn

sat_Olck

mad_Latn

cac_Latn

hnj_Latn

ksh_Latn

ikk_Latn

sba_Latn

zom_Latn

bqc_Latn

bim_Latn

mdy_Ethi

bts_Latn

gya_Latn

ajg_Latn

agw_Latn

kom_Cyrl

knv_Latn

giz_Latn

hui_Latn

kpg_Latn

zea_Latn

aoj_Latn

csy_Latn

azb_Arab

csb_Latn

tpm_Latn

quw_Latn

rmy_Cyrl

ixl_Latn

mbb_Latn

pfl_Latn

pcd_Latn

tlh_Latn

suz_Deva

gcr_Latn

jbo_Latn

tbz_Latn

bam_Latn

prk_Latn

jam_Latn

twx_Latn 45716

45566

45379

45369

44630

44505

44473

43974

43076

43072

42820

42467

42447

42147

41873

41640

41632

41196

41188

41174

40984

40683

40203

40035

39985

39896

39809

39614

38993

38812

38611

38130

38071

38040

37013

36881

36835

36370

36216

35902

35631

35585

35249

35196

35040

34926

34900

34426

34349

34126

33758

33743

33517

33449

33351

33289

33240

33148

32867

32863

32811

32676

32619

32264

32150

32085

32048

32028 indo1319

quec1387

indo1319

afro1255

guai1249

nilo1247

ural1272

indo1319

arti1236

indo1319

ural1272

aust1307

turk1311

arti1236

atla1278

aust1307

arti1236

hmon1336

nilo1247

atla1278

turk1311

indo1319

aust1305

aust1307

maya1287

hmon1336

indo1319

atla1278

cent2225

sino1245

mand1469

atla1278

gong1255

aust1307

atla1278

aust1307

ural1272

Table 12: List of languages used to train Glot500-m (Part II).

afro1255

nucl1709

aust1307

indo1319

nucl1708

sino1245

turk1311

indo1319

atla1278

quec1387

indo1319

maya1287

aust1307

indo1319

arti1236

sino1245

indo1319

arti1236

atla1278

mand1469

aust1305

indo1319

atla1278

Head

yesLanguage-Script |Sent| Family

pnb_Arab

rar_Latn

fij_Latn

wls_Latn

ckb_Arab

ven_Latn

zsm_Latn

chv_Cyrl

lua_Latn

que_Latn

sag_Latn

guw_Latn

bre_Latn

toi_Latn

pus_Arab

che_Cyrl

pis_Latn

kon_Latn

oss_Cyrl

hyw_Armn

iso_Latn

nan_Latn

lub_Latn

lim_Latn

tuk_Latn

tir_Ethi

tgk_Latn

yua_Latn

min_Latn

lue_Latn

khm_Khmr

tum_Latn

tll_Latn

ekk_Latn

lug_Latn

niu_Latn

tzo_Latn

mah_Latn

tvl_Latn

jav_Latn 899895

894515

887134

882167

874441

860249

859947

859863

854359

838486

771048

767918

748954

745385

731992

728201

714783

685194

683517

679819

658789

656389

654390

652078

649411

649117

636541

610052

609065

599429

590429

589857

586530

582595

566948

566715

540262

534614

521556

516833 indo1319

aust1307

indo1319

atla1278

aust1307

turk1311

atla1278

indo1319

atla1278

indo1319

nakh1245

indo1319

atla1278

sino1245

atla1278

indo1319

turk1311

afro1255

indo1319

maya1287

aust1307

atla1278

aust1305

atla1278

ural1272

atla1278

aust1307

maya1287

aust1307

Head

yes

Language-Script |Sent| Family

sme_Latn

gom_Latn

bum_Latn

mgr_Latn

ahk_Latn

kur_Arab

bas_Latn

bin_Latn

tsz_Latn

sid_Latn

diq_Latn

srd_Latn

tcf_Latn

bzj_Latn

udm_Cyrl

cce_Latn

meu_Latn

chw_Latn

cbk_Latn

ibg_Latn

bhw_Latn

ngu_Latn

nyy_Latn

szl_Latn

ish_Latn

naq_Latn

toh_Latn

ttj_Latn

nse_Latn

hsb_Latn

ami_Latn

alz_Latn

apc_Arab

vls_Latn

mhr_Cyrl

djk_Latn

wes_Latn

gkn_Latn

grc_Grek

hbo_Hebr 146803

143937

141673

138953

135068

134160

133436

133256

133251

130406

128908

127064

126050

124958

121705

120636

120273

119751

118789

118733

117381

116851

115914

112496

111814

109747

107583

106925

105189

104802

104559

104392

102392

101900

100474

99234

98492

97041

96986

96484 ural1272

indo1319

atla1278

sino1245

indo1319

atla1278

tara1323

afro1255

indo1319

otom1299

indo1319

ural1272

atla1278

aust1307

atla1278

indo1319

aust1307

utoa1244

atla1278

indo1319

atla1278

khoe1240

atla1278

indo1319

aust1307

nilo1247

afro1255

indo1319

ural1272

indo1319

atla1278

indo1319

afro1255

Head

Language-Script |Sent| Family

nmf_Latn

caq_Latn

rop_Latn

tca_Latn

yan_Latn

xav_Latn

bih_Deva

cuk_Latn

kjb_Latn

hne_Deva

wbm_Latn

zlm_Latn

tui_Latn

ifb_Latn

izz_Latn

rug_Latn

aka_Latn

pxm_Latn

kmm_Latn

mcn_Latn

ifa_Latn

dln_Latn

ext_Latn

ksd_Latn

mzh_Latn

llb_Latn

hra_Latn

mwm_Latn

krc_Cyrl

tuc_Latn

mrw_Latn

pls_Latn

rap_Latn

fur_Latn

kaa_Latn

prs_Arab

san_Latn

som_Arab

uig_Latn

hau_Arab 31997

31903

31889

31852

31775

31765

31658

31612

31471

31465

31394

31345

31161

30980

30894

30857

30704

30698

30671

30666

30621

30620

30605

30550

30517

30480

30472

30432

30353

30349

30304

30136

30102

30052

30031

26823

25742

14199

9637

9593 sino1245

aust1305

indo1319

ticu1244

misu1242

nucl1710

Table 13: List of languages used to train Glot500-m (Part III).

chib1249

maya1287

indo1319

aust1305

aust1307

atla1278

aust1307

atla1278

aust1307

atla1278

book1242

sino1245

afro1255

aust1307

sino1245

indo1319

aust1307

mata1289

atla1278

sino1245

cent2225

turk1311

aust1307

otom1299

aust1307

indo1319

turk1311

indo1319

afro1255

turk1311

afro1255

Head

yes

yesguages (Abate et al., 2018), Phontron (Neubig,

2011), QADI (Abdelali et al., 2021), Quechua-IIC

(Zevallos et al., 2022), SLI_GalWeb.1.0 (Agerri

et al., 2018), Shami (Abu Kwaik et al., 2018),

Stanford NLP, 23 StatMT, 24 TICO (Anastasopou-

los et al., 2020), TIL (Mirzakhalov et al., 2021),

Tatoeba, 25 TeDDi (Moran et al., 2022), Tilde (Rozis

and Skadin , š, 2017), W2C (Majliš, 2011), WAT

(Nakazawa et al., 2022), WikiMatrix (Schwenk

et al., 2021), Wikipedia, 26 Workshop on NER for

South and South East Asian Languages (Singh,

2008), XLSum (Hasan et al., 2021).

Results for Each Task and Language

We report the detailed results for all tasks and

languages in Table 14 (Sentence Retrieval Tatoeba),

15, 16 (Sentence Retrieval Bible), 17 (NER), and 18

(POS), 19, 20 (Text Classification), 21, 22 (Round

Trip Alignment).

Perplexity Results for all Languages

Perplexity number for all languages is presented in

Table 23, Table 24, and Table 25.

23 https://nlp.stanford.edu/

24 https://statmt.org/

25 https://tatoeba.org/en/

26 https://huggingface.co/datasets/wikipediaLanguage-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m

afr_Latn

amh_Ethi

ara_Arab

arz_Arab

ast_Latn

aze_Latn

bel_Cyrl

ben_Beng

bos_Latn

bre_Latn

bul_Cyrl

cat_Latn

cbk_Latn

ceb_Latn

ces_Latn

cmn_Hani

csb_Latn

cym_Latn

dan_Latn

deu_Latn

dtp_Latn

ell_Grek

epo_Latn

est_Latn

eus_Latn

fao_Latn

fin_Latn

fra_Latn

fry_Latn

gla_Latn

gle_Latn

glg_Latn

gsw_Latn 71.9

35.1

59.2

32.5

59.8

62.6

70.0

54.1

78.5

10.3

84.4

72.8

33.2

15.2

71.1

79.5

21.3

45.7

91.9

95.9

5.6

76.2

64.9

63.9

45.9

45.0

81.9

85.7

60.1

21.0

32.0

72.6

36.8 76.5

37.5

66.8

47.8

59.8

78.3

80.5

68.2

82.2

10.9

88.3

73.9

36.0

15.0

81.3

84.8

20.2

45.7

93.9

94.7

4.7

84.1

68.5

68.6

54.4

42.7

85.8

62.4

21.2

36.9

75.8

31.6 81.1

44.6

64.2

63.5

87.4

79.9

81.4

69.4

92.4

19.9

86.7

78.7

49.4

41.3

75.1

85.6

40.3

55.7

91.5

95.0

21.1

80.2

74.3

69.1

52.7

82.4

72.3

86.0

75.1

41.9

50.8

77.5

69.2 heb_Hebr

hin_Deva

hrv_Latn

hsb_Latn

hun_Latn

hye_Armn

ido_Latn

ile_Latn

ina_Latn

ind_Latn

isl_Latn

ita_Latn

jpn_Jpan

kab_Latn

kat_Geor

kaz_Cyrl

khm_Khmr

kor_Hang

kur_Latn

lat_Latn

lfn_Latn

lit_Latn

lvs_Latn

mal_Mlym

mar_Deva

mhr_Cyrl

mkd_Cyrl

mon_Cyrl

nds_Latn

nld_Latn

nno_Latn

nob_Latn

oci_Latn 76.3

73.8

79.6

21.5

76.1

64.6

25.7

34.6

62.7

84.3

78.7

81.3

74.4

3.7

61.1

60.3

41.1

73.4

24.1

33.6

32.5

73.4

80.1

63.5

6.5

70.5

60.9

28.8

90.3

70.7

93.5

22.9 84.1

88.8

85.6

23.0

81.8

40.0

28.8

41.9

66.2

90.2

84.5

84.7

80.8

3.0

79.1

69.9

45.0

84.3

28.5

48.0

35.9

76.8

78.9

84.4

81.2

5.8

83.9

77.3

29.0

91.8

77.8

96.5

23.2 76.0

85.6

89.8

53.6

69.2

83.2

57.6

75.6

91.4

88.8

84.0

86.4

72.6

16.4

67.7

72.3

52.5

78.0

54.1

42.8

59.3

65.6

76.9

83.8

77.9

34.9

81.4

77.0

77.1

91.8

87.8

95.7

46.9 pam_Latn

pes_Arab

pms_Latn

pol_Latn

por_Latn

ron_Latn

rus_Cyrl

slk_Latn

slv_Latn

spa_Latn

sqi_Latn

srp_Latn

swe_Latn

swh_Latn

tam_Taml

tat_Cyrl

tel_Telu

tgl_Latn

tha_Thai

tuk_Latn

tur_Latn

uig_Arab

ukr_Cyrl

urd_Arab

uzb_Cyrl

vie_Latn

war_Latn

wuu_Hani

xho_Latn

yid_Hebr

yue_Hani

zsm_Latn 4.8

83.3

16.6

82.6

91.0

86.0

89.6

73.2

72.1

85.5

72.2

78.1

90.4

30.3

46.9

10.3

58.5

47.6

56.8

16.3

77.9

38.8

77.1

54.4

25.2

85.4

8.0

56.1

28.9

37.3

50.3

81.4 5.6

86.6

12.6

89.6

92.1

89.1

91.6

80.6

78.0

89.0

81.4

85.0

92.4

34.6

42.3

10.3

50.4

54.2

39.4

14.8

85.4

58.3

88.3

34.3

32.2

87.9

6.5

47.4

31.7

51.8

42.3

87.4 11.0

87.6

54.5

82.4

90.1

82.8

91.5

75.9

77.0

88.9

84.7

90.0

89.7

44.1

66.4

70.3

67.9

77.1

78.1

63.5

78.4

62.6

83.7

80.9

64.5

87.0

26.2

79.7

56.3

74.4

76.3

91.8

Table 14: Top10 accuracy of XLM-R-B, XLM-R-L, and Glot500-m on Sentence Retrieval Tatoeba.Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m

ace_Latn

ach_Latn

acr_Latn

afr_Latn

agw_Latn

ahk_Latn

aka_Latn

aln_Latn

als_Latn

alt_Cyrl

alz_Latn

amh_Ethi

aoj_Latn

arb_Arab

arn_Latn

ary_Arab

arz_Arab

asm_Beng

ayr_Latn

azb_Arab

aze_Latn

bak_Cyrl

bam_Latn

ban_Latn

bar_Latn

bba_Latn

bbc_Latn

bci_Latn

bcl_Latn

bel_Cyrl

bem_Latn

ben_Beng

bhw_Latn

bim_Latn

bis_Latn

bod_Tibt

bqc_Latn

bre_Latn

bts_Latn

btx_Latn

bul_Cyrl

bum_Latn

bzj_Latn

cab_Latn

cac_Latn

cak_Latn

caq_Latn

cat_Latn

cbk_Latn

cce_Latn

ceb_Latn

ces_Latn

cfm_Latn

che_Cyrl

chk_Latn

chv_Cyrl

ckb_Arab

cmn_Hani

cnh_Latn

crh_Cyrl

crs_Latn

csy_Latn

ctd_Latn

ctu_Latn

cuk_Latn

cym_Latn

dan_Latn

deu_Latn

djk_Latn

dln_Latn 4.4

4.4

2.6

76.8

5.8

3.0

5.0

67.8

51.4

12.6

4.6

35.4

5.0

7.0

4.8

2.8

5.4

26.2

4.8

7.4

71.0

5.4

3.4

9.0

13.4

3.8

7.8

4.4

10.2

67.2

6.6

46.4

4.4

4.2

7.0

2.0

3.4

17.6

6.0

11.0

81.2

4.8

7.8

5.8

3.6

3.4

3.2

86.6

31.8

5.2

14.2

75.2

4.6

3.4

5.4

4.6

4.0

39.2

4.8

8.8

7.4

3.8

4.2

2.8

5.0

38.8

71.6

78.8

4.6

5.2 4.6

3.2

3.4

77.2

3.0

2.6

4.2

72.4

48.0

9.0

3.8

43.2

3.0

7.8

4.0

4.8

40.6

4.8

6.8

78.6

6.4

3.6

9.8

12.8

3.4

7.4

3.6

11.2

72.8

5.4

52.8

6.0

2.8

4.6

1.8

3.0

23.4

5.0

9.0

78.0

3.6

4.0

4.6

3.0

3.4

4.4

81.0

35.6

4.6

12.6

75.8

4.0

3.4

4.2

4.8

40.8

4.2

11.2

5.2

5.0

5.4

2.8

3.4

46.0

73.2

80.6

4.0

4.8 53.4

40.0

25.4

69.4

36.0

3.2

57.0

67.6

55.8

50.8

34.6

52.8

20.4

14.6

28.4

15.2

24.8

66.6

52.8

72.4

73.0

65.2

60.2

33.0

40.8

36.8

57.2

13.2

79.8

55.8

58.2

53.4

47.8

52.2

48.6

33.2

39.2

32.8

56.4

59.6

76.4

38.0

75.0

17.4

14.8

21.4

30.2

76.4

54.6

51.8

68.0

58.0

46.8

14.0

41.2

56.0

47.2

41.8

55.6

75.2

80.6

50.0

59.4

21.6

22.2

42.4

63.2

66.6

40.4

66.4 iba_Latn

ibo_Latn

ifa_Latn

ifb_Latn

ikk_Latn

ilo_Latn

ind_Latn

isl_Latn

ita_Latn

ium_Latn

ixl_Latn

izz_Latn

jam_Latn

jav_Latn

jpn_Jpan

kaa_Cyrl

kaa_Latn

kab_Latn

kac_Latn

kal_Latn

kan_Knda

kat_Geor

kaz_Cyrl

kbp_Latn

kek_Latn

khm_Khmr

kia_Latn

kik_Latn

kin_Latn

kir_Cyrl

kjb_Latn

kjh_Cyrl

kmm_Latn

kmr_Cyrl

kmr_Latn

knv_Latn

kor_Hang

kpg_Latn

krc_Cyrl

kri_Latn

ksd_Latn

kss_Latn

ksw_Mymr

kua_Latn

lam_Latn

lao_Laoo

lat_Latn

lav_Latn

ldi_Latn

leh_Latn

lhu_Latn

lin_Latn

lit_Latn

loz_Latn

ltz_Latn

lug_Latn

luo_Latn

lus_Latn

lzh_Hani

mad_Latn

mah_Latn

mai_Deva

mal_Mlym

mam_Latn

mar_Deva

mau_Latn

mbb_Latn

mck_Latn

mcn_Latn

mco_Latn 14.4

5.0

4.4

4.8

3.0

6.2

82.6

62.6

75.4

3.2

4.0

2.8

6.6

25.4

65.0

17.6

9.2

3.4

3.6

3.4

51.2

54.2

61.4

2.6

5.0

28.4

4.0

3.2

5.0

54.8

4.0

11.0

4.8

4.0

35.8

2.8

64.0

5.2

9.2

2.8

7.0

2.2

1.6

4.8

4.6

31.4

52.2

74.2

5.4

5.6

2.0

6.6

74.4

6.8

9.8

4.6

6.4

3.8

25.0

7.6

4.8

6.4

49.4

3.8

66.2

2.4

3.0

5.2

6.0

2.6 13.6

3.0

4.4

3.6

3.2

3.6

80.4

73.6

3.0

2.8

4.4

33.2

71.8

24.8

9.8

2.4

3.2

3.6

67.6

61.4

73.0

2.6

3.4

42.6

5.6

2.8

5.0

70.2

3.8

7.8

3.8

4.2

40.4

2.2

71.6

3.8

10.2

2.8

5.4

2.4

2.0

5.4

3.6

52.8

57.8

78.0

4.4

4.0

2.0

5.4

71.6

4.6

10.0

4.0

4.4

3.8

31.4

4.4

4.2

9.6

62.6

3.2

69.0

2.4

3.4

3.6

4.2

2.6 66.0

30.4

39.2

36.6

50.6

55.0

72.2

66.0

70.0

24.8

18.4

25.6

67.8

47.4

64.2

73.8

43.4

20.6

26.4

23.2

50.2

51.4

56.8

36.0

26.4

47.6

33.2

53.4

59.4

66.6

29.6

53.8

42.6

42.4

63.0

9.0

61.2

51.8

63.0

62.8

42.6

6.0

31.8

43.8

27.4

49.6

58.8

25.2

58.2

5.0

65.4

62.4

49.2

73.8

49.4

40.8

54.4

63.4

44.4

35.6

59.2

56.8

12.8

74.8

3.6

33.6

57.4

39.2

7.0 pan_Guru

pap_Latn

pau_Latn

pcm_Latn

pdt_Latn

pes_Arab

pis_Latn

pls_Latn

plt_Latn

poh_Latn

pol_Latn

pon_Latn

por_Latn

prk_Latn

prs_Arab

pxm_Latn

qub_Latn

quc_Latn

qug_Latn

quh_Latn

quw_Latn

quy_Latn

quz_Latn

qvi_Latn

rap_Latn

rar_Latn

rmy_Latn

ron_Latn

rop_Latn

rug_Latn

run_Latn

rus_Cyrl

sag_Latn

sah_Cyrl

san_Deva

san_Latn

sba_Latn

seh_Latn

sin_Sinh

slk_Latn

slv_Latn

sme_Latn

smo_Latn

sna_Latn

snd_Arab

som_Latn

sop_Latn

sot_Latn

spa_Latn

sqi_Latn

srm_Latn

srn_Latn

srp_Cyrl

srp_Latn

ssw_Latn

sun_Latn

suz_Deva

swe_Latn

swh_Latn

sxn_Latn

tam_Taml

tat_Cyrl

tbz_Latn

tca_Latn

tdt_Latn

tel_Telu

teo_Latn

tgk_Cyrl

tgl_Latn

tha_Thai 43.2

12.4

4.4

13.6

9.2

69.4

6.4

5.0

26.6

3.4

79.2

5.6

81.6

3.6

79.4

3.2

4.6

3.6

4.8

4.6

6.2

4.6

4.8

4.4

3.2

6.8

72.2

4.6

3.6

5.4

75.8

6.0

6.2

13.8

4.6

2.8

6.4

44.8

75.2

63.6

6.8

4.4

7.0

52.2

22.2

5.2

6.0

81.2

58.2

4.0

6.8

83.0

85.0

4.8

22.4

3.6

79.8

47.8

4.8

42.8

8.2

2.6

2.4

6.2

44.4

5.8

4.6

61.0

30.0 59.4

9.2

4.0

10.4

8.6

72.2

5.0

4.0

28.0

2.4

79.8

4.4

79.8

2.2

78.6

3.2

3.6

2.8

3.6

4.4

4.6

4.2

3.4

3.2

3.0

5.8

69.6

3.4

6.4

74.6

4.4

4.6

14.2

3.8

2.8

4.8

56.6

72.8

64.6

6.2

3.4

3.6

64.6

29.0

4.2

4.8

78.8

58.2

3.2

5.2

87.0

87.2

8.4

25.4

3.4

79.8

48.8

4.8

56.8

6.2

2.6

3.2

5.0

57.2

3.4

4.2

60.6

37.0 48.8

72.4

29.8

66.8

68.6

80.8

57.2

34.4

59.8

15.2

63.8

21.6

76.6

49.8

88.8

24.0

43.4

24.8

50.8

56.2

49.2

61.4

68.0

46.8

25.6

26.6

34.6

66.6

46.0

49.0

54.6

71.2

52.4

45.8

27.2

9.8

37.6

74.6

45.0

63.6

51.8

47.8

36.0

43.0

66.6

33.0

31.2

52.2

80.0

63.4

32.4

79.8

81.2

47.0

43.0

34.2

78.0

66.4

25.8

52.0

67.2

28.0

15.4

62.2

42.6

26.0

71.2

78.6

45.4

Table 15: Top10 accuracy of XLM-R-B, XLM-R-L, and Glot500-m on Sentence Retrieval Bible (Part I).Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m

dtp_Latn

dyu_Latn

dzo_Tibt

efi_Latn

ell_Grek

enm_Latn

epo_Latn

est_Latn

eus_Latn

ewe_Latn

fao_Latn

fas_Arab

fij_Latn

fil_Latn

fin_Latn

fon_Latn

fra_Latn

fry_Latn

gaa_Latn

gil_Latn

giz_Latn

gkn_Latn

gkp_Latn

gla_Latn

gle_Latn

glv_Latn

gom_Latn

gor_Latn

grc_Grek

guc_Latn

gug_Latn

guj_Gujr

gur_Latn

guw_Latn

gya_Latn

gym_Latn

hat_Latn

hau_Latn

haw_Latn

heb_Hebr

hif_Latn

hil_Latn

hin_Deva

hin_Latn

hmo_Latn

hne_Deva

hnj_Latn

hra_Latn

hrv_Latn

hui_Latn

hun_Latn

hus_Latn

hye_Armn 5.4

4.2

2.2

4.4

52.6

39.8

64.6

72.0

26.2

4.6

24.0

78.2

3.8

60.4

75.6

2.6

88.6

27.8

3.8

5.6

6.2

4.0

3.0

25.2

35.0

5.8

6.0

3.8

17.4

3.4

4.6

53.8

3.8

4.0

3.6

6.0

28.8

4.2

25.0

12.2

11.0

67.0

13.6

6.4

13.4

2.8

5.2

79.8

3.8

76.4

3.6

30.8 4.2

2.4

2.0

4.2

53.8

39.2

59.8

75.6

28.4

3.0

28.4

80.4

3.0

64.4

75.0

2.0

86.8

27.4

3.4

3.6

4.0

3.4

3.2

26.6

38.6

3.6

4.6

3.0

23.8

2.6

3.2

71.2

2.8

3.4

3.0

3.8

4.2

36.0

3.4

26.0

16.4

10.8

72.8

16.0

4.4

14.8

2.8

4.6

81.8

3.0

78.2

3.2

33.0 24.2

50.2

36.4

54.0

48.6

66.0

56.2

56.4

23.0

49.0

73.4

89.2

36.4

72.0

53.8

33.4

79.2

44.0

47.0

36.8

41.0

32.2

20.4

43.0

40.0

47.4

42.8

26.0

54.8

13.0

36.0

71.4

27.0

59.4

41.0

18.0

68.2

54.8

38.8

21.8

39.0

76.2

76.6

43.2

48.2

75.0

54.2

52.2

72.6

28.0

56.2

17.6

75.2 mdy_Ethi

meu_Latn

mfe_Latn

mgh_Latn

mgr_Latn

mhr_Cyrl

min_Latn

miq_Latn

mkd_Cyrl

mlg_Latn

mlt_Latn

mos_Latn

mps_Latn

mri_Latn

mrw_Latn

msa_Latn

mwm_Latn

mxv_Latn

mya_Mymr

myv_Cyrl

mzh_Latn

nan_Latn

naq_Latn

nav_Latn

nbl_Latn

nch_Latn

ncj_Latn

ndc_Latn

nde_Latn

ndo_Latn

nds_Latn

nep_Deva

ngu_Latn

nia_Latn

nld_Latn

nmf_Latn

nnb_Latn

nno_Latn

nob_Latn

nor_Latn

npi_Deva

nse_Latn

nso_Latn

nya_Latn

nyn_Latn

nyy_Latn

nzi_Latn

ori_Orya

ory_Orya

oss_Cyrl

ote_Latn

pag_Latn

pam_Latn 2.8

5.6

9.0

5.2

4.0

6.6

9.4

4.4

76.6

29.0

5.8

4.2

3.2

4.2

6.0

40.0

2.6

3.0

20.2

4.6

3.2

3.0

2.4

9.2

4.4

4.6

5.2

13.0

5.2

9.6

35.6

4.6

78.0

4.6

3.6

58.4

82.8

81.2

50.6

5.2

6.0

4.0

4.4

3.0

3.2

42.6

31.4

4.2

3.6

8.0

8.2 2.4

4.4

6.8

3.4

4.4

5.4

6.2

4.4

72.6

28.4

5.2

3.6

3.2

3.8

4.4

40.2

2.6

3.4

27.8

4.0

3.2

2.2

2.8

11.8

3.0

4.6

15.2

4.0

8.4

50.6

3.4

3.2

75.8

4.6

3.2

67.2

85.2

84.2

70.8

5.0

4.2

4.6

4.2

3.0

62.0

47.0

3.6

2.4

5.0

7.0 31.6

52.0

78.6

23.6

57.6

48.0

29.0

47.4

74.8

66.0

50.4

42.8

21.6

48.4

52.2

40.6

35.8

8.8

29.4

35.0

36.2

13.6

25.0

11.2

53.8

21.4

25.2

40.0

53.8

48.2

43.0

58.6

27.6

29.4

71.8

36.6

42.0

72.6

79.2

86.2

76.6

54.8

57.0

60.2

51.8

25.6

47.2

57.0

55.2

54.8

18.0

61.2

49.8 tih_Latn

tir_Ethi

tlh_Latn

tob_Latn

toh_Latn

toi_Latn

toj_Latn

ton_Latn

top_Latn

tpi_Latn

tpm_Latn

tsn_Latn

tso_Latn

tsz_Latn

tuc_Latn

tui_Latn

tuk_Cyrl

tuk_Latn

tum_Latn

tur_Latn

twi_Latn

tyv_Cyrl

tzh_Latn

tzo_Latn

udm_Cyrl

uig_Arab

uig_Latn

ukr_Cyrl

urd_Arab

uzb_Cyrl

uzb_Latn

uzn_Cyrl

ven_Latn

vie_Latn

wal_Latn

war_Latn

wbm_Latn

wol_Latn

xav_Latn

xho_Latn

yan_Latn

yao_Latn

yap_Latn

yom_Latn

yor_Latn

yua_Latn

yue_Hani

zai_Latn

zho_Hani

zlm_Latn

zom_Latn

zsm_Latn

zul_Latn 5.2

7.4

7.8

2.2

4.0

4.2

3.4

5.8

3.6

5.4

5.6

2.6

3.6

13.6

9.6

5.2

74.4

3.8

6.8

6.0

3.8

6.0

45.8

9.8

66.0

47.6

6.2

54.8

5.4

4.8

72.8

4.2

9.8

3.8

4.6

2.2

10.4

4.2

4.4

4.0

4.8

3.4

3.8

17.2

6.2

40.4

83.4

3.6

90.2

11.0 4.4

6.2

6.4

3.0

4.0

4.4

4.0

3.8

3.6

4.4

3.0

3.6

5.0

3.2

2.6

3.2

15.8

9.6

4.6

74.8

3.0

7.0

5.2

3.8

5.0

63.6

11.0

63.4

47.0

7.4

60.8

5.4

4.2

71.0

5.4

6.6

2.4

4.4

2.4

16.2

3.4

3.8

4.0

3.6

3.4

14.0

4.2

40.2

78.4

3.4

91.0

16.0 51.6

43.4

72.4

16.8

47.2

47.4

15.6

22.4

8.0

58.0

39.6

41.8

50.8

27.0

31.4

38.0

65.0

66.2

63.2

50.0

46.6

25.8

16.6

55.2

56.2

62.8

57.0

65.0

78.8

67.6

87.0

47.2

57.8

51.4

43.4

46.4

35.8

5.0

40.8

31.8

55.2

24.0

42.2

37.4

18.2

24.0

38.0

44.4

87.0

50.2

83.0

49.0

Table 16: Top10 accuracy of XLM-R-B, XLM-R-L, and Glot500-m on Sentence Retrieval Bible (Part II).Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m

ace_Latn

afr_Latn

als_Latn

amh_Ethi

ara_Arab

arg_Latn

arz_Arab

asm_Beng

ast_Latn

aym_Latn

aze_Latn

bak_Cyrl

bar_Latn

bel_Cyrl

ben_Beng

bih_Deva

bod_Tibt

bos_Latn

bre_Latn

bul_Cyrl

cat_Latn

cbk_Latn

ceb_Latn

ces_Latn

che_Cyrl

chv_Cyrl

ckb_Arab

cos_Latn

crh_Latn

csb_Latn

cym_Latn

dan_Latn

deu_Latn

diq_Latn

div_Thaa

ell_Grek

eml_Latn

eng_Latn

epo_Latn

est_Latn

eus_Latn

ext_Latn

fao_Latn

fas_Arab

fin_Latn

fra_Latn

frr_Latn

fry_Latn

fur_Latn

gla_Latn

gle_Latn

glg_Latn

grn_Latn

guj_Gujr

hbs_Latn 33.4

75.6

60.7

42.2

44.7

73.6

48.3

53.2

78.1

40.8

62.4

35.1

55.2

74.2

65.3

50.7

2.5

74.0

59.1

76.8

82.2

54.6

55.1

77.6

15.4

52.9

33.1

54.3

44.3

55.1

57.9

81.5

74.3

37.8

0.0

73.7

32.9

82.7

63.8

72.2

59.0

36.9

61.1

44.6

75.5

77.2

45.4

74.3

44.9

55.5

70.8

80.2

40.0

61.0

61.1 38.9

78.3

61.4

40.9

48.7

74.6

52.5

64.4

82.8

38.7

69.2

49.3

58.6

78.7

75.8

57.1

3.0

74.3

63.9

81.6

85.4

54.0

57.8

80.8

24.6

51.6

42.6

56.4

52.4

54.2

60.1

84.2

78.6

43.3

0.0

78.6

36.1

84.5

71.8

78.5

62.0

47.1

70.8

58.0

79.1

79.8

46.8

79.0

50.1

61.4

74.6

81.1

42.3

61.9

57.2 44.2

76.7

80.0

45.4

56.1

77.2

57.4

64.2

84.5

47.1

66.1

59.4

68.4

74.3

71.6

58.7

31.6

74.2

63.3

77.2

83.7

54.1

53.8

78.3

60.9

75.9

75.5

56.0

54.7

61.2

59.7

81.7

75.7

53.1

51.1

72.8

40.8

83.3

68.0

73.5

58.0

46.1

72.4

51.2

75.2

76.0

54.8

77.5

56.4

63.5

72.2

79.4

54.7

59.8

61.5 heb_Hebr

hin_Deva

hrv_Latn

hsb_Latn

hun_Latn

hye_Armn

ibo_Latn

ido_Latn

ilo_Latn

ina_Latn

ind_Latn

isl_Latn

ita_Latn

jav_Latn

jbo_Latn

jpn_Jpan

kan_Knda

kat_Geor

kaz_Cyrl

khm_Khmr

kin_Latn

kir_Cyrl

kor_Hang

ksh_Latn

kur_Latn

lat_Latn

lav_Latn

lĳ_Latn

lim_Latn

lin_Latn

lit_Latn

lmo_Latn

ltz_Latn

lzh_Hani

mal_Mlym

mar_Deva

mhr_Cyrl

min_Latn

mkd_Cyrl

mlg_Latn

mlt_Latn

mon_Cyrl

mri_Latn

msa_Latn

mwl_Latn

mya_Mymr

mzn_Arab

nan_Latn

nap_Latn

nds_Latn

nep_Deva

nld_Latn

nno_Latn

nor_Latn

oci_Latn 51.5

67.0

77.2

64.0

76.2

50.8

40.8

61.6

55.3

54.7

49.0

69.1

77.3

58.4

18.0

19.7

56.9

65.5

43.7

43.3

60.5

44.2

49.1

41.3

58.8

70.7

73.4

36.9

59.9

37.4

73.4

68.8

47.4

15.6

61.0

60.2

44.3

42.9

74.5

54.9

43.2

72.4

14.2

62.3

42.6

51.3

36.4

46.2

53.0

62.4

63.2

80.1

76.6

76.5

65.3 56.5

71.1

78.9

69.0

79.8

61.7

42.8

78.6

65.3

63.4

54.1

77.2

81.2

61.2

26.3

20.6

60.8

69.5

52.7

46.2

58.4

46.9

58.5

48.3

65.0

79.2

77.1

41.6

64.7

41.3

77.0

68.4

55.8

21.6

63.3

63.4

48.3

46.2

80.4

54.3

48.3

74.3

18.3

70.4

47.5

53.4

43.1

51.4

53.9

66.7

66.4

83.6

80.4

80.1

67.8 49.0

69.4

77.3

71.2

75.9

54.8

58.6

77.8

77.1

58.0

56.6

72.1

78.7

55.8

27.8

17.2

58.4

68.3

50.0

40.6

67.1

46.7

50.9

58.7

69.6

73.8

74.0

46.6

71.8

54.0

73.5

71.3

69.1

11.8

61.3

60.7

63.1

41.8

73.3

57.9

73.3

66.9

53.5

65.8

45.3

55.5

44.9

82.1

55.7

77.1

62.7

80.8

78.0

76.7

70.1 ori_Orya

oss_Cyrl

pan_Guru

pms_Latn

pnb_Arab

pol_Latn

por_Latn

pus_Arab

que_Latn

roh_Latn

ron_Latn

rus_Cyrl

sah_Cyrl

san_Deva

scn_Latn

sco_Latn

sgs_Latn

sin_Sinh

slk_Latn

slv_Latn

snd_Arab

som_Latn

spa_Latn

sqi_Latn

srp_Cyrl

sun_Latn

swa_Latn

swe_Latn

szl_Latn

tam_Taml

tat_Cyrl

tel_Telu

tgk_Cyrl

tgl_Latn

tha_Thai

tuk_Latn

tur_Latn

uig_Arab

ukr_Cyrl

urd_Arab

uzb_Latn

vec_Latn

vep_Latn

vie_Latn

vls_Latn

vol_Latn

war_Latn

wuu_Hani

xmf_Geor

yid_Hebr

yor_Latn

yue_Hani

zea_Latn

zho_Hani 31.4

33.7

50.0

71.2

57.0

77.5

77.8

37.4

59.1

52.6

74.8

63.8

47.3

36.9

49.9

80.9

42.5

52.2

75.0

79.4

41.2

55.8

72.8

74.0

59.7

42.0

65.6

71.8

58.2

55.0

40.7

47.4

24.7

71.0

4.2

45.6

74.9

44.0

75.2

51.2

70.6

59.0

59.8

68.5

68.1

59.2

61.9

29.4

40.2

47.6

42.2

24.8

65.2

24.2 27.6

39.2

50.5

74.9

64.6

81.2

39.9

55.2

55.7

79.9

70.0

49.7

37.3

54.8

81.8

47.4

57.0

81.7

82.2

46.6

55.5

73.3

74.4

71.4

49.7

69.0

75.9

56.7

57.9

47.7

52.5

38.3

74.7

1.6

50.7

79.3

50.9

76.3

57.8

76.2

63.3

59.3

77.8

73.6

55.6

61.4

54.0

40.0

52.5

40.1

30.3

67.4

28.8 31.0

52.1

48.1

75.9

65.8

78.1

78.6

41.4

66.8

60.3

74.2

67.6

74.2

35.8

65.8

85.6

62.7

57.8

78.5

80.1

41.8

58.2

72.8

76.6

66.4

57.7

69.6

69.7

67.6

55.2

68.0

46.0

68.5

75.1

3.2

59.7

76.1

48.0

74.2

74.5

75.1

66.4

71.3

73.7

59.2

66.1

25.1

62.6

50.3

63.1

22.6

68.6

23.4

Table 17: F1 of XLM-R-B, XLM-R-L, and Glot500-m on NER.Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m

afr_Latn

ajp_Arab

aln_Latn

amh_Ethi

ara_Arab

bam_Latn

bel_Cyrl

ben_Beng

bre_Latn

bul_Cyrl

cat_Latn

ceb_Latn

ces_Latn

cym_Latn

dan_Latn

deu_Latn

ell_Grek

eng_Latn

est_Latn

eus_Latn

fao_Latn

fas_Arab

fin_Latn

fra_Latn

gla_Latn

gle_Latn

glg_Latn

glv_Latn

grc_Grek

grn_Latn

gsw_Latn 88.7

62.9

53.5

64.5

68.5

25.4

86.2

82.8

61.6

89.1

86.7

49.3

85.0

65.5

90.7

88.4

87.3

96.3

86.1

71.3

77.0

71.8

85.2

86.7

57.4

65.5

83.7

27.5

62.0

8.9

48.7 89.3

67.3

60.4

66.2

69.7

23.5

86.2

83.8

66.6

88.9

87.9

49.5

85.4

67.0

91.0

88.4

87.0

96.5

86.4

73.7

80.6

74.2

85.7

87.3

61.8

68.7

86.4

29.5

68.1

7.8

55.9 87.5

69.7

52.3

66.1

65.4

40.8

86.0

83.8

60.7

88.1

86.3

66.4

84.4

64.4

90.2

87.9

85.4

96.0

83.1

61.8

89.2

71.5

80.8

85.4

60.2

64.4

82.6

52.7

73.1

19.8

80.3 hbo_Hebr

heb_Hebr

hin_Deva

hrv_Latn

hsb_Latn

hun_Latn

hye_Armn

hyw_Armn

ind_Latn

isl_Latn

ita_Latn

jav_Latn

jpn_Jpan

kaz_Cyrl

kmr_Latn

kor_Hang

lat_Latn

lav_Latn

lĳ_Latn

lit_Latn

lzh_Hani

mal_Mlym

mar_Deva

mlt_Latn

myv_Cyrl

nap_Latn

nds_Latn

nld_Latn

nor_Latn

pcm_Latn 38.9

68.0

71.3

85.9

71.5

82.6

85.2

78.5

83.5

84.2

88.3

73.2

17.3

77.3

73.1

53.7

75.0

86.0

48.1

84.1

14.1

86.9

83.0

21.0

39.7

52.8

58.0

88.5

88.1

47.3 45.7

69.2

75.3

86.2

74.4

82.7

86.5

82.5

84.1

85.1

89.6

76.7

32.2

79.1

78.2

53.4

80.3

86.3

48.6

84.6

23.1

86.7

85.2

21.9

38.6

17.0

67.3

88.8

88.9

50.1 54.2

67.2

70.3

85.5

83.6

81.2

84.0

80.4

82.7

82.8

87.3

74.1

31.7

75.9

75.5

53.1

72.4

83.5

76.8

81.1

23.0

84.4

80.8

79.5

65.7

63.6

77.2

88.2

88.0

57.1 pol_Latn

por_Latn

quc_Latn

ron_Latn

rus_Cyrl

sah_Cyrl

san_Deva

sin_Sinh

slk_Latn

slv_Latn

sme_Latn

spa_Latn

sqi_Latn

srp_Latn

swe_Latn

tam_Taml

tat_Cyrl

tel_Telu

tgl_Latn

tha_Thai

tur_Latn

uig_Arab

ukr_Cyrl

urd_Arab

vie_Latn

wol_Latn

xav_Latn

yor_Latn

yue_Hani

zho_Hani 84.7

88.6

28.9

83.9

89.1

20.3

18.3

57.7

85.6

78.5

29.8

88.5

81.4

86.1

93.5

76.1

45.0

85.0

72.7

46.0

72.9

68.2

85.9

61.0

70.9

25.6

8.4

21.7

31.5

28.6 85.4

89.8

29.3

85.7

89.7

22.8

28.6

60.1

85.8

79.1

31.5

89.0

82.9

86.6

93.7

76.9

48.8

85.0

74.8

54.7

74.0

70.2

86.3

68.2

72.2

25.5

5.3

21.4

42.0

42.4 82.4

88.2

62.4

80.6

88.7

76.8

26.1

54.7

84.4

75.9

73.7

88.0

77.9

85.3

92.1

75.0

70.1

82.2

74.7

56.7

70.7

68.9

84.8

62.0

67.1

61.6

14.0

63.9

40.9

43.1

Table 18: F1 of XLM-R-B, XLM-R-L, and Glot500-m on POS.Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m

ace_Latn

ach_Latn

acr_Latn

afr_Latn

agw_Latn

ahk_Latn

aka_Latn

aln_Latn

als_Latn

alt_Cyrl

alz_Latn

amh_Ethi

aoj_Latn

arb_Arab

arn_Latn

ary_Arab

arz_Arab

asm_Beng

ayr_Latn

azb_Arab

aze_Latn

bak_Cyrl

bam_Latn

ban_Latn

bar_Latn

bba_Latn

bci_Latn

bcl_Latn

bel_Cyrl

bem_Latn

ben_Beng

bhw_Latn

bim_Latn

bis_Latn

bqc_Latn

bre_Latn

bts_Latn

btx_Latn

bul_Cyrl

bum_Latn

bzj_Latn

cab_Latn

cac_Latn

cak_Latn

caq_Latn

cat_Latn

cbk_Latn

cce_Latn

ceb_Latn

ces_Latn

cfm_Latn

che_Cyrl

chv_Cyrl

cmn_Hani

cnh_Latn

crh_Cyrl

crs_Latn

csy_Latn

ctd_Latn

ctu_Latn

cuk_Latn

cym_Latn

dan_Latn

deu_Latn

djk_Latn

dln_Latn

dtp_Latn

dyu_Latn

dzo_Tibt 15

6 25

5 60

55 iba_Latn

iba_Latn

ibo_Latn

ifa_Latn

ifb_Latn

ikk_Latn

ilo_Latn

ind_Latn

isl_Latn

ita_Latn

ium_Latn

ixl_Latn

izz_Latn

jam_Latn

jav_Latn

jpn_Jpan

kaa_Cyrl

kab_Latn

kac_Latn

kal_Latn

kan_Knda

kat_Geor

kaz_Cyrl

kbp_Latn

kek_Latn

khm_Khmr

kia_Latn

kik_Latn

kin_Latn

kir_Cyrl

kjb_Latn

kjh_Cyrl

kmm_Latn

kmr_Cyrl

knv_Latn

kor_Hang

kpg_Latn

krc_Cyrl

kri_Latn

ksd_Latn

kss_Latn

ksw_Mymr

kua_Latn

lam_Latn

lao_Laoo

lat_Latn

lav_Latn

ldi_Latn

leh_Latn

lhu_Latn

lin_Latn

lit_Latn

loz_Latn

ltz_Latn

lug_Latn

luo_Latn

lus_Latn

lzh_Hani

mad_Latn

mah_Latn

mai_Deva

mal_Mlym

mam_Latn

mar_Deva

mau_Latn

mbb_Latn

mck_Latn

mcn_Latn

mco_Latn

mdy_Ethi 30

6 35

7 56

47 ote_Latn

ote_Latn

pag_Latn

pam_Latn

pan_Guru

pap_Latn

pau_Latn

pcm_Latn

pdt_Latn

pes_Arab

pis_Latn

pls_Latn

plt_Latn

poh_Latn

pol_Latn

pon_Latn

por_Latn

prk_Latn

prs_Arab

pxm_Latn

qub_Latn

quc_Latn

qug_Latn

quh_Latn

quw_Latn

quy_Latn

quz_Latn

qvi_Latn

rap_Latn

rar_Latn

rmy_Latn

ron_Latn

rop_Latn

rug_Latn

run_Latn

rus_Cyrl

sag_Latn

sah_Cyrl

sba_Latn

seh_Latn

sin_Sinh

slk_Latn

slv_Latn

sme_Latn

smo_Latn

sna_Latn

snd_Arab

som_Latn

sop_Latn

sot_Latn

spa_Latn

sqi_Latn

srm_Latn

srn_Latn

srp_Latn

ssw_Latn

sun_Latn

suz_Deva

swe_Latn

swh_Latn

sxn_Latn

tam_Taml

tat_Cyrl

tbz_Latn

tca_Latn

tdt_Latn

tel_Telu

teo_Latn

tgk_Cyrl

tgl_Latn 6

48 5

60 36

Table 19: F1 of XLM-R-B, XLM-R-L, and Glot500-m on Text Classification (Part I).Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m

efi_Latn

ell_Grek

eng_Latn

enm_Latn

epo_Latn

est_Latn

eus_Latn

ewe_Latn

fao_Latn

fas_Arab

fij_Latn

fil_Latn

fin_Latn

fon_Latn

fra_Latn

fry_Latn

gaa_Latn

gil_Latn

giz_Latn

gkn_Latn

gkp_Latn

gla_Latn

gle_Latn

glv_Latn

gom_Latn

gor_Latn

guc_Latn

gug_Latn

guj_Gujr

gur_Latn

guw_Latn

gya_Latn

gym_Latn

hat_Latn

hau_Latn

haw_Latn

heb_Hebr

hif_Latn

hil_Latn

hin_Deva

hmo_Latn

hne_Deva

hnj_Latn

hra_Latn

hrv_Latn

hui_Latn

hun_Latn

hus_Latn

hye_Armn 10

60 9

68 50

60 meu_Latn

mfe_Latn

mgh_Latn

mgr_Latn

mhr_Cyrl

min_Latn

miq_Latn

mkd_Cyrl

mlg_Latn

mlt_Latn

mos_Latn

mps_Latn

mri_Latn

mrw_Latn

msa_Latn

mwm_Latn

mxv_Latn

mya_Mymr

myv_Cyrl

mzh_Latn

nan_Latn

naq_Latn

nav_Latn

nbl_Latn

nch_Latn

ncj_Latn

ndc_Latn

nde_Latn

ndo_Latn

nds_Latn

nep_Deva

ngu_Latn

nia_Latn

nld_Latn

nmf_Latn

nnb_Latn

nno_Latn

nob_Latn

nor_Latn

npi_Deva

nse_Latn

nso_Latn

nya_Latn

nyn_Latn

nyy_Latn

nzi_Latn

ori_Orya

ory_Orya

oss_Cyrl 15

6 11

6 52

47 tha_Thai

tih_Latn

tir_Ethi

tlh_Latn

tob_Latn

toh_Latn

toi_Latn

toj_Latn

ton_Latn

top_Latn

tpi_Latn

tpm_Latn

tsn_Latn

tsz_Latn

tuc_Latn

tui_Latn

tuk_Latn

tum_Latn

tur_Latn

twi_Latn

tyv_Cyrl

tzh_Latn

tzo_Latn

udm_Cyrl

ukr_Cyrl

urd_Arab

uzb_Latn

uzn_Cyrl

ven_Latn

vie_Latn

wal_Latn

war_Latn

wbm_Latn

wol_Latn

xav_Latn

xho_Latn

yan_Latn

yao_Latn

yap_Latn

yom_Latn

yor_Latn

yua_Latn

yue_Hani

zai_Latn

zho_Hani

zlm_Latn

zom_Latn

zsm_Latn

zul_Latn 56

24 67

35 61

Table 20: F1 of XLM-R-B, XLM-R-L, and Glot500-m on Text Classification (Part II).Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m

ace_Latn

ach_Latn

acr_Latn

afr_Latn

agw_Latn

ahk_Latn

aka_Latn

aln_Latn

als_Latn

alt_Cyrl

alz_Latn

amh_Ethi

amh_Latn

aoj_Latn

arb_Arab

arn_Latn

ary_Arab

arz_Arab

asm_Beng

ayr_Latn

azb_Arab

aze_Cyrl

aze_Latn

bak_Cyrl

bam_Latn

ban_Latn

bar_Latn

bba_Latn

bbc_Latn

bci_Latn

bcl_Latn

bel_Cyrl

bem_Latn

ben_Beng

bhw_Latn

bim_Latn

bis_Latn

bod_Tibt

bqc_Latn

bre_Latn

bts_Latn

btx_Latn

bul_Cyrl

bum_Latn

bzj_Latn

cab_Latn

cac_Latn

cak_Latn

caq_Latn

cat_Latn

cbk_Latn

cce_Latn

ceb_Latn

ces_Latn

cfm_Latn

che_Cyrl

chk_Hani

chk_Latn

chv_Cyrl

ckb_Arab

ckb_Latn

cmn_Hani

cnh_Latn

crh_Cyrl

crs_Latn

csy_Latn

ctd_Latn

ctu_Latn

cuk_Latn

cym_Latn

dan_Latn

deu_Latn

djk_Latn

dln_Latn

dtp_Latn

dyu_Latn

dzo_Tibt 2.50

3.13

2.01

3.17

2.51

1.11

3.38

4.06

3.92

2.91

3.78

3.04

1.41

1.77

1.07

2.40

0.86

0.83

2.82

2.61

2.57

2.76

4.24

2.20

3.56

2.26

3.11

2.43

3.02

2.81

3.78

3.73

3.06

3.29

2.91

2.54

2.59

0.54

2.44

3.32

4.06

3.23

3.56

3.22

1.65

2.16

1.51

1.86

2.20

3.76

3.12

2.96

3.45

4.33

2.69

2.50

4.88

3.20

2.25

2.38

2.11

3.24

2.17

3.14

2.63

2.58

2.94

1.89

2.20

3.11

4.06

4.85

2.07

3.89

2.05

2.75

0.39 2.83

4.02

2.46

3.66

2.80

1.23

4.50

4.92

4.85

3.36

4.89

3.10

1.76

1.97

1.47

2.79

1.10

1.14

2.47

3.09

3.16

3.26

5.04

2.38

4.29

2.74

3.81

2.80

3.85

3.18

4.61

4.91

3.77

3.07

3.47

3.29

2.96

3.39

3.16

3.87

4.92

3.88

4.67

3.73

2.43

2.63

1.74

2.18

2.94

4.04

3.64

3.40

4.13

5.27

3.18

3.02

6.75

3.94

2.77

3.15

2.57

4.57

2.75

3.79

3.46

3.02

3.61

2.31

2.87

3.78

5.03

5.19

2.46

4.89

2.28

3.32

2.51 4.56

5.60

2.51

5.46

4.09

1.22

6.48

7.39

6.32

5.32

5.94

4.87

1.70

3.22

2.40

4.51

2.43

2.52

5.21

3.93

4.96

3.62

8.00

4.35

5.73

3.37

3.84

4.16

5.22

3.30

8.06

6.46

5.69

4.99

5.16

4.12

4.68

2.43

4.61

3.79

7.99

5.59

5.88

4.89

4.48

2.98

2.86

3.24

3.66

5.24

4.34

4.86

5.10

7.75

4.52

3.17

7.08

5.36

4.79

3.86

3.35

5.22

3.62

6.77

4.88

4.25

4.65

2.40

3.09

3.85

6.94

7.28

3.53

5.23

3.04

5.29

2.03 hye_Armn

hye_Latn

iba_Latn

ibo_Latn

ifa_Latn

ifb_Latn

ikk_Latn

ilo_Latn

ind_Latn

isl_Latn

ita_Latn

ium_Latn

ixl_Latn

izz_Latn

jam_Latn

jav_Latn

jpn_Jpan

kaa_Cyrl

kaa_Latn

kab_Latn

kac_Latn

kal_Latn

kan_Knda

kan_Latn

kat_Geor

kaz_Cyrl

kbp_Latn

kek_Latn

khm_Khmr

kia_Latn

kik_Latn

kin_Latn

kir_Cyrl

kjb_Latn

kjh_Cyrl

kmm_Latn

kmr_Cyrl

kmr_Latn

knv_Latn

kor_Hang

kor_Latn

kpg_Latn

krc_Cyrl

kri_Latn

ksd_Latn

kss_Latn

ksw_Mymr

kua_Latn

lam_Latn

lao_Laoo

lat_Latn

lav_Latn

ldi_Latn

leh_Latn

lhu_Latn

lin_Latn

lit_Latn

loz_Latn

ltz_Latn

lug_Latn

luo_Latn

lus_Latn

lzh_Hani

mad_Latn

mah_Latn

mai_Deva

mal_Latn

mal_Mlym

mam_Latn

mar_Deva

mau_Latn

mbb_Latn

mck_Latn

mcn_Latn

mco_Latn

mdy_Ethi

meu_Latn 2.32

2.34

2.77

2.05

1.81

2.22

1.75

3.06

4.06

4.40

3.55

2.00

1.62

1.65

2.77

3.10

3.62

2.99

2.34

2.51

1.66

3.00

2.58

1.62

4.06

3.82

1.47

1.91

1.57

2.92

2.28

2.67

4.54

2.42

3.13

2.52

2.31

3.75

1.27

2.76

0.92

2.80

2.85

1.90

2.82

0.99

0.95

4.25

2.41

2.61

4.65

3.35

3.41

2.73

1.43

1.78

4.69

3.35

3.73

2.84

3.34

2.43

3.21

2.65

2.95

1.79

2.67

3.19

1.84

3.87

1.60

2.25

3.34

3.74

1.42

1.36

3.26 3.25

2.98

3.85

2.43

2.40

2.58

2.29

3.87

5.00

5.22

4.02

2.27

1.94

2.06

3.06

3.67

4.39

3.91

2.96

3.08

2.17

3.90

3.18

2.08

4.99

4.56

1.65

2.45

1.70

3.27

2.73

3.26

4.35

3.03

3.81

3.30

2.76

4.19

1.53

3.99

2.40

3.12

3.66

2.52

3.28

1.09

1.46

4.92

3.09

3.21

5.51

4.56

3.94

3.66

1.61

2.73

5.66

3.91

3.99

3.50

4.09

2.99

5.56

3.29

3.59

2.02

3.36

4.13

2.20

5.13

1.78

2.56

4.06

4.42

1.63

1.26

3.79 4.91

2.44

6.01

4.33

3.45

3.28

3.83

6.24

7.60

7.07

6.18

3.46

2.14

3.12

3.59

5.21

4.07

5.45

3.64

3.14

3.34

4.73

4.05

1.81

5.53

5.31

3.32

2.70

2.82

4.69

4.38

4.19

6.36

3.27

5.39

3.73

4.30

5.70

2.09

4.89

0.90

5.77

4.90

5.07

5.42

1.49

4.18

7.31

4.03

4.39

7.44

6.45

4.29

5.28

1.36

4.61

7.07

6.03

5.16

5.59

4.90

5.20

5.47

4.45

4.92

3.86

2.71

4.76

2.22

5.65

1.12

3.51

5.09

5.60

1.69

2.89

5.10 pam_Latn

pan_Guru

pap_Latn

pau_Latn

pcm_Latn

pdt_Latn

pes_Arab

pis_Latn

pls_Latn

plt_Latn

poh_Latn

pol_Latn

pon_Latn

por_Latn

prk_Latn

prs_Arab

pxm_Latn

qub_Latn

quc_Latn

qug_Latn

quh_Latn

quw_Latn

quy_Latn

quz_Latn

qvi_Latn

rap_Latn

rar_Latn

rmy_Latn

ron_Latn

rop_Latn

rug_Latn

run_Latn

rus_Cyrl

sag_Latn

sah_Cyrl

san_Deva

san_Latn

sba_Latn

seh_Latn

sin_Sinh

slk_Latn

slv_Latn

sme_Latn

smo_Latn

sna_Latn

snd_Arab

som_Latn

sop_Latn

sot_Latn

spa_Latn

sqi_Latn

srm_Latn

srn_Latn

srp_Cyrl

srp_Latn

ssw_Latn

sun_Latn

suz_Deva

swe_Latn

swh_Latn

sxn_Latn

tam_Latn

tam_Taml

tat_Cyrl

tbz_Latn

tca_Latn

tdt_Latn

tel_Telu

teo_Latn

tgk_Cyrl

tgl_Latn

tha_Thai

tih_Latn

tir_Ethi

tlh_Latn

tob_Latn

toh_Latn 2.85

2.11

3.12

2.67

3.81

2.41

2.66

1.91

2.14

3.74

0.92

3.94

3.53

3.61

2.10

3.54

1.76

2.48

1.87

2.44

2.91

2.89

2.69

3.33

2.82

1.31

1.83

2.85

3.33

1.60

2.56

3.33

4.20

2.92

2.31

2.48

1.54

1.88

3.44

2.55

4.65

3.11

2.70

2.26

2.89

3.12

3.15

2.80

3.49

3.71

4.07

1.75

3.40

6.48

4.16

3.27

2.98

1.68

4.77

4.05

2.08

2.59

3.09

2.13

1.62

1.29

3.20

2.87

3.37

2.63

3.22

1.50

2.21

1.90

3.02

1.42

2.17 3.52

2.73

3.85

3.09

4.44

3.33

3.91

2.32

2.57

3.99

1.10

5.20

4.51

4.35

2.70

4.28

2.15

2.97

2.45

2.99

3.46

3.50

3.15

3.89

3.42

1.61

2.22

3.68

4.00

2.08

2.95

3.98

5.05

3.52

3.01

2.20

2.23

2.24

4.20

3.60

5.06

4.32

3.35

2.72

3.39

3.92

3.40

3.55

4.31

4.21

5.07

1.96

3.86

6.50

5.06

4.02

3.69

1.66

4.76

4.99

2.54

3.08

3.77

2.62

2.03

1.56

3.48

3.78

4.18

3.29

3.35

2.72

2.89

1.93

3.52

1.84

2.90 4.46

4.11

5.46

4.09

6.47

5.11

4.81

4.42

4.02

6.82

1.87

5.12

5.18

6.12

5.40

6.92

3.40

4.24

2.77

5.34

5.43

5.62

5.51

6.07

4.89

2.31

3.27

4.83

4.99

3.46

3.60

6.82

7.38

5.17

4.98

3.64

2.35

3.86

4.94

3.44

6.43

5.23

4.40

4.34

5.32

5.30

4.17

4.23

6.96

5.86

6.50

3.23

5.98

10.24

6.31

5.72

4.61

2.82

7.09

7.27

3.06

2.56

5.74

4.03

4.22

2.77

5.06

3.98

4.29

6.11

5.16

4.10

4.57

4.03

5.71

2.00

4.41

Table 21: Accuracy of XLM-R-B, XLM-R-L, and Glot500-m on Round Trip Alignment (Part I).Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m

efi_Latn

ell_Grek

eng_Latn

enm_Latn

epo_Latn

est_Latn

eus_Latn

ewe_Latn

fao_Latn

fas_Arab

fij_Latn

fil_Latn

fin_Latn

fon_Latn

fra_Latn

fry_Latn

gaa_Latn

gil_Latn

giz_Latn

gkn_Latn

gkp_Latn

gla_Latn

gle_Latn

glv_Latn

gom_Latn

gor_Latn

grc_Grek

guc_Latn

gug_Latn

guj_Gujr

gur_Latn

guw_Latn

gya_Latn

gym_Latn

hat_Latn

hau_Latn

haw_Latn

heb_Hebr

hif_Latn

hil_Latn

hin_Deva

hin_Latn

hmo_Latn

hne_Deva

hnj_Latn

hra_Latn

hrv_Latn

hui_Latn

hun_Latn

hus_Latn 2.55

2.79

4.02

3.77

4.01

4.34

3.12

2.22

3.85

4.54

2.81

3.26

4.06

1.63

3.19

3.36

2.74

2.76

3.00

1.93

1.88

2.90

3.52

2.76

3.05

2.26

1.11

1.46

2.60

3.18

2.14

2.18

1.94

1.44

3.21

3.69

2.25

1.85

2.90

2.92

3.39

2.94

2.43

2.48

2.14

3.32

4.14

1.84

4.54

1.70 3.25

3.38

4.49

4.60

4.83

5.24

3.80

2.67

4.62

4.48

3.17

3.92

5.19

1.89

3.97

3.99

3.26

3.20

3.43

2.07

2.25

3.48

4.24

3.38

3.59

2.73

2.00

1.80

3.23

4.15

2.59

2.54

2.25

1.78

3.64

4.24

2.63

2.41

3.43

3.48

3.80

3.20

2.70

2.53

3.86

5.24

2.10

4.10

2.00 6.23

4.77

6.39

7.19

5.88

8.21

4.19

4.74

5.75

7.00

4.94

4.80

6.03

3.70

5.08

4.52

6.01

4.50

5.40

3.31

3.40

3.61

4.49

4.45

4.40

3.71

2.93

2.23

4.70

4.38

3.22

4.56

4.63

2.63

6.39

6.31

3.55

3.92

3.60

4.88

5.13

4.77

6.12

4.95

4.28

5.19

7.02

3.47

5.62

2.42 mfe_Latn

mgh_Latn

mgr_Latn

mhr_Cyrl

min_Latn

miq_Latn

mkd_Cyrl

mlg_Latn

mlt_Latn

mos_Latn

mps_Latn

mri_Latn

mrw_Latn

msa_Latn

mwm_Latn

mxv_Latn

mya_Mymr

myv_Cyrl

mzh_Latn

nan_Latn

naq_Latn

nav_Latn

nbl_Latn

nch_Latn

ncj_Latn

ndc_Latn

nde_Latn

ndo_Latn

nds_Latn

nep_Deva

ngu_Latn

nia_Latn

nld_Latn

nmf_Latn

nnb_Latn

nno_Latn

nob_Latn

nor_Latn

npi_Deva

nse_Latn

nso_Latn

nya_Latn

nyn_Latn

nyy_Latn

nzi_Latn

ori_Orya

ory_Orya

oss_Cyrl

ote_Latn

pag_Latn 3.61

2.78

3.32

2.75

2.62

2.23

3.99

3.34

2.94

2.71

1.50

2.81

2.69

3.17

1.74

1.75

1.54

2.90

2.62

1.99

2.42

1.75

3.09

2.18

2.64

3.32

4.00

3.21

2.98

3.02

1.86

2.75

2.81

3.30

2.46

3.90

3.88

3.31

3.29

3.06

2.76

2.77

2.21

2.09

2.73

3.27

2.20

1.89

2.93 4.19

3.28

4.06

3.28

3.05

3.13

4.54

3.81

3.57

3.24

1.65

3.44

3.24

3.50

1.99

2.11

1.53

3.42

3.02

2.51

3.15

2.10

3.87

2.74

3.40

3.85

4.60

3.85

3.69

2.97

2.34

3.47

3.63

4.27

3.14

4.61

4.81

4.14

3.30

4.06

3.92

3.19

3.50

2.74

2.70

2.77

3.20

2.52

2.23

3.44 6.26

3.48

6.39

5.32

3.78

4.12

7.37

6.33

4.87

4.25

3.05

5.49

4.58

5.38

3.20

2.31

2.46

4.46

4.10

2.56

4.41

2.71

4.85

3.32

3.69

6.67

6.05

5.61

4.70

6.31

3.39

3.24

4.90

5.05

4.08

7.41

5.83

5.82

5.93

5.74

5.51

5.96

5.59

2.95

4.20

3.92

4.39

5.85

2.66

4.56 toi_Latn

toj_Latn

ton_Latn

top_Latn

tpi_Latn

tpm_Latn

tsn_Latn

tso_Latn

tsz_Latn

tuc_Latn

tui_Latn

tuk_Cyrl

tuk_Latn

tum_Latn

tur_Latn

twi_Latn

tyv_Cyrl

tzh_Latn

tzo_Latn

udm_Cyrl

uig_Arab

uig_Latn

ukr_Cyrl

urd_Arab

urd_Latn

uzb_Cyrl

uzb_Latn

uzn_Cyrl

ven_Latn

vie_Latn

wal_Latn

war_Latn

wbm_Latn

wol_Latn

xav_Latn

xho_Latn

yan_Latn

yao_Latn

yap_Latn

yom_Latn

yor_Latn

yua_Latn

yue_Hani

zai_Latn

zho_Hani

zlm_Latn

zom_Latn

zsm_Latn

zul_Latn 3.19

1.43

2.01

1.56

2.44

2.79

2.82

2.40

2.68

1.43

2.47

2.74

2.43

3.41

5.18

3.05

2.31

2.16

2.01

2.90

2.58

2.26

5.71

1.88

2.29

2.73

3.32

2.61

2.96

3.99

2.87

3.04

2.44

3.47

0.87

3.61

2.95

2.01

2.86

3.25

2.24

2.04

2.37

3.22

2.77

4.39

3.65

4.49

3.67 4.10

1.84

2.64

2.16

2.71

3.39

3.12

3.05

3.14

1.83

2.83

3.68

3.23

4.13

4.86

4.06

2.83

2.50

2.29

3.48

3.11

2.76

5.96

2.88

2.97

3.26

3.98

3.06

3.64

4.48

3.65

3.74

2.86

4.48

1.03

4.27

3.35

2.66

3.41

4.00

2.68

2.26

3.19

3.76

4.38

5.15

4.45

5.07

4.39 4.31

2.25

3.63

2.19

5.96

4.67

4.63

5.00

4.20

2.36

4.53

4.33

4.74

6.15

7.45

6.70

3.33

3.08

2.77

4.72

3.61

3.79

7.47

3.96

3.03

7.24

5.91

5.86

5.34

6.69

4.24

5.43

6.53

6.10

1.12

5.90

5.59

3.87

3.45

5.17

3.88

2.86

2.95

5.21

5.03

7.54

5.36

8.83

5.44

Table 22: Accuracy of XLM-R-B, XLM-R-L, and Glot500-m on Round Trip Alignment (Part II).Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m

srd_Latn

ben_Beng

ajp_Arab

tdx_Latn

tpm_Latn

grc_Grek

sxn_Latn

cos_Latn

tlh_Latn

sid_Latn

jam_Latn

ban_Latn

kin_Latn

rop_Latn

alz_Latn

kwy_Latn

yor_Latn

lao_Laoo

aze_Latn

mya_Mymr

ssw_Latn

lus_Latn

krc_Cyrl

hbo_Hebr

mgr_Latn

crh_Cyrl

ara_Arab

mar_Deva

nba_Latn

mny_Latn

run_Latn

rus_Cyrl

hbs_Latn

lug_Latn

pls_Latn

hif_Latn

tll_Latn

crs_Latn

rng_Latn

cjk_Latn

seh_Latn

rug_Latn

hau_Latn

uzb_Latn

bim_Latn

vep_Latn

slv_Latn

azj_Latn

cac_Latn

npi_Deva

lin_Latn

zom_Latn

kmr_Cyrl

acm_Arab

fin_Latn

rmn_Grek

wls_Latn

hun_Latn

lĳ_Latn

quh_Latn

yap_Latn

abk_Cyrl

cmn_Hani

csb_Latn

nbl_Latn

ndc_Latn

oci_Latn

fao_Latn

tui_Latn

xav_Latn 87.2

5.2

74.6

688.4

99.9

10.1

469.2

52.1

53.6

1003.6

213.3

40.8

544.1

150.7

511.9

598.8

109.1

4.2

5.6

6.9

345.7

493.5

120.1

6.3

737.8

138.6

10.1

7.5

638.8

568.9

817.5

3.3

4.5

489

91.7

21.6

244.6

782.2

656.6

530.8

917.8

260.9

14.5

5.6

142.2

218.1

7.8

5.3

51.4

8.6

377.3

238.7

140.6

113.6

4.2

108.9

334.9

5.1

98.8

279

507.3

122.6

10.4

112.8

137.7

1188.5

41.2

84.2

126.1

21.4 66.6

3.7

34.0

716.4

90.2

10.4

148.3

22.8

46.3

782.3

195.2

76.1

203.2

93.4

145.6

514.4

55.9

4.4

3.6

2.7

108.4

131.2

63.2

3.6

254.2

86.3

6.3

4.6

675.1

492.5

218.5

2.3

2.6

197.5

98.9

46.7

161

146.5

606.8

419.6

230

214.2

7.1

3.6

97.3

111.5

4.9

3.3

39.3

4.9

96.6

176.2

56.7

74.0

3.1

76.8

207.9

3.3

55.1

176.6

195.9

89.5

5.0

59.4

19.6

374.6

24.4

35.6

127

15.9 5.4

7.2

44.8

16.0

17.9

3.4

14.5

13.3

11.1

34.5

15.8

16.1

6.6

8.4

47.7

30.5

11.0

3.8

5.4

6.3

20.2

16.4

9.3

5.6

33.0

5.2

18.8

11.2

14.6

38.7

16.9

4.5

13.1

6.9

13.5

24.3

7.4

11.7

24.0

11.2

5.4

17.2

5.8

11.3

6.1

26.9

5.1

7.0

7.3

15.3

22.8

4.1

21.7

3.3

4.0

25.1

5.9

16.5

10.6

20.1

9.8

6.1

13.9

19.4

8.3

5.5

20.6

5.7 aka_Latn

mon_Latn

gor_Latn

kjb_Latn

lhu_Latn

bos_Latn

lmo_Latn

mwn_Latn

aym_Latn

aoj_Latn

est_Latn

bre_Latn

bsb_Latn

yua_Latn

hrv_Latn

jav_Latn

mai_Deva

tyv_Cyrl

afb_Arab

twi_Latn

sme_Latn

yom_Latn

tob_Latn

mxv_Latn

ron_Latn

ile_Latn

cce_Latn

uzn_Cyrl

ibg_Latn

hat_Latn

fij_Latn

kbp_Latn

mlt_Latn

kjh_Cyrl

ndo_Latn

rar_Latn

ell_Grek

tvl_Latn

toj_Latn

ikk_Latn

ory_Orya

nor_Latn

enm_Latn

arz_Arab

bem_Latn

gkp_Latn

guj_Gujr

tbz_Latn

ven_Latn

crh_Latn

xmv_Latn

slk_Latn

zne_Latn

cgg_Latn

vie_Latn

amh_Ethi

nyu_Latn

suz_Deva

tuc_Latn

lub_Latn

epo_Latn

ksw_Mymr

mwl_Latn

cak_Latn

bar_Latn

asm_Beng

grn_Latn

tso_Latn

nso_Latn

bum_Latn 86.7

288

89.8

110.8

44.7

6.1

48.4

697.8

1084.6

95.1

7.7

12.9

74.5

246.8

7.4

20.2

42.9

104.1

68.7

178.9

293

468

115

69.8

4.4

67.9

468.3

402.4

897.3

228

377.3

34.6

223

209.8

892.3

458.1

3.4

634.1

287.1

67.8

6.1

43.1

17.5

706.9

33.1

6.2

39.2

268.3

151

593.2

854.7

565.7

7.6

8.9

926.2

63.4

108.9

670.8

10.8

16.6

69.1

101.7

124.7

199.3

506.1

656.3

282.8 74.1

282.4

140.7

81.1

12.3

3.4

25.9

543.8

727.8

53.7

4.0

3.7

45.1

55.1

4.9

4.4

48.8

104.4

44.4

66.7

368.2

240.7

78.8

29.7

2.9

40.1

123.5

138.7

807.3

113.3

24.5

162.2

88.8

178.1

50.2

2.6

378.5

113.6

49.5

2.8

31.0

1.5

219.9

30.2

3.6

40.4

70.9

491.4

2.9

658.4

454.4

3.1

5.3

479.2

76.4

80.8

577.5

5.2

7.5

35.6

46.1

108.9

3.8

141.6

115.2

153.4

91.5 14.2

33.7

8.8

16.2

2.0

7.9

6.1

30.7

14.5

7.4

22.1

12.3

7.6

4.6

9.7

6.0

7.3

55.9

17.9

6.5

43.1

7.2

5.0

10.4

5.7

22.5

5.2

21.8

14.0

12.8

7.1

10.3

16.4

21.1

12.1

5.9

7.1

9.6

8.6

6.3

8.5

36.6

6.8

27.1

12.7

6.5

8.4

9.4

6.5

19.4

11.2

48.8

12.4

16.4

7.5

9.3

2.5

7.6

23.8

4.6

4.9

5.4

14.4

10.3

13.2

9.1

22.1 dyu_Latn

nyy_Latn

tzh_Latn

hne_Deva

bel_Cyrl

szl_Latn

ksh_Latn

pcd_Latn

ada_Latn

pxm_Latn

xho_Latn

kaa_Cyrl

kea_Latn

teo_Latn

tsc_Latn

hin_Deva

ekk_Latn

umb_Latn

tam_Taml

toi_Latn

kon_Latn

che_Cyrl

gaa_Latn

tzo_Latn

mon_Cyrl

cuk_Latn

ces_Latn

rmy_Latn

phm_Latn

glv_Latn

diq_Latn

poh_Latn

oss_Cyrl

san_Deva

ote_Latn

her_Latn

efi_Latn

idu_Latn

hye_Armn

gcf_Latn

pus_Arab

sgs_Latn

mbb_Latn

som_Arab

hsb_Latn

ary_Arab

hmo_Latn

quw_Latn

pag_Latn

ber_Latn

chk_Latn

kan_Knda

loz_Latn

tih_Latn

mfe_Latn

tel_Telu

ina_Latn

isl_Latn

tsz_Latn

ori_Orya

tat_Latn

arg_Latn

kia_Latn

afr_Latn

myv_Cyrl

bik_Latn

ltz_Latn

iso_Latn

ewe_Latn

als_Latn 68.5

628.5

320.3

80.1

3.4

46.4

340.3

61.2

100

101.3

32.5

72.9

754.2

587.1

726.3

7.4

920

7.2

988.7

463.7

266.4

109.3

246.5

5.8

211.5

4.4

288.2

914.5

240.2

256.6

62.8

121.8

20.5

127.8

776

256.8

117.7

3.6

450.8

12.9

119.2

177.1

7.2

109.6

32.7

509.3

177.8

923.5

639.1

766.9

7.2

895

247.6

767.9

6.5

26.9

7.9

990.6

5.2

168.4

29.2

132.4

12.2

97.7

170.4

39.7

236.2

198

7.6 27.4

198.3

82.8

60.3

2.5

30.2

227.6

40.8

78.5

120.7

9.4

29.2

525.3

271.7

501.1

3.1

3.8

838.8

2.3

246.5

418.9

127.6

33.3

54.3

3.4

72.1

3.1

349.8

678.5

182.3

120.5

68.9

58.7

12.4

71.2

707.3

90.9

4.4

292.4

7.5

124.7

138

3.1

103.6

4.6

77.7

157.7

232.4

981.4

151.6

2.8

113.7

151.3

255.4

4.0

17.1

4.9

199.7

3.0

65.5

13.6

126.8

7.8

153.3

60.3

165.1

222.4

54.6

2.5 10.2

18.0

4.7

9.1

5.3

3.1

19.9

13.2

9.5

2.7

16.7

8.8

13.4

62.0

17.0

11.8

17.4

9.8

20.9

16.3

5.7

13.5

7.0

8.6

32.0

11.6

25.0

11.6

9.4

13.4

3.8

5.1

15.5

8.0

31.6

11.5

12.0

3.8

5.5

12.7

10.5

4.2

9.3

5.2

10.9

26.1

25.8

21.3

19.1

8.9

27.8

4.9

10.1

7.9

7.2

16.7

14.2

4.7

6.9

7.2

18.5

19.2

8.5

13.7

10.9

8.7

20.0

6.4

Table 23: Perplexity of all languages covered by Glot500-m (Part I).Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m

swc_Latn

deu_Latn

caq_Latn

ceb_Latn

nia_Latn

urd_Arab

niu_Latn

mrw_Latn

bul_Cyrl

pau_Latn

tha_Thai

ilo_Latn

kss_Latn

zai_Latn

guw_Latn

kbd_Cyrl

dln_Latn

war_Latn

tca_Latn

iku_Cans

bjn_Latn

ngu_Latn

kmr_Latn

tgl_Latn

eus_Latn

hra_Latn

lue_Latn

pol_Latn

leh_Latn

lat_Latn

div_Thaa

min_Latn

ctu_Latn

tur_Latn

dhv_Latn

lua_Latn

rmy_Cyrl

zpa_Latn

gom_Latn

dtp_Latn

fra_Latn

cat_Latn

xmf_Geor

ixl_Latn

ckb_Arab

ahk_Latn

sag_Latn

qug_Latn

nyn_Latn

koo_Latn

uig_Arab

kam_Latn

gkn_Latn

twx_Latn

skg_Latn

arb_Arab

mco_Latn

sqi_Latn

cnh_Latn

sin_Sinh

kmb_Latn

vol_Latn

msa_Latn

bba_Latn

tgk_Latn

tiv_Latn

hmn_Latn

swh_Latn

pis_Latn

mzn_Arab 39.2

4.4

185.9

63.1

280.3

8.3

600.1

320.8

3.9

333.7

10.8

786.7

90.4

719.4

267.7

175.7

238.8

200.9

70.4

2.2

41.3

918

7.9

10.7

212.1

839.2

4.5

476.5

15.3

1.6

105

177.4

9.1

509

706

488.1

476.1

405.7

166.4

4.1

71.2

72.2

44.8

491.4

505

834.8

481.3

8.1

225.9

248

1209.8

665.4

4.1

295

6.2

496

7.5

564.8

78.4

8.2

75.5

11.9

912.3

60.9

12.6

563.2

50 22.5

3.6

129

53.1

85.5

5.3

437.5

174.9

3.6

147.3

2.9

184.4

13.2

212.5

65.5

94.4

207.8

110.7

1.9

17.6

110.9

4.6

4.4

6.2

177.7

627.4

2.7

253.9

3.7

1.5

39.7

37.9

4.1

435.8

784.5

389.3

550.1

282.9

78.7

2.8

2.2

72.3

29.6

80.6

9.1

68.7

135.2

236.9

321.6

2.4

155.7

74.6

978.2

624.1

2.1

37.6

2.1

154.4

5.4

465.8

67.7

26.1

65.5

11.7

716.3

52.5

5.8

64.7

34.3 13.2

10.2

21.6

2.1

7.5

8.7

10.1

7.6

6.8

7.2

14.6

13.8

11.2

10.4

6.9

9.1

7.5

2.3

6.0

5.8

11.4

13.4

10.6

8.9

37.3

54.3

19.8

10.6

26.2

24.5

3.5

3.9

4.5

29.5

11.8

21.7

9.3

13.6

27.9

5.5

6.9

7.3

3.8

4.2

6.0

2.1

11.1

13.7

16.8

13.8

5.5

10.3

9.4

15.5

15.8

4.6

8.4

16.3

9.8

15.6

2.4

16.3

7.5

29.3

8.8

24.4

9.7

6.3 top_Latn

bin_Latn

chw_Latn

hyw_Cyrl

kor_Hang

btx_Latn

srn_Latn

llb_Latn

cbk_Latn

bcl_Latn

csy_Latn

ctd_Latn

plt_Latn

smo_Latn

kab_Latn

gom_Deva

ukr_Cyrl

ast_Latn

lvs_Latn

rmn_Cyrl

kir_Cyrl

pfl_Latn

bqc_Latn

yid_Hebr

fil_Latn

nap_Latn

heb_Hebr

sba_Latn

ifa_Latn

ami_Latn

gil_Latn

djk_Latn

new_Deva

bam_Latn

wol_Latn

alt_Cyrl

kri_Latn

kom_Cyrl

sah_Cyrl

mzh_Latn

sna_Latn

bzj_Latn

nld_Latn

gug_Latn

yue_Hani

fry_Latn

jbo_Latn

iba_Latn

nya_Latn

tat_Cyrl

nzi_Latn

wal_Latn

pdt_Latn

apc_Arab

mdy_Ethi

rue_Cyrl

azb_Arab

bci_Latn

kmm_Latn

bak_Cyrl

miq_Latn

kaa_Latn

bod_Tibt

glg_Latn

tum_Latn

bbc_Latn

kek_Latn

ace_Latn

pam_Latn

fas_Arab 589.2

278.1

778.9

268.5

7.2

463

609.3

555.6

129.5

270

198.3

249.2

10.8

235.7

744.5

82.8

3.1

27.5

4.8

624.3

7.7

152

102.7

7.6

9.2

81.7

6.7

75.7

371.9

1070.7

763.5

360.4

36.1

74.5

236.4

140.7

87.6

93.4

99.9

132.8

316.6

264.7

5.7

626.9

17.8

16.1

132.3

529.3

319.6

99.8

113.7

492.7

417.7

74.8

65.7

18.7

194.1

129.6

193.3

347.4

94.2

8.8

5.9

516.4

787.9

126.4

81.5

59.6

8 89.6

169.8

645.8

233.5

2.6

163.1

137.2

589.8

60.4

60.1

152.5

166.1

3.6

55.6

203.5

48.4

2.9

18.6

2.7

513.1

2.9

101.3

71.1

4.8

2.3

39.6

4.9

81.8

266.1

710.2

161.3

93.4

29.8

23.7

158.3

50.9

35.8

91.1

133.4

331.1

75.8

4.5

141.6

10.6

15.4

187.1

256.8

116

47.4

120.3

143

42.2

68.4

11.4

141.8

95.6

164.9

198.9

100.6

4.0

4.6

168.3

203.7

40.6

276.7

4.1 23.5

13.3

33.9

6.3

19.3

12.6

41.1

11.6

12.5

21.7

11.6

5.7

7.0

24.3

9.0

5.9

4.8

5.7

8.7

11.9

11.3

26.5

5.1

9.9

10.5

13.5

6.0

29.2

15.7

13.4

4.5

46.8

32.0

9.3

8.6

4.9

4.5

9.6

16.4

10.9

8.4

10.8

17.2

9.0

16.6

12.7

4.1

12.5

18.1

13.3

37.2

5.4

4.5

4.8

8.7

20.2

5.3

23.6

7.3

6.3

9.2

10.2

13.6

4.3

6.4

28.2

14.1 hin_Latn

eng_Latn

hus_Latn

urh_Latn

mkd_Cyrl

wbm_Latn

kwn_Latn

guc_Latn

quc_Latn

nds_Latn

ind_Latn

nde_Latn

kua_Latn

nch_Latn

por_Latn

jpn_Jpan

spa_Latn

knv_Latn

agw_Latn

ige_Latn

dua_Latn

ogo_Latn

bas_Latn

bpy_Beng

lfn_Latn

ton_Latn

lim_Latn

lav_Latn

bih_Deva

gym_Latn

ish_Latn

zea_Latn

aln_Latn

gcr_Latn

kal_Latn

dan_Latn

tah_Latn

kik_Latn

vmw_Latn

eml_Latn

sco_Latn

kac_Latn

ttj_Latn

lun_Latn

sot_Latn

mau_Latn

yan_Latn

ido_Latn

rmn_Latn

sat_Olck

mad_Latn

hil_Latn

khm_Khmr

fon_Latn

ngl_Latn

tcf_Latn

gur_Latn

qvi_Latn

izz_Latn

kur_Arab

hbs_Cyrl

ach_Latn

wuu_Hani

quz_Latn

tok_Latn

bis_Latn

fur_Latn

ium_Latn

nse_Latn

zul_Latn 11.1

5.7

134.6

236.8

4.3

58.9

1053.6

432.6

270.7

112.5

8.5

56.7

1104.8

705.1

5.1

7.9

4.6

129

150.1

181.1

232.8

131.3

410.4

60.4

116

66.8

4.2

27.6

509.6

144.9

69.6

3.9

352.9

377.2

363

205.8

828.8

283.4

28.1

189.9

865.2

720.1

269.1

199.7

134.4

79.8

968.8

1.4

132.7

366

4.8

71.8

664.9

224.5

86.2

863.4

95.5

90.3

3.7

488.8

35.9

804.5

592.4

727.1

196.5

36.6

771.7

36.3 22.1

4.0

68.2

211.5

3.1

47.3

753.2

117.8

83.9

161.1

5.4

21.5

191.2

166.4

3.9

3.5

78.3

73.4

105.2

152.2

129.7

437.7

21.4

65.2

43.5

2.2

16.1

66.3

134

27.5

2.3

314.7

370.9

3.6

330.9

55.5

434.8

144.9

15.5

76.3

509.5

565.6

122.4

13.6

108.4

24.2

1062.8

1.2

90.2

38.7

3.2

518.3

225.4

91.5

78.5

76.3

2.3

114.6

16.8

269.4

423

47.7

142.8

33.1

292.3

10.1 11.9

7.5

5.3

11.4

6.2

13.6

32.0

9.4

5.6

7.4

17.1

12.1

13.4

11.2

9.3

7.8

5.8

16.3

11.9

19.1

31.1

16.7

2.9

6.9

2.8

11.4

6.6

5.0

17.0

11.6

8.7

12.7

7.5

8.3

13.1

4.8

12.1

17.8

6.6

9.8

17.9

15.5

31.9

8.1

8.4

31.4

7.1

22.9

4.6

7.9

9.6

4.5

10.4

15.9

6.9

17.9

12.3

5.5

5.7

4.3

77.3

11.7

12.2

94.5

10.7

7.7

7.2

13.7

21.7

Table 24: Perplexity of all languages covered by Glot500-m (Part II).Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m Language-Script XLM-R-B XLM-R-L Glot500-m

bts_Latn

gla_Latn

kat_Latn

uig_Latn

kat_Geor

mlg_Latn

arn_Latn

tuk_Latn

vls_Latn

hyw_Armn

que_Latn

snd_Arab

giz_Latn

ita_Latn

qub_Latn

nav_Latn

kqn_Latn

toh_Latn

mah_Latn

wes_Latn

nob_Latn

ext_Latn

lam_Latn

mwm_Latn

kpg_Latn

hau_Arab

ksd_Latn

zsm_Latn

hui_Latn

cym_Latn

srp_Latn

bak_Latn

zho_Hani

nno_Latn

gya_Latn

ibo_Latn

meu_Latn

ncx_Latn 205.7

11.5

36.4

188.8

10.9

382.7

456.7

97.7

15.8

447.9

13.2

81.9

4.5

283.2

228.5

825.9

758.3

314.7

144.6

6.8

68.3

233.7

44.8

165.9

5.3

150

12.2

209.9

8.2

10.9

347.1

20.7

9.9

77.1

380.2

1084.7 204.5

12.7

24.8

173.9

3.9

4.4

96.7

197.8

39.6

9.1

536.1

4.1

82.9

3.3

312.7

126.5

686.6

216.6

81.8

103.9

4.0

38.2

160.8

53.1

122.6

3.0

154.9

2.9

177

4.8

7.9

211

5.9

12.7

24.3

90.1

158.5

948.5 8.8

7.2

18.3

15.2

6.4

7.6

17.6

5.8

9.7

4.3

11.9

19.5

37.7

7.2

9.4

5.2

17.5

19.6

17.3

14.3

9.5

8.1

21.6

7.1

15.1

8.1

7.7

22.7

10.0

11.2

13.3

7.5

31.3

10.4

16.5

8.5

26.7

14.6 tsn_Latn

pon_Latn

nmf_Latn

ajg_Latn

tir_Ethi

bhw_Latn

mhr_Cyrl

swe_Latn

scn_Latn

udm_Cyrl

ifb_Latn

naq_Latn

zlm_Latn

hrx_Latn

lzh_Hani

pap_Latn

cfm_Latn

chv_Cyrl

tdt_Latn

pan_Guru

pms_Latn

roh_Latn

prs_Arab

tuk_Cyrl

srm_Latn

gsw_Latn

fat_Latn

ldi_Latn

kos_Latn

acr_Latn

mri_Latn

frr_Latn

mck_Latn

pes_Arab

san_Latn

yao_Latn

srp_Cyrl

ful_Latn 264.7

928.4

297.6

147.1

28.3

411.2

122.9

4.8

117

356.7

246.3

136.8

5.6

478.1

674.4

235.1

122.5

641.9

4.4

83.6

243.5

6.8

277.4

257.5

288.2

192.3

394.8

470.7

155.7

117.6

369.3

5.5

94.4

738.9

7.4

104 137.8

181.9

310.6

149.5

15.7

126.2

168.4

3.5

64.9

224.9

177.9

60.2

3.3

679.1

149.3

155

73.8

78.6

2.5

46.2

170

3.5

86.3

74.5

181.2

149

107.1

485.7

90.7

59.5

101

164.8

3.1

96.8

162.4

4.5

105.6 12.5

19.2

44.9

22.6

4.4

21.6

5.8

12.7

7.8

6.7

5.1

15.7

4.6

14.9

21.8

18.1

14.0

5.4

9.7

4.3

3.6

7.0

4.8

6.7

12.3

22.3

17.6

38.2

27.0

5.8

8.7

9.5

24.7

5.3

12.0

13.8

8.4

13.1 orm_Latn

luo_Latn

pcm_Latn

nnb_Latn

kaz_Cyrl

dzo_Tibt

sun_Latn

vec_Latn

ayr_Latn

oke_Latn

kur_Latn

mgh_Latn

tgk_Cyrl

sop_Latn

mos_Latn

rap_Latn

prk_Latn

uzb_Cyrl

tog_Latn

mal_Mlym

nyk_Latn

quy_Latn

abn_Latn

mcn_Latn

nep_Deva

gle_Latn

cab_Latn

mps_Latn

pnb_Arab

swa_Latn

hnj_Latn

haw_Latn

tpi_Latn

ncj_Latn

som_Latn

mam_Latn

lit_Latn 23.4

699.4

38.3

364.1

4.3

8.5

23.6

40.6

261.1

209.2

14.2

680

181.3

607.5

272.6

36.1

69.4

236.2

821.1

1182.6

949.7

245.2

120.7

8.8

10.5

1216.7

75.2

51.8

11.4

88.3

63.5

891.8

1019

14.1

132.7

4.4 8.6

258.5

169.6

5.4

3.3

11.9

21.1

237.6

220.1

6.8

272.8

153

228.2

118.3

31.1

45.9

138.4

777.7

3.7

914.2

320.2

272.5

129.7

6.3

3.7

155.6

55.2

30.8

6.4

92.5

66.7

67.8

136.2

6.9

62.4

2.5 16

85.1

3.6

28.6

9.6

5.7

9.2

27.7

13.0

10.3

23.7

4.5

29.5

13.2

2.8

7.1

4.9

13.4

6.2

16.5

14.5

8.7

43.6

9.8

15.4

17.4

7.1

11.3

7.4

8.8

13.7

22.2

6.1

10.6

Table 25: Perplexity of all languages covered by Glot500-m (Part III).