Summary of CABRITA Closing the Gap for Foreign Languages

Summary CABRITA Closing the Gap for Foreign Languages arxiv.org

4,751 words - PDF document - View PDF document

One Line

Cabrita is a methodology that enhances foreign language pre-trained models through the use of a more efficient tokenizer.

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

Enhancing Foreign Language Models with Cabrita

Source: arxiv.org - PDF - 4,751 words - view

Introduction

• Cabrita is a methodology that addresses the limitations of pre-trained models in foreign languages.

• The main challenge is the high cost associated with training models from scratch.

• Cabrita introduces a new tokenizer to enhance the performance of pre-trained models.

Challenges with Tokenizer Behavior

• Adapting a Large Language Model to a new language presents challenges with the tokenizer behavior.

• The default tokenizer for the Portuguese language in the OpenLLaMA model is overly verbose for non-English examples.

• This results in the division of text into small parts, affecting model performance.

Training Details

• The study utilized a TPU v3-8 for training.

• Batches of 16 containing a sequence of 2048 tokens were used.

• 128 accumulation steps were performed to achieve the target batch size.

Cabrita vs. Conventional Pre-training

• The Cabrita approach offers comparable performance to conventional continued pre-training.

• It also provides enhanced inference efficiency.

• The performance of openCabrita3B consistently outperforms GPT-J.

Potential of Larger-Scale Models

• Employing larger-scale models could yield promising results for foreign language processing.

• The successful experiment with Chinese language models serves as a basis for this line of thinking.

• However, the absence of a structured approach for training larger-scale models is noted.

Language Models and Tokenizers

• The document discusses various language models and tokenizers used for foreign languages.

• Models such as GPT-2, MPT Falcon, OpenLLaMA, and BERTau are mentioned.

• Their respective vocab sizes and capabilities are highlighted.

Closing the Gap with Cabrita

• Cabrita addresses the limitations of pre-trained models in foreign languages through a new tokenizer.

• It offers comparable performance to conventional pre-training and enhances inference efficiency.

• By employing larger-scale models, promising results can be achieved in foreign language processing.

Note: Visuals such as graphs showcasing performance comparisons or charts illustrating the training process can be included for added impact.

Key Points

Cabrita is a methodology that addresses the limitations of pre-trained models in foreign languages by introducing a new tokenizer.
Adapting a Large Language Model to a new language presents challenges with the tokenizer behavior.
The study utilized a TPU v3-8 for training and performed 128 accumulation steps to achieve the target batch size.
The Cabrita approach offers comparable performance to conventional continued pre-training and enhanced inference efficiency.
openCabrita3B consistently outperforms GPT-J in terms of performance.
Employing larger-scale models could yield promising results for foreign language processing.
The document discusses various language models and tokenizers used for foreign languages, particularly focusing on Portuguese.

Summaries

17 word summary

Cabrita is a methodology that improves pre-trained models for foreign languages by introducing a more efficient tokenizer.

43 word summary

Cabrita is a methodology that aims to address the limitations of pre-trained models in foreign languages by introducing a new tokenizer. The default tokenizer for the Portuguese language in the OpenLLaMA model is overly verbose, resulting in the division of text into small

236 word summary

Cabrita is a methodology that aims to address the limitations of pre-trained models in foreign languages. The main challenge is the high cost associated with training models from scratch. To overcome this, Cabrita relies on available pre-trained models but introduces a new tokenizer

Adapting a Large Language Model to a new language presents challenges with the tokenizer behavior. The default tokenizer for the Portuguese language in the OpenLLaMA model is overly verbose for non-English examples, resulting in the division of text into small parts

The study utilized a TPU v3-8 for training, with batches of 16 containing a sequence of 2048 tokens. 128 accumulation steps were performed to achieve the target of 2048 samples in a batch, resulting in a throughput

The Cabrita approach, which involves adapting the tokenizer, offers a comparable performance level to conventional continued pre-training, with the added benefit of enhanced inference efficiency. The performance of openCabrita3B is satisfactory, consistently outperforming GPT-J and

The authors of the document express their conviction that employing larger-scale models could yield promising results for foreign language processing. They mention the successful experiment with Chinese language models as a basis for this line of thinking. However, they note that the absence of a structured

The document discusses various language models and tokenizers used for foreign languages, particularly focusing on Portuguese. It mentions models such as GPT-2, MPT Falcon, OpenLLaMA, and BERTau, along with their respective vocab sizes and

Raw indexed text (31,509 chars / 4,751 words / 732 lines)

C ABRITA :

CLOSING THE GAP FOR FOREIGN LANGUAGES

P REPRINT

Celio Larcher

Marcos Piau

Paulo Finardi

Pedro Gengo

Piero Esposito

Vinicius Caridá

22h, Brazil

email:

{celiolarcher, marcos.piau.vieira, pfinardi, pedro.gengo.lourenco,

piero.skywalker, vfcarida}@gmail.com

A BSTRACT

The strategy of training the model from scratch in a specific language or domain serves two essential

purposes: i) enhancing performance in the particular linguistic or domain context, and ii) ensuring

effective tokenization. The main limitation inherent to this approach lies in the associated cost,

which can reach six to seven-digit dollar values, depending on the model size and the number of

parameters involved.

The main solution to overcome the cost challenge is to rely on available pre-trained models, which,

despite recent advancements such as the LLaMA and LLaMA-2 models, still demonstrate ineffi-

ciency for certain specific domain problems or prove ineffective in scenarios involving conversa-

tional memory resources, given the large number of tokens required to represent text.

To overcome this issue, we present a methodology named Cabrita, which, as our research demon-

strates, successfully addresses the performance and efficient tokenization problem, all at an afford-

able cost. We believe that this methodology can be applied to any transformer-like architecture

model. To validate the study, we conducted continuous pre-training exclusively using Portuguese

text on a 3-billion-parameter model known as OpenLLaMA, resulting in a model named openCabrita

3B. The openCabrita 3B also features a new tokenizer that results in a significant reduction in the

number of tokens required to represent the text. In our assessment, for few-shot learning tasks,

we achieved similar results with this 3B model compared to a traditional continuous pre-training

approach as well as to 7B models English pre-trained models.

Introduction

While the Portuguese language does not suffer from a lack of data for training transformer-based language models, a

native speaker can readily perceive limitations in text generation and performance of pre-trained models predominantly

based on English language data. Following the research conducted in Sabiá [1], we share the perspective that the

conventional practice of simultaneously pre-training models in multiple languages fails to capture the cultural richness

and intrinsic knowledge of each language.

As demonstrated in the study [2], language-specific adapted tokenizers have the potential to significantly enhance the

monolingual performance of multilingual models.

Considering that pre-training a model in a single language with a specific tokenizer leads to notably superior perfor-

mance, it’s natural to question why there aren’t more language-specific models. However, the answer lies in a crucial

factor: cost. An example of this is the cost of the LLaMa 2 7B model, which required a total of 184,320 hours of pro-

cessing on GPU A-100 80-GB. Assuming a conservative estimate that 1 hour of this GPU costs $1, the resulting cost

approaches $200,000 significantly. This underscores how the substantial investments required to create and maintain

language-specific language models can pose a significant financial challenge.Cabrita: closing the gap for foreign languages

P REPRINT

This article investigates an alternative approach to zero-shot pre-training based on three fundamental principles: low

cost, performance optimization, and efficient tokenization. This approach aims to enable the model to use fewer tokens

and minimize inference time compared to existing multilingual models.

Methodology

2.1

LLaMA Models

The LLaMA models are Large Language Models developed by the Meta AI’s team [3]. Based on the well-established

transformers architecture it incorporates several novel training and inference mechanisms, notably the use of Pre-

normalization [4], the SwiGlu activation function [5] and the inclusion of Rotary Embeddings [6].

The primary distinction of LLaMA models lies in the magnitude of pre-training data utilized during their development,

scaling from the standard 300 billion tokens in the optimal Chinchila’s law pattern [7] to an extensive 1 trillion to-

kens. Although this setup incurs increased resource costs during training, it significantly improves the state-of-the-art

(SOTA) performance for all model sizes at the time of release. As a result, inference costs are reduced, as smaller

models can achieve comparable levels of performance, thereby justifying the theoretical non-optimal behavior during

pre-training.

After its introduction, several other Large Language Models have been trained following a similar approach, facili-

tating further advancements in SOTA standards, particularly for smaller models. For this study, we adopt the Open-

LLaMA implementation [8], an open-source Apache 2.0 alternative with the same recipe as the open-science-only

LLaMA.

2.2

openCabrita Models

Despite the proliferation of numerous open source Large Language Models in recent years, the availability of genera-

tive models for non-English languages remains a persistent problem.

While employing English models for other languages might seem feasible, their performance typically falls short of

optimal, even when utilizing multilingual options or implementing Instruction Fine Tuning [9] in the target language

[1, 10, 11].

Taking this aspect into consideration, this work introduces the openCabrita models, a collection of checkpoints derived

from the OpenLLama models. These models undergo additional pre-training using a Portuguese corpus, inspired by

the methodology proposed in [1], while also incorporating additional tokenizer adaption. The training procedure will

be elaborated on in the next subsections.

2.2.1

Tokenizer adaptation

One of the main challenges of adapting a Large Language Model to a new language that is underrepresented in its

pre-training is the tokenizer behavior. This is the case for the Portuguese language in the OpenLLaMA model, where

despite the 1 trillion tokens used, just a small part were non-English. The default tokenizer, in such cases, tends to be

overly verbose for non-English examples, necessitating the division of the text into too small parts.

This behavior is undesirable for two primary reasons: i) it shatters the amount of information contained within each

sub-unit processed by the model, potentially making it more challenging to understand long-term relations within the

context; and ii) it escalates the computational cost of both training and inference, as it increases the number of tokens

that need to be processed for the same text sample.

However, the inclusion of a new tokenizer in a pre-trained model poses a challenge, as there exists a direct mapping

between the tokens used in pre-training and the model’s embedding representation. Merely replacing the tokenizer

would break this mapping, likely resulting in odd model behavior and potentially requiring a considerable amount of

additional pre-training to learn the new association pattern.

To address this issue, the approach taken in this work involves adapting the tokenizer to the new language while

preserving the knowledge of the original language already embedded in the model. Inspired by the work of [11], this

process involved three steps:

• Training a new tokenizer exclusively on Portuguese data sourced from Wikipedia. SentencePiece [12] imple-

mentation was used for this purpose, resulting in a tokenizer with 40,000 tokens.Cabrita: closing the gap for foreign languages

P REPRINT

• Merging the original tokenizer with the newly created Portuguese tokenizer. During this merging process,

all tokens present in the original tokenizer were retained, and the new Portuguese tokens that were not yet

represented in the vocabulary were appended until a total of 52,000 tokens were reached. The precedence

level of each merge – BPE Score in the SentencePiece implementation – was maintained from both tokenizers,

given that this led to a better token compression level.

• Resizing the model’s embedding and head matrix to accommodate the newly added tokens. The new rows

corresponding to the Portuguese tokens were appended to the end of the original matrix to ensure that the

existing token mapping remained unaffected.

The result was a bilingual tokenizer, able to handle both English and Portuguese content efficiently.

2.2.2

Continuous pre-training

We use the Portuguese subset of the mC4 dataset [13] (hereafter called mC4-pt) as our pretraining corpus. Following

the recipe in [1], we also apply quality filters based on MassiveText [14] to ensure our model is trained using high-

quality documents.

First, we normalize each document independently using ftfy [15]. Then, we apply the quality filters. Unlike Mas-

siveText, which only considers the stop words {the, be, to, of, and, that, have, with}, we use the Portuguese stop words

from nltk [16]; we also keep only documents with at least 200 unique tokens, following the the procedure outlined

in [1]. A document is kept in the dataset if it does not fall into any of the quality filters. Table 7 shows the contribution

of each filter to the final filtered dataset.

To estimate the total number of training tokens without processing the entire dataset, we assume that the remaining

shards have similar sizes and behavior with respect to the quality filters. Based on this estimation, the unfiltered

mC4-pt training dataset consists of approximately 169 million documents and 188 billion tokens, which reduces to

133 million documents and 170 billion tokens after applying the low-quality filters. To convert these documents into

training examples, we add <|bos|> and <|eos|> tokens to each document, and then pack these documents into

sequences of 2048 tokens.

We perform the continuous pre-training step on TPUs using the EasyLM, which is a framework built in JAX/Flax and

is a one-stop solution for training LLMs on TPU and GPU devices.

To perform continued pre-training in the Portuguese language, the original script and hyper-parameters of OpenL-

LaMA training were employed. The model was initialized with the OpenLLaMA weights and then trained on an

additional 7 billion tokens extracted from the Portuguese dataset.

The learning objective was the standard causal language modeling loss with the learning rate set to increase linearly

from 0 to 3e-4 over 2,000 steps, followed by a cosine decrease to 3e-5 for the remaining 248,000 steps. The optimiza-

tion algorithm used was AdamW, with a weight decay of 0.1, β 1 set to 0.9, β 2 set to 0.99, and a gradient clip value of

1.0.

For training, a TPU v3-8 was utilized, with batches of 16, each containing a sequence of 2048 tokens. To achieve the

target of 2048 samples in a batch, 128 accumulation steps were performed. This configuration resulted in a throughput

of 7,900 tokens/second for the 3B model.

3.1

Evaluation

Tokenizer efficiency

For evaluating how alterations to the tokenizer affect text representation, a set of experiments were performed. These

experiments involved the comparison of various tokenizer alternatives coming from different models and trained on

distinct corpora. Figure 1 presents a comparison of the token count required to represent 7400 words of the Constitu-

tional law of the USA in both English and Portuguese (translated) for each of these tokenizer variations.

It can be noted that the standard OpenLLaMA implementation demonstrates highly inefficient behavior when applied

to Portuguese data. It necessitates over 50% more tokens for representing the same text, in contrast to the multilingual

Bloom and Portuguese pre-trained models (e.g. BERTimbau [17], PTT5 [18], Bertaú [19]). While certain other

LLaMA-based implementations exhibit improved token counts, such as MPT [20] and Falcon [21], some inefficiency

remains. This observation holds true even in comparison with chatGPT [22], which possesses a larger vocabulary than

these LLaMA implementations.Cabrita: closing the gap for foreign languages

P REPRINT

Figure 1: Tokenizer efficiency: the X axis show the number of tokens required to represent 7400 words of the Consti-

tutional law of the USA in English, while the Y axis shows the same, but using a Portuguese translated version. The

size of each sphere represents the vocabulary size of each tokenizer.

In contrast to these choices, the Cabrita tokenizer demonstrated the ability to enhance the OpenLLaMA token require-

ments. It achieved a reduction of over 35% in the token requirements, bringing it in line with the token counts of the

Bloom and native Portuguese implementations.

Furthermore, upon examining the token count in English, the Cabrita tokenizer even exhibited a slight improvement

in the token count when compared with the original OpenLLaMA tokenizer, with no indication of degradation in this

regard.

Ultimately, the Cabrita approach presents an excellent trade-off between Portuguese and English considerations, es-

pecially when seeking models with comparable vocabulary sizes (52.000 tokens for Cabrita versus 250.680 tokens for

Bloom).

3.2

Portuguese Benchmark results

To assess the efficacy of this proposal, 8 Portuguese evaluation datasets were employed across a variety of tasks:

ASSIN 2 RTE and STS [23], FaQuAD [24], TweetSentBr [25] are inherently in Portuguese, while for AG News [26],

IMDB [27], SST2 [28] and BoolQ [29] translated versions provided by the authors of the Sabiá series paper were

utilized. Poeta’s benchmark [1] was not executed in our study due to the unavailability of its implementation at the

time of our publication.

Table 1 provides an overview of the datasets along with the specific few-shot configuration employed in each experi-

ment.

A diverse array of tasks was carefully selected, spanning from binary and multi-class classification to Extractive

Question Answering (QA). However, the absence of a comprehensive Portuguese benchmark prevented the utilization

of a more extensive task set.

The Table 2 exhibits the performance comparison of openCabrita3B with the base openLLaMA3B and a subsequent

pre-training version that employs the openLLaMA tokenizer (referred to as openCabrita3BPTOnly).

The results are mixed but somewhat favoring openCabrita3B. When openCabrita3B performs better, it is notably supe-

rior, as in ASSIN 2 RTE, TweetSentBr, and IMDB tasks. When openCabrita3BPTOnly outperforms, openCabrita3B

is usually close, just a few scores behind. While ASSIN 2 STS is the only task where the original openLLaMA3b

shows a better performance, which is unexpected but consistent with what was observed in other experiments [1].Cabrita: closing the gap for foreign languages

P REPRINT

Table 1: Number of few-show samples and main metric for each task

Task

FaQuAD

ASSIN 2 RTE

ASSIN 2 STS

TweetSentBr

SST2

IMDB

AGNews

BoolQ

Type Metric

Extractive QA

Binary classification

Regression

Multiclass classification

Binary classification

Multiclass classification

Binary classification F1

F1-macro

Pearson

F1-macro

Accuracy

Few-shot

Samples

Table 2: This Table shows the results obtained for each task and each model. The configurations are the ones presented

in the Table 1.

Task

FaQuAD

ASSIN 2 RTE

ASSIN 2 STS

TweetSentBr

SST2

IMDB

AGNews

BoolQ

open LLaMA3b

58.71

43.38

18.22

44.48

86.69

80.60

60.39

61.92

openCabrita3B PTOnly

66.72

38.83

12.38

47.60

90.25

78.28

67.76

64.09

open Cabrita3B

62.16

67.36

12.24

54.03

89.00

86.02

64.98

63.11

Nevertheless, it is evident that the Cabrita approach, which involves adapting the tokenizer, can offer at least a compa-

rable performance level in contrast to the conventional continued pre-training. This equivalent performance improve-

ment also comes with the added benefit of enhanced inference efficiency.

Table 3 shows the same set of tasks, with a focus on comparing the performance of openCabrita3B to the outcomes

presented in the Sabiá paper. It’s important to highlight that while the datasets remain the same, possible variations in

the execution scripts require a cautious approach when drawing comparisons.

Table 3: This Table shows the results obtained for each task and each model for Portuguese datasets. The configura-

tions are presented in Table 1. The * symbol represents results that come from a different running script than ours, so

the results need to be compared carefully.

Model

FaQuAD

ASSIN 2 RTE

ASSIN 2 STS

TweetSentBr

SST2

IMDB

AGNews

BoolQ

open Cabrita3B

62.16

67.36

12.24

54.03

89.00

86.02

64.98

63.11

GPT-J*

59.52

54.88

17.86

20.98

83.94

72.68

64.15

48.75

Sabiá-J*

69.28

35.49

22.97

64.16

87.16

90.86

84.3

51.53

LLaMa-7B*

77.38

56.82

7.39

44.19

88.76

86.92

76.94

57.37

Sabiá-7B*

77.43

64.87

13.63

67.17

90.69

92.7

83.28

64.07

The performance of openCabrita3B seems very satisfactory for a model of its size. It consistently outperforms GPT-J,

a model with 6B parameters, and demonstrates competitiveness with LLaMA-7B. In the case of continued pre-training

versions of these models, openCabrita3B closely approaches the performance of Sabiá-J, while slightly lagging behind

Sabiá-7B.

These results do not cast any negative light on openCabrita3B, as it performs admirably despite having only half the

parameter size compared to the other four options.Cabrita: closing the gap for foreign languages

3.3

P REPRINT

English Benchmark results

To assess the efficacy of this approach and demonstrate its bilingual capabilities, we conducted a series of evaluations

using diverse English datasets, ensuring a comprehensive understanding of its performance. Table 4 presents the

performance of openCabrita in comparison with models trained mainly in English and without any fine-tuning for a

specific language.

Despite a slight decrease in performance for English, the model’s capabilities remained robust and competitive across

languages. Furthermore, we investigated the impact of different tokenizers—Open Llama and Cabrita—on the model’s

output. Encouragingly, both tokenizers yielded highly consistent results, indicating that the introduction of new tokens

had minimal impact on the model’s overall performance. This observation highlights the model’s resilience and

suggests its versatility for diverse applications without significant trade-offs in performance.

Table 4: This Table shows the results obtained for each task and each model for English datasets. The * symbol

represents results that come from a different running script than ours, so the results need to be compared carefully.

Task (metric)

anli_r1 (acc)

anli_r2 (acc)

anli_r3 (acc)

arc_challenge (acc)

arc_challenge (acc_norm)

arc_easy (acc)

arc_easy (acc_norm)

boolq (acc)

hellaswag (acc)

hellaswag (acc_norm)

openbookqa (acc)

openbookqa (acc_norm)

piqa (acc)

piqa (acc_norm)

record (em)

record (f1)

rte (acc)

thuthfulqa_mc (mc1)

thuthfulqa_mc (mc2)

wic (acc)

winogrande (acc)

open Cabrita3B

35.0

34.0

36.0

28.0

32.0

59.0

54.0

63.0

41.0

53.0

21.0

31.0

69.0

83.0

82.0

55.0

25.0

38.0

50.0

56.0

openCabrita3B PTOnly

33.2

33.5

33.4

28.1

31.9

58.0

53.6

62.9

41.1

54.0

20.6

32.0

69.4

69.5

83.0

82.2

52.7

23.1

37.1

50.0

58.7

openLLama 3B*

33.0

32.0

35.0

34.0

37.0

69.0

65.0

68.0

49.0

67.0

27.0

40.0

75.0

76.0

88.0

89.0

58.0

22.0

35.0

48.0

62.0

GPT-J*

32.0

34.0

35.0

34.0

37.0

67.0

62.0

66.0

50.0

66.0

29.0

38.0

75.0

76.0

88.0

89.0

54.0

20.0

36.0

50.0

64.0

LLaMa-7B*

35.0

34.0

37.0

39.0

41.0

62.0

52.0

75.0

56.0

73.0

29.0

41.0

78.0

91.0

56.0

21.0

34.0

50.0

68.0

Conclusion and next steps

This work is currently in progress and requires further development in terms of a more comprehensive set of models

and experiments before arriving at any definitive conclusions. However, the outcomes obtained from modifying the

tokenizer prior to performing continuous pre-training seem very promising.

Even in the absence of state-of-the-art performance enhancements, the openCabrita3b reaches a comparable metrics

level when compared to the standard continuous pre-training approach while also improving inference time — an

imperative factor in the context of deploying models for such applications that is commonly overlooked.

Looking ahead, the authors have outlined the following as their forthcoming initiatives:

• Test the same strategy in a more diverse set of base models

• Establish some type of benchmark and reperform runs for other models in order to have a comparison with

the same running methodology

Limitations

Scaling beyond 3B and for other languages Despite our best efforts, certain ranges of model sizes and experiments

remain beyond our reach due to budget constraints. Consequently, our experimentation was limited to the Portuguese

language and 3B size models.Cabrita: closing the gap for foreign languages

P REPRINT

But, since the good results obtained, we are firmly convinced that employing larger-scale models could yield results at

least as promising. This line of thinking could similarly be applied to other foreign languages, particularly considering

the successful experiment with Chinese presented in [11].

Comparison to other base models and adapting strategies Due to the absence of a structured benchmark like

Harness [30] for the Portuguese language, our capacity to compare with other models and strategies for the same tasks

is confined to the experiments we are able to perform. Consequently, the scope of comparisons we can undertake in

this paper is restricted, both in terms of quantity and depth.

Acknowledgments

We thank Google Cloud for the TPU grant through the TRC program.

References

[1] R. Pires, H. Abonizio, T. S. Almeida, and R. Nogueira, “Sabiá: Portuguese large language models,” 2023.

[2] P. Rust, J. Pfeiffer, I. Vulić, S. Ruder, and I. Gurevych, “How good is your tokenizer? on the monolingual

performance of multilingual language models,” in Proceedings of the 59th Annual Meeting of the Association

for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing

(Volume 1: Long Papers). Online: Association for Computational Linguistics, Aug. 2021, pp. 3118–3135.

[Online]. Available: https://aclanthology.org/2021.acl-long.243

[3] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro,

F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language

models,” 2023.

[4] B. Zhang and R. Sennrich, “Root mean square layer normalization,” 2019.

[5] N. Shazeer, “Glu variants improve transformer,” 2020.

[6] J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu, “Roformer: Enhanced transformer with rotary position

embedding,” 2022.

[7] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks,

J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero,

K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training compute-optimal large language models,”

2022.

[8] X. Geng and H. Liu, “Openllama: An open reproduction of llama,” May 2023. [Online]. Available:

https://github.com/openlm-research/open_llama

[9] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei, and A. Roberts,

“The flan collection: Designing data and methods for effective instruction tuning,” 2023.

[10] N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, S. Biderman, T. L. Scao, M. S. Bari, S. Shen, Z.-X. Yong,

H. Schoelkopf, X. Tang, D. Radev, A. F. Aji, K. Almubarak, S. Albanie, Z. Alyafeai, A. Webson, E. Raff, and

C. Raffel, “Crosslingual generalization through multitask finetuning,” 2023.

[11] Y. Cui, Z. Yang, and X. Yao, “Efficient and effective text encoding for chinese llama and alpaca,” 2023.

[12] T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and

detokenizer for neural text processing,” in Proceedings of the 2018 Conference on Empirical Methods in

Natural Language Processing: System Demonstrations. Brussels, Belgium: Association for Computational

Linguistics, Nov. 2018, pp. 66–71. [Online]. Available: https://aclanthology.org/D18-2012

[13] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mT5:

A massively multilingual pre-trained text-to-text transformer,” in Proceedings of the 2021 Conference

of the North American Chapter of the Association for Computational Linguistics: Human Language

Technologies. Online: Association for Computational Linguistics, Jun. 2021, pp. 483–498. [Online]. Available:

https://aclanthology.org/2021.naacl-main.41

[14] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring,

S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. van den Driessche, L. A. Hendricks,

M. Rauh, P.-S. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Higgins, A. Creswell,

N. McAleese, A. Wu, E. Elsen, S. Jayakumar, E. Buchatskaya, D. Budden, E. Sutherland, K. Simonyan, M. Pa-

ganini, L. Sifre, L. Martens, X. L. Li, A. Kuncoro, A. Nematzadeh, E. Gribovskaya, D. Donato, A. Lazaridou,Cabrita: closing the gap for foreign languages

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

P REPRINT

A. Mensch, J.-B. Lespiau, M. Tsimpoukelli, N. Grigorev, D. Fritz, T. Sottiaux, M. Pajarskas, T. Pohlen, Z. Gong,

D. Toyama, C. de Masson d’Autume, Y. Li, T. Terzi, V. Mikulik, I. Babuschkin, A. Clark, D. de Las Casas,

A. Guy, C. Jones, J. Bradbury, M. Johnson, B. Hechtman, L. Weidinger, I. Gabriel, W. Isaac, E. Lockhart,

S. Osindero, L. Rimell, C. Dyer, O. Vinyals, K. Ayoub, J. Stanway, L. Bennett, D. Hassabis, K. Kavukcuoglu,

and G. Irving, “Scaling language models: Methods, analysis & insights from training gopher,” 2022.

R. Speer, “ftfy,” Zenodo, 2019, version 5.5. [Online]. Available: https://doi.org/10.5281/zenodo.2591652

S. Bird, E. Klein, and E. Loper, Natural language processing with Python: analyzing text with the natural

language toolkit. " O’Reilly Media, Inc.", 2009.

F. Souza, R. Nogueira, and R. Lotufo, “Bertimbau: Pretrained bert models for brazilian portuguese,” in Intelligent

Systems, R. Cerri and R. C. Prati, Eds. Cham: Springer International Publishing, 2020, pp. 403–417.

D. Carmo, M. Piau, I. Campiotti, R. F. Nogueira, and R. de Alencar Lotufo, “PTT5: pretraining and

validating the T5 model on brazilian portuguese data,” CoRR, vol. abs/2008.09144, 2020. [Online]. Available:

https://arxiv.org/abs/2008.09144

P. Finardi, J. D. Viegas, G. T. Ferreira, A. F. Mansano, and V. F. Caridá, “Bertaú: Itaú BERT for digital customer

service,” CoRR, vol. abs/2101.12015, 2021. [Online]. Available: https://arxiv.org/abs/2101.12015

M. Team. (2023) Introducing mpt-7b: A new standard for open-source, commercially usable llms. [Online].

Available: https://www.mosaicml.com/blog/mpt-7b

G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, and

J. Launay, “The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data

only,” 2023.

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray,

J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and

R. Lowe, “Training language models to follow instructions with human feedback,” 2022.

L. Real, E. Fonseca, and H. Gonçalo Oliveira, “The assin 2 shared task: A quick overview,” in Computa-

tional Processing of the Portuguese Language, P. Quaresma, R. Vieira, S. Aluísio, H. Moniz, F. Batista, and

T. Gonçalves, Eds. Cham: Springer International Publishing, 2020, pp. 406–412.

H. F. Sayama, A. V. Araujo, and E. R. Fernandes, “Faquad: Reading comprehension dataset in the domain of

brazilian higher education,” in 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), 2019, pp. 443–

448.

H. B. Brum and M. das Graças Volpe Nunes, “Building a sentiment corpus of tweets in brazilian portuguese,”

CoRR, vol. abs/1712.08917, 2017. [Online]. Available: http://arxiv.org/abs/1712.08917

X. Zhang, J. J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” CoRR, vol.

abs/1509.01626, 2015. [Online]. Available: http://arxiv.org/abs/1509.01626

A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment

analysis,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human

Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics, Jun. 2011, pp.

142–150. [Online]. Available: https://aclanthology.org/P11-1015

R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts, “Recursive deep models for

semantic compositionality over a sentiment treebank,” in Proceedings of the 2013 Conference on Empirical

Methods in Natural Language Processing. Seattle, Washington, USA: Association for Computational

Linguistics, Oct. 2013, pp. 1631–1642. [Online]. Available: https://aclanthology.org/D13-1170

C. Clark, K. Lee, M. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova, “Boolq: Exploring the

surprising difficulty of natural yes/no questions,” CoRR, vol. abs/1905.10044, 2019. [Online]. Available:

http://arxiv.org/abs/1905.10044

L. Gao, J. Tow, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, K. McDonell, N. Muennighoff,

J. Phang, L. Reynolds, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou, “A framework for few-shot language

model evaluation,” Sep. 2021. [Online]. Available: https://doi.org/10.5281/zenodo.5371628Cabrita: closing the gap for foreign languages

P REPRINT

Tokenizer Comparison

Table 5: Comparison between models tokenizers. The data is the first 7400 words of the Constitutional law of the

United States which was translated for the Portuguese language.

Model

GPT-2

MPT

Falcon

OpenLLaMA

openCabrita 3B

BERTaú

BERTimbaú

PTT5

MT5

BLOOM

chatGPT

Vocab Size

50257

50254

65024

32000

52000

34100

29794

32100

250100

250680

100277

Tokens (Portuguese data)

15036

12470

11531

15648

9666

10573

10057

9447

13010

8582

11351

Tokens (English data)

8126

8313

8240

10280

9017

13054

13725

13720

10688

8337

8136

Table 6: Comparison between tokenization strategy. The data is the first 7400 words of the Constitutional law of the

United States which was translated for the Portuguese language.

Model

openCabrita 3B

Cabrita 0x

Cabrita 0.5x

Cabrita 2x

Vocab Size

52000

Tokens (Portuguese data)

9666

10169

9694

9632

Tokens (English data)

9017

9753

9021

9010

B mC4-pt cleaning filters

Table 7: Impact of each low-quality filter on the mC4-pt final clean training dataset.

Description

Total examples

At least one "bad example filter"

Less than 200 unique tokens

Number of words outside range of 50 to 100,000

Less than 50 words

Less than 80% of words containing a alphabetic character

More than 30% of lines ending with an ellipsis sign ("...")

Mean word length outside range of 3 to 10 characters

Less than 2 nltk Portuguese stopwords

Symbol ratio ("..." or "#") greater than 0.1

More than 90% of lines starting with a bullet sign ("*")

More than 100,000 words

# (10 shards)

1,652,725

361,732

269,893

170,928

80,836

28,289

22,280

18,616

8,805

100

21.89

16.33

10.34

4.89

1.71

1.35

1.13

0.53

0.00

# (1024 shards estimate)

169,239,040

37,041,357

27,637,043

17,503,027

8,277,606

2,896,794

2,281,472

1,906,278

901,632

1,638