Summary of Length Generalization in Arithmetic Transformers

Summary Length Generalization in Arithmetic Transformers arxiv.org

9,366 words - PDF document - View PDF document

One Line

This text discusses how transformers struggle with arithmetic and longer sequences, but the use of relative position embeddings and train set priming can improve generalization, particularly in multiplication tasks.

Slides

Slide Presentation (10 slides)

Copy slides outline Copy embed code Download as Word

Length Generalization in Arithmetic Transformers

Source: arxiv.org - PDF - 9,366 words - view

Transformers struggle with arithmetic and longer sequences

• Transformers face challenges in learning basic integer arithmetic and generalizing to longer sequences.

• Large language models lack mathematical ability due to the absence of large numbers in the training data.

• Models must be able to extrapolate small number arithmetic to larger integers.

Relative position embeddings enable length generalization for addition tasks

• Relative position embeddings (RPEs) improve generalization in natural language processing (NLP) tasks.

• RPEs allow models trained on 5-digit numbers to perform 15-digit sums with high accuracy.

• RPEs mix the representation of a token with the embedding of its position in the sequence.

RPEs do not enable length generalization for multiplication tasks

• RPEs fail to improve generalization in multiplication tasks.

• Models trained on 5-digit x 3-digit multiplications struggle with 35 x 3 examples.

• Multiplication tasks require alternative methods for length generalization.

Train set priming improves generalization in multiplication

• Train set priming involves adding a small number of long sequences to the training set.

• Priming enables models trained on 5-digit x 3-digit multiplications to generalize to 35 x 3 examples.

• The size of the priming sample scales logarithmically with the training set size.

Priming on numbers from 6 to 35 enables extrapolation to all lengths up to 35

• Priming on numbers of all lengths from 6 to 35 improves models' ability to generalize.

• Extrapolation to all lengths up to 35 is possible with this approach.

• Priming on a mixture of 34 and 35-digit numbers is effective for curriculum priming.

APE models require larger priming rates compared to RPE models

• Absolute position embeddings (APEs) perform worse than RPEs in terms of length generalization.

• APE models require a larger priming rate to achieve similar results as RPE models.

• RPE models learn all digits simultaneously, while APE models learn each position independently.

Model failures in addition tasks occur with multiple carries

• Model failures are observed in additions involving three or more carries and two consecutive carries.

• Errors concentrate on the first and second positions of the sums.

• Understanding the role of carries is crucial for improving model performance in addition tasks.

Future research questions and open challenges

• Extending train set priming to other mathematical problems is an area for future research.

• Investigating compositionality limits and theoretical aspects of priming is important.

• Exploring the application of priming in natural language processing tasks holds promise.

Key Takeaways

• Transformers struggle with arithmetic and longer sequences, but improvements can be made.

• Relative position embeddings enable length generalization for addition tasks.

• Train set priming improves generalization in multiplication tasks.

• Open questions for future research include extending priming and understanding compositionality limits.

Key Points

Transformers struggle with simple tasks like integer arithmetic and generalizing to longer sequences.
Relative position embeddings enable length generalization for addition tasks but fail for multiplication.
Train set priming, by adding long sequences to the training set, improves models' ability to generalize to larger multiplication examples.
The number of priming examples required scales logarithmically with the number of training examples and linearly with the extrapolation length.
Relative position embeddings perform better than absolute position embeddings in terms of length generalization.
Priming the train set with 35-digit numbers allows for extrapolation to 35-digit operands, while priming on numbers from 6 to 35 enables extrapolation to all lengths up to 35.
Model failures in addition tasks are observed in cases involving three or more carries and two consecutive carries.
Open questions for future research include extending priming to other mathematical problems and exploring priming in natural language processing tasks.

Summaries

122 word summary

Transformers struggle with learning arithmetic and generalizing to longer sequences. Relative position embeddings aid length generalization for addition but not multiplication. Train set priming, adding long sequences to training sets, enables models trained on 5-digit x 3-digit multiplications to generalize to 35 x 3 examples. The paper explores potential applications of priming beyond arithmetic. Previous works focused on in-distribution settings and failed in out-of-distribution experiments. Relative position embeddings improve generalization in natural language processing but have not been extensively studied for arithmetic tasks. The authors propose train set priming to enable length generalization in multiplication tasks. Relative position embeddings outperform absolute position embeddings. The document concludes with open questions for future research and additional experiments on modular arithmetic and element-wise addition tasks.

232 word summary

This paper examines the challenges faced by transformers when learning arithmetic and generalizing to longer sequences. The authors find that relative position embeddings enable length generalization for addition tasks, but not for multiplication. To address this, they propose train set priming, which involves adding long sequences to the training set. Models trained on 5-digit x 3-digit multiplications can then generalize to 35 x 3 examples. The authors also discuss potential applications of priming beyond arithmetic.

Previous works have focused on in-distribution settings for transformers and have not been successful in out-of-distribution experiments and extrapolation.

While relative position embeddings have been shown to improve generalization in natural language processing, their impact on arithmetic tasks has been little studied. The experiments in this paper show that models using relative position embeddings can generalize to longer sequences for addition tasks, achieving high accuracy on 10-digit numbers. However, they do not enable length generalization for multiplication tasks.

To address this, the authors propose train set priming, which allows models to generalize to larger multiplication examples by adding a small number of long sequences to the training set. The results also suggest that relative position embeddings perform better than absolute position embeddings in terms of length generalization.

The document concludes by posing open questions for future research and presents additional experiments on modular arithmetic and element-wise addition tasks to compare the performance of different models and methods.

299 word summary

This paper explores the challenges faced by transformers in learning arithmetic and generalizing to longer sequences. The authors find that relative position embeddings enable length generalization for addition tasks, but not for multiplication. To address this, they propose train set priming, which involves adding long sequences to the training set. Models trained on 5-digit x 3-digit multiplications can then generalize to 35 x 3 examples. The authors also discuss potential applications of priming beyond arithmetic.

The paper discusses the struggles of transformers with simple tasks like arithmetic and the need for models to be able to extrapolate small number arithmetic to larger integers. Previous works have focused on in-distribution settings, but out-of-distribution experiments and extrapolation have proven disappointing.

The authors explain that absolute position embeddings hinder generalization in transformers, while relative position embeddings have been shown to improve generalization in natural language processing. However, their impact on arithmetic tasks has been little studied.

The experiments show that models using relative position embeddings can generalize to longer sequences for addition tasks, achieving high accuracy on 10-digit numbers. However, they do not enable length generalization for multiplication tasks.

To address this failure for multiplication, the authors propose train set priming. By adding a small number of long sequences to the training set, models can generalize to larger multiplication examples. The number of priming examples required scales logarithmically with the number of training examples.

The results also suggest that relative position embeddings perform better than absolute position embeddings in terms of length generalization. RPE models seem to learn all digits simultaneously, while APE models learn each position independently.

The document concludes by posing open questions for future research and presents additional experiments on modular arithmetic and element-wise addition tasks. The results show the performance of different models and methods in these tasks.

999 word summary

This paper explores the challenges faced by transformers in learning basic integer arithmetic and generalizing to longer sequences. The authors find that relative position embeddings enable length generalization for addition tasks, allowing models trained on 5-digit numbers to perform 15-digit sums. However, this method fails for multiplication. To address this, the authors propose train set priming, which involves adding a small number of long sequences to the training set. They demonstrate that models trained on 5-digit x 3-digit multiplications can generalize to 35 x 3 examples with the use of priming. The authors also show that the priming sample size scales with the logarithm of the training set size and discuss potential applications of priming beyond arithmetic.

The paper begins by discussing the success of transformers in various domains but notes their struggle with simple tasks like integer arithmetic. The absence of large numbers in the training data limits the mathematical ability of large language models, and the authors propose that models must be able to extrapolate small number arithmetic to larger integers. Previous works on learning arithmetic with transformers have focused on in-distribution settings, where the training and test sets are drawn from the same distribution. However, out-of-distribution experiments and extrapolation to larger numbers have proven disappointing.

The authors then discuss the concept of length generalization in transformers, which has been widely studied. They explain that absolute position embeddings (APEs), used in many implementations, hinder generalization because they mix the representation of a token with the embedding of its position in the sequence. Several papers have proposed using relative position embeddings (RPEs) or weighted attention schemes to improve generalization in natural language processing (NLP), but their impact on arithmetic tasks has been little studied.

In their experiments, the authors train models on 5-digit operations and investigate their ability to generalize to numbers with up to 20 digits for addition and 35 digits for multiplication. They find that models using relative position embeddings can generalize to longer sequences for addition tasks, achieving high accuracy on 10-digit numbers and reasonable accuracy on 15-digit numbers. However, the use of relative position embeddings does not enable length generalization for multiplication tasks.

To address the failure of relative position embeddings for multiplication, the authors propose train set priming. By adding a small number of long sequences to the training set, models trained on 5-digit x 3-digit multiplications can generalize to 35 x 3 examples. The authors show that the priming sample size scales as the logarith

This summary presents the key points and important details from the excerpted text.

The document discusses the concept of length generalization in arithmetic transformers. It explores the use of fine-tuning and train set priming to improve the accuracy of multiplication tasks. The experiments use a standard UTransformer with specific parameters.

The results show that fine-tuning the model on 5x3 multiplications and then fine-tuning on 35x3 multiplications improves accuracy. Priming the training set with fifty 35x3 examples also leads to better accuracy. The number of priming examples required scales logarithmically with the number of training examples and linearly with the extrapolation length.

Curriculum priming, where priming examples are split between different lengths, is generally ineffective except when priming on a mixture of 34 and 35-digit numbers.

Priming the train set with 35-digit numbers allows for extrapolation to 35-digit operands but not to other lengths. However, priming on numbers of all lengths from 6 to 35 enables extrapolation to all lengths up to 35.

The results also suggest that relative position embeddings (RPEs) perform better than absolute position embeddings (APEs) in terms of length generalization. RPE models seem to learn all digits simultaneously, while APE models learn each position independently.

The document presents an analysis of failure cases in addition tasks, focusing on the role of carries and incorrect digits. It is observed that model failures mostly occur in additions involving at least three carries and two consecutive carries. Errors concentrate on the first and second positions of the sums.

Train set priming is shown to be effective not only for multiplication but also for APE models. However, APE models require a larger priming rate compared to RPE models.

The document concludes by posing open questions for future research, including the extension of priming to other mathematical problems, investigating compositionality limits, understanding the theoretical aspects of priming, and exploring priming in natural language processing tasks.

Additional experiments on modular arithmetic are presented, showing the results of APE and RPE models on digitwise addition and multiplication tasks. APE models can be primed to length generalize in multiplication, but with a larger priming rate. Plots are provided to illustrate the digit order by which RPE and APE models make correct predictions.

Table 4 provides the extrapolation results for modular addition using the UTransformer model in both Base and Large formats. The accuracy reached by the models on 100,000 example test sets is reported. The model manages to extrapolate when the modulus is a power of 10, but fails to do so when the modulus is 128 or 101. Table 5 provides similar results for modular multiplication. The model manages to extrapolate when the modulus is a power of 10, but fails when the modulus is 128 or 101. The difference in performance may be due to 101 being a prime number and 128 being a power of 2.

Table 6 reports the extrapolation results for element-wise addition using the UTransformer model with different position embedding methods. The RPE models manage to length generalize, while the APE models fail. Figure 8 shows the digitwise accuracy of the APE model on element-wise addition, indicating that the model performs well on the leftmost digits seen during training but fails on the rightmost ones.

Additional experiments on priming for multiplication are presented in Figure 9. Priming the model with a small percentage of the training set leads to successful extrapolation to larger multiplication tasks. Figure 10 shows the digitwise accuracy on the training examples, while Figure 11 shows the digitwise prediction on the test examples.

Raw indexed text (56,205 chars / 9,366 words / 1,624 lines)

Length Generalization in Arithmetic Transformers

Samy Jelassi

Princeton University Stéphane d’Ascoli

EPFL

Carles Domingo-Enrich

New York University

Yuhuai Wu

Stanford University

Google Research Yuanzhi Li

Carnegie Mellon University

Microsoft Research

François Charton

Meta AI

June 28, 2023

Abstract

We examine how transformers cope with two challenges: learning basic integer arithmetic,

and generalizing to longer sequences than seen during training. We find that relative position

embeddings enable length generalization for simple tasks, such as addition: models trained on

5-digit numbers can perform 15-digit sums. However, this method fails for multiplication, and

we propose train set priming: adding a few (10 to 50) long sequences to the training set. We

show that priming allows models trained on 5-digit × 3-digit multiplications to generalize to

35 × 3 examples. We also show that models can be primed for different generalization lengths,

and that the priming sample size scales as the logarithm of the training set size. Finally, we

discuss potential applications of priming beyond arithmetic.

Introduction

Transformers (Vaswani et al., 2017) achieve remarkable results in domains ranging from Natural

Language Processing (NLP) (Vaswani et al., 2017; Devlin et al., 2018), to computer vision (Dosovitskiy

et al., 2020), reinforcement learning (Chen et al., 2021; Janner et al., 2021), and program synthesis

(Austin et al., 2021). Yet, they struggle on simple tasks, such as integer arithmetic (Nogueira et al.,

2021). Recent, transformer-based, large language models, such as ChatGPT (Schulman et al., 2022),

can perform arithmetic on small integers, but their performance drops steeply as operands become

large. The text corpora used to train language models is partly responsible for this situation. Most

of the problems of mathematics featured in these data sets involve small numbers. In fact, large

integers, with 15 digits or more, almost never appear in print. The absence of large numbers in the

training data limits the mathematical ability of large language models. To mitigate this, language

models must be able to extrapolate the small number arithmetic they have learned, to larger integers.

Most prior works on learning arithmetic with transformers (Nogueira et al., 2021; Power et al.,

2022) consider the in-distribution setting, where numbers in the training and test sets are drawn

from the same distribution. Out-of-distribution experiments, and in particular extrapolation to

larger numbers, have so far proven disappointing.

On the other hand, length generalization in transformers has been widely studied. The seminal

paper by Shaw et al. (2018) identified the position embedding (PEs) as the likely culprit for

1their inability to generalize. Indeed, the absolute position embeddings (APEs), used in many

implementations, mix the representation of a token with the embedding of its position in the

sequence, making trained models very susceptible to changes in sequence lengths. Since then, several

papers have proposed to use relative position embeddings (RPEs), that encode the relative distance

between tokens (Shaw et al., 2018; Huang et al., 2018; Dai et al., 2019; Huang et al., 2020), or to

replace position embeddings by weighted attention schemes (Raffel et al., 2020; Su et al., 2021; Press

et al., 2021). While these changes improved extrapolation in natural language processing (NLP),

their impact on arithmetic tasks has been little studied.

Recent work suggests that large language models can generalize to longer sequences for the

addition task, thanks to specialized prompt engineering techniques (Zhou et al., 2022). However,

results for multiplication are limited to short extrapolation lengths (7 digits).

In this paper, we study length generalization in transformers for four basic arithmetic tasks:

addition, modular addition, multiplication and modular multiplication. We train models on 5-digit

operations, and investigate their ability to generalize to numbers with up to 20 digits for addition,

and 35 digits for multiplication. We show that the use of relative position embeddings allows for

length generalization in the case of addition and some modular operations. For 5-digit × 3-digit

multiplication, we show that train set priming: adding a tiny amount of examples (50 out of 5000)

from the target distribution, surprisingly allows the model to length generalize to very long operands

(i.e. 35-digit × 3-digit multiplications). The paper is organized as follows.

– Section 2 presents our experimental setup: problems, data generation, encoding, models,

training and evaluation.

– Section 3 demonstrates that, on the addition task, encoder-only transformers using relative

position embeddings, can length generalize.

– Section 4 presents our results for modular arithmetic. In some cases, absolute position

embedding allow for length generalization.

– Section 5 introduces train set priming and shows that it achieves extrapolation to very long

multiplications.

– Section 6 discusses the results, highlights a few additional results and proposes some future

directions.

Contributions. This paper delivers five key messages.

– Relative position embeddings ensure length generation in addition. Models trained

to add 5-digit numbers can generalize to 20-digit operands.

– Simple techniques fail for multiplication. RPE do not allow length generalization.

Fine-tuning on long sequences helps generalize, but requires a lot of samples from the target

distribution. Also, it causes catastrophic forgetting.

– Train set priming enables length generalization. For multiplication, adding a tiny

amount of long sequences to the training set (50 out of the 9 × 10 34 possible 35-digit numbers)

allows generalization to 35-digit operands. Remarkably, the number of long sequences is much

smaller than the one needed for fine-tuning.

– Priming sample size scales as the logarithm of the train set size.

2– Primed model can extrapolate to several lengths. A model trained to multiply 5-digit

numbers can be primed, with 500 priming examples, to generalize to numbers with 6 to

35-digits. On the other hand, 500 examples along would be far from sufficient to train a model

to multiply 6 to 35 digits.

Remark: In our multiplication experiments, we arbitrarily fix the second operand to have

3 digits. This is to ensure that the task is challenging enough. Regarding the first operand, we

arbitrarily set the extrapolation to 35 in order to hightlight that our models are really able to do

length generalization when using priming. However, we believe that our empirical results would still

hold when extrapolating to any reasonable length.

Related work

Transformers for mathematics. Early applications of transformers to mathematics focus on

symbolic computations. Lample and Charton (2019) trained them to perform symbolic integration

and solve differential equations. Polu and Sutskever (2020) applied them to theorem proving, Hahn

et al. (2020) to temporal logic, and Dersy et al. (2022) trained them to simplify formulas involving

polylogarithms. Nogueira et al. (2021) investigates their limitations on basic arithmetic operations.

Palamas (2017) experiments with modular arithmetic, and Wenger et al. (2022) demonstrates that

universal transformers can be trained to perform modular inversion. Despite their limitations

in arithmetic, Charton (2021) shows that transformers can perform numerical calculations, like

computing eigenvalues or inverting matrices.

With the advent of large language models (Bommasani et al., 2021), a new line of research focuses

solving problems of mathematics written in natural language (Griffith and Kalita, 2021; Meng

and Rumshisky, 2019; Cobbe et al., 2021). Lewkowycz et al. (2022) show that a large pre-trained

transformer can be retrained on a large math corpus to solve grade and high school problems of

mathematics.

Length generalization with transformers. Multiple works observe the difficulty of transform-

ers to length generalize especially in NLP (Shaw et al., 2018; Murray and Chiang, 2018; Rosendahl

et al., 2019; Press et al., 2021). Several techniques have then been introduced to address this problem:

new position embeddings Shaw et al. (2018); Dai et al. (2019); Raffel et al. (2020); Huang et al.

(2020); Kiyono et al. (2021); Su et al. (2021); Press et al. (2021), introducing new tokens Newman

et al. (2020), new attention mechanisms Dubois et al. (2019). In this paper, we leverage one of these

techniques (RPE) for addition and introduce a new one, train set priming, for multiplication.

Length generalization in mathematics. Generalization to long sequences, in arithmetic

operations, is a longstanding problem. Using recurrent architectures, Joulin and Mikolov (2015)

and Kaiser and Sutskever (2015) achieve length generalization in the case of binary addition and

multiplication. Later, Trask et al. (2018) introduces NALU, an architecture that learns addition

and multiplication, and that generalizes to any length. However, their network has hand-crafted

modules that are specifically designed to encode addition and multiplication. Several recent works

use auto-regressive models to length generalize in math tasks. Anil et al. (2022) and Zhou et al.

(2022) show that fine-tuning or scratchpad (Nye et al., 2021; Wei et al., 2022) on autoregressive

decoder models is insufficient to length generalize. They tackle this by changing the scratchpad

3Figure 1: Model overview. We linearly embed each symbol token, add position embeddings, and

feed the resulting sequence of vectors to a transformer or universal transformer encoder. In order to

predict the result of the operation, we select the first n out tokens and apply a linear classifier to

each of them.

procedure and designing new prompt engineering techniques. Closer to our work, Zhang et al. (2022)

train encoder-only models to length generalize on variable assignment tasks.

Experimental setup

2.1

Problems and encodings

We consider four arithmetic tasks:

– Addition: y = x 1 + x 2 .

– Modular addition: y ≡ x 1 + x 2 [c].

– Multiplication: y = x 1 × x 2 .

– Modular multiplication: y ≡ x 1 × x 2 [c],

with x 1 and x 2 , two positive integers, and c > 1, a fixed modulus. Our models are trained to

predict y from (x 1 , x 2 ).

For the addition tasks, the train set is composed of pairs of positive integers with up to 5

digits, i.e. (x 1 , x 2 ) ∈ N 210 5 . x 1 is randomly sampled from a fixed set of N train values (we usually

set N train = 5000). x 2 is uniformly sampled in N 10 5 . Since N train ≪ 100, 000, the training set only

covers a small portion of the problem space. This guarantees that the model will not overfit. Trained

models are tested on random pairs of positive integers with n test digits: (x 1 , x 2 ) ∈ N 2 p , p = 10 n test .

We set n test = 5 for in-domain accuracy, and n test ∈ {6, ...20} for length generalization.

For multiplication, we train from pairs of positive integers with up to 5-digits and 3-digits, i.e.

x 1 < 10 5 and x 2 < 10 3 . We henceforth refer to this setting as “5 × 3 multiplication”. As before,

x 1 is randomly sampled from a fixed set of N train examples, and x 2 is uniformly sampled in N 1000 .

4Trained models are tested on n test × 3 products, with n test = 5 in-domain, and n test ∈ {6, ...35} for

length generalization.

Data formatting. The arithmetic operations (e.g. 535 × 257) and the integers (137495) that

correspond to model input and output are encoded as sequences of discrete symbols. Integers

are represented as sequences of digits, in base 10, and padded (using the special token ) to

lengths n test for input operands, and n out for output. We have n out = n test + 1 for addition, and

n out = 2n test for multiplication. The four operations are encoded with the dedicated tokens +, %,

× and ∗. Overall, we use a vocabulary of 15 tokens: {0, . . . , 9, +,%, ×, ∗, }. For example, for

addition with n train = 2 and n test = 3, the train and test examples 12 + 39 = 51 and 999 + 345 = 1344

would be encoded as:

x train = 1 2 + 3 9

y train = 5 1

x test = 9 9 9 + 3 4 5

y test = 1 3 4 4

We use the padding symbol in order to ensure that all the input sequences and output sequences

have the same length. This is crucial for the model in order to deal with carries.

Training procedures. We use the following three procedures. Standard training is used in

Sections 3 and 4. Fine-tuning and priming are introduced in Section 5. In all training procedures, the

first operands and randomly sampled from a fixed set of N train examples, and the second operands

are generated online (i.e. uniformly sampled between 1 and 10 5 for addition, and between 1 and 10 3

for multiplication).

– Standard training: the model is trained on N train examples of n train -digit integers.

– Fine-tuning: the model is trained on N train examples of n train -digit integers and then

fine-tuned on N fine examples of n test -digit integers.

– Train set priming: the model is trained on (1 − ε)N train examples of n train -digit integers

and εN train priming examples of n test -digit integers, with ε ≪ 1. The priming examples are

fixed throughout the training.

Evaluation sets. During and after training, model performance is evaluated on randomly

generated test sets, of N test integers with n digits. The resulting accuracy is said to be in-distribution

(ID) when n = n train , and out-of-distribution (OOD) when n > n train . New test sets are generated

online for each evaluation step. If not specified otherwise, we use n train = 5, N train = 5000, and

N test = 10000. We set n test = 20 for addition, and n test = 35 for multiplication.

2.2

Model and training

Model. We experiment with two encoder-only architectures: a regular transformer (Vaswani et al.,

2017), and a universal transformer (UTransformer) (Dehghani et al., 2018), in the HuggingFace

implementation (Wolf et al., 2020) of BERT (Devlin et al., 2018) and ALBERT (Lan et al., 2019).

Our model is a stack of three components (see Figure 1):

1. Embedding: a (s vocab × d model )-trainable embedding layer and a position embedding.

52. Encoder: an encoder-only transformer or UTransformer.

3. Classifier: encoder output is truncated (to its first n out elements, forming a n out × d model

matrix), which is processed by a linear layer that outputs n out × s vocab predictions, and encodes

each symbol as a one-hot vector.

Important note: Although we use the HuggingFace implementation, our encoders are not

pre-trained, and we do not use masked language modelling. We train non-causal encoders in a

supervised way, using cross-entropy loss.

Notes on design. We chose to use universal transformers, i.e. transformers with shared layers

(Dehghani et al., 2018), because recurrent models are used in prior work on length generalization

(Bansal et al., 2022; Kaiser and Sutskever, 2015), and universal transformers proved essential on tasks

involving modular arithmetic (Wenger et al., 2022). We believe shared-layer architectures are central

to solving arithmetic problems, because they embed the recursive nature of many algorithms. They

also seem fit for extrapolation tasks where a long operand is processed by successive applications of

a simple technique (e.g. one-digit add and carry).

The choice of an encoder-only model contrasts with concurrent works that consider decoder-only

(Power et al., 2022; Bueno et al., 2022; Zhou et al., 2022) or sequence to sequence (seq2seq) models

(Nogueira et al., 2021). We believe that autoregressive models, such as the decoder-only architecture,

are not optimal for problems of arithmetic, because they are trained to learn the correlations between

successive tokens in the input sequence. In natural language, these correlations are meaningful: they

represent the syntactic and grammatical relations between words in a sentence. In arithmetic, these

correlations are tiny: knowing that the first three digits of number 1234 are 1, 2 and 3, offers no clue

about the value of the fourth digit. As for seq2seq models, in problems where output are guaranteed

to be shorter than input, we consider an auto-regressive decoder as an unnecessary complication.

Overall, we choose encoder-only models because they are the simplest architecture that can address

our problems.

Learning problem.

problem:

We frame our arithmetic tasks as the following supervised multi-classification

min

θ∈Θ

N X

train n

out s X

vocab

i=1 j=1 k=1

e f θ (x i )[j,k]

1[y i [j] = k − 1] P s vocab f (x )[j,k ′ ] ,

k ′ =1 e

(1)

where f θ (x i ) ∈ R n out ×s vocab are the model logits evaluated at x i and θ ∈ Θ are the model parameters.

To solve (1), we minimize the cross entropy between model predictions and the ground truth

symbols for each position in the sequence. An alternative approach, perhaps more natural, would

consider these problems as regressions. However, prior works report that reformulating regression as

classification leads to state-of-the-art performance (Rothe et al., 2015; Rogez et al., 2017; Akkaya

et al., 2019; Schrittwieser et al., 2020).

We consider three model sizes. Base (B) models have D=6 layers, d model =512 dimensions, and

h=8 attention heads, Standard (S) models have D=6, d model =1024 and h=16, and Large (L) models,

6Encoder

APE

Transformer

RPE k

RPE k,q

APE

UTransformer

RPE k

RPE k,q

Size

Number of digits

L 1.8

1.9

100

98.9

96.8

100 0

99.9

74.6

81.1

99.6 0

97.2

47.3

25.0

88.2 0

21.3

0.4

1.1

19.2

L 2.0

3.1

92.1

100

99.7

90.8 0

70.6

99.9

22.5

58.0 0

31.2

98.3

31.1 0

0.1

18.2

1.4

Table 1: Addition: Impact of encoder type, size and position embeddings on length generalization. We

consider transformers and UTransformers in their Base (B) and Large (L) format, using three position

embeddings methods (APE, RPE k , RPE k,q ). We evaluate different degrees of extrapolation: easy (6 digits),

medium (10 digits) and hard (15 and 20 digits). The models are trained on 5000 examples with 1 to 5 digits

and we report the accuracy reached by the models on 100,000 example test sets. Results are averaged over 3

seeds.

we have D=10, d model =1024 and h=16. We investigate three kinds of position embeddings: absolute

(APE) Vaswani et al. (2017), relative over keys (RPE k ) Shaw et al. (2018), and relative over keys

and queries (RPE k,q ) Huang et al. (2018). RPE k is our default option. All other parameters are set

to the default HuggingFace values, and are initialized with random Gaussian values.

Optimization. We train our models using AdamW (Loshchilov and Hutter, 2017), with a batch

size to 32, a learning rate between 10 −5 and 10 −4 and weight decays in {1e−5, 1e−4, 1e−3, 1e−2}.

We apply a cosine scheduler Loshchilov and Hutter (2016) to update the learning rate and train the

model for 15000 epochs of N train examples.

Addition: relative position embeddings enable length gen-

eralization

In these experiments, we train transformers to add two numbers with up to five digits, and test trained

models on sums of numbers with 6 to 20 digits. We compare the Transformer and UTransformer

encoders, in their Base (6 layers, 512 dimensions, 8 attentions heads) and Large (10 layers, 1024

dimensions, 16 heads) configurations, using three position embeddings: absolute, relative on keys,

and relative on keys and queries. All models achieve 100% in-domain accuracy. We make the

following observations (Table 1):

– Models using the absolute position embedding fail to generalize. Our best models

achieve 3.1% accuracy on 6-digit test examples, and 0% for all longer lengths. This was

7extrapolation

20 20

15 15

10 10

Transformer

UTransfomer

256

Hidden size

1024

Depth

(a)

(b)

Figure 2: Scaling laws for integer addition. We train Transformers and UTransformers, with standard

model size (d model =16, D=6, h=16) to add numbers with up to 5 digits. We set N train = 50000. We vary

their hidden size (a) and depth (b). The y-axis indicates the largest extrapolation length where the model

achieves 75% accuracy. Results are averaged over 3 seeds.

observed in previous works Shaw et al. (2018); Dai et al. (2019); Huang et al. (2020); Kiyono

et al. (2021).

– Models using relative position embedding generalize to longer sequences. Our best

models achieve 99.9% accuracy on 10-digits test sets, and 98.3% on 15-digit sets. Performance

drops for longer sequences: we achieve 21.3% for 20-digits numbers. We remark that the RPE

key variant is crucial for achieving extrapolation.

In APE models, because the position embedding is added to the embedding of every token, the

rules of addition must be learned separately for every position. At test time, a model trained on

operands with 5 digits only will not know how to handle digits in position 6, or 7, even though it

has learned to add digits in position 1 to 5. Further discussion of the role of position embeddings,

and additional experiments on model failures, can be found in Section 6.

Depth and dimension for longer extrapolation. Figures 2a and 2b provide ablation results

on model dimension and depth. For models with 64 to 1024 dimensions and 2 to 8 layers, trained

on 5 digit examples, they indicate the largest extrapolation length that the model can achieve with

75% accuracy. A minimal hidden size of 512 for Transformers, and 256 for UTransformers, is needed

for the model to extrapolate. Past this value, length extrapolation scales with dimension, and

1024-dimension models achieve 17-digit extrapolation. UTransformers need 6 layers to extrapolate,

whereas shallow Transformers with 2 layers can extrapolate to 10-digit numbers. The efficiency

of shallow transformer models for computational tasks was observed in previous works (Charton,

2021).

Modular arithmetic

In this section, we study modular addition y ≡ (x 1 + x 2 ) [c] and multiplication y ≡ (x 1 × x 2 ) [c], for

c ∈ {100, 101, 128, 1000}. The difficulty of these operations depends on the modulus c. When c is a

power of 10, i.e. c = 10 k , modular operations only involve the k last digits of their operands, and the

8Modulo

5 6 Digits

10 15 20 c

5 10

Digits

100 APE

RPE k

RPE k,q 100

100

100 99.5

100

100 73.3

86.9

100 43.4

36.0

99.4 21.3

3.4

84.5 100 APE

RPE k

RPE k,q 100

100

100 98.8

100

100 96.2

97.5

100 90.2

85.8

100 88.1

65.2

100

1000 APE

RPE k

RPE k,q 100

100

100 90.8

100

100 79.3

100

100 51.8

100

100 14.1

15.2

9.8 1000 APE

RPE k

RPE k,q 80.2

100

100 69.8

84.8

97.9 43.4

4.9

82.6 26.3

0.2

55.1 6.4

3.9

(a)

(b)

Table 2: Modular addition and multiplication: (a) Extrapolation results for addition and (b) for

multiplication. We train a UTransformer in its base version (D = 6, d model = 512, h = 8) with three position

embedding methods (APE, RPE k , RPE k,q ). We report the accuracy on 100,000 example test sets.

result has constant length k. This makes these operations easier to learn (because they only involve

k digits), and easier to generalize (because k is independent of the length of the operands). When

the modulus is not a power of 10, the problem becomes harder than tbeir non-modular verison,

because modularity adds an integer division on top of the operation (addition or multiplication).

Modular addition. In the “easy” cases (c ∈ {100, 1000}), RPE-based models generalize to

large numbers, achieving better extrapolation performance than for non-modular addition (Table 2a).

This is expected, because this is an easier task than standard addition. Interestingly, APE-based

models do generalize; they achieve 73.3% accuracy on 10-digit numbers. This confirms our intuition

that the failure of APE on length generalization is a consequence of their inability to deal with

change in output sequence lengths.

For the hard cases (c ∈ {101, 128}), no model manages to learn 5-digit modular addition in-

domain. Scaling to larger architectures, with up to 14 layers and 1280 dimensions, brings no

improvement. This matches previous observations by Palamas (2017), about the difficulty of learning

modular arithmetic in the general case.

Modular multiplication. In the easy cases (c ∈ {100, 1000}), both APE and RPE-based model

generalize, achieving 100% on 35-digit numbers for c = 100. For c = 1000, APE achieve 43% on

20-digit numbers, but the use of RPE improves performance, to 83% on 20-digit numbers and 55%

on 30-digit numbers (Table 2b). On hard instances (see Appendix A), for c = 128 , the model

performance drops, both in and out of domain, but length generalization still happens, and is

facilitated by RPE and larger models. Finally, for c = 101, models can learn modular multiplication

in-domain, but consistently fail on longer sequences. Modular multiplication turns out to be easier

to learn than modular addition. A possible explanation is the fact that multiplication tables display

more redundancy, that the model can exploit, than addition tables.

Our experiments with modular arithmetic help understand the role of position embeddings.

APE-based models generalize when they learn an operation involving a fixed number of input tokens,

and constant length output.

9Second

operand

Digits

1-digit APE

RPE k

RPE k,q 100

100

100 1.5

12.2

9.2 0

2-digits APE

RPE k

RPE k,q 100

100

100 0

16.9

15.5 0

3-digits APE

RPE k

RPE k,q 100

98.9

100 0

0 0

Table 3: Multiplication by 1, 2 and 3-digit numbers: We train a UTransformer in its standard version

(D = 6, d model = 1024, h = 16) with three position embeddings (APE, RPE k , RPE k,q ). ID and OOD

accuracy on 100,000 test examples.

Multiplication: train set priming for length generalization

We focus on the length generalization problem where we train a UTransformer to multiply 5-digit

numbers by 3-digit numbers, from N train = 5000 examples and train it on a set of N train = 5000

examples that are (n train × 3)-multiplications with n train ≤ 5. We test its extrapolation ability to

perform 35 × 3 multiplications.

5.1

Relative position embeddings and fine-tuning

Relative position embeddings are not sufficient. We first train UTransformers with the

three position embedddings (Table 3). All models achieve close to 100% in-domain accuracy, but

fail to generalize to numbers with 6 digits or more. For 5 × 3 multiplication, RPE do not generalize.

On simpler versions of this task (5 × 2 and 5 × 1), RPE models achieve limited generalization to

6-digit numbers (12.2 and 16.9% for 1 and 2-digits), but fail for longer sequences.

Fine-tuning requires a sizable sample set. Fine-tuning is a common solution for transfer

learning (extrapolating from one distribution to another). Here, we first train a model on 5 × 3

multiplication, then re-train it on a fixed sample of 35 × 3 examples. We observe (Figure 3a) that

35-digit multiplication can indeed be learned by fine-tuning on a set of 1000 examples. This is a large

number: as we shall see, train set priming allows for much smaller samples. Besides, the fine-tuned

model is not longer able to perform 5 × 3 multiplication, a phenomenon known as catastrophic

forgetting (McCloskey and Cohen, 1989).

5.2

Priming for length generalization in multiplication.

As an alternative, we introduce train set priming: adding a tiny amount (ε%) of long sequences

to the training set. By adding 50 35-digit examples (ε = 1%), our model achieves close to 100%

accuracy on 5 × 3 and 35 × 3 multiplication (Figure 3b). To reach equivalent performance, train

sample priming needs 20 times less examples than fine-tuning. 5 × 3 multiplication is learned after a

10(%)

100

Fine-tuning

Train set priming

100

5x3

35x3

50 100

500 1000

# fine-tuning examples

5x3

35x3

0 2500

15000

# Epochs

(a)

(b)

Figure 3: Fine-tuning (a) and train set priming (b). (a) fine-tuning, the model is trained

on 5 × 3 multiplications, then fine-tuned on 35 × 3 multiplications. Final accuracy of 5 × 3 and

35 × 3 multiplications as a function of the number of fine-tuning examples. (b) priming, fifty 35 × 3

examples are added to the training set. Learning curves for 5-digit and 35-digit accuracy. All

experiments use a standard UTransformer (D = 6, d model = 1024, h = 16). Average over 3 seeds.

few hundred thousand examples, 35 × 3 multiplication (OOD generalization) after 1500 epochs, or

7.5 million examples (1500 passes over 5000 fixed examples), but only 75, 000 35-digit example (i.e.

1, 500 passes over 50 fixed examples, out of 9.10 34 possible 35-digit integers).

A minimal priming rate is required. Adding less than 25 samples (25 examples, ε = 0.5%)

prevents generalization. Over that threshold, accuracy increases with the priming rate (Figure 4a).

Priming sample scales logarithmically with train set size. As the number of training

examples increases, so does the number of priming examples required to extrapolate to 35 × 3.

However, it scales logarithmically: 30 (ε=3%) priming examples are needed for 10 3 training examples,

70 (ε=0.7%) for 10 4 and 100 (ε=0.1%) for 10 5 (Figure 4b).

Priming sample scales linearly with extrapolation length. Whereas 50 samples are needed

for 35-digit generalization, 6-digit generalization only needs 10 (Figure 4c).

Curriculum priming fails. We consider curriculum priming as a possible improvement. Instead

of priming on long sequences only (i.e. 35-digit numbers), we could split the priming examples

between several lengths, from 6 to 35. In most cases, curriculum priming fails to extrapolate to

35 × 3 multiplication, but one curriculum proves effective: priming the model on a mixture of

34 and 35-digits numbers (Figure 4d). This causes the model to learn faster and achieve higher

extrapolation accuracy.

5.3

Priming for extrapolation at all lengths

Priming the train set with 35-digit numbers only allows to extrapolate to 35-digit operands. No

other extrapolation lengths are learned in the process (Figure 5a). However, by priming on numbers

of all lengths from 6 to 35, the model can extrapolate to all lengths up to 35. This can be done at a

111 10 20 30 40 50

# priming examples

1,000

10,000 100,000

# training examples

(a)

Curriculum priming

Priming size vs target length

Priming size vs training size

100

6 9 15 21

30 35

# digits (1st operand)

(b)

34&35

0 2500

15000

Epochs

(c)

(d)

Figure 4: Ablations on priming sample size. (a) Accuracy of 35 × 3-multiplications vs priming

sample size. (b) Priming sample needed to achieve 90% 35-digit accuracy for different train set

sizes. (c) Priming sample needed to achieve 90% accuracy, for different extrapolation lengths. (d)

Learning curves for 35-digit priming, and 34 and 35-digit curriculum. All experiments use a standard

UTransformer (D = 6, d model = 1024, h = 16). Results are averaged over 3 seeds.

Priming all lengths

5 10 15 20 25 30 35

# digits

(a)

Priming set for all lengths

100

Count

100

5 10 15 20 25 30 35

# digits

(b)

Priming even lengths

Priming only 35x3

Accuracy vs priming size

100

6 10 15 20 25 30 35

# digits

(c)

100

# digits

(d)

Figure 5: Training set priming to all lengths. (a) Priming with 35-digit numbers only. (b) Priming

with a mixture of all length. (c) Distribution of priming lengths for figure (b). (d) Priming on even lengths

only. All experiments use a standard UTransformer (D = 6, d model = 1024, h =16). Average over 3 seeds.

moderate cost in additional data. Using the priming distribution from Figure 5c, our models learn

to extrapolate with over 95% accuracy to all lengths (see Figure 5b). The priming set size is 500,

for a priming rate of ε = 10%. More efficient priming distributions might exist: the point of this

experiment is to show that priming to all lengths is possible within a reasonable data budget ε. On

the other hand, we observe that all extrapolation length must be primed. For instance, if only even

lengths are primed, the model only generalizes to even lengths. There is no overspill to odd lengths

(Figure 5d).

6.1

Discussion

Why do RPEs extrapolate better than APEs?

In Section 3, we notice that replacing APE by RPE is the key for models to length generalize. Three

experiments help understand the role of RPE.

Element-wise addition. A possible reason for generalization in RPE-based models, is that

relative embeddings allow tokens to “know their neighbors”. This could help models learn local

operations, like carry propagation (an important factor in integer addition). To test this hypothesis,

120

2 4 6 8 10 12 14 16 18

Number of carries

(a)

100

Mistake position

450

2000

Number of wrong digits

20-digit addition (MC)

20-digit addition (NC)

100

1000

0 2 4 6 8 10 12 14 16

Maximum successive carries

150

(b)

300

6 11 16 21

Hamming distance

(c)

6 11 16 21

Digit position

(d)

Figure 6: Success and failure cases in addition. (a) Accuracy of 20-digit sums, by number of carries in

the sum. (b) Accuracy of 20-digit sums, by maximum number of consecutive carries. (c) Distribution of the

number of incorrect digits in wrong predictions of 20-digit sums. (d) Positions of incorrect digits in sumes

where only one digit is wrong. All experiments use a standard UTransformer (D = 6, d model = 1024, h = 16),

achieving 57% accuracy on 20-digit additions.

we train models on element-wise addition ⊕ (i.e. addition without carries: 99 ⊕ 35 = 24). If

carry propagation is the reason why RPE succeed, APE-models should generalize on this task.

Experimental results (in Appendix A) show that APE fail to generalize on element-wise addition,

whereas RPE succeed, this disproving our hypothesis. It is striking to note (see Table 8) that when

the generalize, APE models almost always predict the the 5 leftmost digits of the results, i.e. its

“in-domain” positions, thus confirming our intuition that APE learn addition digit by digit.

Modular arithmetic. As we have seen, APE models length generalize on these tasks when

the modulus is a power of 10. (Tables 2a and 2b). In both cases, the model output have constant

length. This, together with our element-wise results, suggest that varying output lengths are an

important factor of APE extrapolation failures.

RPE-models learn all digits at once. Figures 7a and 7b present learning curves for each

position in the output, when a model is trained on 5-digit addition (e.g. the 6 curve is the learning

curve of the units of the sum, the 5-curve is the tens). We note that whereas the first and last digits

in the sums are learned first, all other digits are learned simultaneously by RPE models, whereas

APE models seem to learn each position independently. This suggests that RPE models might learn

a single algorithm for all positions, which greatly helps them to generalize.

6.2

Failure cases in addition

Figure 6 provides an analysis of model failures when extrapolating to 20-digit sums. First, we assess

the role of carries, by introducing two metrics: the total number of carries (NC), and the maximum

number of consecutive carries (MC). As Figures 6a and 6b indicate, almost all model failures happen

on additions involving at least three carries, and two consecutive carries. Larger values of MC and

NC have no further impact.

Figures 6c and 6d present the number of incorrect digits in wrong model predictions and their

position. We note that, when wrong, the model usually does not hallucinate a irrelevant answer

(with many wrong digits), but fails on just a few. Errors also concentrate on the first and second

positions: the largest powers of ten in the sum.

13(%)

APE (addition)

100

RPE (addition)

# Epochs

100

RPE (multiplication)

100

(a)

# Epochs

(b)

RPE + priming (mul)

1000 2000

# Epochs

(c)

3000

100

1000 2000

# Epochs

3000

(d)

Figure 7: Digit by digit learning curves. Training accuracy for each output digit (1 are the largest

powers, 6 the units for a sum).(a) Addition APE models. (b) Addition RPE models. (c) Multiplication RPE

models (no priming) (d). Multiplication RPE models (with priming). In all these experiments, 1 denotes

the leftmost digit position while 6 (for addition) and 8 (for multiplication) All experiments use a standard

UTransformer (D = 6, d model = 1024, h = 16).

6.3

More about priming

Train set priming is our most striking result. In Section 5, we demonstrate that is allows length

generalization in multiplication. We now present additional results. We first show that train set

priming is also effective on APE models. Then, we investigate how the models learn multiplication.

Primed APE models generalize. In Appendix A, we show that priming on APE models also

yields length generalization. We obtain a similar dynamics as in Figure 3b where the ID accuracy

quickly increases and the OOD accuracy slowly follows (Figure 9a). However, as expected, this does

not make APE models a viable proposition: the priming rate needed is 10 times larger i.e. ε = 10%.

Primed models learn several digits simultaneously. In our addition experiments in Subsec-

tion 6.1, we noticed that whereas APE models learn to predict their output digit by digit as training

proceeds (Figure 7a), RPE models seem to learn them all at once (Figure 7a). A similar pattern

can be seen for multiplication with RPE models. Without priming (Figure 7c), models seem to

learn 5 × 3 multiplication one digit at a time, over 1000 epochs. With priming, the model seems to

learns several digits concurrently Figure 7d. A similar phenomenon holds for APE models: without

priming, the model independently learns each digit (Figure 9b) while the digits are concurrently

learnt with priming (Figure 9c). In summary, simultaneous learning of all the training digit positions

seems a key determinant of length generalization.

6.4

Priming beyond arithmetic

Our work demonstrates that train set priming can improve the length generalization of transformers

on arithmetic tasks. Compared to fine-tuning, it requires much fewer samples from the target

distribution and allows for generalization without catastrophic forgetting. We conclude on a number

of open questions, which constitute as many avenue for future research. All these directions may

help shed light on the capabilities and limitations of transformers, and inspire new methods for

improving their generalization and adaptation.

– Can priming be extended to other mathematical problems? For instance, numerical

computations, matrix operations, or symbolic mathematics.

– Can priming help with compositionality? Investigate the limits of length generalization

in terms of the number and type of operations. For instance, if we train on adding k numbers,

14can we generalize to adding k + 1 numbers, or if we train on compositions of additions and

multiplications separately, does it generalize to compose them together?

– Theoretical understanding of priming: why is train set priming more effective than

fine-tuning for length generalization?

– Can priming work for NLP? Can we use priming to adapt a pre-trained language model

to a new language task, without losing its performance on the original data?

References

Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron,

Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a

robot hand. arXiv preprint arXiv:1910.07113, 2019.

Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Am-

brose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. Exploring length generalization

in large language models. arXiv preprint arXiv:2207.04901, 2022.

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan,

Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language

models. arXiv preprint arXiv:2108.07732, 2021.

Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Goldblum, and

Tom Goldstein. End-to-end algorithm synthesis with recurrent networks: Logical extrapolation

without overthinking, 2022.

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx,

Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportu-

nities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.

Mirelle Bueno, Carlos Gemmel, Jeffrey Dalton, Roberto Lotufo, and Rodrigo Nogueira. Induced

natural language rationales and interleaved markup tokens enable extrapolation in large language

models. arXiv preprint arXiv:2208.11445, 2022.

François Charton. Linear algebra with transformers. arXiv preprint arXiv:2112.01898, 2021.

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel,

Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence

modeling. Advances in neural information processing systems, 34, 2021.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher

Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint

arXiv:2110.14168, 2021.

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdi-

nov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint

arXiv:1901.02860, 2019.

15Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal

transformers. arXiv preprint arXiv:1807.03819, 2018.

Aurélien Dersy, Matthew D. Schwartz, and Xiaoyuan Zhang. Simplifying polylogarithms with

machine learning, 2022.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep

bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas

Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An

image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint

arXiv:2010.11929, 2020.

Yann Dubois, Gautier Dagan, Dieuwke Hupkes, and Elia Bruni. Location attention for extrapolation

to longer sequences. arXiv preprint arXiv:1911.03872, 2019.

Kaden Griffith and Jugal Kalita. Solving arithmetic word problems with transformers and prepro-

cessing of problem text. arXiv preprint arXiv:2106.00893, 2021.

Christopher Hahn, Frederik Schmitt, Jens U Kreber, Markus N Rabe, and Bernd Finkbeiner.

Teaching temporal logics to neural networks. arXiv preprint arXiv:2003.04218, 2020.

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Curtis Hawthorne,

Andrew M Dai, Matthew D Hoffman, and Douglas Eck. An improved relative self-attention

mechanism for transformer with application to music generation. 2018.

Zhiheng Huang, Davis Liang, Peng Xu, and Bing Xiang. Improve transformer models with better

relative position embeddings. arXiv preprint arXiv:2009.13658, 2020.

Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence

modeling problem. Advances in neural information processing systems, 34, 2021.

Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stack-augmented recurrent

nets. Advances in neural information processing systems, 28, 2015.

Lukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228,

2015.

Shun Kiyono, Sosuke Kobayashi, Jun Suzuki, and Kentaro Inui. Shape: Shifted absolute position

embedding for transformers. arXiv preprint arXiv:2109.05644, 2021.

Guillaume Lample and François Charton. Deep learning for symbolic mathematics. arXiv preprint

arXiv:1912.01412, 2019.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu

Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint

arXiv:1909.11942, 2019.

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ra-

masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative

reasoning problems with language models. arXiv preprint arXiv:2206.14858, 2022.

16Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv

preprint arXiv:1608.03983, 2016.

Ilya Loshchilov and Frank Hutter.

arXiv:1711.05101, 2017.

Decoupled weight decay regularization.

arXiv preprint

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The

sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165.

Elsevier, 1989.

Yuanliang Meng and Anna Rumshisky. Solving math word problems with double-decoder transformer.

arXiv preprint arXiv:1908.10924, 2019.

Kenton Murray and David Chiang. Correcting length bias in neural machine translation. arXiv

preprint arXiv:1808.10006, 2018.

Benjamin Newman, John Hewitt, Percy Liang, and Christopher D Manning. The eos decision and

length extrapolation. arXiv preprint arXiv:2010.07174, 2020.

Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. Investigating the limitations of transformers with

simple arithmetic tasks. arXiv preprint arXiv:2102.13019, 2021.

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David

Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work:

Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114,

2021.

Theodoros Palamas. Investigating the ability of neural networks to learn simple modular arithmetic.

2017.

Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving.

arXiv preprint arXiv:2009.03393, 2020.

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking:

Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177,

2022.

Ofir Press, Noah A Smith, and Mike Lewis. Train short, test long: Attention with linear biases

enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi

Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a unified text-to-text

transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.

Gregory Rogez, Philippe Weinzaepfel, and Cordelia Schmid. Lcr-net: Localization-classification-

regression for human pose. In Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, pages 3433–3441, 2017.

Jan Rosendahl, Viet Anh Khoa Tran, Weiyue Wang, and Hermann Ney. Analysis of positional

encodings for neural machine translation. In Proceedings of the 16th International Conference

on Spoken Language Translation, 2019.

17Rasmus Rothe, Radu Timofte, and Luc Van Gool. Dex: Deep expectation of apparent age from a

single image. In Proceedings of the IEEE international conference on computer vision workshops,

pages 10–15, 2015.

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon

Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari,

go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.

J Schulman, B Zoph, C Kim, J Hilton, J Menick, J Weng, JFC Uribe, L Fedus, L Metz, M Pokorny,

et al. Chatgpt: Optimizing language models for dialogue, 2022.

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representa-

tions. arXiv preprint arXiv:1803.02155, 2018.

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer:

Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.

Andrew Trask, Felix Hill, Scott Reed, Jack Rae, Chris Dyer, and Phil Blunsom. Neural arithmetic

logic units. arXiv preprint arXiv:1808.00508, 2018.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,

Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information

processing systems, pages 5998–6008, 2017.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny

Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint

arXiv:2201.11903, 2022.

Emily Wenger, Mingjie Chen, François Charton, and Kristin Lauter. Salsa: Attacking lattice

cryptography with transformers. arXiv preprint arXiv:2207:04785, 2022.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,

Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Transformers: State-of-the-art

natural language processing. In Proceedings of the 2020 conference on empirical methods in

natural language processing: system demonstrations, pages 38–45, 2020.

Yi Zhang, Arturs Backurs, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, and Tal Wagner.

Unveiling transformers with lego: a synthetic reasoning task. arXiv preprint arXiv:2206.04301,

2022.

Hattie Zhou, Azade Nova, Hugo Larochelle, Aaron Courville, Behnam Neyshabur, and Hanie Sedghi.

Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066, 2022.

18A

Additional experiments

In this section, we present some additional experiments that were mentioned in the paper. We first

provide in Subsection A.1 the complete results for modular addition and multiplication that were

mentioned in Section 4. We then present complementary results to our discussion in Section 6. We

first report the results obtained by APE and RPE models on digitwise addition. Then, we show

that APE models can also be primed to length generalize in multiplication at the expense of a much

larger priming rate (Subsection A.3). Lastly, we present plots showing the digit order by which RPE

and APE models make the correct predictions (Subsection A.5).

A.1

Additional experiments on modular arithmetic

APE

100

RPE k

RPE k,q

APE

1000

RPE k

RPE k,q

APE

128

RPE k

RPE k,q

APE

101

RPE k

RPE k,q

Size

Digits

30 35

Base

Large

Base

Large

Base

Large 100

100

100 98.8

100

100 96.2

100

97.5

100

100 90.2

100

85.8

100

100 88.1

100

65.2

100

Base

Large

Base

Large

Base

Large 80.2

28.2

100

100 69.8

12.2

84.8

100

97.9

84.2 43.4

9.9

4.9

100

82.6

83.0 26.3

8.7

0.2

99.9

55.1

82.7 6.4

7.7

26.4

3.9

20.1

Base

Large

Base

Large

Base

Large 14.7

9.1

19.9

11.8

26.9

20.4 8.4

6.9

13.3

11.5

21.7

20.5 4.7

5.3

5.6

11.4

14.1

19.2 4.4

4.4

3.5

11.2

10.3

18.4 3.8

3.9

1.2

10.0

6.2

16.2

Base

Large

Base

Large

Base

Large 44.8

1.1

24.5

95.3

99.1

9.9 2.3

1.2

2.3

2.5

2.4 2.4

1.2

1.9

2.2

2.1 2.4

1.1

1.8

2.0

2.2

1.8 2.3

1.1

1.4

2.1

1.8

Table 4: Modular addition: Extrapolation results for modulo c ∈ {100, 1000, 128, 101}. UTransformer

model in their Base and Large format. We report the accuracy reached by the models on 100,000 example

test sets.

Table 4 provides a more complete version of Table 2a where we do modular addition for modulus

c ∈ {128, 101}. As explained in Section 4, the model manages to extrapolate when the modulus is a

power of 10. When c = 128, 101, the model fails to extrapolate. This shows that what the model

19struggles when the length of the digits that matter vary.

APE

100

RPE k

RPE k,q

APE

1000

RPE k

RPE k,q

APE

128

RPE k

RPE k,q

APE

101

RPE k

RPE k,q

Size

Digits

30 35

Base

Large

Base

Large

Base

Large 100

100

100 98.8

100

100 96.2

100

97.5

100

100 90.2

100

85.8

100

100 88.1

100

65.2

100

Base

Large

Base

Large

Base

Large 80.2

28.2

100

100 69.8

12.2

84.8

100

97.9

84.2 43.4

9.9

4.9

100

82.6

83.0 26.3

8.7

0.2

99.9

55.1

82.7 6.4

7.7

26.4

3.9

20.1

Base

Large

Base

Large

Base

Large 14.7

9.1

19.9

11.8

26.9

20.4 8.4

6.9

13.3

11.5

21.7

20.5 4.7

5.3

5.6

11.4

14.1

19.2 4.4

4.4

3.5

11.2

10.3

18.4 3.8

3.9

1.2

10.0

6.2

16.2

Base

Large

Base

Large

Base

Large 44.8

1.1

24.5

95.3

99.1

9.9 2.3

1.2

2.3

2.5

2.4 2.4

1.2

1.9

2.2

2.1 2.4

1.1

1.8

2.0

2.2

1.8 2.3

1.1

1.4

2.1

1.8

Table 5: Modular multiplication: Extrapolation results for modulo c ∈ {100, 1000, 128, 101}. UTrans-

former model in their Base and Large format. We report the accuracy reached by the models on 100,000

example test sets.

Table 5 provides a more complete version of Table 2b where we do modular multiplication for

modulus c ∈ {128, 101}. As explained in Section 4, the model manages to extrapolate when the

modulus is a power of 10. When c = 128, the model non-trivially length generalize while when

c = 101, the model fails to extrapolate. We do not fully know why this difference happens but one

hypothesis is that 101 is a prime number while 128 a power of 2.

20Element-wise addition experiments

APE

RPE k

Digits

5 6

100

100 5.3

97.5

0.0

90.5

0.0

86.2

A.2

100

Digitwise accuracy (APE on ⊕)

0.0

78.13

Table 6: Element-wise addition: Extrapolation results.

We train a UTransformer in its base version (D = 6, d model =

512, h = 8) with two position embedding methods (APE,

RPE k ). We report the accuracy reached by the models on

10,000 example test sets.

10 15

Digit position

Figure 8: Digitwise accuracy of the

APE model on elementwise addition.

We train a Base UTransformer with APEs

and report the accuracy on 10,000 example

test sets. Average over 3 seeds.

We consider here an element-wise addition operation. For example, 99 ⊕ 45 = 34 because

(9 + 5)%10 = 4 and (9 + 4)%10 = 3. We train a UTransformer on 5-digit element-wise addition

(N train = 50, 000) and evaluate its extrapolation on 20-digit (N test = 10, 000). Table 6 reports the

final results obtained with APE and RPE models. We observe that the RPE models manage to

length generalization while the APE models fail. In Figure 8, we plot the digitwise accuracy on the

test samples. We observe that the model managed to well-predict the leftmost 5 digits (those seen

during training) but fails in the right-most ones.

Train set priming (APE)

100

5x3

35x3

0 2500

10000

# Epochs

(a)

APE (multiplication)

100

1000 2000

# Epochs

(b)

3000

Multiplication experiments using APEs

A.3

APE + priming (mul)

100

1000 2000

# Epochs

3000

(c)

Figure 9: Additional experiments on priming for multiplication. (a) shows the accuracy on 5 × 3

and 35 × 3 multiplications obtained by an APE model. (b) and (c) respectively display the learning process

of an APE model without and with train set priming on multiplication. We train a standard UTransformer

(D = 6, d model = 1024, h = 16) on 5 × 3-multiplications and test on 35 × 3. Training set size is N train = 5000

and test set size is N test = 10000.

In this section, we consider the multiplication task with UTransformers using APEs. Similarly

to the RPE case, we observe that training set priming lead to successful extrapolation to (35 × 3)-

21multiplications with 95% test accuracy (Figure 9a). In Figure 9b, we observe that the model learns

each digit position independently. This is a sign of memorization. On the other hand, when priming

the model with ε = 10%, we observe that this forces the model to learn the digit positions together.

A similar observation holds for RPE models in Section 6.

A.4

Training accuracy for varying priming rates

RPE + 0.1% priming (mul)

100

RPE + 0.2% priming (mul)

100

500

# Epochs

1000

(a)

RPE + 0.4% priming (mul)

100

500

# Epochs

1000

(c)

500

# Epochs

(d)

RPE + 1% priming (mul)

100

500

# Epochs

1000

(b)

RPE + 0.3% priming (mul)

100

500

# Epochs

1000

(e)

Figure 10: Digitwise accuracy on the training examples.

1000Test accuracy for varying priming rates

Predict 35 × 3: priming 0.1%(RPE)

100

A.5

Predict 35 × 3: priming 0.2%(RPE)

100

2500 5000

# Epochs

3500

# Epochs

6500

Predict 35 × 3: priming 1% (RPE)

Predict 35 × 3: priming 0.4%(RPE)

3500

# Epochs

(b)

(a)

100

6500

100

(c)

1000

# Epochs

(d)

Figure 11: Digitwise prediction on the test examples.

2000