Investigating the Limitations of Transformers with Simple Arithmetic Tasks
aa r X i v : . [ c s . C L ] M a r Investigating the Limitations of Transformers with SimpleArithmetic Tasks
Rodrigo Nogueira, Zhiying Jiang & Jimmy Lin
David R. Cheriton School of Computer ScienceUniversity of Waterloo A BSTRACT
The ability to perform arithmetic tasks is a remarkable trait of human intelligenceand might form a critical component of more complex reasoning tasks. In thiswork, we investigate if the surface form of a number has any influence on howsequence-to-sequence language models learn simple arithmetic tasks such as addi-tion and subtraction across a wide range of values. We find that how a numberis represented in its surface form has a strong influence on the model’s accu-racy. In particular, the model fails to learn addition of five-digit numbers whenusing subwords (e.g., “32”), and it struggles to learn with character-level repre-sentations (e.g., “3 2”). By introducing position tokens (e.g., “3 10e1 2”), themodel learns to accurately add and subtract numbers up to 60 digits. We con-clude that modern pretrained language models can easily learn arithmetic fromvery few examples, as long as we use the proper surface representation. Thisresult bolsters evidence that subword tokenizers and positional encodings arecomponents in current transformer designs that might need improvement. More-over, we show that regardless of the number of parameters and training exam-ples, models cannot learn addition rules that are independent of the length of thenumbers seen during training. Code to reproduce our experiments is available at https://github.com/castorini/transformers-arithmetic
NTRODUCTION
Abstraction and composition are two important themes in the study of human languages, madepossible by different linguistic representations. Although treatments in different linguistic traditionsvary, representations at the lexical, syntactic, and semantic levels are a common feature in nearlyall theoretical studies of human language, and until relatively recently, these representations areexplicitly “materialized” in language processing pipelines (for example, semantic role labeling takesas input a syntactic parse).However, with the advent of pretrained transformer models, these intermediate representations nolonger have any explicit “reality”: while various studies have found evidence of syntactic and seman-tic knowledge in these models (Tenney et al., 2019), it is no longer possible to isolate, for example,a subject–verb relation in a specific part of the model. With transformers, the only input to the modelis the surface form of text combined with supplemental embeddings (e.g., positional embeddings,and in the case of BERT, segment embeddings).What are the consequences of this exclusive focus on the surface form of text? Some might say,nothing, as bigger models, better pretraining objectives, etc. will lead us to models that are capableof reasoning (Brown et al., 2020). We believe this to be an untenable position and present a casestudy in simple arithmetic tasks where having the right representation is the difference between anearly-impossible-to-learn task and an easy-to-learn task. Our work shows that it is possible to“inject” representations into transformer models by simple manipulations of the input sequence (inour case, explicitly enumerating the semantics of digit positions), and that doing so makes it possiblefor off-the-shelf models to easily perform simple arithmetic, whereas it is nearly impossible withoutour manipulations. hile we present only a case study, our findings have broader implications for various languageanalysis tasks: First, although end-to-end training enabled by neural networks are a powerful tool,having the right representation is crucial also.Second, we demonstrate a simple way in which representations can be “injected” into transformermodels in a completely transparent manner, without any need to re-pretrain. This work points outa path that might allow us to combine the best of both worlds: leveraging the power of pretraining,with additional guidance from our understanding of the problem domain.However, we find that even explicit semantic representations have their limits. Despite our bestefforts, we find that models cannot extrapolate, i.e., they fail to perform simple arithmetic whenevaluated on inputs whose length distribution differs from the one seen during training. This appearsto be a problem that neither larger models, more compute, nor more data can solve. ELATED W ORK
Recent studies have explored the numerical capabilities learned by neural networks trained on largeamounts of texts (Talmor et al., 2019; Jiang et al., 2019; Naik et al., 2019; Wallace et al., 2019;Lin et al., 2020; Johnson et al., 2020; Mishra et al., 2020). A common finding is that the learnedembeddings capture magnitude (e.g., 2 < 3), but many models fail to capture numeracy (e.g.,two=2) (Naik et al., 2019; Wallace et al., 2019; Ren & Du, 2020; Zhang et al., 2020). Character-level models such as ELMO (Peters et al., 2018) have stronger numeracy than sub-word modelssuch as BERT (Devlin et al., 2019), perhaps because two numbers that are similar in value can havevery different sub-word tokenizations (Wallace et al., 2019). Our work shows that characters are ad-equate representations for small to medium numbers, but they are not sufficient when dealing withlarge numbers, which require precise position representations for each digit.However, independently of the tokenization method, pretrained word embeddings have troubleextrapolating to numbers unseen during training (Wallace et al., 2019). Some alternatives to im-prove the extrapolation capabilities of neural models include augmenting pretraining corpora withnumerical texts (Geva et al., 2020; Chu et al., 2020) or using scientific notation to represent num-bers (Zhang et al., 2020). Similarly, better numerical skills can be achieved by augmenting inputtexts with pre-computed numerical computations (Andor et al., 2019) or by explicitly inferring math-ematical equations from natural language text (Zou & Lu, 2019a;b; Li et al., 2019; Liu et al., 2019;Shi, 2020).Special architectures have also been proposed for arithmetic tasks (Kaiser & Sutskever, 2015;Kalchbrenner et al., 2015; Price et al., 2016; Trask et al., 2018). Many of these models are capableof summing numbers larger than the ones seen during training. In contrast, more general-purposearchitectures fail to extrapolate on numerical tasks (Joulin & Mikolov, 2015; Dehghani et al., 2018;Schlag et al., 2019).Others have proposed neural–symbolic hybrids, which are typically composed of a neural model toconvert inputs to contiguous vector representations and a symbolic component that apply rules overthese vectors (Ran et al., 2019). However, a body of evidence has shown that neural networks canperform reasoning tasks. For instance, a modern pretrained model with self-attention that uses theright level of input representation can outperform neural–symbolic hybrids on artificial reasoningtasks that require answering questions from videos (Ding et al., 2020). Deep learning models werealso successfully applied to symbolic integration, to solve differential equations (Lample & Charton,2019), and automated theorem proving (Polu & Sutskever, 2020).Furthermore, it is not clear how architectures specialized to some tasks can be adapted to simultane-ously perform a range of tasks a human is capable of. Our work instead focuses on a general-purposearchitecture that can be applied to almost all natural language processing tasks.Novel ways of encoding positions of tokens in the transformer architecture were proposed, butthey were mostly evaluated on natural language processing tasks, showing small performancegains (Ke et al., 2020; He et al., 2020; Wang et al., 2019; Huang et al., 2020). We instead exposethe limitations of subword tokenizers and positional encodings using simple arithmetic tasks.Datasets such as DROP (Dua et al., 2019), EQUATE (Ravichander et al., 2019), or MathematicsQuestions (Saxton et al., 2018) test numerical reasoning; they contain examples that require com- aring, sorting, and performing other complex mathematical tasks. This work focuses on isolatingthe failure cases of the transformer architecture by studying how it performs simple arithmetic tasks.We argue that this is a necessary skill to solve more complex reasoning tasks. ETHODOLOGY
ASKS
Our tasks are the addition and subtraction of two numbers. We cast them as sequence-to-sequencetasks in which both inputs to the models and target outputs are treated as sequences of tokens. Forthe addition task, an example input is “What is 52 plus 148?” and the target output is “200”. For thesubtraction task, an example input is “What is 20 minus 185?” and the target output is “-165”.We programmatically generate training, development, and test sets of different sizes depending onthe experiment. The input template is always “What is [number1] [operation] [number2]?”, where[number1] and [number2] are numbers randomly sampled and [operation] is either “plus” or “mi-nus”. In Section 4, we study different ways of representing [number1] and [number2] and theircorresponding answer. We use two different methods to sample numbers for training, development,and test sets, which are described below.
Balanced sampling:
To generate training and development sets, we first set the maximum numberof digits D and then create each example as follows: We first sample d from [2 , D ] and then inde-pendently sample [number1] and [number2] from [10 d − , d − . We then compute the answeraccording to the operation (i.e., either addition or subtraction). This method ensures that the set willhave a roughly equal proportion of d -digit numbers, where d ∈ [2 , D ] . Random sampling:
To generate test sets, we sample [number1] and [number2] independently from [0 , D − ] . This results in approximately 90% of the numbers having D -digits, 9% having D − -digits, and so on. This unbalanced set is intentional and aims at evaluating models on the largestnumbers it was trained on. We study how different sampling methods influence the model perfor-mance in Appendix D.Since each example is independently sampled, there is a chance that a pair of numbers will occurboth in the training and test sets. However, this is very unlikely when dealing with large numbers.For extrapolation experiments, this never happens by design. Regular vs. inverse orders:
Auto-regressive models such as the ones used in this work generatethe output sequence token by token. Thus, to produce the first digit of the answer, which is the mostsignificant one, the model has to perform all the carry operations. In the addition example fromabove, to produce the first digit “2”, the model has to perform the carry operation for the unit digits(2 and 8), and then the carry for the decimal digits (5 and 4). Hence, the model has to perform thedigit-wise addition (or subtraction) of all the digits in the question before generating the first digitof the answer. We call this generation order “regular”.Another way to produce an answer is by generating the least significant digits first. This order isperhaps easier to learn than the “regular” order because to decode each digit, the model only needsto add (or subtract) single digits and check if the previous digit-wise operation had a carry. We callthis generation order “inverse”. We compare models trained on these two orders in Section 5.
Metric:
Our metric is accuracy. That is, the model receives a score of one if its output matches thetarget output exactly. Otherwise, it receives a score of zero.3.2 M
ODELS
We experiment with both the “vanilla” (i.e., non-pretrained) transformer and T5. As we studyslightly different aspects of the representation for both models, the description of the experimentswill be separated into two parts.T5 (Raffel et al., 2020) is a sequence-to-sequence model pretrained on a large corpus of texts. Themain idea behind the model is to cast every natural language processing task—for example, machinetranslation, question answering, and classification—as feeding a sequence-to-sequence model someinput text and training it to generate some output text. We follow this same approach and feed the rthography Example Notes DECIMAL
832 default representation
CHARACTER
FIXED - CHARACTER
UNDERSCORE
WORDS eight hundred thirty-two leverages pretraining10-
BASED E - BASED . . We experimented with all T5 model sizes except for T5-11Bdue to its computational cost. We refer to T5-small, T5-base, and T5-large as T5-60M, T5-220M,and T5-770M, respectively, to easily distinguish models by their numbers of parameters. XPERIMENTS ON O RTHOGRAPHY
Previous studies have recognized that commonly used subword tokenization techniques today arenot ideal to represent numbers (Wallace et al., 2019; Henighan et al., 2020; Saxton et al., 2018;Lample & Charton, 2019), although none of them studied the problem in depth. Here, we investigatehow six different number representations (i.e., their orthographies), illustrated in Table 1, impact theaccuracies of the models on the arithmetic tasks:
DECIMAL : Digits are represented in the Hindu–Arabic numeral form (also called decimal form).
CHARACTER : Digits are separated by a white space, thus allowing the model to work on vectorsthat always represent single digits.
FIXED - CHARACTER : In the character representation above, it is hard to determine the significanceof a digit by relative position embeddings because relative positions change on a per example basis.For example, the least significant digits are in the third and sixth positions in the sequence “Whatis 5 plus 3 2 ?”. However, the least significant digits are in the fourth and seventh positions in thesequence “What is 3 2 plus 1 5 ?”. We conjecture that it is easier to learn the significance of eachdigit if they always occupy the same absolute position in the sequence. To achieve this, we introducethe
FIXED - CHARACTER representation in which numbers have the same maximum number of digits.For example, if this number is four, then “32” is represented as “0 0 3 2”.
UNDERSCORE : Digits are separated by an underscore token. A possible advantage of this represen-tation is that model can learn to find the significance of a digit by counting the number of underscoresto the right until the least significant digit.
WORDS : Numbers are converted to words using the num2words package. We can anticipate twoadvantages in this representation: (1) the T5 model was pretrained on large amounts of textual data,so it likely knows that “hundred” is larger than “ten” (Zhang et al., 2020); (2) Digits are surroundedby tokens that describe their significance (“hundred”, “thousand”, etc.), thus making it easier to findwhich two digits in the input sequence should be added (or subtracted).
BASED : Digits are separated by powers of 10, which we call position tokens. This representationallows the model to find the significance of a digit by simply inspecting its left or right tokens. E - BASED : Digits are separated by powers of 10 represented using scientific notation. This or-thography has a more compact representation for the position tokens of large numbers than the10-
BASED orthography. For example, in the 10-
BASED orthography, the position token of the most https://github.com/savoirfairelinux/num2words . . . . T e s t A cc u r ac y E - BASED “3 10e1 2 10e0”10-
BASED “3 10 2”
WORDS “thirty-two”
UNDERSCORE “3_2”
FIXED - CHARACTER “0 0 3 2”
CHARACTER “3 2”
DECIMAL “32”
Figure 1: Accuracy of different orthographies on the addition task when using a small trainingdataset of 1,000 examples.significant digit of a 60-digit number occupies 60 characters (i.e., “1” followed by 59 zeros). In the10 E - BASED orthography, this position token occupies only 5 characters (i.e., “10e59”).4.1 R
ESULTS
We present results in Figure 1. Each point in the graph represents the mean accuracy of a T5-220Mtrained with five different sets of 1,000 addition examples for 100 epochs. A separate developmentset of 1,000 examples was used to select the best checkpoint of each run. Error bars correspond to95% confidence intervals. Here we only experiment with the regular order. The values in x -axisrepresent the maximum number of digits used for training and testing. We use a maximum of 30-digit numbers as some representations such as WORDS would result in input sequences that have toomany tokens (e.g., more than 512), and hence prohibitively long training times.In the
DECIMAL representation, the model barely learns addition of 2-digit numbers, and it fails tolearn addition of larger numbers, i.e., it has an accuracy of zero for 5 digits or more. One explanationfor this failure is because numbers are not systematically tokenized into digits. For instance, “132”might be tokenized as “1” and “32”, whereas “232” might be tokenized as “23” and “2”. Hence, themodel would have to learn that sometimes the vector of a token refers to a single digit, other timesto two digits, etc. It might be hard to learn (i.e., need more examples) to map a vector to a numberwhen the amount of digits represented by the vector changes irregularly. The
CHARACTER and
UNDERSCORE representations have much higher accuracies than
DECIMAL ,thus showing that is easier to learn when vectors represent single digits. Both representations exhibitdecreasing accuracies as we increase the number of digits, until reaching an accuracy of zero with15-digit addition. One explanation for this failure is that, since digits with the same significancehave different positions in each example, the model has to count the number of digits on the rightside in order to find its significance. With larger numbers, counting becomes harder.The
FIXED - CHARACTER representation achieves higher accuracies than
CHARACTER and
UNDER - SCORE for numbers with more than 12 digits, thus showing that the model can learn to memorizedigit positions to find their significance. However, with an accuracy of approximately 20% for num-bers with 15-digits, the memorization strategy eventually breaks down. We argue it is hard to learnrelative positional embeddings that encode the distance between two tokens precisely, which is arequirement for this task.The
WORDS representation shows stable accuracies in the range of 40-60% from 5 to 15 digits. Ourhypothesis for this stability is that the intrinsic position tokens present in this representation (e.g.,“hundred”, “thousand”) make it easier for the model to find and sum two digits that are far apartin the input sequence. However, for 20-digits or more, the models fail at the task. Pretrainingmight have contributed to the high accuracies on 15-digits or less because the model might havealready seen these numbers in this representation in the pretraining corpus. On the other hand, it is The way a number is tokenized depends on the training data of the tokenizer. . . . . T e s t A cc u r ac y E , P OS -M ASKED , W
ITH T GT
10, P OS -M ASKED , W
ITH T GTCHAR , P OS -M ASKED , W
ITH T GT E , P OS -M ASKED , N O T GT
10, P OS -M ASKED , N O T GTCHAR , P OS -M ASKED , N O T GT E , S INUSOIDAL
10, S
INUSOIDALCHAR , S
INUSOIDAL
Transformer Addition - Position Encoding
Figure 2: Addition performance on vanilla transformer with different position encoding methods.very unlikely that the corpus contains numbers of 20-digits or more expressed in plain English. Wefurther investigate the impact of pretraining in Appendix B.With up to 15 digits, the 10-
BASED and 10 E - BASED representations achieve accuracies close to100%. Our explanation for their success is the explicit position tokens added between each digit,which allows the model to inspect the left or right tokens of a digit to know its significance. Sincethe 10-
BASED representation needs more tokens as we increase the number of digits, we use the10 E - BASED representation in the remainder of this paper due to its compactness.In Appendix A, we show that all representations can reach accuracies of 97% or more when enoughtraining data is provided. Results here, however, show that representations do matter when trainingdata is scarce.4.2 P
OSITION E MBEDDINGS
In this section, we study the impact of various position embeddings on the addition task. Sincepretraining from scratch is a costly process, we experiment with only small transformer modelsfine-tuned without pretraining.The architecture of the transformer follows (Vaswani et al., 2017) except we use 4 layers for theencoder and the decoder, respectively. We look into the effect of representation and positional en-coding on addition from 2 digits to 9 digits. Due to the cost of these experiments, we choose a subsetof the representations studied in the previous section: 10 E - BASED , 10-
BASED , and
CHARACTER .The dataset is split into training and test sets with a ratio of 9:1. For 3–9 digits addition, we randomlygenerate 10,000 samples for the whole dataset. For 2-digit addition, we use all of the combinationsfor every addend a ∈ [10 , , which results in less than 10,000 samples. The models are trained for55 epochs with learning rate of − .We find that the original positional encoding in (Vaswani et al., 2017) fails to learn addition effec-tively, as shown in Figure 2. This might be due to the correlation introduced by two heterogeneoussignals—embedding and absolute positional encoding (Ke et al., 2020). Therefore, we designed aposition-wise masked embedding for this task.More specifically, for an n -digit number whose embedding is e with embedding size d , we will set e [ u : v ] = 1 for i − th digit in the number, where u = int ( dn ) · ( n − i ) and v = int ( dn ) · ( n − i + 1) .We set other position embedding values to 0. Note that i follows the “Big-Endian” style (e.g., i = 3 for “2” in number “271”). However, during inference, digit information is not provided for the targetsequence as we don’t know the exact digit of the decoded number in advance. So we face a formatdiscrepancy between training and inference. To investigate how the discrepancy will affect theresult, we train the model in two different ways—training with target position provided and trainingwithout target position provided (position encoding for the target is zero vector). Note that positionencoding is provided for source sequence in both cases for training and inference; position encodingis not provided for target sequence during inference time in both cases. The results are shown inFigure 2, labeled as “WITH TGT” and “NO TGT”, respectively. We label our position-wise maskedembedding as “Pos-Masked”. The original representation is called “Sinusoidal”.Consistent with previous experiments, 10 E - BASED performs best given the same position encodingand training strategies. Comparing “WITH TGT” and “NO TGT”, we can see that training with tar- nterpolation ExtrapolationOrder: Inverse Regular Inverse RegularOperation: Add Sub Add Sub Add Sub Add SubT5-60M Table 2: Interpolation and extrapolation accuracy. Interpolation refers to training and testing onup to 60-digit numbers. Extrapolation refers to training on up to 50-digit numbers and testing on60-digit numbers. We highlight in bold accuracy above 97%.get position encoding creates fluctuations among different digits. In general, it performs worse thantraining without target position encoding given the same encoding representation. Unsurprisingly,under our experiment setting, whether the target position is provided is not as important as havingthe same format between training and inference.
XPERIMENTS ON E XTRAPOLATION
One advantage of working with arithmetic tasks is that the rules to be learned are well defined andrelatively simple. Thus, it is easy to verify if models learned such rules by evaluating them onnumbers that are larger than the ones they were trained on. If successful, such a model would haveno problem in correctly summing or subtracting arbitrarily long numbers.In this section, we investigate how models of different sizes perform interpolation and extrapolationtasks. We train T5-60M, T5-220M, T5-770M, and T5-3B models on numbers that are sampled usingthe “balanced” method explained in Section 3.1. Models are trained 100k iterations using batches of128 examples and a learning rate of − . We save checkpoints every 2,000 iterations, and the bestcheckpoint is chosen using a separate validation set of 10,000 examples. The models are evaluatedon a test set of 10,000 examples with numbers sampled using the “random” method.On interpolation experiments, the models are trained and evaluated on up to 60-digit numbers. Onextrapolation experiments, the models are trained on up to 50-digit numbers and evaluated on 60-digit numbers. We use that many digits for training because the models could not extrapolate withfewer; see additional discussion about this limitation in Section 5.1.The results presented in Table 2 show that models of all sizes successfully perform interpolationtasks. Two exceptions are T5-60M on the subtraction tasks, which achieve 0.934 and 0.830 accuracyfor inverse and regular orders, respectively. Nevertheless, compared to extrapolation results, thesenumbers are high enough to consider them as successful runs.On extrapolation tasks, T5-3B succeeds on almost all of them, whereas smaller models fail moreoften. Even on tasks where T5-220M achieves reasonable accuracy (0.862 and 0.641 on additionand subtraction using regular order, respectively), T5-3B outperforms T5-220M by large margins.This result provides evidence that larger models might perform better on data whose distribution isoutside its training data distribution. However, it remains to be investigated if this trend holds formore complex tasks, especially those involving natural language.The difference in accuracy is negligible between regular and inverse orders on interpolation tasks.However, models trained and evaluated on the regular order show higher extrapolation accuracythan those that use the inverse order. For example, T5-220M fails to extrapolate on both additionand subtraction tasks when using the inverse order (i.e., accuracy is zero), but it performs betterwhen using the regular order, with accuracy between 60–90%. This result is perhaps surprisingsince one would expect that the inverse order would be easier to learn (for reasons explained inSection 3.1). Supported by recent work, we suspect that the problem is related to the bias of selectingthe termination (i.e., end-of-sequence) token when the generated sequence becomes longer thanthose seen during training (Newman et al., 2020). In the inverse order, the answer is generated fromleast to most significant digit, so the model might have a tendency to select the termination token ight after it generates the most significant digit seen during training. In the regular order, however,the model has to predict the full length of the sequence before emitting the first and second tokens.For example, the first two tokens of the answer to the question +10 are “2” and “10e60”. Thisexplicit length prediction allows the model to better generalize to longer sequences, but it appears tobe insufficient to induce models to learn addition rules that are independent of the length of numbersseen during training (more below).5.1 L IMITATIONS
We observe high variance in accuracy for the extrapolation experiments. For example, during thetraining of a T5-770M model on up to 30-digit numbers, the accuracy ranges from 20% to 50% whenevaluated on 60-digit numbers. Extrapolation accuracy also oscillates between 20–40 percentagepoints when changing the seed for training data generation.Extrapolation is hardly achieved when trained on fewer than 50 digits, regardless of the model size.For example, T5-220M, T5-770M, and T5-3B trained on 15 digits show an accuracy of zero whenevaluated on 20 digits.Beyond a critical amount, increasing the training data does not improve extrapolation accuracy. Forexample, when trained on up to 30-digit and evaluated on 60-digit numbers, a T5-770M showeda similar accuracy range (20%–50%) when trained with either 100k, 1M, or 10M examples. Astraining progresses, interpolation accuracy always reaches 100%, but extrapolation accuracy startsto decrease after some number of training steps. The amount of training steps after which this dropoccurs varies dramatically between runs that differ only in the seed used to generate the training data.We are unable to isolate the cause of this behavior mainly due to the high variance of extrapolationaccuracy.Contrary to the hypothesis of Newman et al. (2020), we find that the end-of-sequence token does notseem to be the cause of extrapolation failures. For example, when a T5-770M model trained on 30-digit is evaluated on 60-digit numbers, it correctly generates the first 23 position tokens (i.e., from“10e60” until “10e38”) but it suddenly skips to position token “10e27”, and continues generatingthe correct position tokens until the last one (“10e0”). Here we show one of such sequence:1 10e60 0 10e59 1 10e58 2 10e57 3 10e56 0 10e55 2 10e54 7 10e53 0 10e521 10e51 0 10e50 3 10e49 9 10e48 0 10e47 5 10e46 3 10e45 1 10e44 5 10e43 310e42 6 10e41 3 10e40 6 10e39 0 ONCLUSION
Rumelhart et al. (1985) wrote in their seminal “Backpropagation” paper that:
Unfortunately, this (addition) is the one problem we have found that reliably leadsthe system into local minima.
Almost four decades later, despite the remarkable progress in neural networks, we confirm thatlearning addition is indeed hard for these models. Better symbolic representations can make theproblem easier, but, still, none of the models are able to find a solution for addition and subtractionproblems that works for any input length.Although larger models perform better than smaller ones, our extrapolation experiments show thatnot even 3B-parameter models can learn simple arithmetic rules. This work provides evidence that mproving tokenizers, positional encodings (Section 4.1), and decoding algorithms (Section 5.1) arepromising directions for future exploration. A CKNOWLEDGMENTS
This research was supported in part by the Canada First Research Excellence Fund and the NaturalSciences and Engineering Research Council (NSERC) of Canada. In addition, we would like tothank Google Cloud for credits to support this work. R EFERENCES
Daniel Andor, Luheng He, Kenton Lee, and Emily Pitler. Giving BERT a calculator: Findingoperations and arguments with reading comprehension. In
Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processing and the 9th International Joint Conferenceon Natural Language Processing (EMNLP-IJCNLP) , pp. 5949–5954, 2019.Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, ArielHerbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler,Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray,Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever,and Dario Amodei. Language models are few-shot learners. In
Advances in Neural InformationProcessing Systems , pp. 1877–1901, 2020.Jui Chu, Chung-Chi Chen, Hen-Hsen Huang, and Hsin-Hsi Chen. Learning to generate correctnumeric values in news headlines. In
Companion Proceedings of the Web Conference 2020 , pp.17–18, 2020.Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universaltransformers. In
International Conference on Learning Representations , 2018.Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deepbidirectional transformers for language understanding. In
Proceedings of the 2019 Conference ofthe North American Chapter of the Association for Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) , pp. 4171–4186, 2019.David Ding, Felix Hill, Adam Santoro, and Matt Botvinick. Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architec-tures. arXiv preprint arXiv:2012.08508 , 2020.Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner.DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Com-putational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.2368–2378, 2019.Mor Geva, Ankit Gupta, and Jonathan Berant. Injecting numerical reasoning skills into languagemodels. In
Proceedings of the 58th Annual Meeting of the Association for Computational Lin-guistics , pp. 946–958, July 2020.Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. DeBERTa: Decoding-enhancedBERT with disentangled attention. arXiv preprint arXiv:2006.03654 , 2020.Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, HeewooJun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive genera-tive modeling. arXiv preprint arXiv:2010.14701 , 2020.Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer. arXiv preprint arXiv:2102.01293 , 2021. hiheng Huang, Davis Liang, Peng Xu, and Bing Xiang. Improve transformer models with betterrelative position embeddings. In Proceedings of the 2020 Conference on Empirical Methods inNatural Language Processing: Findings , pp. 3327–3335, 2020.Chengyue Jiang, Zhonglin Nian, Kaihao Guo, Shanbo Chu, Yinggong Zhao, Libin Shen, HaofenWang, and Kewei Tu. Learning numeral embeddings. arXiv preprint arXiv:2001.00003 , 2019.Devin Johnson, Denise Mak, Andrew Barker, and Lexi Loessberg-Zahl. Probing for multilingualnumerical understanding in transformer-based language models. In
Proceedings of the ThirdBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP , pp. 184–192,2020.Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stack-augmented recurrentnets.
Advances in Neural Information Processing Systems , 28:190–198, 2015.Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. arXiv preprint arXiv:1511.08228 ,2015.Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid long short-term memory. arXiv preprintarXiv:1507.01526 , 2015.Guolin Ke, Di He, and Tie-Yan Liu. Rethinking the positional encoding in language pre-training. arXiv preprint arXiv:2006.15595 , 2020.Guillaume Lample and François Charton. Deep learning for symbolic mathematics. In
InternationalConference on Learning Representations , 2019.Jierui Li, Lei Wang, Jipeng Zhang, Yan Wang, Bing Tian Dai, and Dongxiang Zhang. Modelingintra-relation in math word problems with different functional multi-head attentions. In
Proceed-ings of the 57th Annual Meeting of the Association for Computational Linguistics , pp. 6162–6167,2019.Bill Yuchen Lin, Seyeon Lee, Rahul Khanna, and Xiang Ren. Birds have four legs?! NumerSense:Probing numerical commonsense knowledge of pre-trained language models. arXiv preprintarXiv:2005.00683 , 2020.Qianying Liu, Wenyv Guan, Sujian Li, and Daisuke Kawahara. Tree-structured decoding for solvingmath word problems. In
Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural Language Process-ing (EMNLP-IJCNLP) , pp. 2370–2379, 2019.Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In
International Confer-ence on Learning Representations , 2018.Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, and Chitta Baral. Towardsquestion format independent numerical reasoning: A set of prerequisite tasks. arXiv preprintarXiv:2005.08516 , 2020.Aakanksha Naik, Abhilasha Ravichander, Carolyn Rose, and Eduard Hovy. Exploring numeracy inword embeddings. In
Proceedings of the 57th Annual Meeting of the Association for Computa-tional Linguistics , pp. 3374–3380, 2019.Benjamin Newman, John Hewitt, Percy Liang, and Christopher D Manning. The EOS decisionand length extrapolation. In
Proceedings of the Third BlackboxNLP Workshop on Analyzing andInterpreting Neural Networks for NLP , pp. 276–291, 2020.Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, andLuke Zettlemoyer. Deep contextualized word representations. In
Proceedings of the 2018 Con-ference of the North American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Papers) , pp. 2227–2237, 2018.Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393 , 2020. ric Price, Wojciech Zaremba, and Ilya Sutskever. Extensions and limitations of the neural GPU. arXiv preprint arXiv:1611.00736 , 2016.Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, YanqiZhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-texttransformer. Journal of Machine Learning Research , 21(140):1–67, 2020.Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan Liu. NumNet: Machine reading comprehen-sion with numerical reasoning. In
Proceedings of the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) , pp. 2474–2484, 2019.Abhilasha Ravichander, Aakanksha Naik, Carolyn Rose, and Eduard Hovy. Equate: A benchmarkevaluation framework for quantitative reasoning in natural language inference. In
Proceedingsof the 23rd Conference on Computational Natural Language Learning (CoNLL) , pp. 349–361,2019.Yuanhang Ren and Ye Du. Enhancing the numeracy of word embeddings: A linear algebraic perspec-tive. In
CCF International Conference on Natural Language Processing and Chinese Computing ,pp. 170–178. Springer, 2020.David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representationsby error propagation. Technical report, California Univ San Diego La Jolla Inst for CognitiveScience, 1985.David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical rea-soning abilities of neural models. In
International Conference on Learning Representations ,2018.Imanol Schlag, Paul Smolensky, Roland Fernandez, Nebojsa Jojic, Jürgen Schmidhuber, and Jian-feng Gao. Enhancing the transformer with explicit relational encoding for math problem solving. arXiv preprint arXiv:1910.06611 , 2019.Hongjie Shi. A sequence-to-sequence approach for numerical slot-filling dialog systems. In
Pro-ceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue ,pp. 272–277, 2020.Alon Talmor, Yanai Elazar, Yoav Goldberg, and Jonathan Berant. oLMpics—on what languagemodel pre-training captures. arXiv preprint arXiv:1912.13283 , 2019.Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pp.4593–4601, 2019.Andrew Trask, Felix Hill, Scott E Reed, Jack Rae, Chris Dyer, and Phil Blunsom. Neural arithmeticlogic units. In
Advances in Neural Information Processing Systems , pp. 8035–8044, 2018.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In
Advances in Neural Infor-mation Processing Systems , pp. 5998–6008, 2017.Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner. Do NLP models knownumbers? Probing numeracy in embeddings. In
Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the 9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) , pp. 5310–5318, 2019.Benyou Wang, Donghao Zhao, Christina Lioma, Qiuchi Li, Peng Zhang, and Jakob Grue Simon-sen. Encoding word order in complex embeddings. In
International Conference on LearningRepresentations , 2019.Xikun Zhang, Deepak Ramachandran, Ian Tenney, Yanai Elazar, and Dan Roth. Do language em-beddings capture scales? In
Proceedings of the Third BlackboxNLP Workshop on Analyzing andInterpreting Neural Networks for NLP , pp. 292–299, 2020. anyan Zou and Wei Lu. Quantity tagger: A latent-variable sequence labeling approach to solvingaddition-subtraction word problems. In Proceedings of the 57th Annual Meeting of the Associa-tion for Computational Linguistics , pp. 5246–5251, 2019a.Yanyan Zou and Wei Lu. Text2Math: End-to-end parsing text into math expressions. In
Proceedingsof the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pp. 5330–5340,2019b. I MPACT OF DATA SIZE
In Section 4.1, we show that the choice of orthography has a large impact on the addition task whentraining data is scarce (i.e., 1,000 training examples). In this section, we investigate how theseorthographies perform with varying amounts of training data. We train and evaluate T5-220M onthe addition task of up to 30-digit numbers using the regular order. Due to the high computationalcost of training this model on millions of examples, we reduce the number of epochs depending onthe dataset size, which is detailed in Table 3. We select the best checkpoint using a validation set of10,000 examples and evaluate the models on a test set of 10,000 examples.Results are shown in Figure 3. The 10 E - BASED representation presents the best results for trainingsizes of 1,000 and 10,000 examples, followed by 10-
BASED , WORDS , UNDERSCORE , CHARACTER ,and
DECIMAL . For larger datasets such as 10M examples, almost all representations achieve morethan 99.9% accuracy. The exception is the
DECIMAL representation, which still has a high error of2.1% even when trained with 10M examples.We conclude that with enough training data, models can learn the addition task regardless of therepresentation. The limitations of some representations are exposed only when training data issmall. Size Epochs ,
000 10 ,
000 100 ,
000 1 , ,
000 10 , , . . . . T e s t A cc u r ac y E - BASED “3 10e1 2 10e0”10-
BASED “3 10 2”
WORDS “thirty-two”
UNDERSCORE “3_2”
CHARACTER “3 2”
DECIMAL “32”
Figure 3: Accuracy of different number representations when varying the amount of training exam-ples. The task is the addition of 30-digit numbers.
B P
RETRAINED VS . F
ROM S CRATCH
One hypothesis for the high interpolation accuracy reported in Section 4 despite using a small num-ber of training examples is that the model has already seen addition and subtraction examples duringpretraining. To test this hypothesis, we compare pretrained models with models trained from scratchon the addition task. In this experiment, the models never see the same training example more thanonce. That is, they are not limited by training data.Figure 4 shows that both pretrained T5-220M and T5-3B need approximately ten times fewer train-ing examples (and compute) than models trained from scratch to reach 100% accuracy on the addi-tion of 60-digit numbers. . . . . . . Millions of examples seen during training (log scale) T e s t A cc u r ac y T5-220M,
PRETRAINED
T5-3B,
PRETRAINED
T5-220M,
FROM SCRATCH
T5-3B,
FROM SCRATCH
Figure 4: Accuracy of pretrained models vs. from scratch with respect to the number of trainingexamples. Models are trained and evaluated on numbers with up to 60 digits in length.
C A
CCURACY ON D IFFERENT B ASES
Here we propose another way to test how pretraining can impact a model’s ability to learn arith-metic. We hypothesize that a model might have difficulty learning bases different than base 10 (i.e.,decimal) because examples rarely occur in the pretraining corpus. To test this hypothesis, we traina T5-220M model on addition examples using binary, ternary, decimal, and base 19. While theremight indeed be examples of binary addition in the pretraining corpus, our expectation is that itcontains few (if any?) examples of addition using base 19 numbers. We use the 10 E - BASED or-thography and inverse order due its slightly better accuracy as reported in Table 2. We also evaluatemodels trained from scratch, i.e., no pretraining on the masked language modeling task.We report the mean accuracy and 95% confidence intervals of a model trained with five differentsets of 1,000 addition examples for 100 epochs. A separate development set of 1,000 examples wasused to select the best checkpoint of each run. We trained and evaluated on numbers equivalent to15 decimal digits.For these experiments, we use only 1,000 training examples since experiments in Appendix A showthat models can successfully learn with enough training data, thus too much data defeats the purposeof measuring the impact of pretraining. Results are shown in Table 4. The pretrained model has noproblem in learning binary, ternary, and decimal bases, but its accuracy degrades slightly on base 19.Since it is unlikely that the pretrained model has encountered substantial numbers of examples ofaddition in rare bases (i.e., ternary and 19), it seems that pretraining helps on this task in other waysthan simple memorization.To show that the task is not easy, we also report in the table that models trained from scratch failto learn the task regardless of the base. This result is expected since a large number of parameters(220M) need to be learned from scratch using just 1,000 examples. D I
MPACT OF D IFFERENT L ENGTH D ISTRIBUTIONS
Here we investigate to what extent a mismatch between the length distribution of training and testsets is problematic for the addition task. We train T5-220M models on 100,000 examples, selectthe best checkpoint using a development set of 10,000 examples, and evaluate on another 10,000 Hernandez et al. (2021) provide more evidence for this claim. est AccuracyBase From Scratch Pretrained2 0.000 ± ± ± ± ± ± ± ± E - BASED orthography. TestBalanced RandomTrain Balanced 1.000 1.000Random 0.014 1.000Table 5: Accuracy on 60-digit addition. Balanced means sampling a equal proportion of numberswith 2-60 digits. Random means uniformly sampling from [10 , ] , which results in approxi-mately 90% of the numbers having 60 digits, 9% having 59 digits, and so on.examples. Here we use the regular order. Training and test sets are generated using either balancedor random sampling methods described in Section 3.1.Results are shown in Table 5. When trained on the balanced distribution, the model succeeds onboth random and balanced evaluation sets. When trained on the random distribution, it succeeds onthe random evaluation set, but it fails on the balanced evaluation set. In other words, when trainedon data where most numbers (i.e., 90%) have 60 digits, it does not learn to sum numbers with fewerdigits. This shows that models have problems performing addition of sequences shorter than theones seen during training. This is complementary to the results presented in Section 5, which showsthat models cannot generate examples longer than the ones seen during training., which results in approxi-mately 90% of the numbers having 60 digits, 9% having 59 digits, and so on.examples. Here we use the regular order. Training and test sets are generated using either balancedor random sampling methods described in Section 3.1.Results are shown in Table 5. When trained on the balanced distribution, the model succeeds onboth random and balanced evaluation sets. When trained on the random distribution, it succeeds onthe random evaluation set, but it fails on the balanced evaluation set. In other words, when trainedon data where most numbers (i.e., 90%) have 60 digits, it does not learn to sum numbers with fewerdigits. This shows that models have problems performing addition of sequences shorter than theones seen during training. This is complementary to the results presented in Section 5, which showsthat models cannot generate examples longer than the ones seen during training.