[PDF] Deep Learning for Symbolic Mathematics

Abstract

Neural networks have a reputation for being better at solving statistical or approximate problems than at performing calculations or working with symbolic data. In this paper, we show that they can be surprisingly good at more elaborated tasks in mathematics, such as symbolic integration and solving differential equations. We propose a syntax for representing mathematical problems, and methods for generating large datasets that can be used to train sequence-to-sequence models. We achieve results that outperform commercial Computer Algebra Systems such as Matlab or Mathematica.

Full PDF

DD EEP LEARNING FOR SYMBOLIC MATHEMATICS

Guillaume Lample ∗ Facebook AI Research [email protected]

Franc¸ois Charton ∗ Facebook AI Research [email protected] A BSTRACT

Neural networks have a reputation for being better at solving statistical or approxi-mate problems than at performing calculations or working with symbolic data. Inthis paper, we show that they can be surprisingly good at more elaborated tasksin mathematics, such as symbolic integration and solving differential equations.We propose a syntax for representing mathematical problems, and methods forgenerating large datasets that can be used to train sequence-to-sequence models.We achieve results that outperform commercial Computer Algebra Systems suchas Matlab or Mathematica.

NTRODUCTION

A longstanding tradition in machine learning opposes rule-based inference to statistical learning(Rumelhart et al., 1986), and neural networks clearly stand on the statistical side. They have proven tobe extremely effective in statistical pattern recognition and now achieve state-of-the-art performanceon a wide range of problems in computer vision, speech recognition, natural language processing(NLP), etc. However, the success of neural networks in symbolic computation is still extremelylimited: combining symbolic reasoning with continuous representations is now one of the challengesof machine learning.Only a few studies investigated the capacity of neural network to deal with mathematical objects, andapart from a small number of exceptions (Zaremba et al., 2014; Loos et al., 2017; Allamanis et al.,2017; Arabshahi et al., 2018b), the majority of these works focus on arithmetic tasks like integeraddition and multiplication (Zaremba & Sutskever, 2014; Kaiser & Sutskever, 2015; Trask et al.,2018). On these tasks, neural approaches tend to perform poorly, and require the introduction ofcomponents biased towards the task at hand (Kaiser & Sutskever, 2015; Trask et al., 2018).In this paper, we consider mathematics, and particularly symbolic calculations, as a target for NLPmodels. More precisely, we use sequence-to-sequence models (seq2seq) on two problems of symbolicmathematics: function integration and ordinary differential equations (ODEs). Both are difﬁcult, fortrained humans and computer software. For integration, humans are taught a set of rules (integrationby parts, change of variable, etc.), that are not guaranteed to succeed, and Computer Algebra Systemsuse complex algorithms (Geddes et al., 1992) that explore a large number of speciﬁc cases. Forinstance, the complete description of the Risch algorithm (Risch, 1970) for function integration ismore than 100 pages long.Yet, function integration is actually an example where pattern recognition should be useful: detectingthat an expression is of the form yy (cid:48) ( y + 1) − / suggests that its primitive will contain (cid:112) y + 1 .Detecting this pattern may be easy for small expressions y , but becomes more difﬁcult as the numberof operators in y increases. However, to the best of our knowledge, no study has investigated theability of neural networks to detect patterns in mathematical expressions.We ﬁrst propose a representation of mathematical expressions and problems that can be used byseq2seq models, and discuss the size and structure of the resulting problem space. Then, we showhow to generate datasets for supervised learning of integration and ﬁrst and second order differentialequations. Finally, we apply seq2seq models to these datasets, and show that they achieve a betterperformance than state-of-the-art computer algebra programs, namely Matlab and Mathematica. ∗ Equal contribution. a r X i v : . [ c s . S C ] D ec M ATHEMATICS AS A NATURAL LANGUAGE

XPRESSIONS AS TREES

Mathematical expressions can be represented as trees, with operators and functions as internal nodes,operands as children, and numbers, constants and variables as leaves. The following trees representexpressions × (5 + 2) , x + cos(2 x ) − , and ∂ ψ∂x − ν ∂ ψ∂t : +2 × × x − cos × x - ∂∂ψ x x × / ν ∂∂ψ t t Trees disambiguate the order of operations, take care of precedence and associativity and eliminate theneed for parentheses. Up to the addition of meaningless symbols like spaces, punctuation or redundantparentheses, different expressions result in different trees. With a few assumptions, discussed inSection A of the appendix, there is a one-to-one mapping between expressions and trees.We consider expressions as sequences of mathematical symbols. and are differentexpressions, as are √ x and x , and they will be represented by different trees. Most expressionsrepresent meaningful mathematical objects. x / , √− or log(0) are also legitimate expressions,even though they do not necessarily make mathematical sense.Since there is a one-to-one correspondence between trees and expressions, equality between expres-sions will be reﬂected over their associated trees, as an equivalence : since − × ,the four trees corresponding to these expressions are equivalent.Many problems of formal mathematics can be reframed as operations over expressions, or trees. Forinstance, expression simpliﬁcation amounts to ﬁnding a shorter equivalent representation of a tree.In this paper, we consider two problems: symbolic integration and differential equations. Both boildown to transforming an expression into another, e.g. mapping the tree of an equation to the tree ofits solution. We regard this as a particular instance of machine translation.2.2 T REES AS SEQUENCES

Machine translation systems typically operate on sequences (Sutskever et al., 2014; Bahdanau et al.,2015). Alternative approaches have been proposed to generate trees, such as Tree-LSTM (Tai et al.,2015) or Recurrent Neural Network Grammars (RNNG) (Dyer et al., 2016; Eriguchi et al., 2017).However, tree-to-tree models are more involved and much slower than their seq2seq counterparts,both at training and at inference. For the sake of simplicity, we use seq2seq models, which wereshown to be effective at generating trees, e.g. in the context of constituency parsing (Vinyals et al.,2015), where the task is to predict a syntactic parse tree of input sentences.Using seq2seq models to generate trees requires to map trees to sequences. To this effect, we usepreﬁx notation (also known as normal Polish notation), writing each node before its children, listedfrom left to right. For instance, the arithmetic expression ∗ (5 + 2) is represented as the sequence [+ 2 ∗ . In contrast to the more common inﬁx notation ∗ (5 + 2) , preﬁx sequencesneed no parentheses and are therefore shorter. Inside sequences, operators, functions or variables arerepresented by speciﬁc tokens, and integers by sequences of digits preceded by a sign. As in the casebetween expressions and trees, there exists a one-to-one mapping between trees and preﬁx sequences.2.3 G ENERATING RANDOM EXPRESSIONS

To create training data, we need to generate sets of random mathematical expressions. However,sampling uniformly expressions with n internal nodes is not a simple task. Naive algorithms (such as2ecursive methods or techniques using ﬁxed probabilities for nodes to be leaves, unary, or binary)tend to favour deep trees over broad trees, or left-leaning over right leaning trees. Here are examplesof different trees that we want to generate with the same probability. cos+ × x x × +3 × x x ×× +sin x x × × x +8 sin x In Section C of the appendix, we present an algorithm to generate random trees and expressions,where the four expression trees above are all generated with the same probability.2.4 C

OUNTING EXPRESSIONS

We now investigate the number of possible expressions. Expressions are created from a ﬁnite set ofvariables (i.e. literals), constants, integers, and a list of operators that can be simple functions (e.g. cos or exp ) or more involved operators (e.g. differentiation or integration). More precisely, we deﬁneour problem space as: • trees with up to n internal nodes • a set of p unary operators (e.g. cos , sin , exp , log ) • a set of p binary operators (e.g. + , − , × , pow ) • a set of L leaf values containing variables (e.g. x, y, z ), constants (e.g. e, π ), integers (e.g. {− , . . . , } )If p = 0 , expressions are represented by binary trees. The number of binary trees with n internalnodes is given by the n -th Catalan numbers C n (Sloane, 1996). A binary tree with n internal nodeshas exactly n + 1 leaves. Each node and leaf can take respectively p and L different values. As aresult, the number of expressions with n binary operators can be expressed by: E n = C n p n L n +1 ≈ n n √ πn p n L n +1 with C n = 1 n + 1 (cid:18) nn (cid:19) If p > , expressions are unary-binary trees, and the number of trees with n internal nodes is the n -thlarge Schroeder number S n (Sloane, 1996). It can be computed by recurrence using the followingequation: ( n + 1) S n = 3(2 n − S n − − ( n − S n − (1)Finally, the number E n of expressions with n internal nodes, p unary operator, p binary operatorsand L possible leaves is recursively computed as ( n + 1) E n = ( p + 2 Lp )(2 n − E n − − p ( n − E n − (2)If p = p = L = 1 , Equation 2 boils down to Equation 1. If p = L = 1 , p = 0 , we have ( n + 1) E n = 2(2 n − E n − which is the recurrence relation satisﬁed by Catalan numbers. Thederivations and properties of all these formulas are provided in Section B of the appendix.In Figure 1, we represent the number of binary trees ( C n ) and unary-binary trees ( S n ) for differentnumbers of internal nodes. We also represent the number of possible expressions ( E n ) for differentsets of operators and leaves. 3 Internal nodes N u m b e r o f e x p r e ss i o n s L=11, p1=15, p2=4 (unary-binary expressions)L=11, p1=0, p2=4 (binary expressions)L=11, p1=15, p2=1L=11, p1=0, p2=1L=5, p1=0, p2=1L=1, p1=1, p2=1 (unary-binary trees)L=1, p1=0, p2=1 (binary trees)

Figure 1:

Number of trees and expressions for different numbers of operators and leaves. p and p correspond to the number of unary and binary operators respectively, and L to the number of possible leaves.The bottom two curves correspond to the number of binary and unary-binary trees (enumerated by Catalanand Schroeder numbers respectively). The top two curves represent the associated number of expressions. Weobserve that adding leaves and binary operators signiﬁcantly increases the size of the problem space. ENERATING DATASETS

Having deﬁned a syntax for mathematical problems and techniques to randomly generate expressions,we are now in a position to build the datasets our models will use. In the rest of the paper, we focuson two problems of symbolic mathematics: function integration and solving ordinary differentialequations (ODE) of the ﬁrst and second order.To train our networks, we need datasets of problems and solutions. Ideally, we want to generaterepresentative samples of the problem space, i.e. randomly generate functions to be integrated anddifferential equations to be solved. Unfortunately, solutions of random problems sometimes do notexist (e.g. the integrals of f ( x ) = exp( x ) or f ( x ) = log(log( x )) cannot be expressed with usualfunctions), or cannot be easily derived. In this section, we propose techniques to generate largetraining sets for integration and ﬁrst and second order differential equations.3.1 I NTEGRATION

We propose three approaches to generate functions with their associated integrals.

Forward generation (

FWD ). A straightforward approach is to generate random functions with upto n operators (using methods from Section 2) and calculate their integrals with a computer algebrasystem. Functions that the system cannot integrate are discarded. This generates a representativesample of the subset of the problem space that can be successfully solved by an external symbolicmathematical framework. Backward generation (

BWD ). An issue with the forward approach is that the dataset only containsfunctions that symbolic frameworks can solve (they sometimes fail to compute the integral ofintegrable functions). Also, integrating large expressions is time expensive, which makes the overallmethod particularly slow. Instead, the backward approach generates a random function f , computesits derivative f (cid:48) , and adds the pair ( f (cid:48) , f ) to the training set. Unlike integration, differentiationis always possible and extremely fast even for very large expressions. As opposed to the forwardapproach, this method does not depend on an external symbolic integration system. Backward generation with integration by parts (

IBP ). An issue with the backward approach isthat it is very unlikely to generate the integral of simple functions like f ( x ) = x sin( x ) . Its integral, F ( x ) = − x cos( x ) + 3 x sin( x ) + 6 x cos( x ) − x ) , a function with 15 operators, has a verylow probability of being generated randomly. Besides, the backward approach tends to generateexamples where the integral (the solution) is shorter than the derivative (the problem), while forwardgeneration favors the opposite (see Figure 2 in section E in the Appendix). To address this issue, we4everage integration by parts: given two randomly generated functions F and G , we compute theirrespective derivatives f and g . If f G already belongs to the training set, we know its integral, and wecan compute the integral of F g as: (cid:90)

F g = F G − (cid:90) f G Similarly, if

F g is in the training set, we can infer the integral of f G . Whenever we discover theintegral of a new function, we add it to the training set. If none of f G or F g are in the training set,we simply generate new functions F and G . With this approach, we can generate the integrals offunctions like x sin( x ) without resorting to an external symbolic integration system. Comparing different generation methods.

Table 1 in Section 4.1 summarizes the differencesbetween the three generation methods. The

FWD method tends to generate short problems with longsolutions (that computer algebras can solve). The

BWD approach, on the other hand, generates longproblems with short solutions.

IBP generates datasets comparable to

FWD (short problems and longsolutions), without an external computer algebra system. A mixture of

BWD and

IBP generated datashould therefore provide a better representation of problem space, without resorting to external tools.Examples of functions / integrals for the three approaches are given in Table 9 of the Appendix.3.2 F

IRST ORDER DIFFERENTIAL EQUATION (ODE 1)We now present a method to generate ﬁrst order differential equations with their solutions. Westart from a bivariate function F ( x, y ) such that the equation F ( x, y ) = c (where c is a constant)can be analytically solved in y . In other words, there exists a bivariate function f that satisﬁes ∀ ( x, c ) , F (cid:0) x, f ( x, c ) (cid:1) = c . By differentiation with respect to x , we have that ∀ x, c : ∂F (cid:0) x, f c ( x )) ∂x + f (cid:48) c ( x ) ∂F ( x, f c ( x )) ∂y = 0 where f c = x (cid:55)→ f ( x, c ) . As a result, for any constant c , f c is solution of the ﬁrst order differentialequation: ∂F (cid:0) x, y ) ∂x + y (cid:48) ∂F ( x, y ) ∂y = 0 (3)With this approach, we can use the method described in Section C of the appendix to generatearbitrary functions F ( x, y ) analytically solvable in y , and create a dataset of differential equationswith their solutions.Instead of generating a random function F , we can generate a solution f ( x, c ) , and determine a differ-ential equation that it satisﬁes. If f ( x, c ) is solvable in c , we compute F such that F (cid:0) x, f ( x, c ) (cid:1) = c .Using the above approach, we show that for any constant c , x (cid:55)→ f ( x, c ) is a solution of differentialEquation 3. Finally, the resulting differential equation is factorized, and we remove all positive factorsfrom the equation.A necessary condition for this approach to work is that the generated functions f ( x, c ) can be solvedin c . For instance, the function f ( x, c ) = c × log( x + c ) cannot be analytically solved in c , i.e. thefunction F that satisﬁes F (cid:0) x, f ( x, c ) (cid:1) = c cannot be written with usual functions. Since all theoperators and functions we use are invertible, a simple condition to ensure the solvability in c is toguarantee that c only appears once in the leaves of the tree representation of f ( x, c ) . A straightforwardway to generate a suitable f ( x, c ) is to sample a random function f ( x ) by the methods described inSection C of the appendix, and to replace one of the leaves in its tree representation by c . Below is anexample of the whole process:Generate a random function f ( x ) = x log( c / x ) Solve in c c = xe f ( x ) x = F ( x, f ( x )) Differentiate in x e f ( x ) x (cid:0) f (cid:48) ( x ) − f ( x ) x (cid:1) = 0 Simplify xy (cid:48) − y + x = 0 ECOND ORDER DIFFERENTIAL EQUATION (ODE 2)Our method for generating ﬁrst order equations can be extended to the second order, by consideringfunctions of three variables f ( x, c , c ) that can be solved in c . As before, we derive a function ofthree variables F such that F (cid:0) x, f ( x, c , c ) , c (cid:1) = c . Differentiation with respect to x yields a ﬁrstorder differential equation: ∂F ( x, y, c ) ∂x + f (cid:48) c ,c ( x ) ∂F (cid:0) x, y, c (cid:1) ∂y (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) y = f c ,c ( x ) = 0 where f c ,c = x (cid:55)→ f ( x, c , c ) . If this equation can be solved in c , we can infer another three-variable function G satisfying ∀ x, G (cid:0) x, f c ,c ( x ) , f (cid:48) c ,c ( x ) (cid:1) = c . Differentiating with respect to x a second time yields the following equation: ∂G ( x, y, z ) ∂x + f (cid:48) c ,c ( x ) ∂G ( x, y, z ) ∂y + f (cid:48)(cid:48) c ,c ( x ) ∂G ( x, y, z ) ∂z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) y = f c ,c ( x ) z = f (cid:48) c ,c ( x ) = 0 Therefore, for any constants c and c , f c ,c is solution of the second order differential equation: ∂G ( x, y, y (cid:48) ) ∂x + y (cid:48) ∂G ( x, y, y (cid:48) ) ∂y + y (cid:48)(cid:48) ∂G ( x, y, y (cid:48) ) ∂z = 0 Using this approach, we can create pairs of second order differential equations and solutions, providedwe can generate f ( x, c , c ) is solvable in c , and that the corresponding ﬁrst order differentialequation is solvable in c . To ensure the solvability in c , we can use the same approach as forﬁrst order differential equation, e.g. we create f c ,c so that c has exactly one leaf in its treerepresentation. For c , we employ a simple approach where we simply skip the current equation ifwe cannot solve it in c . Although naive, we found that the differentiation equation can be solved in c about the time. As an example:Generate a random function f ( x ) = c e x + c e − x Solve in c c = f ( x ) e x − c e x = F ( x, f ( x ) , c ) Differentiate in x e x (cid:0) f (cid:48) ( x ) + f ( x ) (cid:1) − c e x = 0 Solve in c c = 12 e − x (cid:0) f (cid:48) ( x ) + f ( x ) (cid:1) = G ( x, f ( x ) , f (cid:48) ( x )) Differentiate in x e − x (cid:0) f (cid:48)(cid:48) ( x ) − f ( x ) (cid:1) Simplify y (cid:48)(cid:48) − y = 0 ATASET CLEANING

Equation simpliﬁcation

In practice, we simplify generated expressions to reduce the number ofunique possible equations in the training set, and to reduce the length of sequences. Also, we donot want to train our model to predict x + 1 + 1 + 1 + 1 + 1 when it can simply predict x + 5 . Asa result, sequences [+ 2 + x and [+ 3 + 2 x ] will both be simpliﬁed to [+ x as they bothrepresent the expression x + 5 . Similarly, the expression log( e x +3 ) will be simpliﬁed to x + 3 , theexpression cos ( x ) + sin ( x ) will be simpliﬁed to , etc. On the other hand, (cid:112) ( x − will not besimpliﬁed to x − as we do not make any assumption on the sign of x − . Coefﬁcients simpliﬁcation

In the case of ﬁrst order differential equations, we modify generatedexpressions by equivalent expressions up to a change of variable. For instance, x + x tan(3) + cx + 1 will be simpliﬁed to cx + 1 , as a particular choice of the constant c makes these two expressionsidentical. Similarly, log( x ) + c log( x ) becomes c log( x ) .6e apply a similar technique for second order differential equations, although simpliﬁcation issometimes a bit more involved because there are two constants c and c . For instance, c − c x/ c + 1 is simpliﬁed to c x + c , while c e c e c xe − can be expressed with c e c x , etc.We also perform transformations that are not strictly equivalent, as long as they hold under speciﬁcassumptions. For instance, we simplify tan( √ c x ) + cosh( c + 1) + 4 to c + tan( c x ) , although theconstant term can be negative in the second expression, but not the ﬁrst one. Similarly e e c x e c log( c ) is transformed to c e c x . Invalid expressions

Finally, we also remove invalid expressions from our dataset. For instance,expressions like log(0) or √− . To detect them, we compute in the expression tree the values ofsubtrees that do not depend on x . If a subtree does not evaluate to a ﬁnite real number (e.g. −∞ , + ∞ or a complex number), we discard the expression. XPERIMENTS

ATASET

For all considered tasks, we generate datasets using the method presented in Section 3, with: • expressions with up to n = 15 internal nodes • L = 11 leaf values in { x } ∪ {− , . . . , } \ { }• p = 4 binary operators: + , − , × , / • p = 15 unary operators: exp , log , sqrt , sin , cos , tan , sin -1 , cos -1 , tan -1 , sinh , cosh , tanh , sinh -1 , cosh -1 , tanh -1 Statistics about our datasets are presented in Table 1. As discussed in Section 3.1, we observe that thebackward approach generates derivatives (i.e. inputs) signiﬁcantly longer than the forward generator.We discuss this in more detail in Section E of the appendix.

Forward Backward Integration by parts ODE 1 ODE 2Training set size 20M 40M 20M 40M 40MInput length 18.9 ± . ± . ± . ± . ± . Output length 49.6 ± . ± . ± . ± . ± . Length ratio 2.7 0.4 2.0 0.4 0.1Input max length 69 450 226 508 508Output max length 508 75 206 474 335

Table 1:

Training set sizes and length of expressions (in tokens) for different datasets.

FWD and

IBP tendto generate examples with outputs much longer than the inputs, while the

BWD approach generates shorteroutputs. Like in the

BWD case, ODE generators tend to produce solutions much shorter than their equations.

ODEL

For all our experiments, we train a seq2seq model to predict the solutions of given problems, i.e.to predict a primitive given a function, or predict a solution given a differential equation. We use atransformer model (Vaswani et al., 2017) with 8 attention heads, 6 layers, and a dimensionality of512. In our experiences, using larger models did not improve the performance. We train our modelswith the Adam optimizer (Kingma & Ba, 2014), with a learning rate of − . We remove expressionswith more than tokens, and train our model with equations per batch.At inference, expressions are generated by a beam search (Koehn, 2004; Sutskever et al., 2014), withearly stopping. We normalize the log-likelihood scores of hypotheses in the beam by their sequencelength. We report results with beam widths of 1 (i.e. greedy decoding), 10 and 50.During decoding, nothing prevents the model from generating an invalid preﬁx expression, e.g. [+ 2 ∗ . To address this issue, Dyer et al. (2016) use constraints during decoding, to ensure7hat generated sequences can always be converted to valid expression trees. In our case, we foundthat model generations are almost always valid and we do not use any constraint. When an invalidexpression is generated, we simply consider it as an incorrect solution and ignore it.4.3 E VALUATION

At the end of each epoch, we evaluate the ability of the model to predict the solutions of givenequations. In machine translation, hypotheses given by the model are compared to references writtenby human translators, typically with metrics like the BLEU score (Papineni et al., 2002) that measurethe overlap between hypotheses and references. Evaluating the quality of translations is a verydifﬁcult problem, and many studies showed that a better BLEU score does not necessarily correlatewith a better performance according to human evaluation. Here, however, we can easily verify thecorrectness of our model by simply comparing generated expressions to their reference solutions.For instance, for the given differential equation xy (cid:48) − y + x = 0 with a reference solution x log( c / x ) (where c is a constant), our model may generate x log( c ) − x log( x ) . We can check that thesetwo solutions are equal, although they are written differently, using a symbolic framework likeSymPy (Meurer et al., 2017).However, our model may also generate xc − x log( x ) which is also a valid solution, that is actuallyequivalent to the previous one for a different choice of constant c . In that case, we replace y in thedifferential equation by the model hypothesis. If xy (cid:48) − y + x = 0 , we conclude that the hypothesis isa valid solution. In the case of integral computation, we can simply differentiate the model hypothesis,and compare it with the function to integrate. For the three problems, we measure the accuracy of ourmodel on equations from the test set.Since we can easily verify the correctness of generated expressions, we consider all hypotheses in thebeam, and not only the one with the highest score. We verify the correctness of each hypothesis, andconsider that the model successfully solved the input equation if one of them is correct. As a result,results with “Beam size 10” indicate that at least one of the 10 hypotheses in the beam was correct.4.4 R ESULTS

Table 2 reports the accuracy of our model for function integration and differential equations. Forintegration, the model achieves close to 100% performance on a held-out test set, even with greedydecoding (beam size 1). This performance is consistent over the three integration datasets (

FWD , BWD , and

IBP ). Greedy decoding (beam size 1) does not work as well for differential equations. Inparticular, we observe an improvement in accuracy of almost 40% when using a large beam size of50 for second order differential equations. Unlike in machine translation, where increasing the beamsize does not necessarily increase the performance (Ott et al., 2018), we always observe signiﬁcantimprovements with wider beams. Typically, using a beam size of 50 provides an improvement of 8%accuracy compared to a beam size of 10. This makes sense, as increasing the beam size will providemore hypotheses, although a wider beam may displace a valid hypothesis to consider invalid oneswith better log-probabilities.

Integration (

FWD ) Integration (

BWD ) Integration (

IBP ) ODE (order 1) ODE (order 2)Beam size 1 . . . . . Beam size 10 . . . . . Beam size 50 . . . . . Table 2:

Accuracy of our models on integration and differential equation solving.

Results are reportedon a held out test set of equations. For differential equations, using beam search decoding signiﬁcantlyimproves the accuracy of the model.

OMPARISON WITH MATHEMATICAL FRAMEWORKS

We compare our model with three popular mathematical frameworks: Mathematica (Wolfram-Research, 2019), Maple and Matlab (MathWorks, 2019) . Preﬁx sequences in our test set are All experiments were run with Mathematica 12.0.0.0, Maple 2019 and Matlab R2019a.

BWD test set. Byconstruction, the

FWD data only consists of integrals generated by computer algebra systems, whichmakes comparison uninteresting.In Table 3, we present accuracy for our model with different beam sizes, and for Mathematica with atimeout delay of 30 seconds. Table 8 in the appendix provides detailed results for different values oftimeout, and explains our choice of 30 seconds. In particular, we ﬁnd that with 30 seconds, only 20%of failures are due to timeouts, and only 10% when the timeout is set to 3 minutes. Even with timeoutlimits, evaluation would take too long on our test equations, so we only evaluate on a smallertest subset of 500 equations, on which we also re-evaluate our model.

Integration (

BWD ) ODE (order 1) ODE (order 2)Mathematica (30s) . . . Matlab . - -Maple . - -Beam size 1 . . . Beam size 10 . . . Beam size 50 . . . Table 3:

Comparison of our model with Mathematica, Maple and Matlab on a test set of 500 equations.

For Mathematica we report results by setting a timeout of 30 seconds per equation. On a given equation, ourmodel typically ﬁnds the solution in less than a second.

On all tasks, we observe that our model signiﬁcantly outperforms Mathematica. On functionintegration, our model obtains close to 100% accuracy, while Mathematica barely reaches 85%. Onﬁrst order differential equations, Mathematica is on par with our model when it uses a beam sizeof 1, i.e. with greedy decoding. However, using a beam search of size 50 our model accuracy goesfrom 81.2% to 97.0%, largely surpassing Mathematica. Similar observations can be made for secondorder differential equations, where beam search is even more critical since the number of equivalentsolutions is larger. On average, Matlab and Maple have slightly lower performance than Mathematicaon the problems we tested.Table 4 shows examples of functions that our model was able to solve, on which Mathematica andMatlab did not ﬁnd a solution. The denominator of the function to integrate, − x + 112 x − x + 28 x − x + 1 , can be rewritten as − (4 x − x + x ) . With the simpliﬁed input: x − x + 2 x (cid:0) − (4 x − x + x ) (cid:1) / integration becomes easier and Mathematica is able to ﬁnd the solution. Equation Solution y (cid:48) = 16 x − x + 2 x ( − x + 112 x − x + 28 x − x + 1) / y = sin − (4 x − x + x )3 xy cos( x ) − (cid:112) x sin( x ) + 1 y (cid:48) + 3 y sin( x ) = 0 y = c exp (cid:0) sinh − (3 x sin( x )) (cid:1) x yy (cid:48)(cid:48) − x y (cid:48) − x yy (cid:48) − x y (cid:48)(cid:48) − x y − x y (cid:48) − x y (cid:48)(cid:48) − xy (cid:48) − y = 0 y = c + 3 x + 3 log ( x ) x ( c + 4 x ) Table 4:

Examples of problems that our model is able to solve, on which Mathematica and Matlab were notable to ﬁnd a solution. For each equation, our model ﬁnds a valid solution with greedy decoding.

QUIVALENT SOLUTIONS

An interesting property of our model is that it is able to generate solutions that are exactly equivalent,but written in different ways. For instance, we consider the following ﬁrst order differential equation,along with one of its solutions: x log( x ) y (cid:48) + 2 y log( x ) − y log( x ) + 81 y = 0 y = 9 √ x (cid:113) x ) √ c + 2 x In Table 5, we report the top 10 hypotheses returned by our model for this equation. We observe thatall generations are actually valid solutions, although they are expressed very differently. They arehowever not all equal: merging the square roots within the ﬁrst and third equations would give thesame expression except that the third one would contain a factor in front of the constant c , but up toa change of variable, these two solutions are actually equivalent. The ability of the model to recoverequivalent expressions, without having been trained to do so, is very intriguing. Hypothesis Score Hypothesis Score √ x (cid:113) x ) √ c + 2 x − .

047 9 (cid:113) c log ( x ) x + 2 log ( x ) − . √ x √ c + 2 x (cid:112) log ( x ) − .

056 9 √ x (cid:112) c log ( x ) + 2 x log ( x ) − . √ √ x (cid:113) x ) √ c + x − .

115 9 (cid:112) cx + 2 (cid:112) log ( x ) − . √ x (cid:115) c log ( x ) + 2 x log ( x ) − .

117 9 (cid:115) c log ( x ) x + 2 log ( x ) − . √ √ x √ c + x (cid:112) log ( x ) − .

124 9 √ x (cid:115) c log ( x ) + 2 x log ( x ) + log ( x ) − . Table 5:

Top 10 generations of our model for the ﬁrst order differential equation x log( x ) y (cid:48) +2 y log( x ) − y log( x ) + 81 y = 0 , generated with a beam search. All hypotheses are valid solutions, and are equivalent upto a change of the variable c . Scores are log-probabilities normalized by sequence lengths. ENERALIZATION ACROSS GENERATORS

Models for integration achieve close to 100% performance on held-out test samples generated withthe same method as their training data. In Table 6, we compare the accuracy on the

FWD , BWD and

IBP test sets for 4 models trained using different combinations of training data. When the test setis generated with the same generator as the training set, the model performs extremely well. Forinstance, the three models trained either on

BWD , BWD + IBP or BWD + IBP + FWD achieve 99.7%accuracy on the

BWD test set with a beam size of 50.On the other hand, even with a beam size of 50, a

FWD -trained model only achieves 17.2% accuracyon the

BWD test set, and a

BWD -trained model achieves 27.5% on the

FWD test set. This results fromthe very different structure of the

FWD and

BWD data sets (cf. Table 1 and the discussion in Section Eof the appendix). Overall, a model trained on

BWD samples learns that integration tends to shortenexpressions, a property that does not hold for

FWD samples. Adding diversity to the training setimproves the results. For instance, adding

IBP -generated examples to the

BWD -trained model raisesthe

FWD test accuracy from 27.5% to 56.1%, and with additional

FWD training data the model reaches94.3% accuracy. Generalization is further discussed in Section E of the appendix.10 orward (

FWD ) Backward (

BWD ) Integration by parts (

IBP )Training data Beam 1 Beam 10 Beam 50 Beam 1 Beam 10 Beam 50 Beam 1 Beam 10 Beam 50

FWD . . . . . . . . . BWD . . . . . . . . . BWD + IBP . . . . . . . . . BWD + IBP + FWD . . . . . . . . . Table 6:

Accuracy of our models on function integration.

We report the accuracy of our model on the threeintegration datasets: forward (

FWD ), backward (

BWD ), and integration by parts (

IBP ), for four models trainedwith different combinations of training data. We observe that a

FWD -trained model performs poorly when it triesto integrate functions from the

BWD dataset. Similarly, a

BWD -trained model only obtain 27.5% accuracy onthe

FWD dataset, as it fails to integrate simple functions like x sin( x ) . On the other hand, training on both the BWD + IBP datasets allows the model to reach up to 56.1% accuracy on

FWD . Training on all datasets allows themodel to perform well on the three distributions.

ENERALIZATION BEYOND THE GENERATOR - S YM P Y Our forward generator,

FWD , generates a set of pairs ( f, F ) of functions with their integrals. Itrelies on an external symbolic framework, SymPy (Meurer et al., 2017), to compute the integralof randomly generated functions. SymPy is not perfect, and fails to compute the integral of manyintegrable functions. In particular, we found that the accuracy of SymPy on the BWD test set is only30%. Our

FWD -trained model only obtains an accuracy of 17.2% on

BWD . However, we observed thatthe

FWD -trained model is sometimes able to compute the integral of functions that SymPy cannotcompute. This means that by only training on functions that SymPy can integrate, the model was ableto generalize to functions that SymPy cannot integrate. Table 7 presents examples of such functionswith their integrals. x (cid:0) tan ( x ) + 1 (cid:1) + 2 x tan ( x ) + 1 x tan ( x ) + x x ) (cid:113) sin (2 x ) + 1 x + asinh (sin (2 x )) x tan ( x ) + log ( x cos ( x )) − x cos ( x )) x log ( x cos ( x )) − x cos (cid:0) asin ( x ) (cid:1) asin ( x ) √ − x sin (cid:0) asin ( x ) (cid:1) + 1sin (cid:0) asin ( x ) (cid:1) x sin (cid:0) asin ( x ) (cid:1) √ x + x (cid:18) x √ x + 1 + 1 + 12 √ x (cid:19) + x + asinh (cid:0) x (cid:1) x (cid:0) √ x + x + asinh (cid:0) x (cid:1)(cid:1) − − (cid:16) − x sin ( x ) + √ x (cid:17) √ x +cos ( x ) ( x + log ( √ x + cos ( x ))) x + log ( √ x + cos ( x )) − (log (log ( x ))) − x ) tan (log (log ( x ))) + 2tan (log (log ( x ))) 2 x tan (log (log ( x ))) Table 7:

Examples of functions / integrals that the

FWD -trained model can integrate, but not SymPy.

Although the

FWD model was only trained on a subset of functions that SymPy can integrate, it learned togeneralize to functions that SymPy cannot integrate. R ELATED WORK

Computers were used for symbolic mathematics since the late 1960s (Moses, 1974). Computeralgebra systems (CAS), such as Matlab, Mathematica, Maple, PARI and SAGE, are used for a varietyof mathematical tasks (Gathen & Gerhard, 2013). Modern methods for symbolic integration arebased on Risch algorithm (Risch, 1970). Implementations can be found in Bronstein (2005) andGeddes et al. (1992). However, the complete description of the Risch algorithm takes more than 100pages, and is not fully implemented in current mathematical framework.Deep learning networks have been used to simplify treelike expressions. Zaremba et al. (2014)use recursive neural networks to simplify complex symbolic expressions. They use tree represen-tations for expressions, but provide the model with problem related information: possible rulesfor simpliﬁcation. The neural network is trained to select the best rule. Allamanis et al. (2017)propose a framework called neural equivalence networks to learn semantic representations of alge-braic expressions. Typically, a model is trained to map different but equivalent expressions (likethe 10 expressions proposed in Table 5) to the same representation. However, they only considerBoolean and polynomial expressions. More recently, Arabshahi et al. (2018a;b) used tree-structuredneural networks to verify the correctness of given symbolic entities, and to predict missing entries inincomplete mathematical equations. They also showed that these networks could be used to predictwhether an expression is a valid solution of a given differential equation.Most attempts to use deep networks for mathematics have focused on arithmetic over integers(sometimes over polynomials with integer coefﬁcients). For instance, Kaiser & Sutskever (2015)proposed the Neural-GPU architecture, and train networks to perform additions and multiplicationsof numbers given in their binary representations. They show that a model trained on numbers withup-to 20 bits can be applied to much larger numbers at test time, while preserving a perfect accuracy.Freivalds & Liepins (2017) proposed an improved version of the Neural-GPU by using hard non-linearactivation functions, and a diagonal gating mechanism.Saxton et al. (2019) use LSTMs (Hochreiter & Schmidhuber, 1997) and transformers on a widerange of problems, from arithmetic to simpliﬁcation of formal expressions. However, they onlyconsider polynomial functions, and the task of differentiation, which is signiﬁcantly easier thanintegration. Trask et al. (2018) propose the Neural arithmetic logic units, a new module designed tolearn systematic numerical computation, and that can be used within any neural network. Like Kaiser& Sutskever (2015), they show that at inference their model can extrapolate on numbers orders ofmagnitude larger than the ones seen during training.

ONCLUSION

In this paper, we show that standard seq2seq models can be applied to difﬁcult tasks like functionintegration, or solving differential equations. We propose an approach to generate arbitrarily largedatasets of equations, with their associated solutions. We show that a simple transformer modeltrained on these datasets can perform extremely well both at computing function integrals, andsolving differential equations, outperforming state-of-the-art mathematical frameworks like Matlab orMathematica that rely on a large number of algorithms and heuristics, and a complex implementation(Risch, 1970). Results also show that the model is able to write identical expressions in very differentways.These results are surprising given the difﬁculty of neural models to perform simpler tasks like integeraddition or multiplication. However, proposed hypotheses are sometimes incorrect, and consideringmultiple beam hypotheses is often necessary to obtain a valid solution. The validity of a solution itselfis not provided by the model, but by an external symbolic framework (Meurer et al., 2017). Theseresults suggest that in the future, standard mathematical frameworks may beneﬁt from integratingneural components in their solvers. 12

EFERENCES

Miltiadis Allamanis, Pankajan Chanthirasegaran, Pushmeet Kohli, and Charles Sutton. Learning con-tinuous semantic representations of symbolic expressions. In

Proceedings of the 34th InternationalConference on Machine Learning - Volume 70 , ICML’17, pp. 80–88. JMLR.org, 2017.Forough Arabshahi, Sameer Singh, and Animashree Anandkumar. Combining symbolic expressionsand black-box function evaluations for training neural programs. In

International Conference onLearning Representations , 2018a.Forough Arabshahi, Sameer Singh, and Animashree Anandkumar. Towards solving differentialequations through neural programming. 2018b.D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align andtranslate. In

International Conference on Learning Representations (ICLR) , 2015.M. Bronstein.

Symbolic Integration I: Transcendental Functions . Algorithms and combinatorics.Springer, 2005. ISBN 978-3-540-21493-9.Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A Smith. Recurrent neural networkgrammars. In

Proceedings of the 2016 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies , pp. 199–209, 2016.Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Cho. Learning to parse and translate improvesneural machine translation. In

Proceedings of the 55th Annual Meeting of the Association forComputational Linguistics (Volume 2: Short Papers) , pp. 72–78, 2017.Philippe Flajolet and Andrew M. Odlyzko. Singularity analysis of generating functions.

SIAM J.Discrete Math. , 3(2):216–240, 1990.Philippe Flajolet and Robert Sedgewick.

Analytic Combinatorics . Cambridge University Press, NewYork, NY, USA, 1 edition, 2009. ISBN 0521898064, 9780521898065.Karlis Freivalds and Renars Liepins. Improving the neural gpu architecture for algorithm learning.

ArXiv , abs/1702.08727, 2017.Joachim von zur Gathen and Jurgen Gerhard.

Modern Computer Algebra . Cambridge UniversityPress, New York, NY, USA, 3rd edition, 2013. ISBN 1107039037, 9781107039032.Keith O. Geddes, Stephen R. Czapor, and George Labahn.

Algorithms for Computer Algebra . KluwerAcademic Publishers, Norwell, MA, USA, 1992. ISBN 0-7923-9259-0.Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997.Lukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms.

CoRR , abs/1511.08228, 2015.Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.Donald E. Knuth.

The Art of Computer Programming, Volume 1 (3rd Ed.): Fundamental Algorithms .Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA, 1997. ISBN 0-201-89683-4.Philipp Koehn. Pharaoh: a beam search decoder for phrase-based statistical machine translationmodels. In

Conference of the Association for Machine Translation in the Americas , pp. 115–124.Springer, 2004.Sarah Loos, Geoffrey Irving, Christian Szegedy, and Cezary Kaliszyk. Deep network guided proofsearch. arXiv preprint arXiv:1701.06972 , 2017.MathWorks. Matlab optimization toolbox (r2019a), 2019. The MathWorks, Natick, MA, USA.13aron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondˇrej ˇCert´ık, Sergey B. Kirpichev, MatthewRocklin, AMiT Kumar, Sergiu Ivanov, Jason K. Moore, Sartaj Singh, Thilina Rathnayake, SeanVig, Brian E. Granger, Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam Vats, FredrikJohansson, Fabian Pedregosa, Matthew J. Curry, Andy R. Terrel, ˇStˇep´an Rouˇcka, AshutoshSaboo, Isuru Fernando, Sumith Kulal, Robert Cimrman, and Anthony Scopatz. Sympy: symboliccomputing in python.

PeerJ Computer Science , 3:e103, January 2017. ISSN 2376-5992. doi:10.7717/peerj-cs.103. URL https://doi.org/10.7717/peerj-cs.103 .Joel Moses. Macsyma - the ﬁfth year.

SIGSAM Bull. , 8(3):105–110, August 1974. ISSN 0163-5824.Myle Ott, Michael Auli, David Grangier, et al. Analyzing uncertainty in neural machine translation.In

International Conference on Machine Learning , pp. 3953–3962, 2018.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automaticevaluation of machine translation. In

Proceedings of the 40th annual meeting on association forcomputational linguistics , pp. 311–318. Association for Computational Linguistics, 2002.Robert H. Risch. The solution of the problem of integration in ﬁnite terms.

Bull. Amer. Math. Soc. ,76(3):605–608, 05 1970.David E. Rumelhart, James L. McClelland, and CORPORATE PDP Research Group (eds.).

ParallelDistributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations . MITPress, Cambridge, MA, USA, 1986. ISBN 0-262-68053-X.David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematicalreasoning abilities of neural models. In

International Conference on Learning Representations ,2019.N. J. A. Sloane. The encyclopedia of integer sequences, 1996.Richard P. Stanley.

Enumerative Combinatorics: Volume 1 . Cambridge University Press, New York,NY, USA, 2nd edition, 2011. ISBN 1107602629, 9781107602625.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks.In

Advances in Neural Information Processing Systems , pp. 3104–3112, 2014.Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations fromtree-structured long short-term memory networks. In

Proceedings of the 53rd Annual Meetingof the Association for Computational Linguistics and the 7th International Joint Conference onNatural Language Processing (Volume 1: Long Papers) , pp. 1556–1566, 2015.Andrew Trask, Felix Hill, Scott E Reed, Jack Rae, Chris Dyer, and Phil Blunsom. Neural arithmeticlogic units. In

Advances in Neural Information Processing Systems , pp. 8035–8044, 2018.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In

Advances in Neural InformationProcessing Systems , pp. 6000–6010, 2017.Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. Grammaras a foreign language. In

Advances in neural information processing systems , pp. 2773–2781,2015.H.S. Wilf. generatingfunctionology: Third Edition . CRC Press, 2005. ISBN 978-1-4398-6439-5.URL .Wolfram-Research. Mathematica, version 12.0, 2019. Champaign, IL, 2019.Wojciech Zaremba and Ilya Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615 , 2014.Wojciech Zaremba, Karol Kurach, and Rob Fergus. Learning to discover efﬁcient mathematicalidentities. In

Proceedings of the 27th International Conference on Neural Information ProcessingSystems - Volume 1 , NIPS’14, pp. 1278–1286, Cambridge, MA, USA, 2014. MIT Press.14 A SYNTAX FOR MATHEMATICAL EXPRESSIONS

We represent mathematical expressions as trees with operators as internal nodes, and numbers,constants or variables, as leaves. By enumerating nodes in preﬁx order, we transform trees intosequences suitable for seq2seq architectures.For this representation to be efﬁcient, we want expressions, trees and sequences to be in a one-to-onecorrespondence. Different expressions will always result in different trees and sequences, but for thereverse to hold, we need to take care of a few special cases.First, expressions like sums and products may correspond to several trees. For instance, the expression can be represented as any one of those trees: +2 3 5 ++2 3 5 +2 +3 5

We will assume that all operators have at most two operands, and that, in case of doubt, they areassociative to the right. would then correspond to the rightmost tree.Second, the distinction between internal nodes (operators) and leaves (mathematical primitive objects)is somewhat arbitrary. For instance, the number − could be represented as a basic object, or as aunary minus operator applied to the number . Similarly, there are several ways to represent √ , x , or the function log . For simplicity, we only consider numbers, constants and variables aspossible leaves, and avoid using a unary minus. In particular, expressions like − x are represented as − × x . Here are the trees for − , √ , x and − x : − ×

42 pow x ×− x Integers are represented in positional notation, as a sign followed by a sequence of digits (from to in base ). For instance, and − are represented as +2 3 5 4 and − . For zero, a uniquerepresentation is chosen ( +0 or − ). B M

ATHEMATICAL DERIVATIONS OF THE PROBLEM SPACE SIZE

In this section, we investigate the size of the problem space by computing the number of expressionswith n internal nodes. We ﬁrst deal with the simpler case where we only have binary operators( p = 0 ), then consider trees and expressions composed of unary and binary operators. In each case,we calculate a generating function (Flajolet & Sedgewick, 2009; Wilf, 2005) from which we derive aclosed formula or recurrence on the number of expressions, and an asymptotic expansion.B.1 B INARY TREES AND EXPRESSIONS

The main part of this derivation follows (Knuth, 1997) (pages 388-389).

Generating function

Let b n be the number of binary trees with n internal nodes. We have b = 1 and b = 1 . Any binary tree with n internal nodes can be generated by concatenating a left and aright subtree with k and n − − k internal nodes respectively. By summing over all possible valuesof k , we have that: b n = b b n − + b b n − + · · · + b n − b + b n − b Let B ( z ) be the generating function of b n , B ( z ) = b + b z + b z + b z + . . . ( z ) = b + ( b b + b b ) z + ( b b + b b + b b ) z + . . . = b + b z + b z + . . . = B ( z ) − b z So, zB ( z ) − B ( z ) + 1 = 0 . Solving for B ( z ) gives: B ( z ) = 1 ± √ − z z and since B (0) = b = 1 , we derive the generating function for sequence b n B ( z ) = 1 − √ − z z We now derive a closed formula for b n . By the binomial theorem, B ( z ) = 12 z (cid:32) − ∞ (cid:88) k =0 (cid:18) / k (cid:19) ( − z ) k (cid:33) = 12 z (cid:32) ∞ (cid:88) k =0 k − (cid:18) kk (cid:19) z k (cid:33) = 12 z ∞ (cid:88) k =1 k − (cid:18) kk (cid:19) z k = ∞ (cid:88) k =1 k − (cid:18) kk (cid:19) z k − = ∞ (cid:88) k =0 k + 1) (cid:18) k + 2 k + 1 (cid:19) z k = ∞ (cid:88) k =0 k + 1 (cid:18) kk (cid:19) z k Therefore b n = 1 n + 1 (cid:18) nn (cid:19) = (2 n )!( n + 1)! n ! These are the Catalan numbers, a closed formula for the number of binary trees with n internal nodes.We now observe that a binary tree with n internal nodes has exactly n + 1 leaves. Since each node ina binary tree can represent p operators, and each leaf can take L values, we have that a tree with n nodes can take p n L n +1 possible combinations of operators and leaves. As a result, the number ofbinary expressions with n operators is given by: E n = (2 n )!( n + 1)! n ! p n L n +1 Asymptotic estimate

To derive an asymptotic approximation of b n , we apply the Stirling formula: n ! ≈ √ πn (cid:16) ne (cid:17) n so (cid:18) nn (cid:19) ≈ n √ πn and b n ≈ n n √ πn Finally, we have the following formulas for the number of expressions with n internal nodes: E n ≈ n √ πn (4 p ) n L n +1 NARY - BINARY TREES

Generating function

Let s n be the number of unary-binary trees (i.e. trees where internal nodescan have one or two children) with n internal nodes. We have s = 1 and s = 2 (the only internalnode is either unary or binary).Any tree with n internal nodes is obtained either by adding a unary internal node at the root of a treewith n − internal nodes, or by concatenating with a binary operator a left and a right subtree with k and n − − k internal nodes respectively. Summing up as before, we have: s n = s n − + s s n − + s s n − + · · · + s n − s Let S ( z ) be the generating function of the s n . The above formula translates into S ( z ) = S ( z ) − s z − S ( z ) zS ( z ) + ( z − S ( z ) + 1 = 0 solving and taking into account the fact that S (0) = 1 , we obtain the generating function of the s n S ( z ) = 1 − z − √ − z + z z The numbers s n generated by S ( z ) are known as the Schroeder numbers (OEIS A006318) (Sloane,1996). They appear in different combinatorial problems (Stanley, 2011). Notably, they correspond tothe number of paths from (0 , to ( n, n ) of a n × n grid, moving north, east, or northeast, and neverrising above the diagonal. Calculation

Schroeder numbers do not have a simple closed formula, but a recurrence allowing fortheir calculation can be derived from their generating function. Rewriting S ( z ) as zS ( z ) + z − − (cid:112) − z + z and differentiating, we have zS (cid:48) ( z ) + 2 S ( z ) + 1 = 3 − z √ − z + z = 3 − z − z + z (1 − z − zS ( z ))2 zS (cid:48) ( z ) + 2 S ( z ) (cid:18) z − z − z + z (cid:19) = (3 − z )(1 − z )1 − z + z − zS (cid:48) ( z ) + 2 S ( z ) 1 − z − z + z = 2 + 2 z − z + z z (1 − z + z ) S (cid:48) ( z ) + (1 − z ) S ( z ) = 1 + z Replacing S ( z ) and S (cid:48) ( z ) with their n-th coefﬁcient yields, for n > ns n − n − s n − + ( n − s n − + s n − s n − = 0( n + 1) s n = 3(2 n − s n − − ( n − s n − Together with s = 1 and s = 2 , this allows for fast ( O ( n ) ) calculation of Schroeder numbers. Asymptotic estimate

To derive an asymptotic formula of s n , we develop the generating functionaround its smallest singularity (Flajolet & Odlyzko, 1990), i.e. the radius of convergence of the powerseries. Since − z + z = (cid:16) − (3 − √ z (cid:17) (cid:16) − (3 + √ z (cid:17) The smallest singular value is r = 1(3 + √ and the asymptotic formula will have the exponential term r − n = (3 + √ n = (1 + √ n

17n a neighborhood of r , the generating function can be rewritten as S ( z ) ≈ (1 + √ (cid:18) − / (cid:113) − (3 + √ z (cid:19) + O (1 − (3 + √ z ) / Since [ z n ] √ − az ≈ − a n √ πn where [ z n ] F ( z ) denotes the n-th coefﬁcient in the formal series of F, we have s n ≈ (1 + √ √ n / √ πn = (1 + √ n +1 / √ πn Comparing with the number of binary trees, we have s n ≈ . . n b n B.3 U

NARY - BINARY EXPRESSIONS

In the binary case, the number of expressions can be derived from the number of trees. This cannotbe done in the unary-binary case, as the number of leaves in a tree with n internal nodes depends onthe number of binary operators ( n + 1 ). Generating function

The number of trees with n internal nodes and n binary operators can bederived from the following observation: any unary-binary tree with n binary internal nodes can begenerated from a binary tree by adding unary internal nodes. Each node in the binary tree can receiveone or several unary parents.Since the binary tree has n + 1 nodes and the number of unary internal nodes to be added is n − n ,the number of unary-binary trees that can be created from a speciﬁc binary tree is the number ofmultisets with n + 1 elements on n − n symbols, that is (cid:18) n + n n − n (cid:19) = (cid:18) n + n n (cid:19) If b q denotes the q-th Catalan number, the number of trees with n binary operators among n is (cid:18) n + n n (cid:19) b n Since such trees have n + 1 leaves, with L leaves, p binary and p unary operators to choose from,the number of expressions is E ( n, n ) = (cid:18) n + n n (cid:19) b n p n p n − n L n +1 Summing over all values of n (from to n ) yields the number of different expressions E n = n (cid:88) n =0 (cid:18) n + n n (cid:19) b n p n p n − n L n +1 z n Let E ( z ) be the corresponding generating function. E ( z ) = ∞ (cid:88) n =0 E n z n = ∞ (cid:88) n =0 n (cid:88) n =0 (cid:18) n + n n (cid:19) b n p n p n − n L n +1 z n = L ∞ (cid:88) n =0 n (cid:88) n =0 (cid:18) n + n n (cid:19) b n (cid:18) Lp p (cid:19) n p n z n = L ∞ (cid:88) n =0 ∞ (cid:88) n =0 (cid:18) n + n n (cid:19) b n (cid:18) Lp p (cid:19) n ( p z ) n (cid:0) n + n n (cid:1) = 0 when n > n E ( z ) = L ∞ (cid:88) n =0 b n (cid:18) Lp p (cid:19) n ∞ (cid:88) n =0 (cid:18) n + n n (cid:19) ( p z ) n = L ∞ (cid:88) n =0 b n (cid:18) Lp p (cid:19) n ∞ (cid:88) n =0 (cid:18) n + 2 n n (cid:19) ( p z ) n + n = L ∞ (cid:88) n =0 b n ( Lp z ) n ∞ (cid:88) n =0 (cid:18) n + 2 n n (cid:19) ( p z ) n applying the binomial formula E ( z ) = L ∞ (cid:88) n =0 b n ( Lp z ) n − p z ) n +1 = L − p z ∞ (cid:88) n =0 b n (cid:18) Lp z (1 − p z ) (cid:19) n applying the generating function for binary trees E ( z ) = L − p z  − (cid:113) − Lp z (1 − p z ) Lp z (1 − p z )  = 1 − p z p z (cid:32) − (cid:115) − Lp z (1 − p z ) (cid:33) = 1 − p z − (cid:112) (1 − p z ) − Lp z p z Reducing, we have E ( z ) = 1 − p z − (cid:112) − p + 2 Lp k ) z + p z p z Calculation

As before, there is no closed simple formula for E n , but we can derive a recurrenceformula by differentiating the generating function, rewritten as p zE ( z ) + p z − − (cid:112) − p + 2 p L ) z + p z p zE (cid:48) ( z ) + 2 p E ( z ) + p = p + 2 p L − p z (cid:112) − p + 2 p L ) z + p z p zE (cid:48) ( z ) + 2 p E ( z ) + p = ( p + 2 p L − p z )(1 − p z − p zE ( z ))1 − p + 2 p L ) z + p z p zE (cid:48) ( z ) + 2 p E ( z ) (cid:18) z ( p + 2 p L − p z )1 − p + 2 p L ) z + p z (cid:19) = ( p + 2 p L − p z )(1 − p z )1 − p + 2 p L ) z + p z − p p zE (cid:48) ( z ) + 2 p E ( z ) (cid:18) − ( p + 2 p L ) z − p + 2 p L ) z + p z (cid:19) = 2 p L (1 + p z ) + p ( p − z − p + 2 p L ) z + p z p zE (cid:48) ( z )(1 − p + 2 p L ) z + p z ) + 2 p E ( z )(1 − ( p + 2 p L ) z ) = (2 p L (1 + p z ) + p ( p − z ) replacing E ( z ) and E (cid:48) ( z ) with their coefﬁcients p ( nE n − p + 2 p L )( n − E n − + p ( n − E ( n − p ( E n − ( p + 2 p L ) E n − ) = 0( n + 1) E n − ( p + 2 p L )(2 n − E n − + p ( n − E n − = 0( n + 1) E n = ( p + 2 p L )(2 n − E n − − p ( n − E n − E = LE = ( p + p L ) L provides a formula for calculating E n . Asymptotic estimate

As before, approximations of E n for large n can be found by developing E ( z ) in the neighbourhood of the root with the smallest module of − p + 2 p L ) z + p z The roots are r = p p + 2 p L − (cid:112) p + 4 p L + 4 p p L − p r = p p + 2 p L + (cid:112) p + 4 p L + 4 p p L − p both are positive and the smallest one is r To alleviate notation, let δ = (cid:113) p + 4 p L + 4 p p L − p r = p p + 2 p L + δ developing E ( z ) near r , E ( z ) ≈ − p r − (cid:113) − r ( p +2 p L − δp ) (cid:113) − zr p r + O (1 − zr ) / E ( z ) ≈ p + 2 p L + δ − p − √ p + 2 p L + δ √ δ (cid:113) − zr p p + O (1 − zr ) / and therefore E n ≈ √ δr − n − p (cid:112) πp n = √ δ p √ πn ( p + 2 p L + δ ) n + p n +11 C G

ENERATING RANDOM EXPRESSIONS

In this section we present algorithms to generate random expressions with n internal nodes. Weachieve this by generating random trees, and selecting randomly their nodes and leaves. We beginwith the simpler binary case ( p = 0 ).C.1 B INARY TREES

To generate a random binary tree with n internal nodes, we use the following one-pass procedure.Starting with an empty root node, we determine at each step the position of the next internal nodesamong the empty nodes, and repeat until all internal nodes are allocated.Start with an empty node, set e = 1 ; while n > do Sample a position k from K ( e, n ) ;Sample the k next empty nodes as leaves;Sample an operator, create two empty children;Set e = e − k + 1 and n = n − ; end Algorithm 1: Generate a random binary tree20e denote by e the number of empty nodes, by n > the number of operators yet to be generated,and by K ( e, n ) the probability distribution of the position ( -indexed) of the next internal node toallocate.To calculate K ( e, n ) , let us deﬁne D ( e, n ) , the number of different binary subtrees that can begenerated from e empty elements, with n internal nodes to generate. We have D (0 , n ) = 0 D ( e,

0) = 1 D ( e, n ) = D ( e − , n ) + D ( e + 1 , n − The ﬁrst equation states that no tree can be generated with zero empty node and n > operators. Thesecond equation says that if no operator is to be allocated, empty nodes must all be leaves and thereis only one possible tree. The last equation states that if we have e > empty nodes, the ﬁrst one iseither a leaf (and there are D ( e − , n ) such trees) or an internal node ( D ( e + 1 , n − trees). Thisallows us to compute D ( e, n ) for all e and n .To calculate distribution K ( e, n ) , observe that among the D ( e, n ) trees with e empty nodes and n operators, D ( e + 1 , n − have a binary node in their ﬁrst position. Therefore P ( K ( e, n ) = 0) = D ( e + 1 , n − D ( e, n ) Of the remaining D ( e − , n ) trees, D ( e, n − have a binary node in their ﬁrst position (sameargument for e − ), that is P ( K ( e, n ) = 1) = D ( e, n − D ( e, n ) By induction over k , we have the general formula P (cid:0) K ( e, n ) = k (cid:1) = D ( e − k + 1 , n − D ( e, n ) C.2 U

NARY - BINARY TREES

In the general case, internal nodes can be of two types: unary or binary. We adapt the previousalgorithm by considering the two-dimensional probability distribution L ( e, n ) of position ( -indexed)and arity of the next internal node (i.e. P ( L ( e, n ) = ( k, a ) is the probability that the next internalnode is in position k and has arity a ).Start with an empty node, set e = 1 ; while n > do Sample a position k and arity a from L ( e, n ) (if a = 1 the next internal node is unary);Sample the k next empty nodes as leaves; if a = 1 then Sample a unary operator;Create one empty child;Set e = e − k ; endelse Sample a binary operator;Create two empty children;Set e = e − k + 1 ; end Set n = n − ; end Algorithm 2: Generate a random unary-binary tree21o compute L ( e, n ) , we derive D ( e, n ) , the number of subtrees with n internal nodes that can begenerated from e empty nodes. We have, for all n > and e : D (0 , n ) = 0 D ( e,

0) = 1 D ( e, n ) = D ( e − , n ) + D ( e, n −

1) + D ( e + 1 , n − The ﬁrst equation states that no tree can be generated with zero empty node and n > operators.The second says that if no operator is to be allocated, empty nodes must all be leaves and there isonly one possible tree. The third equation states that with e > empty nodes, the ﬁrst one will eitherbe a leaf ( D ( e − , n ) possible trees), a unary operator ( D ( e, n − trees), or a binary operator( D ( e + 1 , n − trees).To derive L ( e, n ) , we observe that among the D ( e, n ) subtrees with e empty nodes and n internalnodes to be generated, D ( e, n − have a unary operator in position zero, and D ( e + 1 , n − havea binary operator in position zero. As a result, we have P (cid:0) L ( e, n ) = (0 , (cid:1) = D ( e, n − D ( e, n ) and P (cid:0) L ( e, n ) = (0 , (cid:1) = D ( e + 1 , n − D ( e, n ) As in the binary case, we can generalize these probabilities to all positions k in { . . . e − } P (cid:0) L ( e, n ) = ( k, (cid:1) = D ( e − k, n − D ( e, n ) and P (cid:0) L ( e, n ) = ( k, (cid:1) = D ( e − k + 1 , n − D ( e, n ) C.3 S

AMPLING EXPRESSIONS

To generate expressions, we sample random trees (binary, or unary binary), that we “decorate” byrandomly selecting their internal nodes and leaves from a list of possible operators or mathematicalentities (integers, variables, constants).Nodes and leaves can be selected uniformly, or according to a prior probability. For instance, integersbetween − a and a could be sampled so that small absolute values are more frequent than large ones.For operators, addition and multiplication could be more common than substraction and division.If all L leaves, p and p operators are equiprobable, an alternative approach to generation can bedeﬁned by computing D ( e, n ) as D (0 , n ) = 0 D ( e,

0) = L e D ( e, n ) = LD ( e − , n ) + p D ( e, n −

1) + p D ( e + 1 , n − and normalizing the probabilities P ( L ( e, n )) as P (cid:0) L ( e, n ) = ( k, (cid:1) = L e D ( e − k, n − D ( e, n ) and P (cid:0) L ( e, n ) = ( k, (cid:1) = L e D ( e − k + 1 , n − D ( e, n ) Samples then become dependent on the number of possible leaves and operators.

D I

MPACT OF TIMEOUT ON M ATHEMATICA

In the case of Mathematica, we use function

DSolve to solve differential equations, and function

Integrate to integrate functions. Since computations can take a long time, we set a ﬁnite timeout tolimit the time spent on each equation. Table 8 shows the impact of the timeout value on the accuracywith Mathematica. Increasing the timeout delay naturally improves the accuracy. With a timeoutof 30 seconds, Mathematica times out on 20% of unsolved equations. With a limit of 3 minutes,timeouts represent about 10% of failed equations. This indicates that even in the ideal scenario whereMathematica would succeed on all equations where it times out, the accuracy would not exceed86.2%. 22 imeout (s) Success Failure Timeout . . .

410 82 . . .

230 84 . . .

260 84 . . . . . . Table 8:

Accuracy of Mathematica on 500 functions to integrate, for different timeout values.

As thetimeout delay increases, the percentage of failures due to timeouts decreases. With a limit of 3 minutes, timeoutsonly represent 10% of failures. As a result, the accuracy without timeout would not exceed 86.2%.

E G

ENERALIZATION ACROSS GENERATORS

On the integration problem, we achieve (c.f. Table 6) near perfect performance when the training andtest data are generated by the same method (either

FWD , BWD , or

IBP ). Given the relatively smallsize of the training set ( . examples), the model cannot overﬁt to the entire problem space ( possible expressions). This shows that: • Our model generalizes well to functions created by the training generator. • This property holds for the three considered generators,

FWD , BWD , and

IBP .Table 6 also measures the ability of our model to generalize across generators. A

FWD -trained modelachieves a low performance (17.2% with beam 50) on a

BWD -generated test set. A

BWD -trainedmodel does a little better on the

FWD test set (27.5%), but accuracy remains low. On the otherhand,

FWD -trained models achieve very good accuracy over an

IBP -generated test set (88.9%), and

BWD -trained models stand in the middle (59.2%).Figure 2 provides an explanation for these results. The input/output pairs produced by

FWD and

BWD have very different distributions: integration tends to shorten

BWD generated expressions, and toexpand

FWD generated expressions. As a result, a model trained on

BWD generated data will learnthis shortening feature of integration, which will prove wrong on a

FWD test set. Similar problemswill happen on a

FWD trained model with a

BWD test set. Since

IBP keeps average expression lengthsunchanged,

BWD and

FWD -trained models will generalize better to

IBP test sets (and be more accurateon

FWD -trained models, since their input length distributions are closer).

Number of tokens D e n s i t y Length of derivatives

ForwardBackwardIntegration by parts 0 20 40 60 80 100

Number of tokens D e n s i t y Length of integrals

ForwardBackwardIntegration by parts

Figure 2:

Distribution of input and output lengths for different integration datasets.

The

FWD generatorproduces short problems with long solutions. Conversely, the

BWD generator creates long problems, with shortsolutions. The

IBP approach stands in the middle, and generates short problems with short solutions.

This suggests that what looks at ﬁrst glance like a generalization problem (bad accuracy of

BWD -trained models on

FWD generated sets, and the converse) is in fact a consequence of data generation.

BWD and

FWD methods generate training sets with speciﬁc properties, that our model will learn. Butthis can be addressed by adding

IBP or FWD data to the

BWD dataset, as shown in the two last linesof Table 6. In practice, a better approach could be implemented with self-supervised learning, wherenew training examples are generated by the model itself.23 unctions and their primitives generated with the forward approach (

FWD ) cos -1 ( x ) x cos -1 ( x ) − (cid:112) − x x (2 x + cos (2 x )) 2 x x sin (2 x )2 + cos (2 x )4 x ( x + 4) x + 2 x x − x + 2)cos (2 x )sin ( x ) log (cos ( x ) − − log (cos ( x ) + 1)2 + 2 cos ( x )3 x sinh -1 (2 x ) x sinh -1 (2 x ) − x √ x + 16 + √ x + 112 x log (cid:0) x (cid:1) x log (cid:0) x (cid:1) − x log (cid:0) x (cid:1) x log (cid:0) x (cid:1) − x log (cid:0) x (cid:1) x Functions and their primitives generated with the backward approach (

BWD ) cos ( x ) + tan ( x ) + 2 x + sin ( x ) + tan ( x )1 x √ x − √ x + 1 √ x − √ x + 1 x (cid:18) x cos ( x ) + tan ( x ) (cid:19) tan ( x ) x tan ( x ) x tan (cid:16) e x x (cid:17) + ( x − e x cos ( exx ) x x tan (cid:18) e x x (cid:19) x )) − x ) log (log ( x )) x + x log (log ( x )) − x sin (cid:0) x (cid:1) tan ( x ) + x (cid:0) tan ( x ) + 1 (cid:1) cos (cid:0) x (cid:1) + cos (cid:0) x (cid:1) tan ( x ) x cos (cid:0) x (cid:1) tan ( x ) Functions and their primitives generated with the integration by parts approach (

IBP ) x ( x + log ( x )) x (4 x + 6 log ( x ) − x ( x + 3) − x + ( x + 3) log ( x + 3) x + 3 x + √ ( x ) (cid:16) x + √ (cid:17) tan ( x ) + log (cos ( x )) x (2 x + 5) (3 x + 2 log ( x ) + 1) x (cid:0) x + 24 x log ( x ) + 94 x + 90 log ( x ) (cid:1) (cid:16) x − x sin ( x ) + x ) (cid:17) log ( x )sin ( x ) x log ( x ) + tan ( x )sin ( x ) tan ( x ) x sinh ( x ) x cosh ( x ) − x sinh ( x ) + 6 x cosh ( x ) − x ) Table 9:

Examples of functions with their integrals, generated by our

FWD , BWD and

IBP approaches.

We observe that the

FWD and

IBP approaches tend to generate short functions, with long integrals, while the

BWD approach generates short functions with long derivatives.approach generates short functions with long derivatives.