Combining Symbolic Expressions and Black-box Function Evaluations in Neural Programs
PPublished as a conference paper at ICLR 2018 C OMBINING S YMBOLIC E XPRESSIONS AND B LACK - BOX F UNCTION E VALUATIONS IN N EURAL P ROGRAMS
Forough Arabshahi
University of CaliforniaIrvine, CA [email protected]
Sameer Singh
University of CaliforniaIrvine, CA [email protected]
Animashree Anandkumar
California Institute of TechnologyPasadena, CA [email protected] A BSTRACT
Neural programming involves training neural networks to learn programs, math-ematics, or logic from data. Previous works have failed to achieve good general-ization performance, especially on problems and programs with high complexityor on large domains. This is because they mostly rely either on black-box func-tion evaluations that do not capture the structure of the program, or on detailedexecution traces that are expensive to obtain, and hence the training data has poorcoverage of the domain under consideration. We present a novel framework thatutilizes black-box function evaluations, in conjunction with symbolic expressionsthat define relationships between the given functions. We employ tree LSTMs toincorporate the structure of the symbolic expression trees. We use tree encoding fornumbers present in function evaluation data, based on their decimal representation.We present an evaluation benchmark for this task to demonstrate our proposedmodel combines symbolic reasoning and function evaluation in a fruitful manner,obtaining high accuracies in our experiments. Our framework generalizes signifi-cantly better to expressions of higher depth and is able to fill partial equations withvalid completions.
NTRODUCTION
Human beings possess impressive abilities for abstract mathematical and logical thinking. It haslong been the dream of computer scientists to design machines with such capabilities: machines thatcan automatically learn and reason, thereby removing the need to manually program them. Neuralprogramming, where neural networks are used to learn programs, mathematics, or logic, has recentlyshown promise towards this goal. Examples of neural programming include neural theorem provers,neural Turing machines, and neural program inducers, e.g. Loos et al. (2017); Graves et al. (2014);Neelakantan et al. (2015); Boˇsnjak et al. (2017); Allamanis et al. (2017). They aim to solve taskssuch as learning functions in logic, mathematics, or computer programs (e.g. logical or, addition, andsorting), prove theorems and synthesize programs.Most works on neural programming either rely only on black-box function evaluations (Graves et al.,2014; Balog et al., 2017) or on the availability of detailed program execution traces, where entireprogram runs are recorded under different input conditions (Reed & De Freitas, 2016; Cai et al.,2017). Black-box function evaluations are easy to obtain since we only need to generate inputs andoutputs to various functions in the domain. However, by themselves, they do not result in powerfulgeneralizable models, since they do not have sufficient information about the underlying structure ofthe domain. On the other hand, execution traces capture the underlying structure, but, are generallyharder to obtain under many different input conditions; even if they are available, the computationalcomplexity of incorporating them is significant. Due to the lack of good coverage, these approachesfail to generalize to programs of higher complexity and to domains with a large number of functions.Moreover, the performance of these frameworks is severely dependent on the nature of executiontraces: more efficient programs lead to a drastic improvement in performance (Cai et al., 2017), butsuch programs may not be readily available.In many problem domains, in addition to function evaluations, one typically has access to moreinformation such as symbolic representations that encode the relationships between the given variables1 a r X i v : . [ c s . L G ] A p r ublished as a conference paper at ICLR 2018and functions in a succinct manner. For instance, in physical systems such as fluid dynamics orrobotics, the physical model of the world imposes constraints on the values that different variablescan take. Mathematics and logic are other domains in which expressions are inherently symbolic.In the domain of programming languages, declarative languages explicitly declare variables in theprogram. For instance, database query languages (e.g., SQL), regular expressions, and functionalprogramming. Declarative programs greatly simplify parallel programs through the generation ofsymbolic computation graphs, and have thus been used in modern deep learning packages, such asTheano, TensorFlow, and MxNet. Therefore, rich symbolic expression data is available for manydomains. We will show in this paper, that incorporating this type of information, as well as black-boxfunction evaluations, will result in models that are more generalizable. Summary of Results:
We introduce a flexible and a scalable neural programming framework thatcombines the knowledge of symbolic expressions with black-box function evaluations. To ourknowledge, we are the first to consider such a combined framework. We demonstrate that thisapproach outperforms existing methods by a significant margin, using only a small amount of trainingdata. The paper has three main contributions. (1) We design a neural architecture to incorporateboth symbolic expressions and black-box function evaluation data. (2) We evaluate it on tasks suchas equation verification and completion in the domain of mathematical equation modeling. (3) Wepropose a data generation strategy for both symbolic expressions and black-box function evaluationsthat results in good balance and coverage.We consider learning mathematical equations and functions as a case study, since it has been usedextensively in previous neural programming works, e.g. Zaremba et al. (2014); Allamanis et al.(2017); Loos et al. (2017). We employ tree LSTMs to incorporate the symbolic expression tree,with one LSTM cell for each mathematical function. The parameters of the LSTM cells are sharedacross different expressions, wherever the same function is used. This weight sharing allows us tolearn a large number of mathematical functions simultaneously, whereas most previous works aimat learning only one or few mathematical functions. We then extend tree LSTMs to not only acceptsymbolic expression input, but also numerical data from black-box function evaluations. We employtree encoding for numbers that appear in function evaluations, based on their decimal representation(see Fig. 1c). This allows our model to generalize to unseen numbers, which has been a struggle forneural programing researchers so far. We show that such a recursive neural architecture is able togeneralize to unseen numbers as well as to unseen symbolic expressions.We evaluate our framework on two tasks: equation verification and completion. Under equationverification, we further consider two sub-categories: verifying the correctness of a given symbolicidentity as a whole, or verifying evaluations of symbolic expressions under given numerical inputs.Equation completion involves predicting the missing entry in a mathematical equation. This isemployed in applications such as mathematical question answering (QA). We establish that ourframework outperforms existing approaches on these tasks by a significant margin, especially interms of generalization to equations of higher depth and on domains with a large number of functions.We propose a novel dataset generation strategy to obtain a balanced dataset of correct and incorrectsymbolic mathematical expressions and their numerical function evaluations. Previous methods doan exhaustive search of all possible parse trees and are therefore, limited to symbolic trees of smalldepth (Allamanis et al., 2017). Our dataset generation strategy relies on dictionary look-up andsub-tree matching and can be applied to any domain by providing a basic set of axioms as inputs. Ourgenerated dataset has good coverage of the domain and is key to obtaining superior generalizationperformance. We are also able to scale up our coverage to include about 3.5 × mathematical functionscompared to the previous works (Allamanis et al., 2017; Zaremba et al., 2014). Related work:
Early work on automated programming used first order logic in computer algebrasystems such as Wolfram Mathematica and Sympy. However, these rule-based systems requiredextensive manual input and could not be generalized to new programs. Graves et al. (2014) introducedusing memory in neural networks for learning functions such as grade-school addition and sorting.Since then, many works have extended it to tasks such as program synthesis, program induction andautomatic differentiation (Boˇsnjak et al., 2017; Tran et al., 2017; Balog et al., 2017; Parisotto et al.,2017; Reed & De Freitas, 2016; Mudigonda et al., 2017; Sajovic & Vuk, 2016; Piech et al., 2015).Based on the type of data that is used to train the models, frameworks in neural programmingare categorized under 4 different classes. (1) Models that use black-box function evaluation data2ublished as a conference paper at ICLR 2018(Graves et al., 2014; Balog et al., 2017; Parisotto et al., 2017; Zaremba et al., 2014), (2) models thatuse program execution traces (Reed & De Freitas, 2016; Cai et al., 2017). (3) models that use acombination of black-box input-output data and weak supervision from program sketches (Boˇsnjaket al., 2017; Neelakantan et al., 2015) and finally (4) models that use symbolic data (Allamanis et al.,2017; Loos et al., 2017). Our work is an extension of models of category 3 which uses symbolic datainstead of weak supervision. As stated in Section 1, an example of symbolic data is the computationgraph of a program which is different from program execution traces used in models of category 2such as (Reed & De Freitas, 2016). These high-level symbolic expressions summarize the behaviorof the functions in the domain and apply to many groundings of different inputs as opposed to theNeural Programmer-Interpreter (Reed & De Freitas, 2016). Therefore, we can obtain generalizablemodels that are capable of function evaluation. Moreover, this combination allows us to scale up thedomain and model more functions as well as learn more complex structures.One of the extensively studied applications of neural programming is reasoning with mathematicalequations. These works include automated theorem provers (Loos et al., 2017; Rockt¨aschel & Riedel,2016; Yuan; Alemi et al., 2016; law Chojecki, 2017; Kaliszyk et al., 2017) or computer algebra-likesystems (Allamanis et al., 2017; Zaremba et al., 2014). Our work is closer to the latter, under thiscategorization, however, the problem that we solve is in nature different. Allamanis et al. (2017) andZaremba et al. (2014) aim at simplifying mathematical equations by defining equivalence classesof symbolic expressions that can be used in a symbolic solver. Our problem, on the other hand, ismathematical equation verification and completion which has broader applicability, e.g. our proposedmodel can be used in mathematical question answering systems.Recent advances in symbolic reasoning and natural language processing have indicated the signifi-cance of applying domain structure to the models to capture compositionality and semantics. Socheret al. (2011; 2012) proposed tree-structured neural networks for natural language parsing and neuralimage parsing. Cai et al. (2017) proposed using recursion for capturing the compositionality ofcomputer programs. Both Zaremba et al. (2014) and Allamanis et al. (2017) used tree-structuredneural networks for modeling mathematical equations. Tai et al. (2015) introduced tree-structuredLSTM for semantic relatedness in natural language processing. We will show that this powerfulmodel outperforms other tree-structured neural networks for validating mathematical equations.
ATHEMATIC E QUATION M ODELING
We now address the problem of modeling mathematical equations. Our goal is to verify the correctnessof a mathematical equation. This then enables us to perform equation completion. We limit ourselvesto the domain of trigonometry and elementary algebra in this paper.In this section, we first discuss our grammar that explains the domain under our study. we laterdescribe how we generate a dataset of correct and incorrect symbolic equations within our grammar.We talk about how we combine this data with a few input-output examples to enable functionevaluation. This dataset allows us to learn representations for the functions that capture their semanticproperties, i.e. how they relate to each other, and how they transform the input when applied. Weinterchangeably use the word identity for referring to mathematical equations and input-output datato refer to function evaluations.2.1 G
RAMMAR
Let us start by defining our domain of the mathematical identities using the context-free grammarnotation. Identities (I), by definition, consist of two expressions that we are trying to verify (Eq. (1)).A mathematical expression, represented by E in Eq. (2), is composed either of a terminal ( T ), suchas a constant or a variable, a unary function applied to any expression ( F ), or a binary functionapplied to two expression arguments ( F ). Without loss of generality, functions that take more thantwo arguments, i.e. n -ary functions with n > , are omitted from our task description, since n -aryfunctions like addition can be represented as the composition of multiple binary addition functions.Therefore, this grammar covers the entire space of trigonometric and elementary algebraic identities.The trigonometry grammar rules are thus as follows: I → = ( E, E ) , (cid:54) =( E, E ) (1) E → T, F ( E ) , F ( E, E ) (2)3ublished as a conference paper at ICLR 2018 sin cos = ^ ✓ ✓ ^ (a) sin ( θ ) + cos ( θ ) = 1 sin ⇥ . ⇥ . (b) sin( − .
5) = − . =+ ⇥^
10 0 ^ ⇥ .
52 5 (c) decimal tree for . Figure 1:
Identities and their Expression Trees. with (a) a symbolic expression, (b) a functionevaluation, and (c) a number represented as the decimal tree (also part of the function evaluation data)Table 1: Symbols in our grammar, i.e. the functions, variables, and constants
Unary functions, F sin cos csc sec tancot arcsin arccos arccsc arcsecarctan arccot sinh cosh cschsech tanh coth arsinh arcosharcsch arsech artanh arcoth exp Terminal, T . − . . π x Binary, F + ×∧ F → sin , cos , tan , . . . (3) F → + , ∧ , × , . . . (4) T → − , , , , π, x, y, . . . , any number of precision 2 in [-3.14,+3.14] (5)Table 1 presents the complete list of functions and symbols as well as examples of the terminalsof the grammar. Note that we exclude subtraction and division because they can be representedwith addition, multiplication and power, respectively. Furthermore, the equations can have as manyvariables as needed.The above formulation provides a parse tree for any symbolic and function evaluation expression,a crucial component for representing the equations in a model. Figure 1 illustrates 3 examples ofan identity in our grammar in terms of its expression tree. It is worth noting that there is an implicitnotion of depth of an identity in the expression tree. Since deeper equations are compositions ofmultiple simpler equations, validating higher-depth identities requires reasoning beyond what isrequired for identities with lower depths, and thus depth of an equation is somewhat indicative ofthe complexity of the mathematical expression. However, depth is not sufficient; some higher depthidentities such as may be much easier to verify than tan θ + 1 = acos θ . Symbolicand function evaluation expressions are differentiated by the type of their terminals. Symbolicexpressions have terminals of type constant or variable, whereas function evaluation expressions haveconstants and numbers as terminals. We will come back to this distinction in section 3 where wedefine our model. As shown in Table 1, our domain includes 28 functions. This scales up the domainin comparison to the state-of-the-art methods that use up to 8 mathematical functions (Allamaniset al., 2017; Zaremba et al., 2014). We will also show that our expressions are of higher complexityas we consider equalities of depth up to 4, resulting in trees of size at most 31. Compared to thestate-of-the-art methods that use trees of size at most 13 (Allamanis et al., 2017). Axioms:
We refer to a small set of basic trigonometric and algebraic identities as axioms. Theseaxioms are gathered from the Wikipedia page on trigonometric identities as well as manuallyspecified ones covering elementary algebra. This set consists of about identities varying in depthfrom 1 to 7. Some examples of our axioms are (in ascending order of depth), x = x , x + y = y + x , x × ( y × z ) = ( x × y ) × z , sin ( θ ) + cos ( θ ) = 1 and sin(3 θ ) = − ( θ ) + 3 sin( θ ) . Theseaxioms represent the basic properties of the mathematical functions in trigonometry and algebra, https://en.wikipedia.org/wiki/List_of_trigonometric_identities ATASET OF M ATHEMATICAL E QUATIONS
In order to provide a challenging and accurate benchmark for our task, we need to create a large, variedcollection of correct and incorrect identities, in a manner that can be extended to other domains inmathematics easily. Our approach is based on generating new mathematical identities by performinglocal random changes to known identities, starting with axioms described above. These changesresult in identities of similar or higher complexity (equal or larger depth), which may be correct orincorrect, that are valid expressions within the grammar.
Generating Possible Identities:
To generate a new identity, we select an equation at random fromthe set of known equations, and make local changes to it. In order to do this, we first randomly selecta node in the expression tree, followed by randomly selecting one of the following actions to makethe local change to the equation at the selected node: • ShrinkNode:
Replace the node, if it’s not a leaf, with one of its children, chosen randomly. • ReplaceNode:
Replace the symbol at the node (i.e. the terminal or the function) with anothercompatible one, chosen randomly. • GrowNode:
Provide the node as input to another randomly drawn function f , which thenreplaces the node. If f takes two inputs, the second input will be generated randomly fromthe set of terminals. • GrowSides:
If the selected node is an equality, either add or multiply both sides with arandomly drawn number, or take both sides to the power of a randomly drawn number.At the end of this procedure we use a symbolic solver, sympy (Meurer et al., 2017), to separate correctequations from incorrect ones. Since we are performing the above changes randomly, the number ofgenerated incorrect equations are overwhelmingly larger than the number of correct identities. Thismakes the training data highly unbalanced and is not desired. Therefore, we propose a method basedon sub-tree matching to generate new correct identities.
Generating Additional Correct Identities:
In order to generate only correct identities, we followthe same intuition as above, but only replace structure with others that are equal . In particular, wemaintain a dictionary of valid statements ( mathDictionary ) that maps a mathematical statement toanother. For example, the dictionary key x + y has value y + x . We use this dictionary in our correctequation generation process where we look up patterns from the dictionary. More specifically, welook for keys that match a subtree of the equation then replace that subtree with the pattern of thevalue of the key. E.g. given input equation sin ( θ ) + cos ( θ ) = 1 , this subtree matching mightproduce equality cos θ + sin ( θ ) = 1 by finding key-value pair x + y : y + x .The initial mathDictionaty is constructed from the input list of axioms. At each step of the equationgeneration, we choose one equation at random from the list of correct equations so far, and choose arandom node n of this equation tree for changing. We look for a subtree rooted at n that matches oneor several dictionary keys. We randomly choose one of the matches and replace the subtree with thevalue of the key by looking up the mathDictionary .We generate all possible equations at a particular depth before proceeding to a higher depth. In orderto ensure this, we limit the depth of the final equation and only increase this limit if no new equationsare added to the correct equations for a number of repeats. Some examples of correct and incorrectidentities generated by out dataset generation method is given in Table 2. Generating Function Evaluation data:
We generate a few input-output examples from a specificrange of numbers for the functions in our domain. For unary functions, we randomly draw floatingpoint numbers of fixed precision in the range and evaluate the functions for the randomly drawnnumber. For binary functions we repeat the same with two randomly generated numbers. Note thatfunction evaluation results in identities of depths 2 and 3.
Generating Numerical Expression Trees:
It is important for our dataset to also have a generalizablerepresentation of the numbers. We represent the floating point numbers with their decimal expansion5ublished as a conference paper at ICLR 2018Table 2: Examples of generated equationsExamples of correct identities Examples of incorrect identities = x − × . x +2 = sin (0 . x +2 (arctan 10) = (arctan 10) π × csc( x ) = − csc( x ) x × ( − x ) = x × ( x − − − x x = x + 0 √ × √ x = √ x ✓ W symb W symb W symb W sin W cos W + W ^ ✓ W symb W symb W ^ one-hot encodingdense embeddings Is it true?LSTM orRNN cell ( · ) dense embeddings W sin W num Are they the same? . . W symb W ⇥ W num W symb W ⇥ one-hot encoding W W Figure 2: Tree-structured recursive neural model, for the trees in Figure 1a (left) and 1b (right)which is representable in our grammar. In order to make this clear, consider number 2.5. In order torepresent this number, we expand it into its decimal representation . × + 5 × − andfeed this as one of the function evaluation expressions for training (Figure 1c). Therefore, we canrepresent floating point numbers of finite precision using integers in the range [-1,10]. REE
LSTM A
RCHITECTURE FOR M ODELING E QUATIONS
Analogous to how humans learn trigonometry and elementary algebra, we propose using basic axiomsto learn about the properties of mathematical functions. Moreover, we leverage the underlyingstructure of each mathematical identity to make predictions about their validity. Both Zaremba et al.(2014) and Allamanis et al. (2017) validate the effectiveness of using tree-structured neural networksfor modeling equations. Tai et al. (2015) show that Tree LSTMs are powerful models for capturingthe semantics of the data. Therefore, We use the Tree LSTM model to capture the compositionality ofthe equation and show that it improves the performance over simpler tree-structured, a.k.a recursive,neural networks. We describe the details of the model and training setup in this section.
Tree LSTM Model for Symbolic Expressions and Function Evaluations
The structure of TreeLSTM mirrors the parse-tree of each input equation. As shown in Figure 1, the input equation’s parsetree is inherent in each equation. As described in section 2.1 an equation consists of terminals andbinary and unary functions. Terminals are input to Tree LSTM through the leaves that embeds theirrepresentation using vectors. Each function is associated with an LSTM block with its own weights,with the weights shared among all appearances of the function in different equations. we predict thevalidity of each equation at the root of the tree.The architecture of the neural network is slightly different for symbolic expressions compared tofunction evaluation expressions. Recall from section 2.1, that the two are distinguished by theirterminal types. This directly reflects to the structure of the network used in the leaves for embedding.Moreover, we use different loss functions for each type of expression as described below. • Symbolic expressions
These expressions consist of constants and symbols. These terminalsare represented with their one-hot encoding and are passed through symbol , a single layerneural network block. The validity of a symbolic expression is verified by computing the6ublished as a conference paper at ICLR 2018dot product of the left-hand-side and right-hand-side vector embeddings and applying thelogistic function. • function evaluation expressions In order to encode the terminals of function evaluationexpressions we train an autoencoder. The encoder side embeds the floating point numbersinto a high-dimensional vector space. We call this a number block. The decoder of thisauto-encoder is trained for predicting the floating point number given an input embedding.We call this the decoder block. We pass the output vector embedding of the left-hand-sideand right-hand-side to the decoder block. The validity of a function evaluation is thencomputed by minimizing the MSE loss of the decoder outputs of each side.Figure 2 illustrates our tree LSTM structure constructed from the parse-tree of the equations inFigures 1a and 1b. Baseline Models
We compare our proposed model with chain-structured neural networks, such assequential Recurrent Neural Networks (RNN), LSTMS’s as well as tree-structured neural networks(TreeNN’s) consisting of fully connected layers (Socher et al., 2011; Zaremba et al., 2014). It shouldbe noted that both these papers discover equivalence classes in a dataset, and since our data consistsof many equivalence classes especially for the function evaluation data, we do not use the EqNetmodel proposed in (Allamanis et al., 2017) as a baseline. Another baseline we have used is Sympy.Given each equality, Sympy either returns True, or False, or returns the input equality in its originalform (indicating that sympy is incapable of deciding whether the equality holds or not). Let’s callthis the Unsure class. In the reported Sympy accuracies we have treated the Unsure class as amiss-classification. It should be noted, however, that Since Sympy is used at time of data generationto verify the correctness of the generated equations, its accuracy for predicting correct equations inour dataset is always 100%. Therefore, the degradation in Sympy’s performance in Table 3 is onlydue to incorrect equations. It is interesting to see Sympy’s performance when another oracle is usedfor validating correct equalities.As we will show in the experiments, the structure of the network is crucial for equation verificationand equation completion. Moreover, by adding function evaluation data to the tree-structured modelswe show that using this type of data not only broadens the applicability of the model to enablefunction evaluation, but it also enhances the final accuracy of the symbolic expressions compared towhen no function evaluation data is used.We demonstrate that Tree LSTMs outperform Tree NN’s by a large margin with or without functionevaluation data in all the experiments. We attribute this to the fact that LSTM cells amelioratevanishing and exploding gradients along paths in the tree compared to fully-connected blocks usedin Tree NNs. This enables the model to be capable of reasoning in equations of higher depth wherereasoning is a more difficult task compared to an equation of lower depth. Therefore, it is importantto use both a tree and a cell with memory, such as an LSTM cell for modeling the properties ofmathematical functions.
Implementation Details
Our neural networks are developed using MxNet (Chen et al., 2015). Allthe experiments and models are tuned over the same search space and the reported results are thebest achievable prediction accuracy for each method. We use L -regularization as well as dropout toavoid overfitting, and train all the models for 100 epochs. We have tuned for the hidden dimension { } , the optimizers { SGD, NAG (Nesterov accelerated SGD), RMSProp, Adam, AdaGrad,AdaDelta, DCASGD, SGLD (Stochastic Gradient Riemannian Langevin Dynamics) } , dropout rate { } , learning rate { − , − } , regularization ratio { − , − } and momentum { } .Most of the networks achieved their best performance using Adam optimizer Kingma & Ba (2014)with a learning rate of 0.001 and a regularization ratio of − . Hidden dimension and dropout variesunder each of the scenarios. XPERIMENTS AND R ESULTS
We indicate the complexity of an identity by its depth. We setup the following experiments to evaluatethe performance and the generalization capability of our proposed framework. We investigate the Our dataset generation method, proposed model, and data is available here: https://github.com/ForoughA/neuralMath
Generalization Results: the train and the test contain equations of the same depth [1,2,3,4].Results are on unseen equations.
Sym refers to accuracy of Symbolic expressions and
F Eval refers toMSE of function evaluation expressions. The last four columns measure the accuracy of symbolicexpressions of different depths.
Approach Sym F Eval depth 1 depth 2 depth 3 depth 4
Test set size 3527 401 7 542 2416 563Majority Class 50.24 - 28.57 45.75 52.85 43.69Sympy 81.74 - 85.71 89.11 82.98 69.44RNN 66.37 - 57.14 62.93 65.13 72.32LSTM 81.71 - 85.71 79.49 80.81 83.86TreeNN 92.06 - behavior of the learned model on two different tasks of equation verification and equation completion.Under both tasks, We assess the results of the method on symbolic as well as function evaluationexpressions. We compare each of the models with sequential Recurrent Neural Networks (RNN),LSTMs and recursive tree neural networks also knows as Tree NN’s. Moreover, we show the effectof adding function evaluation data to the final accuracy on symbolic expressions. All the models trainon the same dataset of symbolic expressions. Models,
Tree LSTM + data and
Tree NN + data usefunction evaluation data on top of the symbolic data.Our dataset consists of symbolic equations, of which are correct. This data includes equations of depth , equations of depth , equations of depth and equationsof depth . It should be noted that the equations of depth and have been maxed out in the datageneration. We also add function evaluation equations and decimal expansion trees for numbersthat includes correct samples. We have equations of depth , equations of depth and equations of depth in this function evaluation dataset. Our numerical data includes ofnumbers of precision 2 in the range [ − . , . chosen at random. Equation Verification - Generalization to Unseen Identities:
In this experiment we randomlysplit all of the generated data that includes equations of depths 1 to 4 into train and test partitions withan 80%/20% split ratio. We evaluate the accuracy of the predictions on the held-out data. The resultsof this experiment are presented in Table 3. As it can be seen, tree structured networks are able tomake better predictions compared to chain-structured or flat networks. Therefore, we are leveragingthe structure of the identities to capture information about their validity. Moreover, the superiority ofTree LSTM to Tree NN shows that it is important to incorporate cells that have memory. The detailedprediction accuracy broken in terms of depth and also in terms of symbolic and function evaluationexpressions is also given in Table 3.
Equation Verification - Extrapolation to Unseen Depths:
Here we evaluate the generalization ofthe learned model to equations of higher and lower complexity. Generalization to equations of higherdepth indicates that the network has been able to learn the properties of each mathematical functionand is able to use this in more complex equations to verify their correctness. Ability to generalize tolower complexity indicates whether the model can infer properties of simpler mathematical functionsby observing their behavior in complex equations. For each setup, we hold out symbolic expressionsof a certain depth and train on the remaining depths. Table 4 presents the results of both setups,which suggest that Tree LSTM trained on a combination of symbolic and function evaluation data,outperforms all other methods across all metrics. Comparing the symbolic accuracy of Tree modelswith and without the function evaluation data, we conclude that our models are able to utilize thepatterns in the function evaluations to improve and better model the symbolic expressions as well.
Equation Completion:
In this experiment, we evaluate the capability of the model in completingequations by filling in a blank in unseen identities. For this experiment, we use the same models as8ublished as a conference paper at ICLR 2018Table 4:
Extrapolation Evaluation to measure capability of the model to generalize to unseen depthon symbolic equations
Approach Train depth:1,2,3; Test depth: 4 Train depth:1,3,4; Test depth: 2Accuracy Precision Recall Accuracy Precision Recall
Majority Class 55.22 k T o p - K A cc u r a c y RNNLSTMTree-NN+dataTree-LSTMTree-LSTM+dataTree-NN (a) Symbolic equation completion k T o p - K m i n M S E Tree-NN+dataTree-LSTM+data (b) Function evaluation completion
Figure 3:
Evaluating Equation Completion , Figure 3a shows the top- k accuracy of the symbolicdata for different methods, and Figure 3b illustrates the minimum MSE of the top-k predictions, forthe function evaluation data.reported in table 3. We take all the test equations and randomly choose a node of depth either 1 or 2in each equation, and replace it with all possible configurations of depth 1 and 2 expressions fromour grammar. We then give this set of equations to the models and look at the top- k predictions forthe blank ranked by the model’s confidence. We perform equation completion on both symbolic andfunction evaluation expressions.Figure 3a shows the accuracy of the top- k predictions vs. k , for the symbolic expressions. We definethe top- k accuracy as the percentage of samples for which there is at least one correct match forthe blank in the top k predictions. This indicates that the hardest task is to have a high accuracy for k = 1 . Therefore, as in Fig 3a, the differences at k = 1 for models that use function evaluation datavs. models that do not, indicates the importance of combining symbolic and function evaluation datafor the task of equation completion. We can also see that tree-structured models are substantiallybetter than sequential models, indicating that it is important to capture the compositionality of thedata in the structure of the model. Finally, Tree LSTM shows superior performance compared to TreeNN under both scenarios.Figure 3b evaluates equation completion on function evaluation expressions by measuring the top- k minimum MSE for different k s. We define the top- k minimum MSE as the MSE between the truevalue of the blank and the closest prediction to the true value among the top- k predictions. Similarto the top- k accuracy, the hardest task is to have a low MSE for k = 1 since it indicates that thecorrect prediction is the first model prediction. We would like to note that, for function evaluationexpressions, there is only one correct prediction for a blank, whereas for symbolic expressions, theremay be many correct candidates for a specific blank. This evaluation is performed only for models9ublished as a conference paper at ICLR 2018 tanh(0) = (cid:4) x pred prob − . . . − . . . . . . . (a) Symbolic Equation cos( − (cid:4) ) = − . pred modelErr trueErr . − . − .
17 1 . − . − .
16 2 . − . − .
18 1 . e − .
15 2 . − . − .
19 5 . − . − . . − . − .
13 1 . − . − .
12 1 . − . − (b) Function Evaluation Predicted valueTrue value (c) Function evaluation for cos
Figure 4:
Examples of Equation Completion of the Tree LSTM + data model. Figures 4a and 4bshow examples of equation completion from the test set where the predictions are ranked by model’sconfidence and the correct prediction is shown in boldface. Figure 4c depicts the predicted values ofcos(x) with blue dots for x in [-3.14,3.14] in the test setthat use function evaluation data. We can see from the figure, that Tree LSTM’s MSE is better thatthat of Tree NN across all k s.We present examples of equations and generated candidates in Figures 4a and 4b for the Tree LSTM+ data model. Figure 4a presents the results on a symbolic equation for which the correct predictionvalue is . The Tree LSTM is able to generate many candidates with a high confidence, all of whichare correct. Column prob in the figure is the output probability of softmax which indicates themodel’s confidence in its prediction. On the other hand, in Figure 4b, we show a function evaluationexample, where the correct answer is . rounded to precision . The correct answer is among thetop predictions as shown in Figure 4b. All the predicted values for the blank are listed in column pred ranked by the model’s prediction confidence. Column modelErr shows the model’s confidence ofprediction, which is the squared error between the predicted value of cos(-pred) and − . . Column trueErr is the squared error between the true value of cos(-pred) rounded to precision 2 and − . .As it is shown, the predicted candidates are close to the true value. It is worth noting that for thefunction evaluation task of Figure 4b, there is only 1 correct answer, whereas for the task in Figure 4athere can be many correct solutions. We also present example predictions of our model for functionevaluations by plotting the top predicted values for cos on samples of test data in Figure 4c. ONCLUSIONS AND F UTURE W ORK
In this paper we proposed combining black-box function evaluation data with symbolic expressionsto improve the accuracy and broaden the applicability of previously proposed models in this domain.We apply this to the novel task of validating and completing mathematical equations. We studied thespace of trigonometry and elementary algebra as a case study to validate the proposed model. Wealso proposed a novel approach for generating a dataset of mathematical identities and generatedidentities in trigonometry and elementary algebra. As noted, our data generation technique is notlimited to trigonometry and elementary algebra. We show that under various experimental setups,Tree LSTMs trained on a combination of symbolic expressions and black-box function evaluationachieves superior results compared to the state-of-the-art models.In our future work we will expand our testbed to include other mathematical domains, inequalities,and systems of equations. What is interesting about multiple domains is to investigate if the learnedrepresentations for one domain can be transfered to the other domains, and whether the embedding ofeach domain are clustered close to each other similar to the way the word embedding vectors behave.We are also interested in exploring recent neural models with addressable differentiable memory, inorder to evaluate whether they can handle equations of much higher complexity.10ublished as a conference paper at ICLR 2018A
CKNOWLEDGMENTS
The authors would like to thank Amazon Inc., for the AWS credits. F. Arabshahi is supported byDARPA Award D17AP00002. A. Anandkumar is supported by Microsoft Faculty Fellowship, NSFCAREER Award CCF-1254106, DARPA Award D17AP00002 and Air Force Award FA9550-15-1-0221. S. Singh would like to thank Adobe Research and FICO for supporting this research. R EFERENCES
Alex A. Alemi, Franc¸ois Chollet, Geoffrey Irving, Christian Szegedy, and Josef Urban. Deepmath- deep sequence models for premise selection. In
Advances in Neural Information ProcessingSystems , 2016.Miltiadis Allamanis, Pankajan Chanthirasegaran, Pushmeet Kohli, and Charles Sutton. Learningcontinuous semantic representations of symbolic expressions. In
International Conference onMachine Learning (ICML) , 2017.Matej Balog, Alexander L Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow.Deepcoder: Learning to write programs. In
International Conference on Learning Representations(ICLR) , 2017.Matko Boˇsnjak, Tim Rockt¨aschel, Jason Naradowsky, and Sebastian Riedel. Programming with adifferentiable forth interpreter. In
International Conference on Machine Learning (ICML) , 2017.Jonathon Cai, Richard Shin, and Dawn Song. Making neural programming architectures generalizevia recursion.
International Conference on Learning Representations (ICLR) , 2017.Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu,Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library forheterogeneous distributed systems. 2015.Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprintarXiv:1410.5401 , 2014.Cezary Kaliszyk, Franc¸ois Chollet, and Christian Szegedy. Holstep: A machine learning dataset forhigher-order logic theorem proving. In
International Conference on Learning Representations(ICLR) , 2017.Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In
InternationalConference on Learning Representations (ICLR) , 2014.Przemys law Chojecki. Deepalgebra-an outline of a program.
AITP 2017 , 2017.Sarah Loos, Geoffrey Irving, Christian Szegedy, and Cezary Kaliszyk. Deep network guided proofsearch. arXiv preprint arXiv:1701.06972 , 2017.Aaron Meurer, Christopher P Smith, Mateusz Paprocki, Ondˇrej ˇCert´ık, Sergey B Kirpichev, MatthewRocklin, AMiT Kumar, Sergiu Ivanov, Jason K Moore, Sartaj Singh, et al. Sympy: symboliccomputing in python.
PeerJ Computer Science , 3:e103, 2017.Pawan Mudigonda, R Bunel, A Desmaison, MP Kumar, P Kohli, and PHS Torr. Learning tosuperoptimize programs. 2017.Arvind Neelakantan, Quoc V Le, and Ilya Sutskever. Neural programmer: Inducing latent programswith gradient descent. arXiv preprint arXiv:1511.04834 , 2015.Emilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and PushmeetKohli. Neuro-symbolic program synthesis. In
International Conference on Learning Representa-tions (ICLR) , 2017.Chris Piech, Jonathan Huang, Andy Nguyen, Mike Phulsuksombati, Mehran Sahami, and LeonidasGuibas. Learning program embeddings to propagate feedback on student code. In
InternationalConference on Machine Learning (ICML) , 2015.11ublished as a conference paper at ICLR 2018Scott Reed and Nando De Freitas. Neural programmer-interpreters.
International Conference onLearning Representations (ICLR) , 2016.Tim Rockt¨aschel and Sebastian Riedel. Learning knowledge base inference with neural theoremprovers.
Proceedings of AKBC , pp. 45–50, 2016.ˇZiga Sajovic and Martin Vuk. Operational calculus on programming spaces and generalized tensornetworks. arXiv preprint arXiv:1610.07690 , 2016.Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. Parsing natural scenes and naturallanguage with recursive neural networks. In
International Conference on Machine Learning(ICML) , pp. 129–136, 2011.Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng. Semantic compositionalitythrough recursive matrix-vector spaces. In
Proceedings of the 2012 joint conference on empiricalmethods in natural language processing and computational natural language learning , pp. 1201–1211. Association for Computational Linguistics, 2012.Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations fromtree-structured long short-term memory networks. In
Proceedings of the 53rd Annual Meetingof the Association for Computational Linguistics and the 7th International Joint Conference onNatural Language Processing (Volume 1: Long Papers) , volume 1, pp. 1556–1566, 2015.Dustin Tran, Matthew D Hoffman, Rif A Saurous, Eugene Brevdo, Kevin Murphy, and David MBlei. Deep probabilistic programming. In
International Conference on Learning Representations(ICLR) , 2017.Arianna Yuan. Neural theorem prover.Wojciech Zaremba, Karol Kurach, and Rob Fergus. Learning to discover efficient mathematicalidentities. In