[PDF] Unit Dependency Graph and its Application to Arithmetic Word Problem Solving

Abstract

Math word problems provide a natural abstraction to a range of natural language understanding problems that involve reasoning about quantities, such as interpreting election results, news about casualties, and the financial section of a newspaper. Units associated with the quantities often provide information that is essential to support this reasoning. This paper proposes a principled way to capture and reason about units and shows how it can benefit an arithmetic word problem solver. This paper presents the concept of Unit Dependency Graphs (UDGs), which provides a compact representation of the dependencies between units of numbers mentioned in a given problem. Inducing the UDG alleviates the brittleness of the unit extraction system and allows for a natural way to leverage domain knowledge about unit compatibility, for word problem solving. We introduce a decomposed model for inducing UDGs with minimal additional annotations, and use it to augment the expressions used in the arithmetic word problem solver of (Roy and Roth 2015) via a constrained inference framework. We show that introduction of UDGs reduces the error of the solver by over 10 %, surpassing all existing systems for solving arithmetic word problems. In addition, it also makes the system more robust to adaptation to new vocabulary and equation forms .

Full PDF

UUnit Dependency Graph and its Application to Arithmetic Word Problem Solving

Subhro Roy and

Dan Roth

University of Illinois, Urbana Champaign { sroy9, danr } @illinois.edu Abstract

Math word problems provide a natural abstraction to a rangeof natural language understanding problems that involve rea-soning about quantities, such as interpreting election results,news about casualties, and the ﬁnancial section of a newspa-per. Units associated with the quantities often provide infor-mation that is essential to support this reasoning. This paperproposes a principled way to capture and reason about unitsand shows how it can beneﬁt an arithmetic word problemsolver. This paper presents the concept of Unit DependencyGraphs (UDGs), which provides a compact representation ofthe dependencies between units of numbers mentioned in agiven problem. Inducing the UDG alleviates the brittlenessof the unit extraction system and allows for a natural wayto leverage domain knowledge about unit compatibility, forword problem solving. We introduce a decomposed modelfor inducing UDGs with minimal additional annotations, anduse it to augment the expressions used in the arithmetic wordproblem solver of (Roy and Roth 2015) via a constrained in-ference framework. We show that introduction of UDGs re-duces the error of the solver by over 10%, surpassing all ex-isting systems for solving arithmetic word problems. In ad-dition, it also makes the system more robust to adaptation tonew vocabulary and equation forms .

Understanding election results, sport commentaries and ﬁ-nancial news, all require reasoning with respect to quanti-ties. Math word problems provide a natural abstraction tothese quantitative reasoning problems. As a result, there hasa been a growing interest in developing methods which au-tomatically solve math word problems (Koncel-Kedziorskiet al. 2015; Kushman et al. 2014; Roy and Roth 2015;Mitra and Baral 2016).Units associated with numbers or the question often pro-vide essential information to support the reasoning requiredin math word problems. Consider the arithmetic word prob-lem in Example 1. The units of “66” and “10” are both“ﬂowers”, which indicate they can be added or subtracted.Although unit of “8” is also “ﬂower”, it is associated with arate, indicating the number of ﬂowers in each bouquet. Asa result, “8” effectively has unit “ﬂowers per bouquet”. De-tecting such rate units help understand that “8” will more

Example 1Isabel picked 66 ﬂowers for her friends wedding. Shewas making bouquets with 8 ﬂowers in each one. If 10of the ﬂowers wilted before the wedding, how manybouquets could she still make?likely be multiplied or divided to arrive at the solution. Fi-nally, the question asks for the number of “bouquets”, indi-cating “8” will likely be divided, and not multiplied. Know-ing such interactions could help understand the situation andperform better quantitative reasoning. In addition, given thatunit extraction is a noisy process, this can make it more ro-bust via global reasoning.In this paper, we introduce the concept of unit dependencygraph (UDG) for math word problems, to represent the re-lationships among the units of different numbers, and thequestion being asked. We also introduce a strategy to extractannotations for unit dependency graphs, with minimal addi-tional annotations. In particular, we use the answers to mathproblems, along with the rate annotations for a few selectedproblems, to generate complete annotations for unit depen-dency graphs. Finally, we develop a decomposed model topredict UDG given an input math word problem.We augment the arithmetic word problem solver of (Royand Roth 2015) to predict a unit dependency graph, alongwith the solution expression of the input arithmetic wordproblem. Forcing the solver to respect the dependencies ofthe unit dependency graph enables us to improve unit ex-tractions, as well as leverage the domain knowledge aboutunit dependencies in math reasoning. The introduction ofunit dependency graphs reduced the error of the solver byover , while also making it more robust to reduction inlexical and template overlap of the dataset.

We ﬁrst introduce the idea of a generalized rate, and its unitrepresentation. We deﬁne rate to be any quantity which issome measure corresponding to one unit of some other quan-tity. This includes explicit rates like “40 miles per hour”, aswell as implicit rates like the one in “Each student has 3books”. Consequently, units for rate quantities take the form “A per B” , where A and B refer to different entities. We referto A as Num Unit (short for Numerator Unit), and B as Den a r X i v : . [ c s . C L ] D ec ention Num Unit Den Unit

40 miles per hour mile hour

Each student has 3 books. book student

Table 1: Units of rate quantitiesUnit (short for denominator unit). Table 1 shows examplesof Num and Den Units for various rate mentions.A unit dependency graph (UDG) of a math word problemis a graph representing the relations among quantity unitsand the question asked. Fig. 1 shows an example of a mathword problem and its unit dependency graph. For each quan-tity mentioned in the problem, there exists a vertex in theunit dependency graph. In addition, there is also a vertexrepresenting the question asked. Therefore, if a math prob-lem mentions n quantities, its unit dependency graph willhave n + 1 vertices. In the example in Fig 1, there is onevertex corresponding to each of the quantities , and ,and one vertex representing the question part “how manybouquets could she still make ?”.A vertex representing a number, is labeled R ATE , if thecorresponding quantity describes a rate relationship (accord-ing to the aforementioned deﬁnition). In ﬁg 1, “8” is labeledas a R

ATE since it indicates the number of ﬂowers in eachbouquet. Similarly, a vertex corresponding to the question ismarked R

ATE if the question asks for a rate.Edges of a UDG can be directed as well as undirected.Each undirected edge has the label S

AME U NIT , indicatingthat the connected vertices have the same unit. Each directededge going from vertex u to vertex v can have one of thefollowing labels:1. N UM U NIT : Valid only for directed edges with sourcevertex u labeled as R ATE , indicates that Num Unit of u matches the unit of the destination vertex v .2. D EN U NIT : Valid only for directed edges with sourcevertex labeled as R

ATE , indicates that Den Unit of sourcevertex u matches the unit of the destination vertex v .If no edge exists between a pair of vertices, they have un-related units.Several dependencies exist between the vertex and edgelabels of the unit dependency graph of a problem, and itssolution expression. Sec 4 discusses these dependencies andhow they can be leveraged to improve math problem solving. Predicting UDG for a math word problem is essentially astructured prediction problem. However, since we have lim-ited training data, we develop a decomposed model to pre-dict parts of the structure independently, and then performjoint inference to enforce coherent predictions. This hasbeen shown to be an effective method for structured predic-tion in the presence of limited data (Punyakanok et al. 2005;Sutton and McCallum 2007). Empirically, we found our de-composed model to be superior to jointly trained alternatives(see Section 5).Our decomposed model for UDG prediction uses the fol-lowing two classiﬁers. 1.

Vertex Classiﬁer : This is a binary classiﬁer, which takesa vertex of the UDG as input, and decides whether it de-notes a rate.2.

Edge Classiﬁer : This is a multiclass classiﬁer, whichtakes as input a pair of nodes of the UDG, and predictsthe properties of the edge connecting those nodes.Finally, a constrained inference module combines the outputof the two classiﬁers to construct a UDG. We provide detailsof the components in the following subsections.

Vertex Classiﬁer

In order to detect rate quantities, we train a binary classiﬁer.Given problem text P and a vertex v of the UDG, the clas-siﬁer predicts whether v represents a rate. It predicts one oftwo labels - R ATE or N OT R ATE . The vertex v is either aquantity mentioned in P , or the question of P . The featuresused for the classiﬁcation are as follows :1. Context Features : We add unigrams, bigrams, part ofspeech tags, and their conjunctions from the neighbor-hood of v .2. Rule based Extraction Features : We add a feature in-dicating whether a rule based approach can detect v as arate. Edge Classiﬁer

We train a multiclass classiﬁer to determine the properties ofthe edges of the UDG. Given problem text P and a pair ofvertices v i and v j ( i < j ), the classiﬁer predicts one of thesix labels :1. S AME U NIT : Indicates that v i and v j should be con-nected by an undirected edge labeled S AME U NIT .2. N O R ELATION : Indicates no edge exists between v i and v j .3. R ATE → Num : Indicates that v i is a rate, and the Num Unitof v i matches the unit of v j .4. R ATE ← Num : Indicates that v j is a rate, and the Num Unitof v j matches the unit of v i .5. We similarly deﬁne R ATE → Den and R ATE ← Den .The features used for the classiﬁcation are :1.

Context Features : For each vertex v in the query, we addthe context features described for Vertex classiﬁer.2. Rule based Extraction Features : We add a feature indi-cating whether each of the queried vertices is detected asa rate by the rule based system. In addition, we also addfeatures denoting whether there are common tokens in theunits of v i and v j . Constrained Inference

Our constrained inference module takes the scores of theVertex and Edge classiﬁers, and combines them to ﬁndthe most probable unit dependency graph for a problem.We deﬁne V

ERTEX ( v, l ) to be the score predicted by theVertex classiﬁer for labeling vertex v of a UDG with la-bel l , where l ∈ { R ATE , N OT R ATE } . Similarly, we deﬁne roblem Unit Dependency Graph Expression Tree of Solution Isabel picked 66 ﬂowers for her friendswedding. She was making bouquetswith 8 ﬂowers in each one. If 10 of theﬂowers wilted before the wedding, howmany bouquets could she still make?

66 8 10how many bouquets

RateNum UnitDen Unit Num UnitSame Unit − ÷ Figure 1:

An arithmetic word problem, its UDG, and a tree representation of the solution (66 − / . Several dependencies exist betweenthe UDG and the ﬁnal solution of a problem. Here, “66” and “10” are connected via S AME U NIT edge, hence they can be added or subtracted,“8” is connected by D EN U NIT to the question, indicating that some expression will be divided by “8” to get the answer’s unit. E DGE ( v i , v j , l ) to be the score predicted by the Edge classi-ﬁer for the assignment of label l to the edge between v i and v j . Here the label l is one of the six labels deﬁned for theedge classiﬁer.Let G be a UDG with vertex set V . We deﬁne the scorefor G as follows:S CORE ( G ) = (cid:88) v ∈ V L ABEL ( G,v )= R ATE V ERTEX ( v, R ATE )+ λ × (cid:88) v i ,v j ∈ V,i

ABEL ( G, v ) maps to R ATE , if v is a rate, otherwise itmaps to N OT R ATE . Similarly, if no edge exists between v i and v j , L ABEL ( G, v i , v j ) maps to N O R ELATION , if NumUnit of v i matches the unit of v j , L ABEL ( G, v i , v j ) maps toR ATE → Num , and so on. Finally, the inference problem has thefollowing form: arg max G ∈ G RAPHS S CORE ( G ) where G RAPHS is the set of all valid unit dependency graphsfor the input problem.

In this section, we describe our joint inference procedure topredict both a UDG and the solution of an input arithmeticword problem. Our model is built on the arithmetic wordproblem solver of (Roy and Roth 2015), and we brieﬂy de-scribe it in the following sections. We ﬁrst describe the con-cept of expression trees, and next describe the solver, whichleverages expression tree representation of the solutions.

Monotonic Expression Tree An expression tree is a binary tree representation of a math-ematical expression, where leaves represent numbers, andall non-leaf nodes represent operations. Fig 1 shows an ex-ample of an arithmetic word problem and the expression treeof the solution mathematical expression. A monotonic ex-pression tree is a normalized expression tree representation for math expressions, which restricts the order of combina-tion of addition and subtraction nodes, and multiplicationand division nodes. The expression tree in Fig 1 is mono-tonic. Arithmetic Word Problem Solver

We now describe the solver pipeline of (Roy and Roth 2015).Given a problem P with quantities q , q , . . . , q n , the solveruses the following two classiﬁers.1. Irrelevance Classiﬁer : Given as input, problem P andquantity q i mentioned in P , the classiﬁer decides whether q i is irrelevant for the solution. The score of this classiﬁeris denoted as I RR ( q ) .2. LCA Operation Classsiﬁer : Given as input, problem P and a pair of quantities q i and q j ( i < j ) , the classi-ﬁer predicts the operation at the lowest common ancestor(LCA) node of q i and q j , in the solution expression treeof problem P . The set of possible operations are + , − , − r , × , ÷ and ÷ r (the subscript r indicates reverse order).Considering only monotonic expression trees for the so-lution makes this operation unique for any pair of quanti-ties. The score of this classiﬁer for operation o is denotedas L CA ( q i , q j , o ) .The above classiﬁers are used to gather irrelevance scoresfor each number, and LCA operation scores for each pair ofnumbers. Finally, constrained inference procedure combinesthese scores to generate the solution expression tree.Let I ( T ) be the set of all quantities in P which are notused in expression tree T , and λ I RR be a scaling parameter.The score S CORE ( T ) of an expression tree T is deﬁned as:S CORE ( T ) = λ I RR (cid:88) q ∈I ( T ) I RR ( q )+ (cid:88) q i ,q j / ∈I ( T ) L CA ( q i , q j , (cid:12) LCA ( q i , q j , T )) where (cid:12) LCA ( q i , q j , T ) denotes the operation at the lowestcommon ancestor node of q i and q j in monotonic expressiontree T . Let T REES be the set of valid expressions that can beormed using the quantities in a problem P , and also givepositive solutions. The inference algorithm now becomes: arg max T ∈ T REES S CORE ( T ) Joint Inference

We combine the scoring functions of UDG prediction andthe ones from the solver of (Roy and Roth 2015), so that wecan jointly predict the UDG and the solution of the problem.For an input arithmetic word problem P , we score tuples ( G, T ) (where G is a candidate UDG for P , and T is a can-didate solution expression tree of P ) as follows :S CORE ( G, T ) = λ I RR (cid:88) q ∈I ( T ) I RR ( q )+ (cid:88) q i ,q j / ∈I ( T ) L CA ( q i , q j , (cid:12) LCA ( q i , q j , T ))+ λ V ERTEX (cid:88) v ∈ V L ABEL ( G,v )= R ATE V ERTEX ( v, R ATE )+ λ E DGE (cid:88) v i ,v j ∈ V,i

We have a set of conditions to check whether G is a con-sistent UDG for monotonic tree T . Most of these conditionsare expressed in terms of P ATH ( T, v i , v j ) , which takes as in-put a pair of vertices v i , v j of the UDG G , and a monotonicexpression tree T , and returns the following.1. If both v i and v j are numbers, and their correspondingleaf nodes in T are n i and n j respectively, then it returnsthe nodes in the path connecting n i and n j in T .2. If only v i denotes a number (implying v j represents thequestion), the function returns the nodes in the path from n i to the root of T , where n i is the corresponding leafnode for v i .For the unit dependency graph and solution tree T of Fig 1,P ATH ( T, , is {− , ÷} , whereas P ATH ( T, , question ) is {÷} . Finally, the conditions for consistency between a UDG G and an expression tree T are as follows:1. If v i is the only vertex labeled R ATE and it is the question,there should not exist a path from some leaf n to the rootof T which has only addition, subtraction nodes. If thatexists, it implies n can be added or subtracted to get theanswer, that is, the corresponding vertex for n in G hassame unit as the question, and should have been labeledR ATE . 2. If v i is labeled R ATE and the question is not, the pathfrom n i (corresponding leaf node for v i ) to the root of T cannot have only addition, subtraction nodes. Otherwise,the question will have same rate units as v i .3. We also check whether the edge labels are consistent withthe vertex labels using Algorithm 1, which computes edgelabels of UDGs, given the expression tree T , and vertexlabels. It uses heuristics like if a rate r is being multipliedby a non-rate number n , the Den Unit of r should matchthe unit of n , etc. Algorithm 1 E DGE L ABEL

Input:

Monotonic expression tree T , vertex pairs v i , v j , and theircorresponding vertex labels Output:

Label of edge between v i and v j path ← P ATH ( T, v i , v j ) CountMulDiv ← Number of Multiplication and Divisionnodes in path if v i and v j have same vertex label, and CountMulDiv = 0 then return S AME U NIT end if if v i and v j have different vertex labels, and CountMulDiv =1 then if path contains × and v i is R ATE then return R ATE → Den end if if path contains × and v j is R ATE then return R ATE ← Den end if if path contains ÷ and v i is R ATE then return R ATE → Num end if if path contains ÷ r and v j is R ATE then return R ATE ← Num end if end if return

Cannot determine edge label

These consistency conditions prevent the inference pro-cedure from considering any inconsistent tuples. They helpthe solver to get rid of erroneous solutions which involveoperations inconsistent with all high scoring UDGs.Finally, in order to ﬁnd the highest scoring consistent tu-ple, we have to enumerate the members of T

UPLES , andscore them. The size of T

UPLES however is exponential inthe number of quantities in the problem. As a result, we per-form beam search to get the highest scoring tuple. We ﬁrstenumerate the members of T

REES , and next for each mem-ber of T

REES , we enumerate consistent UDGs.

Dataset

Existing evaluation of arithmetic word problem solvers hasseveral drawbacks. The evaluation of (Roy and Roth 2015)was done separately on different types of arithmetic prob-lems. This does not capture how well the systems can dis-tinguish between these different problem types. Datasets re-leased by (Roy and Roth 2015) and (Koncel-Kedziorski etl. 2015) mention irrelevant quantities in words, and onlythe relevant quantities are mentioned in digits. This removesthe challenge of detecting extraneous quantities.In order to address the aforementioned issues, we pooledarithmetic word problems from all available datasets (Hos-seini et al. 2014; Roy and Roth 2015; Koncel-Kedziorski etal. 2015), and normalized all mentions of quantities to digits.We next prune problems such that there do not exist a prob-lem pair with over match of unigrams and bigrams. Thethreshold of was decided manually by determining thatproblems with around overlap are sufﬁciently differ-ent. We ﬁnally ended up with problems. We refer to thisdataset as

AllArith .We also create subsets of

AllArith using the MAWPSsystem (Koncel-Kedziorski et al. 2016). MAWPS can gener-ate subsets of word problems based on lexical and templateoverlap. Lexical overlap is a measure of reuse of lexemesamong problems in a dataset. High lexeme reuse allows forspurious associations between the problem text and a correctsolution (Koncel-Kedziorski et al. 2015). Evaluating on lowlexical overlap subset of the dataset can show the robustnessof solvers to lack of spurious associations. Template overlapis a measure of reuse of similar equation templates acrossthe dataset. Several systems focus on solving problems un-der the assumption that similar equation templates have beenseen at training time. Evaluating on low template overlapsubset can show the reliance of systems on the reuse of equa-tion templates. We create two subsets of problems each- one with low lexical overlap called

AllArithLex , and onewith low template overlap called

AllArithTmpl .We report random -fold cross validation results on allthese datasets. For each fold, we choose of the trainingdata as development set, and tune the scaling parameters onthis set. Once the parameters are set, we retrain all the mod-els on the entire training data. We use a beam size of inall our experiments. Data Acquisition

In order to learn the classiﬁers for predicting vertex and edgelabels for UDGs, we need annotated data. However, gather-ing vertex and edge labels for UDGs of problems, can beexpensive. In this section, we show that vertex labels for asubset of problems, along with annotations for solution ex-pressions, can be sufﬁcient to gather high quality annota-tions for vertex and edge labels of UDGs.Given an arithmetic word problem P , annotated with themonotonic expression tree T of the solution expression, wetry to acquire annotations for the UDG of P . First, we try todetermine the labels for the vertices, and next the edges ofthe graph.We check if T has any multiplication or division node. Ifno such node is present, we know that all the numbers inthe leaves of T have been combined via addition or subtrac-tion, and hence, none of them describes a rate in terms ofthe units of other numbers. This determines that none of T ’sleaves is a rate, and also, the question does not ask for a rate.If a multiplication or division node is present in T , we gatherannotations for the numbers in the leaves of T as well as thequestion of P . Annotators were asked to mark whether each AllArith AllArithLex AllArithTmplD

ECOMPOSE

OINT

Table 3:

Performance in predicting UDGs number represents a rate relationship, and whether the ques-tion in P asks for a rate. This process determines the labelsfor the vertices of the UDG. Two annotators performed theseannotations, with an agreement of 0.94(kappa).Once we have the labels for the vertices of the UDG, wetry to infer the labels for the edges using Algorithm 1. Whenthe algorithm is unable to infer the label for a particularedge, we heuristically label that edge to be N O R ELATION .The above process allowed us to extract high quality an-notations for UDGs with minimal manual annotations. Inparticular, we only had to annotate vertex labels for problems, out of the problems in

AllArith . Obviouslysome of the extracted N O R ELATION edge labels are noisy;this can be remedied by collecting annotations for thesecases. However, in this work, we did not use any manualannotations for edge labels.

UDG Prediction

Table 2 shows the performance of the classiﬁers and thecontribution of each feature type. The results indicate thatrule-based techniques are not sufﬁcient for robust extraction,there is a need to take context into account. Table 3 showsthe performance of our decomposed model (D

ECOMPOSE )in correctly predicting UDGs, as well as the contribution ofconstraints in the inference procedure. Having explicit con-straints for the graph structure provides 3-5% improvementin correct UDG prediction.We also compare against a jointly trained model (J

OINT ),which learns to predict all vertex and edge labels together.Note that J

OINT also uses the same set of constraints asD

ECOMPOSE in the inference procedure, to ensure it onlypredicts valid unit dependency graphs. We found that J

OINT does not outperform D

ECOMPOSE , while taking signiﬁ-cantly more time to train. The worse performance of jointlearning is due to: (1) search space being too large for thejoint model to do well given our relatively small dataset size,and (2) our independent classiﬁers being good enough, thussupporting better joint inference. This tradeoff is stronglysupported in the literature (Punyakanok et al. 2005; Suttonand McCallum 2007).Note, that all these evaluations are based on noisy edgesannotations. This was done to reduce further annotation ef-fort. Also, less than 15% of labels were noisy (indicated byfraction of N O R ELATION labels), which makes this evalua-tion reasonable.

Solving Arithmetic Word Problems

Here we evaluate the accuracy of our system in correctlysolving arithmetic word problems. We refer to our system asU

NIT D EP . We compare against the following systems: eatures Vertex Classiﬁer Edge ClassiﬁerAllArith AllArithLex AllArithTmpl AllArith AllArithLex AllArithTmplAll features 96.7 96.2 97.5 87.1 84.3 86.6No rule based features 93.2 92.5 92.6 79.3 75.4 78.0No context features 95.1 94.1 95.3 78.6 70.3 75.5 Table 2:

Performance of system components for predicting vertex and edge labels for unit dependency graphsSystem AllArith AllArithLex AllArithTmplT

EMPLATE

INGLE E Q NIT D EP λ V ERTEX = 0 λ E DGE = 0

Table 4:

Performance in solving arithmetic word problems

1. LCA++ : System of (Roy and Roth 2015) with featureset augmented by neighborhood features, and with onlypositive answer constraint. We found that augmenting thereleased feature set with context features, and removingthe integral answer constraint, were helpful. Our systemU

NIT D EP also uses the augmented feature set for Rel-evance and LCA operation classiﬁers, and only positiveconstraint for ﬁnal solution value.2. T EMPLATE : Template based algebra word problemsolver of (Kushman et al. 2014).3. S

INGLE E Q : Single equation word problem solver of(Koncel-Kedziorski et al. 2015).In order to quantify the gains due to vertex and edge in-formation of UDGs, we also run two variants of U NIT D EP - one with λ V ERTEX = 0 , and one with λ E DGE = 0 . Table 4shows the performance of these systems on AllArith, AllAr-ithLex and AllArithTmpl.U

NIT D EP outperforms all other systems across alldatasets. Setting either λ V ERTEX = 0 or λ E DGE = 0 leadsto a drop in performance, indicating that both vertex andedge information of UDGs assist in math problem solving.Note that setting both λ V ERTEX and λ E DGE to , is equivalentto LCA++. S INGLE E Q performs worse than other systems,since it does not handle irrelevant quantities in a problem.In general, reduction of lexical overlap adversely affectsthe performance of most systems. The reduction of templateoverlap does not affect performance as much. This is due tothe limited number of equation templates found in arithmeticproblems. Introduction of UDGs make the system more ro-bust to reduction of both lexical and template overlap. Inparticular, they provide an absolute improvement of inboth AllArithLex and allArithTmpl datasets (indicated bydifference of LCA++ and U NIT D EP results).For the sake of completeness, we also ran our system onthe previously used datasets, achieving % and % abso-lute improvements over LCA++, in the Illinois dataset (Roy,Vieira, and Roth 2015) and the Commoncore dataset (Royand Roth 2015) respectively. Discussion

Most of gains of U

NIT D EP over LCA++ came from prob-lems where LCA++ was predicting an operation or an ex-pression that was inconsistent with the units. A small gain( ) also comes from problems where UDGs help detectcertain irrelevant quantities, which LCA++ cannot recog-nize. Table 5 lists some of the examples which U NIT D EP gets correct but LCA++ does not.Most of the mistakes of U NIT D EP were due to extrane-ous quantity detection (around ). This was followed byerrors due to the lack of math understanding (around ).This includes comparison questions like “How many morepennies does John have?”. There has been a recent interest in automatically solvingmath word problems. (Hosseini et al. 2014; Mitra and Baral2016) focus on addition-subtraction problems, (Roy, Vieira,and Roth 2015) look at single operation problems, (Roy andRoth 2015) as well as our work look at arithmetic problemswith each number in question used at most once in the an-swer, (Koncel-Kedziorski et al. 2015) focus on single equa-tion problems, and ﬁnally (Kushman et al. 2014) focus onalgebra word problems. None of them explicitly model therelations between rates, units and the question asked. In con-trast, we model these relations via unit dependency graphs.Learning to predict these graphs enables us to gain robust-ness over rule-based extractions. Other than those related tomath word problems, there has been some work in extract-ing units and rates of quantities (Roy, Vieira, and Roth 2015;Kuehne 2004a; Kuehne 2004b). All of them employ rulebased systems to extract units, rates and their relations.

In this paper, we introduced the concept of unit dependencygraphs, to model the dependencies among units of numbersmentioned in a math word problem, and the question asked.The dependencies of UDGs help improve performance ofan existing arithmetic word problem solver, while also mak-ing it more robust to low lexical and template overlap ofthe dataset. We believe a similar strategy can be used to in-corporate various kinds of domain knowledge in math wordproblem solving. Our future directions will revolve aroundthis, particularly to incorporate knowledge of entities, trans-fers and math concepts. Code and dataset are available at http://cogcomp.cs.illinois.edu/page/publication view/804 . roblem LCA++ U NIT D EP At lunch a waiter had 10 customers and 5 of them didn’t leave a tip. If he got$3.0 each from the ones who did tip, how much money did he earn? 10.0-(5.0/3.0) 3.0*(10.0-5.0)The schools debate team had 26 boys and 46 girls on it. If they were split intogroups of 9, how many groups could they make? 9*(26+46) (26+46)/9Melanie picked 7 plums and 4 oranges from the orchard . She gave 3 plums toSam . How many plums does she have now ? (7+4)-3 (7-3)Isabellas hair is 18.0 inches long. By the end of the year her hair is 24.0 incheslong. How much hair did she grow? (18.0*24.0) (24.0-18.0)

Table 5: Examples of problems which U

NIT D EP gets correct, but LCA++ does not. Acknowledgements

This work is funded by DARPA under agreement numberFA8750-13-2-0008, and a grant from the Allen Institute forArtiﬁcial Intelligence (allenai.org).

References [Hosseini et al. 2014] Hosseini, M. J.; Hajishirzi, H.; Et-zioni, O.; and Kushman, N. 2014. Learning to solve arith-metic word problems with verb categorization. In

EMNLP .[Koncel-Kedziorski et al. 2015] Koncel-Kedziorski, R.; Ha-jishirzi, H.; Sabharwal, A.; Etzioni, O.; and Ang, S. 2015.Parsing Algebraic Word Problems into Equations.

TACL .[Koncel-Kedziorski et al. 2016] Koncel-Kedziorski, R.; Roy,S.; Amini, A.; Kushman, N.; and Hajishirzi, H. 2016.Mawps: A math word problem repository. In

NAACL .[Kuehne 2004a] Kuehne, S. 2004a. On the representation ofphysical quantities in natural language text. In

Proceedingsof Twenty-sixth Annual Meeting of the Cognitive Science So-ciety .[Kuehne 2004b] Kuehne, S. 2004b.

Understanding naturallanguage descriptions of physical phenomena . Ph.D. Dis-sertation, Northwestern University, Evanston, Illinois.[Kushman et al. 2014] Kushman, N.; Zettlemoyer, L.; Barzi-lay, R.; and Artzi, Y. 2014. Learning to automatically solvealgebra word problems. In

ACL .[Mitra and Baral 2016] Mitra, A., and Baral, C. 2016. Learn-ing to use formulas to solve simple arithmetic problems. In

ACL .[Punyakanok et al. 2005] Punyakanok, V.; Roth, D.; Yih, W.;and Zimak, D. 2005. Learning and inference over con-strained output. In

Proc. of the International Joint Confer-ence on Artiﬁcial Intelligence (IJCAI) , 1124–1129.[Roy and Roth 2015] Roy, S., and Roth, D. 2015. Solvinggeneral arithmetic word problems. In

Proc. of the Confer-ence on Empirical Methods in Natural Language Processing(EMNLP) .[Roy, Vieira, and Roth 2015] Roy, S.; Vieira, T.; and Roth,D. 2015. Reasoning about quantities in natural language.

Transactions of the Association for Computational Linguis-tics