[PDF] Learning to Match Mathematical Statements with Proofs

Abstract

We introduce a novel task consisting in assigning a proof to a given mathematical statement. The task is designed to improve the processing of research-level mathematical texts. Applying Natural Language Processing (NLP) tools to research level mathematical articles is both challenging, since it is a highly specialized domain which mixes natural language and mathematical formulae. It is also an important requirement for developing tools for mathematical information retrieval and computer-assisted theorem proving. We release a dataset for the task, consisting of over 180k statement-proof pairs extracted from mathematical research articles. We carry out preliminary experiments to assess the difficulty of the task. We first experiment with two bag-of-words baselines. We show that considering the assignment problem globally and using weighted bipartite matching algorithms helps a lot in tackling the task. Finally, we introduce a self-attention-based model that can be trained either locally or globally and outperforms baselines by a wide margin.

Full PDF

LLearning to Match Mathematical Statements with Proofs

Maximin Coavoux ∗ Univ. Grenoble Alpes, CNRSGrenoble INP, LIG [email protected]

Shay B. Cohen

ILCC, School of InformaticsUniversity of Edinburgh [email protected]

Abstract

We introduce a novel task consisting in as-signing a proof to a given mathematical state-ment. The task is designed to improve the pro-cessing of research-level mathematical texts.Applying Natural Language Processing (NLP)tools to research level mathematical articles isboth challenging, since it is a highly special-ized domain which mixes natural language andmathematical formulae. It is also an impor-tant requirement for developing tools for math-ematical information retrieval and computer-assisted theorem proving (Mathematical Sci-ences, 2014). We release a dataset for the task,consisting of over 180k statement-proof pairsextracted from mathematical research articles.We carry out preliminary experiments to as-sess the difﬁculty of the task. We ﬁrst exper-iment with two bag-of-words baselines. Weshow that considering the assignment problemglobally and using weighted bipartite match-ing algorithms helps a lot in tackling the task.Finally, we introduce a self-attention-basedmodel that can be trained either locally or glob-ally and outperforms baselines by a wide mar-gin.

Research-level mathematical discourse is a chal-lenging domain for Natural Language Processing(NLP). Indeed, mathematical articles switch fre-quently between natural language and mathemat-ical formulae. A semantic analysis of mathemat-ical text needs to solve relationships (e.g. coref-erence) between mathematical symbols and con-cepts. Moreover, mathematical writing follows alot of conventions, such as variable naming or ty-pography, that are implicit, and may differ from asubﬁeld to another.However, mathematical research can beneﬁtfrom NLP (Mathematical Sciences, 2014), in par- ∗ Work mostly done at the University of Edinburgh.

Theorem 1.3.

Suppose that | Sing( S ) | < (2 r − r . Then X is factorial.Proof. The subset

Sing( S ) ⊂ P is a set-theoretic intersection of surfaces of degree r − , which implies that X is factorial byTheorem 1.1. Figure 1: Example of a statement-proof pair. ticular as concerns bibliographical research: re-searchers need tools to ﬁnd work relevant for theirresearch. Indeed, prior NLP work on mathemati-cal research articles focused on Mathematical In-formation Retrieval (MIR) and related tools ordata (Zanibbi et al., 2016; Stathopoulos and Teufel,2016, 2015).In this paper, we introduce a task aimed at im-proving the processing of research-level mathemat-ical articles and make a step towards the modelingof mathematical reasoning. Given a collection ofmathematical statements and a collection of mathe-matical proofs of the same size, the task consists inﬁnding and assigning a proof to each mathematicalstatement. We construct and release a dataset forthe task, by collecting over 180k statement-proofpairs from mathematical research articles (an ex-ample is given in Figure 1).There are multiple motivations for the design ofthe task. We believe it may help MIR by serving asa proxy for the search for the existence of a mathe-matical result, or for theorems and proofs related toone another (e.g. using the same proof technique),an important search tool for any digital mathemati-cal library (Mathematical Sciences, 2014). Learn-ing to match statements and proofs would also ben-eﬁt computer-assisted theorem proving, as it is akinto tasks such as premise selection, also recently ad-dressed with NLP methods (Piotrowski and Urban,2019). More generally, ﬁnding supporting informa- a r X i v : . [ c s . C L ] F e b ion for or against a given statement, is integral totasks such as question answering or fact-checking(Vlachos and Riedel, 2014). Our mathematicalstatement-proof assignment task can be thought ofas the transposition of such problem to the veryspeciﬁc domain of mathematical research articles.We provide preliminary results on our proposedtask with (i) two bag-of-words baselines (ii) a neu-ral model based on a self-attentive encoder and a bi-linear similarity function. Though the neural modeloutperforms the baselines when using local decod-ing , i.e. assigning the best-scoring proof to eachstatement, we found that it performs even betterwith global decoding , i.e. ﬁnding the best bipartitematching between the sets of statements and proofs.Therefore we also design a global training proce-dure with a structured max-margin objective. Suchan architecture may have applications to other NLPproblems that can be cast as maximum bipartitematching problems, which is the case, for example,for some alignment problems (Taskar et al., 2005b;Pad´o and Lapata, 2006).In summary, our contributions are three-fold:• The deﬁnition of a mathematical statement-proof matching task;• The construction and release of a correspond-ing dataset;• A self-attention-based model for maximumweighted bipartite matching problems, thatcan be trained either locally or globally. Processing mathematical articles

Most NLPwork on mathematical discourse focuses on improv-ing Mathematical Information Retrieval (Zanibbiet al., 2016, MIR) by establishing connections be-tween mathematical formulae and natural languagetext in order to improve the representation of for-mulae.The interpretation of variables is highly depen-dent on the context. For example, the symbol E could denote an expectation in a statistics article,or the energy in a physics article. Some studiesuse the surrounding context of a formula to as-sign a deﬁnition or a type to the whole formula, orto speciﬁc variables. Nghiem Quoc et al. (2010)focus on identifying coreferences between math-ematical formulae and mathematical concepts inWikipedia articles. Kristianto et al. (2012) extractdeﬁnitions of mathematical expressions. Grigoreet al. (2009), Wolska et al. (2011) and Schubotz et al. (2016) disambiguate mathematical identiﬁers,such as variables, using the surrounding textualcontext. Stathopoulos et al. (2018) infer the typeof a variable in a formula from the textual contextof the formula.Another line of work focused on identifyingspecialized terms or concepts to improve MIR(Stathopoulos and Teufel, 2015, 2016).Some work adapts standard NLP tools to thespeciﬁcity of mathematical discourse, e.g. POStaggers (Sch¨oneberg and Sperber, 2014), with theobjective of using linguistic features to improve thesearch for deﬁnitions of mathematical expressions(Pagel and Schubotz, 2014). Maximum bipartite matching in NLP

Globalmodels for maximum weighted bipartite matchingproblems have been explored in NLP for the taskof word alignments, a traditional component ofmachine translation systems (Matusov et al., 2004;Taskar et al., 2005b; Bhagwani et al., 2012; Wangand Lepage, 2016), or for assigning arguments topredicates (Llu´ıs et al., 2013). In particular, Taskaret al. (2005b) introduced a discriminative globalmodel with a max-margin objective.In these articles, the bipartite graph is usuallyformed by two sentences. In contrast, we predictmatchings on graphs that are an order of magnitudelarger and each node in our bipartite graph is acomplete text (a statement or a proof), i.e. a highlystructured object, from which we learn ﬁxed sizevector representations.

Given a collection of mathematical statements { s i } i ≤ N , and a separate equal-size collection ofmathematical proofs { p i } i ≤ N , we are interested inthe problem of assigning a proof to each statement. Evaluation

We use two evaluation metrics. As-summing that a system predicts a ranking of proofs,instead of providing only a single proof, we eval-uate its output with the Mean Reciprocal Rank(MRR) measure:MRR ( { ˆ r i } i ∈{ ,...N } ) = 1 N N (cid:88) i =1 r i , where N is the number of examples and ˆ r i is therank of the gold proof for statement number i , aspredicted by the system.s a second evaluation metric, we use a simpleaccuracy, i.e. the proportion of statements whoseﬁrst-ranked proof is correct.By construction (see Section 4), it is possiblethough unlikely that the same mathematical state-ment occurs several times in the dataset. It is allthe more unlikely that several occurrences haveexactly the same formulation and use the same vari-able names. Therefore, we consider a match tobe correct if and only if it is associated with itsoriginal proof. Task variation

We propose three variations ofthe task, depending on the input of the system:1. Natural language text and mathematical for-mulae;2. Natural language text only;3. Mathematical formulae only.The comparison of these settings is meant to pro-vide insight into which type of information is cru-cial to the task.

This section describes the construction of a dataset of statement-proof pairs (see Figure 1). Source corpus

We use the MREC corpus (L´ıˇska et al., 2011) as a source. The MREC cor-pus contains around 450k articles from ArxMLiV(Stamerjohanns et al., 2010), an on-going projectaiming at converting the arXiv repository fromL A TEX to XML, a format more suited to machineprocessing. In this collection, mathematical for-mulae are represented in the MathML format, amarkup language. Statement-proof identiﬁcation

For each XMLdocument (corresponding to a single arXiv article),we extract pairs of consecutive

tags suchthat: (i) the class attribute of the ﬁrst div nodecontains the string "theorem" ; (ii) the class attribute of the second div node is the string "proof" . Documents that do not contain suchpairs of tags are discarded, as well as documentsthat are not written in English (representing 143articles in French, 11 in Russian, 5 in German, 2 https://gitlab.com/mcoavoux/statement_proof_matching https://mir.fi.muni.cz/MREC/ , version2011.4.439. https://arxiv.org/ in Portuguese and 1 in Ukrainian), as identiﬁed bythe polyglot Python package. In the remaining collection of pairs of statementsand proofs, we ﬁlter out pairs for which either thestatement or the proof is too short. Indeed, theshort texts were often empty (only consisting of atitle, e.g. “5.26 Lemma.”), which we attribute tothe noise inherent to the conversion to XML, or notself-contained. In particular, we identiﬁed severalprototypical cases:• Omitted (or easy) proofs contain usually a sin-gle word (‘omitted’, ‘straightforward’, ‘well-known’, ‘trivial’, ‘evident’), but are some-times more verbose (‘This is obvious and willbe left to the readers’).• Proofs that consist of a single reference to – An appendix (‘See Appendix A’); – Another theorem (‘This follows immedi-ately from Proposition 4.4 (ii).’); – The proof method of another theorem(‘Similar to proof of Lemma 6.1’) – Another article (‘See [BK3, Theorem4.8].’); – Another part of the article (‘The proofwill appear elsewhere.’, ‘See above.’,‘Will be given in section 5.’).Filtering on the number of tokens also excludeself-contained short proofs, such as ‘Take Q (cid:48) = ph i − p i .’ However, such proofs were very infre-quent on manual inspection of the discarded pairs(2 in a manually inspected random sample of 100discarded proofs). Preprocessing: linearizing equations

Mathe-matical formulae in the XML documents are en-closed in a $markup tag, that materializesthe switch to the MathML format, and whose in-ternal structure represents the formula as an XMLtree. As a preprocessing step, we linearize eachformula to a raw sequence of strings.In MathML, an equation can be encoded in acontent-based (semantic) way or in a presenta-tional way, using different sets of markup tags. Weﬁrst convert all MathML trees to presentationalMathML using the XSL stylesheet from the Con- We used a minimum length of 20 tokens for both state-ments and proofs, based on a manual inspection of the shortestexamples. We also exclude proofs and statements longer than500 tokens. ent MathML Polyﬁll repository. Then we per-form a depth-ﬁrst search on each tree rooted in a tag to extract the text content of the wholetree.During this preprocessing, we tested several pro-cessing choices:•$

Font information . In mathematical dis-courses, fonts play an important role. Theirsemantics depend on conventions shared by re-searchers. If both x and x appear in the samearticle, they are most likely to represent differ-ent mathematical objects, e.g. a scalar and avector. Therefore, we use distinct symbols fortokens that are in distinct fonts.• Math-English ambiguity . Some symbolscan be used both in natural language text andin formulae. For example, ‘a’ can be a de-terminer in English, or a variable name in aformula. To avoid increasing ambiguity whenlinearizing formula, we type each symbol (asmath or text) to make the mathematical vocab-ulary completely disjoint from the text vocab-ulary.Both these preprocessing steps had a beneﬁcialeffect on the baselines in preliminary experiments.

Statistics

We report in Table 1 some statisticsabout the dataset. The extracted articles were froma diverse set of mathematical subdomains, and con-nected domains, such as computer science (746articles from 30 subcategories) and mathematicalphysics (2562 from 31 subcategories). There are inaverage 6.6 statement-proof pairs per article.We report statistics about the size of statementsand proofs in number of tokens in Table 2. We re-port the number of tokens in formulae (math), in thetext itself (text) and in both (text+math). On aver-age, proofs are much longer than statements. State-ments and proofs have approximately the sameproportion of text and math. Overall, the variationin number of tokens across statements and proofsis extremely high, as illustrated by the standarddeviation (SD) of all presented metrics.

We propose a system based on a self-attentiveencoder (Vaswani et al., 2017) that constructs https://github.com/fred-wang/webextension-content-mathml-polyfill Number of documents in the MREC corpus 439,423Extracted documents with statement-proof pairs 27,841Total number of statement-proof pairs 184,094Number of (primary) categories (120) 135Average number of categories per article 1.7Most represented primary categories Num. articles Num. pairsmath.AG Algebraic Geometry 2848 22029math.DG Differential Geometry 2030 12440math.CO Combinatorics 1705 10548math.GT Geometric Topology 1539 9234math.NT Number theory 1454 9521math.PR Probability 1422 7660math.AP Analysis of Partial Differential Equations 1386 6981math-ph Mathematical Physics 1249 6491math.FA Functional Analysis 1143 8011math.GR Group Theory 970 7806math.DS Dynamical System 961 6424math.QA Quantum Algebra 944 8074math.OA Operator Algebras 923 8050

Table 1: Statistics about the dataset and categories ofmathematical articles.

Statements Min Max Mean ± SDText+math 20 500 80 ± ± ± % % % ± ± ± ± % % % ± Table 2: Number of tokens in the dataset. We reportfor statements and proofs the minimum, maximum andaverage number of tokens broken down by type (‘math’for tokens extracted from formulae and ‘text’ for theothers). A value of 0 for, e.g. the ‘math only’ row,means that the statement or proof does not containmathematical symbols or formulae. ﬁxed-size vector representations for statements andproofs, and a similarity function that scores therelatedness of a statement-proof pair.

Self-attentive encoder

We encode each textwith a token-level self-attentive encoder. We ﬁrstproject a text to a sequence of token embeddings ofdimension w . Then we run (cid:96) self-attention layers(Vaswani et al., 2017), to obtain a contextualizedembedding for each token. Finally we constructa vector representation for the text with a max-pooling layer over the contextualized embeddingsof the last self-attention layer.The hyperparameters of the encoder are the di-mension of the token embeddings w , the numberof self-attentive layers (cid:96) , the dimension of the en-oder d (size of contextualized embeddings), thenumber of heads for each self-attentive layer h andthe dimension of query and key vectors d k . Trainable bilinear similarity function

Giventhe encoded representations of a statement s = enc ( s ) and a proof p = enc ( p ) , we compute anassociation score with the following bilinear form:score ( s , p ) = s (cid:62) · W · p + b, where W and b are parameters that are learnedtogether with the self-attentive module parameters. Local decoding

For a collection of n statementsand proofs, we ﬁrst score all possible pairs ( s, p ),and construct a matrix M = ( m ij ) ∈ R n × n , with m ij = score ( s ( i ) , p ( j ) ) , where s ( i ) and p ( j ) are the encoded representationsof, respectively, the i th statement and the j th proof.Then we can straightforwardly sort each row bydecreasing order and assign the proof ranking to thecorresponding statement. The best ranking proof ˆ p for statement i satisﬁes: ˆ p i = arg max j m ij . We call this decoding method ‘local’, since it doesnot take into account dependencies between assign-ments. In particular, several statements may havethe same highest-ranking proof.

Global decoding

The local decoding methodoverlooks a crucial piece of information: a proofshould correspond to a single statement. In a worst-case situation, a small number of proofs may scorehigh with most statements and be systematicallyassigned as highest-ranking proof by the local de-coding method.During preliminary experiments, we analysedthe output of our system with local decoding on thedevelopment set, focusing on the distribution of thesingle highest-ranking proof for each statement. Itturned out that around of proofs were assignedto at least two different statements, whereas morethan of proofs were assigned to no statement(Table 3).We propose a second decoding method basedon a global constraint on the output: a proof canbe assigned only to a single statement. Intuitively,the constraint models the fact that if a proof is as-signed by the system to a certain statement with Statements Proofs % ≥ ≥

80 0.2 ≥ ≥ = 1 < Table 3: Cumulative distribution of proofs in the devel-opment set, by number of statements to which they areassigned with the local decoding method. high conﬁdence, we can rule it out as a candidatefor other statements. Under this constraint, the de-coding problem reduces to a classical maximumweighted bipartite matching problem, or equiva-lently, a Linear Assignment Problem (LAP). Inmore realistic scenarios (e.g. if the input sets ofstatements and proofs do not have the same size),the method would require some adaptation.Formally, we deﬁne an assignment A as aboolean matrix A = ( a ij ) ∈ { , } n × n with thefollowing constraints: ∀ i ∀ j, (cid:88) j a ij = (cid:88) i a ij = 1 , i.e. each row and each column of A contains asingle non-zero coefﬁcient. The score of an assign-ment A is the sum of scores of the chosen edges:score ( A, M ) = (cid:88) i (cid:88) j a ij m ij . Finally, global decoding consists in solving thefollowing LAP: ˆ A ( M ) = arg max A ∈{ , } n × n s.t. ∀ i ∀ j, (cid:80) j a ij = (cid:80) i a ij =1 score ( A, M ) . The LAP is solved in polynomial time bythe Hungarian algorithm (Kuhn, 1955), the LAP-Jonker-Volgenant algorithm (LAP-JV; Jonker andVolgenant, 1987), or the push-relabel algorithm(Goldberg and Kennedy, 1995). These methodshave a O ( n ) time complexity where n is the num-ber of pairs, and O ( n ) memory complexity. Thisis too expensive in our case, due to the size of ourdatasets (more than 18,000 pairs in the develop-ment set).To remedy this limitation, when we perform de-coding on a large set, we only consider the k best-scoring proofs (i.e. outgoing edges in the bipartiteraph) for each statement, which makes the numberof edges linear in the number of pairs n (consid-ering k ﬁxed). Moreover, we use a modiﬁcationof the LAP-JV algorithm speciﬁcally designed forsparse matrices (LAP-MOD; Volgenant, 1996). We propose two training methods for the similaritymodel introduced above: a local training methodthat only considers statements in isolation (Sec-tion 6.1) and a global model trained to predict a bi-partite matching (Section 6.2), with a hybrid globaland local objective.

We would like to train our model to assign a highsimilarity to the gold statement-proof pair, anda low similarity to all other statment-proof pairs.This corresponds to the following objective, for asingle statement s and its gold proof p : L LOC ( s, p, P ; θ ) = − log P ( p | s ; θ )= − log  e score ( s , p ) (cid:80) p (cid:48) ∈ P e score ( s , p (cid:48) )  , where P is the set of proofs, and θ are the param-eters of the model. Directly optimizing this lossfunction requires the computation of p = enc ( p ) for every proof in the dataset, for a single optimiza-tion step. This is not realistic considering memorylimitations, the size of the train set, and the factthat the self-attentive encoder is the most computa-tionally expensive part of the network.Instead, we sample minibatches of b pairs andoptimize the following proxy loss for the sequence S (cid:48) = ( s , . . . , s b ) of statements and the sequence P (cid:48) = ( p , . . . , p b ) of corresponding proofs: L (cid:48) LOC ( S (cid:48) , P (cid:48) ; θ ) = b (cid:88) i =1 L LOC ( s i , p i , P (cid:48) ; θ ) . In practice, we sample uniformly and without re-placement b pairs from the training set at eachstochastic step. The local training method only considers state-ments in isolation. Even though we expect a locally We also experimented with a Noise-Contrastive Estima-tion approach (Gutmann and Hyv¨arinen, 2012). However, itexhibited a much slower convergence rate. trained model to perform better with global decod-ing, we hypothesize that a model that is trained topredict the full structure (a bipartite matching) willbe even better.For a collection of n proofs and n statements, thesize of the search space (i.e. the number of bipartitematchings) is n ! , since each matching correspondsto a permutation of proofs. As a result, the use ofa globally normalized model is impractical. Weturn to a max-margin model that does not requirenormalization over the full search space.We use the following max-margin objective, fora set B of n pairs corresponding to matrix M : L GLOBAL ( B ; θ ) = max(0 , ∆( ˆ A, I )+ score ( ˆ A, M ) − score ( I, M )) , where θ is the set of all parameters ˆ A is the pre-dicted assignment and I is the gold assignment, i.e.the identity matrix. The structured cost ∆( ˆ A, I ) = (cid:88) ij max(0 , ( ˆ A − I ) ij ) aims at enforcing a margin for each individual as-signment. In order to compute the loss during train-ing, we perform decoding on matrix M (cid:48) , whichdirectly incorporates the cost of wrong assignments(Taskar et al., 2005a): M (cid:48) = M + ( − I ) . The computation of this loss requires exact de-coding for each optimization step. Since exactdecoding is only feasible for a small n , and sincewe need to keep track of all intermediary vectorsto compute the backpropagation step, we performeach stochastic optimization step on a minibatchof pairs of size b . Since this global objective hada very slow convergence rate (see Section 7.1), inpractice, we optimize a hybrid local-global objec-tive: L (cid:48) LOC + L GLOB . Our experiments address several questions. First,we assess the difﬁculty of the task and provide pre-liminary results with baseline systems. Secondly,we evaluate the performance of our neural modelin several settings: global or local training, global In particular, the computation graph needs to conserve allencoding layers for the n texts involved. r local decoding. In particular, we are interestedin assessing whether global decoding improves ac-curacy when training is only local, and how themore complex global training method fares withrespect to local training. Finally, we are interestedin the informativeness of different types of input:text, mathematical formulae, or both.We describe the experimental protocol (Sec-tion 7.1) before discussing results (Section 7.2). We use the dataset whose constructionis described in Section 4. We shufﬂe the collec-tion of statement-proof pairs before performing a / / train-development-test split, cor-responding to 147276 pairs for the training sets and18409 pairs for the development and tests. Dueto the shufﬂing, pairs from a single article may bedistributed across the three sections. Baselines

We provide two baseline systems thatrank proofs according to their similarity to the state-ment, using classical similarity measures. Theﬁrst baseline computes cosine similarities betweenTF-IDF representations of statements and proofs.The second baseline uses Dice’s similarity mea-sure computed over bag-of-word representations ofstatements and proofs:

Dice( s, p ) = 2 | s ∩ p || s | + | p | , where s and p are the word multiset representationsof, respectively, a statement and a proof.Both baselines are implemented using the scikit-learn Python package (Pedregosa et al., 2011) withdefault parameters. We estimate the IDF metric onthe training set only. Neural model

We implemented the neural net-work in Pytorch (Paszke et al., 2017). Token em-bedding have c = 300 dimensions, we use (cid:96) = 2 self-attentive layers with 4 heads to obtain contex-tualized embeddings of dimension d = 300 . Thequery and key vectors have size d k = 128 .We trained each model on a single GPU using thePytorch implementation of the Averaged StochasticGradient Descent algorithm (ASGD Polyak andJuditsky, 1992), with learning rate . , and anexponential learning rate scheduler (the learningrate is multiplied by . after each epoch). Hyperparameters

For training a local model,we perform 400 epochs over the whole training set, assuming an epoch consists in

N/b stochas-tic steps (where N is the total number of trainingpairs and b is the number of pairs in each mini-batches). We evaluate the model’s performance onthe development set every 20 epochs and select thebest model among these intermediate models. Weuse batches of size b = 60 based on preliminaryexperiments.For global training, we perform 400 epochs(around 3 days with a single GPU) and use the samemodel selection method as in the local training ex-periments. We observed in initial experiments thattraining only with the global objective required avery long time and had a very slow convergencerate. Therefore, we used the following global-localobjective: L (cid:48) LOC + L GLOB , that we optimized byalternating one stochastic step for each loss. Weuse batches of size 60 for both the local loss andthe global loss. Although the global model mightbeneﬁt from larger batches, 60 was the maximumpossible size given our memory resources.

Global decoding

Recall that exact global decod-ing is only feasible for a small subset of pairs. Dur-ing global training, we chose a batch size smallenough to perform exact decoding. However, itis not feasible to perform exact decoding on thewhole development and test corpora. Therefore,we prune the search space by keeping only the 500-best candidate proofs for each statement, and usethe LAP-MOD algorithm designed for sparse ma-trices. In practice, we used the implementations ofthe LAP-JV and LAP-MOD algorithms from the lap

Python package, for respectively exact de-coding on minibatches during global training anddecoding on whole datasets during evaluation. We reportbaseline results in Table 4. The best baseline isthe TF-IDF model considering both text and math-ematical formulae as input, it achieves an MRRof 29.9 and an accuracy of 23.8 (dev set). Theseresults suggest that the task is not trivial, and thatbag-of-words model are insufﬁciently expressive tosolve it. In contrast, our best self-attentive model(Table 5) outperforms all baselines by a wide mar-gin, obtaining an MRR of 64.5 and an accuracyof 57.8 (dev set, local decoding). However, theneural model fails to improve over the baselines inthe text-only setting, perhaps due to the fact that https://github.com/gatagat/lap ocal decoding Global decodingInput Method MRR Accuracy AccuracyDevBoth Dice 16.6 12.7 25.2Both TF-IDF Text Dice 10.4 7.8 16.2Text TF-IDF 27.9 22.7 26.3Math Dice 13.3 10.0 10.4Math TF-IDF 12.1 9.1 9.5TestBoth Dice 16.8 12.9 25.4Both TF-IDF 31.2 25.0 35.6Text Dice 10.7 8.0 17.3Text TF-IDF 27.8 22.4 26.4Math Dice 13.6 10.2 11.1Math TF-IDF 12.2 9.3 9.7

Table 4: Baseline results with the TF-IDF system andthe word-overlap system (Dice), with either global orlocal decoding. The input to the systems are either onlythe textual parts, only the mathematical formulae, orboth. the limitation in this setting is the lack of sufﬁcientinformation, which cannot be compensated by ahigher model expressiveness.

Global decoding with local training

In all set-tings, the use of global decoding substantially im-proves accuracy. This improvement is also mani-fested with baselines.

Global training

We obtain a substantial im-provement over local training when incorporatingthe global loss. However, the improvement is muchbetter for models that already have high results (i.e.math-only and math-text settings).

Effect of input type

For baselines, we observethat using both mathematical formulae and textgives the best results. The baseline models usingonly text outperform the neural models using thesame input as well as the baselines in the math-onlysettings. The pattern is different for neural mod-els: the models using only math input are the bestand slightly outperform models with both text andmath input. This result suggests that mathematicalformulae are crucial to solve the task and best usedwith an expressive neural model.

Qualitative analysis

Upon inspection of ourglobal model’s incorrect predictions (‘both’ set-ting) on the development set, we found that a com-mon source of confusion is due to the proof often

Training Local GlobalDecoding Local Global GlobalInput MRR Accuracy Accuracy AccuracyDevBoth 63.2 56.1 61.4 65.6Text 21.0 15.3 16.4 18.3Math

TestBoth 63.5 56.2 61.6 66.2Text 21.6 15.8 16.6 18.1Math 64.4 57.7 62.8 67.8

Table 5: Self-attentive model results for each setting:local or global training, local or global decoding. introducing discourse-new concepts and new vari-ables, while not necessarily repeating discourse-given concepts that occur in the statement. Asa result, the set of variables and concepts in aproof might better match those of another state-ment. We provide examples of the model’s outputin Appendix A (supplementary material). Finally,incorrectly predicted proofs often contain highlypolysemous words ( linearly , components ) that alsooccur in the statement. We have introduced a new task focusing on the do-main of mathematical research articles. The taskconsists in assigning a proof to a mathematicalstatement. We have constructed a dataset madeof 184k statement-proof pairs for the task and as-sessed its difﬁculty with two classical bag-of-wordsbaselines. Finally, we have introduced a global neu-ral model for addressing the structured predictionproblem of maximum weighted bipartite match-ing. The model is based on a self-attentive encoderand a bilinear similarity function. Our experimentsshow that bag-of-words baselines are insufﬁcientto solve the task, and are outperformed by our pro-posed model by a wide margin. We found thatdecoding is crucial to achieve high results, and isfurther enhanced by a global training loss. Finally,our results show that mathematical formulae arethe most informative source of information for thetask but are best taken into account with the self-attentive neural model. eferences

Sumit Bhagwani, Shrutiranjan Satapathy, and HarishKarnick. 2012. sranjans : Semantic textual similar-ity using maximal weighted bipartite graph match-ing. In *SEM 2012: The First Joint Conference onLexical and Computational Semantics – Volume 1:Proceedings of the main conference and the sharedtask, and Volume 2: Proceedings of the Sixth In-ternational Workshop on Semantic Evaluation (Se-mEval 2012) , pages 579–585. Association for Com-putational Linguistics.Andrew V. Goldberg and Robert Kennedy. 1995. Anefﬁcient cost scaling algorithm for the assignmentproblem.

Mathematical Programming , 71(2):153–177.Mihai Grigore, Magdalena Wolska, and MichaelKohlhase. 2009. Towards context-based disam-biguation of mathematical expressions. In

The jointconference of ASCM 2009 and MACIS 2009. 9th in-ternational conference on Asian symposium on com-puter mathematics and 3rd international conferenceon mathematical aspects of computer and informa-tion sciences, Fukuoka, Japan, December 14–17,2009. Selected papers. , pages 262–271. Fukuoka:Kyushu University, Faculty of Mathematics.Michael Gutmann and Aapo Hyv¨arinen. 2012. Noise-contrastive estimation of unnormalized statisticalmodels, with applications to natural image statistics.

Journal of Machine Learning Research , 13:307–361.R. Jonker and A. Volgenant. 1987. A shortest augment-ing path algorithm for dense and sparse linear assign-ment problems.

Computing , 38(4):325–340.Giovanni Yoko Kristianto, Minh quoc Nghiem, Yuichi-roh Matsubayashi, and Akiko Aizawa. 2012. Ex-tracting deﬁnitions of mathematical expressions inscientiﬁc papers. In

In JSAI .Harold W. Kuhn. 1955. The hungarian method forthe assignment problem.

Naval Research LogisticsQuarterly , 2:83–97.Martin L´ıˇska, Petr Sojka, Michal R˚uˇziˇcka, and PetrMravec. 2011. Web Interface and Collection forMathematical Retrieval: WebMIaS and MREC. In

Towards a Digital Mathematics Library. , pages 77–84, Bertinoro, Italy. Masaryk University.Xavier Llu´ıs, Xavier Carreras, and Llu´ıs M`arquez.2013. Joint arc-factored parsing of syntactic and se-mantic dependencies.

Transactions of the Associa-tion for Computational Linguistics , 1:219–230.C. o. P. a. L. o. t. Mathematical Sciences. 2014. Devel-oping a 21st Century Global Library for Mathemat-ics Research.

ArXiv e-prints , abs/1404.1905.Evgeny Matusov, Richard Zens, and Hermann Ney.2004. Symmetric word alignments for statistical ma-chine translation. In

COLING 2004: Proceedings of the 20th International Conference on ComputationalLinguistics .Minh Nghiem Quoc, Keisuke Yokoi, Yuichiroh Mat-subayashi, and Akiko Aizawa. 2010. Mining coref-erence relations between formulas and text usingwikipedia. In

Proceedings of the Second Workshopon NLP Challenges in the Information ExplosionEra (NLPIX 2010) , pages 69–74. Coling 2010 Or-ganizing Committee.Sebastian Pad´o and Mirella Lapata. 2006. Optimalconstituent alignment with edge covers for seman-tic projection. In

Proceedings of the 21st Interna-tional Conference on Computational Linguistics and44th Annual Meeting of the Association for Compu-tational Linguistics , pages 1161–1168, Sydney, Aus-tralia. Association for Computational Linguistics.Robert Pagel and Moritz Schubotz. 2014. Math-ematical language processing project.

CoRR ,abs/1407.0167.Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, ZemingLin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in PyTorch.In

NIPS-W .Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gram-fort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, Jake Vanderplas, Alexan-dre Passos, David Cournapeau, Matthieu Brucher,Matthieu Perrot, and ´Edouard Duchesnay. 2011.Scikit-learn: Machine learning in Python.

Journalof Machine Learning Research , 12:2825–2830.Bartosz Piotrowski and Josef Urban. 2019. Guid-ing theorem proving by recurrent neural networks.

CoRR , abs/1905.07961.B. T. Polyak and A. B. Juditsky. 1992. Accelerationof stochastic approximation by averaging.

SIAM J.Control Optim. , 30(4):838–855.Ulf Sch¨oneberg and Wolfram Sperber. 2014. POS tag-ging and its applications for mathematics.

CoRR ,abs/1406.2880.Moritz Schubotz, Alexey Grigorev, Marcus Leich,Howard S. Cohl, Norman Meuschke, Bela Gipp, Ab-dou S. Youssef, and Volker Markl. 2016. Semantiﬁ-cation of identiﬁers in mathematics for better mathinformation retrieval. In

Proceedings of the 39thInternational ACM SIGIR Conference on Researchand Development in Information Retrieval , SIGIR’16, pages 135–144, New York, NY, USA. ACM.Heinrich Stamerjohanns, Michael Kohlhase, DeyanGinev, Catalin David, and Bruce Miller. 2010.Transforming large collections of scientiﬁc publica-tions to xml.

Mathematics in Computer Science ,3(3):299–307.iannos Stathopoulos, Simon Baker, Marek Rei, andSimone Teufel. 2018. Variable typing: Assigningmeaning to variables in mathematical text. In

Pro-ceedings of the 2018 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long Papers) , pages 303–312. Associationfor Computational Linguistics.Yiannos Stathopoulos and Simone Teufel. 2015. Re-trieval of research-level mathematical informationneeds: A test collection and technical terminologyexperiment. In

Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing (Volume 2: ShortPapers) , pages 334–340. Association for Computa-tional Linguistics.Yiannos Stathopoulos and Simone Teufel. 2016. Math-ematical information retrieval based on type embed-dings and query expansion. In

Proceedings of COL-ING 2016, the 26th International Conference onComputational Linguistics: Technical Papers , pages2344–2355. The COLING 2016 Organizing Com-mittee.Ben Taskar, Vassil Chatalbashev, Daphne Koller, andCarlos Guestrin. 2005a. Learning structured pre-diction models: A large margin approach. In

Pro-ceedings of the 22Nd International Conference onMachine Learning , ICML ’05, pages 896–903, NewYork, NY, USA. ACM.Ben Taskar, Simon Lacoste-Julien, and Dan Klein.2005b. A discriminative matching approach to wordalignment. In

Proceedings of Human LanguageTechnology Conference and Conference on Empiri-cal Methods in Natural Language Processing .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors,

Advances in Neural Information Pro-cessing Systems 30 , pages 5998–6008. Curran Asso-ciates, Inc.Andreas Vlachos and Sebastian Riedel. 2014. Factchecking: Task deﬁnition and dataset construction.In

Proceedings of the ACL 2014 Workshop on Lan-guage Technologies and Computational Social Sci-ence , pages 18–22, Baltimore, MD, USA. Associa-tion for Computational Linguistics.A. Volgenant. 1996. Linear and semi-assignment prob-lems: A core oriented approach.

Computers & Op-erations Research , 23(10):917 – 932.Hao Wang and Yves Lepage. 2016. Yet another sym-metrical and real-time word alignment method: Hi-erarchical sub-sentential alignment using f-measure.In

Proceedings of the 30th Paciﬁc Asia Conferenceon Language, Information and Computation: OralPapers , pages 143–152. Magdalena Wolska, Mihai Grigore, and MichaelKohlhase. 2011. Using discourse context to inter-pret object-denoting mathematical expressions. In

DML 2011 - Towards a Digital Mathematics Library,Proceedings .Richard Zanibbi, Akiko Aizawa, Michael Kohlhase,Iadh Ounis, Goran Topic, and Kenny Davila. 2016.NTCIR-12 mathir task overview. In

Proceedings ofthe 12th NTCIR Conference on Evaluation of Infor-mation Access Technologies, National Center of Sci-ences, Tokyo, Japan, June 7-10, 2016 . A Output Examples

We provide examples of incorrect outputs by theglobally trained model (‘both’ setting) in Figures 2and 3. In both cases, the predicted proof containsvariable names or concepts from the statement thatdo not occur in the gold proof. tatement ( https://arxiv.org/pdf/math/0511162.pdf ) Corollary 6.13. If G is a compact connected Lie group, then for any maximal abelian connected closedsubgroup of H

Proof.

The only thing to observe is that we do not need any deﬁnability assumptions. The deﬁnabilitycomes for free since any compact Lie group G is isomorphic to a compact subgroup K of GL ( n, R ) forsome n (see [4, Ch. 3, Thm. 4.1, p. 136]) and any such K is a (real)algebraic subgroup of GL ( n, R ) [5,Prop. 2, p. 230], hence it is deﬁnable in the o-minimal structure ( R , <, + , · ) . Predicted proof ( https://arxiv.org/pdf/math/0611764.pdf ) Proof.

By passing to the universal covering of G , we may assume that G is simply connected. Theorem3.18.12 in [12] states that in this case every analytic subgroup of G is closed and simply connected. Thenthe result follows from the proof of Theorem 3. Figure 2: Example of a wrong prediction, word overlaps are highlighted in orange (present in both gold andpredicted proof, red (only in predicted proof), blue (only in gold proof).

Statement ( https://arxiv.org/pdf/math/9902050.pdf ) Let G ◦ be the identity component of G , let H be a closed, connected subgroup of G , andlet Γ be a discrete subgroup of G . Then:1. Γ acts properly on G / H if and only if Γ ∩ G ◦ acts properly on G ◦ / H .2. Γ \ G / H is compact if and only if (Γ ∩ G ◦ ) \ G ◦ / H is compact. Gold proof

Proof. (1) Because every element of the Weyl group of G has a representative in G ◦ [BT1, Cor. 14.6],we see that G and G ◦ have the same positive Weyl chamber A + , and the Cartan projection G ◦ → A + is the restriction of the Cartan projection G → A + . Thus, the desired conclusion is immediate fromCorollary 3.9. (2) This is an easy consequence of the fact that G / G ◦ is ﬁnite [Mo2, Appendix]. Predicted proof ( https://arxiv.org/pdf/math/0209275.pdf ) Proof.

Because G is reductive, there is a subgroup H of G which is semi-simple and such that the quotient G /H is an extension of a ﬁnite group by a torus. Note that the quotient group

G /H acts on the ringof invariants S H for the semi-simple group: ¯ g ∈ G /H acts on f ∈ S H by g · h where g is any liftingto G of ¯ g . It is easy to verify that S G = ( S H ) G/H . Because H is semi-simple, the ring S H is Gorenstein.Thus by the preceding lemma, it is strongly F-regular. On the other hand, G /H is linearly reductiveand thus the inclusion ( S H ) G/H (cid:44) → S H is split by the Reynolds operator. This splitting descends tocharacteristic p for all p > . Therefore, because S H is strongly F-regular in almost all ﬁbers, so is itsdirect summand S G = ( S H ) G/H ..