Why Machines Cannot Learn Mathematics, Yet
André Greiner-Petter, Terry Ruas, Moritz Schubotz, Akiko Aizawa, William Grosky, Bela Gipp
PPreprint from
A. Greiner-Petter et al. “Why Machines Cannot Learn Mathematics, Yet”.In:
Submitted to 4th BIRNDL workshop at 42nd SIGIR . Paris, France,July 2019
Why Machines Cannot Learn Mathematics, Yet
André Greiner-Petter , Terry Ruas , Moritz Schubotz ,Akiko Aizawa , William Grosky , Bela Gipp University of Wuppertal, Wuppertal, Germany {last}@uni-wuppertal.de University of Michigan-Dearborn, Dearborn, USA {truas,wgrosky}@umich.edu National Institute of Informatics, Tokyo, Japan [email protected]
Abstract.
Nowadays, Machine Learning (ML) is seen as the universalsolution to improve the effectiveness of information retrieval (IR) meth-ods. However, while mathematics is a precise and accurate science, itis usually expressed by less accurate and imprecise descriptions, con-tributing to the relative dearth of machine learning applications for IRin this domain. Generally, mathematical documents communicate theirknowledge with an ambiguous, context-dependent, and non-formal lan-guage. Given recent advances in ML, it seems canonical to apply MLtechniques to represent and retrieve mathematics semantically. In thiswork, we apply popular text embedding techniques to the arXiv collec-tion of STEM documents and explore how these are unable to properlyunderstand mathematics from that corpus. In addition, we also investi-gate the missing aspects that would allow mathematics to be learned bycomputers.
Keywords:
Mathematical Information Retrieval, Machine Learning, Word Em-beddings, Math Embeddings, Mathematical Objects of Interest
Mathematics is capable of explaining complex concepts and relations in a com-pact, precise, and accurate way. Learning this idiom takes time and is often diffi-cult, even to humans. The general applicability of mathematics allows a certainlevel of ambiguity in its expressions. This ambiguity is regularly mitigated byshort explanations following or preceding these mathematical expressions, thatserve as context to the reader. Along with context dependency, inherent issuesof linguistics (e.g. ambiguity, non-formality) make it even more challenging forcomputers to understand mathematical expressions. Said that, a system capableof capturing the semantics of mathematical expressions automatically would besuitable for several applications, from improving search engines to recommendersystems. 1 a r X i v : . [ c s . D L ] M a y uring our evaluations of MathMLBen [31], a benchmark for convertingmathematical L A TEX expressions into MathML, is possible to notice several fun-damental problems that generally affect prominent ML approaches to learn se-mantics of mathematical expressions. For instance, the first entry of the bench-mark, W (2 , k ) > k /k ε (1)is extracted from the English Wikipedia page about Van der Waerden’s theo-rem . Without further explanation, the symbols W , k , and ε might have severalpossible meanings. Depending on which one is considered, even the structureof the formula may be different. If we consider W as a variable, instead of afunction, it changes the interpretation of W (2 , k ) to a multiplication operation.Learning connections, such as between W and the entity ‘ Van der Waerden’snumber ’, requires a large specifically labeled scientific database that containsthese mathematical objects. Furthermore, a fundamental understanding of themathematical expression would increase the performance during the learningprocess, e.g., that W (2 , k ) and W ( n, k ) contain the same function.Word embedding techniques has received significant attention over the lastyears in the Natural Language Processing (NLP) community, especially after thepublication of word2vec [18]. Recently, more and more projects try to adapt thisknowledge for solving Mathematical Information Retrieval (MIR) tasks [4, 12,36, 34]. While all of these projects follow similar approaches and obtain promisingresults, all of them fail to understand mathematical expressions because of thesame fundamental issues. In this paper, we explore some of the main aspects thatwe believe are necessary to leverage the learning of mathematics by computersystems. We explain, with our evaluations of word embedding techniques on thearXMLiv 2018 [5] dataset, why current ML approaches are not applicable forMIR tasks, yet. Understanding mathematical expressions essentially means comprehending thesemantic value of its internal components, which can be accomplished by linkingits elements with their corresponding mathematical definitions. Current MIR ap-proaches [11, 32, 30] try to extract textual descriptors of the parts that composemathematical equations. Intuitively, there are questions that arise from this sce-nario, such as (i) how to determine the parts which have their own descriptors,and (ii) how to identify correct descriptors over others.Answers to (i) are more concerned in choosing the correct definitions forwhich parts of a mathematical expression should be considered as one math-ematical object [10, 35, 31]. Current definitions, such as the content MathML https://en.wikipedia.org/wiki/Van_der_Waerden’s_theorem .0 specification, are often imprecise . For example, content MathML 3.0 uses csymbol elements for functions and specifies them as expressions that refer to aspecific, mathematically-defined concept with an external definition . However, itis not clear whether W or the sequence W (2 , k ) (Equation 1) should be declaredas a csymbol . Another example are content identifiers, which MathML speci-fies as mathematical variables which have properties, but no fixed value . Whilecontent identifiers are allowed to have complex rendered structures (e.g. β i ), itis not permitted to enclose identifiers within other identifiers. Let us consider α i , where α is a vector and α i its i -th element. In this case, α i should be con-sidered as a composition of three content identifiers, each one carrying its ownindividualized semantic information, namely the vector α , the element α i of thevector, and the index i . However, with the current specification, the definitionof these identifiers would not be canonical. One possible workaround to repre-sent such expressions with content MathML is to use a structure of four nodes,interpreting α i as a function with the csymbol vector-selector . However, ML al-gorithms and MIR approaches would benefit from more precise definitions anda unified answer for (i). Most of the related work relies on these relatively vaguedefinitions and in the analysis of content identifiers, focusing their efforts on (ii).In [32], an approach is presented for scoring pairs of identifiers and definiens by the number of words between them. Their approach is based on the as-sumption that correct definiens appear close to the identifier and to the com-plex mathematical expression that contains this same identifier. Kristianto etal. [11] introduce an ML approach, in which they train a Support Vector Ma-chine (SVM) to consider sentence patterns and other characteristics as features(e.g. part-of-speech (POS) tags, parse trees). Later, [30] combine the aforemen-tioned approaches and use pattern recognition based on the POS tags of commonidentifier-definiens pairs, the distance measurements, and SVM, reporting resultsfor precision and recall of 48.60% and 28.06%, respectively. These results can beconsidered as a baseline for MIR tasks.More recently, some projects try to use embedding techniques to learn pat-terns of the correlations between context and mathematics. In the work of [4],they embed single symbols and train a model that is able to discover similari-ties between mathematical symbols. Similarly to this approach, Krstovski andBlei [12] use a variation of word embeddings (briefly discussed in Section 3)to represent complex mathematical expressions as single unit tokens for IR. In2019, M. Yasunaga et al. [34] explore an embedding technique based on recurrentneural networks to improve topic models by considering mathematical expres- Note that OpenMath is another specification specifically designed for encodingsemantics of mathematics. However, content MathML is an encoding of Open-Math and inherent problems of content MathML also apply for OpenMath (see ). definiens is a phrase that defines an identifier or mathematical object. Consideringequation (1), the correct definiens for W is the phrase ’ Van der Waerden’s number ’. ions. They state their approach outperforms topic models that do not considermathematics in text and report a topic coherence improvement of . overthe LDA baseline. What all these embedding projects have in common is thatthey show promising examples and suppose a high potential, but do not evaluatetheir results for MIR.Questions (i), (ii), and other pragmatic issues are already in discussion ina bigger context, as data production continues to rise and digital repositoriesseem to be the future for any archive structure. The National Research Councilis making efforts to establish what they call the Digital Mathematics Library (DML) , a project under the International Mathematical Union. The goal ofthis future project is to take advantage of new technologies and help to solvethe inability to search, relate, and aggregate information about mathematicalexpressions in documents over the web. The word2vec [18] technique computes real-valued vectors for words in a docu-ment using two main approaches: skip-gram and continuous bag-of-words (CBOW).Both produce a fixed length n -dimensional vector representation for each wordin a corpus. In the skip-gram training model, one tries to predict the context of agiven word, while CBOW predicts a target word given its context. In word2vec,context is defined as the adjacent neighboring words in a defined range, called asliding window. The main idea is that the numerical vectors representing similarwords should have close values if the words have similar context, often illustratedby the king-queen relationship .Extending word2vec’s approaches, Le and Mikolov [13] propose ParagraphVectors (PV), a framework that learns continuous distributed vector representa-tions for any size of text segment (e.g. sentences, paragraphs, documents). Thistechnique alleviates the inability of word2vec to embed documents as one singleentity. This technique also comes in two distinct variations: Distributed Memory(DM) and Distributed Bag-of-Words (DBOW), which are analogous to the skip-gram and CBOW training models respectively. However, in both approaches,an extra feature vector representing the text segment, named paragraph-id, isincluded as another word. This paragraph-id is updated throughout the entiredocument, based on the current evaluated context window for each word, and isused to represent the whole text segment.Recently, researchers have been trying to improve their semantic representa-tions, producing multiple vectors (multi-sense embeddings) based on the word’ssense, context, and distribution in the corpus [7, 28]. Another concern with tra-ditional techniques is that they often neglect exploring lexical structures withvaluable prior knowledge about the semantic relations, such as: WordNet [19],ConceptNet [15] and BabelNet [20]. These lexical structures offer a rich semantic Latent Dirichlet Allocation v king − v man ≈ v queen − v woman nvironment that illustrate the word-senses, their use, and how they relate toeach other. Some publications take advantage of the robustness provided by wordembeddings approaches and lexical structures to combine them into multi-senserepresentations, improving their overall performance in many NLP downstreamtasks [16, 29, 25].The lack of solid references and applications that provide the same semanticstructure of natural language for mathematical identifiers make their disam-biguation process even more challenging. In natural texts, one can try to inferthe most suitable word sense for a word based on the lemma itself, the adjacentwords, dictionaries, thesaurus and so on. However, in the mathematical arena,the scarcity of resources and the flexibility of redefining their identifiers take thisissue to a more delicate scenario. The context text preceding or following themathematical equation is essential for its understanding.More recently, [12] propose a variation of word embeddings for mathematicalexpressions. Their main idea relies on the construction of a distributed repre-sentation of equations, considering the word context vector of an observed wordand its word-equation context window. They treat equations as single-unit words(EqEmb), which eventually appears in the context of different words. They alsotry to explore the effects of considering the elements of mathematical expressionsseparately (EqEmb-U). In this scenario, mathematical equations are representedusing a Syntax Layout Tree (SLT) [37], which contains the spatial relationshipbetween its symbols. While they present some interesting findings for retrievingentire equations, little is said about the vectors representing equation units andhow they are described in their model. The word embedding techniques seem tohave potential for semantic distance measures between complex mathematicalexpressions. However, they are not appropriate for extracting semantics of iden-tifiers separately. This is an indication that the problems of representing math-ematical identifiers are tied to more fundamental issues, which are explained inSection 5.Since the overall performance of word embedding algorithms has shown su-perior results in many different NLP tasks, such as machine translation [18],relation similarity [9], word sense disambiguation [2], word similarity [21, 29],and topic categorization [26]. In the same direction, we also explore how wellmathematical tokens can be embedded according to their semantic information.However, mathematical formulae are highly ambiguous and if not properly pro-cessed, their representation is jeopardized. There are two main standard formats in which to represent mathematics inscience: L A TEX and MathML. The former is used by humans for writing scientificdocuments. The latter, on the other hand, is popular in web representationsof mathematics due to its machine readability and XML structure. There hasbeen a major effort to automatically convert L A TEX expressions to MathML [31] canonical form, dictionary form, or citation form of a set of words nes. However, neither L A TEX nor MathML are practical formats for embeddings.Considering the equation embedding techniques in [12], we devise three maintypes of mathematical embeddings.
Mathematical Expressions as Single Tokens:
EqEmb [12] uses entire math-ematical expressions as one token. In this type, the inner structure of the math-ematical expression is not taken into account. For example, Equation (1) isrepresented as one single token t . Any other expression, such as W (2 , k ) inthe surrounding text of (1), is an entirely independent token t . Therefore, thisapproach does not learn any connections between W (2 , k ) and (1). While this ap-proach seems to hold interesting results for comparing mathematical expressions,it fails in representing the semantic aspects of inner elements in mathematicalequations. Stream of Tokens:
Instead of embedding mathematical expressions as a singletoken, we can represent them through a sequence of its inner tokens. For ex-ample, considering only the identifiers in Equation (1), we would have a streamof three tokens W , k , and ε . This approach has the advantage of learning allmathematical tokens. However, this method also has some drawbacks. Complexmathematical expressions may lead to long chains of elements, which can beespecially problematic when the window size of the training model is too small.Naturally, there are approaches to reduce the length of chains. In Section 5 weshow our own model which uses a stream of mathematical identifiers and cutout all other expressions. In [4], L. Gao et al. use a CBOW and embed all math-ematical symbols, including identifiers and operands, such as + , − or variationsof equalities = . In [34], they do not cut out symbols and train their model on theentire sequence of tokens that the L A TEX tokenizer generates. Considering Equa-tion (1), it would result in a stream of 13 tokens. They use a long short-termmemory (LSTM) architecture to handle longer chains of tokens and also to limittheir length to − tokens. Usually, in word embeddings, such behaviouris not preferred since it increases the noise in the data . We will see later inthe paper (Section 3.1), that a typically trained model on mathematical em-beddings is able to detect similarities between mathematical objects but do notperform well detecting connections to word descriptors. Therefore, we considerclose relations of mathematical symbols to other mathematical symbols as noise.To mitigate this issue, we only work with mathematical identifiers and not anyother symbols or structures. Semantic Groups of Tokens:
The third approach of embedding mathemat-ics is only theoretical, and concerns the aforementioned problems related tothe vague definitions of identifiers and functions in a standardized format (e.g.MathML). As previously discussed, current MIR and ML approaches would ben-efit from a basic structural knowledge of mathematical expressions, such thatvariations of function calls (e.g. W ( r, k ) and W (2 , k ) ) can be recognized as thesame function. Instead of defining a unified standard, current techniques usetheir own ad-hoc interpretations of structural connections, e.g., α i is one iden- Noise means, the data consists of many uninteresting tokens that affect the trainedmodel negatively. ifier rather than three [31, 30]. We assume that an embedding technique wouldbenefit from a system that is able to detect the parts of interest in mathemati-cal expressions prior any training processes. However, such system still does notexist.
The examples illustrated in [4, 12, 34] seem to be feasible as a new approach fordistance calculations between complex mathematical expressions. While com-paring mathematical expressions is essentially practical for search engines or au-tomatic plagiarism detection systems, these approaches do not seem to capturethe components of complex structure separately, which are necessary for otherapplications, such as automated reasoning. Another aspect to be considered isthat in [12] they do not train mathematical identifiers, preventing their systemfrom learning connections between identifiers and definiens (e.g., W (2 , k ) andthe definiens ‘ Van der Waerden number ’). Additionally, the connection betweenentire equations and definiens is, at some level, questionable. Entire equationsare rarely explicitly named . However, in the extension EqEmb-U [12], they usean SLT representation to tokenize mathematical equations and to obtain specificunit-vectors, which is similar to our identifiers as tokens approach.In order to investigate the discussed approaches, we apply variations of aword2vec implementation to extract mathematical relations from the arXMLiv2018 [5] dataset, an HTML collection of the arXiv.org preprint archive , whichis used as our training corpus. We also consider the subsets that do not reporterrors during the document conversion (i.e. no_problem and warning ) which rep-resent 70% of archive.org. There are other approaches that also produce wordembeddings given a training corpus as an input, such as fastText [1], ELMo [23],and GloVe [22]. The choice for word2vec is justified because of its implementa-tion, general applicability, and robustness in several NLP tasks [9, 8, 14, 16, 25,29]. Additionally, in fastText they propose to learn word representations as a sumof the n -grams of its constituent characters (sub-words). This would incorporatea certain noise to our experiments. In ELMo, they compute their word vectorsas the average of their characters representations, which are obtained througha two-layer bidirectional language model (biLM). This would bring even moregranularity than fastText, as they consider each character in a word as havingtheir own n -dimensional vector representation. Another factor that prevent usfrom using ELMo, for now, is its expensive training process . Closer to theword2vec technique, GloVe [22] is also considered, but its co-occurrence matrixwould escalate the memory usage, making its training for arXiv not possible atthe moment. We also examine the recently published Universal Sentence En- However, it is common for groundbreaking findings, such as
Pythagorean’s theorem or the energy-mass equivalence . https://arxiv.org/ https://github.com/allenai/bilm-tf oder (USE) [3] from google, but their implementation does not allow one to usea new training corpus, only to access its pre-calculated vectors.As a pre-processing step, mathematical expressions are represented usingMathML notation. Firstly, we replace all mathematical expressions by thesequence of the identifiers it contains, i.e., W (2 , k ) is replaced by ‘ W k ’. Secondly,we remove all common English stopwords from the training corpus. Finally,we train a word2vec model (skip-gram) using the following hyperparametersconfiguration : vector size of 300 dimensions, a window size of 15, minimumword count of 10, and a negative sampling of E − .The trained model is able to partially incorporate semantics of mathematicalidentifiers. For instance, the closest
27 vectors to the mathematical identifier f are mathematical identifiers themselves and the fourth closest noun vectorto f is v function . Inspired by the classic king-queen example, we explore whichtokens perform best to model a known relation. Consider an approximation v variable − v a ≈ v − v f , where v variable represents the word variable , v a theidentifier a , and v f represents f . We are looking for v that fits best for theapproximation. We call this measure the semantic distance to f with respectto a given relation between two vectors. Table 1 shows the top 10 semanticallyclosest results to f with respect to the relation between v a and v variable . Tokens Cosine Distances variables . function . appropriate . independent . instead . defined . namely . continuous . depends . represents . Table 1.
Semantically closest 10 re-sults to f with respect to the relationbetween v a and v variable . We also perform an extensive evalu-ation on the first 100 entries of the MathMLBen benchmark [31]. We evaluatethe average of the semantic distances withrespect to the relations between v variable and v x , v variable and v a , and v function and v f . In addition, we consider only resultswith a cosine similarity of . or aboveto maintain a minimum quality in our re-sults. The overall results were poor witha precision of p = . and a recall of r = . . For the identifier W (Equa-tion (1)), the evaluation presents four se-mantically close results: functions , vari-ables , form , and the mathematical iden-tifier q . Even though expected, the scaleof the presented results are astonishing.Additionally, we also try the Dis-tributed Bag-of-Words of Paragraph Vectors (DBOW-PV) [13] considering theapproach of [30]. In [30], they analyze all occurrences of mathematical identifiersand consider the entire article at once. We assume this prevents the algorithm The source TEX file has to use mathematical environments for its expressions. Non mentioned hyperparameters are used with their default values as described inthe Gensim API [27] Considering cosine similarity. Same entries used in [30] rom finding the right descriptor in the text, since later or prior occurrences of anidentifier might appear in a different context, and therefore potentially introducedifferent meanings. Instead of using the entire document, we consider the algo-rithm of [30] only in the input paragraph and similar paragraphs given by ourDBOW-PV model. Unfortunately, the obtained variance within the paragraphsbrings a high number of false positives to the list of candidates, which affectsnegatively our performance.We also experiment other hyperparameters when training our word embed-dings model to see if it is possible to improve the overall results. However, whilethe performance decreases, no drastic structural changes appear in the model.Figure 1 illustrates a t-SNE plot of the model trained with 400 dimensions, awindow size of 25, and minimum count of 10 words, without any filters appliedto the text. The plot is similar to the visualized model presented in [4], eventhough they use a different embedding technique. Compared to [4], we providea bigger picture of the model that reveals some dense clusters for numbers at (11 , − with the math token for invisible times nearby, equation abbreviationssuch as, eq1 , at ( − , − , and logical operators at ( − , − . We highlight mathe-matical tokens in the model in red and word tokens in blue. The plot in Figure 1illustrates that mathematical tokens are close to each other.Based on the presented results, one can still argue that more settings shouldbe explored (e.g. different embedding techniques, parameters) for the embed-dings phase and different pre-processing steps (e.g. stemming and lemmatization)be adopted. This would probably solve some minor problems, such as removingthe inaccurate first hit in Table 1. Nevertheless, the overall results would not beimproved to a point of being comparable to [30] findings, which report a preci-sion of p = 0 . . The main reason for this is that, mathematics as a language ishighly customizable. Many of the defined relations between mathematical con-cepts and their descriptors are only valid in a local scope. Consider, for example,an author that notes his algorithm by π . This does not change the general mean-ing of π , even though it effects the meaning in the scope of the article. CurrentML approaches only learn patterns of most frequently used combinations, e.g.,between f and function , as seen in Table 1. Furthermore, we assume this isa general problem that different embedding techniques and tweaks of settings,such as those illustrated in Section 2, would not solve. Therefore, in the followingsection, we present some concepts that we believe can help ML algorithms tobetter understand mathematics. A case study [33] has shown that 70% of mathematical symbols are explicitlydeclared in the context. Only four reasons are causing an explicit declaration in Note that t-SNE plots may misleadingly create clusters that do not exists in themodel. To overcome this issue we create several plots with different settings. Theresults remain similar to the plot we show which is an indication that the visualclusters exists also in the model. ig. 1. t-SNE plot of top 1000 closest vectors of the identifier f . For this plot, we used aperplexity of 80 and the default values of the TSNE python package for other settings. the context: (a) a new mathematical symbol is defined, (b) a known notationis changed, (c) used symbols are present in other contexts and require specifi-cations to be properly interpreted, or (d) authors declarations were redundant(e.g. for improving readability). We assume (d) is a rare scenario compared to(a-c), unless in educational literature. Current math-embedding techniques canlearn semantic connections only in those 70%, where the definiens are available.Besides (d), the algorithm would learn either rare notations (in case of (a)) orambiguous notations (in cases (b-c)). The flexibility that mathematical docu-ments allow to (re)define used mathematical notations further corroborates tothe complexity of learning mathematics.How would be possible for machines to learn mathematics? One of the majorproblems is the ambiguity of mathematical expressions (a-c). Natural languagesalso consist of ambiguous and context-sensitive words. A typical workaround forthis problem is to consider the use of lexical databases (e.g. WordNet [19]) toidentify the most suitable word senses for a given word. However, mathemat-ics lacks of such system, which makes its adoption not feasible at the moment.In [35] they propose the use of tags, similarly to the POS tags in linguistics, butfor tagging mathematical TEX tokens, bringing more information to the tokensconsidered. As a result, a lexicon containing several meanings for a large set ofathematical symbols is developed. Such lexicons might enable the disambigua-tion approaches in linguistics to be used in mathematical embeddings in the nearfuture.Furthermore, learning algorithms would benefit from a literature focused in(a) and (d), instead of (b-c). Similar to students who start to learn mathemat-ics, ML algorithms have to consider the structure of the content they learn. It ishard to learn mathematics only considering arXiv documents without prior orcomplementary knowledge. Usually, these documents represent state-of-the-artfindings containing new and unusual notations and lack of extensive explana-tions (e.g. due to page limitations). In contrast, educational books carefully andextensively explain new concepts. We assume better results can be obtained ifML algorithms are to be trained in multiple stages. A basic model trained oneducational literature should capture standard relations between mathematicalconcepts and descriptors. This model should also be able to capture patternsindependently how new or unusual the notations are present in the literature.In 2014, Matsuzaki et al. [17] present some promising results to automaticallyanswer mathematical questions from Japanese university entrance exams. Whilethe approach involves many manual adjustments and analysis, the promisingresults illustrate the different levels of knowledge that is still required for under-standing arXiv documents and university entrance level exams. A well-structureddigital mathematical library that distinguishes the different levels of progress inarticles (e.g. introductions vs. state-of-the-art publications) would also benefitsmathematical machine learning tasks.Another problem in recent publications, is the lack of standards for properlyevaluating MIR algorithms, leading to several publications that present promis-ing results without an extensive evaluation [4, 12, 34]. While ML algorithms inNLP benefit from available extensive training and testing datasets, ongoing dis-cussions about interpretations of mathematical expressions [31], and imprecisestandards thwarts research progress in MIR. A common standard for interpretingsemantic structures of mathematics would help to overcome the issues of differ-ent evaluation techniques. Numerous applications (e.g., search engines) wouldbenefit directly from unified interpretations and representations of mathemati-cal semantics. Therefore, we introduce Mathematical Objects of Interest (MOI).Currently, there are three approaches to tokenize and interpret semantics ofmathematical expressions: (1) tokenize the mathematical TEX string and tagthe tokens with semantic information [35], (2) analyze the elements of presenta-tional MathML [37], and (3) analyze elements of content MathML [30, 31]. ThePart-of-Math (POM) tagger [35] proposes a multi-scan approach for mathemat-ical TEX strings that incorporates more semantic information into a parse treein each iteration. The available first scan creates a parse tree and tags each nodewith further information about the symbol of each node (similar to POS-tagsin linguistics). In [37], they generate custom tree representations of presenta-tional MathML (SLT) that allow wildcards for certain positions in the trees andimproves search engine results. Applications of (3) are discussed in Section 2.he goal of MOIs is to combine the advantages of concepts (1-3) and proposea unified solution for interpreting mathematical expressions. We suggest MOIsas a recursive tree structure in which each node is an MOI itself. The currentworkaround of the problematic example of α i as an element of the vector α incontent MathML is vague and inappropriate for content specific tasks. As anMOI, this expression would contain three nodes, with α i as the parent node oftwo leaves α and i . While it first seems non-intuitive that α , as the vector, isa child node of its own element, this structure is able to incorporate all threecomponents of semantic information of the expression. Hence, an MOI structureshould not be misinterpreted as a logical network explaining semantic connec-tions between its elements, but as a highly flexible and lightweight structure forincorporating semantic information of mathematical expressions.We believe that a new standard, such as MOIs, has to be extensively studiedand discussed by specialists on the field before its acceptance. Therefore, theintroduction of MOIs in this paper is fairly broad and still need to be furtherinvestigated. However, we suggest the use of MOIs would assist ML algorithmsto simplify and unify their evaluations in MathIR projects. In this paper, we explore how text embedding techniques (e.g. word2vec, PV)are unable to represent mathematical expressions adequately. After experiment-ing with popular mathematical representations in MIR, we expose fundamentalproblems that prevent ML algorithms from learning mathematics. We also dis-cover the same problems in several related research projects. Many of theseprojects show promising examples without extensive evaluations, motivatingmore researchers to follow the same idea with equivalent fundamental issues.We present concepts for enabling ML algorithms to learn mathematical ex-pressions. Some of these concepts are generally time consuming, such as a lexicaldatabase for mathematics. For a concrete contribution we propose the MOIs, aunified solution for interpreting mathematical expressions semantically. We thinkthis is a necessary first step to enable unified evaluations and libraries.In future work, we plan to explore the element distributions of mathematicalexpressions and compare them with known distributions in linguistics [24] forleveraging MOIs’. Preliminary results have shown that distributions of mathe-matical objects also following Zipf’s law, similar to word distributions in naturallanguage. Thus, we are currently analyzing term frequencies and inverse doc-ument frequencies, commonly known as tf-idf, of mathematics. This researchshould help to discover existing meaningful mathematical structures that arealready in use in scientific publications. Such structures should be interpreted asMOIs. Therefore, this research will build a base to constructively discuss MOIs.
Acknowledgments
This work was supported by the German Research Founda-tion (DFG grant GI-1259-1). We thank Howard Cohl who provided insights andexpertise. eferences [1] P. Bojanowski et al. “Enriching Word Vectors with Subword Information”.In:
Transactions of the Association for Computational Linguistics
Proc. 53rd Annual Meeting of the Association forComputational Linguistics (ACL), Beijing, China . ACL, 2015.[3] D. Cer et al. “Universal Sentence Encoder for English”. In:
Proc. Con-ference on Empirical Methods in Natural Language Processing (EMNLP) .Ed. by E. Blanco and W. Lu. Association for Computational Linguistics,2018.[4] L. Gao et al. “Preliminary Exploration of Formula Embedding for Math-ematical Information Retrieval: can mathematical formulae be embeddedlike a natural language?” In:
CoRR abs/1707.05154 (2017). arXiv: .[5] D. Ginev. arXMLiv:08.2018 dataset, an HTML5 conversion of arXiv.org .SIGMathLing – Special Interest Group on Math Linguistics. 2018. url : https://sigmathling.kwarc.info/resources/arxmliv/ .[7] E. H. Huang et al. “Improving Word Representations via Global Contextand Multiple Word Prototypes”. In: Proc. 50th Annual Meeting of the As-sociation for Computational Linguistics (ACL) Vol. 1 . Jeju Island, Korea:ACL, 2012.[8] I. Iacobacci et al. “Embeddings for Word Sense Disambiguation: An Eval-uation Study”. In:
Proc. 54th Annual Meeting of the Association for Com-putational Linguistics (ACL) Vol. 1, Berlin, Germany . ACL, 2016.[9] I. Iacobacci et al. “SensEmbed: Learning Sense Embeddings for Word andRelational Similarity”. In:
Proc. 53rd Annual Meeting of the Associationfor Computational Linguistics (ACL) Vol. 1, Beijing, China . ACL, 2015.[10] M. Kohlhase. “Math Object Identifiers - Towards Research Data in Math-ematics”. In:
Lernen, Wissen, Daten, Analysen (LWDA) Conference Pro-ceedings, Rostock, Germany, September 11-13, 2017.
Ed. by M. Leyer.Vol. 1917. CEUR-WS.org, 2017.[11] G. Y. Kristianto et al. “Extracting Textual Descriptions of MathematicalExpressions in Scientific Papers”. In:
D-Lib Magazine
CoRR abs/1803.09123 (2018). arXiv: .[13] Q. Le and T. Mikolov. “Distributed Representations of Sentences and Doc-uments”. In:
Proceedings of the 31th International Conference on MachineLearning, ICML, Beijing, China . Beijing, China: JMLR.org, 2014.[14] J. Li and D. Jurafsky. “Do Multi-Sense Embeddings Improve Natural Lan-guage Understanding?” In:
Proc. Conference on Empirical Methods in Nat-ural Language Processing (EMNLP), Lisbon, Portugal . Lisbon, Portugal:Association for Computational Linguistics, 2015.[15] H. Liu and P. Singh. “ConceptNet &Mdash; A Practical CommonsenseReasoning Tool-Kit”. In:
BT Technology Journal
Proc. 21st Conference on ComputationalNatural Language Learning (CoNLL), Vancouver, Canada . Association forComputational Linguistics, 2017.[17] T. Matsuzaki et al. “The Most Uncreative Examinee: A First Step to-ward Wide Coverage Natural Language Math Problem Solving”. In:
Proc.Twenty-Eighth AAAI Conference on Artificial Intelligence, Québec City,Québec, Canada.
Ed. by C. E. Brodley and P. Stone. AAAI Press, 2014.[18] T. Mikolov et al. “Distributed Representations of Words and Phrases andTheir Compositionality”. In:
Proceedings of the 26th International Confer-ence on Neural Information Processing Systems - Volume 2 . Lake Tahoe,Nevada: Curran Associates Inc., 2013.[19] G. A. Miller. “WordNet: A Lexical Database for English”. In:
Commun.ACM
Artificial Intelligence
193 (2012).[21] A. Neelakantan et al. “Efficient Non-parametric Estimation of MultipleEmbeddings per Word in Vector Space”. In:
Proc. Conference on Em-pirical Methods in Natural Language Processing (EMNLP), Doha, Qatar .Association for Computational Linguistics, 2014.[22] J. Pennington et al. “Glove: Global Vectors for Word Representation.” In:
Proc. Conference on Empirical Methods in Natural Language Processing(EMNLP), Doha, Qatar . Vol. 14. Association for Computational Linguis-tics, 2014.[23] M. Peters et al. “Deep Contextualized Word Representations”. In:
Pro-ceedings of the 2018 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technolo-gies, Volume 1 (Long Papers) . New Orleans, Louisiana: Association forComputational Linguistics, 2018.[24] S. T. Piantadosi. “Zipf’s word frequency law in natural language: A criticalreview and future directions”. In:
Psychonomic Bulletin & Review
Proc. Conference on Empirical Methods in Natural Language Processing(EMNLP) . The Association for Computational Linguistics, 2016.[26] M. T. Pilehvar et al. “Towards a Seamless Integration of Word Senses intoDownstream NLP Applications”. In:
CoRR abs/1710.06632 (2017).[27] R. Řehůřek and P. Sojka. “Software Framework for Topic Modelling withLarge Corpora”. English. In:
Proceedings of the LREC 2010 Workshop onNew Challenges for NLP Frameworks . http://is.muni.cz/publication/884893/en . Valletta, Malta: ELRA, May 2010.[28] J. Reisinger and R. J. Mooney. “Multi-prototype Vector-space Modelsof Word Meaning”. In: Human Language Technologies: The 2010 AnnualConference of the North American Chapter of the Association for Com-utational Linguistics . Los Angeles, California: Association for Computa-tional Linguistics, 2010.[29] T. Ruas et al. “Multi-sense Embeddings through a Word Sense Disam-biguation Process”. Pre-print.[30] M. Schubotz et al. “Evaluating and Improving the Extraction of Math-ematical Identifier Definitions”. In:
Experimental IR Meets Multilingual-ity, Multimodality, and Interaction - 8th International Conference of theCLEF Association, CLEF 2017, Dublin, Ireland, September 11-14, 2017,Proceedings . Ed. by G. J. F. Jones et al. Vol. 10456. Springer, 2017.[31] M. Schubotz et al. “Improving the Representation and Conversion of Math-ematical Formulae by Considering their Textual Context”. In:
Proceedingsof the 18th ACM/IEEE on Joint Conference on Digital Libraries, JCDL2018, Fort Worth, TX, USA, June 03-07, 2018 . Ed. by J. Chen et al.ACM, 2018.[32] M. Schubotz et al. “Semantification of Identifiers in Mathematics for Bet-ter Math Information Retrieval”. In:
Proceedings of the 39th InternationalACM SIGIR Conference on Research and Development in InformationRetrieval . Pisa, Italy: ACM, 2016.[33] M. Wolska and M. Grigore. “Symbol Declarations in Mathematical Writ-ing”. In:
Towards a Digital Mathematics Library . Ed. by P. Sojka. Paris,France: Masaryk University Press, 2010.[34] M. Yasunaga and J. Lafferty. “TopicEq: A Joint Topic and MathematicalEquation Model for Scientific Texts”. In:
CoRR abs/1902.06034 (2019).arXiv: .[35] A. Youssef. “Part-of-Math Tagging and Applications”. In:
Proc. CICM .Ed. by H. Geuvers et al. Cham: Springer International Publishing, 2017.[36] A. Youssef and B. R. Miller. “Deep Learning for Math Knowledge Process-ing”. In:
Proc. CICM . Ed. by F. Rabe et al. Vol. 11006. Springer, 2018.[37] R. Zanibbi et al. “Multi-Stage Math Formula Search: Using Appearance-Based Similarity Metrics at Scale”. In:
Proc. 39th International ACM SI-GIR Conference on Research and Development in Information Retrieval,Pisa, Italy . Ed. by R. Perego et al. ACM, 2016. @inproceedings { GreinerPetter2019 , author = { Greiner - Petter , Andr \’ { e } and Ruas , Terryand Schubotz , Moritz and Aizawa , Akiko and Grosky ,William and Gipp , Bela }, location = { Paris , France }, booktitle = { Submitted to 4 th Joint Workshop onBibliometric - enhanced Information Retrieval and NaturalLanguage Processing for Digital Libraries colocated atthe 42 nd International ACM SIGIR Conference }, date = {2019 -07} , title = { Why Machines Cannot Learn Mathematics , Yet }, } Listing 1.1.
Use the following