[PDF] MIaS: Math-Aware Retrieval in Digital Mathematical Libraries

Abstract

Digital mathematical libraries (DMLs) such as arXiv, Numdam, and EuDML contain mainly documents from STEM fields, where mathematical formulae are often more important than text for understanding. Conventional information retrieval (IR) systems are unable to represent formulae and they are therefore ill-suited for math information retrieval (MIR). To fill the gap, we have developed, and open-sourced the MIaS MIR system. MIaS is based on the full-text search engine Apache Lucene. On top of text retrieval, MIaS also incorporates a set of tools for preprocessing mathematical formulae. We describe the design of the system and present speed, and quality evaluation results. We show that MIaS is both efficient, and effective, as evidenced by our victory in the NTCIR-11 Math-2 task.

Full PDF

MMIaS : Math-Aware Retrieval in Digital Mathematical Libraries

Petr Sojka

Masaryk UniversityFaculty of InformaticsBrno, Czech [email protected]

Michal Růžička

Masaryk UniversityFaculty of InformaticsBrno, Czech [email protected]

Vít Novotný

Masaryk UniversityFaculty of InformaticsBrno, Czech [email protected]

ABSTRACT

Digital mathematical libraries (DMLs) such as arXiv, Numdam,and EuDML contain mainly documents from STEM fields, wheremathematical formulae are often more important than text forunderstanding. Conventional information retrieval (IR) systems areunable to represent formulae and they are therefore ill-suited formath information retrieval (MIR). To fill the gap, we have developed,and open-sourced the MIaS MIR system. MIaS is based on the full-text search engine Apache Lucene. On top of text retrieval, MIaS alsoincorporates a set of tools for preprocessing mathematical formulae.We describe the design of the system and present speed, and qualityevaluation results. We show that MIaS is both efficient, and effective,as evidenced by our victory in the NTCIR-11 Math-2 task.

KEYWORDS

Math Information Retrieval, Digital Mathematical Libraries

ACM Reference Format:

Petr Sojka, Michal Růžička, and Vít Novotný. 2018.

MIaS : Math-AwareRetrieval in Digital Mathematical Libraries. In

The 27th ACM InternationalConference on Information and Knowledge Management (CIKM ’18), October22–26, 2018, Torino, Italy.

ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3269206.3269233

In mathematical discourse, formulae are often more important thantext for understanding. As a result, digital mathematical libraries(DMLs) require math information retrieval (MIR) systems that recog-nize both text and math in documents and queries. Conventional IRsystems represent both text, and formulae using the bag-of-wordsvector-space model (VSM). However, the VSM captures neitherthe structural, nor the semantic similarity between mathematicalformulae, which makes it ill-suited for MIR.To fill the gap, new math-aware IR systems started to appear afterthe pioneering workshop on DMLs [18]. Springer’s L A TEX Search system takes formulae from papers with available L A TEX sources,and hashes the formulae to obtain a text representation. ZentralblattMath uses the MathWebSearch system [8], which represents for-mulae with substitution trees. We have developed and open-sourcedthe MIaS (Math Indexer and Searcher) system [16, 14] using the https://zbmath.org/formulae/ https://github.com/MIR-MU/MIaS CIKM ’18, October 22–26, 2018, Torino, Italy © 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.This is the author’s version of the work. It is posted here for your personal use.Not for redistribution. The definitive Version of Record was published in

The 27thACM International Conference on Information and Knowledge Management (CIKM ’18),October 22–26, 2018, Torino, Italy , https://doi.org/10.1145/3269206.3269233. robust highly-scalable full-text search engine Apache Lucene [5]and our own set of tools for the preprocessing of mathematicalformulae. Since 2012, MIaS has been deployed in the EuropeanDigital Mathematical Library (EuDML) , making it historically thefirst system to be deployed in a DML. MIaS processes text and math separately. The text is tokenizedand stemmed to unify inflected word forms. Math is expected tobe in the MathML format . Open tools such as Tralics , L A TEXML convert documents in the popular math authoring language of L A TEXto MathML. Other tools such as InftyReader [21] , and MaxTract [4]convert raster, and vector PDF documents, respectively, to MathML.The math is then canonicalized, ordered, tokenized, and unified(see Figure 1). We will describe each of these processing steps indetail in the following paragraphs.

Canonicalization.

As explained above, MathML can originatefrom multiple sources and each can encode equivalent mathematicalformulae a little differently. To obtain a single canonical representa-tion, we initially used the third-party MathML canonicalizer fromthe UMCL library that converts math to a subset of MathML calledthe Canonical MathML [3]. However, since the conversion speedand accuracy did not match our expectations, we have developedand open-sourced our own MathML canonicalizer [7]. Ordering.

MathML canonicalization only affects the encodingof mathematical formulae and does not result in any syntacticmanipulation. We go a step further and reorder the operands ofcommutative operators alphabetically. For example, we convert theformulae a + b , and b + a to a single canonical form a + b . Tokenization.

A user of our system may not know the preciseform of a formula they are searching for. To enable partial matches,we index not only the original formula, but also all its subformulae ,which correspond to all the XML subtrees of the original formulaXML tree. To penalize partial matches, the weight of subformulaeis inversely proportional to their depth in the XML tree. [19]A user is likely interested in documents that contain either thequery formula itself, or larger formulae with the query formula asa subformula. On the other hand, a user is unlikely to be interestedin documents that contain only small parts of the query formula,such as isolated numbers, and symbols. For that reason, we onlytokenize formulae in indexed documents, not in user queries. https://eudml.org/search https://dlmf.nist.gov/LaTeXML/ https://github.com/MIR-MU/MathMLCan a r X i v : . [ c s . I R ] A ug nput canonicalized document document handler t e x t searcherinput query t e x t t er m s qu er y re s u l t s index indexer unification math processing tokenization m a t h m a t h searchingindexingLucene math processing orderingtokenizationvariables unificationconstants unification indexing searching w e i g h t i n g canonicalization canonicalization searchingindexing x y + y x y + y , x y , y , x, y, , + x y + y , x y , y , x, y , , + , id id + id , id id , id x y + y , x y , y , x, y, , + , id id + id , id id , id , x y + y const , y const , id id + id const , id const x y + y x y + y x y + y x y + y , id id + id x y + y , id id + id , x y + y const , id id + id const x y + y const , id id + id const Match!

Figure 1: The preprocessing of mathematical formulae in indexed and query documents

Unification.

In theory, the naming of variables does not affectthe meaning of formulae. To match formulae in different notations,we replace each variable with a numbered identifier. For example,we convert the formulae a + b a , and x + y x to a single unified formid + id id . In practice, many fields have an established notationand variable names are meaningful. To encourage precise matches,we keep the original formulae in addition to the unified formulae.Two formulae that only differ in numeric constants are oftenrelated. For example, both 3 x − x + , and 8 x − x + x − const x + const. To encourage precise matches,we keep the original formulae in addition to the unified formulae.In predicate logic, a variable can represent an arbitrary formula.For example, the formulae a + √ bc , and a + xy are equivalent if x equals √ b . Starting with the deepest subformulae, we replace allsubformulae at a given depth with a unifying identifier. [15] Forexample, we convert the formula a + √ bc to a sequence of structurallyunified formulae a + √ ◍ c , ◍ ◍ + ◍◍ , and ◍ + ◍ and the formula a + xy to a sequence of structurally unified formulae ◍ ◍ + ◍◍ , and ◍ + ◍ . Topenalize partial matches, the weight of the formulae is proportionalto the depth of replacement. To encourage precise matches, wekeep the original formulae in addition to the unified formulae. Wehave open-sourced the MathML structural unificator .After preprocessing, a query consists of a weighted set of terms,and formulae. Since we are now going to search for documentsthat match at least one term, and at least one formula from thequery, ill-posed terms, and formulae will negatively impact therecall of our system. To overcome this problem, we remove selected https://github.com/MIR-MU/MathMLUnificator Subquery 1: f f t t t Subquery 2: f f t t Subquery 3: f f t Subquery 4: f f Subquery 5: f t t t Subquery 6: t t t Figure 2: The subqueries produced from the original query f f t t t with mathematical formulae f , and f and terms t , t , and t using the Leave Rightmost Out (LRO) strategy. terms and formulae to produce a set of subqueries . Figure 2 showsan example strategy for producing subqueries. Líška, Sojka, andRůžička [10] describe other strategies that we use. We then sub-mit the subqueries to Apache Lucene and receive ranked lists ofresulting documents. Since the scores of the resulting documentsare incomparable between subqueries, we cannot merge and rerankthe individual result lists. Instead, we interleave them to obtain thefinal search results that we present to the user.To provide a web user interface to MIaS, we have developedand open-sourced WebMIaS , [16, 11]. Users can input theirquery in a combination of text, and math with a native supportfor L A TEX provided by Tralics, and MathJax [6]. Matches are con-veniently highlighted in the search results. The user interface ofWebMIaS is shown in Figure 3. We have deployed a demo of thelatest development version of WebMIaS using the Apache Tom-cat implementation of the Java Servlet. The demo uses an index https://mir.fi.muni.cz/webmias/ https://github.com/MIR-MU/WebMIaS https://mir.fi.muni.cz/webmias-demo/ https://tomcat.apache.org/ igure 3: The user interface of WebMIaS. Users can input their query in a combination of text, and math with native supportfor L A TEX provided by Tralics, and MathJax. Matches are conveniently highlighted in the search results.Table 1: Speed evaluation results on the MREC dataset using448G of RAM, and eight Intel Xeon™ X7560 2.26 GHz CPUs.

Mathematical (sub)formulae Indexing time (min)Docs Input Indexed Real CPU10,000 3,406,068 64,008,762 35.75 35.0550,000 18,037,842 333,716,261 189.71 181.19100,000 36,328,126 670,335,243 384.44 366.54200,000 72,030,095 1,326,514,082 769.06 733.44300,000 108,786,856 2,005,488,153 1,197.75 1,116.64350,000 125,974,221 2,318,482,748 1,386.66 1,298.10439,423 158,106,118 2,910,314,146 1,747.16 1,623.22

Table 2: Speed evaluation results on the NTCIR-11 Math-2dataset using the same computer as above.

Mathematical (sub)formulae Indexing time (min)Docs Input Indexed Real CPU8,301,545 59,647,566 3,021,865,236 1940.07 3,413.55 built from a subset of the arXMLiv dataset [20] made available tothe NTCIR-12 conference participants and will serve as the basisfor our live demonstration at the conference.

We performed a speed evaluation of MIaS on the MREC dataset of439,423 documents [13] (see Table 1), a quality and speed evaluationon the NTCIR-10 Math [1, 12] dataset of 100,000 documents, anda quality and speed evaluation on the NTCIR-11 Math-2 [2, 16](see Tables 2, and 3), and NTCIR-12 MathIR [22, 15] dataset of105,120 documents that were split into 8,301,578 paragraphs. Speedevaluation shows that the indexing time of our system is linear inthe number of indexed documents and that the average query timeis 469 ms. With respect to quality evaluation, MIaS has notably wonthe NTCIR-11 Math-2 task.

With the growing importance of DMLs, there is a growing demandfor effective MIR systems. The evaluation shows that our open-source MIaS system is both efficient, and effective while building able 3: Quality evaluation results on the NTCIR-11 Math-2dataset. The mean average precision (MAP), and precisionsat ten (P@10), and five (P@5) are reported for queries for-mulated using Presentation (PMath), and Content MathML(CMath), a combination of both (PCMath), and L A TEX. Twodifferent relevance judgement levels of ≥ (partially rele-vant), and ≥ (relevant) were used to compute the measures.Number between slashes (/ · /) is our rank among all teams. Measure Level PMath CMath PCMath L A TEXMAP 3 0.3073 . The ideaof indexing structures rather than terms can be generalized frommathematical formulae to semi-structured text. Reordering theoperands of associative operators is only a simple transformation.For example, to convert n √ a , and a / n to a single canonical represen-tation, a general computer algebra system (CAS) can be used. Weexperiment [17] with improving the vector space representations ofdocument passages, aiming to add support for mathematics in thefuture. Embeddings can also be computed for equations [9] now,which presents new possibilities of using language modeling forthe semantic segmentation of STEM articles, and weighting thesegments [17]. Grasping the meaning of mathematical formulae iscrucial: content is king. Acknowledgements

We gratefully acknowledge the support by theEuropean Union under the FP7-CIP program, project 250,503 (Eu-DML), and by the ASCR under the Information Society R&D pro-gram, project 1ET200190513 (DML-CZ). We also sincerely thankthree anonymous reviewers for their insightful comments.

REFERENCES [1] Akiko Aizawa, Michael Kohlhase, and Iadh Ounis. 2013. NTCIR-10 Math PilotTask Overview. In

Proc. of the 10th NTCIR Conference . NII, Tokyo, Japan, 654–661.[2] Akiko Aizawa, Michael Kohlhase, Iadh Ounis, and Moritz Schubotz. 2014.NTCIR-11 Math-2 Task Overview. In

Proc. of the 11th NTCIR Conference on Eval-uation of Information Access Technologies . Noriko Kando and Kazuaki Kishida,(Eds.) NII, Tokyo, Japan, 88–98.[3] Dominique Archambault and Victor Moço. 2006. Canonical MathML to Sim-plify Conversion of MathML to Braille Mathematical Notations. In

ComputersHelping People with Special Needs . Lecture Notes in Computer Science. Vol. 4061.Klaus Miesenberger, Joachim Klaus, Wolfgang Zagler, and Arthur Karshmer,(Eds.) Springer Berlin / Heidelberg, 1191–1198. doi: 10.1007/11788713_172.[4] Josef B. Baker, Alan P. Sexton, and Volker Sorge. 2012. MaxTract: ConvertingPDF to L A TEX, MathML and Text. In

AISC/DML/MKM/Calculemus (Lecture Notesin Computer Science). Johan Jeuring et al., (Eds.) Vol. 7362. Springer, 422–426.isbn: 978-3-642-31373-8. doi: 10.1007/978-3-642-31374-5_29. https://elastic.co [5] Andrzej Białecki, Robert Muir, and Grant Ingersoll. 2012. Apache Lucene 4. In SIGIR 2012 Workshop on Open Source Information Retrieval , 17.[6] Davide Cervone. 2012. MathJax: a platform for mathematics on the Web.

Noticesof the AMS , 59, 2, 312–316.[7] David Formánek, Martin Líška, Michal Růžička, and Petr Sojka. 2012. Normaliza-tion of Digital Mathematics Library Content. In

Joint Proc. of the 24th OpenMathWorkshop, the 7th Workshop on Mathematical User Interfaces (MathUI), and theWork in Progress Section of the Conference on Intelligent Computer Mathematics (CEUR Workshop Proceedings) number 921. (Bremen, Germany, July 9–13,2012). James Davenport, Johan Jeuring, Christoph Lange, and Paul Libbrecht,(Eds.) http://ceur-ws.org/Vol-921/wip-05.pdf. Aachen, 91–103.[8] Michael Kohlhase et al. 2008. MathWebSearch 0.4, a semantic search engine formathematics.

Manuscript at http://mathweb. org/projects/mws/pubs/mkm08.pdf .[9] Kriste Krstovski and David M. Blei. 2018. Equation Embeddings.

ArXiv e-prints ,(Mar. 2018). arXiv: 1803.09123 [stat.ML] .[10] Martin Líška, Petr Sojka, and Michal Růžička. 2015. Combining Text and For-mula Queries in Math Information Retrieval: Evaluation of Query ResultsMerging Strategies. In

Proceedings of the First International Workshop on NovelWeb Search Interfaces and Systems (NWSearch ’15). ACM. ACM, Melbourne,Australia, 7–9. isbn: 978-1-4503-3789-2. doi: 10.1145/2810355.2810359. http://doi.acm.org/10.1145/2810355.2810359.[11] Martin Líška, Petr Sojka, and Michal Růžička. 2014. Math Indexer and SearcherWeb Interface: Towards Fulfillment of Mathematicians’ Information Needs.In

Intelligent Computer Mathematics CICM 2014. Proceedings of Calculemus,DML, MKM, and Systems and Projects . Stephen M. Watt et al., (Eds.) SpringerInternational Publishing Switzerland, Zurich, 444–448. isbn: 978-3-319-08434-3.doi: 10.1007/978-3-319-08434-3_36.[12] Martin Líška, Petr Sojka, and Michal Růžička. 2013. Similarity Search for Math-ematics: Masaryk University team at the NTCIR-10 Math Task. In

Proc. of the10th NTCIR Conference on Evaluation of Information Access Technologies . NorikoKando and Kazuaki Kishida, (Eds.) NII, Tokyo, Japan, Tokyo, 686–691. isbn:978-4-86049-062-1.[13] Martin Líška, Petr Sojka, Michal Růžička, and Petr Mravec. 2011. Web Interfaceand Collection for Mathematical Retrieval: WebMIaS and MREC. In

Towardsa Digital Mathematics Library. Bertinoro, Italy, July 20–21st, 2011 . Petr Sojkaand Thierry Bouche, (Eds.) http://hdl.handle.net/10338.dmlcz/702604. MasarykUniversity, Bertinoro, Italy, (July 2011), 77–84. isbn: 978-80-210-5542-1.[14] Michal Růžička. 2017.

Math Information Retrieval for Digital Libraries . Disserta-tion. Masaryk University, Faculty of Informatics, Brno, CZ. https://is.muni.cz/th/pxz4q/?lang=en.[15] Michal Růžička, Petr Sojka, and Martin Líška. 2016. Math Indexer and Searcherunder the Hood: Fine-tuning Query Expansion and Unification Strategies.In

Proc. of the 12th NTCIR Conference on Evaluation of Information AccessTechnologies . Noriko Kando, Tetsuya Sakai, and Mark Sanderson, (Eds.) NIITokyo, 331–337. http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings12/pdf/ntcir/MathIR/05-NTCIR12-MathIR-RuzickaM.pdf.[16] Michal Růžička, Petr Sojka, and Martin Líška. 2014. Math Indexer and Searcherunder the Hood: History and Development of a Winning Strategy. In

Proc. ofthe 11th NTCIR Conference on Evaluation of Information Access Technologies .Hideo Joho and Kazuaki Kishida, (Eds.) https://is.muni.cz/auth/publication/1201956/en. NII, Tokyo, Japan, (Dec. 2014), 127–134.[17] Jan Rygl, Petr Sojka, Michal Růžička, and Radim Řehůřek. 2016. ScaleText:The Design of a Scalable, Adaptable and User-Friendly Document System forSimilarity Searches: Digging for Nuggets of Wisdom in Text. eng. In

Proc. of the10th Workshop on Recent Advances in Slavonic NLP, RASLAN 2016 . Aleš Horák,Pavel Rychlý, and Adam Rambousek, (Eds.) Tribun EU, Brno, 79–87. isbn: 978-80-263-1095-2. https://nlp.fi.muni.cz/raslan/2016/paper08-Rygl_Sojka_etal.pdf.[18] Petr Sojka, (Ed.)

Towards a Digital Mathematics Library . Birmingham, UK, (July2008). Masaryk University. isbn: 978-80-210-4658-0. http://dml.cz/dmlcz/702564.[19] Petr Sojka and Martin Líška. 2011. Indexing and Searching Mathematics inDigital Libraries – Architecture, Design and Scalability Issues. In

IntelligentComputer Mathematics. Proceedings of 18th Symposium, Calculemus 2011, and10th International Conference, MKM 2011 (Lecture Notes in Artificial Intelli-gence, LNAI). James H. Davenport, William M. Farmer, Josef Urban, and FlorianRabe, (Eds.) Vol. 6824. Springer-Verlag, Bertinoro, Italy, (July 2011), 228–243.doi: 10.1007/978-3-642-22673-1_16.[20] Heinrich Stamerjohanns et al. 2010. Transforming Large Collections of Scien-tific Publications to XML.

Mathematics in Computer Science , 3, 299–307, 3. issn:1661-8270. doi: 10.1007/s11786-010-0024-7.[21] Masakazu Suzuki et al. 2003. INFTY — An Integrated OCR System for Math-ematical Documents. In

Proc. of ACM Symposium on Document Engineering2003 . C. Vanoirbeek, C. Roisin, and E. Munson, (Eds.) ACM, Grenoble, France,95–104.[22] Richard Zanibbi et al. 2016. NTCIR-12 MathIR task overview. In