Discovering Mathematical Objects of Interest -- A Study of Mathematical Notations
Andre Greiner-Petter, Moritz Schubotz, Fabian Mueller, Corinna Breitinger, Howard S. Cohl, Akiko Aizawa, Bela Gipp
PPreprint from
André Greiner-Petter et al. Discovering Mathematical Objects of Inter-est - A Study of Mathematical Notations. In:
Proceedings of The WebConference 2020 (WWW’20), April 20–24, 2020, Taipei, Taiwan . doi : Discovering Mathematical Objects of Interest—AStudy of Mathematical Notations
André Greiner-Petter , Moritz Schubotz , Fabian Müller ,Corinna Breitinger , Howard S. Cohl , Akiko Aizawa , and BelaGipp University of Wuppertal, Germany([email protected], {last}@uni-wuppertal.de) FIZ-Karlsruhe, Germany ({first.last}@fiz-karlsruhe.de) National Institute of Standards and Technology, Mission Viejo,California, U.S.A. ([email protected]) National Institute of Informatics, Japan ({last}@nii.ac.jp) University of Konstanz, Germany ({first.last}@uni-konstanz.de)February 20, 2020
Abstract
Mathematical notation, i.e., the writing system used to communicateconcepts in mathematics, encodes valuable information for a variety ofinformation search and retrieval systems. Yet, mathematical notationsremain mostly unutilized by today’s systems. In this paper, we present thefirst in-depth study on the distributions of mathematical notation in twolarge scientific corpora: the open access arXiv (2.5B mathematical objects)and the mathematical reviewing service for pure and applied mathematicszbMATH (61M mathematical objects). Our study lays a foundation forfuture research projects on mathematical information retrieval for largescientific corpora. Further, we demonstrate the relevance of our results toa variety of use-cases. For example, to assist semantic extraction systems,to improve scientific search engines, and to facilitate specialized mathrecommendation systems.The contributions of our presented research are as follows: (1) wepresent the first distributional analysis of mathematical formulae on arXivand zbMATH; (2) we retrieve relevant mathematical objects for given tex-tual search queries (e.g., linking P ( α,β ) n ( x ) with ‘Jacobi polynomial’); (3)we extend zbMATH’s search engine by providing relevant mathematicalformulae; and (4) we exemplify the applicability of the results by pre-senting auto-completion for math inputs as the first contribution to mathrecommendation systems. To expedite future research projects, we havemade available our source code and data. a r X i v : . [ c s . D L ] F e b Introduction
Taking into account mathematical notation in the literature leads to a betterunderstanding of scientific literature on the Web and allows one to make use ofsemantic information in specialized Information Retrieval (IR) systems. Nowa-days applications in Math Information Retrieval (MathIR) [21], such as searchengines [4, 8, 10, 11, 13, 22, 25], semantic extraction systems [23, 28, 30], re-cent efforts in math embeddings [26, 35, 37, 43], and semantic tagging of mathformulae [16, 31] either consider an entire equation as one entity or only focuson single symbols. Since math expressions often contain meaningful and impor-tant subexpressions, these applications could benefit from an approach that liesbetween the extremes of examining only individual symbols or considering anentire equation as one entity. Consider for example, the explicit definition forJacobi polynomials [46, (18.5.7)] P ( α,β ) n ( x )= Γ( α + n +1) n !Γ( α + β + n +1) n X m =0 (cid:18) nm (cid:19) Γ( α + β + n + m +1)Γ( α + m +1) (cid:18) x − (cid:19) m . (1)The interesting components in this equation are P ( α,β ) n ( x ) on the left-hand side,and the appearance of the gamma function Γ( s ) on the right-hand side, imply-ing a direct relationship between Jacobi polynomials and the gamma function.Considering the entire expression as a single object misses this important re-lationship. On the other hand, focusing on single symbols can result in themisleading interpretation of Γ as a variable and Γ( α + n + 1) as a multiplicationbetween Γ and ( α + n + 1). A system capable of identifying the important com-ponents, such as P ( α,β ) n ( x ) or Γ( α + n + 1), is therefore desirable. Hereafter, wedefine these components as Mathematical Objects of Interest (MOIs) [37].The importance of math objects is a somewhat imprecise description and thusdifficult to measure. Currently, not much effort has been made in identifyingmeaningful subexpressions. Kristianto et al. [28] introduced dependency graphsbetween formulae. With this approach, they were able to build dependencygraphs of mathematical expressions, but only if the expressions appeared assingle expressions in the context. For example, if Γ( α + n + 1) appears as astand-alone expression in the context, the algorithm will declare a dependencywith Equation (1). However, it is more likely that different forms, such as Γ( s ),appear in the context. Since this expression does not match any subexpressionin Equation (1), the approach cannot establish a connection with Γ( s ). Kohlhaseet al. studied in [27, 33, 34] another approach to identify essential componentsin formulae. They performed eye-tracking studies to identify important areasin rendered mathematical formulae. While this is an interesting approach thatallows one to learn more about the insights of human behaviors of reading andunderstanding math, it is inaccessible for extensive studies.This paper presents the first extensive frequency distribution study of math-ematical equations in two large scientific corpora, the e-Print archive arXiv.org(hereafter referred to as arXiv ) and the international reviewing service for pure https://arxiv.org/ [Accessed: Sep. 1, 2019] . We will show that math expressions, simi-lar to words in natural language corpora, also obey Zipf’s law [15], and thereforefollows a Zipfian distribution. Related research projects observed a relation toZipf’s law for single math symbols [16, 23]. In the context of quantitative lin-guistics, Zipf’s law states that given a text corpus, the frequency of any wordis inversely proportional to its rank in the frequency table. Motivated by thesimilarity to linguistic properties, we will present a novel approach for rankingformulae by their relevance via a customized version of the ranking functionBM25 [7]. We will present results that can be easily embedded in other sys-tems in order to distinguish between common and uncommon notations withinformulae. Our results lay a foundation for future research projects in MathIR.Fundamental knowledge on frequency distributions of math formulae is ben-eficial for numerous applications in MathIR, ranging from educational pur-poses [3] to math recommendation systems, search engines [22, 25], and evenautomatic plagiarism detection systems [29, 39, 41]. For example, studentscan search for the conventions to write certain quantities in formulae; docu-ment preparation systems can integrate an auto-completion or auto-correctionservice for math inputs; search or recommendation engines can adjust theirranking scores with respect to standard notations; and plagiarism detectionsystems can estimate whether two identical formulae indicate potential plagia-rism or are just using the conventional notations in a particular subject area.To exemplify the applicability of our findings, we present a textual search ap-proach to retrieve mathematical formulae. Further, we will extend zbMATH’sfaceted search by providing facets of mathematical formulae according to agiven textual search query. Lastly, we present a simple auto-completion systemfor math inputs as a contribution towards advancing mathematical recommen-dation systems. Further, we show that the results provide useful insights forplagiarism detection algorithms. We provide access to the source code, theresults, and extended versions of all of the figures appearing in this paper at https://github.com/ag-gipp/FormulaCloudData . Related Work:
Today, mathematical search engines index formulae in a database.Much effort has been undertaken to make this process as efficient as possible interms of precision and runtime performance [4, 8, 14, 24, 25]. The generateddatabases naturally contain the information required to examine the distribu-tions of the indexed mathematical formulae. Yet, no in-depth studies of thesedistributions have been undertaken. Instead, math search engines focus onother aspects, such as devising novel similarity measures and improving run-time efficiency. This is because the goal of math search engines is to retrieverelevant (i.e., similar) formulae which correspond to a given search query thatpartially [13, 14, 22] or exclusively [8, 11, 25] contains formulae. However, fora fundamental study of distributions of mathematical expressions, no similaritymeasures nor efficient lookup or indexing is required. Thus, we use the general-purpose query language XQuery and employ the BaseX implementation. Ba- https://zbmath.org [Accessed: Sep. 1, 2019] http://basex.org/ [Accessed: Sep. 2019]; We used BaseX 9.2 for our experiments. L A TEX is the de facto standard for the preparation of academic manuscriptsin the fields of mathematics and physics [5]. Since L A TEX allows for advancedcustomizations and even computations, it is challenging to process. For thisreason, L A TEX expressions are unsuitable for an extensive distribution analy-sis of mathematical notations. For mathematical expressions on the web, theXML formatted
MathML is the current standard, as specified by the WorldWide Web Consortium (W3C). The tree structure and the fixed standard, i.e., MathML tags, cannot be changed, thus making this data format reliable. Sev-eral available tools are able to convert from L A TEX to
MathML [36] and variousdatabases are able to index XML data. Thus, for this study, we have chosento focus on
MathML . In the following, we investigate the databases arXMLiv(08/2018) [32] and zbMATH [40].The arXMLiv dataset ( ≈ A TE xml [45]. L A TE xml converted all math-ematical expressions into MathML with parallel markup, i.e., presentation andcontent
MathML . In this study we only consider the subsets no-problem and warning , which generated no errors during the conversion process. Nonetheless,the
MathML data generated still contains some errors or falsely annotatedmath. For example, we discovered several instances of affiliation and footnotes,SVG and other unknown tags, encoded in MathML . Regarding the footnotes,we presumed that authors falsely used mathematical environments for gener-ating footnote or affiliation marks. We used the TEX string, provided as anattribute in the
MathML data, to filter out expressions that match the string‘ {}^{*} ’, where ‘ * ’ indicates any possible expression. In addition, we filtered outSVG and other unknown tags. We assume that these expressions were generatedby mistake due to limitations of L A TE xml . The final arXiv dataset consisted of841,008 documents which contained at least one mathematical formula. Thedataset contained a total of 294,151,288 mathematical expressions.In addition to arXiv, we investigated zbMATH, an international reviewingservice for pure and applied mathematics which contains abstracts and reviewsof articles, hereafter uniformly called abstracts, mainly from the domains of pureand applied mathematics. The abstracts in zbMATH are formatted in TEX [40].To be able to compare arXiv and zbMATH, we manually generated MathML via L A TE xml for each mathematical formula in zbMATH and performed the [Accessed: Sep. 1, 2019] https://zbmath.org/ [Accessed: Sep. 1, 2019] Scalable Vector Graphics
Listing 1: MathML representa-tion of P ( α,β ) n ( x ). Since we focused on the frequency distribu-tions of visual expressions, we only consideredpresentational
MathML (pMML). Ratherthan normalizing the pMML data, e.g., viaMathMLCan [9], which would also changethe tree structure and visual core elementsin pMML, we only eliminated the attributes.These attributes are used for minor visualchanges, e.g., stretched parentheses or inlinelimits of sums and integrals. Thus, for thisfirst study, we preserved the core structureof the pMML data, which might provide in-sightful statistics for the
MathML commu-nity to further cultivate the standard. Af-ter extracting all
MathML expressions, fil-tering out falsely annotated math and SVGtags, and eliminating unnecessary attributesand annotations, the datasets required 83GBof disk space for arXiv and 6GB for zbMATH,respectively.In the following, we indexed the data viaBaseX. The indexed datasets required a diskspace of 143.9GB in total (140GB for arXivand 3.9GB for zbMATH). Due to the limitations of databases in BaseX, itwas necessary to split our datasets into smaller subsets. We split the datasetsaccording to the 20 major article categories of arXiv and classifications ofzbMATH. To increase performance, we use BaseX in a server-client environment.We experienced performance issues in BaseX when multiple clients repeatedlyrequested data from the same server in short intervals. We determined that thebest workaround for this issue was to launch BaseX servers for each database,i.e., each category/classification.Mathematical expressions often consist of multiple meaningful subexpres-sions, which we defined as MOIs. However, without further investigation of the A detailed overview of the limitations of BaseX databases can be found at http://docs.basex.org/wiki/Statistics [Accessed: Sep. 1, 2019]. The arXiv categories astro-ph (astro physics), cond-mat (condensed matter), and math (mathematics) were still too large for a single database. Thus, we split those categories intotwo equally sized parts.
MathML is an XML data format (essentiallya tree-structured format), we define subexpressions of equations as subtrees ofits
MathML format.Listing 1 illustrates a Jacobi polynomial P ( α,β ) n ( x ) in pMML. The
MathML expressions. Since we cut offall other elements besides pMML nodes, each
MathML allows us to introduce a measure that reflects the complexity ofmathematical expressions. More complex expressions usually consist of moreextensively nested subtrees in the
MathML data. Thus, we define the com-plexity of a mathematical expression by the maximum depth of the
MathML tree. In XML the content of a node and its attributes are commonly interpretedas children of the node. Thus, we define the depth of a single node as 1 ratherthan 0, i.e., single identifiers, such as
MathML in BaseX. Thealgorithm for the extraction process is written in XQuery. The algorithm tra-verses recursively downwards from the root to the leaves. In each iteration, itchecks whether there is an identifier, i.e.,
MathML tree, the XQuery will trig-ger database requests in every iteration. Hence, the downwards implementationperforms better, since there is only one database request for every expressionrather than for every subexpression.Since we only minimize the pMML data rather than normalizing it, twoidentically rendered expressions may have different complexities. For instance,
MathML normalization via MathMLCan [9] in future research toovercome these issues.
By splitting each formula into subexpressions, we generated longer documentsand a bias towards low complexities. Note that, hereafter, we only refer tothe mathematical content of documents. Thus, the length of a document refersto the number of math formulae—here the number of subexpressions—in the7ategory arXiv zbMATHDocuments 841,008 1,349,297Formulae 294,151,288 11,747,860Subexpressions 2,508,620,512 61,355,307Unique Subexpressions 350,206,974 8,450,496Average Document Length 2,982.87 45.47Average Complexity 5.01 3.89Maximum Complexity 218 26Table 1: Dataset overview. Average Document Length is defined as the averagenumber of subexpressions per document.Figure 1: Unique subexpressions for each complexity in arXiv and zbMATH.document. After splitting expressions into subexpressions, arXiv consists of2 .
5B and zbMATH of 61M expressions, which raised the average documentlength to 2 , .
87 for arXiv and 45 .
47 for zbMATH, respectively.For calculating frequency distributions, we merged two subexpressions iftheir string representations were identical. Remember, the string representationis unique for each
MathML tree. After merging, arXiv consisted of 350,206,974unique mathematical subexpressions with a maximum complexity of 218 and anaverage complexity of 5 .
01. For high complexities over 70, the formulae showsome erroneous structures that might be generated from L A TE xml by mistake.For example, the expression with the highest complexity is a long sequence of apolynomial starting with ‘ P ( t , t , t , t ) =’ followed by 690 summands. Thecomplexity is caused by a high number of unnecessarily deeply nested
89. One of the mostcomplex expressions in zbMATH with a minimum document frequency of threewas M p ( r, f ) = (cid:18) π Z π (cid:12)(cid:12) f (cid:0) re iθ (cid:1)(cid:12)(cid:12) p dθ (cid:19) /p . (4)As we expected, reviews and abstracts in zbMATH were generally shorter andconsisted of less complex mathematical formulae. The dataset also appeared tocontain fewer erroneous expressions, since expressions of complexity 25 are stillreadable and meaningful.Figure 1 shows the ratio of unique subexpressions for each complexity in bothdatasets. The figure illustrates that both datasets share a peak at complexityfour. Compared to zbMATH, the arXiv expressions are slightly more evenlydistributed over the different levels of complexities. Interestingly, complexitiesone and two are not dominant in either of the two datasets. Single identifiersonly make up 0 .
03% in arXiv and 0 .
12% in zbMATH, which is comparable toexpressions of complexity 19 and 14, respectively. This finding illustrates theproblem of capturing semantic meanings for single identifiers rather than formore complex expressions [30]. It also substantiates that entire expressions, iftoo complex, are not suitable either for capturing the semantic meanings [28].Instead, a middle ground is desirable, since the most unique expressions in bothdatasets have a complexity between 3 and 5. Table 1 summarizes the statisticsof the examined datasets.
In linguistics, it is well known that word distributions follow Zipf’s Law [15],i.e., the r -th most frequent word has a frequency that scales to f ( r ) ∝ r α (5)with α ≈
1. A better approximation can be applied by a shifted distribution f ( r ) ∝ r + β ) α , (6)where α ≈ β ≈ .
7. In a study on Zipf’s law, Piantadosi [15] illustratedthat not only words in natural language corpora follow this law surprisinglyaccurately, but also many other human-created sets. For instance, in program-ming languages, in biological systems, and even in music. Since mathematical9 a) Frequency Distributions (b) Complexity Distributions
Figure 2: Each figure illustrates the relationship between the frequency ranks( x -axis) and the normalized frequency ( y -axis) in zbMATH (top) and arXiv(bottom). For arXiv, only the first 8 million entries are plotted to be compa-rable with zbMATH ( ≈ α and β are provided in the plots. Subfigure (b)shades the bins from blue to red according to the maximum complexity in eachbin. 10ommunication has derived as the result of centuries of research, it would notbe surprising if mathematical notations would also follow Zipf’s law. The pri-mary conclusion of the law illustrates that there are some very common tokensagainst a large number of symbols which are not used frequently. Based on thisassumption, we can postulate that a score based on frequencies might be ableto measure the peculiarity of a token. The infamous TF-IDF ranking functionsand their derivatives [2, 7] have performed well in linguistics for many yearsand are still widely used in retrieval systems [20]. However, since we split ev-ery expression into its subexpressions, we generated an anomalous bias towardsshorter, i.e., less complex, formulae. Hence, distributions of subexpressions maynot obey Zipf’s law.Figure 2 visualizes a comparison between Zipf’s law and the frequency dis-tributions of mathematical subexpressions in arXiv and zbMATH. The dashedorange line visualizes the power law (6). The plots demonstrate that the distri-butions in both datasets obey this power law. Interestingly, there is not muchdifference in the distributions between both datasets. Both distributions seemto follow the same power law, with α = 1 . β = 15 .
82. Moreover, we canobserve that the developed complexity measure seems to be appropriate, sincethe complexity distributions for formulae are similar to the distributions for thelength of words [15]. In other words, more complex formulae, as well as longwords in natural languages, are generally more specialized and thus appear lessfrequent throughout the corpus. Note that colors of the bins for complexitiesfluctuate for rare expressions because the color represents the maximum ratherthan the average complexity in each bin.
Figure 3 shows in detail the most frequently used mathematical expressions inarXiv for the complexities 1 to 5. The orange dashed line visible in all graphsrepresents the normal Zipf’s law distribution from Equation (5). We explorethe total frequency values without any normalization. Thus, Equation (5) wasmultiplied by the highest frequency for each complexity level to fit the distri-bution. The plots in Figure 3 demonstrate that even though the parameter α varies between 0 .
35 and 0 .
62, the distributions in each complexity class alsoobey Zipf’s law.The plots for each complexity class contain some interesting fluctuations. Wecan spot a set of five single identifiers that are most frequently used through-out arXiv: n , i , x , t , and k . Even though the distributions follow Zipf’s lawaccurately, we can explore that these five identifiers are proportionally morefrequently used than other identifiers and clearly separate themselves above therest (notice the large gap from k to a ). All of the five identifiers are known to beused in a large variety of scenarios. Surprisingly, one might expect that commonpairs of identifiers would share comparable frequencies in the plots. However,typical pairs, such as x and y , or α and β , possess a large discrepancy.The plot of complexity two also reveals that two expressions are proportion-ally more often used than others: ( x ) and ( t ). These two expressions appear11igure 3: Overview of the most frequent mathematical expressions in arXivfor complexities 1-5. The color gradient from yellow to blue represents thefrequency in the dataset. Zipf’s law (5) is represented by a dashed orange line.12ore than three times as often in the corpus than any other expression of thesame complexity. On the other hand, the quantitative difference between ( x )and ( t ) is negligible. We may assume that arXiv’s primary domain, physics,causes the quantitative disparity between ( x ), ( t ), and the other tokens. Theprimary domain of the dataset becomes more clearly visible for higher complex-ities, such as SU (2) (C3 ) or kms − (C4).Another surprising property of arXiv is that symmetry groups, such as SU (2), appear to play an essential role in the majority of articles on arXiv,see SU (2) (C3), SU (2) L (C4), and SU (2) × SU (2) (C5), among others. Theplots of higher complexities , which we do not show here, made this evenmore noticeable. Given a complexity of six, for example, the most frequentlyused expression was SU (2) L × SU (2) R , and for a complexity of seven it was SU (3) × SU (2) × U (1). Given a complexity of eight, ten out of the top-12expressions were from symmetry group calculations.It is also worthwhile to compare expressions among different levels of com-plexities. For instance, ( x ) and ( t ) appeared almost six million times in the cor-pus, but f ( x ) (at position three in C3) was the only expression which containedone of these most common expressions. Note that subexpressions of variations,such as ( x ), ( t ), or ( t − t ), do not match the expression of complexity two.This may imply that ( x ), and especially ( t ), appear in many different scenarios.Further, we can examine that even though ( x ) is a part of f ( x ) in only approx-imately 3% of all cases, it is still the most likely combination. These results areespecially useful for recommendation systems that make use of math as input.Moreover, plagiarism detection systems may also benefit from such a knowledgebase. For instance, it might be evident that f ( x ) is a very common expression,but for automatic systems that work on a large scale, it is not clear whetherduplicate occurrences of f ( x ) or Ξ( x ) should be scored differently, e.g., in thecase of plagiarism detection.Figure 3 shows only the most frequently occurring expressions in arXiv.Since we already explored a bias towards physics formulae in arXiv, it is worthcomparing the expressions present within both datasets. Figure 4 compares the25-top expressions for the complexities one to four. In zbMATH, we discoveredthat computer science and graph theory appeared as popular topics, see forexample G = ( V, E ) (in C3 at position 20) and the Bachmann-Landau notationsin O (log n ), O ( n ), and O ( n ) (C4 positions 2, 3, and 19).From Figure 4, we can also deduce useful information for MathIR taskswhich focus on semantic information. Current semantic extraction tools [30]or L A TEX parsers [36] still have difficulties distinguishing multiplications from function calls . For example as mentioned before, L A TE xml [45] adds an invisibletimes character between f ( x ) rather than a function application . Investigatingthe most frequently used terms in zbMATH in Table 4 reveals that u is mostlikely considered to be a function in the dataset: u ( t ) (rank 8), u ( x ) (rank 13), u xx (rank 16), u (0) (rank 17), |∇ u | (rank 22). Manual investigations of extended We refer to a given complexity n with C n , i.e., C3 refers to complexity 3. More plots showing higher complexities are available at https://github.com/ag-gipp/FormulaCloudData n !) 129.44 i, j = 1 , . . . , n (cid:0) Q / Q (cid:1) | z | ) α φ − x ij | f ( z ) | p f (cid:0) re iθ (cid:1) z n − x = A ( t ) x (cid:16) | x | (cid:17) f ( z ) = z + P ∞ n =2 a n z n c n ) 106.66 | x − x | | f ( x ) | p (cid:16) |∇ u | p − ∇ u (cid:17) B ( G ) 105.52 S n +1 |∇ u | dx n/ log log n )99.87 log n L (cid:0) R (cid:1) n ( n − / O ( n log n )99.65 ξ ( x ) 103.70 ˙ x = Ax + Bu O ( n − ) – C798.72 div (cid:0) |∇ u | p − ∇ u (cid:1) Table 2: Top s ( t, D ) scores, where D is the set of all zbMATH documents witha minimum document frequency of 200, maximum document frequency of 500k,and a minimum complexity of 3.lists reveal even more hits: u ( x ) (rank 30), − ∆ u (rank 32), and u ( x, t ) (rank33). Since all eight terms are among the most frequent 35 entries in zbMATH,it implies that u can most likely be considered to imply a function in zbMATH.Of course, this does not imply that u must always be a function in zbMATH(see f ( u ) on rank 14 in C3), but this allows us to exploit probabilities forimproving MathIR performance. For instance, if not stated otherwise, u couldbe interpreted as a function by default, which could help increase the precisionof the aforementioned tools.Figure 4 also demonstrates that our two datasets diverge for increasing com-plexities. Hence, we can assume that frequencies of less complex formulae aremore topic-independent. Conversely, the more complex a math formula is, themore context-specific it is. In the following, we will further investigate thisassumption by applying TF-IDF rankings on the distributions. Zipf’s law encourages the idea of scoring the relevance of words according totheir number of occurrences in the corpus and in the documents. The family ofBM25 ranking functions based on TF-IDF scores are still widely used in severalretrieval systems [7, 20]. Since we demonstrated that mathematical formulae(and their subexpressions) obey Zipf’s law in large scientific corpora, it appearsintuitive to also use TF-IDF rankings, such as a variant of BM25, to calculatetheir relevance. In its original form [7],
Okapi BM25 was calculated as followsbm25( t, d ) := ( k + 1) IDF( t ) TF( t, d )TF( t, d ) + k (cid:16) − b + b | d | AVG DL (cid:17) , (7)where TF ( t, d ) is the term frequency of t in the document d , | d | the length ofthe document d (in our case, the number of subexpressions), AVG DL the average15ength of the documents in the corpus (see Table 1), and IDF ( t ) is the inversedocument frequency of t , defined asIDF( t ) := log N − n ( t ) + n ( t ) + , (8)where N is the number of documents in the corpus and n ( t ) the number ofdocuments which contain the term t . By adding , we avoid log 0 and divisionby 0. The parameters k and b are free, with b controlling the influence ofthe normalized document length and k controlling the influence of the termfrequency on the final score. For our experiments, we chose the standard value k = 1 . b = 0 . P ( α,β ) n ( x ), i.e., the document had a length ofone, would generate eight subexpressions, i.e., it results in a document lengthof eight. Thus, we modify the BM25 score in Equation (7) to emphasize highercomplexities and longer documents. First, the average document length is di-vided by the average complexity AVG C in the corpus that is used (see Table 1),and we calculate the reciprocal of the document length normalization to em-phasize longer documents.Moreover, in the scope of a single document, we want to emphasize expres-sions that do not appear frequently in this document, but are the most frequentamong their level of complexity. Thus, less complex expressions are ranked morehighly if the document overall is not very complex. To achieve this weighting,we normalize the term frequency of an expression t according to its complexity c ( t ) and introduce an inverse term frequency according to all expressions in thedocument ITF( t, d ) := log | d | − TF( t, d ) + TF( t, d ) + . (9)Finally, we define the score s( t, d ) of a term t in a document d as s ( t, d ) := ( k + 1) IDF( t ) ITF( t, d ) TF( t, d )max t ∈ d | c ( t ) TF( t , d ) + k (cid:16) − b + b AVG DL | d | AVG C (cid:17) . (10)The TF-IDF ranking functions and the introduced s ( t, d ) are used to retrieverelevant documents for a given search query. However, we want to retrieverelevant subexpressions over a set of documents. Thus, we define the score ofa formula (mBM25) over a set of documents as the maximum score over alldocuments mBM25( t, D ) := max d ∈ D s ( t, d ) , (11)where D is a set of documents. We used Apache Flink [38] to count the expres-sions and process the calculations. Thus, our implemented system scales wellfor large corpora. 16igure 5: Top-20 ranked expressions retrieved from a topic-specific subset ofdocuments D q . The search query q is given above the plots. Retrieved formulaeare annotated by a domain expert with green dots for relevant and red dots fornon-relevant hits. A line is drawn if a hit appears in both result sets. The lineis colored in green when the hit was marked as relevant.17able 2 shows the top-7 scored expressions, where D is the entire zbMATHdataset. The retrieved expressions can be considered as meaningful and real-world examples of MOIs, since most expressions are known for specific mathe-matical concepts, such as Gal( Q / Q ), which refers to the Galois group of Q over Q , or L ( R ), which refers to the L -space (also known as Lebesgue space ) over R . However, a more topic-specific retrieval algorithm is desirable. To achievethis goal, we (i) retrieved a topic-specific subset of documents D q ⊂ D for agiven textual search query q , and (ii) calculated the scores of all expressions inthe retrieved documents. To generate D q , we indexed the text sources of thedocuments from arXiv and zbMATH via elasticsearch (ES) and performedthe pre-processing steps: filtering stop words, stemming, and ASCII-folding .Table 3 summarizes the settings we used to retrieve MOIs from a topic-specificsubset of documents D q . We also set a minimum hit frequency according tothe number of retrieved documents an expression appears in. This requirementfilters out uncommon notations. arXiv zbMATHRetrieved Doc. 40 200Min. Hit Freq. 7 7Min. DF 50 10Max. DF 10k 10kTable 3: Settings for the retrieval experiments.Figure 5 shows the results for five search queries. We asked a domain expertfrom the National Institute of Standards and Technology (NIST) to annotate theresults as related (shown as green dots in Figure 5) or non-related (red dots). Wefound that the results range from good performances (e.g., for the Riemann zetafunction) to bad performances (e.g., beta function). For instance, the resultsfor the Riemann zeta function are surprisingly accurate, since we could discoverthat parts of Riemann’s hypothesis were ranked highly throughout the results(e.g., ζ ( + it )). On the other hand, for the beta function, we retrieved only afew related hits, of which only one had a strong connection to the beta function B ( x, y ). We observed that the results were quite sensitive to the chosen settings(see Table 3). For instance, according to the beta function, the minimum hitfrequency has a strong effect on the results, since many expressions are sharedamong multiple documents. For arXiv, the expressions B ( α, β ) and B ( x, y ) onlyappear in one document of the retrieved 40. However, decreasing the minimumhit frequency would increase noise in the results.Even though we asked a domain expert to annotate the results as relevantor not, there is still plenty of room for discussion. For instance, ( x + y ) (rank 15 https://github.com/elastic/elasticsearch [Accessed Sep. 2019]. We used version 7.0.0 This means that non-ASCII characters are replaced by their ASCII counterparts or willbe ignored if no such counterpart exists. Riemann proposed that the real part of every non-trivial zero of the Riemann zeta functionis 1 /
2. If this hypothesis is correct, all the non-trivial zeros lie on the critical line consistingof the complex numbers 1 / it . iemann Zeta FunctionC1 C2 C3 C415,051 n s ) 1,456 ζ ( s ) 349 ( + it )11,709 s x ) 340 σ + it
232 (1 / it )9,768 x n ) 310 P ∞ n =1
195 ( σ + it )8,913 k t ) 275 (log T ) 136 + it T it
264 1 / it s = σ + it C5 C6 TF-IDF mBM25203 ζ ( + it ) 105 | ζ (1 / it ) | ζ ( s ) ζ (1 / it )166 ζ (1 / it ) 88 (cid:12)(cid:12) ζ ( + it ) (cid:12)(cid:12) ζ (1 / it ) (1 / it )124 ζ ( σ + it ) 81 | ζ ( σ + it ) | (1 / it ) ( + it )54 ζ (1 + it ) 32 | ζ (1 + it ) | + it ζ ( + it )44 ζ (2 n + 1) 22 | ζ (+ it ) | ( + it ) ( σ + it ) EigenvalueC1 C2 C3 C445,488 n x ) 686 − ∆ u |∇ u | p − x t ) 555 ( n −
1) 218 − ∆ p u λ λ |∇ u | W ,p (Ω)35,302 u a ij |∇ u | t R n u ( x ) 97 ( a ij ) C5 C6 TF-IDF mBM25139 |∇ u | p − ∇ u (cid:16) |∇ u | p − ∇ u (cid:17) Ax = λBx − div (cid:16) |∇ u | p − ∇ u (cid:17) − d /dx − ( py ) − ∆ p div (cid:16) |∇ u | p − ∇ u (cid:17) A = ( a ij ) 26 ( | u | p − u ) P ( λ ) p = N +2 N − − d dx
18 ( φ p ( u )) λ k +1 ( φ p ( u )) u ∈ W ,p (Ω) 18 R Ω |∇ u | dx λ > λ ∈ (0 , λ ∗ ) Table 4: The top-5 frequent mathematical expressions in the result set of zb-MATH for the search queries ‘Riemann Zeta Function’ (top) and ‘Eigenvalue’(bottom) grouped by their complexities (left) and the hits reordered accord-ing to their relevance scores (right). The TF-IDF score was calculated withnormalized term frequencies. 19 uto-completion for ‘ E = m ’ Suggestions for ‘ E = { m, c } ’Sug. Expression TF DF Sug. Expression TF DF E = mc
558 376 E = mc
558 376 E = m cosh θ
23 23 E = γmc
39 38 E = mv E = γm e c
41 36 E = m/ p − ˙ q
12 6 E = m cosh θ
23 23 E = m/ p − β
10 6 E = − mc
35 17 E = mc γ E = p m c + p c
10 8
Table 5: Suggestions to complete ‘ E = m ’ and ‘ E = { m, c } ’ (the right-hand sidecontains m and c ) with term and document frequency based on the distributionsof formulae in arXiv.in zbMATH, ‘Beta Function’) is the argument of the gamma function Γ( x + y )that appears in the definition of the beta function [46, (5.12.1)] B ( x, y ) :=Γ( x )Γ( y ) / Γ( x + y ). However, this relation is weak at best, and thus might beconsidered as not related. Other examples are Re z and Re( s ), which play acrucial role in the scenario of the Riemann hypothesis (all non-trivial zeroeshave Re( s ) = ). Again, this connection is not obvious, and these expressionsare often used in multiple scenarios. Thus, the domain expert did not mark theexpressions as being related.Considering the differences in the documents, it is promising to have ob-served a relatively high number of shared hits in the results. Further, wewere able to retrieve some surprisingly good insights from the results, suchas extracting the full definition of the Riemann zeta function [46, (25.2.1)] ζ ( s ) := P ∞ n =1 1 n s . Even though a high number of shared hits seem to sub-stantiate the reliability of the system, there were several aspects that affectedthe outcome negatively, from the exact definition of the search queries to re-trieve documents via ES, to the number of retrieved documents, the minimumhit frequency, and the parameters in mBM25. The presented results are beneficial for a variety of use-cases. In the following,we will demonstrate and discuss several of the applications that we propose.
Extension of zbMATH’s Search Engine:
Formula search engines are oftencounterintuitive when compared to textual search, since the user must know howthe system operates to enter a search query properly (e.g., does the system sup-ports L A TEX inputs?). Additionally, mathematical concepts can be difficult tocapture using only mathematical expressions. Consider, for example, someonewho wants to search for mathematical expressions that are related to eigenval-ues. A textual search query would only retrieve entire documents that requirefurther investigation to find related expressions. A mathematical search en-gine, on the other hand, is impractical since it is not clear what would be a20tting search query (e.g., Av = λv ?). Moreover, formula and textual searchsystems for scientific corpora are separated from each other. Thus, a textualsearch engine capable of retrieving mathematical formulae can be beneficial.Also, many search engines allow for narrowing down relevant hits by suggestingfilters based on the retrieved results. This technique is known as faceted search.The zbMATH search engine also provides faceted search, e.g., by authors, oryear. Adding facets for mathematical expressions allows users to narrow downthe results more precisely to arrive at specific documents.Our proposed system for extracting relevant expressions from scientific cor-pora via mBM25 scores can be used to search for formulae even with textualsearch queries, and to add more filters for faceted search implementations. Ta-ble 4 shows two examples of such an extension for zbMATH’s search engine.Searching for ‘Riemann Zeta Function’ and ‘Eigenvalue’ retrieved 4,739 and25,248 documents from zbMATH, respectively. Table 4 shows the most fre-quently used mathematical expressions in the set of retrieved documents. Italso shows the reordered formulae according to a default TF-IDF score (withnormalized term frequencies) and our proposed mBM25 score. The results canbe used to add filters for faceted search, e.g., show only the documents whichcontain u ∈ W ,p (Ω). Additionally, the search system now provides more intu-itive textual inputs even for retrieving mathematical formulae. The retrievedformulae are also interesting by themselves, since they provide insightful in-formation on the retrieved publications. As already explored with our customdocument search system in Figure 5, the Riemann hypothesis is also prominentin these retrieved documents.The differences between TF-IDF and mBM25 ranking illustrates the problemof an extensive evaluation of our system. From a broader perspective, the hit Ax = λBx is highly correlated with the input query ‘Eigenvalue’. On theother hand, the raw frequencies revealed a prominant role of div( |∇ u | p − ∇ u ).Therefore, the top results of the mBM25 ranking can also be considered asrelevant. Math Notation Analysis:
A faceted search system allows us to analyze math-ematical notations in more detail. For instance, we can retrieve documents froma specific time period. This allows one to study the evolution of mathematicalnotation over time [1], or for identifying trends in specific fields. Also, we cananalyze standard notations for specific authors since it is often assumed that au-thors prefer a specific notation style which may vary from the standard notationin a field.
Math Recommendation Systems:
The frequency distributions of formulaecan be used to realize effective math recommendation tasks, such as type hintingor error-corrections. These approaches require long training on large datasets,but may still generate meaningless results, such as G i = { ( x, y ) ∈ R n : x i = x i } [42]. We propose a simpler system which takes advantage of our frequencydistributions. We retrieve entries from our result database, which contain allunique expressions and their frequencies. We implemented a simple prototypethat retrieves the entries via pattern matching. Table 5 shows two examples.The left side of the table shows suggested autocompleted expressions for the21igure 6: The top ranked expression for ‘ Jacobi polynomial ’ in arXiv and zb-MATH. For arXiv, 30 documents were retrieved with a minimum hit frequencyof 7.query ‘ E = m ’. The right side shows suggestions for ‘ E =’, where the right-hand side of the equation should contain m and c in any order. A combinationusing more advanced retrieval techniques, such as similarity measures based onsymbol layout trees [24, 25], would enlarge the number of suggestions. This kindof autocomplete and error-correction type-hinting system would be beneficial forvarious use-cases, e.g., in educational software or for search engines as a pre-processing step of the input. Plagiarism Detection Systems:
As previously mentioned, plagiarism de-tection systems [29, 39, 41] would benefit from a system capable of distin-guishing conventional from uncommon notations. The approaches described byMeuschke et al. [39] outperform existing approaches by considering frequencydistributions of single identifiers (expressions of complexity one). Consideringthat single identifiers make up only 0 .
03% of all unique expressions in arXiv, wepresume that better performance can be achieved by considering more complexexpressions. The conferred string representation also provides a simple formatto embed complex expressions in existing learning algorithms.22xpressions with high complexities that are shared among multiple docu-ments may provide further hints to investigate potential plagiarisms. For in-stance, the most complex expression that was shared among three documents inarXiv was Equation (3). A complex expression being identical in multiple doc-uments could indicate a higher likelihood of plagiarism. Further investigationrevealed that similar expressions, e.g., with infinite sums, are frequently usedamong a larger set of documents. Thus, the expression seems to be a part of astandard notation that is commonly shared, rather than a good candidate forplagiarism detection. Resulting from manual investigations, we could identifythe equation as part of a concept called generalized Hardy-Littlewood inequal-ity and Equation (3) appears in the three documents [12, 18, 17]. All threedocuments shared one author in common. Thus, this case also demonstrates acorrelation between complex mathematical notations and authorship.
Semantic Taggers and Extraction Systems:
We previously mentioned thatsemantic extraction systems [23, 28, 30] and semantic math taggers [16, 31]have difficulties in extracting the essential components (MOIs) from complexexpressions. Considering the definition of the Jacobi polynomial in Equation (1),it would be beneficial to extract the groups of tokens that belong together, suchas P ( α,β ) n ( x ) or Γ( α + m + 1). With our proposed search engine for retrievingMOIs, we are able to facilitate semantic extraction systems and semantic mathtaggers. Imagine such a system being capable of identifying the term ‘Jacobipolynomial’ from the textual context. Figure 6 shows the top relevant hits forthe search query ‘Jacobi polynomial’ retrieved from zbMATH and arXiv. Theresults contain several relevant and related expressions, such as the constraints α, β > − − x ) α (1 + x ) β ,which are essential properties of this orthogonal polynomial. Based on theseretrieved MOIs, the extraction systems can adjust its retrieved math elementsto improve precision, and semantic taggers or a tokenizer could re-organize parsetrees to more closely resemble expression trees. In this study we showed that analyzing the frequency distributions of mathe-matical expressions in large scientific datasets can provide useful insights for avariety of applications. We demonstrated the versatility of our results by im-plementing prototypes of a type-hinting system for math recommendations, anextension of zbMATH’s search engine, and a mathematical retrieval system tosearch for topic-specific MOIs. Additionally, we discussed the potential impactand suitability in other applications, such as math search engines, plagiarismdetection systems, and semantic extraction approaches. We are confident thatthis project lays a foundation for future research in the field of MathIR.We plan on developing a web application which would provide easy accessto our frequency distributions, the MOI search engine, and the type-hintingrecommendation system. We hope that this will further expedite related fu-ture research projects. Moreover, we will use this web application for an online23valuation of our MOI retrieval system. Since the level of agreement among an-notators will be predictably low, an evaluation by a large community is desired.In this first study, we preserved the core structure of the
MathML datawhich provided insightful information for the
MathML community. However,this makes it difficult to properly merge formulae. In future studies, we willnormalize the
MathML data via MathMLCan [9]. In addition to this nor-malization, we will include wildcards for investigating distributions of formulapatterns rather than exact expressions. This will allow us to study connectionsbetween math objects, e.g., between Γ( z ) and Γ( x + 1). This would furtherimprove our recommendation system and would allow for the identification ofregions for parameters and variables in complex expressions. Acknowledgments
Discovering Mathematical Objects of Interest was sup-ported by the German Research Foundation (DFG grant GI-1259-1).
References [1] Florian Cajori.
A History of Mathematical Notations . Vol. 1 & 2. London,UK: The Open Court Company, 1929.[2] Akiko N. Aizawa. An information-theoretic perspective of tf-idf measures.In:
Inf. Process. Manage. (2003), pp. 45–65. doi : .[3] Glenn Gordon Smith and David Ferguson. Diagrams and math notationin e-learning: growing pains of a new generation. In: International Journalof Mathematical Education in Science and Technology (5 2004), pp. 681–695. doi : .[4] Ashish Lohia, Kirti Sinha, Soujanya Vadapalli, and Kamalakar Karla-palem. An Architecture for Searching and Indexing Latex Equations inScientific Literature. In: Proc. COMAD . Goa, India: Computer Society ofIndia, 2005, pp. 122–130.[5] Alex Gaudeul. Do Open Source Developers Respond to Competition?:The L A TEX Case Study. In:
Review of Network Economics (2 June 2007),pp. 239–263. doi : .[6] Christian Grün, Sebastian Gath, Alexander Holupirek, and Marc Scholl.XQuery Full Text Implementation in BaseX. In: Database and XML Tech-nologies . Springer Berlin, 2009, pp. 114–128.[7] Stephen E. Robertson and Hugo Zaragoza. The Probabilistic RelevanceFramework: BM25 and Beyond. In:
Foundations and Trends in Informa-tion Retrieval (2009), pp. 333–389. doi : .[8] Shahab Kamali and Frank Wm. Tompa. A new mathematics retrievalsystem. In: Proc. ACM CIKM . Toronto, Ontario, Canada: ACM, 2010,pp. 1413–1416. doi : .249] David Formánek, Martin Líška, Michal Růžička, and Petr Sojka. Normal-ization of Digital Mathematics Library Content. In: Proc. of OpenMath/MathUI/ CICM-WiP . CEUR Workshop Proceedings. Bremen, Germany,2012, pp. 91–103.[10] Michael Kohlhase, Bogdan A. Matican, and Corneliu-Claudiu Prodescu.MathWebSearch 0.5: Scaling an Open Formula Search Engine. In:
Intelli-gent Computer Mathematics - 11th International Conference, AISC 2012,19th Symposium, Calculemus 2012, 5th International Workshop, DML2012, 11th International Conference, MKM 2012, Systems and Projects,Held as Part of CICM 2012, Bremen, Germany, July 8-13, 2012. Proceed-ings . Bremen, Germany: Springer Berlin Heidelberg, 2012, pp. 342–357. doi : .[11] Shahab Kamali and Frank Wm. Tompa. Retrieving documents with math-ematical content. In: Proceedings of the 36th International ACM SIGIRConference on Research and Development in Information Retrieval, SI-GIR ’13, Dublin, Ireland - July 28 - August 01, 2013 . Dublin, Ireland:ACM, 2013, pp. 353–362. doi : .[12] Gustavo Araujo and Daniel Pellegrino. On the constants of the Bohnenblust-Hille inequality and Hardy–Littlewood inequalities. In: CoRR (2014). arXiv: .[13] Giovanni Yoko Kristianto, Goran Topic, Florence Ho, and Akiko Aizawa.The MCAT Math Retrieval System for NTCIR-11 Math Track. In:
Proc.11th NTCIR Conference on Evaluation of Information Access Technolo-gies, National Center of Sciences,
Tokyo, Japan: National Institute ofInformatics (NII), 2014.[14] Aldo Lipani, Linda Andersson, Florina Piroi, Mihai Lupu, and Allan Han-bury. TUW-IMP at the NTCIR-11 Math-2. In:
Proceedings of the 11thNTCIR Conference on Evaluation of Information Access Technologies,NTCIR-11, National Center of Sciences, Tokyo, Japan, December 9-12,2014 . Tokyo, Japan: National Institute of Informatics (NII), 2014.[15] Steven T. Piantadosi. Zipf’s word frequency law in natural language: Acritical review and future directions. In:
Psychonomic Bulletin & Review (Mar. 2014), pp. 1112–1130. doi : .[16] Pao-Yu Chien and Pu-Jen Cheng. Semantic Tagging of Mathematical Ex-pressions. In: Proc. WWW’2015 . Florence, Italy: ACM, 2015, pp. 195–204. doi : .[17] Daniel Pellegrino. A short communication on the constants of the multi-linear Hardy–Littlewood inequality. In: CoRR (2015). arXiv: .[18] Jamilson R. Campos, Wasthenny Cavalcante, Vinícius V. Fávaro, DanielNuñez-Alarcón, Daniel Pellegrino, and Diana M. Serrano-Rodríguez. Poly-nomial and multilinear Hardy–Littlewood inequalities: analytical and nu-merical approaches. In:
CoRR (2015). arXiv: .2519] Leonard Wörteler, Michael Grossniklaus, Christian Grün, and Marc Scholl.Function inlining in XQuery 3.0 optimization. In:
Proc. 15th DBLP . Pitts-burgh, PA, USA: ACM, 2015, pp. 45–48. doi : .[20] Jöran Beel, Bela Gipp, Stefan Langer, and Corinna Breitinger. Research-paper recommender systems: a literature survey. In: Int. J. on DigitalLibraries (2016), pp. 305–338. doi : .[21] Ferruccio Guidi and Claudio Sacerdoti Coen. A Survey on Retrieval ofMathematical Knowledge. In: Mathematics in Computer Science (2016),pp. 409–427. doi : .[22] Shunsuke Ohashi, Giovanni Yoko Kristianto, Goran Topic, and AkikoAizawa. Efficient Algorithm for Math Formula Semantic Search. In: IEICETransactions (2016), pp. 979–988. doi : .[23] Moritz Schubotz, Alexey Grigorev, Marcus Leich, Howard S. Cohl, Nor-man Meuschke, Bela Gipp, Abdou S. Youssef, and Volker Markl. Seman-tification of Identifiers in Mathematics for Better Math Information Re-trieval. In: Proceedings of the 39th International ACM SIGIR Conferenceon Research and Development in Information Retrieval . SIGIR ’16. FullPaper. Pisa, Italy: ACM, 2016, pp. 135–144. doi :
10 . 1145 / 2911451 .2911503 .[24] Richard Zanibbi, Kenny Davila, Andrew Kane, and Frank Wm. Tompa.Multi-Stage Math Formula Search: Using Appearance-Based SimilarityMetrics at Scale. In:
Proceedings of the 39th International ACM SIGIRConference on Research and Development in Information Retrieval . SI-GIR ’16. Pisa, Italy: ACM, 2016, pp. 145–154. doi : .[25] Kenny Davila and Richard Zanibbi. Layout and Semantics: CombiningRepresentations for Mathematical Formula Search. In: Proc. ACM SIGIR .Shinjuku, Tokyo: ACM, 2017, pp. 1165–1168. doi : .[26] Liangcai Gao, Zhuoren Jiang, Yue Yin, Ke Yuan, Zuoyu Yan, and ZhiTang. Preliminary Exploration of Formula Embedding for MathematicalInformation Retrieval: can mathematical formulae be embedded like anatural language? In: CoRR (2017). arXiv: .[27] Andrea Kohlhase, Michael Kohlhase, and Michael Fürsich. Visual Struc-ture in Mathematical Expressions. In:
Intelligent Computer Mathematics- 10th International Conference, CICM 2017, Edinburgh, UK, July 17-21,2017, Proceedings . Lecture Notes in Computer Science. Edinburgh, UK:Springer, 2017, pp. 208–223. doi : .[28] Giovanni Yoko Kristianto, Goran Topic, and Akiko Aizawa. Utilizing de-pendency relationships between math expressions in math IR. In: Infor-mation Retrieval Journal (2017), pp. 132–167. doi :
10 . 1007 / s10791 -017-9296-8 . 2629] Norman Meuschke, Moritz Schubotz, Felix Hamborg, Tomás Skopal, andBela Gipp. Analyzing Mathematical Content to Detect Academic Plagia-rism. In:
Proc. ACM CIKM . Singapore: ACM, 2017, pp. 2211–2214. doi : .[30] Moritz Schubotz, Leonard Krämer, Norman Meuschke, Felix Hamborg,and Bela Gipp. Evaluating and Improving the Extraction of Mathemati-cal Identifier Definitions. In: Experimental IR Meets Multilinguality, Mul-timodality, and Interaction - 8th International Conference of the CLEFAssociation, CLEF 2017, Dublin, Ireland, September 11-14, 2017, Pro-ceedings . Lecture Notes in Computer Science. Springer, 2017, pp. 82–94. doi : .[31] Abdou Youssef. Part-of-Math Tagging and Applications. In: IntelligentComputer Mathematics . Cham: Springer International Publishing, 2017,pp. 356–374.[32] Deyan Ginev. arXMLiv:08.2018 dataset, an HTML5 conversion of arXiv.org .SIGMathLing – Special Interest Group on Math Linguistics. 2018. url : https://sigmathling.kwarc.info/resources/arxmliv/ .[33] Andrea Kohlhase. Factors for Reading Mathematical Expressions. In: Pro-ceedings of the Conference "Lernen, Wissen, Daten, Analysen", LWDA2018, Mannheim, Germany, August 22-24, 2018.
CEUR Workshop Pro-ceedings. Mannheim, Germany: CEUR-WS.org, 2018, pp. 195–202.[34] Andrea Kohlhase, Michael Kohlhase, and Taweechai Ouypornkochagorn.Discourse Phenomena in Mathematical Documents. In:
Intelligent Com-puter Mathematics - 11th International Conference, CICM 2018, Hagen-berg, Austria, August 13-17, 2018, Proceedings . Lecture Notes in Com-puter Science. Hagenberg, Austria: Springer, 2018, pp. 147–163. doi : .[35] Kriste Krstovski and David M. Blei. Equation Embeddings. In: CoRR (2018). arXiv: .[36] Moritz Schubotz, André Greiner-Petter, Philipp Scharpf, Norman Meuschke,Howard S. Cohl, and Bela Gipp. Improving the Representation and Con-version of Mathematical Formulae by Considering their Textual Context.In:
Proceedings of the 18th ACM/IEEE on Joint Conference on DigitalLibraries, JCDL 2018, Fort Worth, TX, USA, June 03-07, 2018 . FortWorth, USA: ACM, 2018, pp. 233–242. doi : .[37] André Greiner-Petter, Terry Ruas, Moritz Schubotz, Akiko Aizawa, WilliamI. Grosky, and Bela Gipp. Why Machines Cannot Learn Mathematics, Yet.In: Proceedings of the 4th Joint Workshop on Bibliometric-enhanced In-formation Retrieval and Natural Language Processing for Digital Libraries(BIRNDL 2019) co-located with the 42nd International ACM SIGIR Con-ference on Research and Development in Information Retrieval (SIGIR2019), Paris, France, July 25, 2019.
Paris, France: CEUR-WS.org, 2019,pp. 130–137. 2738] Fabian Hueske and Timo Walther. Apache Flink. In:
Encyclopedia of BigData Technologies.
Springer, 2019. doi : .[39] Norman Meuschke, Vincent Stange, Moritz Schubotz, Michael Kramer,and Bela Gipp. Improving Academic Plagiarism Detection for STEM Doc-uments by Analyzing Mathematical Content and Citations. In: Proceed-ings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL) .Urbana-Champaign, USA, June 2019, pp. 120–129. doi : .[40] Moritz Schubotz and Olaf Teschke. Four decades of TEX at zbMATH. In: Newsletter of the European Mathematical Society (EMS) (2019), pp. 50–52. doi : .[41] Moritz Schubotz, Olaf Teschke, Vincent Stange, Norman Meuschke, andBela Gipp. Forms of Plagiarism in Digital Mathematical Libraries. In: Intelligent Computer Mathematics - 12th International Conference, CICM2019, Prague, Czech Republic, July 8-12, 2019, Proceedings . Lecture Notesin Computer Science. Prague, Czech Republic: Springer, 2019, pp. 258–274. doi : .[42] Michihiro Yasunaga and John Lafferty. TopicEq: A Joint Topic and Math-ematical Equation Model for Scientific Texts. In: CoRR (2019). arXiv: .[43] Abdou Youssef and Bruce R. Miller. Explorations into the Use of WordEmbedding in Math Search and Math Semantics. In:
Intelligent ComputerMathematics - 12th International Conference, CICM 2019, Prague, CzechRepublic, July 8-12, 2019, Proceedings . Lecture Notes in Computer Sci-ence. Prague, Czech Republic: Springer, 2019, pp. 291–305. doi : .[44] André Greiner-Petter, Moritz Schubotz, Fabian Müller, Corinna Breitinger,Howard S. Cohl, Akiko Aizawa, and Bela Gipp. Discovering MathematicalObjects of Interest - A Study of Mathematical Notations. In: Proceedingsof The Web Conference 2020 (WWW’20), April 20–24, 2020, Taipei, Tai-wan . doi : .[45] Bruce R. Miller. LaTeXML
A L A TEX to XML/HTML/MathML Converter . http://dlmf.nist.gov/LaTeXML/ . Accessed: 2019-09-01.[46] NIST Digital Library of Mathematical Functions . http://dlmf.nist.gov/ , Release 1.0.25 of 2019-12-15. F. W. J. Olver, A. B. Olde Daalhuis,D. W. Lozier, B. I. Schneider, R. F. Boisvert, C. W. Clark, B. R. Miller,B. V. Saunders, H. S. Cohl, and M. A. McClain, eds.28isting 2: Use the following BibTeX code to cite this article @inproceedings{GreinerPetter2020, author = {Greiner-Petter, Andr{\’{e}} and Schubotz, Moritz and M\"{u}ller, Fabian and Breitinger, Corinna and Cohl, Howard S. and Aizawa, Akiko and Gipp, Bela}, booktitle = {Proceedings of The Web Conference 2020 (WWW’20), April20--24, 2020, Taipei, Taiwan}, doi = {10.1145/3366423.3380218}, title = {Discovering Mathematical Objects of Interest - A Study ofMathematical Notations}, }}