[PDF] Discovering Mathematical Objects of Interest -- A Study of Mathematical Notations

Abstract

Mathematical notation, i.e., the writing system used to communicate concepts in mathematics, encodes valuable information for a variety of information search and retrieval systems. Yet, mathematical notations remain mostly unutilized by today's systems. In this paper, we present the first in-depth study on the distributions of mathematical notation in two large scientific corpora: the open access arXiv (2.5B mathematical objects) and the mathematical reviewing service for pure and applied mathematics zbMATH (61M mathematical objects). Our study lays a foundation for future research projects on mathematical information retrieval for large scientific corpora. Further, we demonstrate the relevance of our results to a variety of use-cases. For example, to assist semantic extraction systems, to improve scientific search engines, and to facilitate specialized math recommendation systems. The contributions of our presented research are as follows: (1) we present the first distributional analysis of mathematical formulae on arXiv and zbMATH; (2) we retrieve relevant mathematical objects for given textual search queries (e.g., linking P (α,β) n (x) with `Jacobi polynomial'); (3) we extend zbMATH's search engine by providing relevant mathematical formulae; and (4) we exemplify the applicability of the results by presenting auto-completion for math inputs as the first contribution to math recommendation systems. To expedite future research projects, we have made available our source code and data.

Full PDF

PPreprint from

André Greiner-Petter et al. Discovering Mathematical Objects of Inter-est - A Study of Mathematical Notations. In:

Proceedings of The WebConference 2020 (WWW’20), April 20–24, 2020, Taipei, Taiwan . doi : Discovering Mathematical Objects of Interest—AStudy of Mathematical Notations

André Greiner-Petter , Moritz Schubotz , Fabian Müller ,Corinna Breitinger , Howard S. Cohl , Akiko Aizawa , and BelaGipp University of Wuppertal, Germany([email protected], {last}@uni-wuppertal.de) FIZ-Karlsruhe, Germany ({ﬁrst.last}@ﬁz-karlsruhe.de) National Institute of Standards and Technology, Mission Viejo,California, U.S.A. ([email protected]) National Institute of Informatics, Japan ({last}@nii.ac.jp) University of Konstanz, Germany ({ﬁrst.last}@uni-konstanz.de)February 20, 2020

Abstract

Mathematical notation, i.e., the writing system used to communicateconcepts in mathematics, encodes valuable information for a variety ofinformation search and retrieval systems. Yet, mathematical notationsremain mostly unutilized by today’s systems. In this paper, we present theﬁrst in-depth study on the distributions of mathematical notation in twolarge scientiﬁc corpora: the open access arXiv (2.5B mathematical objects)and the mathematical reviewing service for pure and applied mathematicszbMATH (61M mathematical objects). Our study lays a foundation forfuture research projects on mathematical information retrieval for largescientiﬁc corpora. Further, we demonstrate the relevance of our results toa variety of use-cases. For example, to assist semantic extraction systems,to improve scientiﬁc search engines, and to facilitate specialized mathrecommendation systems.The contributions of our presented research are as follows: (1) wepresent the ﬁrst distributional analysis of mathematical formulae on arXivand zbMATH; (2) we retrieve relevant mathematical objects for given tex-tual search queries (e.g., linking P ( α,β ) n ( x ) with ‘Jacobi polynomial’); (3)we extend zbMATH’s search engine by providing relevant mathematicalformulae; and (4) we exemplify the applicability of the results by pre-senting auto-completion for math inputs as the ﬁrst contribution to mathrecommendation systems. To expedite future research projects, we havemade available our source code and data. a r X i v : . [ c s . D L ] F e b Introduction

Taking into account mathematical notation in the literature leads to a betterunderstanding of scientiﬁc literature on the Web and allows one to make use ofsemantic information in specialized Information Retrieval (IR) systems. Nowa-days applications in Math Information Retrieval (MathIR) [21], such as searchengines [4, 8, 10, 11, 13, 22, 25], semantic extraction systems [23, 28, 30], re-cent eﬀorts in math embeddings [26, 35, 37, 43], and semantic tagging of mathformulae [16, 31] either consider an entire equation as one entity or only focuson single symbols. Since math expressions often contain meaningful and impor-tant subexpressions, these applications could beneﬁt from an approach that liesbetween the extremes of examining only individual symbols or considering anentire equation as one entity. Consider for example, the explicit deﬁnition forJacobi polynomials [46, (18.5.7)] P ( α,β ) n ( x )= Γ( α + n +1) n !Γ( α + β + n +1) n X m =0 (cid:18) nm (cid:19) Γ( α + β + n + m +1)Γ( α + m +1) (cid:18) x − (cid:19) m . (1)The interesting components in this equation are P ( α,β ) n ( x ) on the left-hand side,and the appearance of the gamma function Γ( s ) on the right-hand side, imply-ing a direct relationship between Jacobi polynomials and the gamma function.Considering the entire expression as a single object misses this important re-lationship. On the other hand, focusing on single symbols can result in themisleading interpretation of Γ as a variable and Γ( α + n + 1) as a multiplicationbetween Γ and ( α + n + 1). A system capable of identifying the important com-ponents, such as P ( α,β ) n ( x ) or Γ( α + n + 1), is therefore desirable. Hereafter, wedeﬁne these components as Mathematical Objects of Interest (MOIs) [37].The importance of math objects is a somewhat imprecise description and thusdiﬃcult to measure. Currently, not much eﬀort has been made in identifyingmeaningful subexpressions. Kristianto et al. [28] introduced dependency graphsbetween formulae. With this approach, they were able to build dependencygraphs of mathematical expressions, but only if the expressions appeared assingle expressions in the context. For example, if Γ( α + n + 1) appears as astand-alone expression in the context, the algorithm will declare a dependencywith Equation (1). However, it is more likely that diﬀerent forms, such as Γ( s ),appear in the context. Since this expression does not match any subexpressionin Equation (1), the approach cannot establish a connection with Γ( s ). Kohlhaseet al. studied in [27, 33, 34] another approach to identify essential componentsin formulae. They performed eye-tracking studies to identify important areasin rendered mathematical formulae. While this is an interesting approach thatallows one to learn more about the insights of human behaviors of reading andunderstanding math, it is inaccessible for extensive studies.This paper presents the ﬁrst extensive frequency distribution study of math-ematical equations in two large scientiﬁc corpora, the e-Print archive arXiv.org(hereafter referred to as arXiv ) and the international reviewing service for pure https://arxiv.org/ [Accessed: Sep. 1, 2019] . We will show that math expressions, simi-lar to words in natural language corpora, also obey Zipf’s law [15], and thereforefollows a Zipﬁan distribution. Related research projects observed a relation toZipf’s law for single math symbols [16, 23]. In the context of quantitative lin-guistics, Zipf’s law states that given a text corpus, the frequency of any wordis inversely proportional to its rank in the frequency table. Motivated by thesimilarity to linguistic properties, we will present a novel approach for rankingformulae by their relevance via a customized version of the ranking functionBM25 [7]. We will present results that can be easily embedded in other sys-tems in order to distinguish between common and uncommon notations withinformulae. Our results lay a foundation for future research projects in MathIR.Fundamental knowledge on frequency distributions of math formulae is ben-eﬁcial for numerous applications in MathIR, ranging from educational pur-poses [3] to math recommendation systems, search engines [22, 25], and evenautomatic plagiarism detection systems [29, 39, 41]. For example, studentscan search for the conventions to write certain quantities in formulae; docu-ment preparation systems can integrate an auto-completion or auto-correctionservice for math inputs; search or recommendation engines can adjust theirranking scores with respect to standard notations; and plagiarism detectionsystems can estimate whether two identical formulae indicate potential plagia-rism or are just using the conventional notations in a particular subject area.To exemplify the applicability of our ﬁndings, we present a textual search ap-proach to retrieve mathematical formulae. Further, we will extend zbMATH’sfaceted search by providing facets of mathematical formulae according to agiven textual search query. Lastly, we present a simple auto-completion systemfor math inputs as a contribution towards advancing mathematical recommen-dation systems. Further, we show that the results provide useful insights forplagiarism detection algorithms. We provide access to the source code, theresults, and extended versions of all of the ﬁgures appearing in this paper at https://github.com/ag-gipp/FormulaCloudData . Related Work:

Today, mathematical search engines index formulae in a database.Much eﬀort has been undertaken to make this process as eﬃcient as possible interms of precision and runtime performance [4, 8, 14, 24, 25]. The generateddatabases naturally contain the information required to examine the distribu-tions of the indexed mathematical formulae. Yet, no in-depth studies of thesedistributions have been undertaken. Instead, math search engines focus onother aspects, such as devising novel similarity measures and improving run-time eﬃciency. This is because the goal of math search engines is to retrieverelevant (i.e., similar) formulae which correspond to a given search query thatpartially [13, 14, 22] or exclusively [8, 11, 25] contains formulae. However, fora fundamental study of distributions of mathematical expressions, no similaritymeasures nor eﬃcient lookup or indexing is required. Thus, we use the general-purpose query language XQuery and employ the BaseX implementation. Ba- https://zbmath.org [Accessed: Sep. 1, 2019] http://basex.org/ [Accessed: Sep. 2019]; We used BaseX 9.2 for our experiments. L A TEX is the de facto standard for the preparation of academic manuscriptsin the ﬁelds of mathematics and physics [5]. Since L A TEX allows for advancedcustomizations and even computations, it is challenging to process. For thisreason, L A TEX expressions are unsuitable for an extensive distribution analy-sis of mathematical notations. For mathematical expressions on the web, theXML formatted

MathML is the current standard, as speciﬁed by the WorldWide Web Consortium (W3C). The tree structure and the ﬁxed standard, i.e., MathML tags, cannot be changed, thus making this data format reliable. Sev-eral available tools are able to convert from L A TEX to

MathML [36] and variousdatabases are able to index XML data. Thus, for this study, we have chosento focus on

MathML . In the following, we investigate the databases arXMLiv(08/2018) [32] and zbMATH [40].The arXMLiv dataset ( ≈ A TE xml [45]. L A TE xml converted all math-ematical expressions into MathML with parallel markup, i.e., presentation andcontent

MathML . In this study we only consider the subsets no-problem and warning , which generated no errors during the conversion process. Nonetheless,the

MathML data generated still contains some errors or falsely annotatedmath. For example, we discovered several instances of aﬃliation and footnotes,SVG and other unknown tags, encoded in MathML . Regarding the footnotes,we presumed that authors falsely used mathematical environments for gener-ating footnote or aﬃliation marks. We used the TEX string, provided as anattribute in the

MathML data, to ﬁlter out expressions that match the string‘ {}^{*} ’, where ‘ * ’ indicates any possible expression. In addition, we ﬁltered outSVG and other unknown tags. We assume that these expressions were generatedby mistake due to limitations of L A TE xml . The ﬁnal arXiv dataset consisted of841,008 documents which contained at least one mathematical formula. Thedataset contained a total of 294,151,288 mathematical expressions.In addition to arXiv, we investigated zbMATH, an international reviewingservice for pure and applied mathematics which contains abstracts and reviewsof articles, hereafter uniformly called abstracts, mainly from the domains of pureand applied mathematics. The abstracts in zbMATH are formatted in TEX [40].To be able to compare arXiv and zbMATH, we manually generated MathML via L A TE xml for each mathematical formula in zbMATH and performed the [Accessed: Sep. 1, 2019] https://zbmath.org/ [Accessed: Sep. 1, 2019] Scalable Vector Graphics

Listing 1: MathML representa-tion of P ( α,β ) n ( x ). $P_{n}^{(α, β)} (x)$ Since we focused on the frequency distribu-tions of visual expressions, we only consideredpresentational

MathML (pMML). Ratherthan normalizing the pMML data, e.g., viaMathMLCan [9], which would also changethe tree structure and visual core elementsin pMML, we only eliminated the attributes.These attributes are used for minor visualchanges, e.g., stretched parentheses or inlinelimits of sums and integrals. Thus, for thisﬁrst study, we preserved the core structureof the pMML data, which might provide in-sightful statistics for the

MathML commu-nity to further cultivate the standard. Af-ter extracting all

MathML expressions, ﬁl-tering out falsely annotated math and SVGtags, and eliminating unnecessary attributesand annotations, the datasets required 83GBof disk space for arXiv and 6GB for zbMATH,respectively.In the following, we indexed the data viaBaseX. The indexed datasets required a diskspace of 143.9GB in total (140GB for arXivand 3.9GB for zbMATH). Due to the limitations of databases in BaseX, itwas necessary to split our datasets into smaller subsets. We split the datasetsaccording to the 20 major article categories of arXiv and classiﬁcations ofzbMATH. To increase performance, we use BaseX in a server-client environment.We experienced performance issues in BaseX when multiple clients repeatedlyrequested data from the same server in short intervals. We determined that thebest workaround for this issue was to launch BaseX servers for each database,i.e., each category/classiﬁcation.Mathematical expressions often consist of multiple meaningful subexpres-sions, which we deﬁned as MOIs. However, without further investigation of the A detailed overview of the limitations of BaseX databases can be found at http://docs.basex.org/wiki/Statistics [Accessed: Sep. 1, 2019]. The arXiv categories astro-ph (astro physics), cond-mat (condensed matter), and math (mathematics) were still too large for a single database. Thus, we split those categories intotwo equally sized parts.

MathML is an XML data format (essentiallya tree-structured format), we deﬁne subexpressions of equations as subtrees ofits

MathML format.Listing 1 illustrates a Jacobi polynomial P ( α,β ) n ( x ) in pMML. The ele-ment on line 14 contains the invisible times UTF-8 character. By deﬁnition, the $element is the root element of$

MathML expressions. Since we cut oﬀall other elements besides pMML nodes, each $element has one and onlyone child element . Thus, we deﬁne the child element of the elementas the root of the expression. Starting from this root element, we explore allsubexpressions. For this study, we presume that every meaningful mathematicalobject (i.e., MOI) must contain at least one identiﬁer.Hence, we only study subtrees which contain at least one node. Iden-tiﬁers, in the sense of MathML , are ‘ symbolic names or arbitrary text ’ ,e.g., single Latin or Greek letters. Identiﬁers do not contain special characters(other than Greek letters) or numbers. As a consequence, arithmetic expres-sions, such as (1 + 2) , or sequences of special characters and numbers, suchas { , , ... } ∩ {− } , will not appear in our distributional analysis. However, ifa sequence or arithmetic expression consists of an identiﬁer somewhere in thepMML tree (such as in { , , ... } ∩ A ), the entire expression will be recognized.The Jacobi polynomial P ( α,β ) n ( x ), therefore consists of the following subexpres-sions: P ( α,β ) n , ( α, β ), ( x ), and the single identiﬁers P , n , α , β , and x . Theentire expression is also a mathematical object. Hence, we take entire expres-sions with an identiﬁer into account for our analysis. In the following, the setof subexpressions will be understood to include the expression itself.For our experiments, we also generated a string representation of the MathML data. The string is generated recursively by applying one of two rules for eachnode: (i) if the current node is a leaf, the node-tag and the content will bemerged by a colon, e.g.,xwill be converted to mi:x ; (ii) otherwisethe node-tag wraps parentheses around its content and separates the children bya comma, e.g.,(x)will be con-verted to mrow(mo:(,mi:x,mo:)) . Furthermore, the special UTF-8 charactersfor invisible times (U+2062) and function application (U+2061) are replacedby ivt and fa , respectively. For example, the gamma function with argument x + 1, Γ( x + 1) would be represented by mrow(mi: Γ ,mo:ivt,mrow(mo:(,mrow(mi:x,mo:+,mn:1),mo:))) . (2) Between Γ and ( x + 1), there would most likely be the special character for invisible times rather than for function application , because L A TE xml is notable to parse Γ as a function. Note that this string conversion is a bijective Sequences are always nested in anelement. [Accessed: Sep. 1, 2019] Mathematical expressions can become complex and lengthy. The tree structureof MathML allows us to introduce a measure that reﬂects the complexity ofmathematical expressions. More complex expressions usually consist of moreextensively nested subtrees in the MathML data. Thus, we deﬁne the com-plexity of a mathematical expression by the maximum depth of the MathML tree. In XML the content of a node and its attributes are commonly interpretedas children of the node. Thus, we deﬁne the depth of a single node as 1 ratherthan 0, i.e., single identiﬁers, such as P, have a complexity of 1. TheJacobi polynomial from Listing 1 has a complexity of 4.We perform the extraction of subexpressions from MathML in BaseX. Thealgorithm for the extraction process is written in XQuery. The algorithm tra-verses recursively downwards from the root to the leaves. In each iteration, itchecks whether there is an identiﬁer, i.e., element, among the descendantsof the current node. If there is no such element, the subtree will be ignored. Itseems counterintuitive to start from the root and check if an identiﬁer is amongthe descendants rather than starting at each identiﬁer and traversing upwardsto the root. If an XQuery requests a node in BaseX, BaseX loads the entiresubtree of the requested node into the cache (up to a speciﬁed size). If thealgorithm traverses upwards through the MathML tree, the XQuery will trig-ger database requests in every iteration. Hence, the downwards implementationperforms better, since there is only one database request for every expressionrather than for every subexpression.Since we only minimize the pMML data rather than normalizing it, twoidentically rendered expressions may have diﬀerent complexities. For instance, x consists of two distinct subexpressions, but bothof them are displayed the same. Another problem often appears for arraysor similar visually complicated structures. The extracted expressions are notnecessarily logical subexpressions. We will consider applying more advancedembedding techniques such as special tokenizers [14], symbol layout trees [24,25], and a MathML normalization via MathMLCan [9] in future research toovercome these issues. By splitting each formula into subexpressions, we generated longer documentsand a bias towards low complexities. Note that, hereafter, we only refer tothe mathematical content of documents. Thus, the length of a document refersto the number of math formulae—here the number of subexpressions—in the7ategory arXiv zbMATHDocuments 841,008 1,349,297Formulae 294,151,288 11,747,860Subexpressions 2,508,620,512 61,355,307Unique Subexpressions 350,206,974 8,450,496Average Document Length 2,982.87 45.47Average Complexity 5.01 3.89Maximum Complexity 218 26Table 1: Dataset overview. Average Document Length is deﬁned as the averagenumber of subexpressions per document.Figure 1: Unique subexpressions for each complexity in arXiv and zbMATH.document. After splitting expressions into subexpressions, arXiv consists of2 . 5B and zbMATH of 61M expressions, which raised the average documentlength to 2 , . 87 for arXiv and 45 . 47 for zbMATH, respectively.For calculating frequency distributions, we merged two subexpressions iftheir string representations were identical. Remember, the string representationis unique for each MathML tree. After merging, arXiv consisted of 350,206,974unique mathematical subexpressions with a maximum complexity of 218 and anaverage complexity of 5 . 01. For high complexities over 70, the formulae showsome erroneous structures that might be generated from L A TE xml by mistake.For example, the expression with the highest complexity is a long sequence of apolynomial starting with ‘ P ( t , t , t , t ) =’ followed by 690 summands. Thecomplexity is caused by a high number of unnecessarily deeply nested nodes. The highest complexity with a minimum document frequency of two is39, which is a continued fraction. Since continued fractions are nested fractions,they naturally have a large complexity. One of the most complex expressions8complexity 20) with a minimum document frequency of three was the formula  n X j =1  n X j =1  · · · n X j m =1 | T ( e j , . . . , e j m ) | q m ! qm − qm · · ·  q q  q q  q ≤ C K m,p, q k T k . (3) In contrast, zbMATH only consisted of 8,450,496 unique expressions with amaximum complexity of 26 and an average complexity of 3 . 89. One of the mostcomplex expressions in zbMATH with a minimum document frequency of threewas M p ( r, f ) = (cid:18) π Z π (cid:12)(cid:12) f (cid:0) re iθ (cid:1)(cid:12)(cid:12) p dθ (cid:19) /p . (4)As we expected, reviews and abstracts in zbMATH were generally shorter andconsisted of less complex mathematical formulae. The dataset also appeared tocontain fewer erroneous expressions, since expressions of complexity 25 are stillreadable and meaningful.Figure 1 shows the ratio of unique subexpressions for each complexity in bothdatasets. The ﬁgure illustrates that both datasets share a peak at complexityfour. Compared to zbMATH, the arXiv expressions are slightly more evenlydistributed over the diﬀerent levels of complexities. Interestingly, complexitiesone and two are not dominant in either of the two datasets. Single identiﬁersonly make up 0 . 03% in arXiv and 0 . 12% in zbMATH, which is comparable toexpressions of complexity 19 and 14, respectively. This ﬁnding illustrates theproblem of capturing semantic meanings for single identiﬁers rather than formore complex expressions [30]. It also substantiates that entire expressions, iftoo complex, are not suitable either for capturing the semantic meanings [28].Instead, a middle ground is desirable, since the most unique expressions in bothdatasets have a complexity between 3 and 5. Table 1 summarizes the statisticsof the examined datasets. In linguistics, it is well known that word distributions follow Zipf’s Law [15],i.e., the r -th most frequent word has a frequency that scales to f ( r ) ∝ r α (5)with α ≈ 1. A better approximation can be applied by a shifted distribution f ( r ) ∝ r + β ) α , (6)where α ≈ β ≈ . 7. In a study on Zipf’s law, Piantadosi [15] illustratedthat not only words in natural language corpora follow this law surprisinglyaccurately, but also many other human-created sets. For instance, in program-ming languages, in biological systems, and even in music. Since mathematical9 a) Frequency Distributions (b) Complexity Distributions Figure 2: Each ﬁgure illustrates the relationship between the frequency ranks( x -axis) and the normalized frequency ( y -axis) in zbMATH (top) and arXiv(bottom). For arXiv, only the ﬁrst 8 million entries are plotted to be compa-rable with zbMATH ( ≈ α and β are provided in the plots. Subﬁgure (b)shades the bins from blue to red according to the maximum complexity in eachbin. 10ommunication has derived as the result of centuries of research, it would notbe surprising if mathematical notations would also follow Zipf’s law. The pri-mary conclusion of the law illustrates that there are some very common tokensagainst a large number of symbols which are not used frequently. Based on thisassumption, we can postulate that a score based on frequencies might be ableto measure the peculiarity of a token. The infamous TF-IDF ranking functionsand their derivatives [2, 7] have performed well in linguistics for many yearsand are still widely used in retrieval systems [20]. However, since we split ev-ery expression into its subexpressions, we generated an anomalous bias towardsshorter, i.e., less complex, formulae. Hence, distributions of subexpressions maynot obey Zipf’s law.Figure 2 visualizes a comparison between Zipf’s law and the frequency dis-tributions of mathematical subexpressions in arXiv and zbMATH. The dashedorange line visualizes the power law (6). The plots demonstrate that the distri-butions in both datasets obey this power law. Interestingly, there is not muchdiﬀerence in the distributions between both datasets. Both distributions seemto follow the same power law, with α = 1 . β = 15 . 82. Moreover, we canobserve that the developed complexity measure seems to be appropriate, sincethe complexity distributions for formulae are similar to the distributions for thelength of words [15]. In other words, more complex formulae, as well as longwords in natural languages, are generally more specialized and thus appear lessfrequent throughout the corpus. Note that colors of the bins for complexitiesﬂuctuate for rare expressions because the color represents the maximum ratherthan the average complexity in each bin. Figure 3 shows in detail the most frequently used mathematical expressions inarXiv for the complexities 1 to 5. The orange dashed line visible in all graphsrepresents the normal Zipf’s law distribution from Equation (5). We explorethe total frequency values without any normalization. Thus, Equation (5) wasmultiplied by the highest frequency for each complexity level to ﬁt the distri-bution. The plots in Figure 3 demonstrate that even though the parameter α varies between 0 . 35 and 0 . 62, the distributions in each complexity class alsoobey Zipf’s law.The plots for each complexity class contain some interesting ﬂuctuations. Wecan spot a set of ﬁve single identiﬁers that are most frequently used through-out arXiv: n , i , x , t , and k . Even though the distributions follow Zipf’s lawaccurately, we can explore that these ﬁve identiﬁers are proportionally morefrequently used than other identiﬁers and clearly separate themselves above therest (notice the large gap from k to a ). All of the ﬁve identiﬁers are known to beused in a large variety of scenarios. Surprisingly, one might expect that commonpairs of identiﬁers would share comparable frequencies in the plots. However,typical pairs, such as x and y , or α and β , possess a large discrepancy.The plot of complexity two also reveals that two expressions are proportion-ally more often used than others: ( x ) and ( t ). These two expressions appear11igure 3: Overview of the most frequent mathematical expressions in arXivfor complexities 1-5. The color gradient from yellow to blue represents thefrequency in the dataset. Zipf’s law (5) is represented by a dashed orange line.12ore than three times as often in the corpus than any other expression of thesame complexity. On the other hand, the quantitative diﬀerence between ( x )and ( t ) is negligible. We may assume that arXiv’s primary domain, physics,causes the quantitative disparity between ( x ), ( t ), and the other tokens. Theprimary domain of the dataset becomes more clearly visible for higher complex-ities, such as SU (2) (C3 ) or kms − (C4).Another surprising property of arXiv is that symmetry groups, such as SU (2), appear to play an essential role in the majority of articles on arXiv,see SU (2) (C3), SU (2) L (C4), and SU (2) × SU (2) (C5), among others. Theplots of higher complexities , which we do not show here, made this evenmore noticeable. Given a complexity of six, for example, the most frequentlyused expression was SU (2) L × SU (2) R , and for a complexity of seven it was SU (3) × SU (2) × U (1). Given a complexity of eight, ten out of the top-12expressions were from symmetry group calculations.It is also worthwhile to compare expressions among diﬀerent levels of com-plexities. For instance, ( x ) and ( t ) appeared almost six million times in the cor-pus, but f ( x ) (at position three in C3) was the only expression which containedone of these most common expressions. Note that subexpressions of variations,such as ( x ), ( t ), or ( t − t ), do not match the expression of complexity two.This may imply that ( x ), and especially ( t ), appear in many diﬀerent scenarios.Further, we can examine that even though ( x ) is a part of f ( x ) in only approx-imately 3% of all cases, it is still the most likely combination. These results areespecially useful for recommendation systems that make use of math as input.Moreover, plagiarism detection systems may also beneﬁt from such a knowledgebase. For instance, it might be evident that f ( x ) is a very common expression,but for automatic systems that work on a large scale, it is not clear whetherduplicate occurrences of f ( x ) or Ξ( x ) should be scored diﬀerently, e.g., in thecase of plagiarism detection.Figure 3 shows only the most frequently occurring expressions in arXiv.Since we already explored a bias towards physics formulae in arXiv, it is worthcomparing the expressions present within both datasets. Figure 4 compares the25-top expressions for the complexities one to four. In zbMATH, we discoveredthat computer science and graph theory appeared as popular topics, see forexample G = ( V, E ) (in C3 at position 20) and the Bachmann-Landau notationsin O (log n ), O ( n ), and O ( n ) (C4 positions 2, 3, and 19).From Figure 4, we can also deduce useful information for MathIR taskswhich focus on semantic information. Current semantic extraction tools [30]or L A TEX parsers [36] still have diﬃculties distinguishing multiplications from function calls . For example as mentioned before, L A TE xml [45] adds an invisibletimes character between f ( x ) rather than a function application . Investigatingthe most frequently used terms in zbMATH in Table 4 reveals that u is mostlikely considered to be a function in the dataset: u ( t ) (rank 8), u ( x ) (rank 13), u xx (rank 16), u (0) (rank 17), |∇ u | (rank 22). Manual investigations of extended We refer to a given complexity n with C n , i.e., C3 refers to complexity 3. More plots showing higher complexities are available at https://github.com/ag-gipp/FormulaCloudData n !) 129.44 i, j = 1 , . . . , n (cid:0) Q / Q (cid:1) | z | ) α φ − x ij | f ( z ) | p f (cid:0) re iθ (cid:1) z n − x = A ( t ) x (cid:16) | x | (cid:17) f ( z ) = z + P ∞ n =2 a n z n c n ) 106.66 | x − x | | f ( x ) | p (cid:16) |∇ u | p − ∇ u (cid:17) B ( G ) 105.52 S n +1 |∇ u | dx n/ log log n )99.87 log n L (cid:0) R (cid:1) n ( n − / O ( n log n )99.65 ξ ( x ) 103.70 ˙ x = Ax + Bu O ( n − ) – C798.72 div (cid:0) |∇ u | p − ∇ u (cid:1) Table 2: Top s ( t, D ) scores, where D is the set of all zbMATH documents witha minimum document frequency of 200, maximum document frequency of 500k,and a minimum complexity of 3.lists reveal even more hits: u ( x ) (rank 30), − ∆ u (rank 32), and u ( x, t ) (rank33). Since all eight terms are among the most frequent 35 entries in zbMATH,it implies that u can most likely be considered to imply a function in zbMATH.Of course, this does not imply that u must always be a function in zbMATH(see f ( u ) on rank 14 in C3), but this allows us to exploit probabilities forimproving MathIR performance. For instance, if not stated otherwise, u couldbe interpreted as a function by default, which could help increase the precisionof the aforementioned tools.Figure 4 also demonstrates that our two datasets diverge for increasing com-plexities. Hence, we can assume that frequencies of less complex formulae aremore topic-independent. Conversely, the more complex a math formula is, themore context-speciﬁc it is. In the following, we will further investigate thisassumption by applying TF-IDF rankings on the distributions. Zipf’s law encourages the idea of scoring the relevance of words according totheir number of occurrences in the corpus and in the documents. The family ofBM25 ranking functions based on TF-IDF scores are still widely used in severalretrieval systems [7, 20]. Since we demonstrated that mathematical formulae(and their subexpressions) obey Zipf’s law in large scientiﬁc corpora, it appearsintuitive to also use TF-IDF rankings, such as a variant of BM25, to calculatetheir relevance. In its original form [7], Okapi BM25 was calculated as followsbm25( t, d ) := ( k + 1) IDF( t ) TF( t, d )TF( t, d ) + k (cid:16) − b + b | d | AVG DL (cid:17) , (7)where TF ( t, d ) is the term frequency of t in the document d , | d | the length ofthe document d (in our case, the number of subexpressions), AVG DL the average15ength of the documents in the corpus (see Table 1), and IDF ( t ) is the inversedocument frequency of t , deﬁned asIDF( t ) := log N − n ( t ) + n ( t ) + , (8)where N is the number of documents in the corpus and n ( t ) the number ofdocuments which contain the term t . By adding , we avoid log 0 and divisionby 0. The parameters k and b are free, with b controlling the inﬂuence ofthe normalized document length and k controlling the inﬂuence of the termfrequency on the ﬁnal score. For our experiments, we chose the standard value k = 1 . b = 0 . P ( α,β ) n ( x ), i.e., the document had a length ofone, would generate eight subexpressions, i.e., it results in a document lengthof eight. Thus, we modify the BM25 score in Equation (7) to emphasize highercomplexities and longer documents. First, the average document length is di-vided by the average complexity AVG C in the corpus that is used (see Table 1),and we calculate the reciprocal of the document length normalization to em-phasize longer documents.Moreover, in the scope of a single document, we want to emphasize expres-sions that do not appear frequently in this document, but are the most frequentamong their level of complexity. Thus, less complex expressions are ranked morehighly if the document overall is not very complex. To achieve this weighting,we normalize the term frequency of an expression t according to its complexity c ( t ) and introduce an inverse term frequency according to all expressions in thedocument ITF( t, d ) := log | d | − TF( t, d ) + TF( t, d ) + . (9)Finally, we deﬁne the score s( t, d ) of a term t in a document d as s ( t, d ) := ( k + 1) IDF( t ) ITF( t, d ) TF( t, d )max t ∈ d | c ( t ) TF( t , d ) + k (cid:16) − b + b AVG DL | d | AVG C (cid:17) . (10)The TF-IDF ranking functions and the introduced s ( t, d ) are used to retrieverelevant documents for a given search query. However, we want to retrieverelevant subexpressions over a set of documents. Thus, we deﬁne the score ofa formula (mBM25) over a set of documents as the maximum score over alldocuments mBM25( t, D ) := max d ∈ D s ( t, d ) , (11)where D is a set of documents. We used Apache Flink [38] to count the expres-sions and process the calculations. Thus, our implemented system scales wellfor large corpora. 16igure 5: Top-20 ranked expressions retrieved from a topic-speciﬁc subset ofdocuments D q . The search query q is given above the plots. Retrieved formulaeare annotated by a domain expert with green dots for relevant and red dots fornon-relevant hits. A line is drawn if a hit appears in both result sets. The lineis colored in green when the hit was marked as relevant.17able 2 shows the top-7 scored expressions, where D is the entire zbMATHdataset. The retrieved expressions can be considered as meaningful and real-world examples of MOIs, since most expressions are known for speciﬁc mathe-matical concepts, such as Gal( Q / Q ), which refers to the Galois group of Q over Q , or L ( R ), which refers to the L -space (also known as Lebesgue space ) over R . However, a more topic-speciﬁc retrieval algorithm is desirable. To achievethis goal, we (i) retrieved a topic-speciﬁc subset of documents D q ⊂ D for agiven textual search query q , and (ii) calculated the scores of all expressions inthe retrieved documents. To generate D q , we indexed the text sources of thedocuments from arXiv and zbMATH via elasticsearch (ES) and performedthe pre-processing steps: ﬁltering stop words, stemming, and ASCII-folding .Table 3 summarizes the settings we used to retrieve MOIs from a topic-speciﬁcsubset of documents D q . We also set a minimum hit frequency according tothe number of retrieved documents an expression appears in. This requirementﬁlters out uncommon notations. arXiv zbMATHRetrieved Doc. 40 200Min. Hit Freq. 7 7Min. DF 50 10Max. DF 10k 10kTable 3: Settings for the retrieval experiments.Figure 5 shows the results for ﬁve search queries. We asked a domain expertfrom the National Institute of Standards and Technology (NIST) to annotate theresults as related (shown as green dots in Figure 5) or non-related (red dots). Wefound that the results range from good performances (e.g., for the Riemann zetafunction) to bad performances (e.g., beta function). For instance, the resultsfor the Riemann zeta function are surprisingly accurate, since we could discoverthat parts of Riemann’s hypothesis were ranked highly throughout the results(e.g., ζ ( + it )). On the other hand, for the beta function, we retrieved only afew related hits, of which only one had a strong connection to the beta function B ( x, y ). We observed that the results were quite sensitive to the chosen settings(see Table 3). For instance, according to the beta function, the minimum hitfrequency has a strong eﬀect on the results, since many expressions are sharedamong multiple documents. For arXiv, the expressions B ( α, β ) and B ( x, y ) onlyappear in one document of the retrieved 40. However, decreasing the minimumhit frequency would increase noise in the results.Even though we asked a domain expert to annotate the results as relevantor not, there is still plenty of room for discussion. For instance, ( x + y ) (rank 15 https://github.com/elastic/elasticsearch [Accessed Sep. 2019]. We used version 7.0.0 This means that non-ASCII characters are replaced by their ASCII counterparts or willbe ignored if no such counterpart exists. Riemann proposed that the real part of every non-trivial zero of the Riemann zeta functionis 1 / 2. If this hypothesis is correct, all the non-trivial zeros lie on the critical line consistingof the complex numbers 1 / it . iemann Zeta FunctionC1 C2 C3 C415,051 n s ) 1,456 ζ ( s ) 349 ( + it )11,709 s x ) 340 σ + it 232 (1 / it )9,768 x n ) 310 P ∞ n =1 195 ( σ + it )8,913 k t ) 275 (log T ) 136 + it T it 264 1 / it s = σ + it C5 C6 TF-IDF mBM25203 ζ ( + it ) 105 | ζ (1 / it ) | ζ ( s ) ζ (1 / it )166 ζ (1 / it ) 88 (cid:12)(cid:12) ζ ( + it ) (cid:12)(cid:12) ζ (1 / it ) (1 / it )124 ζ ( σ + it ) 81 | ζ ( σ + it ) | (1 / it ) ( + it )54 ζ (1 + it ) 32 | ζ (1 + it ) | + it ζ ( + it )44 ζ (2 n + 1) 22 | ζ (+ it ) | ( + it ) ( σ + it ) EigenvalueC1 C2 C3 C445,488 n x ) 686 − ∆ u |∇ u | p − x t ) 555 ( n − 1) 218 − ∆ p u λ λ |∇ u | W ,p (Ω)35,302 u a ij |∇ u | t R n u ( x ) 97 ( a ij ) C5 C6 TF-IDF mBM25139 |∇ u | p − ∇ u (cid:16) |∇ u | p − ∇ u (cid:17) Ax = λBx − div (cid:16) |∇ u | p − ∇ u (cid:17) − d /dx − ( py ) − ∆ p div (cid:16) |∇ u | p − ∇ u (cid:17) A = ( a ij ) 26 ( | u | p − u ) P ( λ ) p = N +2 N − − d dx 18 ( φ p ( u )) λ k +1 ( φ p ( u )) u ∈ W ,p (Ω) 18 R Ω |∇ u | dx λ > λ ∈ (0 , λ ∗ ) Table 4: The top-5 frequent mathematical expressions in the result set of zb-MATH for the search queries ‘Riemann Zeta Function’ (top) and ‘Eigenvalue’(bottom) grouped by their complexities (left) and the hits reordered accord-ing to their relevance scores (right). The TF-IDF score was calculated withnormalized term frequencies. 19 uto-completion for ‘ E = m ’ Suggestions for ‘ E = { m, c } ’Sug. Expression TF DF Sug. Expression TF DF E = mc 558 376 E = mc 558 376 E = m cosh θ 23 23 E = γmc 39 38 E = mv E = γm e c 41 36 E = m/ p − ˙ q 12 6 E = m cosh θ 23 23 E = m/ p − β 10 6 E = − mc 35 17 E = mc γ E = p m c + p c 10 8 Table 5: Suggestions to complete ‘ E = m ’ and ‘ E = { m, c } ’ (the right-hand sidecontains m and c ) with term and document frequency based on the distributionsof formulae in arXiv.in zbMATH, ‘Beta Function’) is the argument of the gamma function Γ( x + y )that appears in the deﬁnition of the beta function [46, (5.12.1)] B ( x, y ) :=Γ( x )Γ( y ) / Γ( x + y ). However, this relation is weak at best, and thus might beconsidered as not related. Other examples are Re z and Re( s ), which play acrucial role in the scenario of the Riemann hypothesis (all non-trivial zeroeshave Re( s ) = ). Again, this connection is not obvious, and these expressionsare often used in multiple scenarios. Thus, the domain expert did not mark theexpressions as being related.Considering the diﬀerences in the documents, it is promising to have ob-served a relatively high number of shared hits in the results. Further, wewere able to retrieve some surprisingly good insights from the results, suchas extracting the full deﬁnition of the Riemann zeta function [46, (25.2.1)] ζ ( s ) := P ∞ n =1 1 n s . Even though a high number of shared hits seem to sub-stantiate the reliability of the system, there were several aspects that aﬀectedthe outcome negatively, from the exact deﬁnition of the search queries to re-trieve documents via ES, to the number of retrieved documents, the minimumhit frequency, and the parameters in mBM25. The presented results are beneﬁcial for a variety of use-cases. In the following,we will demonstrate and discuss several of the applications that we propose. Extension of zbMATH’s Search Engine: Formula search engines are oftencounterintuitive when compared to textual search, since the user must know howthe system operates to enter a search query properly (e.g., does the system sup-ports L A TEX inputs?). Additionally, mathematical concepts can be diﬃcult tocapture using only mathematical expressions. Consider, for example, someonewho wants to search for mathematical expressions that are related to eigenval-ues. A textual search query would only retrieve entire documents that requirefurther investigation to ﬁnd related expressions. A mathematical search en-gine, on the other hand, is impractical since it is not clear what would be a20tting search query (e.g., Av = λv ?). Moreover, formula and textual searchsystems for scientiﬁc corpora are separated from each other. Thus, a textualsearch engine capable of retrieving mathematical formulae can be beneﬁcial.Also, many search engines allow for narrowing down relevant hits by suggestingﬁlters based on the retrieved results. This technique is known as faceted search.The zbMATH search engine also provides faceted search, e.g., by authors, oryear. Adding facets for mathematical expressions allows users to narrow downthe results more precisely to arrive at speciﬁc documents.Our proposed system for extracting relevant expressions from scientiﬁc cor-pora via mBM25 scores can be used to search for formulae even with textualsearch queries, and to add more ﬁlters for faceted search implementations. Ta-ble 4 shows two examples of such an extension for zbMATH’s search engine.Searching for ‘Riemann Zeta Function’ and ‘Eigenvalue’ retrieved 4,739 and25,248 documents from zbMATH, respectively. Table 4 shows the most fre-quently used mathematical expressions in the set of retrieved documents. Italso shows the reordered formulae according to a default TF-IDF score (withnormalized term frequencies) and our proposed mBM25 score. The results canbe used to add ﬁlters for faceted search, e.g., show only the documents whichcontain u ∈ W ,p (Ω). Additionally, the search system now provides more intu-itive textual inputs even for retrieving mathematical formulae. The retrievedformulae are also interesting by themselves, since they provide insightful in-formation on the retrieved publications. As already explored with our customdocument search system in Figure 5, the Riemann hypothesis is also prominentin these retrieved documents.The diﬀerences between TF-IDF and mBM25 ranking illustrates the problemof an extensive evaluation of our system. From a broader perspective, the hit Ax = λBx is highly correlated with the input query ‘Eigenvalue’. On theother hand, the raw frequencies revealed a prominant role of div( |∇ u | p − ∇ u ).Therefore, the top results of the mBM25 ranking can also be considered asrelevant. Math Notation Analysis: A faceted search system allows us to analyze math-ematical notations in more detail. For instance, we can retrieve documents froma speciﬁc time period. This allows one to study the evolution of mathematicalnotation over time [1], or for identifying trends in speciﬁc ﬁelds. Also, we cananalyze standard notations for speciﬁc authors since it is often assumed that au-thors prefer a speciﬁc notation style which may vary from the standard notationin a ﬁeld. Math Recommendation Systems: The frequency distributions of formulaecan be used to realize eﬀective math recommendation tasks, such as type hintingor error-corrections. These approaches require long training on large datasets,but may still generate meaningless results, such as G i = { ( x, y ) ∈ R n : x i = x i } [42]. We propose a simpler system which takes advantage of our frequencydistributions. We retrieve entries from our result database, which contain allunique expressions and their frequencies. We implemented a simple prototypethat retrieves the entries via pattern matching. Table 5 shows two examples.The left side of the table shows suggested autocompleted expressions for the21igure 6: The top ranked expression for ‘ Jacobi polynomial ’ in arXiv and zb-MATH. For arXiv, 30 documents were retrieved with a minimum hit frequencyof 7.query ‘ E = m ’. The right side shows suggestions for ‘ E =’, where the right-hand side of the equation should contain m and c in any order. A combinationusing more advanced retrieval techniques, such as similarity measures based onsymbol layout trees [24, 25], would enlarge the number of suggestions. This kindof autocomplete and error-correction type-hinting system would be beneﬁcial forvarious use-cases, e.g., in educational software or for search engines as a pre-processing step of the input. Plagiarism Detection Systems: As previously mentioned, plagiarism de-tection systems [29, 39, 41] would beneﬁt from a system capable of distin-guishing conventional from uncommon notations. The approaches described byMeuschke et al. [39] outperform existing approaches by considering frequencydistributions of single identiﬁers (expressions of complexity one). Consideringthat single identiﬁers make up only 0 . 03% of all unique expressions in arXiv, wepresume that better performance can be achieved by considering more complexexpressions. The conferred string representation also provides a simple formatto embed complex expressions in existing learning algorithms.22xpressions with high complexities that are shared among multiple docu-ments may provide further hints to investigate potential plagiarisms. For in-stance, the most complex expression that was shared among three documents inarXiv was Equation (3). A complex expression being identical in multiple doc-uments could indicate a higher likelihood of plagiarism. Further investigationrevealed that similar expressions, e.g., with inﬁnite sums, are frequently usedamong a larger set of documents. Thus, the expression seems to be a part of astandard notation that is commonly shared, rather than a good candidate forplagiarism detection. Resulting from manual investigations, we could identifythe equation as part of a concept called generalized Hardy-Littlewood inequal-ity and Equation (3) appears in the three documents [12, 18, 17]. All threedocuments shared one author in common. Thus, this case also demonstrates acorrelation between complex mathematical notations and authorship. Semantic Taggers and Extraction Systems: We previously mentioned thatsemantic extraction systems [23, 28, 30] and semantic math taggers [16, 31]have diﬃculties in extracting the essential components (MOIs) from complexexpressions. Considering the deﬁnition of the Jacobi polynomial in Equation (1),it would be beneﬁcial to extract the groups of tokens that belong together, suchas P ( α,β ) n ( x ) or Γ( α + m + 1). With our proposed search engine for retrievingMOIs, we are able to facilitate semantic extraction systems and semantic mathtaggers. Imagine such a system being capable of identifying the term ‘Jacobipolynomial’ from the textual context. Figure 6 shows the top relevant hits forthe search query ‘Jacobi polynomial’ retrieved from zbMATH and arXiv. Theresults contain several relevant and related expressions, such as the constraints α, β > − − x ) α (1 + x ) β ,which are essential properties of this orthogonal polynomial. Based on theseretrieved MOIs, the extraction systems can adjust its retrieved math elementsto improve precision, and semantic taggers or a tokenizer could re-organize parsetrees to more closely resemble expression trees. In this study we showed that analyzing the frequency distributions of mathe-matical expressions in large scientiﬁc datasets can provide useful insights for avariety of applications. We demonstrated the versatility of our results by im-plementing prototypes of a type-hinting system for math recommendations, anextension of zbMATH’s search engine, and a mathematical retrieval system tosearch for topic-speciﬁc MOIs. Additionally, we discussed the potential impactand suitability in other applications, such as math search engines, plagiarismdetection systems, and semantic extraction approaches. We are conﬁdent thatthis project lays a foundation for future research in the ﬁeld of MathIR.We plan on developing a web application which would provide easy accessto our frequency distributions, the MOI search engine, and the type-hintingrecommendation system. We hope that this will further expedite related fu-ture research projects. Moreover, we will use this web application for an online23valuation of our MOI retrieval system. Since the level of agreement among an-notators will be predictably low, an evaluation by a large community is desired.In this ﬁrst study, we preserved the core structure of the MathML datawhich provided insightful information for the MathML community. However,this makes it diﬃcult to properly merge formulae. In future studies, we willnormalize the MathML data via MathMLCan [9]. In addition to this nor-malization, we will include wildcards for investigating distributions of formulapatterns rather than exact expressions. This will allow us to study connectionsbetween math objects, e.g., between Γ( z ) and Γ( x + 1). This would furtherimprove our recommendation system and would allow for the identiﬁcation ofregions for parameters and variables in complex expressions. Acknowledgments Discovering Mathematical Objects of Interest was sup-ported by the German Research Foundation (DFG grant GI-1259-1). References [1] Florian Cajori. A History of Mathematical Notations . Vol. 1 & 2. London,UK: The Open Court Company, 1929.[2] Akiko N. Aizawa. An information-theoretic perspective of tf-idf measures.In: Inf. Process. Manage. (2003), pp. 45–65. doi : .[3] Glenn Gordon Smith and David Ferguson. Diagrams and math notationin e-learning: growing pains of a new generation. In: International Journalof Mathematical Education in Science and Technology (5 2004), pp. 681–695. doi : .[4] Ashish Lohia, Kirti Sinha, Soujanya Vadapalli, and Kamalakar Karla-palem. An Architecture for Searching and Indexing Latex Equations inScientiﬁc Literature. In: Proc. COMAD . Goa, India: Computer Society ofIndia, 2005, pp. 122–130.[5] Alex Gaudeul. Do Open Source Developers Respond to Competition?:The L A TEX Case Study. In: Review of Network Economics (2 June 2007),pp. 239–263. doi : .[6] Christian Grün, Sebastian Gath, Alexander Holupirek, and Marc Scholl.XQuery Full Text Implementation in BaseX. In: Database and XML Tech-nologies . Springer Berlin, 2009, pp. 114–128.[7] Stephen E. Robertson and Hugo Zaragoza. The Probabilistic RelevanceFramework: BM25 and Beyond. In: Foundations and Trends in Informa-tion Retrieval (2009), pp. 333–389. doi : .[8] Shahab Kamali and Frank Wm. Tompa. A new mathematics retrievalsystem. In: Proc. ACM CIKM . Toronto, Ontario, Canada: ACM, 2010,pp. 1413–1416. doi : .249] David Formánek, Martin Líška, Michal Růžička, and Petr Sojka. Normal-ization of Digital Mathematics Library Content. In: Proc. of OpenMath/MathUI/ CICM-WiP . CEUR Workshop Proceedings. Bremen, Germany,2012, pp. 91–103.[10] Michael Kohlhase, Bogdan A. Matican, and Corneliu-Claudiu Prodescu.MathWebSearch 0.5: Scaling an Open Formula Search Engine. In: Intelli-gent Computer Mathematics - 11th International Conference, AISC 2012,19th Symposium, Calculemus 2012, 5th International Workshop, DML2012, 11th International Conference, MKM 2012, Systems and Projects,Held as Part of CICM 2012, Bremen, Germany, July 8-13, 2012. Proceed-ings . Bremen, Germany: Springer Berlin Heidelberg, 2012, pp. 342–357. doi : .[11] Shahab Kamali and Frank Wm. Tompa. Retrieving documents with math-ematical content. In: Proceedings of the 36th International ACM SIGIRConference on Research and Development in Information Retrieval, SI-GIR ’13, Dublin, Ireland - July 28 - August 01, 2013 . Dublin, Ireland:ACM, 2013, pp. 353–362. doi : .[12] Gustavo Araujo and Daniel Pellegrino. On the constants of the Bohnenblust-Hille inequality and Hardy–Littlewood inequalities. In: CoRR (2014). arXiv: .[13] Giovanni Yoko Kristianto, Goran Topic, Florence Ho, and Akiko Aizawa.The MCAT Math Retrieval System for NTCIR-11 Math Track. In: Proc.11th NTCIR Conference on Evaluation of Information Access Technolo-gies, National Center of Sciences, Tokyo, Japan: National Institute ofInformatics (NII), 2014.[14] Aldo Lipani, Linda Andersson, Florina Piroi, Mihai Lupu, and Allan Han-bury. TUW-IMP at the NTCIR-11 Math-2. In: Proceedings of the 11thNTCIR Conference on Evaluation of Information Access Technologies,NTCIR-11, National Center of Sciences, Tokyo, Japan, December 9-12,2014 . Tokyo, Japan: National Institute of Informatics (NII), 2014.[15] Steven T. Piantadosi. Zipf’s word frequency law in natural language: Acritical review and future directions. In: Psychonomic Bulletin & Review (Mar. 2014), pp. 1112–1130. doi : .[16] Pao-Yu Chien and Pu-Jen Cheng. Semantic Tagging of Mathematical Ex-pressions. In: Proc. WWW’2015 . Florence, Italy: ACM, 2015, pp. 195–204. doi : .[17] Daniel Pellegrino. A short communication on the constants of the multi-linear Hardy–Littlewood inequality. In: CoRR (2015). arXiv: .[18] Jamilson R. Campos, Wasthenny Cavalcante, Vinícius V. Fávaro, DanielNuñez-Alarcón, Daniel Pellegrino, and Diana M. Serrano-Rodríguez. Poly-nomial and multilinear Hardy–Littlewood inequalities: analytical and nu-merical approaches. In: CoRR (2015). arXiv: .2519] Leonard Wörteler, Michael Grossniklaus, Christian Grün, and Marc Scholl.Function inlining in XQuery 3.0 optimization. In: Proc. 15th DBLP . Pitts-burgh, PA, USA: ACM, 2015, pp. 45–48. doi : .[20] Jöran Beel, Bela Gipp, Stefan Langer, and Corinna Breitinger. Research-paper recommender systems: a literature survey. In: Int. J. on DigitalLibraries (2016), pp. 305–338. doi : .[21] Ferruccio Guidi and Claudio Sacerdoti Coen. A Survey on Retrieval ofMathematical Knowledge. In: Mathematics in Computer Science (2016),pp. 409–427. doi : .[22] Shunsuke Ohashi, Giovanni Yoko Kristianto, Goran Topic, and AkikoAizawa. Eﬃcient Algorithm for Math Formula Semantic Search. In: IEICETransactions (2016), pp. 979–988. doi : .[23] Moritz Schubotz, Alexey Grigorev, Marcus Leich, Howard S. Cohl, Nor-man Meuschke, Bela Gipp, Abdou S. Youssef, and Volker Markl. Seman-tiﬁcation of Identiﬁers in Mathematics for Better Math Information Re-trieval. In: Proceedings of the 39th International ACM SIGIR Conferenceon Research and Development in Information Retrieval . SIGIR ’16. FullPaper. Pisa, Italy: ACM, 2016, pp. 135–144. doi : 10 . 1145 / 2911451 .2911503 .[24] Richard Zanibbi, Kenny Davila, Andrew Kane, and Frank Wm. Tompa.Multi-Stage Math Formula Search: Using Appearance-Based SimilarityMetrics at Scale. In: Proceedings of the 39th International ACM SIGIRConference on Research and Development in Information Retrieval . SI-GIR ’16. Pisa, Italy: ACM, 2016, pp. 145–154. doi : .[25] Kenny Davila and Richard Zanibbi. Layout and Semantics: CombiningRepresentations for Mathematical Formula Search. In: Proc. ACM SIGIR .Shinjuku, Tokyo: ACM, 2017, pp. 1165–1168. doi : .[26] Liangcai Gao, Zhuoren Jiang, Yue Yin, Ke Yuan, Zuoyu Yan, and ZhiTang. Preliminary Exploration of Formula Embedding for MathematicalInformation Retrieval: can mathematical formulae be embedded like anatural language? In: CoRR (2017). arXiv: .[27] Andrea Kohlhase, Michael Kohlhase, and Michael Fürsich. Visual Struc-ture in Mathematical Expressions. In: Intelligent Computer Mathematics- 10th International Conference, CICM 2017, Edinburgh, UK, July 17-21,2017, Proceedings . Lecture Notes in Computer Science. Edinburgh, UK:Springer, 2017, pp. 208–223. doi : .[28] Giovanni Yoko Kristianto, Goran Topic, and Akiko Aizawa. Utilizing de-pendency relationships between math expressions in math IR. In: Infor-mation Retrieval Journal (2017), pp. 132–167. doi : 10 . 1007 / s10791 -017-9296-8 . 2629] Norman Meuschke, Moritz Schubotz, Felix Hamborg, Tomás Skopal, andBela Gipp. Analyzing Mathematical Content to Detect Academic Plagia-rism. In: Proc. ACM CIKM . Singapore: ACM, 2017, pp. 2211–2214. doi : .[30] Moritz Schubotz, Leonard Krämer, Norman Meuschke, Felix Hamborg,and Bela Gipp. Evaluating and Improving the Extraction of Mathemati-cal Identiﬁer Deﬁnitions. In: Experimental IR Meets Multilinguality, Mul-timodality, and Interaction - 8th International Conference of the CLEFAssociation, CLEF 2017, Dublin, Ireland, September 11-14, 2017, Pro-ceedings . Lecture Notes in Computer Science. Springer, 2017, pp. 82–94. doi : .[31] Abdou Youssef. Part-of-Math Tagging and Applications. In: IntelligentComputer Mathematics . Cham: Springer International Publishing, 2017,pp. 356–374.[32] Deyan Ginev. arXMLiv:08.2018 dataset, an HTML5 conversion of arXiv.org .SIGMathLing – Special Interest Group on Math Linguistics. 2018. url : https://sigmathling.kwarc.info/resources/arxmliv/ .[33] Andrea Kohlhase. Factors for Reading Mathematical Expressions. In: Pro-ceedings of the Conference "Lernen, Wissen, Daten, Analysen", LWDA2018, Mannheim, Germany, August 22-24, 2018. CEUR Workshop Pro-ceedings. Mannheim, Germany: CEUR-WS.org, 2018, pp. 195–202.[34] Andrea Kohlhase, Michael Kohlhase, and Taweechai Ouypornkochagorn.Discourse Phenomena in Mathematical Documents. In: Intelligent Com-puter Mathematics - 11th International Conference, CICM 2018, Hagen-berg, Austria, August 13-17, 2018, Proceedings . Lecture Notes in Com-puter Science. Hagenberg, Austria: Springer, 2018, pp. 147–163. doi : .[35] Kriste Krstovski and David M. Blei. Equation Embeddings. In: CoRR (2018). arXiv: .[36] Moritz Schubotz, André Greiner-Petter, Philipp Scharpf, Norman Meuschke,Howard S. Cohl, and Bela Gipp. Improving the Representation and Con-version of Mathematical Formulae by Considering their Textual Context.In: Proceedings of the 18th ACM/IEEE on Joint Conference on DigitalLibraries, JCDL 2018, Fort Worth, TX, USA, June 03-07, 2018 . FortWorth, USA: ACM, 2018, pp. 233–242. doi : .[37] André Greiner-Petter, Terry Ruas, Moritz Schubotz, Akiko Aizawa, WilliamI. Grosky, and Bela Gipp. Why Machines Cannot Learn Mathematics, Yet.In: Proceedings of the 4th Joint Workshop on Bibliometric-enhanced In-formation Retrieval and Natural Language Processing for Digital Libraries(BIRNDL 2019) co-located with the 42nd International ACM SIGIR Con-ference on Research and Development in Information Retrieval (SIGIR2019), Paris, France, July 25, 2019. Paris, France: CEUR-WS.org, 2019,pp. 130–137. 2738] Fabian Hueske and Timo Walther. Apache Flink. In: Encyclopedia of BigData Technologies. Springer, 2019. doi : .[39] Norman Meuschke, Vincent Stange, Moritz Schubotz, Michael Kramer,and Bela Gipp. Improving Academic Plagiarism Detection for STEM Doc-uments by Analyzing Mathematical Content and Citations. In: Proceed-ings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL) .Urbana-Champaign, USA, June 2019, pp. 120–129. doi : .[40] Moritz Schubotz and Olaf Teschke. Four decades of TEX at zbMATH. In: Newsletter of the European Mathematical Society (EMS) (2019), pp. 50–52. doi : .[41] Moritz Schubotz, Olaf Teschke, Vincent Stange, Norman Meuschke, andBela Gipp. Forms of Plagiarism in Digital Mathematical Libraries. In: Intelligent Computer Mathematics - 12th International Conference, CICM2019, Prague, Czech Republic, July 8-12, 2019, Proceedings . Lecture Notesin Computer Science. Prague, Czech Republic: Springer, 2019, pp. 258–274. doi : .[42] Michihiro Yasunaga and John Laﬀerty. TopicEq: A Joint Topic and Math-ematical Equation Model for Scientiﬁc Texts. In: CoRR (2019). arXiv: .[43] Abdou Youssef and Bruce R. Miller. Explorations into the Use of WordEmbedding in Math Search and Math Semantics. In: Intelligent ComputerMathematics - 12th International Conference, CICM 2019, Prague, CzechRepublic, July 8-12, 2019, Proceedings . Lecture Notes in Computer Sci-ence. Prague, Czech Republic: Springer, 2019, pp. 291–305. doi : .[44] André Greiner-Petter, Moritz Schubotz, Fabian Müller, Corinna Breitinger,Howard S. Cohl, Akiko Aizawa, and Bela Gipp. Discovering MathematicalObjects of Interest - A Study of Mathematical Notations. In: Proceedingsof The Web Conference 2020 (WWW’20), April 20–24, 2020, Taipei, Tai-wan . doi : .[45] Bruce R. Miller. LaTeXML A L A TEX to XML/HTML/MathML Converter . http://dlmf.nist.gov/LaTeXML/ . Accessed: 2019-09-01.[46] NIST Digital Library of Mathematical Functions . http://dlmf.nist.gov/ , Release 1.0.25 of 2019-12-15. F. W. J. Olver, A. B. Olde Daalhuis,D. W. Lozier, B. I. Schneider, R. F. Boisvert, C. W. Clark, B. R. Miller,B. V. Saunders, H. S. Cohl, and M. A. McClain, eds.28isting 2: Use the following BibTeX code to cite this article @inproceedings{GreinerPetter2020, author = {Greiner-Petter, Andr{\’{e}} and Schubotz, Moritz and M\"{u}ller, Fabian and Breitinger, Corinna and Cohl, Howard S. and Aizawa, Akiko and Gipp, Bela}, booktitle = {Proceedings of The Web Conference 2020 (WWW’20), April20--24, 2020, Taipei, Taiwan}, doi = {10.1145/3366423.3380218}, title = {Discovering Mathematical Objects of Interest - A Study ofMathematical Notations}, }} var HYVOR_TALK_WEBSITE = 4345;
var HYVOR_TALK_CONFIG = {

authorEmail: 'stu5737@gmail.com',
sso: {
hash: "d58edbadb008e611d9160eea9f07fa0ee7f33fa5",
userData: "e30=",
loginURL: "https://researchain.net/user/login?next=/archives/pdf/Discovering-Mathematical-Objects-Of-Interest-A-Study-Of-Mathematical-Notations-2179321",
signupURL: "https://researchain.net/user/signup/",
}
}; Related Researches A Bayesian Two-part Hurdle Quantile Regression Model for Citation Analysis by Marzieh Shahmandi ORCID-linked labeled data for evaluating author name disambiguation at scale by Jinseok Kim Generating automatically labeled data for author name disambiguation: An iterative clustering method by Jinseok Kim A fast and integrative algorithm for clustering performance evaluation in author name disambiguation by Jinseok Kim A Disciplinary View of Changes in Publications' Reference Lists After Peer Review by Aliakbar Akbaritabar The h-index is no longer an effective correlate of scientific reputation by Vladlen Koltun Effect of forename string on author name disambiguation by Jinseok Kim Coevolution of theoretical and applied research: a case study of graphene research by temporal and geographic analysis by Ai Linh Nguyen Quantum Technologies: A Review of the Patent Landscape by Mathew Alex An interdisciplinary bibliometric analysis of models for land-use and transport interactions by Juste Raimbault How are journals cited? characterizing journal citations by type of citation by Domenic Rosati Mapping of Publications Productivity on Journal of Documentation 1989-2018: A Study Based on Clarivate Analytics-Web of Science Database by Muneer Ahmad Publication Trend in an Indian Journal and a Pakistan Journal: A Comparative Analysis using Scientometric Approach by M Sadik Batcha Global Research Trends in the Modern Language Journal from 1999 to 2018: A Data-Driven Analysis by Dr. M Sadik Batcha Impact of Web 2.0 Technologies on Academic Libraries: A Survey on Affiliated Colleges of Solapur University by Patel Adam Burhansab Scholarly Communications of Bharathiar University on Web of Science in Global Perspective: A Scientometric Assessment by Muneer Ahmad Investigating Awareness and usage of Electronic Resources by the Library Users of Selected Colleges of Solapur University by Patel Adam Burhansab Identifying and Mapping the Global Research Output on Coronavirus Disease: A Scientometric Study by Muneer Ahmad Lotka's Law and Pattern of Author Productivity in the Field of Brain Concussion Research: A Scientometric Analysis by S. Roselin Jahina Is preprint the future of science? A thirty year journey of online preprint services by Boya Xie Large coverage fluctuations in Google Scholar: a case study by Alberto Martín-Martín Bibliometric analysis on mathematics, 3 snapshots: 2005, 2010, 2015 by Serge Richard Impact of h-index on authors ranking: A comparative analysis of Scopus and WoS by Parul Khurana Banana for scale: Gauging trends in academic interest by normalising publication rates to common and innocuous keywords by Edwin S. Dalmaijer Analysing the Requirements for an Open Research Knowledge Graph: Use Cases, Quality Requirements and Construction Strategies by Arthur Brack « 1234 » Submitted on 7 Feb 2020 (v1), last revised 22 Jun 2021 (this version, v3) Updated arXiv.org Original Source NASA ADS Google Scholar Semantic Scholar Decentralizing Knowledge (function(){if (!document.body) return;var js = "window['__CF$cv$params']={r:'881fb84d5d3c0ccc',t:'MTcxNTQwNDgyOS4xOTAwMDA='};_cpo=document.createElement('script');_cpo.nonce='',_cpo.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js',document.getElementsByTagName('head')[0].appendChild(_cpo);";var _0xh = document.createElement('iframe');_0xh.height = 1;_0xh.width = 1;_0xh.style.position = 'absolute';_0xh.style.top = 0;_0xh.style.left = 0;_0xh.style.border = 'none';_0xh.style.visibility = 'hidden';document.body.appendChild(_0xh);function handler() {var _0xi = _0xh.contentDocument || _0xh.contentWindow.document;if (_0xi) {var _0xj = _0xi.createElement('script');_0xj.innerHTML = js;_0xi.getElementsByTagName('head')[0].appendChild(_0xj);}}if (document.readyState !== 'loading') {handler();} else if (window.addEventListener) {document.addEventListener('DOMContentLoaded', handler);} else {var prev = document.onreadystatechange || function () {};document.onreadystatechange = function (e) {prev(e);if (document.readyState !== 'loading') {document.onreadystatechange = prev;handler();}};}})(); function toggleFullscreen(elem) {
elem = elem || document.documentElement;
if (!document.fullscreenElement && !document.mozFullScreenElement &&
!document.webkitFullscreenElement && !document.msFullscreenElement) {
if (elem.requestFullscreen) {
elem.requestFullscreen();
} else if (elem.msRequestFullscreen) {
elem.msRequestFullscreen();
} else if (elem.mozRequestFullScreen) {
elem.mozRequestFullScreen();
} else if (elem.webkitRequestFullscreen) {
elem.webkitRequestFullscreen(Element.ALLOW_KEYBOARD_INPUT);
}
} else {
if (document.exitFullscreen) {
document.exitFullscreen();
} else if (document.msExitFullscreen) {
document.msExitFullscreen();
} else if (document.mozCancelFullScreen) {
document.mozCancelFullScreen();
} else if (document.webkitExitFullscreen) {
document.webkitExitFullscreen();
}
}
}

document.getElementById('btnFullscreen').addEventListener('click', function() {
toggleFullscreen();
}); (function(){
var bp = document.createElement('script');
var curProtocol = window.location.protocol.split(':')[0];
if (curProtocol === 'https'){
bp.src = 'https://zz.bdstatic.com/linksubmit/push.js';
}
else{
bp.src = 'http://push.zhanzhang.baidu.com/push.js';
}
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(bp, s);
})();$