Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical Language
Philipp Scharpf, Moritz Schubotz, Abdou Youssef, Felix Hamborg, Norman Meuschke, Bela Gipp
CClassification and Clusteringof arXiv Documents, Sections, and Abstracts,Comparing Encodings ofNatural and Mathematical Language
Philipp Scharpf
University of Konstanz, [email protected]
Moritz Schubotz
University of Wuppertal and FIZKarlsruhe, [email protected]
Abdou Youssef
George Washington University,United [email protected]
Felix Hamborg
University of Konstanz, [email protected]
Norman Meuschke
University of Wuppertal & Universityof Konstanz, [email protected]
Bela Gipp
University of Wuppertal & Universityof Konstanz, [email protected]
ABSTRACT
In this paper, we show how selecting and combining encodings ofnatural and mathematical language affect classification and cluster-ing of documents with mathematical content. We demonstrate thisby using sets of documents, sections, and abstracts from the arXivpreprint server that are labeled by their subject class (mathematics,computer science, physics, etc.) to compare different encodings oftext and formulae and evaluate the performance and runtimes ofselected classification and clustering algorithms. Our encodingsachieve classification accuracies up to 82 .
8% and cluster puritiesup to 69 .
4% (number of clusters equals number of classes), and99 .
9% (unspecified number of clusters) respectively. We observe arelatively low correlation between text and math similarity, whichindicates the independence of text and formulae and motivates treat-ing them as separate features of a document. The classification andclustering can be employed, e.g., for document search and recom-mendation. Furthermore, we show that the computer outperformsa human expert when classifying documents. Finally, we evaluateand discuss multi-label classification and formula semantification.
CCS CONCEPTS • Information systems → Information retrieval . KEYWORDS
Information Retrieval, Mathematical Information Retrieval, Ma-chine Learning, Document Classification and Clustering
ACM Reference Format:
Philipp Scharpf, Moritz Schubotz, Abdou Youssef, Felix Hamborg, Nor-man Meuschke, and Bela Gipp. 2020. Classification and Clustering of arXiv
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
JCDL ’20, August 1–5, 2020, Virtual Event, China © 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-7585-6/20/08...$15.00https://doi.org/10.1145/3383583.3398529
Documents, Sections, and Abstracts, Comparing Encodings of Natural andMathematical Language. In
Proceedings of ACM/IEEE Joint Conference onDigital Libraries in 2020 (JCDL ’20).
ACM, New York, NY, USA, 10 pages.https://doi.org/10.1145/3383583.3398529
The computational analysis of documents (e.g., for Recommendersystems) from Science, Technology, Engineering and Mathematics(STEM) is particularly challenging since it involves both NaturalLanguage Processing (NLP) and Mathematical Language Process-ing (MLP) to simultaneously investigate text and formulae. WhileNLP already relies heavily on Machine Learning (ML) techniques,their use in MLP is still being explored. In this paper, we show howmethods of NLP and MLP can be combined to enable the use ofML in Information Retrieval (IR) applications on documents withmathematical content. Machine Learning (ML) has been evolvingsince Alan Turing’s proposal of a
Learning Machine [32]. It hasdecisively promoted fields such as Computer Vision, Speech Recog-nition, Natural Language Processing, and Information Retrieval,with a vast number of applications, e.g., in Medical Diagnosis, Fi-nancial Market Analysis, Fraud Detection, Recommender Systems,Object Recognition, and Machine Translation.Natural Language Processing (NLP) is an interdisciplinary fieldinvolving both computer science and linguistics to develop meth-ods that enable computers to process and analyze natural languagedata [17]. Originally evolving from automatic translation - George-town experiment [7], and chatbots - ELIZA [34] - the discipline hasmade fast advancements in Part-of-speech (POS) tagging, NamedEntity Recognition (NER) and Relationship extraction [35]. Recently,NLP has especially been enriched by enhanced Deep Learning ca-pabilities [36]. Mathematical Language Processing (MLP) was firstcoined and introduced by Pagel and Schubotz [19] as a term anddiscipline that is concerned with analyzing mathematical formulae,analogous to how NLP is dealing with natural language sentences.This comprises the semantic enrichment of mathematical formulaeand their constituents to automatically infer their meaning fromthe context [29] (surrounding text, mathematical topic or discipline,etc.). a r X i v : . [ c s . D L ] M a y CDL ’20, August 1–5, 2020, Virtual Event, China Scharpf et al.
All these research fields and disciplines are joint contributorsin the analysis of documents with both natural and mathematicallanguage (containing text and formulae). This paper illustrates thesynergy of NLP and MLP in ML applications. We employed a set of4900 documents, 3500 sections, and 1400 abstracts from the arXivpreprint server (arxiv.org) that are labeled by their subject class(mathematics, computer science, physics, etc.) to compare differentencodings (doc2vec, tf-idf) of text and formulae. We evaluated theperformance and runtimes of selected classification and clusteringalgorithms, observing classification accuracies up to 82 .
8% andcluster purities up to 69 .
4% for a fixed number of clusters and 99 . Since Machine Learning methods require a formal (abstract mathe-matical) representation, natural language has to be converted intoword vectors. This is typically done via Word2Vec [15] using Bag-of-Words (BOW) or Skip-Grams (SG), or encoding term-frequency (tf)or term frequency-inverse document frequency (tf-idf). Whole docu-ments or document sections can be represented, i.a., by Doc2Vec [9]features, which are learned using (Deep) Neural Networks [36]. Vec-tors of words or documents generated by Word2Vec or Doc2Vecwere observed to be semantically close with respect to algebraicdistance metrics and can be used to reconstruct linguistic contextsof words [14]. This enables comparisons of the semantic content ofdocuments, e.g., for Recommender systems or text-topic classifica-tion.
The Mathematical Language Processing (MLP) project [19] wasintroduced as an attempt to disambiguate identifiers occurring inmathematical formulae. Retrieving the natural language definitionof the identifiers using Part-Of-Speech (POS) tags combined withnumerical statistics for the candidate ranking yielded high accura-cies of around 90%.The Part-Of-Math (POM) tagging project [37] also aims at mathdisambiguation and math semantics determination for the enrich-ment of math expressions. Scanning a math input document, defini-tive tags (operation, relation, etc.) and tentative features (alternativeroles and meanings) are assigned to math expressions using a se-mantic database that was created for the project.Furthermore, a comprehensive Math Knowledge Processing projectwas started [38] to explore sequence-to-sequence translation fromLaTeX typesetting to MathML markup using Math2Vec encodings of the formulae. A dataset of 6000 papers has been collected tobe used as training and testing data for the semantification of theidentifiers.
Text Document Classification.
Automatic document classification(ADC) has increasingly gained interest due to the vast availabil-ity of documents needing to be rapidly categorized. Advantagesover the knowledge engineering approach of "manual" labeling bydomain experts are efficiency (saving time) and easy portability(general techniques). Compared to traditional methods, e.g., fuzzylogic, ML methods are less interpretable, but often more effectiveand thus widely used. One differentiates single-label vs. multi-label,as well as hard (top one) vs. ranking text categorization [30]. Appli-cations of ADC include spam filtering, sentiment analysis, productcategorization, speech categorization, author and text genre identi-fication, essay grading, automatic document indexing, word sensedisambiguation, and hierarchical categorization of web pages [16].
Mathematical Document Classification.
For the classification ofmathematical documents, a mathematics-aware Part Of Speech(POS) tagger was developed to extend the dictionaries for keyphraseidentification via noun phrases (NPs) by symbols and mathematicalformulae [28]. The aim was to aid mathematicians in their search forrelevant publications by classified tags, such as ‘named mathemati-cal entities’ where, e.g., names of mathematicians indicate beingpotential parts of names for a special conjecture, theorem, approachor method. The hierarchical Mathematics Subject Classification [5]scheme was employed.Mathematical formulae (available as TeX code) were transformedto unique but random character sequences, e.g. x ? ( t ) = f ( t , x ( t )) to "kqnompjyomsqomppsk". Prior to the classification, key NP can-didates were extracted from the full text or abstract and evaluatedby experts (editors or reviewers removing, changing or addingphrases). The authors suggest that for scalable automatic extractionof key phrases, titles and abstracs are more accessible and suitablebecause they already summarize the publication content. The devel-oped tools were tested for key phrase extraction and classificationin the database zbMATH [40], obtaining best results using an SVMsequential minimal optimization algorithm with polynomial kernel.For 26 of the 63 top-level classes the precision was higher than0.75 and only for 4 classes smaller than 0.5. Controversial criteriasuch as quality, correctness, completeness, uncertainty, subjectivity,reliability were discussed. Text Document Clustering.
Motivated by the need for unstruc-tured document organization, summarization, and knowledge dis-covery, clustering methods are increasingly used for efficient repre-sentation and visualization. Applications of clustering in scienceand business include search engines, recommender systems, dupli-cate and plagiarism detection, and topic modeling. Similar docu-ments are grouped such that intra-class similarities are high, whileinter-class similarity is low [31]. Being an unsupervised ML method,clustering does not need any prior knowledge about the class dis-tribution, at the cost of the results potentially not being properlyunderstandable or interpretable for humans. Distinctions are made omparing Encodings of Natural and Mathematical Language JCDL ’20, August 1–5, 2020, Virtual Event, China between hard (disjoint) vs. soft (overlapping), as well as agglomera-tive (bottom-up) vs. divisive (top-down) clustering. In a survey onsemantic document clustering [18], augmentation by synonyms anddomain specific ontologies is suggested to improve Latent SemanticAnalysis (LSA), and Word Sense Disambiguation (WSD). Amongthe challenges of text clustering are the extraction and selectionof appropriate features, similarity measure, clustering method andalgorithm, efficient implementation, meaningful cluster labeling,and appropriate evaluation criteria. A review of the history andmethods of document clustering can be found at [22].
Mathematical Formula Clustering.
First investigations of howclustering algorithms perform on mathematical formulae have beenmade by two groups [11, 1]. The first group compared three cluster-ing algorithms - K-Means, Agglomerative Hierarchical Clustering(AHC), and Self Organizing Map (SOM) - on 20 training and 20test samples, showing Top-5 and Top-10 accuracies between 82%and 99% with the discovery that all three achieved similarly highresults. The second group aimed at the speedup of formula search,which is why they also discussed the runtimes of the algorithmswith the observation that K-means outperformed the other two(Self Organizing Map, Average-link) with 96% precision. For fur-ther details, the reader is referred to the respective publications.Clustering-based retrieval of mathematical formulae can have sev-eral applications. It can speed up formula search, e.g., on the DigitalLibrary of Mathematical Functions (DLMF) [10] or the arXiV.orge-Print archive [13] by grouping indexed formulae or documents.Given a formula search query, the closest cluster centroid is de-termined first, reducing the remaining search space by a factorthat is the cluster parameter k . Furthermore, clustering possiblesolutions to mathematical exercises can help in automatic gradingand feedback for learners assignments [8]. In this section, we investigate how Machine Learning (ML) cancombine Natural Language Processing (NLP) and MathematicalLanguage Processing (MLP) when classifying and clustering docu-ments (docs), sections (secs), and abstracts (abs) containing bothtext and formulae.Our research was driven by the following questions :1)
How does selecting and combining encodings of natural andmathematical language affects classification (accuracy) and clustering(purity) of documents with mathematical content? Which encoding (content=text/math, method=2vec/tf-idf) oralgorithm (classification/clustering) has the highest performance (ac-curacy/purity) and shortest runtime?
The following section starts by presenting the employed datasets,followed by a description of our data extraction pipeline, and the en-codings. We then report an examination of the correlation betweentext and math similarity, followed by our investigation to classifyand cluster STEM docs, secs, and abs from the arXiv preprint serverby their subject class (mathematics, computer science, physics, etc.)using the contained text and formulae. Finally, we compare the clas-sification confusion of the computer to a human expert, summarizeour findings and outline some future directions and experiments. We conducted experiments using 9800 samples (documents, sec-tions, abstracts) with 400 different settings (encodings, methods,algorithms).
Our code is available at https://purl.org/class_clust_arxiv_code . SigMathLing arXMLiv-08-2018.
Provided by the Special InterestGroup on Maths Linguistics (sigmathling.kwarc.info), the arXMLiv-08-2018 dataset contains 137864 HTML document files (w3c.org/html). We selected an equal distribution of the first 350 documentsfrom each of the following subject classes: [’hep-ph’, ’astro-ph’,’quant-ph’, ’physics’, ’cond-mat’, ’hep-ex’, ’hep-lat’, ’nucl-th’, ’nucl-ex’, ’hep-th’, ’math’, ’gr-qc’, ’nlin’, ’cs’], yielding a total of 14 × = NTCIR-11/12 MathIR arXiv. × = × = Text.
From the HTML documents and TEI section files, we re-trieved the textual content using the nltk [12]
RegexpTokenizer and corpus English stopword set . We cleaned the raw text strings bylowering and removing stopwords, mathematical formulae, digitsand words with less than three characters. The cleaning increasedclassification accuracies up to a factor of 3.35 for tf-idf encodingswhile achieving less improvements for doc2vec encodings.
Formulae.
We retrieved the mathematical content using the Pythonpackage BeautifulSoup [41]. We isolated the formulae from
We encoded the retrieved text and formulae using the
TfidfVec-torizer from the Python package
Scikit-learn [20] and
Doc2Vec model [9] from the Python package
Gensim [23].After creating a LabeledLineSentence iterator for the vocabu-lary, the model was built with size=300, window=10, min_count=5,workers=11, alpha=0.025, min_alpha=0.025, iter=20 and trained 10epochs with model.alpha-=0.002.Table 1 lists the encodings that deserve further explanation.The surroundings encoding uses text surroundings (within +-500characters, excluding stopwords and letters) of single identifiers astheir putative meanings.Summing up, we varied the following experimental parameters:1) encoded data types or batch size (documents, sections, abstracts,summarizations), 2) encoded data features (text, math), 3) type of
CDL ’20, August 1–5, 2020, Virtual Event, China Scharpf et al.
Table 1: Special encodings of mathematical formula contentwith explanation.Encoding Explanation
Math_op formula operators (+,-, etc.)Math_id formula identifiers (x,y,z, etc.)Math_opid formula operators and identifiersMath_surroundings text surroundings of identifiersmath encoding (op, id, opid, surroundings), 4) encoding method(doc2vec, tf-idf), 5) classification/clustering algorithm (performance,runtime).
First, we determined the correlation of the cosine similarity (innerproduct) between text and math encodings. For each document,section or abstract, we calculated the similarities with all the otherdocs/secs/abs in both text and math encodings (different vectorspaces) seperately. This means, we investigated whether if twodocuments are similar in their text encoding, they are also similarin their math encoding. Table 2 lists the results of our comparison.The low correlations indicate that in principle, the independence oftext and math encodings leave a potential for improvement of MLalgorithms by combining the two, which was explored. Besides, itsuggests that in a Recommender System for STEM documents, itwill be beneficial to provide the user with weighting parametersfor text and math (if relatively uncorrelated), to customize therecommendations.
Table 2: Correlations between text and math (cosine) simi-larity of individual documents and sections.
Comparison/Domain x doc secx2vecText - x2vecMath_op 0.14 0.16x2vecText - x2vecMath_id 0.12 0.11x2vecText - x2vecMath_opid 0.16 0.15x2vecText - x2vecMath_surroundings 0.21 0.27
Using the selection of 4900 documents, 3500 sections, and 1400abstracts from the arXiv, we compared the influence of text andformulae on the performance of a subject class [’math’, ’physics’,’cs’, etc.] classification. We subsequently clustered the doc/sec/absvectors; for the
KMeans, Agglomerative , and
GaussianMixture cluter-ers, we fixed the number of clusters to 14 (= the number of labeledclasses), while for the
Affinity, MeanShift , and
HDBSCAN clusterers,no number of clusters was fixed. The encodings secText_tfidf andsec2vecMath_surroundings needed a PCA dimensionality reduc-tion before the clustering with MeanShift and GaussianMixturewas possible.
We used 10-fold cross-validation , while comparing the accuracy,purity, and relative runtimes of selected single or ensemble classi-fiers and clustering algorithms (with their respective default met-rics) with or without fixed cluster number, provided by the Pythonpackage Scikit-learn [20].For the classification, we calculated the accuracy as the num-ber of correctly classified samples divided by the sample size andaveraged over all splittings of the cross-validation.For the clustering, we compared the clusters to the labeled classes,calculating the cluster purity as the number of data points of theclass that makes up the largest fraction of the cluster divided bythe cluster size and averaged over all clusters.
The results of the classification and clustering are shown in Tables3 and 4.In contrast to the classification, the clustering of math vectorsyielded partly better results than the text clustering. A combinationof both text and math yielded no significant improvement over theseparate encodings.
Table 3 shows the classification accuracies of the individual classifi-cation algorithms using text or math encodings of the documents,sections, and abstracts.The best encoding for docs, secs and abs is always Text_tfidf. Themost accurate algorithm is MLP (Multilayer Perceptron, with hiddenlayer size 500), except for docs where LinSVC yields a slightly highermaximum value. The fastest algorithms are kNN and Random Forest.For kNN and DecTree there is a high discrepancy between thevalues of doc2vecText and docText_tfidf encodings. While for textthe tf-idf encoding is better, for math it is the doc2vec encoding,with the exception of the surroundings encoding which is, eventhough connected to mathematical identifiers, effectively text.An overall comparison of x2vec and tf-idf including both text andmath shows that the former outperforms the latter with mean(doc2vecX,sec2vecX, abs2vecX) = (40.2, 37.0, 28.0) mostly greater than mean(doc_tfidf,sec_tfidf, abs_tfidf) = (35.2, 35.5, 30.0), and mean(2vec) = 35.1 > mean(tfidf) = 33.6 summarized. The mean of the means is decay-ing from docs (38.1) to secs (36.4) to abs (29.0). The surroundingsencoding (especially with tf-idf) is better than the math encod-ings of operators (op), identifiers (id) or both (opid). Given thatmean(doc_text, doc_math, doc_textmath) = (63.4, 28.3, 51.8), andmean(sec_text, sec_math, sec_textmath) = (59.7, 27.7, 47.9), it isstriking that for the classification, the text encodings yield betterresults than the standalone math encodings.We tested some other algorithms that are not listed: GaussianNaive Bayes yields accuracies of mostly less than 10%; MultinomialNaive Bayes could not be carried out on the text vectors due to thenegative values of their continuous distribution. omparing Encodings of Natural and Mathematical Language JCDL ’20, August 1–5, 2020, Virtual Event, China Table 3: Classification accuracies of 4900 arXiv documents (above), 3500 sections (middle), and 1400 abstracts (below) into14 subject classes using different classifier (columns), and text or math encodings (rows). The highest mean/maximum ishighlighted in yellow/red. It is orange if an encoding or classifier yields both the highest mean and maximum value. Theshortest relative runtime is marked in green.
Encoding/Classifier LogReg LinSVC RbfSVC kNN MLP DecTree RandForest GradBoost Mean Maxdoc2vecText docText_tfidf doc2vecMath_op docMath_op_tfidf doc2vecMath_id docMath_id_tfidf doc2vecMath_opid docMath_opid_tfidf doc2vecMath_surroundings docMath_surroundings_tfidf doc2vecTextMath_opid doc2vecTextMath_surroundings
Mean
Max
Runtime [%]
Encoding/Classifier LogReg LinSVC RbfSVC kNN MLP DecTree RandForest GradBoost Mean Maxsec2vecText secText_tfidf sec2vecMath_op secMath_op_tfidf sec2vecMath_id secMath_id_tfidf sec2vecMath_opid secMath_opid_tfidf sec2vecMath_surroundings secMath_surroundings_tfidf sec2vecTextMath_opid sec2vecTextMath_surroundings
Mean
Max
Runtime [%]
Encoding/Classifier LogReg LinSVC RbfSVC kNN MLP DecTree RandForest GradBoost Mean Maxabs2vecText absText_tfidf abs2vecMath_opid absMath_opid_tfidf
Mean
Max
Runtime [%]
CDL ’20, August 1–5, 2020, Virtual Event, China Scharpf et al.
Table 4: Clustering purities of 4900 arXiv documents (above), 3500 sections (middle), and 1400 abstracts (below) with 14 subjectclasses using different clusterers (columns), and text or math encodings (rows). The highest mean/maximum is highlighted inyellow/red for the group of clusterers with specified cluster number (KMeans, Agglomerative, GaussianMixture) and unspec-ified (Affinity, MeanShift, HDBSCAN) respectively. It is orange if an encoding or clusterer yields both the highest mean andmaximum. The shortest relative runtime is marked in green.
Encoding/Clusterer KMeans Affinity Agglomerative MeanShift GaussianMixture HDBSCAN Mean Maxdoc2vecText docText_tfidf doc2vecMath_op docMath_op_tfidf doc2vecMath_id docMath_id_tfidf doc2vecMath_opid docMath_opid_tfidf doc2vecMath_surroundings docMath_surroundings_tfidf doc2vecTextMath_opid doc2vecTextMath_surroundings
Mean
Max
Runtime [%]
Encoding/Clusterer KMeans Affinity Agglomerative MeanShift GaussianMixture HDBSCAN Mean Maxsec2vecText secText_tfidf sec2vecMath_op secMath_op_tfidf sec2vecMath_id secMath_id_tfidf sec2vecMath_opid secMath_opid_tfidf sec2vecMath_surroundings secMath_surroundings_tfidf sec2vecTextMath_opid sec2vecTextMath_surroundings
Mean
Max
Runtime [%]
Encoding /Clusterer KMeans Affinity Agglomerative MeanShift GaussianMixture HDBSCAN Mean Maxabs2vecText absText_tfidf abs2vecMath_opid absMath_opid_tfidf
Mean
Max
Runtime [%]
Table 4 shows the cluster purities of the individual clustering algo-rithms using text or math encodings of the documents, sections, and We observed that the split k had only a small impact on the result. omparing Encodings of Natural and Mathematical Language JCDL ’20, August 1–5, 2020, Virtual Event, China abstracts. The best encodings are doc/sec2vecMath_surroundingsand abs2vecMath_opid. The most accurate algorithms are 1) Gaus-sianMixture (highest mean and maximum), MeanShift (highestmaximum), Affinity (highest max and mean). GaussianMixture isthe best algorithm with a fixed cluster number, while Affinity andMeanShift are the best algorithms without a fixed cluster number.The fastest algorithm is HDBSCAN. Only for the mean of docs, textyields the highest value. For the other mean and maximum values,math are better than text encodings.A comparison of x2vec and tf-idf shows that the former out-performs the latter with mean(doc2vecX, sec2vecX, abs2vecX) =(56.3, 43.4, 53.1) > mean(doc_tfidf, sec_tfidf, abs_tfidf) = (51.5, 36.5,47.3), and mean(2vec) = 50.9 > mean(tfidf) = 45.0 summarized. Forabstracts, the math encodings yield better results than the textencodings with mean(abs_text,abs_math) = (44.8, 55.6). However,given that mean(text, math, textmath) = (51.2, 48.2, 52.0), all in all,also for the clustering, the text encodings yield better results thanthe math encodings, but the combination of text and math slightlyoutperforms the other.We tested some other algorithms that are not listed due to poorperformance or exceedingly large runtimes (e.g. Spectral Clustering,DBSCAN). We carried out a human expert classification of 10 examples fromeach of the 14 subject classes for comparison to our algorithmicresults. From 140 in total, 85 - i.e. 60.7% were correctly classified.Figure 1 shows a comparison of the classification confusionmatrix. The human classifier (left) is outperformed by the computerclassifier (right) with lower diagonal and higher off-diagonal values.Both the computer and human classification confusion show thatsome categories like ’physics’ should be disposed and distributed tothe respective specializations (’cond-mat’, ’hep-ex’, ’nucl-ex’, etc.). For the NTCIR and SigMathLing arxiv datasets, we could not find abaseline for our experiments. However, we were able to reproducethe results using the script of SchÃűneberg et al. [28] on 942337( ≈ In this paper, we discussed how methods of Natural Language Pro-cessing (NLP) and Mathematical Language Processing (MLP) can We chose one physicist, able to separate all subject classes. be combined to enable the use of Machine Learning (ML) in Infor-mation Retrieval (IR) applications on documents with mathematicalcontent. We first provided a short review of MathIR, NLP and MLPand the current state of research in text and math classification andclustering. Subsequently, we introduced the employed datasets ofmathematical documents and described encodings for their textand math content. We investigated the correlation between textand math similarity. Finally, we presented and discussed the resultsof a classification and clustering of 4900 documents, 3500 sections,and 1400 abstracts from the arXiv preprint server (arxiv.org).The correlations between text and math (cosine) similarity (Table2) were relatively low (mean = 0.17, max = 0.27), motivating us totreat text and math encodings as separate features of a document.While for the classification, the Text_tfidf encoding was outperform-ing the others, for the clustering doc/sec2vecMath_surroundingsand abs2vecMath_opid encodings are the best. For both classifica-tion and clustering, the x2vec encodings yielded better results thanthe tf-idf encodings and text outperformed math encodings. How-ever, for the clustering, the combination of text and math slightlyoutperforms the separate encodings.All in all, our research questions were answered as:1)
Combining text and math encodings does not improve the classifi-cation accuracy, but partly the cluster purity of selected ML algorithmsworking on documents, sections, and abstracts. On the whole, the doc2vec encoding outperforms tf-idf encoding.The most accurate classification algorithm is a Multilayer Perceptron(MLP), while for the clustering, the highest maximum and mean valuesof the purities are divided among GaussianMixture, MeanShift, andAffinity Propagation. The fastest algorithms are k-Nearest Neighbors,and Random Forest classifiers, and HDBSCAN clustering.
Why did the use of mathematical encodings not significantly im-prove classification accuracy? We suspect a low inter-class varianceof the math encodings due to a large overlap of the formula iden-tifier namespaces. For example, the identifier x occurs very oftenin many subject classes, but with different meanings. Documentsfrom different subject classes often have similar sets of identifiersymbols. Therefore, we expect that disambiguation of the identifiersemantics by annotation would increase the vector distance be-tween subject classes and possibly increase classification accuracy.There are two ways to tackle the identifier disambiguation. It can bedone supervised with or unsupervised without the quality controlof a human. In the following, we present our results of unsuper-vised semantification using three different sources. Furthermore,we shortly discuss our ongoing endeavors to additionally performsupervised annotation in the future work section. In an attempt to increase the classification accuracy compared tothe previously presented math-encodings, we tested a conversionfrom math (identifier) symbols to text (semantics). We semanticallyenriched the text of the 14 × = SigMathLing arXMLiv-08-2018 dataset by identifier name candidatesprovided from three different lists. These were previously extractedfrom the following sources:
CDL ’20, August 1–5, 2020, Virtual Event, China Scharpf et al.
Figure 1: Confusion matrix with percentages comparing the classification of a human expert (left) to the best performingcombination of a LinSVC classifier on docText_tfidf encodings (right).1) arXiv:
Identifier candidate names for all lower- and upper-case Latin and Greek letter identifier symbols appearing in theNTCIR arXiv corpus that was created as part of the NTCIR MathIRTask [2]. The candidates were extracted from the surrounding textof 60 M formulae and ranked by the frequency of their occurrence;
2) Wikipedia:
Identifier candidate names extracted from defini-tions in mathematical English articles, as provided by
Physikerwelt ;
3) Wikidata:
Identifier candidate names retrieved via a SPARQLquery for items with defining formula containing the respectiveidentifier symbol.For each source, the candidates were extracted, ranked by theoccurrence frequency of the respective identifier symbol/namemapping, and dumped to static lists. The encodings with semanticenrichment by the top 3 ranked identifier name candidates outper-form all other mathematical encodings listed in Table 3. We willdiscuss an extension of the experiment to supervised semantifica-tion in our future work. In this paper, we presented classification and clustering baselines onthe
SigMathLing arXMLiv-08-2018 and
NTCIR-11/12 MathIR arXiv datasets. Since, so far we were not able to significantly outperformthe text encodings by math encodings, we call out for a
FormulaEncoding Challenge . The aim is to find a suitable math encodingthat outperforms or enhances the text classification.
We now outline some future directions and experiments.
Deep Contextualized Encodings.
In the near future, we aim totest other recently developed encodings like Deep Bidirectional http://ntcir-math.nii.ac.jp/data/ https://en.wikipedia.org/wiki/User:Physikerwelt https://query.wikidata.org Transformers (BERT) [4] and Deep Contextualized Word Represen-tations (ELMo) [21], which are computationally more expensiveand memory consuming. Our most extensive selection of text from4900 documents (docText) taken from the SigMathLing arXMLiv-08-2018 dataset contains 1 million sentences, 75 million words, and1.65 billion tokens and is thus larger than other NLP benchmarkdatasets the encodings are usually tested on. As an example forELMo, the Stanford Natural Language Inference (SNLI) Corpus [3]comprises 570 thousand sentences, and the CoNLL-2003 SharedNER Task [24] unannotated data consists of 17 million tokens.
Supervised Formula Semantification.
To improve the classifica-tion accuracy on unsupervised semantification encodings, we planto employ supervised semantification by human labeling. However,the semantic enrichment by "manual" annotation will take a signif-icant amout of time. To facilitate and speed up the process, we arecurrently working on a formula and identifier name annotation rec-ommender system [26]. We aim to integrate the tool into the editingviews of both Wikipedia (Wikitext documents) and overleaf (LaTeXdocuments) to integrate the mathematical research community inthe semantification process.
Formula Clustering.
We propose another potential use case offormula clustering, namely a formula occurrence retrieval . Given alarge dataset of mathematical documents (e.g., from the arXiv), thetask will be to retrieve a ranking of formulae that occur most often.One could hypothesize that due to their popularity in research, thehighest-scored formulae are most relevant candidates to have theirunderlying mathematical concepts seeded into encyclopedias anddictionaries such as Wikipedia, the semantic knowledge-base Wiki-data [33] or the NIST Digital Library of Mathematical Functions [10].Since formulae often appear in a variety of different formulationsor equivalent representations and it is a priori unknown how manydifferent formula concepts [25, 27] will be discovered (the clusterparameter k ), this is a very challenging problem that the authorscurrently are working on. omparing Encodings of Natural and Mathematical Language JCDL ’20, August 1–5, 2020, Virtual Event, China ACKNOWLEDGMENT
This work was supported by the German Research Foundation(DFG grant GI-1259-1). The authors would like to thank ChristianBorgelt for his support.
REFERENCES [1] M. Adeel, M. Sher, and M. S. H. Khiyal. “Efficient cluster-based information retrieval from mathematical markup docu-ments”. In:
World Applied Sciences Journal
NT-CIR . National Institute of Informatics (NII), 2014.[3] S. R. Bowman et al. “A large annotated corpus for learn-ing natural language inference”. In:
Proceedings of the 2015Conference on Empirical Methods in Natural Language Pro-cessing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015 .Ed. by L. MÃărquez et al. The Association for ComputationalLinguistics, 2015, pp. 632–642.[4] J. Devlin et al. “BERT: Pre-training of Deep BidirectionalTransformers for Language Understanding”. In:
CoRR abs/1810.04805(2018). arXiv: 1810.04805.[5] Y. Dong. “NLP-Based Detection of Mathematics Subject Clas-sification”. In:
Mathematical Software - ICMS 2018 - 6th In-ternational Conference, South Bend, IN, USA, July 24-27, 2018,Proceedings . Ed. by J. H. Davenport et al. Vol. 10931. LectureNotes in Computer Science. Springer, 2018, pp. 147–155. doi:10.1007/978-3-319-96418-8\_18.[6] D. Ginev et al. “The LaTeXML Daemon: Editable Math onthe Collaborative Web”. In:
Intelligent Computer Mathematics- 18th Symposium, Calculemus 2011, and 10th InternationalConference, MKM 2011, Bertinoro, Italy, July 18-23, 2011. Pro-ceedings . Ed. by J. H. Davenport et al. Vol. 6824. LectureNotes in Computer Science. Springer, 2011, pp. 292–294. doi:10.1007/978-3-642-22673-1\_25.[7] W. J. Hutchins. “The Georgetown-IBM Experiment Demon-strated in January 1954”. In:
Machine Translation: From RealUsers to Research, 6th Conference of the Association for Ma-chine Translation in the Americas, AMTA 2004, Washington,DC, USA, September 28-October 2, 2004, Proceedings . Ed. byR. E. Frederking and K. Taylor. Vol. 3265. Lecture Notes inComputer Science. Springer, 2004, pp. 102–114. doi: 10.1007/978-3-540-30194-3\_12.[8] A. S. Lan et al. “Mathematical Language Processing: Auto-matic Grading and Feedback for Open Response Mathemati-cal Questions”. In:
Proceedings of the Second ACM Conferenceon Learning @ Scale, L@S 2015, Vancouver, BC, Canada, March14 - 18, 2015 . Ed. by G. Kiczales, D. M. Russell, and B. P. Woolf.ACM, 2015, pp. 167–176. doi: 10.1145/2724660.2724664.[9] Q. V. Le and T. Mikolov. “Distributed Representations ofSentences and Documents”. In:
Proceedings of the 31th Inter-national Conference on Machine Learning, ICML 2014, Beijing,China, 21-26 June 2014 . Vol. 32. JMLR Workshop and Confer-ence Proceedings. JMLR.org, 2014, pp. 1188–1196.[10] D. W. Lozier. “NIST Digital Library of Mathematical Func-tions”. In:
Ann. Math. Artif. Intell.
Software Engineering and Data Mining (SEDM), 2010 2ndInternational Conference on . IEEE. 2010, pp. 372–377.[12] C. D. Manning et al. “The Stanford CoreNLP Natural Lan-guage Processing Toolkit”. In:
Proceedings of the 52nd AnnualMeeting of the Association for Computational Linguistics, ACL2014, June 22-27, 2014, Baltimore, MD, USA, System Demon-strations . The Association for Computer Linguistics, 2014,pp. 55–60.[13] G. McKiernan. “arXiv.org: the Los Alamos National Lab-oratory e-print server”. In:
International Journal on GreyLiterature
NIPS . 2013, pp. 3111–3119.[15] T. Mikolov et al. “Efficient Estimation of Word Representa-tions in Vector Space”. In:
CoRR abs/1301.3781 (2013). arXiv:1301.3781.[16] M. Mironczuk and J. Protasiewicz. “A recent overview of thestate-of-the-art elements of text classification”. In:
ExpertSyst. Appl.
106 (2018), pp. 36–54. doi: 10.1016/j.eswa.2018.03.058.[17] P. M. Nadkarni, L. Ohno-Machado, and W. W. Chapman.“Natural language processing: an introduction”. In:
Journalof the American Medical Informatics Association
Electrical, Computer andCommunication Technologies (ICECCT), 2015 IEEE Interna-tional Conference on . IEEE. 2015, pp. 1–10.[19] R. Pagel and M. Schubotz. “Mathematical Language Process-ing Project”. In:
Joint Proceedings of the MathUI, OpenMathand ThEdu Workshops and Work in Progress track at CICMco-located with Conferences on Intelligent Computer Mathe-matics (CICM 2014), Coimbra, Portugal, July 7-11, 2014.
Ed. byM. England et al. Vol. 1186. CEUR Workshop Proceedings.CEUR-WS.org, 2014.[20] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”.In:
Journal of Machine Learning Research
12 (2011), pp. 2825–2830.[21] M. E. Peters et al. “Deep Contextualized Word Representa-tions”. In:
Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computational Lin-guistics: Human Language Technologies, NAACL-HLT 2018,New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (LongPapers) . Ed. by M. A. Walker, H. Ji, and A. Stent. Associationfor Computational Linguistics, 2018, pp. 2227–2237.[22] K. Premalatha and A. Natarajan. “A literature review ondocument clustering”. In:
Information Technology Journal
Proceedings of the Seventh Conference
CDL ’20, August 1–5, 2020, Virtual Event, China Scharpf et al. on Natural Language Learning, CoNLL 2003, Held in cooper-ation with HLT-NAACL 2003, Edmonton, Canada, May 31 -June 1, 2003 . Ed. by W. Daelemans and M. Osborne. ACL,2003, pp. 142–147.[25] P. Scharpf, M. Schubotz, and B. Gipp. “Representing Math-ematical Formulae in Content MathML using Wikidata”.In:
Proceedings of the 3rd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Pro-cessing for Digital Libraries (BIRNDL) 2018, co-located withthe 41st International ACM SIGIR Conference on Research andDevelopment in Information Retrieval (SIGIR 2018), Ann Arbor,USA, July 12, 2018.
Ed. by P. Mayr, M. K. Chandrasekaran, andK. Jaidka. Vol. 2132. CEUR Workshop Proceedings. CEUR-WS.org, 2018, pp. 46–59.[26] P. Scharpf et al. “
AnnoMath TeX - a formula identifier an-notation recommender system for STEM documents”. In:
Proceedings of the 13th ACM Conference on Recommender Sys-tems, RecSys 2019, Copenhagen, Denmark, September 16-20,2019 . Ed. by T. Bogers et al. ACM, 2019, pp. 532–533. doi:10.1145/3298689.3347042.[27] P. Scharpf et al. “Towards Formula Concept Discovery andRecognition”. In:
Proceedings of the 4th Joint Workshop onBibliometric-enhanced Information Retrieval and Natural Lan-guage Processing for Digital Libraries (BIRNDL 2019) co-locatedwith the 42nd International ACM SIGIR Conference on Re-search and Development in Information Retrieval (SIGIR 2019),Paris, France, July 25, 2019 . Ed. by M. K. Chandrasekaranand P. Mayr. Vol. 2414. CEUR Workshop Proceedings. CEUR-WS.org, 2019, pp. 108–115.[28] U. SchÃűneberg and W. Sperber. “POS Tagging and Its Appli-cations for Mathematics - Text Analysis in Mathematics”. In:
Intelligent Computer Mathematics - International Conference,CICM 2014, Coimbra, Portugal, July 7-11, 2014. Proceedings .Ed. by S. M. Watt et al. Vol. 8543. Lecture Notes in ComputerScience. Springer, 2014, pp. 213–223. doi: 10.1007/978-3-319-08434-3\_16.[29] M. Schubotz et al. “Semantification of Identifiers in Mathe-matics for Better Math Information Retrieval”. In:
Proceedingsof the 39th International ACM SIGIR conference on Researchand Development in Information Retrieval, SIGIR 2016, Pisa,Italy, July 17-21, 2016 . Ed. by R. Perego et al. ACM, 2016,pp. 135–144. doi: 10.1145/2911451.2911503.[30] F. Sebastiani. “Machine learning in Automated Text Catego-rization”. In:
ACM Comput. Surv.
International Journal of Applied Information Sys-tems
Mind
LIX.236 (1950), pp. 433–460.[33] D. Vrandecic and M. KrÃűtzsch. “Wikidata: a free collabora-tive knowledgebase”. In:
Commun. ACM
Commun. ACM
PAKDD . Vol. 5476. Lecture Notes in ComputerScience. Springer, 2009, pp. 266–277.[36] T. Young et al. “Recent Trends in Deep Learning Based Nat-ural Language Processing [Review Article]”. In:
IEEE Comp.Int. Mag.
Intelligent Computer Mathematics - 10th International Confer-ence, CICM 2017, Edinburgh, UK, July 17-21, 2017, Proceedings .Ed. by H. Geuvers et al. Vol. 10383. Lecture Notes in Com-puter Science. Springer, 2017, pp. 356–374. doi: 10.1007/978-3-319-62075-6\_25.[38] A. Youssef and B. R. Miller. “Deep Learning for Math Knowl-edge Processing”. In:
Intelligent Computer Mathematics - 11thInternational Conference, CICM 2018, Hagenberg, Austria, Au-gust 13-17, 2018, Proceedings . Ed. by F. Rabe et al. Vol. 11006.Lecture Notes in Computer Science. Springer, 2018, pp. 271–286. doi: 10.1007/978-3-319-96812-4\_23.[39] R. Zanibbi et al. “NTCIR-12 MathIR Task Overview”. In:
Proceedings of the 12th NTCIR Conference on Evaluation ofInformation Access Technologies, National Center of Sciences,Tokyo, Japan, June 7-10, 2016 . Ed. by N. Kando, T. Sakai, andM. Sanderson. National Institute of Informatics (NII), 2016.[40]
Zentralblatt MATH (zbMATH) . https://zbmath.org. Accessed:2020-05-10.[41] C. Zheng, G. He, and Z. Peng. “A Study of Web InformationExtraction Technology Based on Beautiful Soup”. In: