AutoMSC: Automatic Assignment of Mathematics Subject Classification Labels
Moritz Schubotz, Philipp Scharpf, Olaf Teschke, Andreas Kuehnemund, Corinna Breitinger, Bela Gipp
PPreprint from
M. Schubotz et al. “AutoMSC: Automatic Assignment of MathematicsSubject Classification Labels”. In:
Proceedings of the 13th Conferenceon Intelligent Computer Mathematics . 2020
AutoMSC: Automatic Assignment ofMathematics Subject Classification Labels
Moritz Schubotz , Philipp Scharpf , Olaf Teschke , AndreasK¨uhnemund , Corinna Breitinger , and Bela Gipp FIZ-Karlsruhe, Germany ( { first.last } @fiz-karlsruhe.de) University of Wuppertal, Germany ( { last } @uni-wuppertal.de) University of Konstanz, Germany ( { first.last } @uni-konstanz.de)May 26, 2020 Abstract
Authors of research papers in the fields of mathematics, and othermath-heavy disciplines commonly employ the Mathematics Subject Clas-sification (MSC) scheme to search for relevant literature. The MSC is ahierarchical alphanumerical classification scheme that allows librarians tospecify one or multiple codes for publications. Digital Libraries in Mathe-matics, as well as reviewing services, such as zbMATH and MathematicalReviews (MR) rely on these MSC labels in their workflows to organize theabstracting and reviewing process. Especially, the coarse-grained classi-fication determines the subject editor who is responsible for the actualreviewing process.In this paper, we investigate the feasibility of automatically assigning acoarse-grained primary classification using the MSC scheme, by regardingthe problem as a multi class classification machine learning task. We findthat the our method achieves an F -score of over 77%, which is remarkablyclose to the agreement of zbMATH and MR ( F -score of 81%). Moreover,we find that the method’s confidence score allows for reducing the effortby 86% compared to the manual coarse-grained classification effort whilemaintaining a precision of 81% for automatically classified articles. zbMATH has classified more than 135k articles in 2019 using the MathematicsSubject Classification (MSC) scheme [6]. With more than 6,600 MSC codes, thisclassification task requires significant in-depth knowledge of various sub-fields ofmathematics to determine the fitting MSC codes for each article. In summary, https://zbmath.org/ a r X i v : . [ c s . D L ] M a y he classification procedure of zbMATH and MR is two-fold. First, all articlesare pre-classified into one of 63 primary subjects spanning from general topicsin mathematics (00), to integral equations (45), to mathematics education (97).In a second step, subject editors assign fine-grained MSC codes in their area ofexpertise, i.a. with the aim to match potential reviewers.The automated assignments of MSC labels has been analyzed by Rehurekand Sojka [9] in 2008 on the DML-CZ [14] and NUMDAM [3] full-text cor-pus. They report a micro-averaged F score of 81% for their public corpus. In2013 Barthel, T¨onnies, and Balke performed automated subject classification forparts of the zbMATH corpus [2]. They criticized the micro averaged F measure,especially, if the average is applied only to the best performing classes. However,they report a micro-averaged F score of 67 .
1% for the zbMATH corpus. Theysuggested training classifiers for a precision of 95% and assigning MSC class la-bels in a semi-automated recommendation setup. Moreover, they suggested tomeasure the human baseline (inter-annotator agreement) for the classificationtasks. Moreover, they found that the combination of mathematical expressionsand textual features improves the F score for certain MSC classes substantially.In 2014, Sch¨oneberg and Sperber [11] implement a method that combined for-mulae and text using an adapted Part of Speech Tagging approach. Their paperreported a sufficient precision of > .
75, however, it did not state the recall. Theproposed method was implemented and is currently being used especially topre-classify general journals [7] with additional information, like references. Fora majority of journals, coarse- and fine-grained codes can be found by statis-tically analyzing the MSC codes from referenced documents matched withinthe zbMATH corpus. The editor of zbMATH hypothesizes that the referencemethod outperforms the algorithm developed by Sch¨oneberg and Sperber. Toconfirm or reject this hypothesis was one motivation for this project.The positive effect of mathematical features is confirmed by Suzuki and Fu-jii [16], who measured the classification performance based on an arXiv andmathoverflow dataset. In contrast, Scharpf et al. [10] could not measure asignificant improvement of classification accuracy for the arxiv dataset whenincorporating mathematical identifiers. In their experiments Scharpf et al. eval-uated numerous machine learning methods, which extended [4, 15] in terms ofaccuracy and run-time performance, and found that complex compute-intensiveneural networks do not significantly improve the classification performance.In this paper, we focus on the coarse-grained classification of the primaryMSC subject number (pMSCn) and explore how current machine learning ap-proaches can be employed to automate this process. In particular, we comparethe current state of the art technology [10] with a part of speech (POS) prepro-cessing based system customized for the application in zbMATH from 2014 [11].2 ilter Match zbMATH MR yesno
Train Test
Model Evaluate
Figure 1: Workflow overview.We define the following research questions:1. Which evaluation metrics are most useful to assess the classifications?2. Do mathematical formulae as part of the text improve the classifications?3. Does POS preprocessing [11] improve the accuracy of classifications?4. Which features are most important for accurate classification?5. How well do automated methods perform in comparison to a human base-line?
To investigate the given set of problems, we first created test and trainingdatasets. We then investigated the different pMSCn encodings, trained ourmodels and evaluated the results, cf Figure 1.
Filter current high quality articles:
The zbMATH database has assignedMSC codes to more than 3.6 M articles. However, the way in which mathemat-ical articles are written has changed over the last century, and the classificationof historic articles is not something we aim to investigate in this article. The firstMSC was created in 1990, and has since been updated every ten years (2000,2010, and 2020) [5]. With each update, automated rewrite rules are applied tomap the codes from the old MSC to the next MSC version, which is connectedwith a loss of accuracy of the class labels. To obtain a coherent and high qual-ity dataset for training and testing, we focused on the more recent articles from2000 to 2019, which were classified using the MCS version 2010, and we only3onsidered selected journals . Additionally, we restricted our selection to En-glish articles and limited ourselves to abstracts rather than reviews of articles.To be able to compare methods that are based on references and methods usingtext and title, we only selected articles with at least one reference that couldbe matched to another article. In addition, we excluded articles that were notyet published and processed. The list of articles is available from our website: https://automsceval.formulasearchengine.com Splitting to test and training set:
After applying the filter criteria as men-tioned above, we split the resulting list of 442,382 articles into test and trainingsets. For the test set, we aimed to measure the bias of our zbMATH classifica-tion labels. Therefore, we used the articles for which we knew the classificationlabels by the MR service as the training set from a previous research project [1].The resulting test set consisted of n = 32 ,
230 articles, and the training setcontained 410,152 articles. To ensure that this selection did not introduce addi-tional bias, we also computed the standard ten-fold cross validation, cf. Section3.
Definition of article data format:
To allow for reproducibility, we createda dedicated dataset from our article selection, which we aim to share with otherresearchers. However, currently, legal restrictions apply and the dataset can notyet be provided for anonymous download at this date. However, we can grantaccess for research purposes as done in the past [2]. Each of the 442,382 articlesin the dataset contained the following fields: de An eight-digit ID of the document . labels The actual MSC codes . title The English title of the document, with LaTeX macros for mathematicallanguage [12]. text
The text of the abstract with LaTeX macros. mscs
A comma separated list of MSC codes generated from the references.These 5 fields were provided as CSV files to the algorithms. The mscs fieldwas generated as follows: For each reference in the document, we looked up theMSC codes of the reference. For example, if a certain document contained thereferences
A, B, C that are also in the documents in zbMATH and the MSCcodes of
A, B, C are a and a , b , and c − c , respectively, then the field mscs will read a a , b , c c c . After training, we required each of our tested algorithms to return the fol-lowing fields in CSV format for the test sets: The list of selected journals is available from https://zbmath.org/?q=dt%3Aj+st%3Aj+py%3A2000-2019. The fields de and labels must not be used as input to the classification algorithm. e (integer) Eight-digit ID of the document. method (char(5))
Five-letter ID of the run. pos (integer)
Position in the result list. coarse (integer)
Coarse-grained MSC subject number. fine (char(5), optional)
Fine-grained MSC code. score (numeric, optional)
Self-confidence of the algorithm about the result.We ensured that the fields de , method and pos form a primary key, i.e., no twoentries in the result can have the same combination of values. Note that forthe current multi-class classification problem, pos is always 1, since only theprimary MSC subject number is considered. While the assignment of all MSC codes to each article is a multi-label classifi-cation task, the assignment of the primary MSC subject, which we investigatein this paper, is only a multi-class classification problem. With k = 63 classes,the probability of randomly choosing the correct class of size c i is rather low P i = c i n . Moreover, the dataset is not balanced. In particular, the entropy H = − (cid:80) ki =1 P i log P i , can be used to measure the imbalance (cid:98) H = H log k bynormalizing it to the maximum entropy log k. To take into account the imbalance of the dataset, we used weighted versionsof precision p , recall r, and the F measure f . In particular, the precision p = (cid:80) ki =1 c i p i n with the class precision p i . r and F are defined analogously.In the test set, no entries for the pMSCn 97 (Mathematics education) wereincluded, thus (cid:98) H = H log k = 3 . . k = 37 , which only has a minor effecton the normalized entropy as it is raised to (cid:98) H = . . The chosen value of 200can be interactively adjusted in the dynamic result figures we made availableonline . Additionally, the individual values for P i that were used to calculate H are given in the column p in the table on that page. As one can experiencein the online version of the figures, the impact on the choice of the minimumclass size is insignificant. https://autoMSCeval.formulasearchengine.com .3 Selection of methods to evaluate In this paper, we compare 12 different methods for (automatically) determiningthe primary MSC subject in the test dataset: zb1
Reference MSC subject numbers from zbMATH. mr1
Reference MSC subject numbers from MR. titer
According to recent research performed on the arXiv dataset [10], wechose a machine learning method with a good trade-off between speedand performance. We combined the title , abstract text , and reference mscs of the articles via string concatenation. We encoded these stringsources using the
TfidfVectorizer of the Scikit-learn python package. Wedid not alter the utf-8 encoding, and did not perform accent striping, orother character normalization methods, with the exception of lower-casing.Furthermore, we used the word analyzer without a custom stop wordlist, selecting tokens of two or more alphanumeric characters, processingunigrams, and ignoring punctuation. The resulting vectors consisted offloat64 entries with l2 norm unit output rows. This data was passed toOur encoder. The encoder was trained on the training set to subsequentlytransform or vectorize the sources from the test set. We chose a lightweight LogisticRegression classifier from the python package Scikit-learn. We em-ployed the l2 penalty norm with a 10 − tolerance stopping criterion anda 1.0 regularization. Furthermore, we allowed intercept constant additionand scaling, but no class weight or custom random state seed. We fittedthe classifier using the lbfgs (Limited-memory BFGS) solver for 100 con-vergence iterations. These choices were made based on a previous studyin which we clustered arXiv articles. refs Same as titer , but using only the mscs as input . titls Same as titer , but using only the title as input . texts Same as titer , but using only the text as input . tite Same as titer , but without using the mscs as input . tiref : Same as titer , but without using the abstract text as input . teref : Same as titer , but without using the title as input . ref1 We used a simple SQL script to suggest the most frequent primary MSCsubject based on the mscs input. This method is currently used in pro-duction to estimate the primary MSC subject. uT1
We adjusted the JAVA program posLingue [11] to read from the newtraining and test sets. However, we did not perform a new training and https://swmath.org/software/8058 [8] p , recall r and F -measure f with regard to the baseline zb1 (left) and mr1 (right).instead reused the model that was trained in 2014. However, for this run,we removed all mathematical formulae from the title and the abstract text to generate a baseline. uM1 The same as uT1 but in this instance, we included the formulae. Weslightly adjusted the formula detection mechanism, since the way in whichformulae are written in zbMATH had changed [12]. This method is cur-rently used in production for articles that do not have references withresolvable mscs . After executing each of the methods described in the previous section, we cal-culated the precision p , recall r, and F score f for each method, cf. Table 1.Overall, we find that results are similar whether we used zbMATH or MR as abaseline in our evaluation. Therefore, we will use zbMATH as the reference forthe remainder of the paper. All data, including the test results using MR as thebaseline is available from: https://automsceval.formulasearchengine.com . Effect of mathematical expressions and part-of-speech tags:
By filter-ing out all mathematical expressions in the current production method uT1 incontrast to uM1 we could receive information on the impact of mathematicalexpressions on classification quality. We found that the overall F score withoutmathematical expressions f uT = 64 .
5% is slightly higher than the score withmathematical expressions f uM = 64 . . Here, the main effect is an increase in Each of these sources was encoded and classified separately. title and abstract text do not improvethe classification quality. Method uT1 =left bar; method uM1 =right barFigure 3: Part-of-speech tagging for mathematics does not improve the classifi-cation quality. Method uM1 =left bar, method tite =right bar.recall from 63 .
9% to 64 . . Additionally, a class-wise investigation showed thatfor most classes, uT1 outperformed uM1 , cf. Figure 2. Exceptions are pMSCn46 (Functional analysis ) and 17 (Nonassociative rings and algebras) where theinclusion of math tags raised the F -score slightly.We evaluated the effect of part of speech tagging (POS), by comparing tite with uM1 . f tite = .
713 clearly outperformed f uM = . . This held true for allMSC subjects, cf. Figure 3. We modified posLingo to output the POS taggedtext and used this text as input and retrained scikit learn classifier tite2 .However, this method did not lead to better results than tite . Effect of features and human baseline:
The newly developed methodcombined method [10] works best in a combined approach that uses title ,abstract text , and references titer f titer = 77 . . This method performs sig-nificantly better than methods that omit either one of these features. Thebest performing single feature method was refs f refs = 74 . text f text = 69 .
9% and titls f titls = 62 . title (i.e. teref f text = 77%) or abstract text (i.e. tiref f text = 76%), the performanceremained notably higher than when the approach excluded the reference mscs refs , left) clearly outperforms current pro-duction ( ref1 , right) method using references as only source for classification.Figure 5: For many pMSCn the best automatic method ( titer , right) gets closeto the performance of the human baseline ( mr1 left)( tite f text = 71 . ref1 f text = 65 . tite despite this approachignoring references. In conclusion, we can say that training a machine learn-ing algorithm that weights all information from the fine grained MSC codes isclearly better than the majority vote of the references, cf. 4.Even the best performing machine learning algorithm, titer with f titer =77 . mr1 f mr = 81 . . However, there is no foundation that could allow us to determinewhich of the primary MSC subjects, either from MR or zbMATH, are truly cor-rect. Assigning a two-digit label to mathematical research papers – which oftencover overlapping themes and topics within mathematics – remains a challenge9igure 6: Confusion matrix titer even to humans, who struggle to conclusively label publications as belonging toonly a single class. While for some classes, expert agreement is very high, e.g.for class 20 agreement is 89 . .
6% regarding the F score, cf., Figure 5. These discrepancies reflect theintrinsic problem that mathematics cannot be fully reflected by a hierarchicalsystem. The differences in classifications made among the two reviewing servicesare likely also a reflection of emphasizing different facets of evolving research,which often derive from differences in the reviewing culture.We also investigated the bias introduced by the non-random selection of thetraining set. Performing ten fold cross validation on the entire dataset yieldedan accuracy of f titer, = .
776 with a standard deviation σ titer, = . . Thus,test set selection does not introduce a significant bias.After having discussed the strengths and weaknesses of the individual meth-ods tested, we now discuss how the currently best-performing method, titer ,can be improved. One standard tool to analyze misclassifications is a confu-sion matrix, cf., Figure 6. In this matrix, off-diagonal elements of the matrixindicate that two sets of classes are often mixed by the classification algorithm.The x axis shows the true labels, while the y axis shows the predicted labels.10igure 7: Precision recall curve titer .The most frequent error of titer was that 68 (Computer science) was classifiedas 5 (Combinatorics). Moreover, 81 (Quantum theory) and 83 (Relativity andgravitational theory) were often mixed up.However, in general the number of misclassifications were small and therewas no immediate action that one could take to avoid special cases of misclas-sification that do not involve a human expert.Since titer outperforms both the text-based and reference based methodscurrently used in zbMATH, we decided to develop a restful API that wrapsour trained model into a service. We use pythons fastAPI under unicorn tohandle higher loads. Our system is available as a docker container and canthus be scaled on demand. To simplify development and testing, we providea static HTML page as a micro UI, which we call
AutoMSC . This UI dis-plays not only lists/suggests the most likely primary MSC subjects but alsothe less likely MSC subjects. We expect that our UI can support human ex-perts, especially whenever the most likely MSC subject seems unsuitable. Theresult is displayed as a pie-chart, cf., Figure 8 from https://automscbackend.formulasearchengine.com . To use the system in practice, an interface to thecitation matching component of zbMATH would be desired to paste the actualreferences rather than the MSC subjects extracted from the references. More-over, looking at the precision-recall curve (Figure 7) for titer , suggests thatone can also select a threshold for falling back to manual classification. Forinstance, if one requires a precision that is as high as the precision of the otherhuman classifications by MR, one would need to only consider suggestions with ascore > .
5. This would automatically classify 86 .
2% of the 135k articles beingannually classified by subject experts at zbMATH/MR and thus significantlyreduce the number of articles that humans must manually examine without aloss of classification quality. This is something we might develop in the future.11
Conclusion & Future Work
Returning to our research questions, we summarize our findings as follows:First, we asked which metrics are best suited to assess classification quality.We demonstrated that the classification quality for the primary MSC subjectcan be evaluated with classical information retrieval methods such as precision,recall and F -score. We share the observation Barthel, T¨onnies, and Balke [2]that the averages do not reflect the performance of outliers, cf. Figures 1-4.However, for our methods the difference between the best and worst performingclass was significantly smaller than reported by [2].Second, we wanted to find out whether taking into account the mathematicalformulae contained in publications could improve the accuracy of classifications.In accordance with [10], we did not find evidence that mathematical expressionsimproved pMSCn classification. However, we did not evaluate advanced encod-ings of mathematical formulae. This is will be a subject of future work, cf.Figure 1.Third we evaluated the effect of POS-preprocessing [11] and found that mod-ern machine learning methods do not benefit from the POS tagging based modeldeveloped by [11], cf. Figure 2.Fourth we evaluated which features are most important for an accurate clas-sification. We conclude that references have the highest prediction power, fol-lowed by the abstract text and title.Finally, we evaluated the performance of automatic methods in comparisonto a human baseline. We found that our best performing method has an F score of 77.2%. The manual classification is significantly better for most classes,cf. Figure 4. However, the self-reported score can be used to reduce the manualclassification effort by 86.2%, without a loss in classification quality.In the future, we plan to extend our automated methods to predict fullMSC codes. Moreover, we would like to be able to assign pMSCn to documentsections, since we realize that some research just does not fit into one of theclasses. We also plan to extend the application domain to other mathematicalresearch artifacts, such as blog posts, software, or dataset descriptions. As anext step, we plan to generate pMSCn from authors using the same methods weapplied for references. We speculate that authors will have a high impact on theclassification, since authors often publish in the same field. For this purpose,we are leveraging our prior research on affiliation disambiguation, which couldbe used as fallback method for junior authors, who have not yet established atrack record. Another extension is a better combination of the different features.Especially when performing research on the full MSC code-generation, we willneed to use a different encoding for the MSC from references and authors.However, this new encoding requires more main memory for the training ofthe model and cannot be done on a standard laptop. Thereafter, we will re-investigate the impact of mathematical formulae since the inherently combinedrepresentation of text and formulae was not successful.Our work represents a further step in the automation of Mathematics Sub-ject Classification and can thus support reviewing services, such as zbMATH Mathematical Reviews . For accessible exploration, we have made the best-performing approaches available in our
AutoMSC implementation and haveshared our code on our website. We envision that other application domainsrequiring an accurate labeling of publications into their respective MathematicsSubject Classification, for example, research paper recommendation systems, orreviewer recommendation systems, will also be able to benefit from this work.AutoMSC delivers comparable results to human experts in the first stage ofMSC labeling, all without requiring manual labor or trained experts. In thefuture, zbMATH will use our new method for all journals that used to employthe method by Sch¨oneberg and Sperber [11] introduced in 2014.
Acknowledgments:
This work was supported by the German Research Foun-dation (DFG grant GI 1259-1). The authors would like to express their gratitudeto Felix Hamborg, and Terry Ruas for their advice in the most recent machinelearning technology.
References [1] A. Bannister et al. “Editorial: On the Road to MSC 2020”. In:
EMSNewsletter doi : .[2] S. Barthel, S. T¨onnies, and W. Balke. “Large-Scale Experiments for Math-ematical Document Classification”. In: Proc. Digital Libraries: Social Me-dia and Community Networks, ICADL 2013 . Vol. 8279. Springer, 2013,pp. 83–92. doi : .[3] T. Bouche and O. Labbe. “The New Numdam Platform”. In: Proc. CICM .Ed. by H. Geuvers et al. Vol. 10383. Springer, 2017, pp. 70–82. doi : .[4] I. Evans. “Semi-supervised topic models applied to mathematical docu-ment classification”. PhD thesis. University of Bath, Somerset, UK, 2017.135] P. Ion and W. Sperber. “MSC 2010 in SKOS – the transition of the MSCto the semantic web.” In: Eur. Math. Soc. Newsl.
84 (2012), pp. 55–57.[6] A. K¨uhnemund. “The role of applications within the reviewing servicezbMATH”. In:
PAMM doi :
10 . 1002 /pamm.201610459 .[7] H. Mihaljevi´c-Brandt and O. Teschke. “Journal profiles and beyond: whatmakes a mathematics journal “general”?” English. In:
Eur. Math. Soc.Newsl.
91 (2014), pp. 55–56.[8] F. Pedregosa et al. “Scikit-learn: machine learning in Python.” English.In:
J. Mach. Learn. Res.
12 (2011), pp. 2825–2830.[9] R. Rehurek and P. Sojka. “Automated Classification and Categorizationof Mathematical Knowledge”. In:
Proc. CICM . Ed. by S. Autexier et al.Vol. 5144. Springer, 2008, pp. 543–557. doi : .[10] P. Scharpf et al. “Classification and Clustering of arXiv Documents, Sec-tions, and Abstracts Comparing Encodings of Natural and MathematicalLanguage”. In: Proc. ACM/IEEE JCDL . 2020.[11] U. Sch¨oneberg and W. Sperber. “POS Tagging and Its Applications forMathematics - Text Analysis in Mathematics”. In:
Proc. CICM . 2014. doi : .[12] M. Schubotz and O. Teschke. “Four decades of TeX at zbMATH.” English.In: European Mathematical Society Newsletter
112 (2019), pp. 50–52.[14] P. Sojka and R. Rehurek. “Classification of Multilingual MathematicalPapers in DML-CZ”. In:
Proc. The 1st Workshop on Recent Advances inSlavonic Natural Languages Processing, RASLAN 2007 . Masaryk Univer-sity, 2007, pp. 89–96.[15] P. Sojka et al. “Quo Vadis, Math Information Retrieval”. In:
The 13thWorkshop on Recent Advances in Slavonic Natural Languages Process-ing, RASLAN 2019, Karlova Studanka, Czech Republic, December 6-8,2019 . Ed. by A. Hor´ak, P. Rychl´y, and A. Rambousek. Tribun EU, 2019,pp. 117–128.[16] T. Suzuki and A. Fujii. “Mathematical Document Categorization withStructure of Mathematical Expressions”. In: . IEEE Computer Society, 2017, pp. 119–128. doi :
10 . 1109 /JCDL.2017.7991566 . 14isting 1: Use the following
BibTeX code to cite this article @inproceedings { Schubotz2020b ,author = { Moritz Schubotz and Philipp Scharpf and OlafTeschke and Andreas K \" uhnemund and Corinna Breitingerand Bela Gipp },title = { AutoMSC : Automatic Assignment of MathematicsSubject Classification Labels },booktitle = { Proceedings of the 13 th Conference onIntelligent Computer Mathematics },date = 2020 ,}@inproceedings { Schubotz2020b ,author = { Moritz Schubotz and Philipp Scharpf and OlafTeschke and Andreas K \" uhnemund and Corinna Breitingerand Bela Gipp },title = { AutoMSC : Automatic Assignment of MathematicsSubject Classification Labels },booktitle = { Proceedings of the 13 th Conference onIntelligent Computer Mathematics },date = 2020 ,}