[PDF] EXmatcher: Combining Features Based on Reference Strings and Segments to Enhance Citation Matching

Abstract

Citation matching is a challenging task due to different problems such as the variety of citation styles, mistakes in reference strings and the quality of identified reference segments. The classic citation matching configuration used in this paper is the combination of blocking technique and a binary classifier. Three different possible inputs (reference strings, reference segments and a combination of reference strings and segments) were tested to find the most efficient strategy for citation matching. In the classification step, we describe the effect which the probabilities of reference segments can have in citation matching. Our evaluation on a manually curated gold standard showed that the input data consisting of the combination of reference segments and reference strings lead to the best result. In addition, the usage of the probabilities of the segmentation slightly improves the result.

Full PDF

EEXmatcher: Combining Features Based onReference Strings and Segments to EnhanceCitation Matching

Behnam Ghavimi, Wolfgang Otto, and Philipp Mayr

GESIS – Leibniz Institute for the Social Sciences [email protected]

Abstract.

Citation matching is a challenging task due to diﬀerent prob-lems such as the variety of citation styles, mistakes in reference stringsand the quality of identiﬁed reference segments. The classic citationmatching conﬁguration used in this paper is the combination of blockingtechnique and a binary classiﬁer. Three diﬀerent possible inputs (refer-ence strings, reference segments and a combination of reference stringsand segments) were tested to ﬁnd the most eﬃcient strategy for cita-tion matching. In the classiﬁcation step, we describe the eﬀect which theprobabilities of reference segments can have in citation matching. Ourevaluation on a manually curated gold standard showed that the inputdata consisting of the combination of reference segments and referencestrings lead to the best result. In addition, the usage of the probabilitiesof the segmentation slightly improves the result.

Keywords:

Citation Matching · Blocking · Classiﬁcation · Support Vec-tor Machines · Random Forest · Evaluation

The need for data integration from various sources is growing in information sys-tems in order to improve data quality and reusability of the data e.g. for retrievalor data analysis. The procedure of ﬁnding records in a database that correspondto the same entity (e.g. ﬁles, publications, data sets, ...) across another dataset is typically called record linkage. Record linkage has been used in diﬀerentdomains [4,17]. The application of record linkage in the domain of bibliographicdata is known as citation matching or reference matching. High quality citationdata for research publications is the basis for areas like bibliometrics but alsofor integrated digital libraries (DL). Citation data are valuable since they showthe linkage between publications. The extraction of reference information fromfull text is called citation extraction. One key challenge for the aforementionedtasks is to match the extracted reference information to given DLs. The processof mapping an extracted reference string to one entity of a given DL is calledcitation matching [24]. Proper citation matching is an essential step for everycitation analysis [29] and the improvement of citation matching leads to a higher a r X i v : . [ c s . D L ] J un B. Ghavimi et al. quality of bibliometric studies. In the DL context, citation data is one importantsource of eﬀective information retrieval, recommendation systems and knowledgediscovery processes [26].Despite the widely acknowledged beneﬁts of citation data, the open access tothem is still insuﬃcient. Some commercial companies such as Clarivate Analytics,Elsevier or Google possess citation data in large-scale and use them to provideservices for their users. Recently, some initiatives and projects e.g. the ”OpenCitations” project or the ”Initiative for Open Citations” focus on publishingcitation data openly . The ”Extraction of Citations from PDF Documents” -EXCITE project is one of these projects. The aim of EXCITE is extracting andmatching citations from social science publications [22] and making more citationdata available to researchers. With respect to this objective, a set of algorithmsfor information extraction and matching has been developed focusing on socialscience publications in the German language. The shortage of citation data forthe international and German social sciences is well known to researchers in theﬁeld and has itself often been subject to academic studies [29].This paper is dedicated to the step of citation matching in the EXCITEpipeline and the responsible algorithm for this task is called EXmatcher . Forthe matching task in EXCITE, diﬀerent target databases/DLs are deﬁned a)sowiport [18], b) GESIS Search and c) Crossref . The matching target for thestudy in this paper is solely sowiport. Sowiport contains bibliographic metadatarecords of more than 9 million references on publications and research projectsin the social sciences.This paper makes the following contributions: – Introduction of a gold standard for the citation matching task, – Evaluate the eﬀect of diﬀerent inputs in the citation matching steps and – Investigate the eﬀect of the utilization of reference segmentation probabilitiesas features in the citation matching procedure.The remainder of this paper is structured as follows. In section 2, we organize therelated work around the concepts of the record linkage pipeline known from [4].Section 3 describes the set-up of citation matching in the EXCITE project. Sec-tion 4 is about creating a citation matching gold standard corpus and evaluationof our algorithm with diﬀerent conﬁgurations. Finally, section 5 summarizes thekey outcomes of our improvements on citation matching.

Christen et al. [4] suggested general steps for the matching process after review-ing diﬀerent matching approaches: (1)

Input pre-processing, (2)

Blocking tech-nique, (3)

Feature extraction and classiﬁcation. EXmatcher also follows these https://i4oc.org/ http://excite.west.uni-koblenz.de/website/ https://github.com/exciteproject/EXmatcher https://search.gesis.org/ https://search.crossref.org Xmatcher 3 steps for citation matching and considers diﬀerent input conﬁgurations to inves-tigate their aﬀects. In the following, we organize the related work according tothese steps.

As the ﬁrst step, input data need to be pre-processed in a way that it becomesproper for the matching algorithm. To identify similar strings during all parts ofthe matching process a common method to increase robustness is to normalizethe input strings. A simple normalization is to lowercase the input string andremove punctuation and stop words.If an algorithm depends on reference segments for matching, this data needto be extracted from the reference strings. PDFX [7], Exparser [1], GROBID[25], ParsCit [8] are few examples of tools that perform reference segmentation.Wellner et al. [36] investigated the eﬀect of extraction probabilities on citationmatching by the consideration of diﬀerent number of best Viterbi segmentations.EXmatcher considers only the best Viterbi segmentation and uses the probabilityof each segment in feature vector provided for a binary classiﬁer regarding thecitation matching task.Phonetic function is another technique used in this step and the commonidea behind all phonetic encoding functions is that they attempt to convert astring into a code based on pronunciation [3]. Phonetic algorithms are mainlyused for name segments. Pre-processing functions can also be used in othersteps. For example, data which has been prepared by phonetic functions can beconsidered as blocking keys in the indexing step since indexing brings similarvalues together. These techniques can also be used in the feature extractionstep to generate vectors of features for classiﬁers. This encoding process is oftenlanguage dependent. Soundex algorithm was developed by Russell and Odellin 1918 [31] for English language pronunciation. Phonex [23], NYSIIS [34],and Cologne functions [33] are some other examples of phonetic functions. TheCologne phonetics is based on the Soundex phonetic algorithm and is optimizedto match the German language. We also used Cologne phonetic function in ourimplementation since our main focus was working on German language papers.

The next step is the blocking technique in order to decrease the number ofpairs required to be compared. Imagine, we need to match a set of n referencesextracted from publications to a bibliographic database with m entries. In anaive way, comparisons of every reference with every entry in the database arerequired which results in a complexity of n × m . Considering a set of 100,000references and a database with 10 million bibliographic entries this results in 10 comparisons. In the blocking approach, we split target and source into blocksof data depending on a common attribute or combination of the attributes.After ﬁnding corresponding blocks in source and target we reduce the numberof necessary comparisons to the number of combinations between corresponding B. Ghavimi et al. blocks. For example, we have a reference with ”2001” as publication year, so inthis case, it is not necessary to compare this reference with the entire records inthe target database. We only could compare to the entries in the block of recordspublished in 2001. In related works (e.g., [32]), even they use blocks related toone year before and after that year.Several blocking or indexing techniques have been introduced till now [4]. Asan example, D-Dupe [20] is a tool implemented for data matching and networkvisualizations. D-Dupe implemented an indexing technique based on standardblocking [12]. Hernandez et al. suggested a sorted neighborhood approach [15,16].This technique, instead of generating a key for each block, sorts the data formatching based on a ’sorting key’. In suﬃx or q-gram based indexing approachesthere is a higher chance to have correct matches in a same block since the ideabehind them are for handling diﬀerent forms of entities and errors. In the citationmatching ﬁeld, Fedoryszak et al. presented a blocking method based on hashfunctions [10]. Another research ﬁeld deals with the identiﬁcation of eﬃcientblocking keys. Koo et al. tried to ﬁnd the best combination of citation recordﬁelds [21] that helps increase citation matching performance.

In the third step of the citation matching process, each candidate record pair(i.e, reference (string and related segments) and each retrieved item by blocking)are compared using a variety of attributes and comparison functions. The outputof this step is a feature vector for each pair. In the ﬁnal step, each comparedcandidate record pair is classiﬁed into one of the classes (i.e., match, non-match)using the related feature vector.Comparison functions such as Jaro-Winkler, Jaccard, or Levenshtein are of-ten used for analyzing textual values. As an example, the D-Dupe tool includesstring comparison functions such as Levenshtein distance, Jaro, Jaccard, andMonge-Elkan [20].For the classiﬁcation step of citation matching the reference can be repre-sented by a reference string or by extracted segments. Also a combination ofboth is possible as we show in this work.Foufoulas et al. [14] suggested an algorithm which matches reference stringswithout reference segmentation. Their approach ﬁrst tries to detect the referencesection by some heuristic and then attempts to identify the title of a record in thetarget repository in the reference section. Finally, it validates this match withmore metadata of the record in the target repository. Their title detection andcitation validation steps are mostly based on the combination of simple searchand comparison functions. One of classiﬁcation approaches is a threshold-basedone. In this type, the similarity between vectors of two items will be calculated(e.g., by using cosine similarity algorithm) and if the similarity score is higherthan a predeﬁned threshold, then two items are matched.A rule-based classiﬁcation employs some rules for classiﬁcation [6,16,30].These rules consist of a combination of smaller parts and the link between theseparts are logical ”AND”, ”OR” and ”NOT” operands. These rules deﬁne the

Xmatcher 5 similarity of pairs. In the optimal case, each rule in a set of rules should have ahigh precision and recall [27]. More strict or speciﬁc rules usually have high pre-cision, while general or easy rules often have low precision but high recall. Theiterative, rule-based citation matching algorithm of CWTS (Center for Scienceand Technology Studies) [32] relies on a series of matching rules. These rulesare applied iterative in decreasing order of strictness. The citation matching al-gorithm starts with the most restrictive matching rules (e.g., exact match onﬁrst author, publication year, publication title, volume number, starting pagenumber, and DOI). Afterward, it proceeds with less restrictive matching rules(e.g. match on Soundex encoding of the last name of the ﬁrst author, publicationyear plus or minus one, volume number, and starting page number). The lessrestrictive matching rules allow for various types of inaccuracies in the biblio-graphic ﬁelds of cited references. In all rules, the Levenshtein distance is usedto match the publication name of a cited reference to the publication name of acited article.Viewing probabilistic record linkage from a Bayesian perspective has alsobeen discussed by Fortini et al. [13] and Herzog et al. [17]. If training data areavailable, then a supervised classiﬁcation approach can be employed. Many bi-nary classiﬁcation techniques have been introduced [27,28], and many of thesetechniques are used for matching. Decision tree is one of these supervised classi-ﬁcation techniques [27]. As an example, Cochinwala et al. [5] build a training setand trained a Regression Tree (CART) classiﬁer [2] for data matching. TAILORtool [9] for data matching uses e.g. a ID3 decision tree.The Support Vector Machine (SVM) classiﬁcation algorithm [35] is based onthe idea of mapping the input data of the classiﬁer into a higher dimensional vec-tor space using a kernel function. This is done to be able to separate samples forthe target classes using a hyperplane even if it is not possible in the lower dimen-sion. SVM as a large margin classiﬁer optimizes during training time throughmaximization of the distance between training samples and hyperplane. Fedo-ryszak et al. [11] presented a citation matching solution using Apache Hadoop.Their algorithm is based on reference segments and also uses SVM algorithmto conﬁrm the status of items (i.e., match or not match) based on the createdfeatures.

For matching we used two types of information from each reference, the rawreference strings and structured information (i.e. segments). The segmentationis done with Exparser which is a CRF-based algorithm [1]. The output of Ex-parser includes a probability for each predicted segment. This information istaken into account as an additional information to enhance the results of thematching procedure. To enhance the results for the publication year information https://github.com/exciteproject/Exparser B. Ghavimi et al. we extract year mentions from the raw reference strings with a regular expres-sion independently from the parser. We also remove extra characters in the yearsegment (e.g. b in 1989b). As a last pre-processing step we have combined vol-ume and issue to one segment called number because during parsing the issuewas often recognized as a volume and vice versa.

We used the search platform Solr for blocking. For each reference, EXmatcherretrieves the corresponding block with the help of blocking queries. The wholeblocking procedure is described in Algorithm 1. First, queries are formulated Data:

Pre-processed reference r , indexed bibliographic database D , cutoﬀparameter c Result:

A set of suggested matching records S Generate a query set Q based on segments or reference string; Initialize an empty suggestion set S ; foreach query q in query set Q do Retrieve ranked result list with query q as Rl ; if size of result list Rl ≥ then Cut oﬀ ranked result list at position c ; Join reduced list Rl to S ; end end Algorithm 1:

Blocking step for matching in EXmatcherwith the help of the parsed segments and the raw reference strings. Thereforewe used the operators OR and AND from the Solr query syntax . Additionalwe use fuzzy search ( ∼ -operator) which reﬂects a fuzzy string similarity searchbased on the Levenshtein distance.The output of the blocking step is a ranked lists of retrieved items from thetarget database. The items are ranked by the Lucene score based on tf/idf .To get the best trade oﬀ between retrieving all possible matching items andthe reduction of necessary comparisons in the following classiﬁcation task weidentiﬁed two opportunities for inﬂuence. One is varying the query and selectthe best query formulation. The other is the selection of a cut oﬀ thresholdwhich determines how many of the retrieved items per query are used for furtherprocessing.As it is mentioned, ﬁrstly, queries out of segment combinations should begenerated. For six segments (i.e., 1-Author, 2-Title, 3-Year, 4-Page, 5-Number http://lucene.apache.org/solr/ https://lucene.apache.org/solr/guide/6_6/query-syntax-and-parsing.html https://lucene.apache.org/core/7 0 0/core/index.html?org/apache/lucene/search/similarities/TFIDFSimilarity.htmlXmatcher 7 (Volume/Issue), 6-Source) this results in a maximum of 63 segment combina-tions. For each query generated based on one of combinations needs to have onecorrect information about each of the segments queried. For example, if year ofpublication and authors’ names are used, one of the author names has to becorrect and also the year of publication. For title and source we used a fuzzyquery on the whole segment string. For numbers at least one found number haveto be in the volume or the issue ﬁeld of the record in our database. To excludenot well performing segment combinations for query generation we measure theprecision at one of the queries on our gold data. We only select segment com-binations where at least 60% of the retrieved items are a correct match. Thisreduces the number of maximum combinations we consider for query generationby 25% to 48.As an alternative strategy we generated queries only from the referencestrings without using information from the segmentation. This strategy triesto deal with the problem that title information is often not correctly identiﬁedduring segmentation. But since the title is the most eﬀective ﬁeld for matching,the following approach is used which can act independently from the quality ofthe segmentation. For this we consider all token of the reference string as po-tentially including title information. The idea is to formulate a bigram searchof the whole reference string. The resulting query leads to results which at leastneed to include one bigram of the reference string in the title ﬁeld. But the morebigrams of the reference string are included in the title, the more preferred re-sults are. Therefore a query based on only these bigrams of the reference stringwill be added to the set of queries. In addition, to increase the precision, a querybased on year and bigrams of the reference string will also be considered. Forthis the year information is taken into account which is extracted with a regularexpression.The eﬀect on blocking for the two strategies of query generation and even acomparison with a mixture of both strategies is described in 4.2. After retrieving candidates for matches with our blocking procedure we needto decide which of the found candidate our system identify as a match. I.e. thedecision if the retrieved item is a match and hence the reference and the entryin the database are representing the same entity.For this we train and evaluate a binary classiﬁer which is able to judge a pairof reference and match candidate as match or non match. It is worth noting,that our approach is able to handle duplicates in our reference database.The crucial step for building this classiﬁer is feature selection. We combinefeatures generated from the raw reference string and from the segmentation. Onenovelty of our approach is to test the usefulness of utilizing the certainty of ourparser for the detected segments as an additional input feature for our classiﬁer.The output of Exparser contains for each token of all segments a probability valuereﬂecting the certainty of the model. If we have a high probability for a segment,the chance of having a wrong predicted label is low. Therefore, we expect that

B. Ghavimi et al. the usage of features reﬂecting this probabilities will have a noticeable eﬀect onthe performance of citation matching.The ﬁrst group of features is based on the comparison of the reference seg-ments and the retrieved items in the blocking step: – Some example of features based on the author segment:1. Levenshtein score (phono-code and exact),2. Segmentation probability of ﬁrst author (surname) – Some example of features based on titles and source:1. Jaccard score (including segmentation probabilities),2. Levenshtein score (token and letter level) – Some example of features based on numbers, pages, and publication year:1. Jaccard score, and2. Segmentation probabilityAn example for the usage of the probability is the extended version of the Jac-card score for author names. The Jaccard similarity for the last names is theintersection of last names over the union of the set of last names in two records.If the size of the intersection of the last names of two records is 2 and the sizeof the union of them is 4, then the Jaccard score would be 0.5 ((1+1)/4). Ourenhanced metric uses the extracted probabilities as weights in the intersection.If probabilities of these items in the intersection of lasts names are 0.8 and 0.9,then the new Jaccard score would be 0.42 ((0.8+0.9)/4). For the creation of thefeatures of the second group all information based on segmentation is excluded.These are features based only on the comparison of the raw reference stringwith the information of the retrieved record. You can ﬁnd some examples of thisgroup in the following list:1. Longest common sub-string of title and reference string, and2. Occurrence of the abbreviation of the source ﬁeld (e.g., journal abbreviation)in index in reference string.

The computation of oﬀ-line evaluation metrics such as precision, recall and F-measure need ground truths. A manually checked gold standard was generatedto assess the performance of the algorithms. For creating this gold standard, weapplied a simple matching algorithm based on blocking on a randomly selectedset of reference strings from the EXCITE corpus. The EXCITE corpus containsSSOAR corpus (about 35k), SOJ:Springer Online Journal Archives 1860-2001corpus (about 80k), and Sowiport papers (about 116K). We used queries basedon diﬀerent combinations of title, author and publication year segments andconsidered the top hit in the retrieved blocks based on the Solr score. The result Xmatcher 9 was a document id (from sowiport) for each reference, if the approach could ﬁndany match. In the second step, these ids detected by the matching algorithm werecompleted by duplication information in sowiport to reach a list of all candidatesof match items for each references.Afterwards, a trained human assessor checked the results. If the previous stepleads to an empty result set the assessor was asked to retrieve a record basedon manually curated search queries. These manual queries used the informa-tion from the correct segments by manually extracting them from the referencestrings. If the corresponding item was found, it was added to the gold standard.It also appeared, that not only one match was found, but also duplicates. Inthis case the duplicates where also added as matching items. When matchingitems are found in the previous step, the assessor checked this list to removewrong items and add missing items. The result of this process is a corpus con-taining 816 reference strings. 517 of these items have at least one matched itemin sowiport. We published this corpus and a part of sowiport data (18,590 bib-liographic items) openly for interested researchers in our Github repository . In this evaluation, three diﬀerent conﬁgurations for input of blocking (i.e., 1-using only reference strings , 2- using only reference segments , and 3- the combi-nation of reference segments and strings ) were examined. In addition, the eﬀectof the consideration of diﬀerent numbers of top items from the blocking stepwas checked. Fig. 1 shows that the precision curve of blocking based on refer-ence strings is higher than the two other conﬁgurations. This is not a big surprisebecause using only reference strings in our approach means focusing on the titleand year ﬁelds (which it is explained in section 3.1 and section 3.2) and theusage of these two ﬁelds has a high precision score to retrieve items.On the one hand by considering more items of the blocking list the precisionis decreasing. On the other hand the recall shown in Fig. 2 reach a score higherthan 0.9 after the consideration of the 4 top items of blocking. The highestrecall has been achieved using the combination of reference strings and segments.Surprisingly, the curve of reference strings become closer to the combination ofreference string and segments by consideration of more top items in blockingand almost reach to that in number 14. All these three curves become almoststeady after consideration of 11 top retrieved items for each blocking query.Since we have another step after blocking which improve the precision, theimportant point in blocking is keep recall score high and at the same time shrink-ing the number of items for comparison. The precision of these three curveswere not signiﬁcantly diﬀerent, therefore, the combination of reference stringsand segments is picked in blocking step to generate input for the evaluation ofclassiﬁcation step. For the number of top items in blocking, which are used for https://github.com/exciteproject/EXgoldstandard/tree/master/Goldstandard_EXmatcher Fig. 1.

Precision of blocking

Fig. 2.

Recall of blocking further processing in our pipeline, ﬁve is selected because considering more thenﬁve items is not leading to a higher recall value.The selected conﬁguration leads to a number of 1 to 39 retrieved items perreference. The average number was 14 records with a standard deviations of 6.5.For the 816 references of the gold standard 10,997 match candidates aregenerated with our conﬁguration. For each pair of reference and correspondingmatch candidate in our reference database sowiport we know if it is a match ornot based on our gold standard. In these 10,997 pairs, 1,026 (9.3%) are correctmatches and 9,971 (90.7%) are no matches. After blocking, the number of ref-erence strings which have at least one correct match is 507, and 302 referencesare without any correct pair. It means only ten references (1.2%) which have atleast one match in the gold standard could not pass blocking successfully, i.e.blocking step could not suggest any correct match for them.

In this section we present the results of the classiﬁcation task. We applied ten-foldcross validation for testing diﬀerent classiﬁer and feature combinations. Blockinggenerated results for 809 references were split into ten separated groups andtheir related pairs placed in the related group to form the ten folds for crossvalidation. Table 1 contains precision, recall and f-measure for our comparedconﬁgurations. The results show that the SVM classiﬁer using the combination ofreference strings and segment features with considering segmentation probabilityhas the highest F1 and precision scores. The second highest F1 score is relatedto the SVM classiﬁer which uses the combination input features but this timewithout the segmentation probabilities. The interpretation of data is that usingthe combination of inputs (reference segments and strings) has the main impacton the accuracy scores. The average number of references in diﬀerent folds basedon their number of correct predictions of the classiﬁer with the highest F1 scoreare shown in Fig. 4.In most real world scenarios it is only necessary to ﬁnd exactly one matchin a bibliographic database. Because of this we evaluate our matching algorithm

Xmatcher 11Ref String Ref Segments Seg probability SVM Random Forest Precision Recall F1

D D D D - 0.947 * 0.904 0.925 *

D D D - D D D - D - 0.941 0.908 * 0.924 D D - - D D D D - 0.942 0.865 0.901-

D D - D D - D - 0.836 0.869 0.852- D - - D D - - D - 0.843 0.903 0.871 D - - - D Table 1.

Evaluation macro-metrics of diﬀerent classiﬁers including duplicate matchesfor each reference string - the highest value in each column is marked by * symbol.

Fig. 3.

Frequency of references in goldstandard with the number of matches intarget database sowiport

Fig. 4.

Average number of references infolds with true prediction of match class again. For this evaluation only one correct match have to be found for eachreference. Regarding this purpose, we pick the highest probability generated bythe classiﬁer among match pairs for each reference . For this evaluation, weused the combination of features based on reference strings and segments asthe input (including segments probabilities). In this case, average precision andrecall scores for SVM algorithm are 0.97 and 0.92. For random forest algorithm,average precision and recall scores are 0.96 and 0.93. To calculate a ﬁnal scorefor the complete pipeline, also 10 references which could not pass blocking stepshould be considered. Since the consideration of these items changes the numberof false negatives, we see the eﬀect on the recall score. Consequently, recall in the The threshold for decision between two classes would be the 0.5 - default thresholdfor the SVM classiﬁer in scikit-learn python package.2 B. Ghavimi et al. pipeline with SVM would be 0.913 and the pipeline using the Random Forestclassiﬁer would be 0.917. These evaluation scores are included in Table 2.

Table 2.

In this paper, we explained our approach for handling the task of citation match-ing in the EXCITE project. The implemented algorithm (EXmatcher) followsthe classic solution for this task which contains three steps (i.e., 1- data normal-ization and cleaning, 2- blocking, 3- feature vector creation and classiﬁcation).We analyzed the impact of diﬀerent inputs (i.e., reference strings, segments andthe combination of both) on the performance of our citation matching algorithm.In addition, we investigated the beneﬁt of using segments probabilities in thecitation matching task. The segmentation probabilities are considered directlyand as weights for creating speciﬁc features for the classiﬁer of EXmatcher.Using the combination of reference strings and segments as input with a SVMclassiﬁcation outperforms the other conﬁgurations in terms of F1 and precisionscores. Segments probabilities have a good impact on the precision score whenthe citation matching algorithm uses segments as input. For example, in theconﬁguration of using only segments as the input and using SVM, segmentsprobabilities can improve the precision about 11% (Table 1). The combinationof reference strings and segments can also cover the eﬀect of considering segmentsprobabilities. It means including/excluding segments probability doesn’t aﬀectthe accuracy when citation matching algorithm uses the combination of twoinput data. The eﬀect of utilizing diﬀerent classiﬁers on the result are verydepended on other parameters in the citation matching conﬁguration such asinput types (i.e. reference strings, segments or both) and the consideration ofsegment probabilities. The combination of reference strings and segments as theinput for citation matching shows a higher recall than using each of them alone.But still 10 references which have at least one match couldn’t pass the blockingstep with using the combination the both data. One reason of this incident wasthat in generating queries, EXmatcher combines the information from referencestrings and information from reference segments in one query and links themwith OR logical operand. Decreasing the number of failed in blocking step leadsto a higher recall. One solution could be to send queries based on reference stringsand based on segments in diﬀerent queries and then the algorithm combines theretrieved items. Also more items can be extracted from reference strings input

Xmatcher 13 (such as pages, issue, volume and DOI) with some rule based steps and used inthe blocking.The citation matching approach which has been described and evaluated inthis paper is implemented in a demonstrator which connects all important stepsfrom reference extraction, reference segmentation and matching in the EXCITEtoolchain (see [19] http://excite.west.uni-koblenz.de/excite ). References

1. Boukhers, Z., et al.: An End-to-end Approach for Extracting and Segmenting High-Variance References from PDF Documents. In: Proc. of JCDL 2019. ACM (2019)2. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classiﬁcation and regressiontrees. Group (15), 237–251 (1984)3. Christen, P.: A comparison of personal name matching: Techniques and practicalissues. In: Data Mining Workshops, 2006. ICDM Workshops 2006. pp. 290–294.IEEE (2006)4. Christen, P.: Data matching: concepts and techniques for record linkage, entityresolution, and duplicate detection. Springer Science & Business Media (2012)5. Cochinwala, M., Kurien, V., Lalk, G., Shasha, D.: Eﬃcient data reconciliation.Information Sciences (1-4), 1–15 (2001)6. Cohen, W.W.: Data integration using similarity joins and a word-based informationrepresentation language. ACM Transactions on Information Systems (TOIS) (3),288–321 (2000)7. Constantin, A., Pettifer, S., Voronkov, A.: Pdfx: fully-automated pdf-to-xml con-version of scientiﬁc literature. In: Proceedings of the 2013 ACM symposium onDocument engineering. pp. 177–180. ACM (2013)8. Councill, I.G., Giles, C.L., Kan, M.Y.: Parscit: an open-source crf reference stringparsing package. In: LREC. vol. 8, pp. 661–667 (2008)9. Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: Tailor: A record linkage toolbox.In: Proc. of 18th International Conference on Data Engineering. pp. 17–28. IEEE(2002)10. Fedoryszak, M., Bolikowski, (cid:32)L.: Eﬃcient blocking method for a large scale citationmatching. D-Lib Magazine (11/12) (2014)11. Fedoryszak, M., Tkaczyk, D., Bolikowski, (cid:32)L.: Large scale citation matching usingapache hadoop. In: Proc. of TPDL 2013. pp. 362–365. Springer (2013)12. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the AmericanStatistical Association (328), 1183–1210 (1969)13. Fortini, M., Liseo, B., Nuccitelli, A., Scanu, M.: On bayesian record linkage. Re-search in Oﬃcial Statistics (1), 185–198 (2001)14. Foufoulas, Y., Stamatogiannakis, L., Dimitropoulos, H., Ioannidis, Y.: High-passtext ﬁltering for citation matching. In: Proc. of TPDL 2017. pp. 355–366. Springer(2017)15. Hern´andez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In:ACM Sigmod Record. vol. 24, pp. 127–138. ACM (1995)16. Hern´andez, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and themerge/purge problem. Data mining and knowledge discovery (1), 9–37 (1998)17. Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data quality and record linkage tech-niques. Springer Science & Business Media (2007)4 B. Ghavimi et al.18. Hienert, D., et al.: Digital Library Research in Action - Support-ing Information Retrieval in Sowiport. D-Lib Magazine (3/4) (2015).https://doi.org/10.1045/march2015-hienert19. Hosseini, A., Ghavimi, B., Kern, D., Mayr, P.: EXCITE - A toolchain to extract,match and publish open literature references. In: Proceedings of the ACM/IEEEJoint Conference on Digital Libraries 2019. ACM (2019), https://philippmayr.github.io/papers/JCDL2019-EXCITE-demo.pdf

20. Kang, H., et al.: Interactive entity resolution in relational data: A visual analytictool and its evaluation. IEEE transactions on visualization and computer graphics (5), 999–1014 (2008)21. Koo, H.K., Kim, T., Chun, H.W., Seo, D., Jung, H., Lee, S.: Eﬀects of unpopularcitation ﬁelds in citation matching performance. In: Proc. of ICISA 2011. pp. 1–7.IEEE (2011)22. K¨orner, M., et al.: Evaluating Reference String Extraction Using Line-Based Con-ditional Random Fields: A Case Study with German Language Publications. In:New Trends in Databases and Information Systems, vol. 767, pp. 137–145. SpringerInternational Publishing (2017)23. Lait, A., Randell, B.: An assessment of name matching algorithms. Technical Re-port Series-University of Newcastle Upon Tyne Computing Science (1996)24. Lawrence, S., Giles, C.L., Bollacker, K.: Digital libraries and autonomous citationindexing. Computer (6), 67–71 (1999)25. Lopez, P.: Grobid: Combining automatic bibliographic data recognition and termextraction for scholarship publications. In: Proc. of TPDL 2009. pp. 473–474.Springer (2009)26. Mayr, P., et al.: Introduction to the Special Issue on Bibliometric-EnhancedInformation Retrieval and Natural Language Processing for Digital Libraries(BIRNDL). International Journal on Digital Libraries (2-3), 107–111 (2018)27. Mining, W.I.D.: Data mining: Concepts and techniques. Morgan Kauﬁnann (2006)28. Mitchell, T.M.: Artiﬁcial neural networks. Machine learning , 81–127 (1997)29. Moed, H.F.: Citation analysis in research evaluation, vol. 9. Springer Science &Business Media (2006)30. Naumann, F., Herschel, M.: An introduction to duplicate detection. Synthesis Lec-tures on Data Management (1), 1–87 (2010)31. Odell, M., Russell, R.: The soundex coding system, 1918. US Patents (1918)32. Olensky, M., et al.: Evaluation of the citation matching algorithms of cwts and ifqin comparison to the web of science. Journal of the Association for InformationScience and Technology (10), 2550–2564 (2016)33. Postel, H.J.: Die k¨olner phonetik. ein verfahren zur identiﬁzierung von personen-namen auf der grundlage der gestaltanalyse. IBM-Nachrichten19