[PDF] Cleaning Noisy and Heterogeneous Metadata for Record Linking Across Scholarly Big Datasets

Abstract

Automatically extracted metadata from scholarly documents in PDF formats is usually noisy and heterogeneous, often containing incomplete fields and erroneous values. One common way of cleaning metadata is to use a bibliographic reference dataset. The challenge is to match records between corpora with high precision. The existing solution which is based on information retrieval and string similarity on titles works well only if the titles are cleaned. We introduce a system designed to match scholarly document entities with noisy metadata against a reference dataset. The blocking function uses the classic BM25 algorithm to find the matching candidates from the reference data that has been indexed by ElasticSearch. The core components use supervised methods which combine features extracted from all available metadata fields. The system also leverages available citation information to match entities. The combination of metadata and citation achieves high accuracy that significantly outperforms the baseline method on the same test dataset. We apply this system to match the database of CiteSeerX against Web of Science, PubMed, and DBLP. This method will be deployed in the CiteSeerX system to clean metadata and link records to other scholarly big datasets.

Full PDF

CCleaning Noisy and Heterogeneous Metadata for Record Linking AcrossScholarly Big Datasets

Athar Seﬁd , Jian Wu , Allen C. Ge , Jing Zhao , Lu Liu ,Cornelia Caragea , Prasenjit Mitra , C. Lee Giles Computer Science and Engineering, Pennsylvania State University, University Park, PA, 16801 Information Sciences and Technology, Pennsylvania State University, University Park, PA, 16801 Computer Science, Old Dominion University, Norfolk, VA, 23529 Computer Science, University of Illinois at Chicago, Chicago, IL, 60607 [email protected] , [email protected] Abstract

Automatically extracted metadata from scholarly documentsin PDF formats is usually noisy and heterogeneous, oftencontaining incomplete ﬁelds and erroneous values. One com-mon way of cleaning metadata is to use a bibliographic ref-erence dataset. The challenge is to match records betweencorpora with high precision. The existing solution which isbased on information retrieval and string similarity on titlesworks well only if the titles are cleaned. We introduce a sys-tem designed to match scholarly document entities with noisymetadata against a reference dataset. The blocking functionuses the classic BM25 algorithm to ﬁnd the matching can-didates from the reference data that has been indexed byElasticSearch. The core components use supervised methodswhich combine features extracted from all available metadataﬁelds. The system also leverages available citation informa-tion to match entities. The combination of metadata and ci-tation achieves high accuracy that signiﬁcantly outperformsthe baseline method on the same test dataset. We apply thissystem to match the database of CiteSeerX against Web ofScience, PubMed, and DBLP. This method will be deployedin the CiteSeerX system to clean metadata and link records toother scholarly big datasets.

Introduction

Since the advent of Scholarly Big Data (SBD)(Giles 2013),there has been a growing interest in topics related to this bigdata instance, such as scholarly article discovery (Wesley-Smith and West 2016), semantic analysis (Al-Zaidy andGiles 2017), recommendation systems (Huang et al. 2015),citation prediction (Liu et al. 2017), scalability improvement(Kim, Seﬁd, and Giles 2017), and Science of Science (Chenet al. 2018). Major SBD datasets include the MicrosoftAcademic Graph (MAG), CiteSeerX (Giles, Bollacker, andLawrence 1998; Wu et al. 2014), DBLP, Web of Science(WoS), and Medline, among which MAG and CiteSeerX arethe only two freely available large scale datasets which offercitation graphs. Academic search engines such as MicrosoftAcademic and CiteSeerX obtain raw PDF ﬁles by activelycrawling the Web. These PDF documents are then classiﬁedinto academic and non-academic documents. Metadata, ci-tations, and other types of content are then extracted from

Copyright c (cid:13) these documents. Different from submission-based datasetssuch as the WoS, a large fraction of documents crawled arepre-prints and manuscripts, which do not necessarily containunique identiﬁers, e.g., DOIs. Metadata from these docu-ments may also differ from their ofﬁcial published versions.Also, errors may occur when text is extracted from PDFs,and when metadata is parsed. As a result, metadata extractedis likely to be incomplete and erroneous. This metadata isalso heterogeneous since the documents were written by au-thors using different conventions and templates. This noisyinformation can propagate through to data analytics and ag-gregations that can then distort research, making cleaning ita necessity. One common approach is to link document enti-ties to corresponding entities in a cleaned dataset (referencedataset) and then use its records to replace the automaticallyextracted records.The biggest challenge of this approach is to ﬁnd the cor-rect matching entity in the reference dataset. Though differ-ent from traditional machine learning (ML) tasks in whichthere is a trade-off between precision and recalls, the entitymatching task must be accomplished with high precision .This is because when the dataset is large, a small fraction offalse positives may lead to a large number of “false correc-tions”. The problem is even more challenging because thereis usually no prior knowledge of which ﬁelds are noisy.Our contributions include: (1) developing a ML-basedpaper entity matching framework which includes bothheader and citation information of available scholarly docu-ments; (2) applying the system on CiteSeerX, WoS, DBLP,and PubMed to ﬁnd overlap between these digital librarydatabases which is used to clean CiteSeerX data. Plus thisgenerates a more accurate citation graph data and linksrecords which can enrich the content of individual docu-ment.

Related Work

There are in general three types of methods in entity match-ing across bibliographic databases.

Information Retrieval-based : This method searches oneor multiple attributes of an entity in the target corpus againstthe index of the reference corpus and rank the candidatesusing a similarity metric. This approach matches CiteSeerXwith DBLP (Caragea et al. 2014). The reference dataset(DBLP) was indexed by Apache Solr. Metadata from the a r X i v : . [ c s . D L ] J un oisy dataset (CiteSeerX) were used to query correspondingﬁelds. Candidates were selected based on similarity scores.It was found that using 3-grams of titles and Jaccard similar-ity with a threshold of . achieves the best F1-measure of . Because of the relatively low precision, the approachcannot be directly used in cleaning CiteSeerX data. Machine Learning-based : These methods have beenused in entity matching of user entities in online social net-works (Peled et al. 2013). The problem is to match userproﬁles on Facebook and Xing. This work applies pairwisecomparison to whole dataset ( ≈ , records) withoutapplying any blocking function to reduce search spaces, sothis method cannot be scaled up to large digital libraries con-taining tens of millions of records. Topical-based : This method is used to resolve and matchentities that are represented by free text, e.g., Wiki articles.The challenge is that different sources may use different lan-guages or terminologies to describe the same topic. A proba-bilistic model was proposed to integrate the topic extractionand matching into a uniﬁed model (Yang et al. 2015). As wedon’t have access to the full text of the reference datasets,this method is not applicable to our problem.

Models

Entity Representation

Throughout below, we refer to a scholarly paper with fullinformation as a paper entity . We denote the target corpuswhich contains noisy data as T . This contains n paper enti-ties t i , ≤ i ≤ n , and a reference corpus R which containsreference data and m paper entities r j , ≤ j ≤ m . Each en-tity can be represented by a number of attributes. Our goalis to ﬁnd a set M , M = { ( t, r ); t = r, t ∈ T, r ∈ R } . An ofﬁcially published paper usually is assigned a uniqueidentiﬁer, i.e. DOI. A journal article can also be identiﬁedby the journal name, volume, issue number, and the start-ing page number. However, for a large fraction of open ac-cess scholarly papers crawled from the Web, such informa-tion is usually not available. Empirically, a paper entity canbe uniquely identiﬁed by four header ﬁelds, (title, authors,year, venue), in which the venue is a conference or a journalname. A citation record parsed from a citation string usuallycontains the above four ﬁelds. So, matching a single citationrecord can be done in the same manner as matching a paperentity. The abstract is usually a paragraph of text that maycontain non-ASCII characters with different encodings, butnormalizing abstracts and calculating simhash values takesheavy overhead, so in this work, we use abstracts withoutnormalization. Due to a lack of general venue disambigua-tor, venue information is not incorporated as matching fea-tures. We will show that even without it, the application stillachieves high performance.Paper entity linking can be formalized as a binary clas-siﬁcation problem in which the classiﬁer decides whether acandidate pair is a real match or not. Because many digitallibrary datasets do not have citation information, we con-sider two separate models, one with only header information called Header Matching Model (HMM), and the other withonly citation information called Citation Matching Model(CMM). The combination of them is called the IntegratedMatching Model (IMM). There is a separate model to eval-uate title quality called Title Evaluation Model (TEM).

Header Matching Model (HMM)

HMM is a supervised machine learning model to classifycandidate pairs using information from the paper header. Weﬁrst index all document metadata in the reference corpus us-ing an open source search platform. We use ElasticSearch(ES) because of its relatively low setup overhead and scala-bility. The default settings are applied for our experiments.The indexed metadata contains header ﬁelds including titles,authors, abstracts, and years. For each paper in the targetcorpus, we query the title against the index, if it contains aminimum of 20 characters. Otherwise, the ﬁrst author’s lastname and publication year are used in queries. If the year isnot available, only the ﬁrst author’s last name is used. Foreach ﬁeld, the query string is segmented into unigrams con-nected by “OR”. The ranking scores are calculated using theOkapi BM25 algorithm (Robertson, Zaragoza, and Taylor2004). The query algorithm is shown in Algorithm 1.For each target paper, the top 10 papers from the referenceset are retrieved and 10 candidates are formed as matchingpairs. The features of each pair include a list of similaritiescalculated using the header metadata.1. Title similarity represented by Levenshtein distance ofsimhashes of normalized titles (Charikar 2002):Sim L ( title t , title r ) = lev a,b ( | a | , | b | ) (1) a = Simhash ( Norm ( title t )) (2) b = Simhash ( Norm ( title r )) (3)in which a simhash string contains 16 alphanumeric char-acters. The titles are normalized so that (1) all letters arelowercased; (2) diacritics are removed, e.g., “´a” is con-verted to “a”; (3) consecutive spaces are collapsed; (4)punctuation marks are trimmed off; (5) single characters“s” and “t” are removed because they are mostly resultedfrom removing apostrophe from possessives or abbrevia-tions such as “can’t”.2. Abstract similarity Sim L ( abstract t , abstract r ) representedby Levenshtein distance of simhashes of abstracts, calcu-lated in a similar way as Equations (1)–(3) without nor-malization.3. Jaccard similarities between normalized titles and originalabstracts. For example,Sim J ( title t , title r ) = | W t ∩ W r || W t ∪ W r | (4)in which W t and W r represent the token set of the title ofthe target and the reference paper, respectively.4. The absolute difference of the years.5. The ﬁrst and the last author’s full name similarity . Au-thor similarities are measured in multiple metrics. An au-thor’s full name similarity is represented by a three digitinary lmf , representing whether the last name, the mid-dle initial, and the ﬁrst initial matches, respectively. If acertain name component is missing or it does not match,the binary is set to . The decimal value of a binaryis used as the full name similarity index. Author namesare also normalized before comparison. Diacritics are re-moved and letters are lowercased. Preﬁxes, e.g., Prof., andsufﬁxes, e.g., “II”, and their variants are removed. For ex-ample, if ﬁrst authors are “Jane C. Huck” and “J. Huck”,the binary is 101, which equals to 5 in decimal.6. The last name similarities of the ﬁrst and the last au-thor. The last name similarity is computed in this way

Sim ( N t , N r ) =  N t (cid:54) = NULL ∧ N r (cid:54) = NULL ∧ N t (cid:54) = N r N t = NULL ∨ N r = NULL N t (cid:54) = NULL ∧ N r (cid:54) = NULL ∧ N t = N r (5) in which N stands for the last name and NULL means thevalue is not available.7. All authors’ last name Jaccard similarity

Sim J ( L t , L r ) = | L t ∩ L r || L t ∪ L r | (6) in which L t and L r stands for the set of last names in thetarget and the reference corpora, respectively.The pseudocodes of the HMM is in Algorithm 2. Algorithm 1

Query Builder function Q UERY ( title, lastName, year )2: if title (cid:54) = Null and title.length > then

3: query ← title4: else if lastName (cid:54) = Null and year (cid:54) = Null then

5: query ← lastName and year6: else if lastName (cid:54) = Null then

7: query ← lastName8: end if

9: return query10: end function

Algorithm 2

Header Matching Model function HMM()2: T ← target corpus3: R ← reference corpus4: index R ← index of reference corpus5: matchList ← Ø6: for t ∈ T do

7: Q ← Query( t .title, t .ﬁrstLastName, t .year)8: Candidates ← query Q to index R for c ∈ Candidates do

10: prediction ← Model.predict( t ,c)11: if prediction=1 then

12: matchList.add ( t, c )

13: break14: end if end for end for end function

Algorithm 3

Citation Matching Model function CMM()2: T ← target corpus3: R ← reference corpus4: matchList ← Ø5:

CitationIndex R ← citations index of reference corpus6: for t ∈ T do t citations ← citations of t .8: for tc i ∈ t citations do

9: Q ← Query ( tc i .title, tc i .ﬁrstLastName, tc i .year)10: results ← query Q to CitationIndex R for rc j ∈ results do

12: prediction ← Model.predict ( tc i , rc j )13: if prediction=1 then (cid:46) citations match14: r k ← paper that cites rc j r .title ← Simhash ( r k .title)16: t .title ← Simhash ( t .title)17: title dist = lev ( t .title, r .title)18: if title dist < θ title then

19: matchList.add ( t , r k )20: break21: end if BoW r ← BoW ( r k .referenceTitles)23: BoW t ← BoW ( t .referenceTitles)24: sim ← Jaccard(

BoW t i , BoW r i )25: if sim > θ ref then

26: matchList.add ( t , r k )27: break28: end if end if end for end for end for end function Citation Matching Model (CMM)

CMM matches paper entities by citations. The paper en-tity matching problem can beneﬁt from this model when theheader metadata is noisy but the references are available.Similar to HMM, a prerequisite is to index all citations inthe reference corpus. On average, one paper contains about20 citations , so the citation index is usually much larger thanthe document index.Given a target paper t , its citation records tc i are retrievedfrom the database. We attempt to ﬁnd the matching record rc j in the reference corpus using the query builder in Al-gorithm 1. Retrieved citations and rc i are matched by theHMM (Algorithm 2). Citations do not contain abstracts sorelevant features are not used. Assuming such j = 1 exists(if not, then no matching entity is found) and rc is cited by r , the next step is to compare r with t . The CMM uses boththe paper title and the citation titles (Algorithm 3). First, thetitle similarities are calculated using Equations (1)–(3). Ifthis distance is less than a threshold of θ title , r is believed tobe the matching entity of t . Otherwise, CMM extracts the to-kens from all the reference titles of t , denoted by BoW t , andtokens from all the reference titles of r , denoted by BoW r .The judgment is made by comparing the Jaccard similaritybetween BoW t and BoW r . If the similarity is greater than athreshold θ ref , then ( t, r ) is determined as a matching pair.able 1: Features used to train TEM.

Character-level features

Word-level features max ( DF ( w )) , w / ∈ S min ( DF ( w )) , w / ∈ S ( DF ( w )) , w / ∈ S : { Abstract, List, Acknowledgments, Notices, Content, Ac-cepted, Authors, References, Acknowledgments, Null, Chapter,Discussions, Summary } DF: document frequency, calculated on all DBLP titles. S is aset of stopwords adopted from Apache Solr. The value is set to 1 if the string contains at least one exactmatch to the controlled list.

Otherwise, the algorithm continues to examine the next pa-per that shares citation rc with t . If no papers citing rc isfound to be a match for t , CMM continues and attempts toﬁnd the matching record of tc . Title Evaluation Model (TEM)

TEM is a light-weight supervised learning model designedto provide a quantitative evaluation of the title quality. Theinput is a title string, and the output is a probability θ of howlikely the input string looks like a paper title. The title qualityis determined to be high if θ is greater than a threshold θ tq .The TEM exploits the features in Table 1 extracted from theoriginal title string. The TEM is trained on a sample of 8200title strings containing 6270 high-quality titles and 1930 ti-tles with low quality. Titles are labeled to be of low quality if(1) They are NULL; (2) They have many non-ASCII charac-ters; (3) They include evidently irrelevant information suchas authors; (4) They are not in English.Four supervised models, including Logistic Regression(LR), Support Vector Machine (SVM), Na¨ıve Bayes (NB),and Random Forests (RF), are trained. The LR modelachieves the best 10-fold cross-validated F1-score of . ,which we adopted. IMM integrates HMM, CMM, and TEM (Figure 1). IfHMM is able to ﬁnd the match of a paper entity, the processcontinues to the next paper. Otherwise, the paper title qualityis evaluated by TEM. If the title quality is high ( θ ≥ θ tq ),it is likely that there is not a matching entity existing in thereference corpus, otherwise ( θ < θ tq ), the matching entity isnot found due to the poor title quality. In this situation, weuse citation information to match papers. Experiments

Data

The target data is the CiteSeerX database with about 9 mil-lion scholarly papers. The header metadata is extracted byGROBID. References are extracted and parsed by ParsCit(Councill, Giles, and Kan 2008). The reference corpora aredescribed below.

WoS is a digital library dataset spanning 230+ academicdisciplines with citation indexing. WoS indexing coverage Figure 1: The pipeline of IMM.is from 1900 to 2015 with over 20,000 journals, books, andconference proceedings. There are about 45 million WoS pa-pers and 906 million citation records in this corpus.

DBLP is a bibliographic dataset covering more than 5,000conferences and 1,500 journals in computer science. We usethe version published in March, 2017 with about 4 milliondocuments. This dataset does not contain citations.

Medline is the premier bibliographic dataset released byNCBI with about 24 million academic papers in the area ofbiomedicine published since 1966. The dataset does not con-tain citation information.The

IEEE corpus is a subset of the IEEE Xplore database,containing about 2 million bibliographic records down-loaded from IEEE FTP sites. It does not contain citations.

Ground Truth Labeling

The procedure for labeling process comprises three steps.First, for each paper in the CiteSeerX sample set, a candidateset of 10 papers is retrieved from the reference index usingthe same manner described in Algorithm 1. Then, to deter-mine the true matches out of candidate papers, other meta-data of the papers including authors, abstract, year, venue,keywords, and the number of pages were visually inspectedindependently by two graduate students. Finally, if it was notpossible to decide based on papers proﬁles, actual PDF ﬁleswere used side by side to make ﬁnal decisions. We generatedthe following ground truth datasets:

CiteSeerX-IEEE

This dataset, adopted from Wu et al.(2017), is built based on 1000 CiteSeerX papers with 51 truematching pairs found in the IEEE corpus.

CiteSeerX-DBLP

This dataset, revised based on Carageaet al. (2014), contains 292 matching pairs identiﬁed between1000 CiteSeerX papers and the DBLP dataset.

CiteSeerX-WoS

This dataset contains 345 matching pa-pers found in WoS out of 533 CiteSeerX papers.

Combined Sample

The positive sample contains 688matching pairs. The negative samples are selected using1845 candidate matching pairs, containing the most similarbut unmatched papers.

Experiment Setups

We trained binary classiﬁers that decide whether a pair ofdocuments from target and reference corpora is a true match-able 2:

HMM model 10-fold CV results.Model Precision Recall F1-measure

SVM 0.926 0.937 0.931

LR 0.794 0.968 0.872RF 0.912 0.931 0.921XGBoost 0.925 0.899 0.912 ing pair. Four machine learning models, SVM, LR, RF, andXGBoost are trained on the Combined Sample. Grid searchis applied to tune and ﬁnd the hyper parameters yielding thebest results. Precision, Recall, and F1-score values for 10-fold CV are reported in Table 2.

Results and Discussion

In four models, SVM achieves the highest F1-measure. RFhas a comparable F1-measure but requires a signiﬁcantly re-duced test time ( < ), so we employ RF for HMM. Theinformation gain (IG) is calculated for each feature indicat-ing that the most informative features are related to titles andthe ﬁrst authors.To make an even comparison with the method proposedby Caragea et al. (2014), we rerun their experiments on theCiteSeerX-DBLP ground truth with the best parameter set-tings in which n = 3 and the Jaccard similarity thresh-old θ J = 0 . . We then compare the results with our RFmodel trained on the combination of CiteSeerX-WoS andCiteSeerX-IEEE datasets. The HMM outperforms the IR-based model with 14% improvement in precision ( vs. ) and a 3% improvement in the F1 score ( vs. ).We investigate how the reference Jaccard similaritythreshold θ ref and title Levenshtein distance θ title affectsthe performance of CMM (Table 3). A higher value of θ ref indicates that two papers need more common citations to beconsidered as a matching pair. The best F1 score is obtainedat θ ref = 0 . and θ title = 0 . .Table 3: The CMM performance with different θ ref and θ title . θ ref θ title Precision Recall F10.40 0.15 0.876 0.719 0.7900.25 0.877 0.725 0.7940.35 0.878

Table 4 compares HMM, CMM, and IMM based on theCiteSeerX-WoS dataset (because only this dataset containscitations). In the ﬁrst column, as the threshold θ tq increasesfrom . to . , the testing corpus encloses more papers with higher quality titles, which results in a better perfor-mance of HMM. The CMM alone is getting better withremarkably high precision but poor recalls. The integratedmodel achieves both high recall and precision. This indicatesthat (1) CMM tend to be more useful when the title qualityis low; (2) The integrated model signiﬁcantly increases theoverall performance, especially for papers with low qualitytitles.One result in Table 4 that is counter-intuitive is the HMMconsistently achieves high performance when the title qual-ity is low. To answer this question, we trained a RF clas-siﬁer on papers with low-quality titles only. IG of the newmodel reveals that the most important features in the absenceof good titles are First author features, Jaccard similarity ofall authors’ last names, and Abstract features, implying thatwhen title quality is low, accurate author information canalso provide accurate matches. Error Analysis

Although the combination of HMM and CMM achieve supe-rior performance, the recall of CMM alone is poor (Table 4).This could be due to two reasons: (1) Citation parsing errors.For example, more than 1 million papers in CiteSeerX con-tain less than 5 citations; (2) Null title citations. About of citation records in WoS and . of citations in Cite-SeerX have null titles.The citation-based model is slow because (1) the largenumber of citations (906 million) slows down the searchprocess and (2) the candidate set for each CiteSeerX citationcould be huge for highly-cited papers. The integrated modelonly applies citation model to papers with low-quality titlesto improve recall. Application and Conclusion

We applied HMM on CiteSeerX documents against DBLP,WoS, and Medline. The result indicates that the currentCiteSeerX dataset includes about 3 million WoS docu-ments, 1.62 million Medline papers, and about 1.35 mil-lion DBLP papers . The matching process by HMM is donein 11 days on a machine with following speciﬁcations: 32logical cores of Intel Xeon CPU E5-2630 v3 @ 2.40GH;330 GB RAM. The result reveals that there is still a largenumber of papers that CiteSeerX should index. The un-matched document metadata can aid CiteSeerX crawler toﬁnd relevant resources.Previous studies (Caragea et al. 2014; Wu et al. 2017)used only metadata in the header of scholarly articles forpaper entity linking. In reality, the quality of a header isnot always that good. Hence, we investigated leveragingboth header and citation information to match paper en-tities between two digital library datasets when the tar-get corpus contains noisy data. We proposed an approachthat integrates header and citation information for paper en-tity matching. Compared with the IR-based method, headermatching model improves precision by at least and F1by about for papers with low quality titles. The inte-grated model with header and citation information achievesan F1 as high as . and precision as high as . . Weshow that CiteSeerX has a huge overlap with WoS, DBLP,able 4: Comparisons of HMM, CMM, and IMM performances using the CiteSeerX-WoS dataset with different title qualitythresholds.

T /s stands for testing time in seconds. θ < θ tq Data HMM CMM IMMPortion

P R F1

T /s

P R F1

T /s

P R F1

T /sθ < . θ < . θ < . θ < . and Medline, which can be used for metadata correction, andthat there are still a large number of scientiﬁc documents tobe crawled and indexed. The framework developed can beused to match records between any bibliographic databaseswith or without citations. The idea of combining ML andIR is in general applicable to many information retrievaland data linking problems. We will apply this framework toclean the CiteSeerX metadata. This will generate high qual-ity large-scale datasets that can enable development and im-plementation of many graph-based AI applications.The software implementation of this framework is avail-able on GitHub at https://github.com/SeerLabs/entity-matching . Acknowledgements

We gratefully acknowledge partial support from the Na-tional Science Foundation.

References [Al-Zaidy and Giles 2017] Al-Zaidy, R. A., and Giles, C. L.2017. A machine learning approach for semantic structuringof scientiﬁc charts in scholarly documents. In

AAAI , 4644–4649.[Caragea et al. 2014] Caragea, C.; Wu, J.; Ciobanu, A.;Williams, K.; Fern´andez-Ram´ırez, J.; Chen, H.-H.; Wu, Z.;and Giles, L. 2014. Citeseerx: A scholarly big dataset. In

Advances in Information Retrieval: 36th European Confer-ence on IR Research, ECIR 2014, Amsterdam, The Nether-lands, April 13-16, 2014. Proceedings , 311–322.[Charikar 2002] Charikar, M. 2002. Similarity estimationtechniques from rounding algorithms. In

Proceedings on34th Annual ACM Symposium on Theory of Computing, May19-21, 2002, Montr´eal, Qu´ebec, Canada , 380–388.[Chen et al. 2018] Chen, C.; Wang, Z.; Li, W.; and Sun, X.2018. Modeling scientiﬁc inﬂuence for research trend-ing topic prediction. In

Proceedings of the Thirty-SecondAAAI Conference on Artiﬁcial Intelligence, New Orleans,Louisiana, USA, February 2-7, 2018 .[Councill, Giles, and Kan 2008] Councill, I.; Giles, C. L.;and Kan, M.-Y. 2008. Parscit: an open-source crf referencestring parsing package. In

Proceedings of the Sixth Interna-tional Conference on Language Resources and Evaluation(LREC’08) .[Giles, Bollacker, and Lawrence 1998] Giles, C. L.; Bol-lacker, K. D.; and Lawrence, S. 1998. Citeseer: An auto-matic citation indexing system. In

Proceedings of the 3rdACM International Conference on Digital Libraries, June23-26, 1998, Pittsburgh, PA, USA , 89–98. [Giles 2013] Giles, C. L. 2013. Scholarly big data: Informa-tion extraction and data mining. In

Proceedings of the 22NdACM International Conference on Information & Knowl-edge Management , CIKM ’13, 1–2. New York, NY, USA:ACM.[Huang et al. 2015] Huang, W.; Wu, Z.; Chen, L.; Mitra, P.;and Giles, C. L. 2015. A neural probabilistic model for con-text based citation recommendation. In

Proceedings of theTwenty-Ninth AAAI Conference on Artiﬁcial Intelligence,January 25-30, 2015, Austin, Texas, USA. , 2404–2410.[Kim, Seﬁd, and Giles 2017] Kim, K.; Seﬁd, A.; and Giles,C. L. 2017. Scaling author name disambiguation with cnfblocking. arXiv preprint arXiv:1709.09657 .[Liu et al. 2017] Liu, X.; Yan, J.; Xiao, S.; Wang, X.; Zha,H.; and Chu, S. M. 2017. On predictive patent valuation:Forecasting patent citations and their types. In

Proceed-ings of the Thirty-First AAAI Conference on Artiﬁcial In-telligence, February 4-9, 2017, San Francisco, California,USA. , 1438–1444.[Peled et al. 2013] Peled, O.; Fire, M.; Rokach, L.;and Elovici, Y. 2013. Entity matching in on-line social networks. In

International Conferenceon Social Computing, SocialCom 2013, Social-Com/PASSAT/BigData/EconCom/BioMedCom 2013,Washington, DC, USA, 8-14 September, 2013 , 339–344.[Robertson, Zaragoza, and Taylor 2004] Robertson, S.;Zaragoza, H.; and Taylor, M. 2004. Simple bm25 extensionto multiple weighted ﬁelds. In

Proceedings of the Thir-teenth ACM International Conference on Information andKnowledge Management , CIKM ’04, 42–49. New York,NY, USA: ACM.[Wesley-Smith and West 2016] Wesley-Smith, I., and West,J. D. 2016. Babel: A platform for facilitating researchin scholarly article discovery. In

Proceedings of the 25thInternational Conference Companion on World Wide Web ,WWW ’16 Companion, 389–394. Republic and Canton ofGeneva, Switzerland: International World Wide Web Con-ferences Steering Committee.[Wu et al. 2014] Wu, J.; Williams, K.; Chen, H.; Khabsa, M.;Caragea, C.; Ororbia, A.; Jordan, D.; and Giles, C. L. 2014.Citeseerx: AI in a digital library search engine. In

Proceed-ings of the 28th AAAI Conference on Artiﬁcial Intelligence,July 27 -31, 2014, Qu´ebec City, Qu´ebec, Canada. , 2930–2937.[Wu et al. 2017] Wu, J.; Seﬁd, A.; Ge, A. C.; and Giles, C. L.2017. A supervised learning approach to entity matching be-tween scholarly big datasets. In

Proceedings of the Knowl-dge Capture Conference , K-CAP 2017, 42:1–42:4. NewYork, NY, USA.[Yang et al. 2015] Yang, Y.; Sun, Y.; Tang, J.; Ma, B.; and Li,J. 2015. Entity matching across heterogeneous sources. InCao, L.; Zhang, C.; Joachims, T.; Webb, G. I.; Margineantu,D. D.; and Williams, G., eds.,