[PDF] Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations

Abstract

Identifying academic plagiarism is a pressing task for educational and research institutions, publishers, and funding agencies. Current plagiarism detection systems reliably find instances of copied and moderately reworded text. However, reliably detecting concealed plagiarism, such as strong paraphrases, translations, and the reuse of nontextual content and ideas is an open research problem. In this paper, we extend our prior research on analyzing mathematical content and academic citations. Both are promising approaches for improving the detection of concealed academic plagiarism primarily in Science, Technology, Engineering and Mathematics (STEM). We make the following contributions: i) We present a two-stage detection process that combines similarity assessments of mathematical content, academic citations, and text. ii) We introduce new similarity measures that consider the order of mathematical features and outperform the measures in our prior research. iii) We compare the effectiveness of the math-based, citation-based, and text-based detection approaches using confirmed cases of academic plagiarism. iv) We demonstrate that the combined analysis of math-based and citation-based content features allows identifying potentially suspicious cases in a collection of 102K STEM documents. Overall, we show that analyzing the similarity of mathematical content and academic citations is a striking supplement for conventional text-based detection approaches for academic literature in the STEM disciplines.

Full PDF

Improving Academic Plagiarism Detection for STEM Documentsby Analyzing Mathematical Content and Citations

Norman Meuschke , Vincent Stange , Moritz Schubotz , Michael Kramer , Bela Gipp University of Wuppertal, Germany ([email protected]) University of Konstanz, Germany ([email protected])

ABSTRACT

Identifying academic plagiarism is a pressing task for educationaland research institutions, publishers, and funding agencies. Currentplagiarism detection systems reliably find instances of copied andmoderately reworded text. However, reliably detecting concealedplagiarism, such as strong paraphrases, translations, and the reuseof nontextual content and ideas is an open research problem. Inthis paper, we extend our prior research on analyzing mathematicalcontent and academic citations. Both are promising approaches forimproving the detection of concealed academic plagiarism primarilyin Science, Technology, Engineering and Mathematics (STEM). Wemake the following contributions: i) We present a two-stage detec-tion process that combines similarity assessments of mathematicalcontent, academic citations, and text. ii) We introduce new similar-ity measures that consider the order of mathematical features andoutperform the measures in our prior research. iii) We comparethe effectiveness of the math-based, citation-based, and text-baseddetection approaches using confirmed cases of academic plagia-rism. iv) We demonstrate that the combined analysis of math-basedand citation-based content features allows identifying potentiallysuspicious cases in a collection of 102K STEM documents. Overall,we show that analyzing the similarity of mathematical content andacademic citations is a striking supplement for conventional text-based detection approaches for academic literature in the STEMdisciplines. The data and code of our study are openly available at https://purl.org/hybridPD

Academic plagiarism (AP) is ’the use of ideas, concepts, words, orstructures without appropriately acknowledging the source to benefitin a setting where originality is expected’ [8, 9]. Forms of AP rangefrom copying content ( copy&paste ) to reusing slightly modifiedcontent, e.g., interweaving text from multiple sources, to heavilyconcealing content reuse, e.g., by paraphrasing or translating text,and reusing data or ideas without proper attribution [44].The easily recognizable copy&paste-type AP is more prevalentamong students [22], while concealed AP is more characteristicof researchers, who have strong incentives to avoid detection [2].Plagiarized research publications can have a severe negative impactby distorting the mechanisms for tracing and correcting researchresults and causing inefficient allocations of research funds [9].Therefore, detecting concealed AP in research publications is apressing problem affecting many stakeholders, including researchinstitutions, academic publishers, digital library providers, fundingagencies, and of course other researchers [28].As we present in Section 2, many plagiarism detection (PD)approaches have been proposed that employ lexical, semantic, syn-tactical, or cross-lingual text analysis. These approaches reliably detect copied or moderately altered forms of AP; some approachescan also identify paraphrased and translated text. However, cur-rent approaches are computationally expensive and thus require acomputationally less demanding selection of candidate documentsbefore their application. The performance of methods that ana-lyze textual features to retrieve candidate documents has reached aplateau. Therefore, the candidate retrieval step currently limits theeffectiveness of PD approaches to detect concealed forms of AP.Prior research (cf. Section 2) showed that approaches that ana-lyze nontextual content features, such as academic citations, images,and mathematical content, complement the many text analysis ap-proaches to improve the identification of concealed forms of AP.Nontextual content features in academic documents are a valu-able source of semantic information that is largely independentof natural language text. Considering these sources of semanticinformation for similarity analysis raises the effort plagiarists mustinvest for obfuscating reused content [9, 24].We extend the research on analyzing nontextual content featuresfor PD by devising a novel PD approach that combines the analysisof mathematical content with the analysis of academic citations. Westructure the presentation of our contributions as follows. Section 2presents an overview of prior research on PD to show the benefitof analyzing nontextual content features for this purpose. Section 3explains the conceptual design and technical realization of the math-based, citation-based and text-based PD approaches we investigate.Section 4 presents our test collection and describes the methodologyof our evaluation. Section 5 compares the performance of the threePD approaches using confirmed cases of AP. Section 6 presentsthe results of an exploratory study that investigates the ability ofthe novel PD approach to discover so far unknown cases of AP.Section 7 concludes the paper and presents future work.

External plagiarism detection is an information retrieval task withthe objective of comparing an input document to a large collec-tion and retrieving all documents exhibiting similarities above athreshold [41]. External PD approaches typically employ a two-stage process consisting of candidate retrieval and detailed analysis[23, 41]. In the candidate retrieval stage, the approaches employcomputationally efficient retrieval methods to limit the collectionto a set of documents that may have been the source for the contentin the input document. In the detailed analysis stage, the systemsperform computationally more demanding analysis steps to substan-tiate the suspicion and to align components in the input documentand potential source documents that are similar [2, 23].Text retrieval research has yielded mature systems that reliablydetect copied or moderately altered text in an input document andretrieve its source if the source is part of the system’s reference a r X i v : . [ c s . D L ] J un CDL’19, Jun. 2019, Urbana-Champaign, IL, USA Meuschke, Stange, Schubotz, Kramer, Gipp collection. For the candidate retrieval stage, such systems typicallyemploy character-gram or word-gram fingerprinting [30, 43] orterm-based vector space models [19]. For the detailed analysis stage,such systems often perform exhaustive string comparisons [43] orcomputationally more efficient text alignment [30]. Text alignmentapproaches typically use matching strings as seeds, which the pro-cedures extend and then filter using heuristics [36].To detect monolingual paraphrases, researchers have proposedapproaches that analyze semantic and syntactic features, mostlyduring the detailed analysis stage of the PD process. Several re-searchers adapted semantic text analysis methods, such as Singu-lar Value Decomposition [5], Latent Semantic Analysis [40] andExplicit Semantic Analysis [27] for the PD use case. Other PD ap-proaches employ linguistic resources, such as WordNet , to analyzeexactly matching and semantically related words [16]. Some workscombine the analysis of word-based semantic similarity with ananalysis of similarity in semantic arguments derived using Seman-tic Role Labeling [31]. Other approaches employ part-of-speechtagging to also compare the syntactic structure of documents [16].To detect cross-lingual (CL) plagiarism, researchers have pro-posed approaches that leverage lexical similarities of languages, e.g.,CL character n -gram matching, employ thesauri, parallel corpora,and machine translation followed by a mono-lingual analysis [3].The extensive research on semantic and syntactic methods formonolingual and cross-lingual PD has yielded approaches that arehighly effective for the detailed analysis stage. To our knowledge,Gupta et al. reported the highest retrieval effectiveness for thedetailed analysis of realistically obfuscated plagiarism [16]. Theauthors analyzed artificially created plagiarism in the Webis TextReuse Corpus 2012 (Webis-TRC-2012) [33], which is a standardcorpus to evaluate PD systems as part of the PAN Workshop series .For manually paraphrased instances of simulated plagiarism, Guptaet al. reported the precision P = .

81, recall R = .

80, and F = . R = .

65) for the candidate retrieval task in all four PANworkshops (2012-2015) that evaluated research contributions forthis task [20]. The organizers of the PAN workshop series notedthat the workshops had reached a stable production phase, in whichthe submitted approaches no longer exhibited "[...] real innovationswith respect to recall-oriented source retrieval." [17].For cross-lingual PD, the candidate retrieval stage likewise seemsto present an upper bound for the otherwise higher effectivenessof the analysis methods in the detailed analysis stage. Ehsan et al.reported an approach that achieved an F -score of 0 .

87 ( P = . R = .

82) for the detailed analysis of cross-lingual plagiarism in theWebis-TRC-2012 [7]. For candidate retrieval, Ehsan and Shakeryreported a maximum recall of R = .

75 using the same corpus [6].These results suggest that the candidate retrieval step, which is nec-essary to enable applying semantic, syntactic and cross-lingual PDapproaches, currently limits the effectiveness of the PD approaches. https://wordnet.princeton.edu/ http://pan.webis.de/ PD approaches that analyze nontextual content features in aca-demic documents are a promising complement to the variety oftext analysis approaches both for the candidate retrieval and thedetailed analysis stages of the PD process. In prior research, weshowed that an analysis of in-text citation patterns in academicdocuments, i.e., identical citations occurring in proximity or in asimilar order within two documents, can identify concealed formsof AP in real-world, large-scale collections [9–12]. This approachis computationally efficient enough to be applied in the candidateretrieval stage [9, 12, 24]. Pertile et al. confirmed the positive effectof combining citation and text analysis and devised a hybrid ap-proach using machine learning [32]. The benefits of citation-basedPD are twofold. First, citations encode semantic information thatcannot easily be substituted, since leaving out citations to relevantprior work or citing sources that are not relevant to the topic wouldlikely raise the suspicion of expert peer reviewers. Second, whilecitations are independent of natural language text, analyzing in-textcitation patterns can indicate shared structural and semantic simi-larity among texts. Assessing this semantic and structural similarityusing citation patterns requires significantly less computationaleffort than approaches for semantic and syntactic text analysis.We also showed that analyzing image similarity in academicdocuments, e.g., the similarity of figures and plots, improves thedetection capabilities for concealed forms of AP [25].In a recent short paper [26], we extended the idea of citation-based PD. We proposed that mathematical expressions share manycharacteristics of academic citations and hence are promising non-textual content features to be considered when searching for con-cealed forms of AP. Similar to academic citations, mathematicalexpressions are essential components of academic documents inthe Science, Technology, Engineering and Mathematics fields. Fur-thermore, mathematical expressions are independent of naturallanguage text and contain rich semantic information. Addition-ally, some STEM disciplines, such as mathematics and physics, areknown for their comparably sparse use of academic citations [29].A citation-based analysis alone is, therefore, less likely to revealpotentially suspicious content similarity for these disciplines.Our piloting study [26] investigated measures to quantify thesimilarity of mathematical content features in the detailed analysisstage of the PD process. Given the infancy of the research area, weevaluated the suitability of comparing basic presentational featuresof mathematical expressions, i.e., elements of mathematical nota-tion, such as identifiers, numbers, operators, and special symbols.Our goal was to identify the type of similar mathematics that wehad observed in confirmed cases of AP, which we collected, e.g., byreviewing journal retractions. We embedded the retracted test doc-uments together with their sources in the dataset of the NTCIR-11Math task (105,120 arXiv documents) [1]. We performed pairwisecomparisons of all documents in the dataset (detailed analysis ap-proach) and evaluated similarity measures that consider identifiers,numbers, operators, and combinations thereof as the features. Thebest performing approach, a set-based comparison of the frequencyof mathematical identifiers, retrieved eight of ten test cases at thetop rank and achieved a mean reciprocal rank of 0 . mproving PD for STEM Documents by Analyzing Mathematics and Citations JCDL’19, Jun. 2019, Urbana-Champaign, IL, USA The present paper extends our pilot study [26] by making fourcontributions: i) devising a candidate retrieval stage that analyzesmathematical expressions, citations, and textual features; ii) propos-ing new math-based similarity measures for the detailed analysisstage that consider the order of mathematical content features;iii) comparing the effectiveness of the math-based, citation-basedand text-based PD approaches using the confirmed cases of APgathered for our pilot study; iv) analyzing math-based and citation-based content features to discover potentially suspicious cases ofdocument similarity in a large-scale dataset of STEM documents.

Figure 1 gives an overview of our system HyPlag [28] that imple-ments the analysis of math-based, citation-based, and text-basedcontent features. HyPlag allows the combination of the analysisapproaches as part of a hybrid detection process that consists offive stages: preprocessing, indexing, candidate retrieval, detailedanalysis, and human inspection. We describe the stages hereafter.

HyPlag preprocesses input documents in two steps. In the first step,the system converts the documents to a unified XML-based docu-ment format used for the second preprocessing step. Our unifieddocument format uses a subset of the TEI standard defined by theinformation extraction tool GROBID to represent in-text citationsand bibliographic references. Additionally, the unified documentformat employs a subset of the Mathematical Markup Language(MathML) to represent mathematical formulae.For this study, we processed documents in two formats: PDF(confirmed cases of AP) and LaTeX source code (NTCIR-11 MathIRTask dataset). We used GROBID to obtain bibliographic referencesfrom documents in both formats because the tool achieved excellentresults for extracting header metadata, citations, and references[4, 42]. Since GROBID cannot recognize mathematical formulae, wesemi-automatically invoked InftyReader to convert the PDFs forconfirmed cases of AP to an intermediate LaTeX format. InftyReaderis currently the most commonly used OCR-based recognition sys-tem for mathematical content [18]. While the tool typically achievesa recall of at least 0 .

90, the precision can be as low as approx. 0 . to convert LaTeX documents, i.e., thedocuments in the NTCIR-11 dataset and the PDF that InftyReaderconverted to LaTeX, to the unified document format. The LaTeXMLlibrary offers mathematical content conversion from LaTeX sourcecode to a MathML representation. To enable conversion to ourunified document format, we contributed an XSL style sheet thattransforms LaTeXML’s native output to TEI. The new conversionoption has been included in the LaTexML distribution . http://tei-c.org/ https://github.com/kermitt2/grobid https://w3.org/Math/ https://dlmf.nist.gov/LaTeXML https://github.com/brucemiller/LaTeXML/blob/master/lib/LaTeXML/resources/XSLT/LaTeXML-tei.xsl To recognize in-text citations, LaTeXML requires the use of La-TeX tags, such as \cite{} . Many documents in the dataset do notcontain such markup but state in-text citations as plain-text. Insuch cases, our preprocessing pipeline did not recognize the in-textcitations. Additionally, many documents do not use in-text citationsat all but only reference items in the bibliography. Due to both er-rors, the number of unique in-text citations for 68,743 documents(67% of the dataset) is smaller than the number of references.In the second preprocessing step, HyPlag splits the unified docu-ment format into separate data structures holding plain text, math-ematical formulae, in-text citations, and bibliographic references.To extract plain text, the system removes all XML structures, im-ages, formulae, and formatting instructions. Formulae in ContentMathML are extracted as they are. In-text citations are linked tothe corresponding reference entries in the bibliography; referenceentries are split into author, title, and venue fields.

In the indexing stage, HyPlag stores into an Elasticsearch indexthe following data extracted from the preprocessed documents: Document metadata : title, authors, publication date and filename.

Mathematical features : the sequence of all mathematical iden-tifiers in the order of their occurrence in the document, and theunordered histogram of the occurrence frequencies of identifiers,i.e., how often an identifier occurs in the document. We focusedon analyzing identifiers since they achieved the best retrieval effec-tiveness in our pilot study [26]. We used the MathML element to extract formulae. Additionally, we used elementsin MathML formulae to extract identifiers in a formula.

Citation features : bibliographic references and in-text citations.The system consolidates the data about referenced documents bycomparing the title and author names extracted from referencestrings to the data of previously indexed documents while account-ing for minor spelling variations utilizing the Levenshtein distance.

Textual features : full text and text fingerprints formed by chunk-ing the document into word 3-grams and applying probabilisticchunk selection (average chunk retention rate 1 / tool.The indexing process is identical for all documents, i.e., docu-ments that ought to be analyzed need to be indexed first. In the candidate retrieval stage, the system queries the index usingmathematical identifiers, in-text citations, and text fingerprintsextracted from an input document to retrieve a set of candidatedocuments for the subsequent detailed analysis.To retrieve candidate documents, we employed "Lucene’s practi-cal scoring function" implemented in the Elasticsearch server as acomputationally efficient, well-established heuristic. The scoringfunction combines a tf/idf weighted vector space model with aBoolean retrieval approach . The tool’s website went offline recently. The source code and documentation arestill available via the Web archive: https://web.archive.org/web/20180219024142/http://web.it.usyd.edu.au/~scilect/sherlock/ CDL’19, Jun. 2019, Urbana-Champaign, IL, USA Meuschke, Stange, Schubotz, Kramer, Gipp

CKIM May 2018 input document(s)

Human InspectionHuman InspectionDetailed

Analysis

Detailed

Analysis

Candidate

Retrieval

Candidate

Retrieval candidate documents similar documents userdocument full text document full text

MathML formulae

MathML formulae citations & references citations & references text fingerprints text fingerprints

Pre-processingPre-processing math. identifiers (list & histogram) math. identifiers (list & histogram)

IndexingIndexing unified document Start analysis Doc. ID=78

Data Storage

ID78ID78

JCDL Jan 2019 input document(s)

Human

Inspection

Human

Inspection

Detailed

Analysis

Detailed

Analysis

Candidate

Retrieval

Candidate

Retrieval candidate documents similar documents user document full textdocument full text MathML formulaeMathML formulae citations &referencescitations &references text fingerprintstext fingerprints

Pre-processingPre-processing math. identifiers (list & histogram)math. identifiers (list & histogram)

IndexingIndexing unified document format

Start analysis Doc. ID=78

Data Storage

ID78ID78

Figure 1: Overview of the hybrid plagiarism detection approach.

We performed three queries, each retrieving the 100 documentswith the highest relevance scores. For the citation-based and text-based retrieval of candidate documents, in-text citations and, re-spectively, text fingerprints of the input document represented theterms of the query. Analogously, indexed documents were repre-sented by their sets of in-text citations and text fingerprints. Weused the default parameters of Lucene’s scoring function.For the math-based retrieval of candidate documents, the set ofmathematical identifiers occurring in a document was the query.The indexed documents were represented by the sequence of math-ematical identifiers in a document, i.e., identifiers can occur morethan once. However, using Lucene’s default parameters for therelevance scoring yielded unsatisfactory results in the case of math-ematical features. This finding is in line with research by Sojka andLíška [39]. Similar to Sojka and Líška, we found that query terms,i.e., mathematical identifiers, should be given additional weight formultiple occurrences. Therefore, we set the boost value boost ( t ) forthe term t in the query, i.e., individual identifiers, to the number ofoccurrences of the term (identifier) in the query document.Since we sought to investigate the effectiveness of the math-based, the citation-based, and the text-based PD approach indepen-dently of each other, we did not consolidate the three sets of 100candidate documents each retrieved by the three queries. In the detailed analysis stage, HyPlag compares the input docu-ment(s) to all documents in each of the three sets of 100 candidatedocuments retrieved in the previous stage. For each document com-parison, HyPlag computes the following similarity measures.

We only computed math-based similarity scores for document pairs that share 20 or moreidentifiers to prevent high similarity scores resulting from a fewshared identifiers, such as the occurrence of x and y . For documentsthat meet this threshold, we computed three similarity measures.First, we computed the similarity of frequency histograms ofmathematical identifiers (Histo) , which performed best in our pilotstudy [26]. The measure quantifies the similarity of two documents d and d ′ as the difference in the relative occurrence frequencies of identifiers f i in d and d ′ according to Equation (1). s ( d , d ′ ) = − (cid:205) i ∈ I (cid:12)(cid:12) f i , d − f i , d ′ (cid:12)(cid:12)(cid:205) i ∈ I max ( f i , d , f i , d ′ ) . (1)The Histo score reflects the global overlap of identifiers in twodocuments. The measure is most suitable for documents with com-parable numbers of identifiers. Typically, this requirement is notmet if the two documents strongly differ in length.In addition to the set-based, order-agnostic Histo measure pro-posed in our pilot study [26], we devised two new similarity mea-sures that consider the order of mathematical identifiers. We did notevaluate the influence of the sequential similarity of features in ourpilot study. The two new measures consider the Longest CommonSubsequence of Identifiers (LCIS) and the set of

Greedy Identifier Tiles(GIT) for score computation. The longest common subsequenceand greedy tiling algorithms are well-established approaches toidentify sequential patterns and have been applied successfully forthe text-based [21] and citation-based [10] detection of academicplagiarism, as well as for source code plagiarism detection [35].The longest common subsequence of features, e.g., characters,citations, or mathematical identifiers, is the maximum number offeatures that match in both documents in the same order but notnecessarily in a contiguous block. Like Histo, the LCIS measurequantifies the global similarity of documents. We compute the sim-ilarity score s LCIS ( d , d ′ ) = | L ( d , d ′ )| I − d that represents the numberof identifiers in the query document I d that are part of the longestcommon identifier sequence whose length is given by L .Greedy tiles are the set of all individually longest blocks of sharedfeatures in identical order that cannot be extended to the left orright without encountering a non-matching feature [45]. Greedytiles are well-suited to identify confined regions with high similarity[10]. We computed the similarity of two documents using the GITapproach as s GIT ( d , d ′ ) = | T l | I − d , where T l is the set of tiles with alength greater or equal to 5 matching identifiers and I d is the totalnumber of identifiers in the query document. In other terms, thescore quantifies the number of identifiers in the query documentthat are part of identifier tiles with a minimum length of five. mproving PD for STEM Documents by Analyzing Mathematics and Citations JCDL’19, Jun. 2019, Urbana-Champaign, IL, USA For the detailed text-based analysis, we used the

Encoplot algorithm (Enco) developedby Grozea et al. [14]. Encoplot is an efficient character 16-gramcomparison that achieves O ( n ) time-complexity by ignoring re-peated matches. The similarity score is the ratio of shared character16-grams to all 16-grams of the shorter document. We used three ap-proaches that proved effective in our prior research[9–12, 24].

Bibliographic Coupling (BC) , quantifies the fraction of shared bib-liographic references. The similarity score is calculated as s ( d , d ′ ) = | R d ∩ R d ′ | ( R d ∪ R d ′ ) − , where R d and R d ′ are the setsof references in the query and the comparison document. Like theHisto measure for mathematical features, BC is an order-agnosticmeasure that quantifies the global citation-based similarity of doc-uments. It achieves high scores if documents with similar numbersof references share a significant fraction of those references.The Longest Common Citation Sequence (LCCS) and

Greedy Cita-tion Tiling (GCT) measures follow the same idea as LCIS and GITbut consider in-text citations instead of mathematical identifiers.The similarity scores for the LCCS and GCT approaches are calcu-lated analogously to the scores for LCIS and GIT, with the exceptionthat the minimum length for citation tiles is two matching citationsopposed to five matching identifiers for GIT. We only computedthe three citation-based similarity measures if both documents wecompared contained at least three bibliographic references.

We used HyPlag’s web-based frontend to visualize content similar-ity for inspection. For details on the visualizations, see [28].

To ensure the reproducibility of our research, the data and code ofour study are available at https://purl.org/hybridPD . To achieve comparability to our prior research, we reused thedataset of our previous experiments [26]. The dataset consists often cases that we selected after manually reviewing 44 researchpublications in STEM disciplines that have been officially retractedfor plagiarism and involve mathematical content. We restricted thedataset to ten cases for three reasons. First, we chose cases fromresearch fields within our area of expertise to enable us to assessthe severity of identified similarities. Second, we selected casesthat are most representative of the types of mathematical similaritywe observed. Third, the effort required for checking the output ofInftyReader and correcting incorrectly recognized mathematicalexpressions prevented us from converting more cases.We chose using real cases of AP over creating artificial test cases,although gathering and converting real cases is time-consumingand thus resulted in a smaller dataset. The reason is that we seethe ability to identify real cases of AP committed by researcherswho are experts in their fields and have a strong incentive to avoiddetection as the ultimate performance test for any PD approach.Therefore, we see real cases of AP as best suited to devise andevaluate the novel hybrid detection approach.

Table 1: Overview of content features in our dataset.Features Total Avg. per doc. references 2,201,094 21unique references 1,445,059 –citations 3,068,865 30text fingerprints 26,539,276 256formulae 52,271,908 504math. identifiers 156,706,600 1513We embedded the ten retracted documents and the ten sourcedocuments for our test cases in the topically related NTCIR-11MathIR Task dataset. The NTCIR dataset, which is available forresearch purposes [1], consists of 105,120 scientific papers in LaTeXformat from computer science, mathematics, physics, and statisticsthat were published in the arXiv preprint repository .Using our preprocessing pipeline (cf. section 3.1), we convertedall LaTeX source files of the NTCIR dataset and of our test casesto HyPlag’s unified document format. We excluded 2,616 docu-ments, for which LaTeXML or our TEI parser encountered criticalprocessing errors. Approximately one-third of the remaining doc-uments did not contain markup for authors and title. To achievethe best possible data quality, we used the arXiv API to obtainauthor and title information for all documents instead of extractingthe information from the LaTeX source files. For 6,770 documentswe were unable to extract bibliographic references due to missingmarkup. Since the arXiv API does not offer the data of bibliographicreference, we indexed these documents without reference data.Table 1 shows the number of content features we obtained forthe final dataset of 102,524 documents. The numbers confirm thatthis collection of STEM documents contains a significantly highernumber of mathematical formulae (52M) than academic citations(3M). Therefore, analyzing both mathematical formulae and cita-tions is more promising in these disciplines than analyzing citationsalone. The formulae contain more than 156M identifiers, which thesystem grouped into 4,063,354 identifier histogram entries. On av-erage, documents contained 70 different mathematical identifiers. To evaluate the effectiveness of the PD approaches, we performedtwo conceptually different investigations. The first investigationreflects the typical scenario in external PD, i.e., checking an inputdocument for similarity to documents in a collection. We submittedthe retracted paper for each test case to our system HyPlag. Foreach query document, the system used the math-based, citation-based, and text-based retrieval heuristics (cf. Section 3.3) to retrievethree sets of 100 candidate documents each. In the subsequentdetailed analysis stage, each query document was compared to allthe candidate documents in the three sets without consolidatingthe sets. Section 5 presents the results of this investigation.The second investigation assesses the effectiveness of combiningthe math-based and citation-based similarity measures to discoverso far unknown cases of potentially suspicious document similarity. https://arxiv.org/help/api/ CDL’19, Jun. 2019, Urbana-Champaign, IL, USA Meuschke, Stange, Schubotz, Kramer, Gipp

We submitted each of the N = ,

524 documents in our datasetto HyPlag. We retrieved the three sets of candidate documents byapplying the math-based, citation-based, and text-based retrievalheuristics for all N documents. Opposed to the evaluation of con-firmed plagiarism cases, we formed the union of the sets to enablethe exploration of approaches that combine measures. In the de-tailed analysis stage, we compared each of the N documents in thedataset to its consolidated set of candidate documents C . We manu-ally examined the retrieved documents with the highest similarityscores. Section 6 presents the results of this investigation. To not establish a link between a paper on academic plagiarismdetection and legitimate research papers, i.e., the source documentsof our test cases and unsuspicious documents retrieved in our exper-iments, we do not cite the documents we discuss hereafter. However,all documents are accessible via our data and code repository.

Table 2 shows the effectiveness of the candidate retrieval approaches.Plus signs ( + ) in the table indicate that HyPlag retrieved the sourcedocument among the 100 candidate documents when the retracteddocument for each of the ten test cases (C1 . . . C10) was the query.Minus signs ( − ) indicate that an analysis approach did not retrievethe source document among the candidate documents. The right-most column in Table 2 shows the recall of the approaches. Table 2: Recall for candidate retrieval stage.C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 RMath + + + – – – + + + + 0.7

Cit. + + – + + + + + + + 0.9

Text + + + + + + – + + + 0.9Both the citation-based and the text-based approaches achieveda recall of 0 .

9; the math-based approach achieved a recall of 0 . To quantify the effectiveness of the similarity measures employed inthe detailed analysis stage, we performed a score-based assessmentand a rank-based assessment.

This assessment determines whichscores are significant, i.e., potentially suspicious, for our similaritymeasures and dataset. To our knowledge, no study (including ourpilot study [26]) has quantified the mathematical similarity thatcan be expected by chance to derive a significance threshold.To establish significance thresholds for the scores of all similaritymeasures, we analyzed a random sample of 1 million documentpairs as follows. We randomly picked two documents from thedataset. If the chosen documents had (a) common author(s) or ifone of the documents cited the other, we discarded the pair. Wecontinued the process until reaching the number of 1M document

Table 3: Significance thresholds for similarity measures.Histo LCIS GIT BC LCCS GCT Enco s ≥ . ≥ . ≥ . ≥ . ≥ . ≥ . ≥ . s (verticalaxis) computed using each similarity measure for the random sam-ple of 1M documents. Large horizontal bars shaded in blue indicatethe median score; small horizontal bars shaded in grey mark theminimum and maximum scores; small horizontal bars shaded ingreen indicate the significance thresholds for each measure (cf. Ta-ble 3). The grey shapes in the chart show the smoothed probabilitydensity functions of the score frequencies, which were generatedby applying a kernel-based density estimation. Red dots in the plotindicate the similarity scores of test cases for which the respectivemeasure was applied, i.e., if the document pairs contained enoughfeatures to compute a score (cf. Section 3.4).As shown in Figure 2, the probability density function (PrDF) ofHisto is symmetrical while the PrDF for any other measure is nega-tively skewed, i.e., exhibits the highest frequencies at lower scores.The stronger the PrDF of scores is negatively skewed, the moreselective the measure is. For the math-based similarity measures(Histo, LCIS, GIT), considering the order of identifiers stronglyincreases the selectivity of the measures. The PrDF for the order-agnostic Histo measure is symmetrical. The PrDF of scores for theLCIS measure, which leniently considers the order of identifiersin the entire document, is slightly skewed towards lower values,while the PrDF for the GIT measure, which focuses on consideringidentifier order, is strongly skewed towards lower values. Histo LCIS GIT BC LCCS GCT Enco0.00.20.40.60.8 s Figure 2: Similarity scores in 1M random document pairs. mproving PD for STEM Documents by Analyzing Mathematics and Citations JCDL’19, Jun. 2019, Urbana-Champaign, IL, USA

Table 4: Retrieval effectiveness of detection approaches for confirmed cases of plagiarism.Math Citation TextCase Histo LCIS GIT BC LCCS GCT Enco r s r s r s r s s* r s s* r s s* r s

C1 1 .68 1 .40 1 .21 1 .06 .15 1 .06 .10 - - .04 1 .13C2 1 .60 1 .39 1 .12 10’ .05 .28 1 .33 .42 - - - 1 .16C3 3 .29 1 .88 1 .78 - - - - - - - - - 1 .36C4 (1) (.36) (99) (.37) (3) (.03) - - .35 - - .44 - - .25 1 .15C5 (1) (.57) (86) (.30) (1) (.23) 5 .02 .18 7’ .02 .23 - - .05 1 .45C6 (19) (.14) (98) (.40) (1) (.15) 2 .04 .32 1 .11 .44 - - .22 1 .27C7 2 .52 98 .25 1 .09 - - .04 - - .05 - - - (4) (.02)C8 1 .76 1 .65 1 .37 1 .11 .37 - - .25 - - - 1 .32C9 1 .69 1 .51 1 .27 1 .03 .26 1 .08 .39 - - - 1 .68C10 1 .85 1 .81 1 .63 1 .03 .03 1 .04 .04 - - - 1 .51

MRR .58 .60 .79 .48 .60 .00 .90(.79) (.60) (.93) (.48) (.60) (.00) (.93)Given our prior research on citation-based similarity [9, 10, 12,13], we expected similar characteristics for the citation-based mea-sures. However, as shown in Figure 2, the order-agnostic BC mea-sure is more selective than the order-considering LCCS measurein this case. The reason is the errors in citation extraction (cf. Sec-tion 3.1). The mismatch of references and in-text citations causesthat the LCCS and GCT measures can only consider a fraction of thecitations in the dataset. This fraction is smaller than the fraction ofextracted references, which the BC measure uses. Therefore, the BCmeasure is more selective than the LCCS measure for this dataset,since overlaps of the comparably sparse in-text citations increasedthe LCIS score more than overlaps in references increased the BCscore. Unrecognized in-text citations also cause the GCT measureto be overly selective for this dataset. Due to a shortage of datapoints, the PrDF for scores of the GCT measure shows interpolationartifacts, i.e., the PrDF is not monotonically decreasing for largerscores. HyPlag could identify in-text citation tiles above the exclu-sion thresholds (cf. Section 3.4) for only 41 document pairs in oursample of 1M documents and for none of our test cases.The PrDF of the Encoplot scores shows that the text-based mea-sure is highly selective. Nine of the test cases have scores abovethe significance threshold, i.e., most verified cases of AP have a sig-nificant textual overlap with the respective source document. Thischaracteristic is common for confirmed cases of AP [34]. Identifyingliteral text overlap is easier for reviewers and better supported byproductive PD systems than identifying concealed content similar-ity. Therefore, documents with (near) copied text are more likely tobe discovered and hence likely overrepresented in our dataset.

In addi-tion to assessing the significance of the similarity scores, we alsoexamined the ranks r at which the similarity measures retrievedthe source document for each of the test cases. To indicate the aver-age ranking performance of the measures, we computed the MeanReciprocal Rank (MRR) . In the case of tied ranks, we considered themean rank, i.e., the pessimistically rounded average of the numberof document pairs that share the same rank. The best possible score of 1 is assigned if a similarity measure exclusively retrieves thesource document at rank 1 for each test case.Table 4 shows the results of both the rank-based and the score-based assessment. For each of the test cases (C1. . . C10) the tablelists the rank r at which HyPlag retrieved the source document andwhich score s the similarity measure assigned. We mark the meanrank, which we list in the case of tied ranks, with an apostrophe,e.g., 7’. Scores above the significance threshold of a measure (seeTable 3) are underlined. To gauge the performance of the similaritymeasures specifically for the detailed analysis stage, we also statethe ranks and similarity scores for the cases not retrieved in thecandidate retrieval stage. We mark such entries with parentheses,e.g., (0.15). To compute the ranks and scores for these documents, weperformed a comparison of the query document to all documents inthe dataset. Minus characters ( − ) indicate that HyPlag computed nosimilarity score due to the exclusion criteria of the measure. Becauseof the incomplete and error-prone extraction of bibliographic data,we state a separate score s ∗ for the citation-based measures. Thescore indicates the true citation-based similarity of the test cases.To compute s ∗ , we manually corrected erroneous data for in-textcitations and references before applying the similarity measures.The text-based approach consisting of word 3-gram fingerprint-ing (Sherlock) for the candidate retrieval stage and efficient stringmatching (Encoplot) for the detailed analysis stage achieved thebest individual result. The approach retrieved nine of the ten testcases at the top rank. Only test case C7 exhibits a textual similaritythat is too low to retrieve the source document in the candidateretrieval stage and mark the document as suspicious in the detailedanalysis stage. The Encoplot scores for 6 of the 10 test cases exceed0.25, hence are clearly suspicious. For the cases C1, C2, and C4, theEncoplot scores exceed our significance threshold of 0.06, yet arelower than 0.20. Anecdotal evidence suggests that 10% - 20% oftext overlap is not immediately suspicious but often tolerated byjournal reviewers and editors. The practices regarding acceptable CDL’19, Jun. 2019, Urbana-Champaign, IL, USA Meuschke, Stange, Schubotz, Kramer, Gipp text overlap vary between research fields and even between venues.Whether a productive text-based PD system would flag C1, C2, andC4 as suspicious is thus unclear. The retraction note of C1 namesthe unattributed reuse of a mathematical analysis, not the textualoverlap with the source, as the reason for the retraction. The scoresfor Histo (0.68) and Git (0.21), which both exceed the significancethresholds, reflect this similarity in mathematical content.The math-based similarity measures achieved the second-bestresult when considering both the candidate retrieval and detailedanalysis stages. GIT, which we devised as a new similarity measurefor this study, performed particularly well, retrieving seven casesat the top rank. When only considering the detailed analysis stage,GIT achieved the same effectiveness as the text-based analysis(9 test cases retrieved at rank one, MRR=0.93). To enable this resultfor the detailed analysis stage, the candidate retrieval procedurecould simply combine the results of the math-based, citation-basedand text-based approaches as discussed in Section 3.3.GIT outperformed the Histo measure, which achieved the bestresults in our pilot study [26]. In this prior study, Histo achieved anMRR score of 0 .

86. Our current implementation exhibits a slightlylower MRR of 0.79. We attribute the difference to using a differentconversion and data extraction process. The good performanceof GIT suggests that the pattern of reusing (nearly) identical con-tent in confined parts of a document known as "shake&paste" or"patchwriting" [44] also applies to mathematical content.For our test cases, LCIS achieved no significant improvementover the set-based Histo measure. Both LCIS and Histo achievedgood results for test cases that share a large fraction of their math-ematical content. For such documents, the amount of shared mathsufficed to retrieve the documents using the Histo approach. Thatthe large overlap in mathematical content also yielded long identi-fier subsequences did not significantly improve the similarity score.The citation-based measures achieved the lowest overall perfor-mance, largely due to the deficiencies of the extracted data. Despitethe sub-optimal data, the LCCS measure retrieved 5 cases at rankone achieving an MRR score of 0.60. The similarity scores s ∗ , whichassume the bibliographic data in the documents would have beenextracted and matched correctly, give a better indication of the po-tential effectiveness of the citation-based measures. Notably, LCCSwould yield scores of approx. two times the significance thresh-old of 0.22 and hence strongly suspicious for C2, C4, C6, and C9.Given that C2 and C4 exhibit a textual overlap that is significantbut not strongly suspicious (0.16 and 0.15), the high LCCS scorecould provide an indicator for suspicious similarity.For all cases except C7, which none of the measures flags assuspicious, at least one math-based or citation-based measure yieldsa similarity score above the individual significance thresholds. ForCase C7, the Histo score has the smallest difference to the measure’ssignificance threshold making Histo the most likely measure toretrieve the case despite the comparably low score.In summary, the evaluation using confirmed cases of AP showedthat the combined analysis of math-based and citation-based sim-ilarity identified all cases that also a text-based analysis flaggedas strongly suspicious. Moreover, the two nontextual detectionapproaches provide valuable indicators for suspicious documentsimilarity for cases with a comparably low textual similarity. Table 5: Top-ranked documents in exploratory study.Rank

Case

C3 C11 C12 C13 C10 C14 C15 C16 C17 C18

Rating

Plag. Susp. CR FP Plag. FP CR CR CR CR

In this section, we describe our findings from manually investigat-ing the top-ranked documents that HyPlag retrieved when applyingmath-based and citation-based content features to compare eachdocument of the dataset to its individual set of candidate documents.Given the size of the result set (approx. 6M document pairs) andour primary goal of searching for undiscovered cases of plagiarism,we employed several filters to focus our manual investigations onthe most critical similarities. To eliminate cases, in which authorslikely reused own content, we excluded document pairs that sharedat least one author. This exclusion prevents the identification ofpotential self-plagiarism. Similarly, we pruned document pairs, forwhich the older document cites the newer document, to reduceresults in which authors reproduced previous work with due at-tribution. We make these restrictions for two reasons. First, thedefinition of what constitutes self-plagiarism varies greatly in dif-ferent research fields and even for different venues. The vaguenessof the problem definition prevents a well-founded assessment of theretrieved documents. Second, because we analyze all documents inthe dataset, the number of results is much larger than in the typicalPD scenario, i.e., analyzing a single input document.Since we are particularly interested in the benefit that a math-based similarity assessment can add to a combined approach, weexcluded documents with a Histo score below 0.25. i.e., with littlemath-based similarity, and sorted the remaining results accordingto the GIT score in descending order. To not exclude cases, in whichdocuments contained unequal amounts of identifiers, e.g., becauseone document is significantly shorter (cf. Section 3.4), we did notrequire a Histo score above the significance threshold of 0.56 butonly a score that is greater or equal to 0.25.Table 5 shows the ten top-ranked document pairs and our ratingof the observed similarities. We use the abbreviated ratings ’Plag.’for confirmed cases of plagiarism, ’Susp.’ for suspicious contentsimilarity, ’CR’ for notable but legitimate content reuse and ’FP’ forfalse positives, i.e., documents with insignificant content overlap.The highest ranked document pair is the confirmed case of plagia-rism C3. The author of the retracted paper copied three geometricproofs with few changes from a significantly longer paper, thusresulting in a high GIT (0.78) but low Histo score (0.29). Anotherconfirmed case of AP (C10) was retrieved at rank 5. The main con-tribution of the retracted paper in C10, a model in Nuclear Physics,was taken from the source paper while partially renaming identi-fiers. Almost the entire mathematical content of the retracted paperoverlaps with the source document, resulting in the highest Histoscore (0.85) in our exploratory study. The differences of identifiersin the source document and the retracted document result in alower but still suspiciously high GIT score (0.63).The later document in C11 (rank 2) is a mixture of idea reuse andcontent reuse. The author of the later paper reused the argumenta-tive structure, sequence of formulae, several of the cited sources, mproving PD for STEM Documents by Analyzing Mathematics and Citations JCDL’19, Jun. 2019, Urbana-Champaign, IL, USA many descriptions of formulae, and non-trivial remarks about theimplications of the research from the earlier paper. By doing so,the author of the later paper derived a minor generalization ofan entropy model for a specific type of black holes introduced inthe earlier paper. The later paper cites other papers by the authorof the earlier paper but not the earlier paper itself. We contactedthe author of the earlier paper about our findings. In his view, thelater paper "certainly constitutes a case of plagiarism" . In coordina-tion with the author of the earlier paper, we contacted the journalthat published the later paper. The journal’s editorial board cur-rently examines the case. Since the journal has not published anofficial determination about the legitimacy of the paper, we classifythe document as suspicious. This case exemplifies the benefits ofa combined math-based, citation-based, and text-based similarityanalysis. Only a combined analysis reveals the full extent of contentsimilarity that encompasses approx. 80% of the paper’s content.The five cases of legitimate content reuse (C12, C15, C16, C17,and C18) exhibit similar characteristics. In all five cases, the authorsof the later papers reproduce and properly cite extensive mathe-matical models proposed in the earlier papers. HyPlag failed torecognize the citations and to exclude the document pairs due totwo challenges. First, the use of severely abridged citation styles,e.g., only stating the author name(s) and the arXiv identifier ofa paper. Second, some authors cite the arXiv preprint of a paper,whereas other authors cite the journal version. The journal versionsregularly exhibit differences in the order of authors and the titlecompared to the respective arXiv preprints. Both cases were nothandled correctly by our preprocessing pipeline (cf. Section 3.1).Clearly, we need to improve our procedures for extracting anddisambiguating such challenging references in STEM documents.However, retrieving these five cases at top ranks is justified giventhe overlap in mathematical content (typically multiple pages). Weexpect that reviewers would like to be made aware of such contentoverlap, e.g., to verify the correct citation of the previous work.The two false positives, C13 (rank 4) and C14 (rank 6) that HyPlagretrieved reveal potential improvements for the math-based sim-ilarity measures. C13 comprises of two papers in Combinatoricsthat contain long lists of all possible combinations of the identi-fiers a , b , and c according to a set of production rules. Similarly,C14 comprises two documents that analyze partition functions andcontain long matches entirely made up of the identifiers p and q that occur in large quantities within unrelated formulae.To increase the effectiveness of the math-based similarity mea-sures and prevent such false positives, we plan to devise measuresthat are confined to individual formulae. Likewise, we plan to re-search how an assessment of structural and semantic similarityof formulae can be adapted for the plagiarism detection use case.Research on formula search and other mathematical informationretrieval tasks has provided approaches that could prove valuablefor the PD scenario [15, 26]. The math-based approach to PD isat an early stage of development. Like the early approaches fortext reuse detection [2, 23], we investigated basic, computationallyefficient feature analysis methods to identify the reuse of identicaland slightly different mathematical content. The results of our in-vestigations show that the math-based analysis approaches increasethe detection capabilities for STEM documents, particularly whenbeing combined with other similarity assessments. By reviewing prior research, we showed that semantic, syntacticand cross-lingual PD approaches achieve high detection effective-ness, even for concealed forms of AP. However, these approachesrequire a high computational effort. The recall level of efficienttext-based candidate retrieval methods stagnates. Approaches thatanalyze nontextual content features in academic documents, suchas citations and mathematical content, show promise for being em-ployed as computationally modest methods to retrieve candidatedocuments and as more elaborate detailed analysis methods.The paper at hand extends a pilot study [26], in which we ex-plored the potential of analyzing the similarity of mathematicalcontent in the detailed analysis stage of the external PD process. Inthe current paper, we additionally devised a computationally effi-cient candidate retrieval stage that analyzes mathematical contentfeatures, academic citations, and textual features using production-ready information retrieval technology. Moreover, we created theGIT and LCIS measures, which consider the order of mathemat-ical identifiers, for the detailed analysis stage. We implementedthe newly developed math-based measures, as well as establishedcitation-based and text-based measures in HyPlag - a working pro-totype of a hybrid plagiarism detection system.Using HyPlag, we compared the effectiveness of the math-based,citation-based, and text-based PD approaches using confirmed casesof AP. We showed that a simple unification of the modestly sizedsets of candidate documents retrieved by each retrieval heuristicachieved perfect recall for the candidate retrieval stage. For thedetailed analysis stage, the newly developed GIT measure exceededthe effectiveness of the best performing approach (Histo) in ourpilot study and achieved the same effectiveness as the text-basedsimilarity measure in our current study.Errors in the acquisition of in-text citations and bibliographic ref-erences decreased the effectiveness of the citation-based similaritymeasures in our experiments. Despite these limitations, citation-based measures added a significant benefit to the hybrid approach,particularly for the candidate retrieval stage. LCCS also performeddecently for the detailed analysis stage (MRR=0.60 for our testcases). The error-corrected similarity scores showed that the trueeffectiveness of the citation-based measures is much higher.Overall, the combined analysis of math-based and a citation-based similarity identified all cases that a text-based analysis flaggedas strongly suspicious. Moreover, the two nontextual detectionapproaches provided valuable indicators for suspicious documentsimilarity for cases with a comparably low textual similarity. Thisresult indicates that the best detection effectiveness can be achievedby combining the heterogeneous similarity assessments.In an exploratory study, we showed the effectiveness of analyzingmath-based and citation-based similarity for discovering unknowncases of potential AP. We used the GIT and Histo measures incombination with the citation relations of documents to reducea result set of approx. 6M document pairs to 10 document pairsthat we investigated manually. The highest ranked document pairwas a confirmed case of AP. The document retrieved at the secondrank was rated as an undiscovered case of AP by the author ofthe apparent source document. The remaining 8 cases include oneconfirmed case of AP, 5 documents with high but legitimate overlap

CDL’19, Jun. 2019, Urbana-Champaign, IL, USA Meuschke, Stange, Schubotz, Kramer, Gipp in mathematical content and 2 false positives. The citation-basedfilter would have eliminated 5 cases of legitimate content reuse ifthe bibliographic data had been extracted correctly. These resultsshow the large potential of analyzing mathematical content andacademic citations as a complement to text-based PD approaches.Our future work must improve the extraction of citation data toleverage the full potential of the citation-based detection approach.We will increase the number of test cases and their degree of ob-fuscation to further support our results. For this purpose, we arecollaborating with a major mathematical publishing service [38].We will also research improvements to the math-based similaritymeasures. We expect that the performance of math-based heuristicsfor candidate retrieval can be improved by incorporating positionalinformation about mathematical features. To improve the math-based measures employed for the detailed analysis stage, we willinvestigate approaches that consider the structural similarity andextend our research on the semantic similarity of formulae [37].Another promising direction for future research we plan to pur-sue is the application of machine learning to balance the weights ofthe similarity measures. Such an approach could train a detectionsystem tailored to the domain-specific properties of a collection.In summary, we see the integrated analysis of textual and non-textual features as the most promising approach to deter and todetect academic plagiarism in STEM research publications.

REFERENCES [1] Akiko Aizawa, Michael Kohlhase, Iadh Ounis, and Moritz Schubotz. 2014. NTCIR-11 Math-2 Task Overview. In

Proc. NTCIR .[2] Salha M. Alzahrani, Naomie Salim, and Ajith Abraham. 2012. UnderstandingPlagiarism Linguistic Patterns, Textual Features, and Detection Methods. In

IEEETrans. Syst., Man, Cybern. C, Appl. Rev. , Vol. 42. 133–149.[3] Alberto Barrón-Cedeño, Parth Gupta, and Paolo Rosso. 2013. Methods for Cross-language Plagiarism Detection.

Know.-Based Syst.

50 (2013), 211–217.[4] Hannah Bast and Claudius Korzen. 2017. A Benchmark and Evaluation for TextExtraction from PDF. In

Proc. JCDL .[5] Zdenek Ceska. 2008. Plagiarism Detection Based on Singular Value Decomposi-tion. In

Advances in Natural Language Processing . LNCS, Vol. 5221. Springer.[6] Nava Ehsan and Azadeh Shakery. 2016. Candidate Document Retrieval for Cross-lingual Plagiarism Detection Using Two-level Proximity Information.

Inf. Process.Manage.

52, 6 (2016), 1004–1017.[7] Nava Ehsan, Frank Wm. Tompa, and Azadeh Shakery. 2016. Using a Dictionaryand N-gram Alignment to Improve Fine-grained Cross-Language PlagiarismDetection. In

Proc. DocEng .[8] Teddy Fishman. 2009. "We know it when we see it"? is not good enough: towarda standard definition of plagiarism that transcends theft, fraud, and copyright. In

Proc. Asia Pacific Conf. on Educational Integrity .[9] Bela Gipp. 2014.

Citation-based Plagiarism Detection - Detecting Disguised andCross-language Plagiarism using Citation Pattern Analysis . Springer.[10] Bela Gipp and Norman Meuschke. 2011. Citation Pattern Matching Algorithms forCitation-based Plagiarism Detection: Greedy Citation Tiling, Citation Chunkingand Longest Common Citation Sequence. In

Proc. DocEng .[11] Bela Gipp, Norman Meuschke, and Joeran Beel. 2011. Comparative Evaluation ofText- and Citation-based Plagiarism Detection Approaches using GuttenPlag. In

Proc. JCDL .[12] Bela Gipp, Norman Meuschke, Corinna Breitinger, Jim Pitman, and AndreasNuernberger. 2014. Web-based Demonstration of Semantic Similarity Detectionusing Citation Pattern Visualization for a Cross Language Plagiarism Case. In

Proc. Int. Conf. on Enterprise Inform. Sys. [13] Bela Gipp, Norman Meuschke, and Mario Lipinski. 2015. CITREC: An EvaluationFramework for Citation-Based Similarity Measures based on TREC Genomicsand PubMed Central. In

Proc. iConference .[14] Christian Grozea, Christian Gehl, and Marius Popescu. 2009. ENCOPLOT: Pair-wise Sequence Matching in Linear Time Applied to Plagiarism Detection. In

Proc.PAN WS .[15] Ferruccio Guidi and Claudio Sacerdoti Coen. 2016. A Survey on Retrieval ofMathematical Knowledge.

Mathem. in Computer Science

10, 4 (2016), 409–427.[16] D. Gupta, Vani K, and C. K. Singh. 2014. Using Natural Language Processing tech-niques and fuzzy-semantic similarity for automatic external plagiarism detection. In

Proc. Int. Conf. on Advances in Computing, Communications and Informatics .[17] Matthias Hagen, Martin Potthast, and Benno Stein. 2015. Source Retrieval forPlagiarism Detection from Large Web Corpora. In

Proc. PAN WS .[18] Kenichi Iwatsuki, Takeshi Sagara, Tadayoshi Hara, and Akiko Aizawa. 2017.Detecting In-line Mathematical Expressions in Scientific Documents. In

Proc.DocEng .[19] Vani K and Deepa Gupta. 2015. Investigating the impact of combined similaritymetrics and POS tagging in extrinsic text plagiarism detection system. In

Proc.Int. Conf. on Advances in Computing, Communications and Informatics .[20] Leilei Kong, Haoliang Qi, Cuixia Du, Mingxing Wang, and Zhongyuan Han. 2013.Approaches for Source Retrieval and Text Alignment of Plagiarism Detection. In

Proc. PAN WS .[21] Arun kumar Jayapal. 2012. Similarity Overlap Metric and Greedy String Tilingat PAN 2012. In

Proc. PAN WS .[22] Donald L. McCabe. 2005. Cheating among College and University Students: ANorth American Perspective.

Int.J. for Academic Integrity

1, 1 (2005), 1–11.[23] Norman Meuschke and Bela Gipp. 2013. State-of-the-art in detecting academicplagiarism.

Int. J. for Educational Integrity (2013).[24] Norman Meuschke and Bela Gipp. 2014. Reducing Computational Effort forPlagiarism Detection by using Citation Characteristics to Limit Retrieval Space.In

Proc. JCDL .[25] Norman Meuschke, Christopher Gondek, Daniel Seebacher, Corinna Breitinger,Daniel A. Keim, and Bela Gipp. 2018. An Adaptive Image-based PlagiarismDetection Approach. In

Proc. JCDL .[26] Norman Meuschke, Moritz Schubotz, Felix Hamborg, Tomas Skopal, and BelaGipp. 2017. Analyzing Mathematical Content to Detect Academic Plagiarism. In

Proc. CIKM .[27] Norman Meuschke, Nicolas Siebeck, Moritz Schubotz, and Bela Gipp. 2017. Ana-lyzing Semantic Concept Patterns to Detect Academic Plagiarism. In

Proc. Int.WS on Mining Scientific Publ. (WOSP) at JCDL .[28] Norman Meuschke, Vincent Stange, Moritz Schubotz, and Bela Gipp. 2018. Hy-Plag: A Hybrid Approach to Academic Plagiarism Detection. In

Proc. SIGIR .[29] H.F. Moed, W.J.M. Burger, J.G. Frankfort, and A.F.J. Van Raan. 1985. The applica-tion of bibliometric indicators: Important field- and time-dependent factors to beconsidered. 8, 3-4 (1985), 177–203.[30] Gabriel Oberreuter, Gaston L’Huillier, Sebastián Ríos, and Juan. Velásquez. 2011.Approaches for Intrinsic and External Plagiarism Detection. In

Proc. PAN WS .[31] Merin Paul and Sangeetha Jamal. 2015. An improved SRL based plagiarismdetection technique using sentence ranking.

Proc. CS

46 (2015), 223–230.[32] Solange de L. Pertile, Viviane P. Moreira, and Paolo Rosso. 2016. Comparingand combining Content- and Citation-based approaches for plagiarism detection.

JASIST

67, 10 (2016), 2511–2526.[33] Martin Potthast, Tim Gollub, Matthias Hagen, Jan Graßegger, Johannes Kiesel,Maximilian Michel, Arnd Oberländer, Martin Tippmann, Alberto Barrón-Cedeño,Parth Gupta, Paolo Rosso, and Benno Stein. 2012. Overview of the 4th Interna-tional Competition on Plagiarism Detection. In

Proc. PAN WS .[34] Martin Potthast, Benno Stein, Alberto Barrón Cedeño, and Paolo Rosso. 2010. AnEvaluation Framework for Plagiarism Detection. In

Proc. ACL .[35] Lutz Prechelt, Guido Malpohl, and Michael Philippsen. 2002. Finding plagiarismsamong a set of programs with JPlag.

J. of Univ. CS

8, 11 (2002), 1016.[36] Miguel A. Sanchez-Perez, Alexander Gelbukh, and Grigori Sidorov. 2015. AdaptiveAlgorithm for Plagiarism Detection: The Best-Performing Approach at PAN 2014Text Alignment Competition. In

Proc. CLEF (LNCS) , Vol. 9283.[37] Moritz Schubotz, Alexey Grigorev, Marcus Leich, Howard S. Cohl, NormanMeuschke, Bela Gipp, Abdou S. Youssef, and Volker Markl. 2016. Semantificationof Identifiers in Mathematics for Better Math Information Retrieval. In

Proc.SIGIR .[38] Moritz Schubotz, Olaf Teschke, Vincent Stange, Norman Meuschke, and BelaGipp. 2019. Forms of Plagiarism in Digital Mathematical Libraries. In

Proc. Int.Conf. on Intelligent Computer Mathematics .[39] Petr Sojka and Martin Líška. 2011. Indexing and Searching Mathematics inDigital Libraries – Architecture, Design and Scalability Issues. In

Proc. Int. Conf.on Intelligent Computer Mathematics (LNCS) , Vol. 6824.[40] S. Soleman and A. Purwarianti. 2014. Experiments on the Indonesian plagiarismdetection using latent semantic analysis. In

Int. Conf. on ICT .[41] Benno Stein, Sven Meyer zu Eissen, and Martin Potthast. 2007. Strategies forRetrieving Plagiarized Documents. In

Proc. SIGIR .[42] Dominika Tkaczyk, PawełSzostek, Mateusz Fedoryszak, Piotr Jan Dendek, andLukasz Bolikowski. 2015. CERMINE: Automatic Extraction of Structured Meta-data from Scientific Literature.

Int. J. Doc. Anal. Recognit.

18, 4 (2015), 317–335.[43] Juan D. Velásquez, Yerko Covacevich, Francisco Molina, Edison Marrese-Taylor,Cristián Rodríguez, and Felipe Bravo-Marquez. 2016. DOCODE 3.0 (DOcumentCOpy DEtector).

Information Fusion

27 (2016).[44] Debora Weber-Wulff. 2014.

False Feathers: A Perspective on Academic Plagiarism .[45] Michael J. Wise. 1993. String Similarity via Greedy String Tiling and RunningKarp-Rabin Matching. TR (Univ. of Sydney. Basser Dept. of CS) 463. mproving PD for STEM Documents by Analyzing Mathematics and Citations JCDL’19, Jun. 2019, Urbana-Champaign, IL, USA

Listing 1: Use the following

BibTeX code to cite this articlecode to cite this article