[PDF] Algorithmic Detection of Computer Generated Text

Abstract

Computer generated academic papers have been used to expose a lack of thorough human review at several computer science conferences. We assess the problem of classifying such documents. After identifying and evaluating several quantifiable features of academic papers, we apply methods from machine learning to build a binary classifier. In tests with two hundred papers, the resulting classifier correctly labeled papers either as human written or as computer generated with no false classifications of computer generated papers as human and a 2% false classification rate for human papers as computer generated. We believe generalizations of these features are applicable to similar classification problems. While most current text-based spam detection techniques focus on the keyword-based classification of email messages, a new generation of unsolicited computer-generated advertisements masquerade as legitimate postings in online groups, message boards and social news sites. Our results show that taking the formatting and contextual clues offered by these environments into account may be of central importance when selecting features with which to identify such unwanted postings.

Full PDF

AAlgorithmic Detection of Computer Generated Text

Allen Lavoie – [email protected] Krishnamoorthy – [email protected] Center for Open Source Software (RCOS)Rensselaer Polytechnic Institute

Computer generated academic papers have been usedto expose a lack of thorough human review at severalcomputer science conferences. We assess the problemof classifying such documents. After identifying andevaluating several quantiﬁable features of academicpapers, we apply methods from machine learning tobuild a binary classiﬁer. In tests with two hundredpapers, the resulting classiﬁer correctly labeled pa-pers either as human written or as computer gener-ated with no false classiﬁcations of computer gener-ated papers as human and a 2% false classiﬁcationrate for human papers as computer generated. Webelieve generalizations of these features are applicableto similar classiﬁcation problems. While most cur-rent text-based spam detection techniques focus onthe keyword-based classiﬁcation of email messages,a new generation of unsolicited computer-generatedadvertisements masquerade as legitimate postings inonline groups, message boards and social news sites.Our results show that taking the formatting and con-textual clues oﬀered by these environments into ac-count may be of central importance when selectingfeatures with which to identify such unwanted post-ings.

A project called Scigen[8] made waves in 2005 whenit produced a paper which was accepted to WM-SCI 2005. Unfortunately this was not due to break-throughs in artiﬁcial intelligence which allowed Sci- gen to write about computer science eﬀectively. In-stead, Scigen relies on sentence generation based ona context free grammar. It eﬀectively produces anassortment of random keywords spliced into prede-ﬁned sentence structures. With the addition of ran-domly generated graphics and references, the papersare practically indistinguishable from papers that hu-mans have written until true understanding is at-tempted by someone who could reasonably expect tounderstand an equivalent but meaningful paper. Thequestion then arises as to whether a human review isrequired to classify such papers.The diﬃculty of this problem is quite dependenton the approach used and that approach’s expectedapplicability. On one extreme we could look for for-matting quirks or other metadata which are irrelevantto the paper itself, but which could easily be used toidentify Scigen papers speciﬁcally. This approach hasvery little general applicability, but would be almosttrivial to implement. Another extreme is the mostgeneral case of identifying papers as either examplesof good scholarly work or not. A solution here wouldbe of great practical use, but seems unlikely at thepresent time. A good approach, then, is one whichmaximizes its general applicability while remainingpractical.This paper attempts to show that there exists anon-empty subset of approaches ranging from trivialto hopelessly complex which are both practical anduseful in a broader context. While practicality hasobjective measures such as algorithmic complexity,the usefulness of an approach is necessarily subjec-tive. Thus we will analyze one approach for practi-cality and eﬀectiveness in dealing with Scigen papers1 a r X i v : . [ s t a t . M L ] A ug peciﬁcally, and then discuss other potential applica-tions. Having decided not to exploit shallow ﬂaws in Scigenand lacking any true understanding of the academicpapers we seek to classify, we draw candidate fea-tures from several simple observations about humanwriting.1. If a paper claims to be about a speciﬁc topic, itshould actually be about that topic.2. Papers follow a central theme.3. Cited papers are related to the paper which citesthem.Unfortunately none of these observations are directlyquantiﬁable as stated. Note that even the abovecriteria are necessary but not suﬃcient for identify-ing good scholarly work. They do, however, suggestheuristics which can easily be quantiﬁed. We canrewrite each of these observations in terms of key-words, sacriﬁcing accuracy for computability.1. Keywords which appear in the title and abstractof a paper should appear frequently in the bodyof that paper.2. Certain keywords should be favored throughouta paper.3. A paper should mention keywords from the titlesof articles it cites.Whereas our original observations capture part ofwhat it means for a paper to be an example of truescholarly work, the keyword heuristics are merelyplausible substitutes. On the other hand, these fea-tures are readily quantiﬁable. Since our features areno longer directly tied to the deﬁnition of scholarlywork, a paper generator could easily be adapted tosuperﬁcially fulﬁll the above heuristics. The simplic-ity of keyword heuristics in general, however, alsoallows us to readily produce additional features. Forexample, we could examine repetition within para-graphs or patterns in the usage of certain sentence structures. Domain-speciﬁc features, such as the oc-currence of keywords from a user-submitted summaryin a linked article on a social news website, are alsofeasible.Having a basis for quantifying the abstract conceptof scholarly work, we can now build a classiﬁer andtest our intuitive notions empirically.

We begin by converting a subject paper to plain textfrom the typical PDF. There are several freely avail-able tools for accomplishing this task. Next, a paperis split into word tokens.Before additional processing is done to clean up thetext, we search for certain keywords denoting the ti-tle, abstract, introduction and references sections ofa paper. Keeping such structure allows us to basefeatures on comparisons between sections, and doingit early ensures that our text processing does not in-terfere. In the case of missing keywords, we simplyfall back to taking a certain fraction of the paper atthe desired section’s expected location.Next, we clean the text by selecting only partsof speech which might reasonably have keywords re-lated to the topic of the paper. This entails standardpart of speech tagging followed by aggressive ﬁltering.Speciﬁcally we selected nouns, adjectives and foreignor unrecognized words. Words which were unrecog-nized by the tagging algorithm often turned out to bethe most valuable, as many times they were examplesof unique technical jargon.The selected tokens are ﬁnally stemmed to avoidconfusion between diﬀerent forms of a single word.This allows for a straightforward character basedcomparison, which turned out to be good enoughfor our purposes. A more advanced approach mightmake use of a semantic diﬀerence metric [5].

The text can then be scored numerically based on ourselected features. We need to make the three features2iscussed in the previous section slightly more speciﬁcin order to accomplish this. Again we trade somesemantics for ease of computation as we introducefairly arbitrary nuances to our features, but this timethe sacriﬁces are fairly subtle.

For our ﬁrst feature, we could perform a straightfor-ward count of keywords from the title and abstractof a paper in that paper’s body. This is a decent ﬁrstapproximation, but favors long papers unduly. Thus,we normalize this feature by scoring papers based onthe number of times keywords from the title or ab-stract of a paper are mentioned divided by the lengthof the part of speech ﬁltered body of that paper. Let A be the set of keywords from the title and abstractof a paper after part of speech ﬁltering, and B be thecorresponding multiset for the remainder of the papersimilarly ﬁltered. Here, m Q ( q ) denotes the multiplic-ity of element q in the multiset Q . Then our ﬁrstfeature’s score s can be written as follows: s = (cid:80) a ∈ A m B ( a ) | B | (1)We must be careful to treat B as a multiset and notsimply as a set, since this feature seeks to quantifythe repetition of ideas from the title and abstractof a paper in its body. We could similarly take therepetition of keywords from the title and abstract ofa paper into account by making A a multiset as well,but doing this simply rewards repetition rather thancompleteness. Whereas the abstract and title of apaper taken together should be a succinct summaryof the paper’s content, rephrasing and repetition isboth common and desirable in a paper’s body. Next, we seek to quantify the repetition of a certainset of words throughout a paper. Here we have cho-sen to compare the occurrence of the top N mostused words in a paper to the occurrence of all otherwords. Let P denote the multiset of all words in agiven paper, and let W = { w i } denote the set ofdistinct elements in P sorted in decreasing order of their multiplicity m P ( w i ). Thus w occurs with thehighest frequency in the paper, followed by w andso on. Here | P | = (cid:80) w i ∈ W m P ( w i ). Then the secondfeature can be written as follows: s = (cid:80) N − i =0 m P ( w i ) | P | − (cid:80) N − i =0 m P ( w i ) (2)In general N must be between 1 and | W |− N = 10 is a fairly good tradeoﬀ for this feature. Note that part of speech taggingand ﬁltering plays a very important role here. With-out it we would almost certainly be selecting wordswhich are simply common in the English language, atwhich point our feature would cease to make sense. Our ﬁnal feature is calculated much like the ﬁrst.Without parsing references at all, we simply use thetokenized, ﬁltered and stemmed set of keywords fromthe references section. This includes paper titles, au-thors and a lot of other irrelevant information. Theirrelevant tokens do not aﬀect the feature’s score sincewe do not normalize on the number of tokens in thereferences section.If we let R denote the set of tokens from the ref-erences section of a paper and again let B denote amultiset of keywords in the remainder of the paper,we can compute the third feature as follows: s = (cid:80) r ∈ R m B ( r ) | B | (3)While feature three’s computation is nearly identi-cal to (1), its semantics are quite diﬀerent. Whileour ﬁrst feature attempts to quantify the relevanceof the title and abstract of a paper, our third featureseeks to quantify the relevance of its chosen refer-ences. Thus despite the similar computation, we donot see a strong linear dependence between these twofeatures. Finally, we build a classiﬁer based on all three fea-tures. Let a paper p be represented by a point3 s , s , s ) in a three dimensional space, where s , s and s are our ﬁrst, second and third features re-spectively. We can then build a classiﬁer based on oneof several well known methods from machine learn-ing. A nearest neighbor classiﬁer[4] which takes avote of the k nearest points ( k = 3 in this case) withknown classiﬁcations was used for its simplicity onsmall data sets for the purposes of this paper, butsupport vector machines[3] or other more advancedalgorithms would be more eﬃcient when dealing withlarger data sets. The running time of our implementation of the al-gorithm outlined above is dominated by the part ofspeech tagging used during preprocessing. We usethe default part of speech tagger from the NaturalLanguage Tool Kit[6], but a faster and slightly lessaccurate tagging algorithm could dramatically reducethe running time of the algorithm presented above.Tagging accuracy should not aﬀect the accuracy ofthe overall algorithm overmuch, since we are simplyusing the tags to accept or reject word tokens.After preprocessing, feature scores can be gener-ally be calculated sub-quadratically in the length ofa paper by using either search trees or hash tablesfor calculating the multiplicity of a given keyword.In practice, this step is quite fast compared to pre-processing.Finally, classiﬁcation relies on one of several wellknown binary classiﬁers. For simplicity, we used anearest neighbor search with a KD-tree[9] based ona Euclidean distance metric. A support vector ma-chine or other classiﬁer whose running time duringclassiﬁcation is independent of the size of the set oftraining data can be substituted where eﬃciency is aconcern.

After scoring 200 papers based on the above features,a 3-nearest-neighbor classiﬁer misclassiﬁed only 2 pa-pers for an error rate of 1%. For error estimation, weused leave one out cross validation. The data set con- Figure 1: Error rates for nearest neighbor classiﬁersof order k using leave one out cross validation.sisted of 100 computer generated papers from Scigenand 100 randomly chosen papers from the computerscience and mathematics sections of the ArXiv[1].Of the two misclassiﬁed papers, the ﬁrst[7] hasan exceptionally short abstract and a great deal offormulas. The short abstract yields a low score forfeature 1, since it mentions very few relevant key-words. The formulas did not translate well to text,and our failure to ﬁlter out the artifacts of this trans-lation made for a slightly reduced score for feature 2.The second[2] paper had no exceptionally low score,but simply did not score well on any feature, plac-ing it well within a cluster of computer generatedpapers. Neither paper appears to be computer gen-erated upon human inspection.Notably, we did not have any false negative classi-ﬁcations. While human papers showed a great dealof variation leading to the errors mentioned above,Scigen papers fell within fairly well deﬁned ranges oneach feature.We mention above that our error was measuredwith a 3-nearest-neighbor classiﬁer. This is perhapssomewhat surprising, since the model provides al-most no regularization and we do none outside of themodel. However, Figure 1 indicates that this levelof regularization is preferable to that found in higherorder nearest neighbor classiﬁers. The same generaltrend is visible when pruning is employed, and so wedo not believe that the trend is simply an artifact ofdensity eﬀects.4igure 2: A two dimensional cross section of the clas-siﬁer, ignoring the references score.Figure 2 shows the distribution of papers andthe classiﬁcation boundary based on our 3-nearest-neighbor classiﬁer and two of the three features:Word repetition and title and abstract scores. Theimage is somewhat misleading, as the classiﬁer workswith all three features in a three dimensional space.All but two of the points which appear misclassiﬁedin Figure 2 are diﬀerentiated by their references score.The paper your are now reading was determined tobe a human product by the 3-nearest-neighbor clas-siﬁer. We have shown that the problem of computer genera-tion of text is not quite so simple as stringing togetherkeywords from a predeﬁned distribution. Coherenthuman writing has many subtly self-referential ele-ments which can be exploited to classify products ofunwary text generators. Relying simply on keyword-based features, it is possible to exploit these elementseﬃciently to attack the problem of paper classiﬁca-tion with only moderately sized data sets.Traditionally, attempts to ﬁlter out automatedmessages have focused on their keyword compositionrather than their structure. Such techniques are idealwhen very little context is available to the messagegenerator, such as in the case of a new email mes-sage arriving. Previously received messages provide a good deal of context for the classiﬁer, but are un-available when generating the unsolicited messages.These techniques are less eﬀective when a context isreadily available to all parties, where the keywordcomposition of generated messages can be carefullyselected to avoid detection.The structural elements described in this paper arecertainly not diﬃcult to duplicate in a text genera-tor. When generating papers, one must simply fa-vor words chosen previously or in certain sections ofthe text. However, the speciﬁc features discussed inthis paper only scratch the surface of possible struc-tural considerations when building classiﬁers. As textgeneration evolves, text classiﬁers have many aspectsof text structure from which to draw new heuristics.Many of these heuristics will be domain speciﬁc, justas some of the features discussed in this paper areapplicable only to academic papers.As user generated content becomes more prevalent,there is an increasing monetary incentive to pass un-solicited and automated messages oﬀ as human con-tent in information sharing networks. In many casesis it undesirable or infeasible to have all such mes-sages moderated by a human, in which case tech-niques from machine learning such as those describedin this paper will become increasingly expedient.

It is worth considering the limits of text classiﬁersin general. Consider a perfect binary classiﬁer of pa-pers as either scholarly work or not. We could thenconstruct a program to enumerate all examples ofscholarly work by ﬁltering an enumeration of the setof all strings. This is certainly no proof of impossi-bility, but does seem unlikely. It is then natural toask how close a classiﬁer can reasonably get to theaforementioned ideal, or even how eﬀective a speciﬁcclassiﬁer is.In this paper we perform a straightforward char-acter based comparison on stemmed words. Therehas been some work[5] in determining semantic dif-ferences between words. We could then ask how aword relatedness heuristic in place of a simple char-acter comparison aﬀects a text classiﬁer’s accuracy.Is automatic deep reference checking feasible, and5f so can it make a reference score more eﬀective?There are automated tools which index academic pa-pers and analyze citation networks, providing infor-mation which could be very useful in creating ad-ditional features for classifying computer generatedacademic papers speciﬁcally.We propose quantifying the eﬀectiveness of textclassiﬁers as an open question above. An analogousquestion can be asked about text generation. Canwe quantify the diﬃculty of classifying text generatedby certain means? Does the availability of a corpusof human texts make text generation with plausiblestructure easier?

The source code for an implementation of the algo-rithm discussed in this paper is available online at http://code.google.com/p/paper-detection/ . This work was made possible by the generous sup-port from Sean O’Sullivan (RPI ’85) of the RensselaerCenter for Open Source Software. [1] ArXiv e-print archive. http://arxiv.org/ .[2] Vicente H. F. Batista, George O. AinsworthJr., and Fernando L. B. Ribeiro. Paral-lel structurally-symmetric sparse matrix-vectorproducts on multi-core processors.

ArXiv e-prints , 2010.[3] Corinna Cortes and Vladimir Vapnik. Support-vector networks.

Machine Learning , 20(3), 1995.[4] T. Cover and P. Hart. Nearest neighbor patternclassiﬁcation.

Information Theory, IEEE Trans-actions on , 13(1):21–27, 1967.[5] Eiji Kawaguchi, Seiichiro Kamata, MasahiroWakiyama, and Koichi Nozaki. An algorithm to compute semantic metric in the sd-form seman-tics model.

Information Modelling and KnowledgeBases , IV, 1995.[6] Edward Loper and Steven Bird. Nltk: The nat-ural language toolkit. In

In Proceedings of theACL Workshop on Eﬀective Tools and Method-ologies for Teaching Natural Language Processingand Computational Linguistics. Philadelphia: As-sociation for Computational Linguistics , 2002.[7] P. Scholze. The Langlands-Kottwitz approach forthe modular curve.

ArXiv e-prints , 2010.[8] Jeremy Stribling, Max Krohn, and Dan Aguayo.Scigen: An automatic cs paper generator. http://pdos.csail.mit.edu/scigen/ .[9] Peter N. Yianilos. Data structures and algorithmsfor nearest neighbor search in general metricspaces. In