Algorithmic Detection of Computer Generated Text
AAlgorithmic Detection of Computer Generated Text
Allen Lavoie – [email protected] Krishnamoorthy – [email protected] Center for Open Source Software (RCOS)Rensselaer Polytechnic Institute
Computer generated academic papers have been usedto expose a lack of thorough human review at severalcomputer science conferences. We assess the problemof classifying such documents. After identifying andevaluating several quantifiable features of academicpapers, we apply methods from machine learning tobuild a binary classifier. In tests with two hundredpapers, the resulting classifier correctly labeled pa-pers either as human written or as computer gener-ated with no false classifications of computer gener-ated papers as human and a 2% false classificationrate for human papers as computer generated. Webelieve generalizations of these features are applicableto similar classification problems. While most cur-rent text-based spam detection techniques focus onthe keyword-based classification of email messages,a new generation of unsolicited computer-generatedadvertisements masquerade as legitimate postings inonline groups, message boards and social news sites.Our results show that taking the formatting and con-textual clues offered by these environments into ac-count may be of central importance when selectingfeatures with which to identify such unwanted post-ings.
A project called Scigen[8] made waves in 2005 whenit produced a paper which was accepted to WM-SCI 2005. Unfortunately this was not due to break-throughs in artificial intelligence which allowed Sci- gen to write about computer science effectively. In-stead, Scigen relies on sentence generation based ona context free grammar. It effectively produces anassortment of random keywords spliced into prede-fined sentence structures. With the addition of ran-domly generated graphics and references, the papersare practically indistinguishable from papers that hu-mans have written until true understanding is at-tempted by someone who could reasonably expect tounderstand an equivalent but meaningful paper. Thequestion then arises as to whether a human review isrequired to classify such papers.The difficulty of this problem is quite dependenton the approach used and that approach’s expectedapplicability. On one extreme we could look for for-matting quirks or other metadata which are irrelevantto the paper itself, but which could easily be used toidentify Scigen papers specifically. This approach hasvery little general applicability, but would be almosttrivial to implement. Another extreme is the mostgeneral case of identifying papers as either examplesof good scholarly work or not. A solution here wouldbe of great practical use, but seems unlikely at thepresent time. A good approach, then, is one whichmaximizes its general applicability while remainingpractical.This paper attempts to show that there exists anon-empty subset of approaches ranging from trivialto hopelessly complex which are both practical anduseful in a broader context. While practicality hasobjective measures such as algorithmic complexity,the usefulness of an approach is necessarily subjec-tive. Thus we will analyze one approach for practi-cality and effectiveness in dealing with Scigen papers1 a r X i v : . [ s t a t . M L ] A ug pecifically, and then discuss other potential applica-tions. Having decided not to exploit shallow flaws in Scigenand lacking any true understanding of the academicpapers we seek to classify, we draw candidate fea-tures from several simple observations about humanwriting.1. If a paper claims to be about a specific topic, itshould actually be about that topic.2. Papers follow a central theme.3. Cited papers are related to the paper which citesthem.Unfortunately none of these observations are directlyquantifiable as stated. Note that even the abovecriteria are necessary but not sufficient for identify-ing good scholarly work. They do, however, suggestheuristics which can easily be quantified. We canrewrite each of these observations in terms of key-words, sacrificing accuracy for computability.1. Keywords which appear in the title and abstractof a paper should appear frequently in the bodyof that paper.2. Certain keywords should be favored throughouta paper.3. A paper should mention keywords from the titlesof articles it cites.Whereas our original observations capture part ofwhat it means for a paper to be an example of truescholarly work, the keyword heuristics are merelyplausible substitutes. On the other hand, these fea-tures are readily quantifiable. Since our features areno longer directly tied to the definition of scholarlywork, a paper generator could easily be adapted tosuperficially fulfill the above heuristics. The simplic-ity of keyword heuristics in general, however, alsoallows us to readily produce additional features. Forexample, we could examine repetition within para-graphs or patterns in the usage of certain sentence structures. Domain-specific features, such as the oc-currence of keywords from a user-submitted summaryin a linked article on a social news website, are alsofeasible.Having a basis for quantifying the abstract conceptof scholarly work, we can now build a classifier andtest our intuitive notions empirically.
We begin by converting a subject paper to plain textfrom the typical PDF. There are several freely avail-able tools for accomplishing this task. Next, a paperis split into word tokens.Before additional processing is done to clean up thetext, we search for certain keywords denoting the ti-tle, abstract, introduction and references sections ofa paper. Keeping such structure allows us to basefeatures on comparisons between sections, and doingit early ensures that our text processing does not in-terfere. In the case of missing keywords, we simplyfall back to taking a certain fraction of the paper atthe desired section’s expected location.Next, we clean the text by selecting only partsof speech which might reasonably have keywords re-lated to the topic of the paper. This entails standardpart of speech tagging followed by aggressive filtering.Specifically we selected nouns, adjectives and foreignor unrecognized words. Words which were unrecog-nized by the tagging algorithm often turned out to bethe most valuable, as many times they were examplesof unique technical jargon.The selected tokens are finally stemmed to avoidconfusion between different forms of a single word.This allows for a straightforward character basedcomparison, which turned out to be good enoughfor our purposes. A more advanced approach mightmake use of a semantic difference metric [5].
The text can then be scored numerically based on ourselected features. We need to make the three features2iscussed in the previous section slightly more specificin order to accomplish this. Again we trade somesemantics for ease of computation as we introducefairly arbitrary nuances to our features, but this timethe sacrifices are fairly subtle.
For our first feature, we could perform a straightfor-ward count of keywords from the title and abstractof a paper in that paper’s body. This is a decent firstapproximation, but favors long papers unduly. Thus,we normalize this feature by scoring papers based onthe number of times keywords from the title or ab-stract of a paper are mentioned divided by the lengthof the part of speech filtered body of that paper. Let A be the set of keywords from the title and abstractof a paper after part of speech filtering, and B be thecorresponding multiset for the remainder of the papersimilarly filtered. Here, m Q ( q ) denotes the multiplic-ity of element q in the multiset Q . Then our firstfeature’s score s can be written as follows: s = (cid:80) a ∈ A m B ( a ) | B | (1)We must be careful to treat B as a multiset and notsimply as a set, since this feature seeks to quantifythe repetition of ideas from the title and abstractof a paper in its body. We could similarly take therepetition of keywords from the title and abstract ofa paper into account by making A a multiset as well,but doing this simply rewards repetition rather thancompleteness. Whereas the abstract and title of apaper taken together should be a succinct summaryof the paper’s content, rephrasing and repetition isboth common and desirable in a paper’s body. Next, we seek to quantify the repetition of a certainset of words throughout a paper. Here we have cho-sen to compare the occurrence of the top N mostused words in a paper to the occurrence of all otherwords. Let P denote the multiset of all words in agiven paper, and let W = { w i } denote the set ofdistinct elements in P sorted in decreasing order of their multiplicity m P ( w i ). Thus w occurs with thehighest frequency in the paper, followed by w andso on. Here | P | = (cid:80) w i ∈ W m P ( w i ). Then the secondfeature can be written as follows: s = (cid:80) N − i =0 m P ( w i ) | P | − (cid:80) N − i =0 m P ( w i ) (2)In general N must be between 1 and | W |− N = 10 is a fairly good tradeoff for this feature. Note that part of speech taggingand filtering plays a very important role here. With-out it we would almost certainly be selecting wordswhich are simply common in the English language, atwhich point our feature would cease to make sense. Our final feature is calculated much like the first.Without parsing references at all, we simply use thetokenized, filtered and stemmed set of keywords fromthe references section. This includes paper titles, au-thors and a lot of other irrelevant information. Theirrelevant tokens do not affect the feature’s score sincewe do not normalize on the number of tokens in thereferences section.If we let R denote the set of tokens from the ref-erences section of a paper and again let B denote amultiset of keywords in the remainder of the paper,we can compute the third feature as follows: s = (cid:80) r ∈ R m B ( r ) | B | (3)While feature three’s computation is nearly identi-cal to (1), its semantics are quite different. Whileour first feature attempts to quantify the relevanceof the title and abstract of a paper, our third featureseeks to quantify the relevance of its chosen refer-ences. Thus despite the similar computation, we donot see a strong linear dependence between these twofeatures. Finally, we build a classifier based on all three fea-tures. Let a paper p be represented by a point3 s , s , s ) in a three dimensional space, where s , s and s are our first, second and third features re-spectively. We can then build a classifier based on oneof several well known methods from machine learn-ing. A nearest neighbor classifier[4] which takes avote of the k nearest points ( k = 3 in this case) withknown classifications was used for its simplicity onsmall data sets for the purposes of this paper, butsupport vector machines[3] or other more advancedalgorithms would be more efficient when dealing withlarger data sets. The running time of our implementation of the al-gorithm outlined above is dominated by the part ofspeech tagging used during preprocessing. We usethe default part of speech tagger from the NaturalLanguage Tool Kit[6], but a faster and slightly lessaccurate tagging algorithm could dramatically reducethe running time of the algorithm presented above.Tagging accuracy should not affect the accuracy ofthe overall algorithm overmuch, since we are simplyusing the tags to accept or reject word tokens.After preprocessing, feature scores can be gener-ally be calculated sub-quadratically in the length ofa paper by using either search trees or hash tablesfor calculating the multiplicity of a given keyword.In practice, this step is quite fast compared to pre-processing.Finally, classification relies on one of several wellknown binary classifiers. For simplicity, we used anearest neighbor search with a KD-tree[9] based ona Euclidean distance metric. A support vector ma-chine or other classifier whose running time duringclassification is independent of the size of the set oftraining data can be substituted where efficiency is aconcern.
After scoring 200 papers based on the above features,a 3-nearest-neighbor classifier misclassified only 2 pa-pers for an error rate of 1%. For error estimation, weused leave one out cross validation. The data set con- Figure 1: Error rates for nearest neighbor classifiersof order k using leave one out cross validation.sisted of 100 computer generated papers from Scigenand 100 randomly chosen papers from the computerscience and mathematics sections of the ArXiv[1].Of the two misclassified papers, the first[7] hasan exceptionally short abstract and a great deal offormulas. The short abstract yields a low score forfeature 1, since it mentions very few relevant key-words. The formulas did not translate well to text,and our failure to filter out the artifacts of this trans-lation made for a slightly reduced score for feature 2.The second[2] paper had no exceptionally low score,but simply did not score well on any feature, plac-ing it well within a cluster of computer generatedpapers. Neither paper appears to be computer gen-erated upon human inspection.Notably, we did not have any false negative classi-fications. While human papers showed a great dealof variation leading to the errors mentioned above,Scigen papers fell within fairly well defined ranges oneach feature.We mention above that our error was measuredwith a 3-nearest-neighbor classifier. This is perhapssomewhat surprising, since the model provides al-most no regularization and we do none outside of themodel. However, Figure 1 indicates that this levelof regularization is preferable to that found in higherorder nearest neighbor classifiers. The same generaltrend is visible when pruning is employed, and so wedo not believe that the trend is simply an artifact ofdensity effects.4igure 2: A two dimensional cross section of the clas-sifier, ignoring the references score.Figure 2 shows the distribution of papers andthe classification boundary based on our 3-nearest-neighbor classifier and two of the three features:Word repetition and title and abstract scores. Theimage is somewhat misleading, as the classifier workswith all three features in a three dimensional space.All but two of the points which appear misclassifiedin Figure 2 are differentiated by their references score.The paper your are now reading was determined tobe a human product by the 3-nearest-neighbor clas-sifier. We have shown that the problem of computer genera-tion of text is not quite so simple as stringing togetherkeywords from a predefined distribution. Coherenthuman writing has many subtly self-referential ele-ments which can be exploited to classify products ofunwary text generators. Relying simply on keyword-based features, it is possible to exploit these elementsefficiently to attack the problem of paper classifica-tion with only moderately sized data sets.Traditionally, attempts to filter out automatedmessages have focused on their keyword compositionrather than their structure. Such techniques are idealwhen very little context is available to the messagegenerator, such as in the case of a new email mes-sage arriving. Previously received messages provide a good deal of context for the classifier, but are un-available when generating the unsolicited messages.These techniques are less effective when a context isreadily available to all parties, where the keywordcomposition of generated messages can be carefullyselected to avoid detection.The structural elements described in this paper arecertainly not difficult to duplicate in a text genera-tor. When generating papers, one must simply fa-vor words chosen previously or in certain sections ofthe text. However, the specific features discussed inthis paper only scratch the surface of possible struc-tural considerations when building classifiers. As textgeneration evolves, text classifiers have many aspectsof text structure from which to draw new heuristics.Many of these heuristics will be domain specific, justas some of the features discussed in this paper areapplicable only to academic papers.As user generated content becomes more prevalent,there is an increasing monetary incentive to pass un-solicited and automated messages off as human con-tent in information sharing networks. In many casesis it undesirable or infeasible to have all such mes-sages moderated by a human, in which case tech-niques from machine learning such as those describedin this paper will become increasingly expedient.
It is worth considering the limits of text classifiersin general. Consider a perfect binary classifier of pa-pers as either scholarly work or not. We could thenconstruct a program to enumerate all examples ofscholarly work by filtering an enumeration of the setof all strings. This is certainly no proof of impossi-bility, but does seem unlikely. It is then natural toask how close a classifier can reasonably get to theaforementioned ideal, or even how effective a specificclassifier is.In this paper we perform a straightforward char-acter based comparison on stemmed words. Therehas been some work[5] in determining semantic dif-ferences between words. We could then ask how aword relatedness heuristic in place of a simple char-acter comparison affects a text classifier’s accuracy.Is automatic deep reference checking feasible, and5f so can it make a reference score more effective?There are automated tools which index academic pa-pers and analyze citation networks, providing infor-mation which could be very useful in creating ad-ditional features for classifying computer generatedacademic papers specifically.We propose quantifying the effectiveness of textclassifiers as an open question above. An analogousquestion can be asked about text generation. Canwe quantify the difficulty of classifying text generatedby certain means? Does the availability of a corpusof human texts make text generation with plausiblestructure easier?
The source code for an implementation of the algo-rithm discussed in this paper is available online at http://code.google.com/p/paper-detection/ . This work was made possible by the generous sup-port from Sean O’Sullivan (RPI ’85) of the RensselaerCenter for Open Source Software. [1] ArXiv e-print archive. http://arxiv.org/ .[2] Vicente H. F. Batista, George O. AinsworthJr., and Fernando L. B. Ribeiro. Paral-lel structurally-symmetric sparse matrix-vectorproducts on multi-core processors.
ArXiv e-prints , 2010.[3] Corinna Cortes and Vladimir Vapnik. Support-vector networks.
Machine Learning , 20(3), 1995.[4] T. Cover and P. Hart. Nearest neighbor patternclassification.
Information Theory, IEEE Trans-actions on , 13(1):21–27, 1967.[5] Eiji Kawaguchi, Seiichiro Kamata, MasahiroWakiyama, and Koichi Nozaki. An algorithm to compute semantic metric in the sd-form seman-tics model.
Information Modelling and KnowledgeBases , IV, 1995.[6] Edward Loper and Steven Bird. Nltk: The nat-ural language toolkit. In
In Proceedings of theACL Workshop on Effective Tools and Method-ologies for Teaching Natural Language Processingand Computational Linguistics. Philadelphia: As-sociation for Computational Linguistics , 2002.[7] P. Scholze. The Langlands-Kottwitz approach forthe modular curve.
ArXiv e-prints , 2010.[8] Jeremy Stribling, Max Krohn, and Dan Aguayo.Scigen: An automatic cs paper generator. http://pdos.csail.mit.edu/scigen/ .[9] Peter N. Yianilos. Data structures and algorithmsfor nearest neighbor search in general metricspaces. In