Linguistic data mining with complex networks: a stylometric-oriented approach
LLinguistic data mining with complex networks:a stylometric-oriented approach
Tomasz Stanisz a , Jarosław Kwapień a , Stanisław Drożdż a,b, ∗ a Complex Systems Theory Department, Institute of Nuclear Physics, Polish Academy of Sciences, ul. Radzikowskiego 152,Kraków 31-342, Poland b Faculty of Physics, Mathematics and Computer Science, Cracow University of Technology, ul. Warszawska 24, Kraków31-155, Poland
Abstract
By representing a text by a set of words and their co-occurrences, one obtains a word-adjacency networkbeing a reduced representation of a given language sample. In this paper, the possibility of using networkrepresentation to extract information about individual language styles of literary texts is studied. Bydetermining selected quantitative characteristics of the networks and applying machine learning algorithms,it is possible to distinguish between texts of different authors. Within the studied set of texts, Englishand Polish, a properly rescaled weighted clustering coefficients and weighted degrees of only a few nodesin the word-adjacency networks are sufficient to obtain the authorship attribution accuracy over 90%. Acorrespondence between the text authorship and the word-adjacency network structure can therefore befound. The network representation allows to distinguish individual language styles by comparing the waythe authors use particular words and punctuation marks. The presented approach can be viewed as ageneralization of the authorship attribution methods based on simple lexical features.Additionally, other network parameters are studied, both local and global ones, for both the unweightedand weighted networks. Their potential to capture the writing style diversity is discussed; some differencesbetween languages are observed.
Keywords: complex networks, natural language, data mining, stylometry, authorship attribution
1. Introduction
Many systems studied in contemporary science, from the standpoint of their structure, can be viewedas the ensembles of large number of elements interacting with each other. Such systems can often beconveniently represented by networks, consisting of nodes and links between these nodes. As nodes andlinks are abstract concepts, they may refer to many different objects and types of interactions, respectively.Therefore, a rapid development of the so-called network science has found application in studies on a greatvariety of systems, like social networks, biological networks, networks representing financial dependencies,the structure of the Internet, or the organization of transportation systems [8, 12, 16, 23, 34, 38, 43].The mentioned networks are usually referred to as complex networks; they owe their name to a generalconcept of complexity . The above-listed systems are often viewed as examples of complex systems. Althoughcomplexity does not have a precise definition, it can be stated that a complex system typically consists ofa large number of elements nonlinearly interacting with each other, is able to display collective behavior,and, by exchanging energy or information with surroundings, can modify its internal structure [23]. Manycomplex systems exhibit emergent properties, which are characteristic for a system as a whole, and cannot ∗ Corresponding author
Email address: [email protected] (Stanisław Drożdż)
Preprint submitted to Information Sciences January 18, 2019 a r X i v : . [ c s . C L ] J a n e simply derived or predicted solely from the properties of the system’s individual elements. The existenceof the emergent phenomena is often summarized with the phrases such as "more is different" or "the wholeis something besides the parts".In this context, natural language is clearly a vivid instance of a complex system. Higher levels of itsstructure usually cannot be simply reduced to the sum of the elements involved. For example, phonemesor letters basically do not have any meaning, but the words consisting of them are references to specificobjects and concepts. Likewise, knowing the meaning of separate words does not necessarily provide theunderstanding of a sentence composed of them, as a sentence can carry additional information, like anemotional load or a metaphorical message. Other features of natural language that are typical for complexsystems have also been studied, for example long-range correlations [47], fractal and multifractal properties[5, 10, 13], self-organization [9, 40] or lack of characteristic scale, which manifests itself in power laws suchas the well-known Zipf’s law or Heaps’ law (the latter also referred to as the Herdan’s law) [14, 35, 48].The network formalism has proven to be useful in studying and processing natural language. It allows torepresent language on many levels of its structure - the linguistic networks can be constructed to reflect wordco-occurrence, semantic similarity, grammatical relationships etc. It has a significant practical importance- the methods of analysis of graphs and networks have been employed in the natural language processingtasks, like keyword selection, document summarization, word-sense disambiguation or machine translation[2, 3, 32]. Networks also seem to be a promising tool in research on the language itself; they have beenapplied to study the language complexity [17, 21, 27], the structural diversity of different languages [28],or the role of punctuation in written language [22]. Furthermore, the network formalism appears at theinterface between linguistics and other scientific fields, for example in sociolinguistics, which, by studyingsocial networks, investigates human language usage and evolution [11, 19].In this paper the properties of the word-adjacency networks constructed from literary texts are studied.Such networks may be treated as a representation of a given text, taking into account mutual co-occurrencesof words. They can be characterized by some parameters that describe their structure quantitatively.Determining these parameters results in obtaining a set of a few numbers capturing considered featuresof networks and, consequently, features of the underlying text sample. In that sense the procedure ofconstructing a network from a text and calculating its characteristics may be viewed as extracting essentialtraits of the language sample and putting them into a compact form. Subjecting the obtained data to datamining algorithms reveals that the networks representing texts of different authors can be distinguished onefrom another by their parameters. After appropriately proposed preparation of the data, such procedureof authorship attribution, though quite simple, results in relatively high accuracy. It therefore leads to astatement that information on individual language style used in a given text can be drawn from the set ofcharacteristics describing the corresponding word-adjacency network. This is comparable with the knownresults concerning identification of individual language styles using the linguistic networks [1, 30]; however,the previous research has typically focused on the network types and the network structure aspects otherthan considered here.
2. Data and methods
The set of the studied texts consists of novels in English and Polish. For each language, there are48 novels, 6 novels for each of 8 different authors. The full list of the books and authors is presented inAppendix. The texts were downloaded from the websites of Project Gutenberg [36], Wikisources [45], andWolne Lektury [46]. Both Polish and English-language authors considered in this paper are widely recognizedas notable writers, whose works had an impact on the shaping of culture; two of the Polish writers - HenrykSienkiewicz and Władysław Reymont - are laureates of the Nobel Prize in Literature. Among the selectioncriteria for the authors, worth mentioning are the number of books (each studied author has at least 6 books,each of them with at least 25000 words), and their availability (all the books are publicly available).Each text studied in this paper is transformed into a word-adjacency network, and resulting networks aresubjected to adequate calculations and analyses. However, before the construction of the networks, the textsare appropriately pre-processed. Preparation includes: removing annotations and redundant whitespaces,2onverting all characters to lowercase, and replacing some types of punctuation marks by special symbols,which are treated as usual words in the subsequent steps of analysis. This approach is justified by theobservation of the fact that in terms of some statistical properties, punctuation follows the same patterns aswords [22] and may be a valuable carrier of information on the language sample. The punctuation taken intoconsideration is: full stop, question mark, exclamation mark, ellipsis, comma, dash, colon, semicolon andbrackets. The remaining punctuation types (for example, quotation marks) are removed from text samples,along with marks that do not indicate the structure of sentence (examples are: the full stops used afterabbreviations, or quotation dashes - i.e. the dashes placed at the beginning of quotations in some languages,including Polish).After pre-processing, the texts are transformed into the word-adjacency networks. In such networks eachvertex represents one of the words appearing in the underlying text. The edges reflect the co-occurrence ofwords. In an unweighted network, an edge is present between two vertices, if the words that correspond tothese vertices appear at least once next to each other in the text (see Figure 3). In a weighted network,edges are assigned weights that represent the exact number of the co-occurrences of the respective words. Itis assumed that the theoretically possible case of words occurring next to themselves is ignored - i.e. thereare no edges connecting a vertex with itself in the resulting network.An unweighted network may be treated as a special case of a weighted network (with all edge weightsequal to 1). When determining the characteristics in a weighted network, one may either take weights intoaccount or neglect them; the latter case is equivalent to the calculation of the parameters of an unweightednetwork. Therefore, all networks studied in this paper may be viewed as weighted ones. With such approach,weighted and unweighted versions of a given network characteristic can be treated as two distinct quantitiesdescribing a weighted network, with two different definitions.In the process of network construction, words are not lemmatized, which means that their inflectedforms are distinguished. Hence all such forms occurring in a particular text end up as separate vertices inthe resulting network. All studied networks are undirected - which means that a pair of adjacent words iscounted in the same way as a pair with reversed order.
Throughout this paper, a few concepts and quantities known from the field of complex network analysisare used. This section presents them briefly.For each studied network, a number of characteristics are calculated. Depending on the context, thesecharacteristics are either global (related to the network as a whole), or local (related to particular vertices inthe network). The quantities describing the network’s structure can be unweighted or weighted - the lattertake into account the weights assigned to edges in the weighted networks. The characteristics studied in thispaper are: vertex degree, average shortest path length, clustering coefficient, assortativity coefficient, andmodularity. Although these quantities are widely applied in the studies on complex networks and are welldescribed in literature [15, 37], there are some ambiguities, especially pertaining to generalizations onto theweighted networks. For this reason, the adequate definitions are given below.
Degree is a quantity describing individual vertices. In simple (i.e. unweighted) networks, the degree ofa given vertex v is the number of edges touching v , and is denoted by deg( v ). In weighted networks, thedegree can be generalized to weighted degree (also called strength and therefore denoted by str( v )), whichis the sum of weights of the edges touching v .In word-adjacency networks, the vertex degree is obviously related to the frequency of the correspondingword in the considered text sample. In its weighted version, it is equal to 2 f ( v ), where f ( v ) stands for theoverall number of occurrences of the word v (apart from the first and the last word in a sample, for whichthe weighted degree is equal to 2 f ( v ) − .2.2. Clustering coefficient In unweighted networks, the clustering coefficient of a given vertex represents the probability that tworandomly chosen direct neighbours of this vertex are also direct neighbours of each other. By a directneighbour of a vertex v we understand a vertex connected with v by an edge. Let m v stand for the numberof edges in the network that link the direct neighbours of v with other direct neighbours of v . Then theclustering coefficient of v is given by: C u ( v ) = 2 m v deg( v ) · (deg( v ) − , (1)where the subscript " u " comes from the word "unweighted". In a word-adjacency network, the clusteringcoefficient of a given word v describes the probability that any two words that appear next to v in theconsidered text sample also appear next to each other at least once.Generalization of the clustering coefficient onto the weighted networks can be done in multiple ways. Inthis paper a definition proposed by Barrat et al. [6] is used. Let S ( v ) denote the set of neighbours of avertex v , and let w uv denote the weight of the edge connecting vertices u and v (if there is no such edge,then w uv = 0). Let a uv denote an unweighted adjacency matrix element, i.e. a number defined as follows: a uv = ( , if there exists an edge connecting u and v ;0 , otherwise.Then the weighted clustering coefficient of v is written as: C w ( v ) = 1str( v ) · (deg( v ) − X u,t ∈ S ( v ) w vu + w vt a vu a ut a tv , (2)where summation is over all pairs ( u, t ) of neighbours of v . It is worth noting that if deg v = 0 or deg v = 1,the clustering coefficient cannot be determined from the above-given formulas, and it is assumed that insuch cases it is equal to 0.The above definitions pertain to the individual vertices of a network. Global clustering coefficient canbe defined in more than one way; in this work a simple approach based on averaging the local clusteringcoefficients is applied. If V stands for the set of all vertices of a network and N is the number of elementsin V , then the global clustering coefficient of the network is given by: C = 1 N X v ∈ V C ( v ) . (3)Here the subscript " u " or " w ", indicating the unweighted or weighted network, is omitted, because the formulais identical in both cases. In unweighted networks, the length of a path between two vertices is the number of edges on that path.In weighted networks, by the length of a path we understand the sum of the reciprocals of edge weights onthat path. The length of the shortest path between vertices u and v is also called the distance between u and v and is denoted by d ( u, v ).The average shortest path length ‘ ( v ) of a vertex v is the average distance from v to every other vertexin the network. It is a measure of the centrality of a vertex in the network, and is given by the formula: ‘ ( v ) = 1 N − X u ∈ V \{ v } d ( v, u ) , (4)in which V is the set of all vertices of the network, and N is the number of elements in V .4he quantity defined above has finite values only in connected networks. If there are at least two verticesthat are not connected by any path, the distance between them is not defined; usually it is treated as infinite,and therefore ‘ ( v ) cannot be calculated.Global average shortest path length is a quantity describing the whole network; it is the average distancebetween all pairs of vertices. If local average distances ‘ ( v ) for all v in V are given, then the global meandistance in the whole network can be expressed by: ‘ = 1 N X v ∈ V ‘ ( v ) . (5)Formulas 4 and 5 apply both to unweighted and weighted networks; the difference between the unweightedand the weighted average shortest path length lies in the definition of the distance in respective cases. Assortativity is a global characteristic of a network, describing the preference for vertices to attach toothers that have similar degree. A network is called assortative, if vertices with high degree tend to be directlyconnected with other vertices with high degree, and low-degree vertices are typically directly connected tovertices which also have low degree. In disassortative networks, the high-degree nodes are typically directlyconnected to the nodes with low degree. In word-adjacency networks, assortativity provides the informationabout the extent to which frequent words co-occur with rare ones.In unweighted networks, the assortativity coefficient can be defined as the Pearson correlation coefficientbetween the degrees of nodes that are connected by an edge. Let ( u, v ) denote the ordered pair of verticesthat are connected by an edge. Since edges are undirected and the pair ( u, v ) is ordered, two such pairscan be assigned to each edge in the network. For each pair one can calculate the degrees of vertices u and v , and form a pair (deg( u ) , deg( v )). The set of all pairs (deg( u ) , deg( v )) for all edges can be treated as theset of values of a certain two-dimensional random variable ( X, Y ). With such notation, the assortativitycoefficient r u is expressed by the Pearson correlation coefficient of variables X and Y : r u = corr( X, Y ) . (6)The generalization of the above formula to weighted networks is done in this paper by replacing thedegrees of vertices by their strengths, and calculating the weighted correlation coefficient instead of normalone. Let ( X, Y ) be a two-dimensional random variable whose values are pairs ( x, y ) = (str( u ) , str( v )) for allpairs of vertices ( u, v ) connected by an edge. Let w be a function that to each pair ( x, y ) = (str( u ) , str( v ))assigns the weight of an edge connecting u and v . Then the weighted assortativity coefficient r w can bewritten as: r w = wcorr( X, Y ; w ) , (7)where wcorr( X, Y ; w ) denotes the Pearson weighted correlation coefficient of variables X and Y with theweighing function w . This formula is equivalent to the one that can be found, for example, in [26].Since the assortativity coefficient is expressed by the correlation coefficient, it has values between -1 and 1.Networks with positive r are assortative, while networks with negative r are disassortative. Modularity is a global characteristic of a network, measuring the extent to which the set of network’svertices can be divided into disjunctive subsets which maximize the density of edges within them, and min-imize the number of edges connecting one with another. In word-adjacency networks, it can be interpretedas divisibility of the set of words appearing in the text into groups of words that frequently co-occur witheach other.Consider an unweighted network, with the set of vertices V . By partition of the network we understandthe division of V into disjoint subsets (called modules, clusters or communities). Let a uv denote the adjacencymatrix element (defined the same as in subsection 2.2.2). Let c v denote the module to which the vertex v ●●● ●● ●● ● ●●● ● ● ●●● ●●● ● ●● ●● ●● ● ●● (a) ●● ●● ●●● ●●●●●● ● ● ●●● ●●● ●● ●● ●● ●● ●● ●● ●●● ●● ●●●●● ●● ●● ●●● ●● ● ●● ●●●● (b) ●● ● ●●● ●● ●● ● ●●● ●●● ●● ●● ● ●● ●● ●●● ●●● ●●● ● ●● ●● (c)Figure 1: Unweighted networks with various assortativity coefficients. The network in (a) is assortative ( r = 0 . r = − . | r | < . is assigned by the given partition. The modularity of a partition is defined as: q u = 12 m X u,v ∈ V (cid:18)(cid:20) a uv − deg( u ) deg( v )2 m (cid:21) δ ( c u , c v ) (cid:19) , (8)where m is the number of the edges of the network, deg( u ) , deg( v ) are degrees of vertices u and v , andfunction δ ( c u , c v ) has value 1 if c u = c v and 0 otherwise.Modularity of a partition has value between -1 and 1, and indicates whether the density of edges withinthe given groups is higher or lower than it would be if edges were distributed at random. The randomnetwork that serves as a reference in this definition can be constructed using the so-called configurationmodel [18].The modularity of a network Q u is the maximum value from modularities q u of all possible partitions.Determining the network’s modularity precisely is computationally intractable, hence a number of heuristicalgorithms have been proposed. In this paper modularity is calculated using the Louvain method [7].The generalization of modularity onto weighted networks can be done by replacing the quantities ap-pearing in Eq. 8 by their weighted counterparts. If w uv denotes the weight of the edge connecting vertices u and v , M is the sum of all edge weights, and str( u ), str( v ) are the strengths of vertices u, v , then themodularity of a given partition is equal to: q w = 12 M X u,v ∈ V (cid:18)(cid:20) w uv − str( u ) str( v )2 M (cid:21) δ ( c u , c v ) (cid:19) . (9)Again, the weighted modularity of the network, Q w , is the greatest of modularities obtained in all possiblepartitions of the network. The studied texts vary in length. To allow for a reasonable comparison between them, all calculatedquantities are specifically normalized. Normalization of each network characteristic is achieved by dividingthe value of the considered parameter by the average value of the same parameter in a network constructedfrom the randomly shuffled text (see Figure 3). As an example, consider any text and any of above-mentionedquantities 2.2.1-2.2.5, say, the assortativity coefficient (it does not matter whether the weighted or theunweighted version is considered). To obtain the normalized (in described sense) value of the assortativitycoefficient related to the given text, one needs to create the word-adjacency network from the text, andcalculate the assortativity coefficient of this network, r . Then the original text needs to be subjected tothe random shuffling of words, k times (in this study, k = 50 was used). From each of the k obtained6 ● ●● ● ● ●● ● ●● ● ●●●● (a) ● ●●●●● ●● ●●●●● ●●● ● ●●●● ●●●● ●●●●●●●●●● ●●● ●● ● ●●●●● ●●● ● ● ●●● ●●●●●● (b)Figure 2: Examples of unweighted networks with (a) low modularity ( Q u = 0 .
07) and (b) high modularity ( Q u = 0 . "random texts", a network can be created, in the same manner as in the previous step. Determining theassortativity coefficients for all of these networks and calculating their average gives r , the typical valueof the assortativity coefficient in the randomized text. The normalized value of the considered parameter- in this case, the assortativity coefficient - is then e r = r/r . It reflects how the parameter related to theoriginal text is different from the one obtained for a text with the same word distribution, but random wordorder. The normalization is performed in the described way for all mentioned characteristics 2.2.1-2.2.5.Normalized quantities are denoted with the symbol ”~” throughout this paper. To explore the diversity of investigated networks’ properties, two methods of data analysis were employedin this paper - hierarchical clustering and decision tree bagging. The former is a simple data clusteringmethod, which, being an example of an unsupervised machine learning technique, attempts to identify theinternal structure of a given set of data, based only on the data itself and on the assumed measure ofsimilarity. The latter is a supervised learning method based on ensembles of decision trees.
Hierarchical clustering [42] performed on a given set of vectors, aims to group together the vectors thatare close to each other, according to a certain metric (a function that defines the distance between vectors).The method, as used in this paper, works as follows: given an m -element set of n -dimensional vectors of realnumbers and a metric (here the Euclidean metric was used), it starts from creating m clusters and assigningone vector to each cluster. Then the clusters that are closest to each other are merged into one, and thisis repeated until there is only one cluster. As clusters are not vectors, but sets of vectors, the distancebetween such sets must be defined, to allow for determining which are closest to each other. This can bedone in multiple ways, here the so-called furthest neighbour method was applied. It defines the distancebetween two clusters as the greatest distance between pairs of elements of these clusters (each element in apair belongs to other cluster).The result of clustering can be presented in the form of a dendrogram - a tree-like diagram which showsthe consecutive merges and allows to determine what kind of clusters can be discerned within the dataset. Statistical classification with decision trees [41] is an example of supervised learning, which means thatit allows for the categorization of data, provided that it is given a set of already categorized examples, calledthe training set. 7 ll l l lllll ll l l lll ll ll ll l ll lll ll lll lll l ll ll llll l ll
It has been frequently remarked, that it seems tohave been reserved to the people of this country, bytheir conduct and example, to decide the importantquestion, whether societies of men are really capa-ble or not, of establishing good government fromreflection and choice, or whether they are foreverdestined to depend, for their political constitutions,on accident and force. (a) l ll l l lllll ll l l lll ll ll ll l ll lll ll lll lll l ll ll llll l ll this good on really , societies , to question and theyto decide and destined not , men force their . by im-portant whether of to of conduct the frequently from, are example government remarked country peopleor constitutions are , been reserved the capable de-pend whether establishing , of been their It politicalhas , have choice accident , or and reflection thatfor seems it forever to (b)Figure 3: An unweighted word-adjacency network and its randomization. Figure (a) presents the network created from onesentence excerpted from The Federalist Paper No. 1 by Alexander Hamilton. Figure (b) shows the network created fromthe same piece of text, but with randomly shuffled words. Performing such shuffling repeatedly leads to obtaining a set ofrandomized networks which serve as a benchmark in the calculation of normalized characteristics of the original network. Notethat the punctuation is taken into consideration in the construction of network; comma and full stop are denoted by "
Creating an ensemble of decision trees requires constructing individual trees in the first place. Let A denote the set of n -dimensional vectors of real numbers; elements of A are called observations, and theircoordinates are called attributes. Each observation is labeled with one of K categories, also called classes.The training of a single classification tree consists of: considering all possible one-dimensional splits of A ,selecting and executing the best split , and repeating these steps recursively in the resulting subsets; splittingstops when A is partitioned in such a way that each subset contains observations of only one category. The one-dimensional split related to some attribute x i is the choice of a constant number S and grouping theobservations according to whether their coordinate x i is smaller or greater than S . The best split is the onethat maximizes the decrease of the diversity of distribution of classes in the considered set. The diversitycan be measured, for example, by information entropy: H = − P Kk =1 p k log p k ( p k denotes the fraction ofthe observations in the set that belong to category k ). In such case, the maximization of diversity decreaseis equivalent to the maximization of the quantity H − H split , where H is the initial entropy, and H split isthe weighted sum of entropies in the resulting subsets, with weights proportional to the numbers of elementsin these subsets and adding up to 1.The scheme of the consecutive splits of A is equivalent to a system of conditions imposed on the observa-tions’ attributes; such system is a classification tree. A trained tree can be used to categorize observationswith unknown class membership, by assigning them to appropriate subsets of A , according to the conditionsfulfilled by their components. 8lassification with a single decision tree may suffer form instability, which means that the classifier mayproduce significantly different results for only slightly different training sets. Also, decision trees are proneto overfitting, which leads to decrease of classification accuracy of unknown observations.Decision tree bagging ( bootstrap aggregating ) [41] is a method of enhancing the performance of classifi-cation based on decision tress. Given a training set with m observations, one can create N new trainingsets of size m , by sampling with replacement from the original set. Obtained sets, called bootstrap samples,for large m are expected to have the fraction 1 − /e (which is roughly 63.2%) of the unique observationsfrom the original set, the rest being duplicates. A decision tree is trained on each of the bootstrap samples,and the ensemble of N trees becomes a new classifier. When such an ensemble is given an observation toclassify, each tree being its part classifies the observation on its own, and then the class that was chosen bymost of trees becomes the final result of classification.A typical method of verifying the performance of classification is cross-validation [4]. Its general ideais dividing the set A of observations with known class membership into two disjoint sets: the training set A train and the test set A test . The classifier is trained on A train and then it classifies the observations in A test , treating them as if their class memberships were unknown. Then the results are compared withthe true class memberships of elements of A test , and the number of correct matches indicates the classifier’sperformance. Partitioning A , training the classifier, and testing its performance is repeated a certain numberof times, and the average result becomes the final assessment of classification’s accuracy. The methods andrules of partitioning A may vary; the approach utilized in this paper is the repeated random sub-samplingvalidation (also called Monte Carlo validation), with stratification. It simply performs an independentrandom stratified partition in each iteration, and each partition obeys a fixed proportion of the numbers ofelements in the training and test set. Stratification ensures that all classes are equally represented in thetraining set. The computational effort needed to perform the classification depends on the choice of classifiers’ type,which is not necessarily restricted to decision trees and their ensembles. Analysis of networks, on the otherhand, is based on algorithms whose exact running time can be found in the literature; it may depend onthe type of network (whether it is a sparse network, for example) and its representation (adjacency matrix,adjacency list, etc.). However, it is worth mentioning which stages are crucial for the overall processingtime. For networks of sizes as studied in this paper, the only noticeably time-consuming part of networkcharacteristics’ computation is determining the global average shortest path length, as it requires findingall shortest paths in a graph. To illustrate with an example: for the network constructed from CharlesDickens’
David Copperfield (the longest English-language book considered in the analysis) computing allnetwork parameters that do not rely on determining shortest paths (namely, all vertex degrees and clusteringcoefficients, assortativity, modularity) both in unweighted and weighted version, takes 1.8 s on the PCused to perform the presented analysis (2.7 GHz CPU, 12 GB RAM). Obtaining average shortest pathlength (local) of the nodes corresponding to 100 most frequent words requires 0.1 s and 0.8 s, for theunweighted and weighted network, respectively. But getting global average shortest path length takes 12.3s for unweighted, and 141.3 s for the weighted network. Considering that the characteristics’ normalizationrequires repeating the computations for a number of randomized networks, it makes analysing the globalshortest-path-related properties of networks rather impractical. However, as shown below, it turns outthat global characteristics are not as effective at distinguishing between authors as local ones; furthermore,among studied network characteristics, these related to the lengths of shortest paths seem to have theweakest discriminative potential. Therefore, for practical purposes, they can be omitted without the loss ofaccuracy of the proposed approach. 9 . Results
At the beginning, the unweighted networks are studied, as they are simpler and less computationallydemanding than the weighted ones. Each of the books listed in Appendix is converted into a network. Thenthe unweighted global parameters are calculated (all normalized, as described in 2.3): average shortest pathlength e ‘ u , clustering coefficient e C u , assortativity coefficient e r u and modularity e Q u . The obtained sets ofquantities are treated as vectors in 4-dimensional Euclidean space; each vector corresponds to one text,and each parameter corresponds to one component of a vector. The Euclidean distance is introduced asthe measure of similarity (the greater the distance, the lesser the similarity), and hierarchical clustering isapplied. The result indicates clear separation of the books with respect to the language (Figure 4). Thebooks cluster into two groups, each related to one language; only a few of them fall into the "wrong" group.The set of books is then divided according to the language, and hierarchical clustering is performed again,in the already separated sets of books in English and Polish. The outcome of this analysis (Figure 5) revealsthe emergence of small clusters (consisting of 2-3 texts) corresponding to certain authors. This suggests theexistence of connection between the authorship of the text and the structure of related network, expressedby appropriate quantities. Such hypothesis may also be supported by analysing the samples of text muchshorter than the whole books. Figure 6 presents the plots of characteristics ‘ u , Q u , C u of networks that arecreated from text samples randomly drawn from the books of selected authors, each sample containing 5000words. It can be observed that although derived from different books, the networks corresponding to textsof the same author tend to be similar in terms of calculated characteristics. This effect is present for any3-dimensional plot of an arbitrarily chosen triad of the studied characteristics ‘ u , Q u , C u , r u . l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l Figure 4: The dendrogram of hierarchical clustering of all books listed in Appendix, in the space of unweighted globalcharacteristics of networks . Red squares and blue dots denote English and Polish texts, respectively. The dashed lineseparates two general clusters, corresponding to distinct languages. The books falling into the "wrong" clusters are:
Faraon , Lalka , Emancypantki (Polish),
Animal Farm , The Prince and the Pauper (English).
To further examine the capability of network parameters of texts to distinguish between authors, statisti-cal classification by an ensemble of decision trees is applied. The analysis is performed in the following way.For a given language, 4 randomly chosen books of each author are taken to form the training set, and theremaining 2 books contribute to the test set. Then the classifier (an ensemble of 100 decision trees) is trainedon the training set to categorize books with respect to their author - each book constitutes one observation,10 a) (b)Figure 5: The dendrograms of the hierarchical clustering of (a) English and (b) Polish books, in the space of unweightedglobal characteristics of networks . Each text is labeled by the surname of its author. whose attributes are the calculated network parameters. After that, the classifier categorizes observationsin the test set. Picking training and test sets, training the classifier, and performing classification in the testset is repeated 10000 times. The average accuracy of classifier (the fraction of correct classifications) in thetest set is treated as a measure of the possibility to distinguish between the authors. The reasoning behindthis approach can be understood as follows: the greater the dissimilarities between individual authors (interms of the network parameters), the easier it should be to train an accurate classifier and to categorizethe observations in the test set correctly. Thus, the classifier’s performance, which is given by the averageaccuracy in the test set, may serve as an indicator of network parameters’ potential to distinguish betweendifferent writing styles. If the texts were not distinguishable at all, the classification would not be muchdifferent from a random choice, and in that case the average accuracy would be equal to 1 /n , where n denotes the number of considered authors.The detailed results of the classification are presented in Table 1, which is organized as follows: a numberin the i -th row and j -th column is the probability of classifying a text of the i -th author as a text of the j -thauthor, obtained by counting such classifications in the test set and dividing the number of counts by thenumber of performed repetitions of the test set selections (10000). The probabilities of correct classificationsreside on the diagonal of the table. The sum of values in each row is equal to 1 - as it is the probability ofassigning a text to any available author; the sums of values in columns are not given any constraint, as theydo not have the interpretation of probability. 11 u Q u C u (a) ‘ u Q u C u (b) ‘ u Q u C u (c) ‘ u Q u C u (d)Figure 6: The projections of the triplets of network characteristics ( ‘ u , Q u , C u ) onto planes ( ‘ u , Q u ), ( ‘ u , C u ), ( Q u , C u ), fortexts in English (a, b) and Polish (c, d). Each triplet of characteristics pertains to one chunk of text of length 5000 words.Texts samples were randomly chosen from all of the studied works of considered authors. Different markers denote differentauthors - red dots, green triangles and blue squares denote respectively: Charles Dickens, Daniel Defoe and Mark Twain in(a), George Eliot, Jane Austen, Joseph Conrad in (b), Władysław Reymont, Janusz Korczak, Jan Lam in (c) and HenrykSienkiewicz, Józef Ignacy Kraszewski, Stefan Żeromski in (d). It can be seen that points representing texts of different authorstend to occupy different regions of space. Note that the characteristics are not normalized, as all the samples have the samelength. Table 1: The results of the classification of (a) English (b) Polish books with respect to the authorship, in the space of unweighted global characteristics of networks . A number in the i -th row and j -th column is the probability of classifyinga text of i -th author as a text of j -th author.(a) The classification of English books. The authors aredenoted by the two first letters of their surnames: Au- Austen, Co - Conrad, De - Defoe, Di - Dickens, Do -Doyle, El - Eliot, Or - Orwell, Tw - Twain. Au Co De Di Do El Or TwAu .33 .22 .17 .06 .01 .11 .10 .00Co .08 .34 .01 .21 .02 .11 .14 .09De .21 .00 .54 .00 .09 .04 .12 .00Di .18 .21 .00 .05 .23 .18 .08 .07Do .00 .12 .06 .09 .28 .03 .24 .18El .10 .22 .16 .05 .04 .39 .04 .00Or .10 .04 .09 .07 .23 .12 .18 .17Tw .00 .00 .00 .03 .12 .16 .03 .66 (b) The classification of Polish books. The authors aredenoted by the two first letters of their surnames: Ko -Korczak, Kr - Kraszewski, La - Lam, Or - Orzeszkowa,Pr - Prus, Re - Reymont, Si - Sienkiewicz, Że - Żeromski.
Ko Kr La Or Pr Re Si ŻeKo .18 .23 .14 .18 .04 .00 .09 .14Kr .16 .57 .00 .02 .00 .18 .03 .04La .21 .06 .54 .02 .05 .00 .08 .04Or .14 .00 .02 .26 .11 .00 .22 .25Pr .02 .00 .09 .12 .65 .00 .10 .02Re .05 .21 .00 .01 .00 .44 .07 .22Si .04 .00 .18 .21 .09 .00 .46 .02Że .19 .10 .01 .32 .00 .17 .00 .21
The analysis of weighted network parameters is performed in the same manner as in the case of theunweighted ones. The whole procedure described above is repeated, but this time the networks created fromtexts are weighted networks, and the characteristics (average shortest path length, clustering coefficient,assortativity coefficient, modularity) are calculated in both the unweighted and the weighted version. Thevectors of parameters calculated for each book (this time they are 8-dimensional) again serve as an input ofdata clustering and classification algorithms.Table 2 presents the outcome of the classification. It can be seen that including weighted networkcharacteristics into analysis gives a slight improvement of the distinguishability between the English-languageauthors - the overall average accuracy of English texts classification is 42% with standard deviation of 11%.For Polish books, the increase in the classifier’s performance is rather negligible, the accuracy obtained is44% with standard deviation of 11%. For some individual authors, the probability of correct classificationdecreases after including the weighted characteristics of networks. The results of hierarchical clustering inthe space with 4 additional dimensions do not change significantly - they are analogous to what can be seenin Figure 5; for this reason the dendrograms are not presented here.The classification of texts using only weighted characteristics f ‘ w , f Q w , f C w , f r w gives results very similarto these from the analysis of the unweighted networks - the classification accuracy for English and Polish is37% and 41% with standard deviations of 11% and 10%, respectively. This similarity along with the factthat the classification for both the unweighted and weighted parameters together is not much more accurate,suggests that the information carried by these two types of characteristics is to a large extent overlapping.13 able 2: The results of the classification of (a) English (b) Polish books with respect to the authorship, in the space of unweighted and weighted global characteristics of networks . A number in the i -th row and j -th column is theprobability of classifying a text of i -th author as a text of j -th author.(a) The classification of English books. The authors aredenoted by the two first letters of their surnames: Au- Austen, Co - Conrad, De - Defoe, Di - Dickens, Do -Doyle, El - Eliot, Or - Orwell, Tw - Twain. Au Co De Di Do El Or TwAu .20 .17 .18 .12 .00 .22 .11 .00Co .12 .62 .00 .06 .02 .07 .03 .08De .11 .00 .52 .11 .09 .14 .01 .02Di .16 .03 .03 .21 .22 .18 .09 .08Do .02 .10 .00 .22 .33 .01 .14 .18El .17 .16 .13 .04 .00 .46 .00 .04Or .06 .03 .05 .04 .05 .01 .55 .21Tw .00 .01 .03 .02 .10 .13 .22 .49 (b) The classification of Polish books. The authors aredenoted by the two first letters of their surnames: Ko -Korczak, Kr - Kraszewski, La - Lam, Or - Orzeszkowa,Pr - Prus, Re - Reymont, Si - Sienkiewicz, Że - Żeromski.
Ko Kr La Or Pr Re Si ŻeKo .22 .22 .16 .09 .01 .01 .15 .14Kr .21 .35 .03 .07 .01 .29 .00 .04La .06 .04 .55 .05 .12 .00 .03 .15Or .16 .04 .02 .34 .13 .00 .12 .19Pr .01 .00 .08 .18 .63 .00 .10 .00Re .01 .12 .00 .00 .00 .50 .18 .19Si .10 .00 .07 .06 .15 .01 .61 .00Że .22 .03 .08 .08 .00 .28 .00 .31
The analysis of local parameters of networks is an approach different from what has been done in previoussteps. Instead of studying the properties of texts by determining the quantities describing the structure of thewhole networks, here the characteristics of the individual vertices (words) are considered. In each language, n most frequently occurring words are chosen from the whole set of texts. Each book is transformedinto a network, and the characteristics of vertices corresponding to selected words are determined. Thecharacteristics taken into account are: vertex degree, local clustering coefficient, and average shortest pathlength, all in both the unweighted and weighted versions. All of these quantities are normalized, in thesame manner as in previous calculations - by dividing their value by the average value obtained for the sameword in a randomized network. Vertex strength (weighted degree) is an exception here - since it is equal tothe frequency of the related word multiplied by 2 (and frequency is invariant under text randomization), itsnormalized value should be equal to 1 for any word. Therefore f str( v ), the normalized strength of a vertex v ,is defined as the strength of vertex v in the original network, divided by the sum of strengths of all vertices inthe same network: f str( v ) = str( v ) / P u ∈ V str( u ). This is equal to the word’s relative frequency (the fractionof the total number of words in the text that comes from the occurrences of the considered word).The sets of parameters related to the n most frequent words in each text are again supplied to thedata clustering and classification algorithms. Figure 7 presents the dendrogram of the clustering in the12-dimensional space of the weighted clustering coefficients of n = 12 most frequent words. The results oftext classification in this space are shown in Table 3. The obtained overall classification accuracy is 90%with standard deviation of 8% for English texts, and 86% with standard deviation of 8% for the Polish ones.The choice of the number of words, n = 12, and the weighted clustering coefficient e C w as the onlyparameter to consider in the above example is justified by studying the effectiveness of classification witheach of the network characteristics separately. It can be noticed (Figure 8) that the weighted clusteringcoefficient gives the best results in English and one of the best in Polish for a wide range of the mostfrequent words studied. In both languages, it is sufficient to analyse the 11-12 most frequent words toobtain the accuracy of 85-90%; further increase in the number of words does not improve the classifier’sperformance.From the results obtained, it can be obviously concluded that analysing the local parameters of selectedwords leads to much better distinguishability between the authors than studying the global characteristicsof the networks. In the "local approach" the number of considered attributes of each text can be muchgreater than when studying the networks globally, as one can choose the number of analysed words arbitrary.However, even for equal dimension of the attribute space, the networks’ local properties (weighted clusteringcoefficient, for example) provide the significantly higher texts’ classification accuracy.14 a) (b)Figure 7: The dendrograms of the hierarchical clustering of (a) English and (b) Polish books, in the space of the weightedclustering coefficients of networks’ nodes, corresponding to 12 most frequent words . Each text is labeled by thesurname of its author. It can be found in the literature on computational stylometry that a very often utilized approach toauthorship attribution is representing a text as a collection of words along with their frequencies (the so-called bag-of-words representation), and comparing particular words’ frequencies is among the most reliablemethods to discriminate between authors [20, 29, 33, 39, 44, 49]. As mentioned in Section 2, the normalizednode strength in a weighted word-adjacency network is equal to the corresponding word’s relative frequency;therefore network analysis may be viewed as a generalization of the bag-of-words approach: it incorporatesall the information about the numbers of word occurrences.The fact that the word frequency works well in identifying authors can be observed in Figure 8. Theclassification based on normalized vertex strength leads to the best or one of the best results among all thestudied network characteristics. A question arises, whether it is viable to introduce the network formalismwhen a satisfying result can be obtained by studying only the numbers of word occurrences. It turns out thata profit can be made from combining the purely network-based parameters with the frequencies. For example,a classification in the space constructed from both the relative frequencies and the normalized weightedclustering coefficients leads to accuracy higher than achievable by any of the two methods separately, ofcourse at the cost of doubling the dimension of attribute space. This is shown in Figure 9. Furthermore,these two parameters seem to exploit all the available information that is useful with respect to authorshiprecognition, because including further network characteristics into the classification does not improve itsaccuracy. 15 able 3: The results of the classification of (a) English (b) Polish books with respect to the authorship, in the space of the weighted clustering coefficient of networks’ nodes, corresponding to 12 most frequent words . A number in the i -th row and j -th column is the probability of classifying a text of i -th author as a text of j -th author.(a) The classification of English books. The authors aredenoted by the two first letters of their surnames: Au- Austen, Co - Conrad, De - Defoe, Di - Dickens, Do -Doyle, El - Eliot, Or - Orwell, Tw - Twain. Au Co De Di Do El Or TwAu .96 .00 .00 .01 .01 .02 .00 .00Co .00 .92 .00 .00 .00 .04 .04 .00De .00 .00 .00 .00 .00 .00 .00Di .01 .00 .00 .88 .04 .00 .00 .07Do .01 .00 .00 .05 .93 .01 .00 .00El .02 .02 .00 .03 .01 .87 .02 .03Or .00 .13 .00 .01 .01 .07 .78 .00Tw .00 .01 .00 .00 .01 .02 .07 .89 (b) The classification of Polish books. The authors aredenoted by the two first letters of their surnames: Ko -Korczak, Kr - Kraszewski, La - Lam, Or - Orzeszkowa,Pr - Prus, Re - Reymont, Si - Sienkiewicz, Że - Żeromski.
Ko Kr La Or Pr Re Si ŻeKo .73 .00 .00 .00 .00 .06 .03 .18Kr .00 .80 .00 .12 .00 .00 .05 .03La .00 .00 .00 .00 .00 .00 .00Or .00 .15 .00 .83 .00 .00 .01 .01Pr .00 .00 .00 .00 .00 .00 .00Re .04 .00 .00 .00 .00 .77 .02 .17Si .00 .05 .00 .00 .04 .01 .90 .00Że .01 .15 .00 .00 .00 .03 .02 .79
From a practical point of view, the above result may be useful in classifying texts in which the set ofavailable words is limited. This may happen, for example, when studied texts are much shorter than theseconsidered here and the words that are the most frequent there are topic-specific and do not carry valuableinformation about an individual writing style. According to the Zipf’s law, the frequency of the n -th mostfrequent word in a text is proportional to 1 /n , so the number of words with the number of occurrences highenough to be reasonably included into analysis decreases substantially when texts are shorter.The results presented in Figure 8 may suggest which properties of the networks seem to be in a wayuniversal for the language, and which differ among authors. Aside from word frequency, the weightedclustering coefficient turned out to be most effective in capturing the diversity of writing styles, whenconsidering both the English and Polish texts. Clustering coefficient C w ( v ) (unnormalized) of a vertex v describes the structure of the neighbourhood of v - it measures the contribution to the strength of v comingfrom pairs of its neighbours that are connected by an edge. The normalized coefficient, e C w ( v ), reflects thedeviation of this structure from the structure that would be typically observed in a randomized network. Thefact that it allows for a quite accurate classification suggests that the organization of such local structures inthe word-adjacency networks may be considered as a feature of individual language that is able to distinguishone author from another.On the other hand, the average shortest path length, especially in its weighted version, gives the weakestclassifier. Thus, one may anticipate that this quantity is shared among the word-adjacency networks evenwhen they are associated with different language users. Indeed, during the construction of a word-adjacencynetwork from any sufficiently long text, a set of the most frequent words form a densely-connected cluster.Such cluster is a subnetwork, whose any two vertices are connected by a path of low length. This canbe noticed when comparing the average shortest path length in such a subnetwork, ‘ sub , with the averageshortest path length in the whole network, ‘ . Figure 11 presents the strip charts of the distributions of thequotient ‘ sub /‘ , for all the studied networks, with the subnetworks consisting of the 50 most frequent words.For the unweighted networks, the value of ‘ sub /‘ is usually around 0.4-0.5 (because the smallest possibledistance between two distinct vertices is equal to 1); however, in weighted networks it is typically smallerthan 0.1 (as distances can be made arbitrarily small when edges’ weight increase). Hence, in the weightednetworks, for any v being one of, say, the 50 most frequent words, the local average shortest path length ‘ ( v )has roughly the same value. This is the reason why the average shortest path length gives a poor insight intothe way of using particular words by different authors. The existence of a cluster with high edge density,composed of the most frequent words, is observable both in original and randomized networks; thereforeit can be viewed as an effect of the principle of word-adjacency network construction, related to the wordfrequency distribution. 16 g deg( v ) f str( v ) e ‘ u ( v ) e ‘ w ( v ) e C u ( v ) e C w ( v )number of most frequent words, n c l a ss i fi c a t i o n e rr o r [ % ] (a) g deg( v ) f str( v ) e ‘ u ( v ) e ‘ w ( v ) e C u ( v ) e C w ( v )number of most frequent words, n c l a ss i fi c a t i o n e rr o r [ % ] (b)Figure 8: The classification of books in the feature spaces constructed from a single network parameter determined for a set of n words occurring most frequently in the whole collection of books. Charts (a) and (b) present the average classification erroras a function of n , for English and Polish books, respectively. Each point on a chart represents the average classification errorin the test set, obtained in one experiment. One experiment consists of selecting the n most frequent words, calculating onenetwork parameter for each of these words in each text, and performing the cross-validation of classification in the so obtained n -dimensional space, 10000 times. The studied network characteristics are (all normalized): vertex degree and strength ( f deg( v ), e str( v )), unweighted and weighted average shortest path length ( e ‘ u ( v ), e ‘ w ( v )), unweighted and weighted clustering coefficient( e C u ( v ), e C w ( v )). number of most frequent words, n c l a ss i fi c a t i o n e rr o r [ % ] f str( v ) e C w ( v )( f str( v ) , e C w ( v )) (a) number of most frequent words, n c l a ss i fi c a t i o n e rr o r [ % ] f str( v ) e C w ( v )( f str( v ) , e C w ( v )) (b)Figure 9: The classification of books in the feature spaces constructed from the sets of network parameters, determined for aset of n words occurring most frequently in the whole collection of books. Charts (a) and (b) present the average (obtainedfrom 10000 repetitions of cross-validation) classification error in the test set, as a function of n , for English and Polish books,respectively. 3 sets of quantities describing words in texts were investigated, namely: (1) normalized vertex strength e str( v ),(2) normalized weighted clustering coefficient e C w ( v ), (3) normalized vertex strength e str( v ) together with normalized weightedclustering coefficient e C w ( v ).
17t may be interesting how including punctuation into the calculations affects the distinguishability ofauthors. It turns out that the utilized approach of treating punctuation marks as usual words [22] is justified:the classification accuracy decreases if punctuation is removed, even when the dimension of the attributespace is preserved (the removed punctuation marks are replaced with subsequent words from the list of themost frequent words). For example, for the normalized weighted clustering coefficient, e C w ( v ), of n = 12the most frequent words (not including punctuation marks), the average classification accuracy is 75% forEnglish and 76% for Polish. A substantial drop from the values obtained previously suggests that the roleof punctuation in expressing individual writing style is non-negligible.Another interesting subject is how the classification is affected by the length of the timespan duringwhich the studied books were published. The publication dates of the studied books vary from 1719 to1949. It is well known that language evolves over time; some aspects of its evolution can be captured byquantitave means and related to the processes taking place in a society and its culture [24, 25, 31]. Suchcultural and linguistic changes may possibly affect authorship attribution - it can be suspected that literaryworks written around the same time are more likely to be confused than these separated by a long timeinterval. In order to get some preliminary insight into this issue (based on the texts selected here), thefollowing approach is utilized. For each author, a "time centroid" is determined as an arithmetic mean ofpublication dates of the works studied in this paper. Then a distance between the centroids (in years)is assigned to each pair of authors and treated as a time interval separating them. Next, for each twoauthors a classification considering only their works is performed multiple times (feature space consists ofweighted degrees and clustering coefficients of the 12 most frequent words; the texts of each author aresplit into training and test set with ratio 4:2). The error rates in such pairwise classifications (for each pairof authors) are then plotted versus time intervals (Figure 10) to assess whether the authors separated bygreater time intervals are easier to distinguish between or not. At the first glance, it seems that in the caseof English, for the writers who are separated by more than 100 years the probability of misclassification islower than 2%, while for others it varies more significantly. However, it must be noticed that all exceptone time interval greater than 100 years pertain to Daniel Defoe, who lived much earlier than the rest ofthe writers. When considering only the time intervals shorter than 100 years (Figure 10, inset), no clearrelationship between the classification’s accuracy and the length of the time interval separating authors canbe observed. It must be remembered though, that these results do not pretend to be general - they apply tothe set of studied texts, but not necessarily to other sets of literary works; the reliable analysis of the effectof language evolution on authorship attribution, possibly with the proposed approach, would require textsfrom a larger timespan.Two additional remarks regarding the obtained results can be made. Firstly, it can be noticed in Figure9 that for the Polish books, reaching the limit of the classification accuracy requires less words than forthe books in English (it is most visible in the case of classification based on f str( v ) and on both f str( v ) and e C ( v )). Secondly, the differences between the accuracies of classification based on the weighted networkcharacteristics and the accuracies of classification employing the characteristics in unweighted version, aresmaller for the Polish texts than for the English ones. This can be observed in Figure 8, for example, forthe clustering coefficient.These effects may perhaps be related to the general structural differences between the languages: whilein Polish, like in all Slavic languages, the syntactic role of a word in a sentence is determined mainly byinflection, the English syntax relies more on the word order and on utilizing function words (articles, auxiliaryverbs, etc). Function words are among the most frequent ones; one may anticipate that in a language whosegrammar imposes stronger conditions on their usage and order, there is less room for the diversity related toan individual writing style. In such a case, more words need to be included into an analysis to obtain reliableaccuracy of the authorship classification. Inflection, on the other hand, leads to the increase of the numberof words that are treated as distinct - as during the construction of networks the words are not lemmatized.If one constructs the weighted word-adjacency networks from the English and Polish texts samples of equallength, these networks will have the same sum of edges’ weights, but the network corresponding to a Polishtext will have more vertices. Since word-adjacency networks are connected (every vertex is reachable fromany other vertex) and have edge weights expressed by natural numbers, it can be hypothesized that from18
50 100 150 200 . . . . . . time interval [years] p r o b a b ili t y o f m i s c l a ss i fi c a t i o n Au-OrDe-DiDe-El De-TwDe-DoDe-Co De-Or . . . Figure 10: The scatterplot of the error rates of pairwise classifications and the time intervals separating classified writers. Eachmarker denotes one pair of authors - squares for English, and dots for Polish language. All points with time interval greaterthan 100 years, except one, correspond to pairs in which one author is Daniel Defoe. These points are labeled with writers’names, abbreviated in the same manner as in Tables 1, 2, 3. For the rest of pairs, the scatterplot is presented in more detailin the inset. two weighted networks with the same sum of edges’ weights, the network with the larger number of verticesshould generally have more low-weight edges and therefore be more similar to an unweighted network. If so,the weighted and the unweighted characteristics of networks should differ less for the Polish texts than forthe English ones, as observed. It must be remembered, however, that the presented explanations can onlybe treated as suppositions; determining the relationship between the two discussed effects and the attributesof the languages, in a reliable way, would require a separate analysis which is beyond the scope of this paper.
4. Summary
The presented results confirm the usability of complex networks as a representation of natural language.Though simple in construction and intuitive, they carry within their structure exploitable information onthe underlying language sample. Constructing such networks from English and Polish literary texts andstudying their properties have led to the possibility of distinguishing individual writing styles. The factthat the presented approach works well for both Polish and English texts suggests that it can probably betreated as applicable to a vast group of languages - as Polish and English are examples of the languagessubstantially different from each other, in terms of the origin, morphology, and grammar features.It can be observed that the vectors representing texts belonging to the same author have a noticeabletendency to group together in the space of global network characteristics, both in the case of unweightedand weighted networks. Investigating this tendency, by employing decision tree ensembles to perform theclassification of texts with respect to authorship and with network parameters as attributes, results in theclassification accuracy higher than the accuracy of a random choice; nevertheless, it is rather questionableto state that the global properties of networks are able to distinguish between authors effectively (at leastwithin the studied set of texts). The networks’ property of having different global parameters for texts withdifferent authorship, that is observed for selected groups of authors (Figure 6), becomes much less evidentas the number of authors grows. This suggests that the global features of word-adjacency networks are toa considerable extent universal for the language and non-specific to individual language users.19 . . . . . . ‘ s u b / ‘ unweightedoriginalnetworks unweightedrandomizednetworks weightedoriginalnetworks weightedrandomizednetworks (a) . . . . . . ‘ s u b / ‘ unweightedoriginalnetworks unweightedrandomizednetworks weightedoriginalnetworks weightedrandomizednetworks (b)Figure 11: The distributions of the quantity ‘ sub /‘ , for networks constructed from (a) English and (b) Polish books. ‘ sub denotes the global average shortest path length in a subnetwork consisting of 50 most frequent words, and ‘ denotes theaverage shortest path length in the whole network. Both original and randomized networks are considered, in both weightedand unweighted version. However, a division of the set of texts into subsets consisting of works of the same author is possibleby analysing the local properties of networks. Calculating the vertex parameters corresponding to theseveral most frequent words (or punctuation marks), creating a vector of these parameters for each text,and applying the clustering or classification algorithm allows for the evident separation of the texts withrespect to their authorship. Among the studied network characteristics, the vertex strength and the weightedclustering coefficient (in an appropriately normalized form) turn out to be the most effective in capturingindividual language features, for both the English and the Polish texts. Data clustering in the clusteringcoefficient’s space for only the 12 most frequent words (including punctuation marks) reveals the emergenceof text groups that clearly can be related to particular authors. The classification with the decision treeensembles results in obtaining the average accuracy of around 85-90%. This confirms that the attributes ofselected vertices in a word-adjacency network can be treated to a large extent as typical to an individualstyle. As these attributes describe a given vertex along with its network neighbourhood, it can be concludedthat they are able to capture the specific way the individual authors use particular words and punctuationmarks in their writing.One could argue that because the vertex strength is directly related to the word frequency, and becausevarious local characteristics may be correlated with vertex strength, the method of distinguishing betweenthe authors, based on an analysis of these characteristics, does not differ significantly from the basic approachto authorship attribution, relying on comparing word frequencies. However, it must be remembered thatin order to eliminate purely frequency-based effects on the characteristics other than vertex strength, allthe studied quantities are normalized. This is done by relating characteristics’ values to the average valuesin the networks corresponding to randomized texts. It may therefore be stated that the analysis of wordfrequencies and the analysis of network parameters are distinct; also, the fact that network parameters andword frequencies combined lead to the classification results better than the results for any of them studiedseparately indicates that both these types of text features carry complementary, non-redundant information.20rom a practical perspective, the correspondence between the text authorship and the structure of theword-adjacency network seems to be potentially useful when combined with other lexical features of the text.As mentioned above, the accuracy of the classification involving word frequencies (which do not have to betreated as the network parameters) and clustering coefficients (which are purely network-based) achieves theresults better than the results obtained from the two methods separately; with such an approach it turnsout to be sufficient to select only 4-5 words to get an average of 80% of the correct classifications. Whendealing with the problem of authorship attribution in textual data sets where the number of words availablefor analysis is somehow limited, it may be profitable to combine the extraction of the basic text featureswith the network analysis, especially since it requires little preprocessing.It is worth noting that the presented approach to explore the diversity of language styles is very simpleand straightforward - it relies on studying the characteristics of a few most frequent words and punctuationmarks. In a future study, it would be recommended to examine the possibility of word selection by thecriteria other than frequency, and also to investigate whether there are other features of language that canbe captured by studying the structure of linguistic networks.
References [1] Akimushkin, C., Amancio, D. R., & Oliveira, O. N. (2017). Text authorship identified using the dynamics of wordco-occurrence networks.
PLOS ONE , . doi: .[2] Amancio, D., Nunes, M., Oliveira, O., Pardo, T., Antiqueira, L., & da F. Costa, L. (2011). Using metrics from complexnetworks to evaluate machine translation. Physica A: Statistical Mechanics and its Applications , . doi: .[3] Antiqueira, L., Oliveira, O. N., da Fontoura Costa, L., & das Graças Volpe Nunes, M. (2009). A complex network approachto text summarization. Information Sciences , . doi: .[4] Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys , .doi: .[5] Ausloos, M. (2012). Generalized Hurst exponent and multifractal function of original and translated texts mapped intofrequency and length time series. Physical Review E , . doi: .[6] Barrat, A., Barthelemy, M., Pastor-Satorras, R., & Vespignani, A. (2004). The architecture of complex weighted networks. Proceedings of the National Academy of Sciences , . doi: .[7] Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment , . doi: .[8] Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., & Hwang, D. (2006). Complex networks: Structure and dynamics. Physics Reports , . doi: .[9] de Boer, B. (2011). Self-organization and language evolution. In The Oxford Handbook of Language Evolution .Oxford University Press. URL: https://doi.org/10.1093/oxfordhb/9780199541119.013.0063 . doi: .[10] Chatzigeorgiou, M., Constantoudis, V., Diakonos, F., Karamanos, K., Papadimitriou, C., Kalimeri, M., & Papageorgiou,H. (2017). Multifractal correlations in natural language written texts: Effects of language family and long word statistics.
Physica A: Statistical Mechanics and its Applications , . doi: .[11] Dall’Asta, L., & Baronchelli, A. (2006). Microscopic activity patterns in the naming game. Journal of Physics A:Mathematical and General , . doi: .[12] Drożdż, S., Kulig, A., Kwapień, J., Niewiarowski, A., & Stanuszek, M. (2017). Hierarchical organization of H. EugeneStanley scientific collaboration community in weighted network representation. Journal of Informetrics , . doi: .[13] Drożdż, S., Oświęcimka, P., Kulig, A., Kwapień, J., Bazarnik, K., Grabska-Gradzińska, I., Rybicki, J., & Stanuszek,M. (2016). Quantifying origin and character of long-range correlations in narrative texts. Information Sciences , .doi: .[14] Egghe, L. (2007). Untangling Herdan’s law and Heaps’ law: Mathematical and informetric arguments. Journal of theAmerican Society for Information Science and Technology , . doi: .[15] da F. Costa, L., Rodrigues, F. A., Travieso, G., & Boas, P. R. V. (2007). Characterization of complex networks: A surveyof measurements. Advances in Physics , . doi: .[16] Faloutsos, M., Faloutsos, P., & Faloutsos, C. (1999). On power-law relationships of the internet topology. In Proceedings ofthe conference on Applications, technologies, architectures, and protocols for computer communication - SIGCOMM’99 .ACM Press. doi: .[17] Grabska-Gradzińska, I., Kulig, A., Kwapień, J., & Drożdż, S. (2012). Complex networks analysis of literary and scientifictexts.
International Journal of Modern Physics C , . doi: .[18] van der Hofstad, R. (2016). Random Graphs and Complex Networks . Cambridge University Press. doi: .[19] Kalampokis, A., Kosmidis, K., & Argyrakis, P. (2007). Evolution of vocabulary on scale-free and random networks.
Physica A: Statistical Mechanics and its Applications , . doi: .
20] Koppel, M., Schler, J., & Argamon, S. (2009). Computational methods in authorship attribution.
Journal of the AmericanSociety for Information Science and Technology , . doi: .[21] Kulig, A., Drożdż, S., Kwapień, J., & Oświęcimka, P. (2015). Modeling the average shortest-path length in growth ofword-adjacency networks. Physical Review E , . doi: .[22] Kulig, A., Kwapień, J., Stanisz, T., & Drożdż, S. (2017). In narrative texts punctuation marks obey the same statisticsas words. Information Sciences , . doi: .[23] Kwapień, J., & Drożdż, S. (2012). Physical approach to complex systems. Physics Reports , . doi: .[24] Lansdall-Welfare, T., Sudhahar, S., Thompson, J., Lewis, J., & and, N. C. (2017). Content analysis of 150 years of Britishperiodicals. Proceedings of the National Academy of Sciences , . doi: .[25] Lansdall-Welfare, T., Sudhahar, S., Veltri, G. A., & Cristianini, N. (2014). On the coverage of science in the media: A bigdata study on the impact of the Fukushima disaster. In .IEEE. doi: .[26] Leung, C., & Chau, H. (2007). Weighted assortative and disassortative networks model. Physica A: Statistical Mechanicsand its Applications , . doi: .[27] Liu, H., & Hu, F. (2008). What role does syntax play in a language network? EPL (Europhysics Letters) , . doi: .[28] Liu, H., & Xu, C. (2011). Can syntactic networks indicate morphological complexity of a language? EPL (EurophysicsLetters) , . doi: .[29] Madigan, D., Genkin, A., Lewis, D. D., Argamon, S., Fradkin, D., & Ye, L. (2005). Author identification on the largescale. In In Proc. of the Meeting of the Classification Society of North America .[30] Mehri, A., Darooneh, A. H., & Shariati, A. (2012). The complex networks approach for authorship attribution of books.
Physica A: Statistical Mechanics and its Applications , . doi: .[31] Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P.,Orwant, J., Pinker, S., Nowak, M. A., & and, E. L. A. (2010). Quantitative analysis of culture using millions of digitizedbooks. Science , . doi: .[32] Mihalcea, R. F., & Radev, D. R. (2011). Graph-based Natural Language Processing and Information Retrieval . (1st ed.).New York, NY, USA: Cambridge University Press. doi: .[33] Neal, T., Sundararajan, K., Fatima, A., Yan, Y., Xiang, Y., & Woodard, D. (2017). Surveying stylometry techniques andapplications.
ACM Computing Surveys , . doi: .[34] Newman, M. E. J. (2003). The structure and function of complex networks. SIAM Review , . doi: .[35] Piantadosi, S. T. (2014). Zipf’s word frequency law in natural language: A critical review and future directions. Psycho-nomic Bulletin & Review , . doi: .[36] Project Gutenberg. (accessed: 30.03.2017).[37] Saramäki, J., Kivelä, M., Onnela, J.-P., Kaski, K., & Kertész, J. (2007). Generalizations of the clustering coefficient toweighted complex networks. Physical Review E , . doi: .[38] Sienkiewicz, J., & Hołyst, J. A. (2005). Statistical analysis of 22 public transport networks in Poland. Physical ReviewE , . doi: .[39] Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society forInformation Science and Technology , . doi: .[40] Steels, L. (2011). Modeling the cultural evolution of language. Physics of Life Reviews , . doi: .[41] Sutton, C. D. (2005). Classification and regression trees, bagging, and boosting. In Handbook of Statistics . Elsevier.doi: .[42] Tan, P.-N., Steinbach, M., & Kumar, V. (2005).
Introduction to Data Mining . Pearson.[43] Valencia, M., Pastor, M. A., Fernández-Seara, M. A., Artieda, J., Martinerie, J., & Chavez, M. (2009). Complex modularstructure of large-scale brain networks.
Chaos: An Interdisciplinary Journal of Nonlinear Science , . doi: .[44] de Vel, O., Anderson, A., Corney, M., & Mohay, G. (2001). Mining e-mail content for author identification forensics. ACMSIGMOD Record , . doi: .[45] Wikisource, the free library. https://wikisource.org/ (accessed: 30.03.2017).[46] Wolne Lektury. https://wolnelektury.pl/ (accessed: 30.03.2017).[47] Yang, T., Gu, C., & Yang, H. (2016). Long-range correlations in sentence series from ’A Story of the Stone’. PLOS ONE , . doi: .[48] Yang, Y., Gu, C., Xiao, Q., & Yang, H. (2017). Evolution of scaling behaviors embedded in sentence series from ’A Storyof the Stone’. PLOS ONE , . doi: .[49] Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the American Society for Information Science and Technology , .doi: . ppendix Table A.1: The set of English books studied in this paper. Usual words and punctuation marks are listed separately, althoughthey are treated in the same way during the analysis. Average length of a sentence is the average number of words betweenpunctuation marks denoting the end of a sentence (full stop, question mark, exclamation mark, ellipsis).Author Title Year ofpub-lishing Numberof words(in thou-sands) Numberof punc-tuationmarks(in thou-sands) Numberof sen-tences(in thou-sands) Averagelength ofasentenceArthurConanDoyle(1859-1930)
Micah Clarke
The Adventures of Sherlock Holmes
The Exploits of Brigadier Gerard
The Lost World
The Refugees
The Valley of Fear
A Tale of Two Cities
Barnaby Rudge
David Copperfield
Oliver Twist
The Mystery of Edwin Drood
The Pickwick Papers
Colonel Jack
Memoirs of a Cavalier
Roxana: The Fortunate Mistress
Moll Flanders
Robinson Crusoe
Captain Singleton
Adam Bede
Daniel Deronda
Felix Holt, the Radical
Middlemarch
Romola
The Mill on the Floss
Animal Farm
Burmese Days
Coming up for Air
Down and Out in Paris and London
Keep the Aspidistra Flying
Nineteen Eighty-Four
Emma
Mansfield Park
Northanger Abbey
Persuasion
Pride and Prejudice
Sense and Sensibility
An Outcast of the Islands
Chance: A Tale in Two Parts
Lord Jim
Nostromo: A Tale of the Seaboard
Under Western Eyes
Victory: An Island Tale
Following the Equator
Life on the Mississippi
The Adventures of Huckleberry Finn
The Adventures of Tom Sawyer
The Innocents Abroad
The Prince and the Pauper able A.2: The set of Polish books studied in this paper. Usual words and punctuation marks are listed separately, althoughthey are treated in the same way during the analysis. Average length of a sentence is the average number of words betweenpunctuation marks denoting the end of a sentence (full stop, question mark, exclamation mark, ellipsis).Author Title Year ofpub-lishing Numberof words(in thou-sands) Numberof punc-tuationmarks(in thou-sands) Numberof sen-tences(in thou-sands) Averagelength ofasentenceBolesławPrus(1847-1912) Anielka
Dzieci
Emancypantki
Faraon
Lalka
Placówka
Cham
Dziurdziowie
Jędza
Marta
Meir Ezofowicz
Nad Niemnem
Ogniem i mieczem
Potop
Quo Vadis
Rodzina Połanieckich
W pustyni i w puszczy
Wiry
Dziwne karyery
Humoreski
Idealiści
Koroniarz w Galicyi
Rozmaitości i powiastki
Wielki świat Capowic
Bankructwo małego Dżeka
Dzieci ulicy
Dziecko salonu
Kajtuś czarodziej
Król Maciuś na wyspie bezludnej
Król Maciuś Pierwszy
Barani Kożuszek
Boża opieka
Boży gniew
Bracia rywale
Infantka
Złote jabłko
Dzieje grzechu
Ludzie bezdomni
Popioły
Przedwiośnie
Syzyfowe prace
Wierna rzeka
Bunt
Fermenty
Lili
Marzyciel
Wampir
Ziemia obiecana1899 170.9 38.0 13.3 12.9