Quantifying origin and character of long-range correlations in narrative texts
Stanisław Drożdż, Paweł Oświęcimka, Andrzej Kulig, Jarosław Kwapień, Katarzyna Bazarnik, Iwona Grabska-Gradzińska, Jan Rybicki, Marek Stanuszek
aa r X i v : . [ c s . C L ] O c t Quantifying origin and character of long-rangecorrelations in narrative texts
Stanis law Dro˙zd˙z a,d, ∗ , Pawe l O´swi¸ecimka a , Andrzej Kulig a , Jaros lawKwapie´n a , Katarzyna Bazarnik b , Iwona Grabska-Gradzi´nska c , Jan Rybicki b ,Marek Stanuszek d a Complex Systems Theory Department, Institute of Nuclear Physics, Polish Academy ofSciences, ul. Radzikowskiego 152, 31-342 Krak´ow, Poland b Institute of English Studies, Faculty of Philology, Jagiellonian University, ul. prof. S. Lojasiewicza 4, 30-348 Krak´ow, Poland c Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University,ul. Lojasiewicza 11, 30-348 Krak´ow, Poland d Faculty of Physics, Mathematics and Computer Science, Cracow University of Technology,ul. Warszawska 24, 31-155 Krak´ow, Poland
Abstract
In natural language using short sentences is considered efficient for communi-cation. However, a text composed exclusively of such sentences looks technicaland reads boring. A text composed of long ones, on the other hand, demandssignificantly more effort for comprehension. Studying characteristics of the sen-tence length variability (SLV) in a large corpus of world-famous literary textsshows that an appealing and aesthetic optimum appears somewhere in betweenand involves selfsimilar, cascade-like alternation of various lengths sentences. Arelated quantitative observation is that the power spectra S ( f ) of thus char-acterized SLV universally develop a convincing ‘1 /f β ’ scaling with the averageexponent β ≈ /
2, close to what has been identified before in musical compo-sitions or in the brain waves. An overwhelming majority of the studied textssimply obeys such fractal attributes but especially spectacular in this respectare hypertext-like, ”stream of consciousness” novels. In addition, they appear todevelop structures characteristic of irreducibly interwoven sets of fractals calledmultifractals. Scaling of S ( f ) in the present context implies existence of the ∗ Corresponding author
Email address: [email protected] (Stanis law Dro˙zd˙z)
Preprint submitted to Journal of Information Sciences October 15, 2015 ong-range correlations in texts and appearance of multifractality indicates thatthey carry even a nonlinear component. A distinct role of the full stops in induc-ing the long-range correlations in texts is evidenced by the fact that the abovequantitative characteristics on the long-range correlations manifest themselvesin variation of the full stops recurrence times along texts, thus in SLV, but to amuch lesser degree in the recurrence times of the most frequent words. In thislatter case the nonlinear correlations, thus multifractality, disappear even com-pletely for all the texts considered. Treated as one extra word, the full stops atthe same time appear to obey the Zipfian rank-frequency distribution, however.
Keywords: natural language, consciousness, correlations, multifractals,hypertext
1. Introduction
Mirroring cultural progress [13, 1] during their evolution natural languages- the most imaginative carriers of information, and the principal clue to themind and to consciousness [24] - developed remarkable quantifiable patterns ofbehaviour such as hierarchical structure in their syntactic organization [32, 39],a corresponding lack of characteristic scale [41, 38] as evidenced by the cele-brated Zipf law [52], small world properties [22, 14, 33], long-range correlationsin the use of words [42, 7, 8, 3] or a stretched exponential distribution [36] ofword recurrence times [4]. A majority of such patterns are common to a largeclass of natural systems known as complex systems [46, 34]. With no doubt,language constitutes a great complexity as it for language is especially true [28]that ”more is different” [5] and the capacity of language is to generate an infiniterange of expressions from the finite set of elements [26, 21]. Thus this suggeststo inspect correlations also among the linguistic constructs longer than merewords. The most natural of them are sentences - strings of words structuredaccording to syntactical principles [2, 11]. Typically it is within a sentence thatwords acquire a specific meaning. Furthermore, in a text the sentence structureis expected to be correlated with the surrounding sentences as dictated by the2ntended information to be encoded, fluency, rhythm, harmony, intonation andpossibly due to many other factors and feedbacks including the authors’ prefer-ences. Consequently, this may introduce even more complex correlations thanthose identified so far. In fact already the Hurst exponent based study [42] ofcorrelations among words in Shakespeare’s plays and in Dickens’ and Darwin’sbooks suggests that the range of such correlations extends far beyond the spanof sentences. Indeed, shuffling sentences by preserving their internal structureappears to bring the Hurst exponents to values even closer to the noise level ascompared to their original values. This thus indicates that long-range correla-tions among words are induced by factors other than grammar as its range isrestricted essentially to a sentence. As an indication for potential factors gen-erating correlations far beyond the range of single sentences one should noticethat the composition of sentences of varied length dictates the reading rhythmwhich involves sound and perception. This, therefore, opens up a possibilitythat the Weber-Fechner law [15] - stating that in perception it is the relativeproportions that matter primarily, and not differences in absolute magnitudes- leaves its imprints also in the sentence arrangement by making some variantof the multiplicative cascade a likely component of the mechanism that ampli-fies the associative turns and thus induces correlations of significantly longerrange than the ones due to grammar. There are, however, also ’coarse grain-ing’ constraints to such a mechanism as sentences cannot usually be expandedcontinuously but by adding clauses, so that syntactical rules are obeyed. Themultifractal formalism [25, 45] offers a particularly appropriate framework to getinsight into such effects and to quantify their relative significance and extent.
2. Materials and Methods
In order to study the long-range correlations among sentences, particularlythose that refer to fractals and cascade effects, we select a corpus of 113 En-glish, French, German, Italian, Polish, Russian, and Spanish literary texts ofconsiderable size and for each individually form a series l ( j ) from the lengths of3he consecutive sentences j expressed in terms of the number of words. Thus, asentence is defined in purely orthographic terms, as a sequence of words startingwith a capital letter and ending in a full stop. Equivalently, in a text, such aseries can be considered a sequence of the recurrence times of the full stops.Based on this criterion an initial selection of sentences is performed automati-cally but then a further processing is executed, in some cases even manually, inorder to identify such (not very frequent) instances where a full stop does notterminate a sentence, like for instance Mr., in initials, question or exclamationmarks in parenthesis, etc. Since the present study has a statistical character, anadditional criterion we impose specifies that each text contains no fewer than5000 sentences. For the correlated series, as the ones to be studied here, such alower bound on the number of sentences is dictated by requirements to obtainreliable results even on the level of the multifractal analysis [18]. A completelist of the titles included in this corpus is given in the Appendix.The simplest, second-order linear characteristics are measured in terms ofthe power spectra S ( f ) of such series. Such spectra are calculated as FourierTransform modulus squared S ( f ) = | j max X j =1 l ( j ) e − πifj | (1)of the series l ( j ) representing lengths of the consecutive sentences j . A com-plementary approach towards higher order correlations consists in the waveletdecomposition of l ( j ). The corresponding ‘mathematical microscope’ waveletcoefficient maps T ψ ( s, k ) are obtained as T ψ ( s, k ) = (1 / √ s ) j max X j =1 l ( j ) ψ (( j − k ) /s ) (2)where k represents the wavelet position in a text while s the wavelet resolutionscale. The wavelet ψ used in the present study is a Gaussian third derivative.It is orthogonal - hence insensitive - to quadratic trends in a signal and thuseffectively leads to their removal [43, 44] as demanded by consistency with theother method described and used below.4he wavelet decomposition is optimal for visualization and, in principle, it iswell suited to extract the multifractal characteristics [43]. However, the newermethod, termed Multifractal Detrended Fluctuation Analysis (MFDFA) [30] isnumerically more stable and often more accurate [44], though even here the con-vergence to a correct result is a delicate matter [18]. Accordingly, for a series l ( j )of sentence lengths one evaluates its signal profile L ( j ) ≡ P jk =1 [ l ( k ) − < l > ],where < · > denotes the series average and j = 1 , ..., j max with j max standingfor the number of sentences in a series. This profile is then divided into 2 M s disjoint segments ν of length s starting from both end points of the series. Next,the detrended variance F ( ν, s ) = 1 s s X k =1 { L (( ν − s + k ) − P ( m ) ν ( k ) } (3)is determined, where a polynomial P ( m ) ν of order m serves detrending. Finally,a q -th order fluctuation function F q ( s ) = (cid:26) M s M s X ν =1 [ F ( ν, s )] q/ (cid:27) /q , (4)is calculated and its scale s dependence inspected. Scale invariance in a form F q ( s ) ∼ s h ( q ) (5)indicates the most general multifractal structure if the generalised Hurst expo-nent h ( q ) is explicitly q -dependent, while it is reduced to monofractal when h ( q )becomes q -independent. The well-known Hurst exponent is identical to h (2). h ( q ) determines the H¨older exponents α = h ( q ) + qh ′ ( q ) and the singularityspectrum f( α ) = q [ α − h ( q )] + 1 , (6)the latter being the fractal dimension of the set of points with this particular α .For a model multifractal series (like a binomial cascade), f( α ) typically assumesa shape resembling an inverted parabola whose widths ∆ α = α max − α min isconsidered a measure of the degree of multifractality and thus often also ofcomplexity. 5 . Results and discussion A highly significant result is obtained already by evaluating the power spec-tra S ( f ) according to Eq.(1) of the series representing the sentence length vari-ability (SLV) of all the text considered. As documented in Fig. 1, the overalltrend of essentially all sample texts, and especially its average, shows a clear S ( f ) = 1 /f β (7)scaling with β ≈ /
2, in most cases over the entire range of about two orders ofmagnitude in frequencies f spanned by the number of sentences of a typical texthere analyzed. Statistical significance of this result is inspected by randomlyshuffling sentences within texts. For the so obtained randomized texts the cal-culated power spectra appear to be trivially fluctuating along a horizontal line( β = 0) on all the scales.For the individual original texts β is seen to range between 1/4 and 3/4.This kind of scaling points to the existence of the power-law long-range tempo-ral correlations in SLV - thus to its fractal organization - and indicates that itbalances randomness and orderliness, just as it does for music, speech [49], heartrate [31], cognition [23], spontaneous brain activity [35], and for other ‘soundsof Nature’ [9, 48]. From this perspective human writing appears to correlatewith them. Even the range of the corresponding β -values from about 1/4 to 3/4overlaps significantly (more on the Mozart’s than Beethoven’s side) with those(1/2 to 1) found [37] for musical compositions. This perhaps provides a quanti-tative argument for our tendency to refer to writing as ‘being composed’ whenwe care about all its aspects including aesthetics and rhythm to be experiencedin reading.The two extremes in the corpus, explicitly indicated in Fig. 1, are The Am-bassadors (upper) and
Artam`ene ou le Grand Cyrus , the 17 th century Frenchnovel sequence (lower), considered the longest novel ever published. The first ofthem appears to be most peculiar among all the texts as at the small frequenciesit visibly departs from 1 /f by bending down and displaying the two preferred6 .0001 0.001 0.01 0.1 frequency f S ( f ) Artamène ou le Grand CyrusThe Ambassadors
Figure 1: Power spectra S ( f ) of the sentence length variability for 113 world famous liter-ary works. They are calculated from the series l ( j ) representing lengths of the consecutivesentences j expressed in terms of the number of words. S ( f ) is seen to display 1 /f β scal-ing. Middle solid line (green) denotes average over the individual power spectra, properlynormalised, of all the corpus elements and it fits well by β =1/2. Boundaries of the dispersionin β are indicated by taking average over 10 corpus elements, with the largest β -values, whichresults in β =3/4 and over 10 its elements with the smallest β -values, which results in β =1/4.The two extremes in the corpus, explicitly indicated, are Henry James’s The Ambassadors (upper) and Madeleine and/or Georges de Scud´ery’s
Artam`ene ou le Grand Cyrus , the 17 th century novel sequence (lower), considered the longest novel ever published. The straight linefits to these two extremes are represented by the dash-dotted lines. β = 0 . ± .
02 in the region of higher frequencies where 1 /f scalingapplies. The latter novel, on the other hand, with β = 0 . ± .
02 appears closestto the white noise whose S ( f ) is flat. It is also appropriate to notice that at thelargest frequencies, which corresponds to the smallest scales, the power spectraof all the texts have some tendency to flatten. This may suggest that the long-range coherence in its 1 /f organization is somewhat perturbed on shorter scalesby coarsening due to grammatical constraints.Our central result relates, however, to the nonlinear characteristics that maymanifest themselves in heterogeneous, self-similarly convoluted structures, un-detectable by S ( f ). Such structures may demand using the whole spectrum ofthe scaling exponents and are then termed multifractals. That such structuresin SLV may be present within the corpus analysed here can be inferred fromFig. 2, which shows four, somewhat distinct, categories of behaviour. A ma-jority of the texts in our study resembles the case displayed in ( I ). SLV is hereseen to be rather homogenously ‘erratic’ and, consequently, the distribution ofcascades seen through the wavelet decomposition is largely uniform. The threeother cases, ( II ), ( III ) and ( IV ), commonly considered representatives of thestream of consciousness (SoC) literary style that seeks ”to depict the multi-tudinous thoughts and feelings which pass through the mind” [17], are visiblyinhomogeneous in this respect, as SLV displays clusters of intermittent bursts ofmuch longer sentences. Such structures are characteristic of multifractals andthus an appropriate subject of the analysis within the above formalism.The fluctuation functions F q ( s ) obtained according to Eq.(4) display (Fig. 2)a convincing scaling with different degree of q -dependence, however. This iscorroborated by the corresponding singularity spectra f( α ), which range fromvery narrow in Voyna i mir ( I ), indicating essentially monofractal structure,through significantly broader - thus already multifractal - but asymmetric, likethe strongly left sided Rayuela ( II ) or right sided The Waves ( III ), up to the ex-ceptionally broad and simultaneously almost symmetric case ( IV ) of Finnegans igure 2: Variety of multifractal sentence arrangements in literary texts: Four examplesillustrating different fractal/multifractal characteristics identified within the corpus of thecanonical literary texts: I , Voyna i mir (War and Peace) by Lev Tolstoy; II , Rayuela (Hop-scotch) by Julio Cort´azar;
III , The Waves by Virginia Woolf and IV , Finnegans Wake (FW) by James Joyce. The panels inside each contain correspondingly: (a)
The series l ( j ) of theconsecutive sentence lengths throughout the whole text. Insets illustrate the correspondingprobability distributions P ( l ) of l ( j ); (b) Wavelet coefficient maps ( T ψ ( s, k )) obtained for l ( j ).The wavelet ψ used is a Gaussian third derivative. The horizontal axis represents the sentenceposition in a text while the vertical axis - the wavelet resolution scale s . Colour codes denotemagnitude of the coefficient from the smallest (dark blue) to the largest (red); (c) q -th orderfluctuation functions calculated according to Eq.(4) using the detrending polynomial P ( m ) ν ofsecond order ( m =2), for q ∈ [ − ,
4] and s ⊂ [20 , j max /
5] (for
Rayuela a consistent scalingregime stops somewhat earlier at s ≈ (d) The resulting singularity spectra f( α ) for (i) the series l ( j ) representing original texts(black), (ii) for their Fourier-phase randomised counterparts (blue); here f( α ) is seen shrunkessentially to a point as is characteristic of a pure monofractal, and (iii) for their randomlyshuffled counterparts (gray). ake (FW) .The left side of f( α ) is determined by the positive q -values, which filterout larger events (here longer sentences), and its right side reflects behaviourfor smaller events as filtered out by the negative q -values. Hence, asymmetry inf( α ) signals non-uniformity of the underlying hypothesized cascade [19]. Rayuela is thus seen to be more multifractal in the composition of long sentences andalmost monofractal on the level of small ones. To some extent the oppositeapplies to
The Waves . In fact, these effects can be inferred already from the non-uniformities of the corresponding SLV wavelet decompositions (Fig. 2). In thisrespect FW appears impressively consistent; being one of the most intriguingliterary ‘compositions’ ever, mastered imaginatively in the SoC technique, freelyexploring the mental labyrinth of dreams and thus often breaking conventionalrules of syntax and of linguistic rigour. However, from the perspective of ourformal quantitative approach, its architecture looks - or perhaps just is - a resultof these factors - to be governed consistently by the same ‘generators’ on allscales of sentence length. An extra intellectual factor shaping FW is very likelyto be also related to its top-bottom development - much like model mathematicalcascades - as evidenced by its chronology of writing [12] graphically sketched inFig. 3.The significance of the above results for the singularity spectra f( α ) of theseries l ( j ) representing the original texts has also been tested against the twocorresponding surrogates. One standard surrogate in this kind of analysis isobtained by generating the Fourier-phase randomised counterparts of l ( j ). Thisdestroys nonlinear correlations and makes probability distribution of fluctua-tions Gaussian-like, but preserves the linear correlations and, as it is clearlyseen in Fig. 2, shrinks f( α ) essentially to a point as is characteristic of a puremonofractal. Another surrogate is obtained by randomly shuffling the origi-nal series l ( j ). Consequently, any temporal correlations get destroyed but theprobability distributions of fluctuations remain unchanged. The correspondingsingularity spectra calculated according to the same MFDFA algorithm are alsoshown in Fig. 2 (gray). Consistently with the lack of any temporal correlations10 igure 3: Chronological progress of James Joyces engineering work on writing FW, whichhe described as boring a mountain from two sides [20]. This chart may be also taken as avisualisation of Joyces dream about a Turk picking threads from heaps on his left and rightsides, and weaving a fabric in the colours of the rainbow, which the writer interpreted as asymbolic picture of Books I and III of FW. they all get shifted down to α ≈ . α ) still remainsto be observed. However, at least a large part of this remaining multifractalityin this last case may be apparent due to a relatively small size of the samples.For the uncorrelated series the result of calculating the multifractal spectra isknown [18] to end up in either mono-fractal for the series whose fluctuationprobability distributions are L´evy-unstable, or in bi-fractal for those whose dis-tributions are L´evy-stable. Contrary to the correlated series, the convergenceto the ultimate correct results in this case is very slow. We also wish to note atthis point that in spite of the Menzerath-Altmann law, all the relevant resultsshown here remain essentially unchanged if the sentence length is measured interms of the number of characters instead of the number of words.Another, even better known SoC novel by Joyce - Ulysses , which played acentral role in formulating the scale-free word rank-frequency distribution lawby Zipf - also deserves here an extended attention, however, for a different rea-son. For this novel, as illustrated in Fig. 4 no unique multifractal scaling can be11 igure 4: Special case of Ulysses: The same convention as in Fig. 2 is used here. The twoadditional insets in the panels a and c display results for
Ulysses after bisecting it into halves.
Ulysses - I corresponds to the text from the beginning to the end of Chapter 10 and
Ulysses -IIto the remaining text without its last two disproportionately long sentences. α ) assigned. The SLV inspected both in terms of thesentence length distribution and through its wavelet transform indicate clearlythat Ulysses splits into two parts such that each of them may independentlyhave well defined scaling properties. Indeed, by bisecting it approximately intohalves (between Chapters 10 and 11) allows us again to comprise
Ulysses withinthe present formalism. The first part appears essentially monofractal, while theother is clearly multifractal, though asymmetrically left-sided, just as
Rayuela .In fact, this result provides a quantitative argument in favour of the ”double-ness” of
Ulysses [40].The results, represented in terms of the width ∆ α = α max − α min of f( α ),where α max and α min denote the beginning and the end of f( α ) support, andof the Hurst exponents H = h (2) measuring the degree of persistence in SLV,for the whole studied corpus are collected in Fig. 5. The relation between theHurst exponent H and the scaling exponent β in Eq. (7) reads [27]: β = 2 H − . (8)As the upper-right inset to Fig. 5 visibly documents this relation appears to bevery satisfactorily fulfilled when comparing the results presented in Fig. 1 versusthe Hurst exponents in Fig. 5 for all the texts studied which thus provides anindependent test for correctness of the results presented. All the explicit valuesof H and β correspondingly, together with their error bars σ H and σ β measuredin terms of the mean standard deviations, are listed in the Appendix in parallelwith the titles included in the corpus.The ‘scatter plot’ shown in Fig. 5 opens up room for many further interestingobservations and hypotheses or even definite conclusions of general interest.Some of them can be straightforwardly listed as follows: (i) Essentially all thestudied texts that are seen in the multifractality region are commonly classifiedas SoC literature. The only exception found here, the Old Testament , has notbeen considered before in this context. (ii) ∆ α for all the texts that do notbelong to SoC is located below the border of definite multifractality. Theircomplexity is thus poorer. (iii) Also, several texts, by some considered as SoC,13 H ∆ α α f ( α ) H -0.300.30.60.91.21.5 β Finnegans WakeU.S.A. trilogyThe WavesArtamène ou le Grand Cyrus RayuelaA Heartbreaking Work of Staggering Genius M u ltif r a c t a lit y M ono f r a c t a lit y Bible (New Testament) The GoldfinchBible (Old Testament) The AmbassadorsTristram Shandy The Portrait of a LadyMort à crédit α min α max ∆α=α max - α min Ulysses-IIUlysses-I
Finnegans Wake
W. Shakespeare ( d e g r ee o f c o m p l e x it y ) (degree of persistency) À la recherche du temps perdu β =2H-1 Figure 5: ‘Scatter plot’, which for a collection of 113 most representative literary works in-dicated by their numbers (the ones in red indicate those works that usually are considered asbelonging to the SoC narrative) on our list (see Appendix) displays the width ∆ α (schemat-ically defined in the upper-left inset) and the Hurst exponent H . Shaded area marks thetransition (uncertainty) region between fully developed multifractality and definite monofrac-tality. We find it reasonable to assume that the shuffled series are mono-fractal (or at mostbi-fractal) and that any trace of multifractality in this case is an artifact of the finiteness of aseries. Therefore, the lower bound of the shaded area is determined as an average of ∆ α ’s forall the series (texts) shuffled. Due to the thickest tails in the probability distributions P ( l ) of l ( j ) in FW (seen in the inset to panel IV of Fig. 2), which after shuffling the correspondingseries may yield the strongest apparent multifractality signal, the upper bound of the uncer-tainty region is taken as ∆ α of the shuffled FW . The upper-right inset shows location of apair of the H and β values for each book, represented by a point, relative to the straight linedetermined by Eq. (8). igure 6: Sentence length complementary cumulative distributions F ( ℓ ) = P r ( l ≥ ℓ ) for thewhole corpus of 113 books considered but split into three groups according to their locationin Fig. 5. Tails ( ℓ > F ( ℓ ) = exp ( − µℓ b ) and the corresponding best fit b -values listed. The intermediate (10 < ℓ < b = 1) dependence. Inset shows the Zipfianrank-frequency distribution plot for Ulysses where the full stops are treated as another word. appear to be located significantly below this border. An important exampleof this is `A la recherche du temps perdu by Marcel Proust (no. 76 in the listgiven in the Appendix), which is clearly monofractal. (iv)
Artam`ene ou le GrandCyrus is seen to have characteristics just opposite to FW . Here ∆ α equals nearlyzero and H gets shifted down towards 1/2, which complements its flat powerspectrum seen in Fig. 1, to mean that the corresponding SLV is of the whitenoise type.Finally, examples of the sentence lengths distributions shown in Fig. 2 ( Ia to IVa ) indicate some differences among works, related more to the style than tothe language. Those representing SoC kind of narrative have a clear tendency todevelop thicker tails. In fact, attempts to exploit the sentence length distributionin stylometry and in authorship attribution have a long history [51, 50, 47,6]. In order to approach this issue quantitatively for the present corpus the15omplementary cumulative distributions F ( ℓ ) = P r ( l ≥ ℓ ) for all the textscollected separately from the three distinct regions identified through Fig. 5 aredisplayed in Fig. 6. The tails of these distributions, starting at around l ≈ F ( ℓ ) = exp ( − ℓ b )with the smallest b parameter and thus the thickest tail for the SoC narrative,thus multifractal in SLV. This parameter remains, however, significantly smallerthan unity even in the monofractal regime. This result closely resembles thestretched exponential distribution of recurrence times between words [4]. Ina sense the sentence length can be interpreted as a recurrence time betweenthe full stops and they, in a text, play at least as important role as words.Their large recurrence times appear to be governed by a similar distributionas the one for words. Interesting in this context, and also complementary, isobservation documented in the upper-left inset of Fig. 6, that the full stops(dots, question and exclamation marks) treated as one more word in the rank-frequency dependance typically belong to the same Zipfian plot as worlds, asshown for the original Zipf’s example of Ulysses .These observations thus prompt a question if a series of recurrence times(measured in terms of the number of words) between the same words developsimilar long-range correlations as the ones identified above for the full stops.For the lowest-rank - thus most frequent - words like English the , and , of or to (only for such words one obtains sufficiently long series) multifractality turnsout to disappear in all the texts studied. This is true even for FW as the q -thorder fluctuation functions F q ( s ) calculated according to Eq.(4), for this orig-inally extreme case of multifractality and displayed in Fig. 7 show. For theseries of recurrence times of the words the , and and of the F q ( s ) functions be-come very weakly q -dependent as is characteristic to monofractals. In fact eventhis monofractal scaling is here not fully convincing at some places. The long-range nonlinear correlations responsible for multifractality appear thus to beoriginating exclusively from the specific arrangement of the full stops in texts.The situation with the long-range linear correlations is not that extreme assome representative cases expressed in terms of the power spectra S ( f ) shown16 F q ( s ) F q ( s )
10 100 1000 s F q ( s )
100 1000 10000 s F q ( s ) Full stop theof and
Figure 7: The q -th order fluctuation functions calculated using the same algorithm andconvention as in Fig. 2 for the sentence length variability (SLV), which is equivalent to therecurrence times of the full stops, versus the series of recurrence times of three different words, the , of and and considered separately, in Finnegans Wake by James Joyce. in Fig. 8 illustrate it. These power spectra for other cases than SLV still showa trace of scaling of the form S ( f ) = 1 /f β with significantly smaller values ofthe β -parameter (denoted by β s for SLV and by β w for the word recurrencetimes), however, and larger dispersion of fluctuations along the fit which in-dicates a very weak character of the underlying long-range linear correlations.Very interestingly, however, for those texts that are multifractal on the level ofSLV, the linear long-range correlations for the recurrence times between wordsdo not depart so much from the correlations measured by SLV in the same textsand are significantly stronger than in the monofractal texts, as the examples inFig. 8 show. Summary
The present analysis, based on a large corpus which includes world famousliterary texts, uncovers the long-range correlations in their sentence arrange-ment. The linear component of these correlations universally reveals the scale-17 igure 8: Power spectra S ( f ) of the sentence length variability (scaling exponent β s ) and ofthe series of recurrence times for the three most frequent words in four texts representativefor the present corpus. Straight line fits are indicated by the dashed lines. The S ( f ) for therecurrence times is fitted globally and the corresponding scaling exponent denoted by β w . /f β ’ form as characteristic to many other ‘sounds of Nature’ and thusthis observation may serve as an indicator of those factors that shape humanlanguage. The corresponding β -value typically range from about 1/4 to 3/4and may thus serve also as a very useful and inspiring stylometry measure. Asfar as correlations in the sentence length variability are concerned, some texts -within the present corpus exclusively belonging to the stream of consciousnessnarrative - develop even more complex scale-free patterns of the nonlinear char-acter and heavy tailed intermittent bursts in SLV similar to the ones identifiedin other areas of human activity [10]. In quantitative terms this results in awhole spectrum of the scaling exponents as compactly grasped by the multi-fractal spectrum f( α ), whose width reflects the degree of nonlinearity involved.A greater complexity of such hypertext-like narrative finds an interesting par-allel in the biological dynamical system as documented [29] for the healthyhuman heartbeat, which develops broader multifractal spectra as compared tothe heart disfunction. That the SoC kind of narrative should simultaneously ac-tivate greater variety of brain areas seems quite natural. Whether this indicatesmeans to more efficient sharing of information also emerges as an intriguingperspective to study. A further argument in favour of such a likely correspon-dence is that hypertext is paralleled by the underlying architecture of WorldWide Web, which proves easy-to-use and flexible in its self-similar traffic [16] ofsharing information over the Internet, indeed.SLV is equivalent to variability of the recurrence times, measured in termsof the number of words between the full stops. Analogous variability of therecurrence times between words, which for statistical reasons can be studiedfor the most frequent ones, also involves ‘1 /f β ’-type long-range correlations butthey appear significantly weaker than for the full stops. What is more, thenonlinear correlations inducing multifractal characteristics seem to take placeexclusively on the level of SLV. Thus, even though the full stops together withwords appear to belong to the same Zipfian distribution, they form a frame forlong-range correlations in narrative texts. This frame seems to obey more strictuniversal principles of organization, likely shaped also by the factors listed in the19ntroduction, and words have some more freedom in filling and complementingit as can be inferred from their weaker mutual correlations. Such a scenariooffers one possible visualization of the results obtained.Finally, the results presented - like imprints of the Weber-Fechner law throughSLV cascades - appear largely consistent with the working hypothesis formulatedin the introduction by listing factors that may induce long-range correlationsduring the process of the narrative text formation. In order to further illuminateon such issues some empirical study, like quantifying the vocal and perceptioncharacteristics of texts with varying strength of SLV correlations, would be cru-cially helpful and our results indicate direction. Acknowledgment
We thank Krzysztof Bartnicki (who translated FW into Polish) for construc-tive exchanges at the early stage of this Project.20 ppendix List of the considered literary works.2123 eferenceseferences