[PDF] Quantifying origin and character of long-range correlations in narrative texts

Abstract

In natural language using short sentences is considered efficient for communication. However, a text composed exclusively of such sentences looks technical and reads boring. A text composed of long ones, on the other hand, demands significantly more effort for comprehension. Studying characteristics of the sentence length variability (SLV) in a large corpus of world-famous literary texts shows that an appealing and aesthetic optimum appears somewhere in between and involves selfsimilar, cascade-like alternation of various lengths sentences. A related quantitative observation is that the power spectra S(f) of thus characterized SLV universally develop a convincing `1/f^beta' scaling with the average exponent beta =~ 1/2, close to what has been identified before in musical compositions or in the brain waves. An overwhelming majority of the studied texts simply obeys such fractal attributes but especially spectacular in this respect are hypertext-like, "stream of consciousness" novels. In addition, they appear to develop structures characteristic of irreducibly interwoven sets of fractals called multifractals. Scaling of S(f) in the present context implies existence of the long-range correlations in texts and appearance of multifractality indicates that they carry even a nonlinear component. A distinct role of the full stops in inducing the long-range correlations in texts is evidenced by the fact that the above quantitative characteristics on the long-range correlations manifest themselves in variation of the full stops recurrence times along texts, thus in SLV, but to a much lesser degree in the recurrence times of the most frequent words. In this latter case the nonlinear correlations, thus multifractality, disappear even completely for all the texts considered. Treated as one extra word, the full stops at the same time appear to obey the Zipfian rank-frequency distribution, however.

Full PDF

aa r X i v : . [ c s . C L ] O c t Quantifying origin and character of long-rangecorrelations in narrative texts

Stanis law Dro˙zd˙z a,d, ∗ , Pawe l O´swi¸ecimka a , Andrzej Kulig a , Jaros lawKwapie´n a , Katarzyna Bazarnik b , Iwona Grabska-Gradzi´nska c , Jan Rybicki b ,Marek Stanuszek d a Complex Systems Theory Department, Institute of Nuclear Physics, Polish Academy ofSciences, ul. Radzikowskiego 152, 31-342 Krak´ow, Poland b Institute of English Studies, Faculty of Philology, Jagiellonian University, ul. prof. S. Lojasiewicza 4, 30-348 Krak´ow, Poland c Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University,ul. Lojasiewicza 11, 30-348 Krak´ow, Poland d Faculty of Physics, Mathematics and Computer Science, Cracow University of Technology,ul. Warszawska 24, 31-155 Krak´ow, Poland

Abstract

In natural language using short sentences is considered eﬃcient for communi-cation. However, a text composed exclusively of such sentences looks technicaland reads boring. A text composed of long ones, on the other hand, demandssigniﬁcantly more eﬀort for comprehension. Studying characteristics of the sen-tence length variability (SLV) in a large corpus of world-famous literary textsshows that an appealing and aesthetic optimum appears somewhere in betweenand involves selfsimilar, cascade-like alternation of various lengths sentences. Arelated quantitative observation is that the power spectra S ( f ) of thus char-acterized SLV universally develop a convincing ‘1 /f β ’ scaling with the averageexponent β ≈ /

2, close to what has been identiﬁed before in musical compo-sitions or in the brain waves. An overwhelming majority of the studied textssimply obeys such fractal attributes but especially spectacular in this respectare hypertext-like, ”stream of consciousness” novels. In addition, they appear todevelop structures characteristic of irreducibly interwoven sets of fractals calledmultifractals. Scaling of S ( f ) in the present context implies existence of the ∗ Corresponding author

Email address: [email protected] (Stanis law Dro˙zd˙z)

Preprint submitted to Journal of Information Sciences October 15, 2015 ong-range correlations in texts and appearance of multifractality indicates thatthey carry even a nonlinear component. A distinct role of the full stops in induc-ing the long-range correlations in texts is evidenced by the fact that the abovequantitative characteristics on the long-range correlations manifest themselvesin variation of the full stops recurrence times along texts, thus in SLV, but to amuch lesser degree in the recurrence times of the most frequent words. In thislatter case the nonlinear correlations, thus multifractality, disappear even com-pletely for all the texts considered. Treated as one extra word, the full stops atthe same time appear to obey the Zipﬁan rank-frequency distribution, however.

Keywords: natural language, consciousness, correlations, multifractals,hypertext

1. Introduction

Mirroring cultural progress [13, 1] during their evolution natural languages- the most imaginative carriers of information, and the principal clue to themind and to consciousness [24] - developed remarkable quantiﬁable patterns ofbehaviour such as hierarchical structure in their syntactic organization [32, 39],a corresponding lack of characteristic scale [41, 38] as evidenced by the cele-brated Zipf law [52], small world properties [22, 14, 33], long-range correlationsin the use of words [42, 7, 8, 3] or a stretched exponential distribution [36] ofword recurrence times [4]. A majority of such patterns are common to a largeclass of natural systems known as complex systems [46, 34]. With no doubt,language constitutes a great complexity as it for language is especially true [28]that ”more is diﬀerent” [5] and the capacity of language is to generate an inﬁniterange of expressions from the ﬁnite set of elements [26, 21]. Thus this suggeststo inspect correlations also among the linguistic constructs longer than merewords. The most natural of them are sentences - strings of words structuredaccording to syntactical principles [2, 11]. Typically it is within a sentence thatwords acquire a speciﬁc meaning. Furthermore, in a text the sentence structureis expected to be correlated with the surrounding sentences as dictated by the2ntended information to be encoded, ﬂuency, rhythm, harmony, intonation andpossibly due to many other factors and feedbacks including the authors’ prefer-ences. Consequently, this may introduce even more complex correlations thanthose identiﬁed so far. In fact already the Hurst exponent based study [42] ofcorrelations among words in Shakespeare’s plays and in Dickens’ and Darwin’sbooks suggests that the range of such correlations extends far beyond the spanof sentences. Indeed, shuﬄing sentences by preserving their internal structureappears to bring the Hurst exponents to values even closer to the noise level ascompared to their original values. This thus indicates that long-range correla-tions among words are induced by factors other than grammar as its range isrestricted essentially to a sentence. As an indication for potential factors gen-erating correlations far beyond the range of single sentences one should noticethat the composition of sentences of varied length dictates the reading rhythmwhich involves sound and perception. This, therefore, opens up a possibilitythat the Weber-Fechner law [15] - stating that in perception it is the relativeproportions that matter primarily, and not diﬀerences in absolute magnitudes- leaves its imprints also in the sentence arrangement by making some variantof the multiplicative cascade a likely component of the mechanism that ampli-ﬁes the associative turns and thus induces correlations of signiﬁcantly longerrange than the ones due to grammar. There are, however, also ’coarse grain-ing’ constraints to such a mechanism as sentences cannot usually be expandedcontinuously but by adding clauses, so that syntactical rules are obeyed. Themultifractal formalism [25, 45] oﬀers a particularly appropriate framework to getinsight into such eﬀects and to quantify their relative signiﬁcance and extent.

2. Materials and Methods

In order to study the long-range correlations among sentences, particularlythose that refer to fractals and cascade eﬀects, we select a corpus of 113 En-glish, French, German, Italian, Polish, Russian, and Spanish literary texts ofconsiderable size and for each individually form a series l ( j ) from the lengths of3he consecutive sentences j expressed in terms of the number of words. Thus, asentence is deﬁned in purely orthographic terms, as a sequence of words startingwith a capital letter and ending in a full stop. Equivalently, in a text, such aseries can be considered a sequence of the recurrence times of the full stops.Based on this criterion an initial selection of sentences is performed automati-cally but then a further processing is executed, in some cases even manually, inorder to identify such (not very frequent) instances where a full stop does notterminate a sentence, like for instance Mr., in initials, question or exclamationmarks in parenthesis, etc. Since the present study has a statistical character, anadditional criterion we impose speciﬁes that each text contains no fewer than5000 sentences. For the correlated series, as the ones to be studied here, such alower bound on the number of sentences is dictated by requirements to obtainreliable results even on the level of the multifractal analysis [18]. A completelist of the titles included in this corpus is given in the Appendix.The simplest, second-order linear characteristics are measured in terms ofthe power spectra S ( f ) of such series. Such spectra are calculated as FourierTransform modulus squared S ( f ) = | j max X j =1 l ( j ) e − πifj | (1)of the series l ( j ) representing lengths of the consecutive sentences j . A com-plementary approach towards higher order correlations consists in the waveletdecomposition of l ( j ). The corresponding ‘mathematical microscope’ waveletcoeﬃcient maps T ψ ( s, k ) are obtained as T ψ ( s, k ) = (1 / √ s ) j max X j =1 l ( j ) ψ (( j − k ) /s ) (2)where k represents the wavelet position in a text while s the wavelet resolutionscale. The wavelet ψ used in the present study is a Gaussian third derivative.It is orthogonal - hence insensitive - to quadratic trends in a signal and thuseﬀectively leads to their removal [43, 44] as demanded by consistency with theother method described and used below.4he wavelet decomposition is optimal for visualization and, in principle, it iswell suited to extract the multifractal characteristics [43]. However, the newermethod, termed Multifractal Detrended Fluctuation Analysis (MFDFA) [30] isnumerically more stable and often more accurate [44], though even here the con-vergence to a correct result is a delicate matter [18]. Accordingly, for a series l ( j )of sentence lengths one evaluates its signal proﬁle L ( j ) ≡ P jk =1 [ l ( k ) − < l > ],where < · > denotes the series average and j = 1 , ..., j max with j max standingfor the number of sentences in a series. This proﬁle is then divided into 2 M s disjoint segments ν of length s starting from both end points of the series. Next,the detrended variance F ( ν, s ) = 1 s s X k =1 { L (( ν − s + k ) − P ( m ) ν ( k ) } (3)is determined, where a polynomial P ( m ) ν of order m serves detrending. Finally,a q -th order ﬂuctuation function F q ( s ) = (cid:26) M s M s X ν =1 [ F ( ν, s )] q/ (cid:27) /q , (4)is calculated and its scale s dependence inspected. Scale invariance in a form F q ( s ) ∼ s h ( q ) (5)indicates the most general multifractal structure if the generalised Hurst expo-nent h ( q ) is explicitly q -dependent, while it is reduced to monofractal when h ( q )becomes q -independent. The well-known Hurst exponent is identical to h (2). h ( q ) determines the H¨older exponents α = h ( q ) + qh ′ ( q ) and the singularityspectrum f( α ) = q [ α − h ( q )] + 1 , (6)the latter being the fractal dimension of the set of points with this particular α .For a model multifractal series (like a binomial cascade), f( α ) typically assumesa shape resembling an inverted parabola whose widths ∆ α = α max − α min isconsidered a measure of the degree of multifractality and thus often also ofcomplexity. 5 . Results and discussion A highly signiﬁcant result is obtained already by evaluating the power spec-tra S ( f ) according to Eq.(1) of the series representing the sentence length vari-ability (SLV) of all the text considered. As documented in Fig. 1, the overalltrend of essentially all sample texts, and especially its average, shows a clear S ( f ) = 1 /f β (7)scaling with β ≈ /

2, in most cases over the entire range of about two orders ofmagnitude in frequencies f spanned by the number of sentences of a typical texthere analyzed. Statistical signiﬁcance of this result is inspected by randomlyshuﬄing sentences within texts. For the so obtained randomized texts the cal-culated power spectra appear to be trivially ﬂuctuating along a horizontal line( β = 0) on all the scales.For the individual original texts β is seen to range between 1/4 and 3/4.This kind of scaling points to the existence of the power-law long-range tempo-ral correlations in SLV - thus to its fractal organization - and indicates that itbalances randomness and orderliness, just as it does for music, speech [49], heartrate [31], cognition [23], spontaneous brain activity [35], and for other ‘soundsof Nature’ [9, 48]. From this perspective human writing appears to correlatewith them. Even the range of the corresponding β -values from about 1/4 to 3/4overlaps signiﬁcantly (more on the Mozart’s than Beethoven’s side) with those(1/2 to 1) found [37] for musical compositions. This perhaps provides a quanti-tative argument for our tendency to refer to writing as ‘being composed’ whenwe care about all its aspects including aesthetics and rhythm to be experiencedin reading.The two extremes in the corpus, explicitly indicated in Fig. 1, are The Am-bassadors (upper) and

Artam`ene ou le Grand Cyrus , the 17 th century Frenchnovel sequence (lower), considered the longest novel ever published. The ﬁrst ofthem appears to be most peculiar among all the texts as at the small frequenciesit visibly departs from 1 /f by bending down and displaying the two preferred6 .0001 0.001 0.01 0.1 frequency f S ( f ) Artamène ou le Grand CyrusThe Ambassadors

Figure 1: Power spectra S ( f ) of the sentence length variability for 113 world famous liter-ary works. They are calculated from the series l ( j ) representing lengths of the consecutivesentences j expressed in terms of the number of words. S ( f ) is seen to display 1 /f β scal-ing. Middle solid line (green) denotes average over the individual power spectra, properlynormalised, of all the corpus elements and it ﬁts well by β =1/2. Boundaries of the dispersionin β are indicated by taking average over 10 corpus elements, with the largest β -values, whichresults in β =3/4 and over 10 its elements with the smallest β -values, which results in β =1/4.The two extremes in the corpus, explicitly indicated, are Henry James’s The Ambassadors (upper) and Madeleine and/or Georges de Scud´ery’s

Artam`ene ou le Grand Cyrus , the 17 th century novel sequence (lower), considered the longest novel ever published. The straight lineﬁts to these two extremes are represented by the dash-dotted lines. β = 0 . ± .

02 in the region of higher frequencies where 1 /f scalingapplies. The latter novel, on the other hand, with β = 0 . ± .

02 appears closestto the white noise whose S ( f ) is ﬂat. It is also appropriate to notice that at thelargest frequencies, which corresponds to the smallest scales, the power spectraof all the texts have some tendency to ﬂatten. This may suggest that the long-range coherence in its 1 /f organization is somewhat perturbed on shorter scalesby coarsening due to grammatical constraints.Our central result relates, however, to the nonlinear characteristics that maymanifest themselves in heterogeneous, self-similarly convoluted structures, un-detectable by S ( f ). Such structures may demand using the whole spectrum ofthe scaling exponents and are then termed multifractals. That such structuresin SLV may be present within the corpus analysed here can be inferred fromFig. 2, which shows four, somewhat distinct, categories of behaviour. A ma-jority of the texts in our study resembles the case displayed in ( I ). SLV is hereseen to be rather homogenously ‘erratic’ and, consequently, the distribution ofcascades seen through the wavelet decomposition is largely uniform. The threeother cases, ( II ), ( III ) and ( IV ), commonly considered representatives of thestream of consciousness (SoC) literary style that seeks ”to depict the multi-tudinous thoughts and feelings which pass through the mind” [17], are visiblyinhomogeneous in this respect, as SLV displays clusters of intermittent bursts ofmuch longer sentences. Such structures are characteristic of multifractals andthus an appropriate subject of the analysis within the above formalism.The ﬂuctuation functions F q ( s ) obtained according to Eq.(4) display (Fig. 2)a convincing scaling with diﬀerent degree of q -dependence, however. This iscorroborated by the corresponding singularity spectra f( α ), which range fromvery narrow in Voyna i mir ( I ), indicating essentially monofractal structure,through signiﬁcantly broader - thus already multifractal - but asymmetric, likethe strongly left sided Rayuela ( II ) or right sided The Waves ( III ), up to the ex-ceptionally broad and simultaneously almost symmetric case ( IV ) of Finnegans igure 2: Variety of multifractal sentence arrangements in literary texts: Four examplesillustrating diﬀerent fractal/multifractal characteristics identiﬁed within the corpus of thecanonical literary texts: I , Voyna i mir (War and Peace) by Lev Tolstoy; II , Rayuela (Hop-scotch) by Julio Cort´azar;

III , The Waves by Virginia Woolf and IV , Finnegans Wake (FW) by James Joyce. The panels inside each contain correspondingly: (a)

The series l ( j ) of theconsecutive sentence lengths throughout the whole text. Insets illustrate the correspondingprobability distributions P ( l ) of l ( j ); (b) Wavelet coeﬃcient maps ( T ψ ( s, k )) obtained for l ( j ).The wavelet ψ used is a Gaussian third derivative. The horizontal axis represents the sentenceposition in a text while the vertical axis - the wavelet resolution scale s . Colour codes denotemagnitude of the coeﬃcient from the smallest (dark blue) to the largest (red); (c) q -th orderﬂuctuation functions calculated according to Eq.(4) using the detrending polynomial P ( m ) ν ofsecond order ( m =2), for q ∈ [ − ,

4] and s ⊂ [20 , j max /

5] (for

Rayuela a consistent scalingregime stops somewhat earlier at s ≈ (d) The resulting singularity spectra f( α ) for (i) the series l ( j ) representing original texts(black), (ii) for their Fourier-phase randomised counterparts (blue); here f( α ) is seen shrunkessentially to a point as is characteristic of a pure monofractal, and (iii) for their randomlyshuﬄed counterparts (gray). ake (FW) .The left side of f( α ) is determined by the positive q -values, which ﬁlterout larger events (here longer sentences), and its right side reﬂects behaviourfor smaller events as ﬁltered out by the negative q -values. Hence, asymmetry inf( α ) signals non-uniformity of the underlying hypothesized cascade [19]. Rayuela is thus seen to be more multifractal in the composition of long sentences andalmost monofractal on the level of small ones. To some extent the oppositeapplies to

The Waves . In fact, these eﬀects can be inferred already from the non-uniformities of the corresponding SLV wavelet decompositions (Fig. 2). In thisrespect FW appears impressively consistent; being one of the most intriguingliterary ‘compositions’ ever, mastered imaginatively in the SoC technique, freelyexploring the mental labyrinth of dreams and thus often breaking conventionalrules of syntax and of linguistic rigour. However, from the perspective of ourformal quantitative approach, its architecture looks - or perhaps just is - a resultof these factors - to be governed consistently by the same ‘generators’ on allscales of sentence length. An extra intellectual factor shaping FW is very likelyto be also related to its top-bottom development - much like model mathematicalcascades - as evidenced by its chronology of writing [12] graphically sketched inFig. 3.The signiﬁcance of the above results for the singularity spectra f( α ) of theseries l ( j ) representing the original texts has also been tested against the twocorresponding surrogates. One standard surrogate in this kind of analysis isobtained by generating the Fourier-phase randomised counterparts of l ( j ). Thisdestroys nonlinear correlations and makes probability distribution of ﬂuctua-tions Gaussian-like, but preserves the linear correlations and, as it is clearlyseen in Fig. 2, shrinks f( α ) essentially to a point as is characteristic of a puremonofractal. Another surrogate is obtained by randomly shuﬄing the origi-nal series l ( j ). Consequently, any temporal correlations get destroyed but theprobability distributions of ﬂuctuations remain unchanged. The correspondingsingularity spectra calculated according to the same MFDFA algorithm are alsoshown in Fig. 2 (gray). Consistently with the lack of any temporal correlations10 igure 3: Chronological progress of James Joyces engineering work on writing FW, whichhe described as boring a mountain from two sides [20]. This chart may be also taken as avisualisation of Joyces dream about a Turk picking threads from heaps on his left and rightsides, and weaving a fabric in the colours of the rainbow, which the writer interpreted as asymbolic picture of Books I and III of FW. they all get shifted down to α ≈ . α ) still remainsto be observed. However, at least a large part of this remaining multifractalityin this last case may be apparent due to a relatively small size of the samples.For the uncorrelated series the result of calculating the multifractal spectra isknown [18] to end up in either mono-fractal for the series whose ﬂuctuationprobability distributions are L´evy-unstable, or in bi-fractal for those whose dis-tributions are L´evy-stable. Contrary to the correlated series, the convergenceto the ultimate correct results in this case is very slow. We also wish to note atthis point that in spite of the Menzerath-Altmann law, all the relevant resultsshown here remain essentially unchanged if the sentence length is measured interms of the number of characters instead of the number of words.Another, even better known SoC novel by Joyce - Ulysses , which played acentral role in formulating the scale-free word rank-frequency distribution lawby Zipf - also deserves here an extended attention, however, for a diﬀerent rea-son. For this novel, as illustrated in Fig. 4 no unique multifractal scaling can be11 igure 4: Special case of Ulysses: The same convention as in Fig. 2 is used here. The twoadditional insets in the panels a and c display results for

Ulysses after bisecting it into halves.

Ulysses - I corresponds to the text from the beginning to the end of Chapter 10 and

Ulysses -IIto the remaining text without its last two disproportionately long sentences. α ) assigned. The SLV inspected both in terms of thesentence length distribution and through its wavelet transform indicate clearlythat Ulysses splits into two parts such that each of them may independentlyhave well deﬁned scaling properties. Indeed, by bisecting it approximately intohalves (between Chapters 10 and 11) allows us again to comprise

Ulysses withinthe present formalism. The ﬁrst part appears essentially monofractal, while theother is clearly multifractal, though asymmetrically left-sided, just as

Rayuela .In fact, this result provides a quantitative argument in favour of the ”double-ness” of

Ulysses [40].The results, represented in terms of the width ∆ α = α max − α min of f( α ),where α max and α min denote the beginning and the end of f( α ) support, andof the Hurst exponents H = h (2) measuring the degree of persistence in SLV,for the whole studied corpus are collected in Fig. 5. The relation between theHurst exponent H and the scaling exponent β in Eq. (7) reads [27]: β = 2 H − . (8)As the upper-right inset to Fig. 5 visibly documents this relation appears to bevery satisfactorily fulﬁlled when comparing the results presented in Fig. 1 versusthe Hurst exponents in Fig. 5 for all the texts studied which thus provides anindependent test for correctness of the results presented. All the explicit valuesof H and β correspondingly, together with their error bars σ H and σ β measuredin terms of the mean standard deviations, are listed in the Appendix in parallelwith the titles included in the corpus.The ‘scatter plot’ shown in Fig. 5 opens up room for many further interestingobservations and hypotheses or even deﬁnite conclusions of general interest.Some of them can be straightforwardly listed as follows: (i) Essentially all thestudied texts that are seen in the multifractality region are commonly classiﬁedas SoC literature. The only exception found here, the Old Testament , has notbeen considered before in this context. (ii) ∆ α for all the texts that do notbelong to SoC is located below the border of deﬁnite multifractality. Theircomplexity is thus poorer. (iii) Also, several texts, by some considered as SoC,13 H ∆ α α f ( α ) H -0.300.30.60.91.21.5 β Finnegans WakeU.S.A. trilogyThe WavesArtamène ou le Grand Cyrus RayuelaA Heartbreaking Work of Staggering Genius M u ltif r a c t a lit y M ono f r a c t a lit y Bible (New Testament) The GoldfinchBible (Old Testament) The AmbassadorsTristram Shandy The Portrait of a LadyMort à crédit α min α max ∆α=α max - α min Ulysses-IIUlysses-I

Finnegans Wake

W. Shakespeare ( d e g r ee o f c o m p l e x it y ) (degree of persistency) À la recherche du temps perdu β =2H-1 Figure 5: ‘Scatter plot’, which for a collection of 113 most representative literary works in-dicated by their numbers (the ones in red indicate those works that usually are considered asbelonging to the SoC narrative) on our list (see Appendix) displays the width ∆ α (schemat-ically deﬁned in the upper-left inset) and the Hurst exponent H . Shaded area marks thetransition (uncertainty) region between fully developed multifractality and deﬁnite monofrac-tality. We ﬁnd it reasonable to assume that the shuﬄed series are mono-fractal (or at mostbi-fractal) and that any trace of multifractality in this case is an artifact of the ﬁniteness of aseries. Therefore, the lower bound of the shaded area is determined as an average of ∆ α ’s forall the series (texts) shuﬄed. Due to the thickest tails in the probability distributions P ( l ) of l ( j ) in FW (seen in the inset to panel IV of Fig. 2), which after shuﬄing the correspondingseries may yield the strongest apparent multifractality signal, the upper bound of the uncer-tainty region is taken as ∆ α of the shuﬄed FW . The upper-right inset shows location of apair of the H and β values for each book, represented by a point, relative to the straight linedetermined by Eq. (8). igure 6: Sentence length complementary cumulative distributions F ( ℓ ) = P r ( l ≥ ℓ ) for thewhole corpus of 113 books considered but split into three groups according to their locationin Fig. 5. Tails ( ℓ > F ( ℓ ) = exp ( − µℓ b ) and the corresponding best ﬁt b -values listed. The intermediate (10 < ℓ < b = 1) dependence. Inset shows the Zipﬁanrank-frequency distribution plot for Ulysses where the full stops are treated as another word. appear to be located signiﬁcantly below this border. An important exampleof this is `A la recherche du temps perdu by Marcel Proust (no. 76 in the listgiven in the Appendix), which is clearly monofractal. (iv)

Artam`ene ou le GrandCyrus is seen to have characteristics just opposite to FW . Here ∆ α equals nearlyzero and H gets shifted down towards 1/2, which complements its ﬂat powerspectrum seen in Fig. 1, to mean that the corresponding SLV is of the whitenoise type.Finally, examples of the sentence lengths distributions shown in Fig. 2 ( Ia to IVa ) indicate some diﬀerences among works, related more to the style than tothe language. Those representing SoC kind of narrative have a clear tendency todevelop thicker tails. In fact, attempts to exploit the sentence length distributionin stylometry and in authorship attribution have a long history [51, 50, 47,6]. In order to approach this issue quantitatively for the present corpus the15omplementary cumulative distributions F ( ℓ ) = P r ( l ≥ ℓ ) for all the textscollected separately from the three distinct regions identiﬁed through Fig. 5 aredisplayed in Fig. 6. The tails of these distributions, starting at around l ≈ F ( ℓ ) = exp ( − ℓ b )with the smallest b parameter and thus the thickest tail for the SoC narrative,thus multifractal in SLV. This parameter remains, however, signiﬁcantly smallerthan unity even in the monofractal regime. This result closely resembles thestretched exponential distribution of recurrence times between words [4]. Ina sense the sentence length can be interpreted as a recurrence time betweenthe full stops and they, in a text, play at least as important role as words.Their large recurrence times appear to be governed by a similar distributionas the one for words. Interesting in this context, and also complementary, isobservation documented in the upper-left inset of Fig. 6, that the full stops(dots, question and exclamation marks) treated as one more word in the rank-frequency dependance typically belong to the same Zipﬁan plot as worlds, asshown for the original Zipf’s example of Ulysses .These observations thus prompt a question if a series of recurrence times(measured in terms of the number of words) between the same words developsimilar long-range correlations as the ones identiﬁed above for the full stops.For the lowest-rank - thus most frequent - words like English the , and , of or to (only for such words one obtains suﬃciently long series) multifractality turnsout to disappear in all the texts studied. This is true even for FW as the q -thorder ﬂuctuation functions F q ( s ) calculated according to Eq.(4), for this orig-inally extreme case of multifractality and displayed in Fig. 7 show. For theseries of recurrence times of the words the , and and of the F q ( s ) functions be-come very weakly q -dependent as is characteristic to monofractals. In fact eventhis monofractal scaling is here not fully convincing at some places. The long-range nonlinear correlations responsible for multifractality appear thus to beoriginating exclusively from the speciﬁc arrangement of the full stops in texts.The situation with the long-range linear correlations is not that extreme assome representative cases expressed in terms of the power spectra S ( f ) shown16 F q ( s ) F q ( s )

10 100 1000 s F q ( s )

100 1000 10000 s F q ( s ) Full stop theof and

Figure 7: The q -th order ﬂuctuation functions calculated using the same algorithm andconvention as in Fig. 2 for the sentence length variability (SLV), which is equivalent to therecurrence times of the full stops, versus the series of recurrence times of three diﬀerent words, the , of and and considered separately, in Finnegans Wake by James Joyce. in Fig. 8 illustrate it. These power spectra for other cases than SLV still showa trace of scaling of the form S ( f ) = 1 /f β with signiﬁcantly smaller values ofthe β -parameter (denoted by β s for SLV and by β w for the word recurrencetimes), however, and larger dispersion of ﬂuctuations along the ﬁt which in-dicates a very weak character of the underlying long-range linear correlations.Very interestingly, however, for those texts that are multifractal on the level ofSLV, the linear long-range correlations for the recurrence times between wordsdo not depart so much from the correlations measured by SLV in the same textsand are signiﬁcantly stronger than in the monofractal texts, as the examples inFig. 8 show. Summary

The present analysis, based on a large corpus which includes world famousliterary texts, uncovers the long-range correlations in their sentence arrange-ment. The linear component of these correlations universally reveals the scale-17 igure 8: Power spectra S ( f ) of the sentence length variability (scaling exponent β s ) and ofthe series of recurrence times for the three most frequent words in four texts representativefor the present corpus. Straight line ﬁts are indicated by the dashed lines. The S ( f ) for therecurrence times is ﬁtted globally and the corresponding scaling exponent denoted by β w . /f β ’ form as characteristic to many other ‘sounds of Nature’ and thusthis observation may serve as an indicator of those factors that shape humanlanguage. The corresponding β -value typically range from about 1/4 to 3/4and may thus serve also as a very useful and inspiring stylometry measure. Asfar as correlations in the sentence length variability are concerned, some texts -within the present corpus exclusively belonging to the stream of consciousnessnarrative - develop even more complex scale-free patterns of the nonlinear char-acter and heavy tailed intermittent bursts in SLV similar to the ones identiﬁedin other areas of human activity [10]. In quantitative terms this results in awhole spectrum of the scaling exponents as compactly grasped by the multi-fractal spectrum f( α ), whose width reﬂects the degree of nonlinearity involved.A greater complexity of such hypertext-like narrative ﬁnds an interesting par-allel in the biological dynamical system as documented [29] for the healthyhuman heartbeat, which develops broader multifractal spectra as compared tothe heart disfunction. That the SoC kind of narrative should simultaneously ac-tivate greater variety of brain areas seems quite natural. Whether this indicatesmeans to more eﬃcient sharing of information also emerges as an intriguingperspective to study. A further argument in favour of such a likely correspon-dence is that hypertext is paralleled by the underlying architecture of WorldWide Web, which proves easy-to-use and ﬂexible in its self-similar traﬃc [16] ofsharing information over the Internet, indeed.SLV is equivalent to variability of the recurrence times, measured in termsof the number of words between the full stops. Analogous variability of therecurrence times between words, which for statistical reasons can be studiedfor the most frequent ones, also involves ‘1 /f β ’-type long-range correlations butthey appear signiﬁcantly weaker than for the full stops. What is more, thenonlinear correlations inducing multifractal characteristics seem to take placeexclusively on the level of SLV. Thus, even though the full stops together withwords appear to belong to the same Zipﬁan distribution, they form a frame forlong-range correlations in narrative texts. This frame seems to obey more strictuniversal principles of organization, likely shaped also by the factors listed in the19ntroduction, and words have some more freedom in ﬁlling and complementingit as can be inferred from their weaker mutual correlations. Such a scenariooﬀers one possible visualization of the results obtained.Finally, the results presented - like imprints of the Weber-Fechner law throughSLV cascades - appear largely consistent with the working hypothesis formulatedin the introduction by listing factors that may induce long-range correlationsduring the process of the narrative text formation. In order to further illuminateon such issues some empirical study, like quantifying the vocal and perceptioncharacteristics of texts with varying strength of SLV correlations, would be cru-cially helpful and our results indicate direction. Acknowledgment

We thank Krzysztof Bartnicki (who translated FW into Polish) for construc-tive exchanges at the early stage of this Project.20 ppendix List of the considered literary works.2123 eferenceseferences