[PDF] Differential Entropy Rate Characterisations of Long Range Dependent Processes

Abstract

A quantity of interest to characterise continuous-valued stochastic processes is the differential entropy rate. The rate of convergence of many properties of LRD processes is slower than might be expected, based on the intuition for conventional processes, e.g. Markov processes. Is this also true of the entropy rate? In this paper we consider the properties of the differential entropy rate of stochastic processes that have an autocorrelation function that decays as a power law. We show that power law decaying processes with similar autocorrelation and spectral density functions, Fractional Gaussian Noise and ARFIMA(0,d,0), have different entropic properties, particularly for negatively correlated parameterisations. Then we provide an equivalence between the mutual information between past and future and the differential excess entropy for stationary Gaussian processes, showing the finiteness of this quantity is the boundary between long and short range dependence. Finally, we analyse the convergence of the conditional entropy to the differential entropy rate and show that for short range dependence that the rate of convergence is of the order O(n^{-1}), but it is slower for long range dependent processes and depends on the Hurst parameter.

Full PDF

11 Differential Entropy Rate Characterisations of LongRange Dependent Processes

Andrew Feutrill, and Matthew Roughan,

Fellow, IEEE,

Abstract —A quantity of interest to characterise continuous-valued stochastic processes is the differential entropy rate. Therate of convergence of many properties of LRD processes isslower than might be expected, based on the intuition forconventional processes, e.g. Markov processes . Is this also trueof the entropy rate?In this paper we consider the properties of the differentialentropy rate of stochastic processes that have an autocorrelationfunction that decays as a power law. We show that power lawdecaying processes with similar autocorrelation and spectral den-sity functions, Fractional Gaussian Noise and ARFIMA(0,d,0),have different entropic properties, particularly for negativelycorrelated parameterisations. Then we provide an equivalencebetween the mutual information between past and future andthe differential excess entropy for stationary Gaussian processes,showing the ﬁniteness of this quantity is the boundary betweenlong and short range dependence. Finally, we analyse the con-vergence of the conditional entropy to the differential entropyrate and show that for short range dependence that the rate ofconvergence is of the order O ( n − ) , but it is slower for longrange dependent processes and depends on the Hurst parameter. Index Terms —Stochastic Processes, Long Range Dependence,Differential Entropy Rate

I. I

NTRODUCTION T HE entropy rate of discrete time stochastic processes hasbeen studied as a measure of the average uncertainty.Most investigations of this type have focussed on processeswhose correlations decay quickly, and hence the dependenceon past observations disappears rapidly. However, many realprocesses from a variety of contexts, i.e., data networks [25],[37], [38], climate [34], hydrology [1], [24], economics [7],[39], have been shown to exhibit long range dependence,meaning correlations exist between past and future observa-tions that cannot be ignored at any time lag.Information and coding theory have had profoundly im-portant uses in signal processing and communication. Noiseprocesses are an important part of this story. However, inmost works, for instance on designing optimal codes onnoisy channels, the noise processes are presumed to be shortrange dependent. However, as far back as 1965, Mandelbrot

This research has been supported by the Australian Research Council Centreof Excellence for Mathematical and Statistical Frontiers (ACEMS), DefenceScience and Technology Group (DST) and Data61A. Feutrill is with CSIRO’s Data 61, the School of Mathematical Sciences,University of Adelaide, Adelaide, Australia and the Australian ResearchCouncil’s Centre of Excellence for Mathematical and Statistical Frontiers(ACEMS), e-mail: [email protected]. Roughan is with the School of Mathematical Sciences, University ofAdelaide, Adelaide, Australia and the Australian Research Council’s Centreof Excellence for Mathematical and Statistical Frontiers (ACEMS). showed [28] that some noise processes are also long-rangedependent (though the terminology was still developing).Recent work has investigated an information theoretic char-acterisation of long range and short range processes [5],[10], [26], using the ﬁniteness of mutual information betweenpast and future. We aim to clarify this characterisation andinvestigate its implications.This paper calculates the differential entropy rate for thetwo most common stationary Gaussian Long Range Depen-dent (LRD) processes: Fractional Gaussian Noise (FGN) andthe Auto-Regressive Fractionally-Integrated Moving Average(ARFIMA) process. We start by deriving the entropy rate forthese processes, and show that they both have negative poles asthe processes tend towards strong long-range correlations, butthat their behaviour when anti-correlated is surprisingly differ-ent: FGN has a pole similar to that for positive correlations, butARFIMA does not. This contradicts common intuition basedon their similar spectral densities that these two processes areclose to equivalent models in the case of ARFIMA(0,d,0).We also investigate the links between the two informationmeasures: excess entropy and the mutual information betweenpast and future processes, and compare these to the differentialentropy rate. We show that the differential entropy rate deﬁni-tion for excess entropy is equivalent to the mutual informationbetween past and future for continuous valued discrete timeGaussian processes, and hence that excess entropy is inﬁnitefor all long range dependent Gaussian processes.Finally, estimators, such as the sample mean, applied toLRD processes have been shown to have slow convergencerates, which can lead to a larger than expected uncertaintywhen investigating these processes. We ask, “Does this be-haviour apply to estimators of entropy?” and “What is theimpact of the degree of positive or negative correlations onthe entropy rate?”

We show that while the convergence rateof the conditional entropy of short range dependent Gaussianprocesses is in the order of n − , the rate of convergence forLRD Gaussian processes is slower. Although this parallelsmany of the other results for LRD processes, the actual rateis different at O (cid:16) (1 − H ) log( n ) /n (cid:17) , where H is the Hurstparameter. As H → or we can see that this convergenceis at its worst. a r X i v : . [ c s . I T ] F e b II. B

ACKGROUND

A. Long range dependence

LRD refers to a process where correlations decay slowerover time such that the future is non-trivially dependent onthe past no matter how far forward we proceed. A samplepath of an LRD fractional Gaussian noise process, with H = 0 . , is shown in Figure 1. As is typical for LRDprocesses, there is the appearance of long periods of upwardsand downwards“trends”, even though the process is stationary.LRD can be deﬁned in two equivalent ways, via the au-tocorrelation or spectral density. These deﬁnitions are relatedthrough the Fourier Transform [4, pg. 117], and are equivalentvia the Kolmogorov Isomorphism Theorem [2]. The followingstatement deﬁnes the concept of long range dependence interms of its autocorrelation function. For reference, the auto-covariance function, γ ( k ) , and autocorrelation function, ρ ( k ) ,are related by ρ ( k ) = γ ( k ) /γ (0) = γ ( k ) /σ , where σ is thevariance of the stochastic process. Deﬁnition II.1.

Let { X n } n ∈ N be a stationary process. If thereexists α ∈ (0 , , and c γ > , such that the auto-covariance γ ( k ) satisﬁes lim k →∞ γ ( k ) c γ k − α = 1 , then we say that the process is long range dependent. The equivalent deﬁnition in the frequency domain considersthe limit of the spectral density near the origin.

Deﬁnition II.2.

Let { X n } n ∈ N be a stationary process. If thereexists β ∈ (0 , , and c f > , such that the spectral density f ( λ ) satisﬁes lim λ → f ( λ ) c f | λ | − β = 1 , then we say that the process is long range dependent. The concept of LRD is strongly linked to statistical self-similarity, which is often characterised by the Hurst parameter, H . It is perhaps unfortunate that H is used as standard notationboth for the Hurst parameter, and for Shannon entropy, but weshall side-step that issue here as we are mainly concerned withentropy rates, which we will designate with lower-case h .The parameter was developed by Hurst when he was mea-suring ﬂows in the Nile River [18], and it is related to theself-similarity structure of a process. There is a relationshipbetween H and the α in Deﬁnition II.1, namely H = 1 − α/ . LRD processes have α ∈ (0 , and thence have H ∈ (0 . , .However, when describing self-similar processes the parametertakes on values between 0 and 1, with H > / representinga positively correlated process and H < / representing anegatively correlated process and H = / being a short rangecorrelated process, such as white Gaussian noise.Another property of note, that has been used as the def-inition of LRD processes, is the (un)summability of theautocorrelation function. For a LRD processes the sum of the X n Fig. 1: Sample path of Fractional Gaussian Noise with Hurstparameter, H = 0 . . We can see that the high correlationslead to the appearance of longer trends than we would expectfor an independent and identically distributed process.autocovariances diverges, i.e., (cid:80) ∞ k =1 γ ( k ) → ∞ , whereas forSRD processes this is ﬁnite [1]. Then we can interpret LRD ashaving such strong correlations that the autocovariance valuesdecay such that the distant past still inﬂuences the future. Theintuition for conventional processes is that the autocovariancefunction decay exponentially, or quicker, and in many casesthis means we can ignore correlations beyond some short lag.The negatively correlated processes, H < / , have notreceived as much consideration at the SRD and LRD casesbut we include them in the analysis here. These are similar toshort range dependent processes, such as white Gaussian noise,as they still have a summable autocorrelation function [1],[13]. In fact their structure enforces that (cid:80) ∞ k =1 γ ( k ) = 0 .This is quite a strict and surprising property, and hence theseprocesses are often called constrained short range dependent(CSRD) [13].We will be working with Gaussian processes in this paper. Deﬁnition II.3.

A stochastic process is a Gaussian process ifand only if each ﬁnite collection of random variables from theprocess has a multivariate Gaussian distribution.

This deﬁnition applies to both discrete and continuous timeprocesses, though here we are principally interested in theformer. Gaussian processes are completely characterised bytheir second order statistics (the mean and autocovariancefunction) [3], which makes them the primary type of stochasticmodel used in this context.

B. Entropy rate

As we are considering continuous random variables in thispaper, we will be considering differential entropy, which is acontinuous extension of Shannon entropy for discrete randomvariables. In this paper, we will be using the natural logarithmin all of the deﬁnitions, and hence the units of entropy that wewill be working with are nats. We include standard deﬁnitionsin order that all notation be precisely deﬁned.

Deﬁnition II.4.

The differential entropy, h ( X ) , of a continu-ous random variable with probability density function, f ( x ) ,is deﬁned as, h ( X ) = − (cid:90) Ω f ( x ) log f ( x ) dx, where Ω is the support of the random variable. Differential entropy has some important properties whichare different from Shannon entropy. For example, differentialentropy can be negative, or even diverge to −∞ , which we cansee by considering the Dirac delta function, δ ( x ) . The delta, i.e., the unit impulse, is deﬁned by the properties δ ( x ) = 0 for x (cid:54) = 0 and (cid:82) ∞−∞ δ ( x ) dx = 1 . The Dirac delta can be thoughtof in terms of probability as a completely determined pointin time, that is, a function possessing no uncertainty. It canbe constructed as the limit of rectangular pulses of constantarea 1 as their width decreases, and hence we can calculatethe entropy of the delta as h ( X ) = − (cid:90) a − a a log (cid:18) a (cid:19) dx, = log(2 a ) , which tends to −∞ as a → .The intuition for h ( X ) = −∞ from Cover and Thomas [8,pg. 248] is that the number of bits (note we are working innats) on average required to ﬁx a random variable, X to n-bitaccuracy is h ( X ) + n . Meaning h ( X ) = −∞ , can be readas requiring n − ∞ bits, or that we can describe the randomvariable arbitrarily accurately without any using any bits.We will see the same type of asymptotic behaviour for LRDprocesses as H → . Effectively, in the limit the correlations inthe process straight-jacket it, such that the future is completelydetermined by the past, and so the incremental uncertainty inthe process is the same as that of the delta.The differential entropy can be extended into the multi-variate case and hence to stochastic processes using the jointentropy for a collection of random variables. Deﬁnition II.5.

The joint differential entropy of a col-lection of random variables X , X , ..., X n , with density f ( x , x , ..., x n ) = f ( x ) is deﬁned as, h ( X , X , ..., X n ) = − (cid:90) Ω f ( x ) log f ( x ) d x , where Ω = Ω × Ω × ... × Ω n is the support of the randomvariables. Similarly, the conditional differential entropy can be deﬁnedfor a random variable, given knowledge of other variables.

Deﬁnition II.6.

The conditional differential entropy of arandom variable, X n , given a collection of random variables X , X , ..., X n − , with a joint density f ( x , x , ..., x n ) = f ( x ) , is deﬁned as, h ( X n | X n − , ..., X ) = − (cid:90) Ω f ( x ) log f ( x n | x n − , ..., x ) d x , where Ω = Ω × Ω × ... × Ω n is the support of the randomvariables. Finally, we deﬁne the concept of differential entropy rate,which can be thought of as the average amount of newinformation from each sample of a random variable in adiscrete time process.

Deﬁnition II.7.

Where the limit exists, the differential entropyrate of a stochastic process χ = { X i } i ∈ N is deﬁned to be, h ( χ ) = lim n →∞ h ( X , ..., X n ) n . An example of a process which is non-stationary but hasa differential entropy rate is the Gaussian walk, S n . This isdeﬁned as the process of sums of i.i.d. normally distributedrandom variables, i.e., S n = (cid:80) ni =1 X i , where X i ∼ N (0 , σ ) .The process has mean 0 for all n , however it is non-stationaryas the variance depends on n as,Var ( S n ) = Var (cid:32) n (cid:88) i =1 X i (cid:33) = nσ . However, the entropy rate converges and is equal to lim n →∞ h ( S , ..., S n ) n = lim n →∞ (cid:80) ni =1 h ( S i | S i − , . . . , S ) n , = lim n →∞ nh ( X i ) n , = h ( X i ) , as each random variable X i is independent.An alternative characterisation of the differential entropyrate is given by the following theorem for stationary processes.This was developed for the Shannon entropy of discreteprocesses, however this has been extended to differentialentropy [8, pg. 416]. Theorem II.1.

For a stationary stochastic process, χ = { X i } i ∈ N , the differential entropy rate is equal to, h ( χ ) = lim n →∞ h ( X n | X n − , ..., X ) . The second equivalent deﬁnition is useful because it will allowus to analyse the convergence rates of conditional entropy tothe differential entropy rate, which is important in estimationof entropy rates.III. E

NTROPY RATE FUNCTION FOR F RACTIONAL G AUSSIAN N OISE

We want to understand the effect of memory on the entropicproperties of a stochastic process. We start with the entropyrate characterisation for Gaussian processes originally derivedby Kolmogorov (see Ihara [21, pg. 76]) h ( χ ) = 12 log(2 πe ) + 14 π (cid:90) π − π log(2 πf ( λ )) dλ, (1)where f ( λ ) is the spectral density, i.e., the Fourier transformof the autocovariance function for a mean zero process.We’ll begin by investigating the spectral density of Frac-tional Gaussian Noise (FGN), which is given by [1, pg. 53] f ( λ ) = 2 c f (1 − cos λ ) ∞ (cid:88) j = −∞ | πj + λ | − H − , (2)where, c f = σ π sin( πH )Γ(2 H +1) , H is the Hurst parameter,and σ is the variance of the process. A. Comparison of approximate and analytical spectral densityfor entropy rate calculation

Substituting the spectral density of FGN (2) into the secondterm in the entropy rate expression (1) we get (cid:90) π − π log (cid:0) πf ( λ ) (cid:1) dλ = 2 π log(4 πc f ) + (cid:90) π − π log(1 − cos λ ) dλ + (cid:90) π − π log  ∞ (cid:88) j = −∞ | πj + λ | − H −  dλ, = 2 π log(4 πc f ) − π log 2+ (cid:90) π − π log  ∞ (cid:88) j = −∞ | πj + λ | − H −  dλ. The last term is ﬁnite for all H ∈ (0 , , since the singularitythat exists when λ, j = 0 in the absolute value is integrable.This is important as we can then see that this does not affectthe asymptotic behaviour of FGN processes. The resultingentropy rate is h ( χ ) = 12 log(2 πe ) + 12 log (cid:0) σ sin( πH )Γ(2 H + 1) (cid:1) + 14 π (cid:90) π − π log  ∞ (cid:88) j = −∞ | πj + λ | − H −  dλ. We calculate h ( χ ) using numerical integration via Python’sScipy library [36]. We plot the differential entropy rate ofFractional Gaussian Noise as a function of the Hurst parame-ter, H, in Figure 2. The plot shows the impact of the varianceon entropy rate calculation, and hence that the entropy rateof Fractional Gaussian Noise has a large dependence on thevariance, i.e. the second order properties. Each unit increasein variance has a smaller effect on the value of the differentialentropy, due to the log( σ ) term.The spectral density expression is quite cumbersome towork with and an approximation is often used, which isaccurate at low frequencies [1, pg. 53]. It is derived froma Taylor series expansion of the spectral density and is givenby, f ( λ ) ≈ c f | λ | − H . To calculate the entropy rate we substitute this approximationinto the integral in the entropy rate expression (1) to get (cid:90) π − π log(2 πf ( λ )) dλ = (cid:90) π − π log (2 πc f ) dλ + (cid:90) π − π log (cid:0) | λ | − H (cid:1) dλ, = 2 π log (2 πc f ) + 2 (1 − H ) (cid:90) π log ( λ ) dλ, = 2 π log (2 πc f ) + 2 (1 − H ) ( π log π − π ) . Note that there is a singularity at the origin of the spectraldensity of LRD processes. However, the integral is still well E n t r o p y R a t e ( n a t s ) = 1 = 2 = 3 = 4 Fig. 2: Entropy rate of Fractional Gaussian Noise as a functionof the Hurst Parameter. The maximum is at H = 0 . , wherethe process is white Gaussian noise. As H → or , thefunction tends towards −∞ , as the strength of the negative orpositive correlations increase. The impact of changing variancedecreases as the variance increase, due to the log( σ ) term. E n t r o p y R a t e ( n a t s ) Actual Entropy RateApproximate Entropy Rate

Fig. 3: Comparison of the numerically integrated spectral den-sity and the spectral density approximation. The approximationis relatively good for H ≥ / but an underestimate for H ≥ / .deﬁned and ﬁnite in this case. Therefore the entropy rateapproximation is, ˜ h ( χ ) = 12 log(2 πe ) + 12 log 2 πc f + (1 − H )(log π − , which differs from the exact formulation only in the last term.Figure 3 shows the entropy rate and its approximation. We cansee that the entropy rate approximation is very good for thepositively correlated cases H ≥ . and at the limits around H = 0 or . However for moderately, negatively-correlatedprocesses the approximation is a noticeable underestimate ofthe entropy rate. B. Properties of Entropy rate for Fractional Gaussian Noise

Figure 3 shows some interesting properties • The entropy rate function is not symmetric. Negativelycorrelated processes seem to have higher uncertainty thesame distance from H = 0 . . • The entropy rate asymptotically tends to −∞ as H → • The maximum entropy rate occurs at 0.5. Indicating thatthe maximum entropy occurs for white Gaussian noise.We explain how these properties emerge below.

1) Asymptotic behaviour:

Theorem III.1.

The approximate differential entropy rate ofFractional Gaussian Noise, h ( χ ) → −∞ as H → or .Proof. When H → or , the term c f → , as the gammafunction terms are ﬁnite, however the trigonometric terms tendto 0 as H tends to an integer value. Hence, asymptotically theapproximate entropy rate expression is dominated by log c f →−∞ , as c f → . Remark.

Note that the approximation works well in the limits H → or 1, and so the theorem describes the aysmptoticbehaviour of entropy rate well. Moreover, the theorem linesup with the intuition for an LRD process. As we move closerto either perfectly positively or negatively correlated, theprocess becomes “less uncertain” i.e. , we have less entropyon average. Then the uncertainty disappears, by viewing theentire past we can accurately infer the current value. It’simportant to reiterate that the differential entropy can be −∞ ,which can be interpreted as least uncertainty for a process.2) Maximum: We want to understand the maximum ofdifferential entropy rate, as function of the Hurst parameter.This will provide an understanding of which parameter choicesrepresent the highest uncertainty. We differentiate the entropyrate, with respect to H and then solve for H when thederivative equals zero. Here we need to apply this to the exactformula because the approximation distorts the location of themaximum. Therefore, dropping constant terms, we get dhdH = 12 ddH log (cid:0) σ sin( πH )Γ(2 H + 1) (cid:1) + 14 π ddH (cid:90) π − π log  ∞ (cid:88) j = −∞ | πj + λ | − H −  dλ = 12 ddH log (cid:0) sin( πH ) (cid:1) + 12 ddH log (cid:0) Γ(2 H + 1) (cid:1) − π (cid:90) π − π (cid:80) j log( | πj + λ | ) | πj + λ | − H − (cid:80) j | πj + λ | − H − dλ = π πH ) + ψ (2 H + 1) − π (cid:90) π − π (cid:80) j log( | πj + λ | ) | πj + λ | − H − (cid:80) j | πj + λ | − H − dλ. where ψ ( z ) = Γ (cid:48) ( z ) / Γ( z ) is the digamma function.Then we set this expression to zero, and solve for H . Thisis a transcendental equation with no closed form. We solve itnumerically using Python’s SciPy package [36], which yields H ≈ . . Thus the maximum is at H = 0 . , which alignswith the idea that a SRD process has more uncertainty thanany equivalent LRD process.Note that from the solution of the spectral density approx-imation is H ≈ . . So although using the spectral densityapproximation is acceptable for many purposes, it can leadto false conclusions about the properties of the differentialentropy rate.IV. E NTROPY RATE FUNCTION FOR

ARFIMA( P ,d,q)We consider the differential entropy rate function ofa related process to Fractional Gaussian Noise, which isARFIMA(p,d,q), the fractional extension of the ARIMA (Au-toregressive Integrated Moving Average) processes, by ex-tending to non-integer differencing parameters, d [14], [17].FGN and ARFIMA(0,d,0) are commonly used stationary LRDprocesses for modelling real phenomena, and in particularFGN and ARFIMA(0,d,0) have very similar properties in thetime and frequency domains. Additionally, these processeshave been linked by the limit operation of their behaviourof their autocorrelation functions, ρ ( k ) := γ ( k ) /σ , underaggregation and rescaling [13]. From this perspective FGN andARFIMA processes have the same ﬁxed point as the limit ofthese operations tend to inﬁnity. However, ARFIMA processesdo differ from FGN in that you could change the ﬁxed point, i.e. , alter the eventual limit under aggregation and rescaling,with the addition of additive noise [35], i.e. , this class was lessrobust to the addition of noise. Hence, there may be somedifferences in behaviour when looking through an entropiclens.Before we deﬁne an ARFIMA(p,d,q) process, we deﬁnetwo polynomials that are required for the ARFIMA deﬁni-tion. These are polynomials of the lag operator, L , where LX n = X n − , deﬁned by φ ( x ) = 1 − p (cid:88) j =1 φ j x j , for coefﬁcients φ j and p ∈ Z + ,ψ ( x ) = 1 + q (cid:88) j =1 ψ j x j , for coefﬁcients ψ j and q ∈ Z + . Now we can deﬁne an ARFIMA(p,d,q) process,

Deﬁnition IV.1.

For a stationary stochastic process { X i } i ∈ Z ,such that φ ( L )(1 − L ) d X i = ψ ( L ) (cid:15) i , for some − / < d < / and (cid:15) i , a zero mean normallydistributed random variable with variance σ (cid:15) . { X i } i ∈ Z iscalled a ARFIMA ( p, d, q ) process. An ARFIMA(0,d,0) process is a special case of anARFIMA(p,d,q), where φ ( x ) = ψ ( x ) = 1 , that is that there isno lag on the noise, (cid:15) , and the all the auto-regressive lags onthe previous values come from the differencing operator.The spectral density of an ARFIMA(p,d,q) is given by [1], f ( λ ) = σ (cid:15) | ψ ( e iλ ) | π | φ ( e iλ ) | | − e iλ | − d . The following theorem from Hosking [17] and Beran [1, pg.64] gives inﬁnite autoregressive and moving average represen-tations for ARFIMA(0,d,0) processes.

Theorem IV.1.

Let X n be a fractional ARIMA(0,d,0) processwith - < d < . Then(i) the following inﬁnite autoregressive representation holds: ∞ (cid:88) k =0 π k X n − k = (cid:15) n , where (cid:15) n ( n = 1 , , ... ) are independent identically distributedrandom variables and π k = Γ( k − d )Γ( k + 1)Γ( − d ) . For k → ∞ we have, π k ∼ − d ) k − d − . (ii) The following inﬁnite moving average representationholds: X n = ∞ (cid:88) k =0 a k (cid:15) n − k where (cid:15) n ( n = 1 , , ... ) are independent identically distributedrandom variables and a k = Γ( k + d )Γ( k + 1)Γ( d ) . For k → ∞ we have a k ∼ d ) k d − . We will express an entropy rate characterisation for ARMAprocesses in terms of its innovation process variance, fromIhara [21, pg. 78], and show that this can be extended toARFIMA(0,d,0) and ARFIMA(p,d,q) processes. Then we willuse the characterisation to characterise the entropy rate of anARFIMA(0,d,0) process in terms of its process variance.

Theorem IV.2 (From Ihara [21, pg. 78]) . The entropy rate ofan ARMA(p,q) process is given by, h ( χ ) = log(2 πeσ (cid:15) ) . Now, we state our extension to ARFIMA(0,d,0).

Theorem IV.3.

The entropy rate of an ARFIMA(0,d,0) processis given by, h ( χ ) = log(2 πeσ (cid:15) ) .Proof. First we calculate (cid:82) π − π log(2 πf ( λ )) dλ . (cid:90) π − π log(2 πf ( λ )) dλ = (cid:90) π − π log( σ (cid:15) | − e iλ | − H ) dλ, = (cid:90) π − π log( σ (cid:15) ) dλ + (1 − H ) (cid:90) π − π log | − e iλ | dλ. Now we transform the elements in the last term using theirtrigonometric representation, | − e iλ | = | − cos( λ ) − i sin( λ ) | , = (cid:113) (1 − cos( λ )) + sin ( λ ) , = (cid:112) − λ ) , = (cid:115) (cid:18) λ (cid:19) , = 2 sin (cid:18) λ (cid:19) . This makes the integral of the log spectral density, (cid:90) π − π log | − e iλ | dλ = 2 (cid:90) π log (cid:18) (cid:18) λ (cid:19)(cid:19) dλ, = 2 (cid:90) π log(2) dλ + 2 (cid:90) π log (cid:18) sin (cid:18) λ (cid:19)(cid:19) dλ, We substitute y = λ/ , (cid:90) π − π log | − e iλ | dλ = 2 π log(2) + 2 (cid:90) λ log(sin y )2 dy, = 2 π log(2) + 4 (cid:16) − π (cid:17) , = 0 . Where the equality (cid:82) λ log(sin y ) dy = − π log(2) is givenby [23].So the last term of the spectral density vanishes, and (cid:90) π − π log(2 πf ( λ )) dλ = (cid:90) π − π log( σ (cid:15) ) dλ, + (1 − H ) (cid:90) π − π log | − e iλ | dλ, = 2 π log( σ (cid:15) ) . Using Kolmogorov’s entropy rate expression, the entropy rateis therefore, h ( χ ) = 12 log(2 πe ) + 14 π (2 π log( σ (cid:15) )) , = 12 log(2 πeσ (cid:15) ) . Remark.

This can be shown also using the inﬁnite autore-gressive expression above, X n = (cid:15) n − (cid:80) ∞ k =1 π k X n − k , andsubstituting into the conditional entropy rate for stationaryprocesses, h ( χ ) = lim n →∞ h ( X n | X n − , ..., X ) . Then we canremove the conditioning from the entropy rate calculation, h ( χ ) = lim n →∞ h ( (cid:15) n − (cid:80) ∞ k =1 π k X n − k | X n − , ..., X ) =lim n →∞ h ( (cid:15) n ) . Which then implies that h ( χ ) = log(2 πeσ (cid:15) ) , i.e. , the entropy rate of the process depends only on the entropyintroduced at each step by the innovations. Therefore, weconclude that the entropy rate of an ARFIMA(0,d,0) processdepends on the innovation variance and not H . We can generalise to ARFIMA(p,d,q) process by adding anadditional condition, the invertibility of the moving average polynomial. This is an extremely common condition applied inthe theory of autoregressive-moving average, i.e. , ARMA(p,q)processes. The condition implies that all roots of the movingaverage polynomial lie outside of the unit circle, and similarlythe stationarity condition of the process ensures that all roots ofthe autoregressive polynomial lie outside of the unit circle [3].We will use these conditions on the ARFIMA processes toanalyse their properties.

Theorem IV.4.

The entropy rate of a stationaryARFIMA(p,d,q) process with invertible moving averagepolynomial is given by, h ( χ ) = log(2 πeσ (cid:15) ) .Proof. Since ARFIMA(p,d,q) processes are stationary and in-vertible, this implies that the polynomials φ ( x ) and ψ ( x ) , haveroots outside of the unit circle, i.e. each root z ∈ C is such that | z | > . By the Fundamental Theorem of Algebra, both the au-toregressive and moving average polynomials can be factoredinto linear factors. As the constant terms are 1, this impliesthe polynomials can be factored as φ ( x ) = (cid:81) pi =1 (1 − a i e iλ ) and ψ ( x ) = (cid:81) qi = i (1 − b i e iλ ) , where | a i | , | b i | < , ∀ i . Now,recall that the spectral density is given by, f ( λ ) = σ (cid:15) π | − e iλ | − d | ψ ( e iλ ) | | φ ( e iλ ) | . Hence, log((2 πf ( λ )))= log( σ (cid:15) ) − d log | − e iλ | +2 log (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q (cid:89) j =1 (1 − a j e iλ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p (cid:89) j =1 (1 − b j e iλ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , = log( σ (cid:15) ) − d log | − e iλ | + q (cid:88) j =1 | − a j e iλ | − p (cid:88) j =1 | − b j e iλ | . Now we calculate the integral of the log spectral density, (cid:90) π − π log(2 πf ( λ )) dλ = (cid:90) π − π log( σ (cid:15) ) dλ − (cid:90) π − π d log | − e iλ | dλ + q (cid:88) j =1 (cid:90) π − π log | − a j e iλ | dλ − p (cid:88) j =1 (cid:90) π − π log | − b j e iλ | dλ, = 2 π log( σ (cid:15) ) . Where the third equality is given as all the integrals of log | − ae iλ | over [ − π, π ] vanish for | a | ≤ [31].We substitute this expression into Kolmogorov’s entropyrate expression for Gaussian processes. h ( χ ) = 12 log(2 πe ) + 14 π (2 π log( σ (cid:15) )) , = 12 log(2 πeσ (cid:15) ) . This result leads to the following corollary, which canﬁnalise the discussion of the differential entropy rate in termsof innovation variance for the classes of AR, MA, ARMAprocesses. This is relevant as the deﬁnition in terms of theinnovation variance is the perspective that is commonly used inthe time series literature, when modelling real world processes.

Corollary IV.4.1.

The differential entropy rate of station-ary AR(p), invertible MA(q) and, stationary and invertibleARMA(p,q) processes is h ( χ ) = log(2 πeσ (cid:15) ) . Hence, for these models the entropy rate can be calculated interms of the variance of its innovations. However we wantto compare the entropy rates, as a function of their Hurstparameter, between ARFIMA(0,d,0) and FGN, so we wantto ﬁx the variance of process itself, σ . We will use theautocovariance function of ARFIMA(0,d,0), from Beran [1,pg. 63], γ ( k ) = σ (cid:15) ( − k Γ(1 − d )Γ( k − d + 1)Γ(1 − k − d ) . Note that γ (0) = σ , σ = γ (0) = σ (cid:15) Γ(1 − d )Γ(1 − d ) , and hence, σ (cid:15) = σ Γ(1 − d ) Γ(1 − d ) . This leads to the following characterisation ofARFIMA(0,d,0) processes in terms of the Hurst parameter, H , noting that d = H − / . Theorem IV.5.

The entropy rate of an ARFIMA(0,d,0) pro-cess for a ﬁxed process variance, σ , is given by, h ( χ ) = log(2 πeσ ) + log(Γ( − H )) − log(Γ(2 − H )) .Proof. By Theorem IV.3 and from the characterisation of σ (cid:15) above, h ( χ ) = 12 log (cid:18) πeσ Γ(1 − d ) Γ(1 − d ) (cid:19) , = 12 log(2 πeσ ) + log(Γ(1 − d )) −

12 log(Γ(1 − d )) , = 12 log(2 πeσ ) + log (cid:18) Γ (cid:18) − H (cid:19)(cid:19) −

12 log(Γ(2 − H )) . Remark.

The same approach can be used for more generalARFIMA(p,d,q) processes. However, there is no general closedform for the autocovariance function, so the variance canbe calculated for each process and then substituted for theinnovation variance. Interestingly, this result indicates that theeffect of the changing the process variance is balanced by theeffect of the change in the Hurst parameter, with respect tothe innovation variance. This results in the constant differen-tial entropy rate when considered in terms of its innovationvariance. E n t r o p y R a t e ( n a t s ) = 1 = 2 = 3 = 4 Fig. 4: The entropy rate of ARFIMA(0,d,0) as a function ofthe Hurst parameter, H , for variance, σ = 1 , , , . On thepositively correlated side, H > . , we see a similar asymp-totic behaviour to FGN. However, for negatively correlatedprocesses, the amount of entropy in the process, stays quitehigh. We see the maximum of the function at H = 0 . , whichintuitively shows that the highest uncertainty occurs for thewhite Gaussian noise process.We show the plot of the ARFIMA(0,d,0) entropy rate asa function of the Hurst parameter, H , with process variance, σ = 1 , , , , in Figure 4. The plot shows some interestingbehaviour, particularly when compared to the FGN entropyrate function in Figure 5. Some of these observed propertiesare: • The entropy rate is not symmetric, much less so thanFGN. The positively correlated side has a dramatic drop,however the negatively correlated side stays relativelyhigh. In order words, there is a demonstrable differencebetween FGN and ARFIMA(0,d,0) in the behaviour asCSRD processes. • The entropy rate asymptotically tends to −∞ as H → • The maximum entropy rate occurs at the same point asFGN, H = 0 . . Indicating that the maximum entropyoccurs for white Gaussian noise.Similar to the previous section, we will prove the asymptoticsof the entropy rate function, and show that the maximumoccurs at H = 0 . . Theorem IV.6.

The differential entropy rate ofARFIMA(0,d,0), h ( χ ) → −∞ as H → , for a ﬁxedvariance σ ∈ R .Proof. As H → , the term Γ( − H ) is ﬁnite, as well asthe log(2 πeσ ) , for a ﬁxed variance < σ < ∞ . Now,as H → , the term Γ(2 − H ) → Γ(0) . There exists asingularity for the gamma function at 0, which diverges toinﬁnity. Which implies that the term − log(Γ(2 − H )) →−∞ , since Γ( x ) → ∞ , as x → . This implies that h ( χ ) →−∞ , as H → . E n t r o p y R a t e ( n a t s ) FGN - = 1ARFIMA(0,d,0) - = 1 Fig. 5: The comparison of the entropy rate as function of theHurst parameter, for both ARFIMA(0,d,0) and FGN processes,with variance 1. It appears that the ARFIMA(0,d,0) processhas an entropy rate which is greater than or equal to FGNfor all values of H . The negatively correlated portion fallsaway quickly as H → for FGN but stays relatively high forthe ARFIMA(0,d,0) process. The maximum of the functionscoincide at H = 0 . . Remark.

Note that the value of the entropy rate function foran ARFIMA(0,d,0) process as H → , is h ( χ ) = 12 log(2 πeσ ) + log(Γ( 32 )) −

12 log(Γ(2)) ≈ . . To complete this section of the analysis, we will consider themaximum of the entropy rate function of ARFIMA(0,d,0), andconclude which Hurst parameter has the highest uncertainty,in the sense of maximum differential entropy rate.

Theorem IV.7.

The differential entropy rate of ARFIMA(0,d,0)as a function of H attains the maximum at H = / .Proof. We differentiate the entropy rate function with respectto H , and we get dh ( χ ) dH = ddH (cid:18)

12 log(2 πeσ ) + log(Γ( 32 − H )) −

12 log (Γ(2 − H )) (cid:19) , = Γ( − H ) ψ (cid:0) − H (cid:1) Γ( − H ) − Γ(2 − H ) ψ (2 − H )Γ(2 − H ) , = ψ (cid:18) − H (cid:19) − ψ (2 − H ) , where ψ ( x ) is the digamma function.Then we set dh ( χ ) dH = 0 , and solve for H . Since ψ ( x ) isa monotonically increasing function on R + , this implies that dh ( χ ) dH has one solution. Since − H = 2 − H only when H = / , this implies that h ( χ ) achieves a unique maximumat this point.This aligns with our intuition, that the highest uncertaintyoccurs for this model when it is uncorrelated and equal to white Gaussian noise, as it simpliﬁes to X n = (cid:15) n , identical toFGN processes. This explains why the maxima coincide forthe two processes, given the same process variance, althoughARFIMA(0,d,0) appears to higher differential entropy acrossthe entire parameter range, when not at H = 0 . . This mightindicate that the maximal entropy process for LRD covarianceconstraints is the ARFIMA class, which echos previous resultsin this area such as Burg’s Theorem [6] and the ARMA classgiven appropriate constraints on the covariances and impulseresponses [12], [20].We have shown in this section that the behaviour forthe ARFIMA(0,d,0) model differs from that of FGN in thebehaviour of their CSRD processes. This is a surprisingdiscovery and warrants further investigation. Both models,however, have much less uncertainty as the strength of thepositive correlations increases, as well as a maximum un-certainty occurring for uncorrelated processes. Hence, wemay be able to characterise the behaviour of LRD processeson the entropy rate as tending to −∞ as the strength ofcorrelations increases. In remainder of the paper we look atother information theoretic measures as way to characterisethe behaviour of SRD and LRD processes.V. M UTUAL I NFORMATION AND E XCESS E NTROPY FOR L ONG R ANGE D EPENDENT P ROCESSES

In this section we continue analysing of the differentialentropy rate for stochastic processes with power-law decayingcovariance function. We investigate the links between theamount of entropy that is accumulated during the convergenceof the conditional entropy to the entropy rate and the amountof information that is shared between the past and future of astochastic process.We extend the standard notion of mutual information to thespecial case of mutual information between past and future, I p-f , which will measure the amount of information aboutthe inﬁnite future, given knowledge of the inﬁnite past ofstochastic processes. Deﬁnition V.1.

The mutual information for continuous ran-dom variables is deﬁned as I ( X ; Y ) = (cid:90) Y (cid:90) X f ( x, y ) log (cid:18) f ( x, y ) f ( x ) f ( y ) (cid:19) dxdy, and in particular the mutual information between past andfuture with n lags, I ( n ) , for a stochastic process { X i } i ∈ Z isdeﬁned as I ( { X s , s < } , { X s , s ≥ n } ) . The case with n = 0 is called the mutual information between past and future andis of special interest. An alternative characterisation for the mutual information, thatapplies to both shannon and differential entropy, is I ( X ; Y ) = h ( X ) − h ( X | Y ) = h ( Y ) − h ( Y | X ) [8, pg.251].We present a theorem from Li [26], that links the value of I p-f , and autocovariance function and the fourier coefﬁcientsof the logarithm of the spectral density function. Theorem V.1 (From Li [26]) . Let { X i } i ∈ Z be a stationaryGaussian stochastic process: • if the spectral density f ( λ ) is continuous and f ( λ ) > , then I p-f is ﬁnite if and only if the autocovariancefunction satisﬁes the condition (cid:80) ∞ k = −∞ kγ ( k ) < ∞ . • I p-f is ﬁnite if and only if the cepstrum coefﬁcients, b k = π (cid:82) π − π log f ( λ ) e − ikλ dλ , satisfy the condition (cid:80) ∞ k =1 kb k < ∞ . In this case, I p − f = (cid:80) ∞ k =1 kb k . Remark.

The convergence of the sum (cid:80) ∞ k = −∞ kb k requiresthat (cid:80) ∞ k =1 kb k < ∞ and (cid:80) ∞ k =1 − kb − k < ∞ separately. Theconvergence of sums of this type is related to whether themutual information between pass and future with n lags decaysto 0 as n → ∞ [19, pg. 131]. This theorem gives us a way to classify whether processes haveinﬁnite mutual information between past and future, in thispaper we will use this quantity to analyse convergence towardsthe entropic rate, in particular for LRD stochastic processes.In the next result we make an explicit link between LRDprocesses and the ﬁniteness of the Mutual Information be-tween past and future. This perspective provides us with acharacterisation of LRD processes, these are processes that“share inﬁnite information from the past to the future”. Thischaracterisation is intuitive if we think about these processesas retaining strong correlations with the past, and if we areable to have a complete understanding of the inﬁnite past wecan retain a ﬁnite amount from each random variable observed,to the point we have a inﬁnite amount of information accrued.

Theorem V.2.

The Mutual Information between past andfuture, I p-f , for stationary LRD Gaussian processes is inﬁnite.Proof. To analyse LRD processes we use the spectral den-sity asymptotic representation around the origin, f ( λ ) ∼ c f | λ | − H , as we can see the divergence by considering theasymptotic behaviour around the singularity at the origin.Hence, as b k = π (cid:82) π − π log f ( λ ) e − ikλ dλ , then b k ∼ π (cid:90) π − π log (cid:0) c f | λ | − H (cid:1) e − ikλ dλ, = log c f π (cid:90) π − π e − ikλ dλ + 1 − H π (cid:90) π − π log | λ | e − ikλ dλ We split the integral into positive and negative components, b k = 2 log c f sin ( πk )2 πk + 1 − H π (cid:90) π log λ (cid:0) e − ikλ + e ikλ (cid:1) dλ. Since sin( πk ) = 0 , ∀ k ∈ Z , the ﬁrst term vanishes and we onlyneed to consider the integral. Then we can decompose into thetrigonometric representations since e − ikλ + e ikλ = 2 cos ( kλ ) , b k ∼ − H π (cid:90) π λ cos ( kλ ) dλ. Integrating this expression by parts, we get b k ∼ − H π (cid:20) log λ sin ( kλ ) k (cid:21) π − (cid:90) π sin ( kλ ) kλ dλ. To analyse the integral in the second term we use the substi-tution, u = kλ , and therefore (cid:90) π sin ( kλ ) kλ dλ = (cid:90) kπ sin ( u ) u du. Then undoing the u-substitution and noting that Si ( x ) = (cid:82) x tt dt , is the Sine integral, b k ∼ − Hπ (cid:20) log λ sin ( kλ ) − Si ( λ ) k (cid:21) π , = (1 − H ) ( − Si ( π )) πk . The partial sum has the asymptotic form, m (cid:88) k =1 kb k ∼ m (cid:88) k =1 (1 − H ) Si ( π ) π k , = (1 − H ) Si ( π ) π m (cid:88) k =1 k , ∼ (1 − H ) Si ( π ) π log m. As the rate of growth of the harmonic series, (cid:80) nk =1 1 k ∼ log( n ) as n → ∞ . Hence, as n → ∞ , then the sum, (cid:80) mk =1 kb k , diverges. This implies that the sum, (cid:80) ∞ k = −∞ kb k ,diverges and therefore by Theorem V.1, I p-f is inﬁnite. Remark.

We can quite easily show this result with the addi-tional assumptions that the spectral density, f ( λ ) is positiveand continuous. The asymptotic expression of the autocovari-ance function, γ ( k ) ∼ σ c ρ | k | − α . Hence, considering the sum, (cid:80) ∞ k =1 kγ ( k ) , from the ﬁrst part of Theorem V.1, ∞ (cid:88) k =1 kγ ( k ) ∼ ∞ (cid:88) k =1 k ( σ c ρ | k | − α ) = σ c ρ ∞ (cid:88) k =1 k − α . This sum is diverges in the parameter range, α ∈ (0 , ,and hence by Theorem V.1, I p-f is inﬁnite for LRD processes.However, there exist many processes that have inﬁnite excessentropy but are not long range dependent. Some examples aregiven, including deterministic processes, in Crutchﬁeld andFeldman [9]. Crutchﬁeld and Feldman [9] analysed a quantity namedexcess entropy, (cid:80) ∞ n =1 ( H ( X n | X n − , . . . , X ) − H ( χ )) , forthe Shannon entropy H and corresponding entropy rate H ( χ ) ,which has been shown to be equivalent to the mutual in-formation between past and future. As note by Crutchﬁeldand Feldman [9], this has been interpreted as stored informa-tion [30], effective measure complexity [15], [27], predictiveinformation [29], and they describe it as the excess informationrequired to reveal the entropy rate. Importantly, it has beenused to measure the convergence rate of the conditionalentropy, based on past observations, to the entropy rate. Weaim to extend this result to the differential entropy, and thenthe question of classiﬁcation of LRD processes via the amountof shared information can be made by the convergence rate tothe entropy rate. We extend the deﬁnition of the excess entropyto the case of differential entropy. Deﬁnition V.2.

The differential excess entropy, E , of astochastic process, { X i } i ∈ N , is deﬁned as, E = ∞ (cid:88) n =1 ( h e ( n ) − h ( χ )) , = lim n →∞ [ h ( X n , . . . , X ) − nh ( χ )] . where h e ( n ) = h ( X , .., X n ) − h ( X , ..., X n − ) , = h ( X n | X n − , ..., X ) . We have the tools available to make an explicit link betweenthe mutual information between past and future and theexcess entropy of a continuous-valued, discrete-time stochasticprocess. This is an exact analogue of Proposition 8 fromCrutchﬁeld and Feldman [9].

Theorem V.3.

For a stationary, continuous-valued stochasticprocess, the mutual information between past and future, I p-f ,is equal to the differential excess entropy, E .Proof. The mutual information for a process X , with a pastand future of n observations, I [ { X s , − n ≤ s < } ; { X s , ≤ s ≤ n } ] , = h ( X , ..., X n − ) − h ( X , ..., X n − | X − n , ..., X − ) , = n − (cid:88) i =0 h ( X i | X i − , ..., X ) − n − (cid:88) i =0 h ( X i | X i − , ..., X − n ) , = n − (cid:88) i =0 h ( X i | X i − , ..., X ) − h ( X i | X i − , ..., X − n ) , by the chain rule of differential entropy [8, pg. 253]. Then weconsider the mutual information between past and future, bytaking the limit of the above expression as n → ∞ , whichleads to I p-f = lim n →∞ (cid:34) n − (cid:88) i =0 h ( X i | X i − , ..., X ) − h ( X i | X i − , ..., X − n ) (cid:35) , = lim n →∞ (cid:34) ∞ (cid:88) i =0 ( h ( X i | X i − , ..., X ) − h ( X i | X i − , ..., X − n )) { i ≤ n } (cid:35) . We deﬁne the sequence of measurable functions, f n ( i ) as f n ( i ) = ( h ( X i | X i − , ..., X ) − h ( X i | X i − , ..., X − n ) { i ≤ n } , and we deﬁne the function, g ( i ) as g ( i ) = h ( X i | X i − , ..., X ) − h ( χ ) . We want to show that | f n ( i ) | ≤ g ( i ) for all n and forall i ∈ N . In this case it is equivalent to showing that f n ( i ) ≤ g ( i ) , since f n ( i ) ≥ for all n, i ∈ N , as thesecond term of f n ( i ) conditions on more random variables,and since conditioning cannot increase entropy this impliesthat h ( X i | X i − , ..., X ) ≥ h ( X i | X i − , ..., X − n ) . We considertwo cases, i ≤ n and i > n , separately. In the case, i > n , wehave that f n ( i ) = 0 , and since g ( i ) ≥ for all i , this impliesthat f n ( i ) ≤ g ( i ) . Considering the second case, i ≤ n , wehave that f n ( i ) = ( h ( X i | X i − , ..., X ) − h ( X i | X i − , ..., X − n ) , and therefore, g ( i ) − f n ( i ) = h ( X i | X i − , ..., X − n ) − h ( χ ) . Again, since conditioning does not increase entropy and thecharacterisation of entropy rate for stationary processes fromTheorem II.1 this implies that g ( i ) − f n ( i ) ≥ and therefore g ( i ) ≥ f n ( i ) for all n, i such that i ≤ n . Then we can apply thedominated convergence theorem [11, pg. 26], since f n ( i ) → ( X i | X i − , ..., X ) − h ( χ ) pointwise, this implies that I p-f = lim n →∞ ∞ (cid:88) i =0 ( h ( X i | X i − , ..., X ) − h ( X i | X i − , ..., X − n )) { i ≤ n } , = ∞ (cid:88) i =0 lim n →∞ ( h ( X i | X i − , ..., X ) − h ( X i | X i − , ..., X − n )) { i ≤ n } , = ∞ (cid:88) i =0 h ( X i | X i − , ..., X ) − h ( χ ) . Remark.

This proof is similar to that of Proposition 8 fromCrutchﬁeld and Feldman [9]. However, it is more rigoroussince the limit is kept out the front of the sum while simulta-neously applied to the second term in the sum. This approachusing dominated convergence can resolve the issue in theirproof.

This link shows us that the inﬁnite mutual information betweenpast and future has an impact on the amount of informationthat accumulates during the convergence to the entropy rate. InCrutchﬁeld and Feldman [9], they analyse the excess entropyof discrete random variables to understand the convergencerate of the conditional entropy to the entropy rate. Hence,utilising our knowledge of the mutual information betweenthe past and future may be able to inform us about the rate ofconvergence of the conditional entropy to the entropy rate.This result and Theorem V.2, lead to the following corollarywhich gives us an approach to understand the entropy rate con-vergence by conditional entropy, of Gaussian LRD processes.

Corollary V.3.1.

The excess entropy of an LRD Gaussianprocess is inﬁnite.Proof.

This is shown by combining Theorem V.2 and Theo-rem V.3.We will use this idea in the subsequent sections to analyse theexcess entropy, given its relationship to convergence to entropyrate which is noted for Shannon entropy rate by Crutchﬁeldand Feldman [9].VI. E

XCESS E NTROPY FOR S TATIONARY G AUSSIAN P ROCESSES

In this section we investigate the behaviour of the excessentropy, E , for Gaussian processes which have an autocor-relation function which decays as a power law, we apply alimit theorem to the terms inside the excess entropy, whichshows that summation terms asymptotically tend to zero. In the following section, we utilise a stronger version of the theoremwith an additional term that we use to classify the convergencerate of the conditional entropy to the entropy rate.For stationary processes, the terms in the excess entropywill tend to zero as n increases because lim n →∞ h e ( n ) = lim n →∞ h ( X n | X n − , ..., X ) = h ( χ ) . However, we will investigate the nature of the convergenceusing the conditional entropy to gain some additional insight.We begin by looking at the behaviour of the individual termsof the excess entropy series to understand why these termsdecay to zero.We consider h e ( n ) from Deﬁnition V.2 of differential excessentropy, and given that the joint entropy of a ﬁnite collectionof random variables of a gaussian process is h ( X n , ..., X ) = log (cid:0) (2 πe ) n | K ( n ) | (cid:1) [8, pg. 416], we have h e ( n ) = h ( X n , ..., X ) − h ( X n − , ..., X ) , = 12 log (cid:16) (2 πe ) n | K ( n ) | (cid:17) −

12 log (cid:16) (2 πe ) n − | K ( n − | (cid:17) , = 12 log (cid:18) πe | K ( n ) || K ( n − | (cid:19) , until time n where K ( n ) is the autocovariance matrix of theprocess X n , and note that this is a Toeplitz matrix of size n × n .We analyse each summand, n , of the inﬁnite series ofdifferential excess entropy by substituting the characterisationof h e ( n ) above and utilising the entropy rate characterisationfor gaussian processes given in Section III as equation (1).This gives h e ( n ) − h ( χ )= 12 log(2 πe ) | K ( n ) || K ( n − | −

12 log(2 πe ) − π (cid:90) π − π log f ( λ ) dλ, = 12 log | K ( n ) || K ( n − | − π (cid:90) π − π log f ( λ ) dλ. (3)The expression, 3, will be able to tell us information about therate of convergence of the conditional entropy to the entropyrate, and about the general convergence of the excess entropy.If log f ( λ ) is integrable, there is a well known limit theoremby Szeg¨o [2], [33], which will be deﬁned below and used toevaluate the limit of | K ( n ) || K ( n − | as n → ∞ .For LRD processes we show that log( f ( λ )) ∈ L [ − π, π ] and hence that Szeg¨o’s limit theorem can be applied. We usethe asymptotic form to analyse this case, as the issue withintegrability is the singularity that exists at the origin of thespectral density functions, since there are ﬁnite bounds onthe frequencies considered. Therefore, the integrability of theasymptotic form implies the integrability for LRD processes. Theorem VI.1.

For a spectral density, f ( λ ) such that itdeﬁnes a LRD stochastic process, log( f ( λ )) ∈ L [ − π, π ] .Proof. Using the asymptotic expression for spectral density forLRD processes, f ( λ ) ∼ c f | λ | − β , β ∈ (0 , , < c f < ∞ .Hence, (cid:90) π − π log( c f | λ | − β ) dλ = (cid:90) π − π log c f dλ − β (cid:90) π − π log | λ | dλ, = 2 π log c f − β (cid:90) π log λdλ, = 2 π log c f − β [ λ log λ − λ ] π , = 2 π log c f − β [ π log π − π ] . By the ﬁniteness of all of the terms, this implies that, (cid:90) π − π log f ( λ ) dλ < ∞ . The following theorem was originally formulated bySzeg¨o [33], and then extended to include another term inSzeg¨o [32], the second theorem will be used in the nextsection. The statement of the theorem that we will be using areby Bingham [2], which are from the probabilistic perspectiveand include the most recent generalisations in the conditions.However, this theorem has a history of having the conditionsgeneralised, and applications being found in many areas infunctional analysis, statistics and probability [16, pg. 145-228],to calculate functions of eigenvalues of Toeplitz matrices.

Theorem VI.2 (Szeg¨o’s Theorem [2]) . For a sequence ofToeplitz matrices, Γ ( n ) = [ γ ( | i − j | )] i,j , of increasing size n , where γ ( n ) = (cid:82) π − π f ( λ ) e inλ dλ , and f ( λ ) is the spectraldensity such that log f ( λ ) ∈ L [ − π, π ] , then there exists alimit lim n →∞ | Γ ( n ) | n = exp (cid:26) π (cid:90) π − π log f ( λ ) dλ (cid:27) . We apply this to the ratios of the autocovariance function, K ( n ) , of the process, { X n } n ∈ N . Due to the ﬁniteness of thislimit we can take the n th power, for a covariance matrix K ( n ) .Which has the following form, lim n →∞ | K ( n ) | n = exp (cid:26) π (cid:90) π − π log f ( λ ) dλ (cid:27) . Applying this result to the limit of the n th term of thedifferential excess entropy (3), gives lim n →∞ h e ( n ) − h ( χ )= 12 log (cid:18) exp (cid:26) π (cid:90) π − π log f ( λ ) dλ (cid:27)(cid:19) − π (cid:90) π − π log f ( λ ) dλ, = 0 . As n → ∞ , h e ( n ) − h ( χ ) → , which implies the convergenceof the conditional entropy, conditioned on the inﬁnite past,to the entropy rate, equivalent ot the result of Theorem II.1.We have gained some additional insight that the summandsconverge to zero in the limit as n → ∞ because the ratioof subsequent covariance matrix determinants of ﬁnite sets ofobservations for stationary stochastic processes converges tothe exponential of an integral of the logarithm of the spectraldensity function. However, this does not tell us anything aboutthe rate of convergence of the terms or whether E convergesat all.In following section we will use a stronger version of theSzeg¨o theorem, with an additional term in the limit. Thisgives an approach to analyse the convergence properties to theentropy rate that arise from observing the conditional entropy,conditioned on the past observations. VII. E NTROPY RATE CONVERGENCE FOR POWER - LAWDECAY PROCESSES

From the previous sections we have gained an understandingof some of the properties of the entropy rate function for com-mon LRD models. In this section, we classify the convergencerate of the conditional entropy to the entropy rate for SRD andLRD processes.In Section VI we used Szeg¨o’s theorem to show, lim n →∞ | K ( n ) || K ( n − | = exp (cid:18) π (cid:90) π − π log f ( λ ) dλ (cid:19) , if log f ( λ ) ∈ L [ − π, π ] , to provide another perspective toexplain why the conditional entropy converges to the entropyrate. There is an extension to the original result, the StrongSzeg¨o Theorem, which provides an additional term to thislimit, with the same regularity conditions, however we haveto be careful with some quantities that can become inﬁnite.From the Szeg¨o theorem we could show why the convergenceoccured for the conditional entropy, and with the strong Szeg¨otheorem we can explain the convergence rate of the conditionalentropy to the entropy rate. We will state the version as givenin Bingham [2].We additionally deﬁne the limit of the Szeg¨o theorem, G ( µ ) := exp (cid:18) (cid:90) π − π log f ( λ ) dλ (cid:19) , for ease of notation and deﬁne the partial autocorrelationcoefﬁcients, α n , as, α n = corr ( X n − P [1 ,n − X n , X − P [1 ,n − X ) , where P [1 ,n − is the projection onto the linear space spannedby { X − n , ..., X − } and the correlation function is deﬁned as, corr ( X, Y ) = E [ X ¯ Y ] (cid:112) E [ | X | ] E [ | Y | ] , for X and Y zero mean random variables. Note that P [1 ,n − X is the best linear predictor of X given the ﬁnitepast of length n .The partial autocorrelation function is related to the autocor-relation function, by the removal of the linear dependence onthe variables within n lags. For example, in the case of ﬁnitelag processes, such as AR(p), the partial correlation functionis 0 for a lag greater than p. In the case of the ARFIMA(p,d,q)process, the decay is slower for LRD parametrisations [22].We deﬁne the Hardy space H , which is the subspace of (cid:96) of sequences a = ( a n ) such that || a || := (cid:80) n (1 + | n | ) | a n | < ∞ , which is a well deﬁned norm on (cid:96) . We use this to describethe conditions on the strong Szeg¨o theorem. Also, the cepstralcoefﬁcients that were deﬁned in Theorem V.1 will be usedin the deﬁnition of the additional term of the Szeg¨o limittheorem, reinforcing their connection with the informationtheoretic perspective of stochastic processes. Theorem VII.1 (Strong Szeg¨o Theorem [2]) . For a sequenceof Toeplitz matrices, Γ ( n ) = [ γ ( | i − j | )] i,j , of increasing size n ,for covariance function γ ( n ) , with associated spectral density f ( λ ) , such that log f ( λ ) ∈ L [ − π, π ] , then lim n →∞ | Γ ( n ) | G ( µ ) n → E ( µ ) , where, E ( µ ) := ∞ (cid:89) j =1 (1 − | α j | ) − j = exp (cid:32) ∞ (cid:88) k =1 kb k (cid:33) , note that these expressions may be inﬁnite. The inﬁnite productconverges if and only if • The strong Szeg¨o condition, α ∈ H holds; • The sum of cepstral coefﬁcients (cid:80) ∞ k =1 kb k ∈ H . Using this theorem we will be able to characterise the con-vergence rate of the short range dependent processes that wehave been considering in this paper.The SRD and CSRD versions of ARFIMA and FractionalGaussian Noise meet the conditions of the convergence of theinﬁnite product and sum of Strong Szeg¨o Theorem. This isshown by Theorem V.1. For a positive, continuous spectraldensity (cid:80) ∞ k =1 kγ ( k ) < ∞ if and only if I p − f < ∞ , whichin turn holds if and only if (cid:80) ∞ k = −∞ | k || b k | < ∞ . In the proofof Theorem V.2, we showed that the boundary of the ﬁnitenessand inﬁniteness of (cid:80) ∞ k =1 kb k , coincided with the boundarybetween SRD and LRD processes. For example, for ARFIMAand Fractional Gaussian Noise, (cid:80) ∞ k =1 kγ ( k ) < ∞ when H ≤ and therefore by Theorem V.2, (cid:80) ∞ k =1 | k || b k | < ∞ .This indicates that the convergence and divergence of theinﬁnite product and sum may have some inﬂuence on theresulting convergence rate to the entropy rate if we considerthe n th root of the asymptotic form, | Γ n | n ∼ G ( µ ) E ( µ ) n . One additional note about the conditions of the inﬁniteproduct and sum in this theorem, the conditions for (cid:96) and H < , coincide for sequences that decay as power laws, i.e., n α , which is common when considering the convergence ordivergence of sequences of LRD processes. However, it maybe the case that a process may be in (cid:96) and not in H , e.g. , a n = n log n [2].Before we continue, we require a lemma to prove a the-orem about the convergence rate of the differential entropyrate of SRD and LRD processes, which gives two differentdeterminant limits for the results of the Szeg¨o limit theorems. Lemma VII.2.

For all discrete-time stationary Gaussianprocesses, lim n →∞ (cid:16) | K ( n ) | (cid:17) n = lim n →∞ | K ( n ) || K ( n − | , where K ( n ) is the n × n autocovariance matrix of the process.Proof. For discrete-time stationary Gaussian processes, lim n →∞ n h ( X , .., X n ) = lim n →∞ h ( X n | X n − , ..., X ) , by Theorem II.1 and noting that both of the limits existfor stationary processes. The form of the joint entropy of amultivariate Gaussian random vector of length n [8, pg. 249]is h ( X , .., X n ) = 12 log (cid:16) πe | K ( n ) | (cid:17) . Therefore as, lim n →∞ n h ( X , .., X n ) = lim n →∞ n (cid:18)

12 log (cid:16) (2 πe ) n | K ( n ) | (cid:17)(cid:19) , = lim n →∞

12 log (cid:16) πe | K ( n ) | n (cid:17) . Equating this for the entropy rate characterised using condi-tional entropy we get lim n →∞

12 log (cid:16) πe | K ( n ) | n (cid:17) = lim n →∞ h ( X n , ..., X ) − h ( X n − , ..., X ) , = lim n →∞

12 log (cid:16) (2 πe ) n | K ( n ) | (cid:17) −

12 log (cid:16) (2 πe ) n − | K ( n − | (cid:17) , = lim n →∞

12 log (cid:18) (2 πe ) | K ( n ) || K ( n − | (cid:19) , Hence, we conclude that, lim n →∞ ( | K ( n ) | ) n = lim n →∞ | K ( n ) || K ( n − | . Now we can characterise the convergence rate of the con-ditional entropy of SRD ARFIMA and Fractional GaussianNoise processes to the differential entropy rate. Then we willshow that a slower decay exists in the case of long rangedependence.

Theorem VII.3.

For all Gaussian SRD processes, such thatthe autocovariance function γ ∈ (cid:96) , the convergence rate ofthe conditional entropy, h ( X n | X n − , ..., X ) , to the differen-tial entropy rate of the process is O ( n − ) .Proof. We rearrange the asymptotic expression of Szeg¨o’sstrong theorem to get | K ( n ) | G ( µ ) n ∼ E ( µ ) , = ⇒ | K ( n ) | n ∼ G ( µ ) E ( µ ) n . Then we use the asymptotic form of the determinant limitfrom Lemma VII.2, | K ( n ) || K ( n − | ∼ G ( µ ) exp (cid:32) ∞ (cid:88) k =1 kb k (cid:33) n , = ⇒ | K ( n ) || K ( n − | ∼ G ( µ ) exp (cid:32) n ∞ (cid:88) k =1 kb k (cid:33) . From the asymptotic form of the Szeg¨o theorem, Theo-rem VI.2, the LHS will converge to G ( µ ) as n increases. Thisimplies that the convergence rate of the conditional entropy iscontrolled by the term, exp( n (cid:80) ∞ k =1 kb k ) , and its convergencerate to 1. That is, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) G ( µ ) exp (cid:32) n ∞ (cid:88) k =1 kb k (cid:33) − G ( µ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = G ( µ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) exp (cid:32) n ∞ (cid:88) k =1 kb k (cid:33) − exp(0) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , = G ( µ ) , as n → ∞ . These two expressions are equivalent since n ∞ (cid:88) k =1 kb k → , as n → ∞ .Since (cid:80) ∞ k =1 kb k is ﬁnite for H ≤ , as γ ∈ (cid:96) , this impliesthat the convergence is at the rate of O ( n − ) as (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) exp (cid:32) n ∞ (cid:88) k =1 kb k (cid:33) − exp(0) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → , = ⇒ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n ∞ (cid:88) k =1 kb k − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → , and, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n ∞ (cid:88) k =1 kb k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Cn , C ∈ R + , ∀ n ∈ Z + . For

C < ∞ , such that C ≥ (cid:80) ∞ k =1 kb k .The key part of the proof that allows us to characterisethe convergence rate of the conditional entropy is that theseries, (cid:80) ∞ k =1 kb k , is ﬁnite. By Theorem V.1, we know thatthis is the boundary between LRD and SRD processes. Thisbehaviour intuitively indicates that a limit would take a longertime to converge in the LRD case, similar to many estimatorsfor LRD processes. Hence, we expect that the convergenceto the entropy rate from the conditional entropy rate will beslower for LRD processes. This leads to the following theoremand characterisation of LRD processes through convergence ofconditional entropy. Theorem VII.4.

For all Gaussian LRD processes the conver-gence rate of the conditional entropy, h ( X n | X n − , ..., X ) , tothe differential entropy rate of the process is O ( log( n (1 − H ) ) n ) . Proof.

Similar to the theorem above, we consider the con-vergence of the term, exp( n (cid:80) ∞ k =1 kb k ) , to 0, and use thefollowing expansion of the term, lim n →∞ exp (cid:32) n ∞ (cid:88) k =1 kb k (cid:33) = lim n →∞ exp (cid:32) n lim m →∞ m (cid:88) k =1 kb k (cid:33) , = lim n →∞ lim m →∞ exp (cid:32) n m (cid:88) k =1 kb k (cid:33) . Where we use the continuity in the exponential function toexchange the limit in the function. From Theorem V.2, weshowed that the rate of divergence of the partial sum of kb k is equal to, m (cid:88) k =1 kb k ∼ (1 − H ) Si ( π ) π log m, = Si ( π ) π log (cid:16) m (1 − H ) (cid:17) . Hence we can consider the limits as n and m tend to inﬁnity, lim n →∞ lim m →∞ exp (cid:32) n m (cid:88) k =1 kb k (cid:33) ∼ lim n →∞ lim m →∞ exp (cid:32) n Si ( π ) π log (cid:16) m (1 − H ) (cid:17)(cid:33) . If we take the n → ∞ and m → ∞ such that m = n , thenwe have lim n →∞ exp (cid:32) n Si ( π ) π log (cid:16) n (1 − H ) (cid:17)(cid:33) → . Hence, lim n →∞ log (cid:16) n (1 − H ) (cid:17) n → . Which gives the rate of convergence of the conditional entropyto entropy rate.Therefore, the convergence of the conditional entropy toentropy rate is slower for LRD processes than SRD processes.This provides an information theoretic characterisation of LRDby convergence properties, similar to the covariance function,the sample mean and the parameters of linear predictors onthe inﬁnite past. The rate of convergence to the entropy ratedecreases rapidly as H → , because in (1 − H ) theinﬂuence of the Hurst parameter is squared. The convergencebecomes much quicker as the Hurst parameter approaches0.5, and eventually reaches the point where it convergesimmediately at H = 0 . , as h ( X i ) = h ( χ ) , ∀ i ∈ Z + in theabsence of any correlations.VIII. C ONCLUSION

In this paper, we are concerned with the behaviour ofthe differential entropy rate to understand and characterisethe behaviour of LRD and SRD processes. Analysing twocommon LRD processes, FGN and ARFIMA(0,d,0), we haveshown that the maximum occurs in the absence of correlations, i.e. , H = 0 . , and the differential entropy rate tends tothe minimum, −∞ as the strength of positive correlationsincrease, i.e. , as we receive more information from correla-tions, the entropy of the process decreases. However, there isvery different behaviour for negatively correlated processes,where ARFIMA(0,d,0) processes do not tend to −∞ as thestrength of the negative correlations increases. Further researchis required to understand this behaviour for these processes.In addition, we have made a link, similar to Shannonentropy, between the mutual information between past andfuture and excess entropy, meaning that the amount of sharedinformation between the complete past of future of a processis the same as the additional information that accrues whenconverging to the entropy rate, based on past observations.This leads to a characterisation of LRD processes, as thosehaving inﬁnite mutual information between past and future.Using this and Szeg¨o’s limit theorems we then can classifyLRD and SRD processes by their convergence rates, andshow that LRD processes have slower convergence of theconditional entropy, conditioned on the past, to the entropyrate. A CKNOWLEDGMENT

This research was funded by CSIRO’s Data61, the Aus-tralian Research Council’s Centre of Excellence in Mathemat-ical and Statistical Frontiers and Defence Science and Tech-nology Group. Thanks to Adam Hamilton for his assistancein providing feedback on this manuscript.R

EFERENCES[1] J. Beran.

Statistics for Long-Memory Processes . Chapman & Hall/CRCMonographs on Statistics & Applied Probability. Taylor & Francis, 1994.[2] N. Bingham. Szeg¨o’s theorem and its probabilistic descendants.

Probab.Surveys , 9:287–324, 2012.[3] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung.

Time seriesanalysis: forecasting and control . John Wiley & Sons, 2015.[4] P. J. Brockwell and R. A. Davis.

Time Series: Theory and Methods .Springer-Verlag, Berlin, Heidelberg, 1986.[5] G. Chavez. Conditional and marginal mutual information in gaussianand hyperbolic decay time series.

Journal of Time Series Analysis ,37(6):851–861, 2016.[6] B. Choi and T. M. Cover. An information-theoretic proof of Burg’smaximum entropy spectrum.

Proceedings of the IEEE , 72(8):1094–1096, 1984.[7] R. Cont.

Long range dependence in ﬁnancial market , pages 159–179.01 2005.[8] T. M. Cover and J. A. Thomas.

Elements of Information Theory(Wiley Series in Telecommunications and Signal Processing) . Wiley-Interscience, USA, 2006.[9] J. P. Crutchﬁeld and D. P. Feldman. Regularities unseen, randomnessobserved: Levels of entropy convergence.

Chaos: An InterdisciplinaryJournal of Nonlinear Science , 13(1):25–54, 2003.[10] Y. Ding and X. Xiang. An entropic characterization of long memorystationary process, 2016. arXiv preprint arXiv:1604.05453.[11] R. Durrett.

Probability: Theory and Examples . Cambridge UniversityPress, New York, NY, USA, 4th edition, 2010.[12] J. Franke. Arma processes have maximal entropy among time serieswith prescribed autocovariances and impulse responses.

Advances inapplied probability , pages 810–840, 1985.[13] A. Gefferth, D. Veitch, I. Maricza, S. Moln´ar, and I. Ruzsa. The natureof discrete second-order self-similarity.

Advances in Applied Probability ,pages 395–416, 2003.[14] C. W. J. Granger and R. Joyeux. An introduction to long-memorytime series models and fractional differencing.

Journal of Time SeriesAnalysis , 1(1):15–29, 1980.[15] P. Grassberger. Toward a quantitative theory of self-generated complex-ity.

International Journal of Theoretical Physics , 25(9):907–938, 1986.[16] U. Grenander and G. Szeg¨o.

Toeplitz forms and their applications . Univof California Press, 1958.[17] J. R. M. Hosking. Fractional differencing.

Biometrika , 68(1):165–176,04 1981.[18] H. E. Hurst. Long-term storage capacity of reservoirs.

Trans. Amer.Soc. Civil Eng. , 116:770–799, 1951.[19] I. A. Ibragimov and I. A. Rozanov.

Gaussian random processes / I. A.Ibragimov, Y. A. Rosanov ; translated by A. B. Aries . Springer-VerlagNew York, 1978.[20] S. Ihara. Maximum entropy spectral analysis and arma processes(corresp.).

IEEE transactions on information theory , 30(2):377–380,1984.[21] S. Ihara.

Information theory for continuous systems , volume 2. WorldScientiﬁc, 1993.[22] A. Inoue. Asymptotic behavior for partial autocorrelation functions offractional arima processes.

Annals of Applied Probability , pages 1471–1491, 2002.[23] S.-Y. Koyama and N. Kurokawa. Euler’s integrals and multiplesine functions.

Proceedings of the American Mathematical Society ,133(5):1257–1265, 2005.[24] A. J. Lawrance and N. T. Kottegoda. Stochastic modelling of riverﬂowtime series.

Journal of the Royal Statistical Society. Series A (General) ,140(1):1–47, 1977.[25] W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson. On the self-similar nature of ethernet trafﬁc. In

Conference Proceedings on Com-munications Architectures, Protocols and Applications , SIGCOMM’93,pages 183–193, New York, NY, USA, 1993. ACM. [26] L. M. Li. Some notes on mutual information between past and future.

Journal of Time Series Analysis , 27(2):309–322, 2006.[27] K. Lindgren and M. G. Nordahl. Complexity measures and cellularautomata.

Complex Systems , 2(4):409–440, 1988.[28] B. Mandelbrot. Self-similar error clusters in communication systemsand the concept of conditional stationarity.

IEEE Transactions onCommunication Technology , 13(1):71–90, 1965.[29] I. Nemenman. Information theory and learning: A physical approach. arXiv preprint physics/0009032 , 2000.[30] R. Shaw.

The dripping faucet as a model chaotic system . Aerial Pr,1984.[31] P. Shiu and R. Shakarchi. Complex analysis.

The Mathematical Gazette ,88(512), 2004.[32] G. Szeg¨o. On certain hermitian forms associated with the fourier seriesof a positive function.

Comm. S´em. Math. Univ. Lund [Medd. LundsUniv. Mat. Sem.] , (Tome Suppl´ementaire):228–238, 1952.[33] G. Szeg¨o. Ein grenzwertsatz ¨uber die toeplitzschen determinanten einerreellen positiven funktion.

Mathematische Annalen , 76(4):490–503,1915.[34] C. Varotsos and D. Kirk-Davidoff. Long-memory processes in ozoneand temperature variations at the region 60° S-60° N.

Atmos. Chem.Phys. , 6, 09 2006.[35] D. Veitch, A. Gorst-Rasmussen, and A. Gefferth. Why FARIMA modelsare brittle.

Fractals , 21(02):1350012, 2013.[36] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy,D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J.van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J.Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, ˙I. Polat, Y. Feng,E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman,I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H.Ribeiro, F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors.SciPy 1.0: Fundamental Algorithms for Scientiﬁc Computing in Python.

Nature Methods , 17:261–272, 2020.[37] W. Willinger, M. S. Taqqu, W. E. Leland, and D. V. Wilson. Self-similarity in high-speed packet trafﬁc: Analysis and modeling of ethernettrafﬁc measurements.

Statist. Sci. , 10(1):67–85, 02 1995.[38] W. Willinger, M. S. Taqqu, R. Sherman, and D. V. Wilson. Self-similaritythrough high-variability: statistical analysis of ethernet lan trafﬁc at thesource level.

IEEE/ACM Transactions on Networking , 5(1):71–86, Feb1997.[39] W. Willinger, M. S. Taqqu, and V. Teverovsky. Stock market prices andlong-range dependence.