[PDF] Autocorrelation type functions for big and dirty data series

Abstract

One form of big data are signals - time series of consecutive values. In physical experiments, billions of values can now be measured within a second. Signals of heart and brain in intensive care, as well as seismic waves, are measured with 100 up to 1000 Hz over hours, days or even years. A song of 3 minutes on CD comprises 16 million values. Music and seismic vibrations basically consist of harmonic oscillations for which classical tools like autocorrelation and spectrogram work well. This note presents similar tools for all kinds of rhythmic processes, with non-linear distortion, artefacts, and outliers. Permutation entropy has been used in physics, medicine, and engineering. Big data allow a detailed analysis of ordinal patterns. As new version of permutation entropy, we define a distance to white noise consisting of four curious components. Applications to a variety of data are discussed.

Full PDF

AAUTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTYDATA SERIES

CHRISTOPH BANDT

One form of big data are signals - time series of consecutive values.In physical experiments, billions of values can now be measured withina second [1, 2] . Signals of heart and brain in intensive care, as well asseismic waves, are measured with 100 up to 1000 Hz over hours, days oreven years [3, 4] . A song of 3 minutes on CD comprises 16 million values.Music and seismic vibrations basically consist of harmonic oscillationsfor which classical tools like autocorrelation and spectrogram work well [5, 6, 7] . This note presents similar tools for all kinds of rhythmic pro-cesses, with non-linear distortion, artefacts, and outliers. Permutationentropy [8, 9, 10] has been used in physics [2, 11] , medicine [12, 13, 14] , andengineering [15, 16] . Now ordinal patterns [17, 18, 19] are studied in detailfor big data. As new version of permutation entropy, we deﬁne a distanceto white noise consisting of four curious components. Applications to avariety of medical and sensor data are discussed.

Figure 1.

Nose breathing of a healthy volunteer in normal life. a:One minute of data without artefacts. b: Mean absolute ﬂow for each30 seconds of 24 hours measurement. c: New function ˜ τ ( d ) for eachminute shows more structure, needs no calibration of data. Date : January 16, 2018. a r X i v : . [ s t a t . M E ] D ec CHRISTOPH BANDT

Classical time series from quarterly reports of companies, monthly unemploymentﬁgures, daily statistics of accidents etc. consist of 20 up to a few thousand valueswhich were determined with great care. Machine-generated data amount to millionsand are not double-checked. Usually it is easy to let the machine run faster andlonger but processing goes the other way: data are smoothed out by downsampling.This note shows how high time resolution can be exploited with simple ordinalautocorrelation functions. To demonstrate the remarkable eﬀect of these tools, theywill be applied to raw data: no preprocessing, no ﬁlters, no detection of outliers, noremoval of artefacts, and no polishing of the results.Figure 1 comes from ongoing work with Achim Beule (Department ENT, Headand Neck Surgery, University of Greifswald) on respiration of healthy volunteers ineveryday life. Sensors measuring air ﬂow with sampling frequency 50 Hz were ﬁxedin both nostrils, gently enough to ensure comfort for 24 hours measurement. Mouthbreathing was not controlled, so the signals contain lots of artefacts. Traditionalanalysis takes averages over 30 seconds to obtain a better signal. More informationis found in a function ˜ τ ( d ) for each minute of the dirty signal. As explained below,˜ τ measures the percentage of ”elliptical symmetry” of the signal, depending ontime and the delay d which runs from 0.02 to 5 seconds. The collection of thesefunctions, visualized like a spectrogram, shows phases of activity and sleep, variousinterruptions of sleep, inaccurate measurements around 8 am, a little nap after 2pm. Frequency of respiration can be read from the lower dark red stripe whichmarks half of the wavelength, and the upper red and yellow stripe marks the fullwavelength (4 seconds in sleep, less than 3 in daily life).

1. Up-down balance.

Only the order relation between the values of the timeseries x = ( x , ..., x T ) will be used, not the values themselves. In the simplest case,the question is whether the time series goes more often upwards than downwards.For a delay d between 1 and T /

2, let n ( d ) and n ( d ) denote the number of timepoints t for which x t < x t + d and x t > x t + d , respectively. We determine the relativefrequencies of increase and decrease over d steps: p ( d ) = n ( d ) / ( n ( d ) + n ( d ))and p ( d ) = 1 − p ( d ) . Ties are disregarded. The up-down balance β ( d ) = p ( d ) − p ( d ) for d = 1 , , ... is a kind of autocorrelation function. It reﬂects the dependence structure of theunderlying process and has nothing to do with the size of the data.For illustration we take water levels from the database of the National OceanService [20]. Since water tends to come fast and disappear more slowly, we couldexpect β ( d ) to be negative, at least for small d. This is mostly true for water levelsfrom lakes, like Milwaukee in Figure 2. For sea stations, tidal inﬂuence makes thedata almost periodic, with a period slightly smaller than 25 hours. The data have 6minute intervals and we measure d in hours, writing d = 250 as d = 25h . A strictlyperiodic time series with period L fulﬁls β ( L − d ) = − β ( d ) which in the case L = 25implies β (12 .

5) = 0 , visible at all sea stations in Figure 2. Otherwise there are bigdiﬀerences: at Honolulu and Baltimore the water level is more likely to fall withinthe next few hours, at San Francisco it is more likely to increase, and at Anchorage UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 3

Figure 2.

Water levels at 6 minute intervals from [20]. Original datashown for Sep 1-2, 2014. Functions β are given for September of 18years 1997-2014, and also for January in case of Anchorage and SanFrancisco. d runs from 6 min to 27 hours.there is a change at 6 hours. Each station has its speciﬁc β -proﬁle, almost unchangedduring 18 years, which characterizes its coastal shape. β can also change with theseason, but these diﬀerences are smaller than those between stations.Figure 2 indicates that β, as well as related functions below, can solve some basicproblems of statistics: describe, distinguish, and classify objects. Cf. Figure 13.

2. Sliding windows and statistical variation.

The time series is divided intopieces, so-called windows, and β is determined for every window. If we have somehundred windows, which may overlap, we can compose them in a map like Figure 1or 4, where each time point corresponds to a window, the β -functions are representedby vertical lines, and the values β ( d ) are color coded. Such maps show whether theregime of the underlying process changes in time, and indicate transitions betweendiﬀerent regimes: activity and rest, or sleep stages.If there are no systematic changes in the appearance of the functions, we canconsider the underlying process as stationary and estimate statistical ﬂuctuations.Mathematical arguments say that the standard deviation of β ( d ) for ﬁxed d is c √ n where n is the window length and c some constant. In our data, c varied from to 4, and in most cases we had c ≈ . For the data of Figure 2, the underlyingtidal process is certainly not stationary - but the inﬂuence of the moon was reducedby calculating β over one month, the annual component was minimized by takingonly September. We can roughly estimate σ, with deviations still depending on d. CHRISTOPH BANDT

For medical data, errors were obtained from stationary segments, e.g. deep sleep.Another error of order dn is observed for very large parameters d (Methods 2,3).The error estimate σ ≈ √ n can be used to determine the appropriate windowlength. To have the 2 σ -radius of the 95% conﬁdence interval for β ( d ) smaller than0.01, we must take n ≥ . For Figures 2 and 3 we have n ≈ ± . . Thus our method works only for big data.A simple exact method checks whether β ( d ) signiﬁcantly diﬀers from zero, or the β -curves of two sites in Figure 2 signiﬁcantly diﬀer at some d. When windows do notoverlap, the median test is applicable. Since β (20) > β (20) for the underlying process cannot bezero since the p -value for 11 heads in 11 coin tosses is less than 0.001. Thus, exceptfor γ, most values of the functions in Figure 3 diﬀer signiﬁcantly from zero.

3. Patterns of length 3.

Three equidistant values x t , x t + d , x t +2 d without ties canrealize six order patterns. 213 denotes the case x t + d < x t < x t +2 d . For each pattern π and d = 1 , , .., T / n π ( d ) of appearancesin the same way as n ( d ) . In case π = 312 we count all t = 1 , ..., T − d with x t + d < x t +2 d < x t . Let S be the sum of the six numbers. Patterns with ties arenot counted. Next, we compute the relative frequencies p π ( d ) = n π ( d ) /S. For whitenoise it is known that all p π ( d ) are [8].As autocorrelation type functions, we now deﬁne certain sums and diﬀerences ofthe p π . The function τ ( d ) = p ( d ) + p ( d ) − x t + d − x t persists when we go d time steps ahead. The largest possible value of τ ( d )is , assumed for monotone time series. The minimal value is − . The constant was chosen so that white noise has persistence zero. The letter τ indicates that thisis one way to transfer Kendall’s tau to an autocorrelation function. Another versionwas studied in [22].Similar to the classical autocorrelation function ρ, a period of length L in a signalis indicated by minima of τ at d = L , L , L , ... For these d we have x t ≈ x t +2 d sothat the patterns 123 and 321 are rare. Near to d = L, L, L, ... the function τ is large, as ρ. For a noisy sequence, however, a local minimum appears directly at d = L, L, ..., since patterns tend to have equal probability there. The larger thenoise, the deeper the bump. In Figure 3b, τ at d = 12 and d = 24 shows exactlythis appearance and proves the existence of a 24 hour rhythm better than ρ. Beside τ and β we deﬁne two other autocorrelation type functions. For conve-nience, we drop the argument d.γ = p + p − p − p UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 5

Figure 3.

Autocorrelation ρ and ordinal functions ∆ , τ, β, γ, δ for a: the almost periodic series of water levels at Los Angeles (September1997-2014, data from [20]), b: the noisy series of hourly particulatevalues at nearby San Bernardino 2000-2011 from [21], with weak dailyrhythm, see Figure 15. The functions on the left are about three timeslarger, for ∆ nine times, but ﬂuctuations are of the same size.is a measure of time irreversibility of the process, and δ = p + p − p − p describes up-down scaling since it approximately fulﬁls δ ( d ) = β (2 d ) − β ( d ) (Methods3). Like β, these functions measure certain symmetry breaks in the distribution ofthe time series. They all have strong invariance properties (cf. Figure 9):When a nonlinear monotonous transformation is applied to the data – manysensors measure proxy eﬀects which are monotonously, but not linearly related to thetarget variable – all ordinal functions remain unchanged. They are not inﬂuenced bylow frequency components with wavelength much larger than d which often appearas artefacts in data. Since τ, β, γ, and δ are deﬁned by assertions like X t + d − X t > , they do not require full stationarity of the underlying process. Stationary incrementssuﬃce - Brownian motion for instance has τ ( d ) = for all d [17]. Finally, ordinalfunctions are not inﬂuenced much by a few outliers. One wrong value, no matter howlarge, can change n π ( d ) only by ±

4. Permutation entropy is the Shannon entropy of the distribution of orderpatterns: H = − (cid:88) π p π log p π .H can be deﬁned for any level n, using the vectors ( x t , x t + d , ..., x t +( n − d ) and their n ! order patterns π [8]. In practice we hardly go beyond n = 7 . Used as a measureof complexity and disorder, H can be calculated for time series of less than thou-sand values since statistical inaccuracies of the p π are smoothed out by averaging. CHRISTOPH BANDT

Applications include EEG data [12, 13, 18, 14], optical experiments [2, 1, 11], riverﬂow data [19], control of rotating machines [15, 16], economic applications (cf. [9]),and the theory of dynamical systems [23, 24]. Recent surveys on H are [9, 10]. As ameasure of disorder, H assumes its maximum log n ! for white noise. D = log n ! − H is called divergence or Kullback-Leibler distance to the uniform distribution p π = n ! of white noise.In this note all functions, including autocorrelation, measure the distance of thedata from white noise. For this reason, we take divergence rather than entropy, andwe replace − p π log p π by p π . As before, we drop the argument d. The function∆ = (cid:88) π ( p π − ) = (cid:88) π p π − where the sum runs over the six order patterns π of length 3, will be called the distance of the data from white noise. Of course, we can deﬁne ∆ for any length n, replacing by n ! . More precisely, ∆ is the squared Euclidean distance betweenthe observed order pattern distribution and the order pattern distribution of whitenoise. Considering white noise as complete disorder, ∆ measures the amount ofrule and order in the data. The minimal value 0 is obtained for white noise, and themaximum for monotone time series.From a practical viewpoint, ∆ is just a rescaling of H, related to the quadraticTaylor approximation of H around white noise parameters p π = H ≈ log 6 − . For our data these two functions can hardly be distinguished by eyesight.

5. Partition of the distance to white noise.

A Pythagoras type formula com-bines ∆ with the ordinal functions:4∆ = 3 τ + 2 β + γ + δ . This holds for each d = 1 , , ... The equation is exact for random processes withstationary increments as well as for cyclic time series. The latter means that wecalculate p π ( d ) from the series ( x , x , ..., x T , x , x , ..., x d ) where t runs from 1 to T. For real data we go only to T − d and have a boundary eﬀect which causes theequation to be only approximately fulﬁlled (Methods 4). The diﬀerence is smallerthan 1% in most of our data, see Figures 5, 11, and 16.This partition is related to orthogonal contrasts in the analysis of variance. When∆ ( d ) is signiﬁcantly diﬀerent from zero, we can deﬁne new functions of d :˜ τ = 3 τ , ˜ β = β , ˜ γ = γ , ˜ δ = δ . By taking squares, we lose the sign of the values, but we gain a natural scale. ˜ τ , ˜ β, ˜ γ and ˜ δ lie between 0 and 1 = 100% , and they sum up to 1. For each d, they describethe percentage of order in the data which is due to the corresponding diﬀerence ofpatterns. Figure 1 shows ˜ τ for one-minute non-overlapping sliding windows as colorcode on vertical lines. UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 7

For Gaussian and elliptically symmetric processes, the functions β, γ, and δ areall zero, and ˜ τ is 1, for every d [17, Section 5]. An image of ˜ τ , as Figure 1, showsto which extent the data come from a Gaussian process and where the symmetry isbroken. We get one percentage for each time and delay, 300000 values in Figure 1,and an overall average. As a rule of thumb, we exclude points for which ∆ ( d ) < n where n is window length (Methods 2). This bird’s view method is a big datacounterpart of rigorous tests for elliptical symmetry based on the covariance matrix[25, 26, 27]. For ARMA processes, the average ˜ τ is above 97%, as should be, forEEG data about 80% , for music above 50%, and for heart data only 30-50%. Thefunctions β and δ, and less frequently γ, can represent the main part of ∆ , butusually only for special values of d. Details are discussed in the supplement.

Figure 4.

Sleep stages of healthy subject n3 from [28]. a: 10 secondsof data from EEG channel Fp2-F4, ECG and plethysmogram. b:Expert annotation of sleep depth from [28] agrees with distance ∆ . c: Function ˜ β of the plethysmogram describes changes including REM.

6. Sleep data.

As an application, we study data from the CAP sleep database byTerzano et al. [28] available at physionet [3]. Sleep stages S1-S4 and R for REMsleep were already annotated by experts, mainly using the EEG channel Fp2-F4and the oculogram. Figure 4 demonstrates that permutation entropy ∆ of thatEEG channel, averaged over d = 2 , ...,

20 gives an almost identical estimate of sleepdepth.

In [13, 29] permutation entropy was already recommended as indicator ofsleep stages. We veriﬁed this coincidence in all data of the database with a normalEEG, which seems magic since ∆ was introduced as a simple quantity, without anyregard to sleep data. For 512 Hz sampling frequency, d = 2 , ...,

20 corresponds to thetime scale between 4 and 40 ms. Thus our ∆ is a measure of smoothness of data atsmall scales which increases when high-frequency oscillations disappear. This seemsa complementary viewpoint to the classical R&K rules for sleep stage classiﬁcationand their recent modiﬁcation [30] which refer to the appearance of low-frequencywaves and patterns. CHRISTOPH BANDT

REM phases are more diﬃcult to detect. The oculogram was used for annotation.Figure 4c presents ˜ β for the plethysmogram, measured with an optical sensor at theﬁngertip. We can see all interruptions of sleep and a lot of breakpoints which coincidewith changes detected by the annotator. The REM phases are characterized by highvalues of ˜ β ( d ) , as well as an increase and a strongly increased variability of the heartrate: the characteristic wavelength d goes down. These variations inﬂuence ordinalfunctions so strongly that changes become more apparent than in a plot of heartrate.Oximeters measuring the plethysmogram are cheap and easy to apply, in bed athome and possibly in daily life. Many similar devices - armwrists, shirts, smart-phone apps - are currently being developed. Our ordinal functions can evaluate andvisualize their high-resolution data.

7. Conclusion.

On the basis of order patterns, error-resistent autocorrelationfunctions for big data were introduced. For patterns of length 3, the distance towhite noise was divided into four interesting components. Various extensions seempossible. To develop a coherent theory of ordinal time series remains a challenge,despite groundbreaking work by Marc Hallin and co-authors [31, 22, 25] on rankstatistics. On the practical side, a spectrogram-like visualization was introducedwhich allows to see at one glance the course of heart and respiration data over 24hours (Figures 1,4). Simple as it is, this technique may apply to data of sensorson satellites, weather stations, factory chimneys etc. as well as to experiments onnanoscale [1, 11] which may lay the ground for future generations of sensors. Inany application, the methods presented here have to be modiﬁed, combined withestablished techniques, speciﬁc machine-learning and imaging tricks. Mathematicalextraction of information must keep track with the revolution in sensor and computertechnology: this seems an emerging ﬁeld of study.

References [1] A. Aragoneses, N. Rubido, J. Tiana-Aisina, M.C. Torrent & C. Masoller, Distinguishingsignatures of determinism and stochasticity in spiking complex systems, Scientiﬁc Reports3, Article 1778 (2012)[2] M.C. Soriano, L. Zunino, O.A. Rosso, I. Fischer & C.R. Mirasso, Time scales of a chaoticsemiconductor laser with optical feedback under the lens of a permutation information anal-ysis, IEEE J. Quantum Electronics 47 issue. 2, 252-261 (2011)[3] A.L. Goldberger et al. PhysioBank, PhysioToolkit, and PhysioNet: Components of a NewResearch Resource for Complex Physiologic Signals. Circulation 101(23):e215-e220 [Circu-lation Electronic Pages; http://circ.ahajournals.org/cgi/content/full/101/23/e215]; (2000).Data at: physionet.org/physiobank/database/capslpdb [4] GFZ Seismological Data Archive: geofon.gfz-potsdam.de/waveform [5] P.J. Brockwell & R.A. Davies, Time Series; Theory and Methods, 2nd ed., Springer, NewYork 1991[6] R.H. Shumway & D.S. Stoﬀer, Time Series Analysis and Its Applications, 2nd ed., Springer,New York 2006[7] A. Davis, M. Rubinstein, N. Wadhwa, G.J. Mysore, F. Durand & W.T. Freeman, The visualmicrophone: passive recovery of sound from video, SIGGRAPH 2014

UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 9 [8] C. Bandt & B. Pompe, Permutation entropy: a natural complexity measure for time series,Phys. Rev. Lett. 88, 174102 (2002)[9] M. Zanin, L. Zunino, O.A. Rosso & D. Papo, Permutation entropy and its main biomedicaland econophysics applications: a review, Entropy 14, 1553-1577 (2012)[10] J. Amigo, K. Keller & J. Kurths (eds.), Recent progress in symbolic dynamics and permu-tation entropy, Eur. Phys. J. Special Topics 222 (2013)[11] J.P. Toomey & D.M. Kane, Mapping the dynamical complexity of a semiconductor laser withoptical feedback using permutation entropy, Optics Express 22, issue 2, 1713-1725 (2014)[12] M. Staniek & K. Lehnertz, Symbolic transfer entropy, Phys. Rev. Lett. 100, 158101 (2008)[13] N. Nicolaou & J. Georgiou, The use of permutation entropy to characterize sleep encephalo-grams, Clin. EEG Neurosci 2011, 42:24[14] E. Ferlazzo et al. Permutation entropy of scalp EEG: a tool to investigate epilepsies, ClinicalNeurophysiology 125, 13-20 (2014)[15] U. Nair, B.M. Krishna, V.N.N. Namboothiri & V.P.N. Nampoori, Permutation entropy basedreal-time chatter detection using audio signal in turning process, Int. J. Adv. Manuf. Technol.46, 61-68 (2010)[16] R. Yan, Y. Liu & R.X.Gao, Permutation entropy: a nonlinear statistical measure for statuscharacterization of rotary machines, Mechan. Syst. Signal Processing 29, 474-484 (2012)[17] C. Bandt & F. Shiha, Order patterns in time series, J. Time Series Analysis 28, 646-665(2007)[18] G. Ouyang, C. Dang, D.A. Richards & X. Li, Ordinal pattern based similarity analysis forEEG recordings, Clinical Neurophysiology 121 (2010), 694-703[19] H. Lange, O.A. Rosso & M. Hauhs, Ordinal pattern and statistical complexity analysis ofdaily stream ﬂow time series, Eur. Phys. J. Special Topics 222, 535-552 (2013)[20] The National Water Level Observation Network: [21] California Environmental Protection Agency, Air Resources Board: [22] T.S. Ferguson, C. Genest & M. Hallin, Kendall’s tau for serial dependence, Canadian J. Stat.28, 587-604 (2000)[23] J. Amigo, Permutation complexity in dynamical systems, Springer 2010[24] A.M. Unakafov & K. Keller, Conditional entropy of ordinal patterns, Physica D 269, 94-102(2014)[25] M. Hallin & D. Paindaveine, Semiparametrically eﬃcient rank-based inference for shape I.Ann. Statistics 34, 2707-2756 (2008)[26] L. Sakharenko, Testing for ellipsoidal symmetry: a comparison study, Computational Stat.Data Analysis 53, 565-581 (2008)[27] A. Onatski, M.J. Moreira & M. Hallin, Asymptotic power of sphericity tests for high-dimensional data, Ann. Statistics 41, 1204-1231 (2013)[28] M.G. Terzano et al., Atlas, rules, and recording techniques for the scoring of cyclic alternatingpattern (CAP) in human sleep, Sleep Med. 2 (6), 537-553 (2001).[29] C.-E. Kuo, S.-F. Liang, Automatic stage scoring of single-channel sleep EEG based on mul-tiscale permutation entropy, Biomedical Circuits and Systems Conference (BioCAS), IEEE2011, 448-451[30] C. Iber, S. Anconi-Israel, A. Chesson & S.F. Quan, The AASM manual for the scoring of sleepand associated events: rules, terminologyand technical speciﬁcations. Westchester: AmericanAcademy of Sleep Medicine (2007)[31] M. Hallin & M.L. Puri, Aligned rank tests for linear models with autocorrelated error terms,J. Multivariate Analysis 50, 175-237 (1994)

Methods

1. The program code.

A few lines of MATLAB code provide all our functions. x(1:T) denotes the given time series, d is between 1 and dmax. y0=x(1:T-2*d); y1=x(d+1:T-d); y2=x(2*d+1:T);s=2*((y0>y1)+(y0>y2))+(y1>y2);s is a vector which contains the symbols 0 , , ..., x t , x t + d , x t +2 d ) for d = 1 , ..., T − d. We correct for missing values NaN and tiesby assigning symbols 6 to 11: h=isnan(y0)|isnan(y1)|isnan(y2)|(y0==y1)|(y0==y2)|(y1==y2);s=s+6*h;

Here | means ’or’. If patterns with equality are to be counted, this line has to bemodiﬁed. We now use the histogram function to determine the frequencies of thesix patterns. p=hist(s,12)/(T-2*d-sum(h)); q=p(1:6); This will determine all our functions, for example: beta=p(1)-p(6); tau=p(1)+p(6)-1/3; h2=(q-1/6)*(q-1/6)’; H=-log(q)*q’;

On a PC, the algorithm obtains Figure 3 within a few seconds and Figure 1 withina few minutes. In some papers [10], permutation entropy of scale d uses patternsfor sums of d consecutive terms of the x t , which makes sense if the x t represent adensity function, like precipitation or workload on a server. This version is easilyimplemented by cumulative sums, adding x=cumsum(x) as ﬁrst line to the program.

2. Statistical accuracy.

We consider a ﬁxed d and windows of length n. Then p ( d ) is a random sum of n terms 1,0 which say whether x t < x t + d is true orfalse. According to the binomial model, the standard deviation of p ( d ) and β ( d ) =2 p ( d ) −

12 1 √ n and √ n , respectively. In reality, the variation is biggerbecause the terms in the sum are correlated. The correlations between diﬀerences Y t = X t + d − X t are relevant. As a simple model, we take the sum of nk independentterms 1,0 each of which is repeated k times. For β ( d ) this gives the standarddeviation c √ n with c = k √ k. An estimate of c can only come from data. As a rulewe got c ≈ . For τ the variation is a bit smaller, for γ and δ a bit larger. Note thatcorrelations can increase with d. The partition formula and some assumptions givea rough estimate of n for the standard deviation of ∆ . Twice the value, n , is takenas a kind of conﬁdence bound for the hypothesis ∆ = 0 . This is just a guideline:for the data in the supplement it works well.

3. Identities for pattern frequencies and boundary eﬀect.

We ﬁx d andconsider the p π for the whole time series. The equation p = p + p + p holds since on the right we count all t = 1 , ..., T − d with x t < x t + d . Similarly, p = p + p + p because on the right we count all t = d + 1 , ..., T − d with x t < x t + d . Strictly speaking, both relations are not correct since the count for p is over t = 1 , ..., T − d. The largest possible diﬀerence is dT − d when we exclude tiesfor simplicity. This upper bound is negligible for large T and comparably small d, and random ﬂuctuations usually make the diﬀerence much smaller than the upper UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 11 bound. Taking the diﬀerence of the two equations we see that the quantity (cid:15) = p + p − p − p equals zero, with the same degree of accuracy. In other words, the numbers of localmaxima and of local minima in a time series coincide. Adding both identities for p and subtracting 1 = (cid:80) p π we obtain β = p − p which has been used throughout the paper to calculate β, see the code above. An-other relation of this type is p (2 d ) = p ( d ) + p ( d ) + p ( d ) . As a consequence,we get β (2 d ) = 2 p (2 d ) − β ( d ) + δ ( d ) which says that the functions β and δ are tightly connected. For the mathematical model of processes with stationaryincrements, as well as for the cyclic time series mentioned in section 5, all theseequations are exact [17].

4. Proof of the ∆ partition formula. As in the code, we set q = p − , q = p − , q = p − , q = p − , q = p − , q = p − . By high-schoolalgebra4 (cid:88) q k = 2( q + q ) + 2( q − q ) +( q + q + q + q ) + ( q − q − q + q ) +( q + q − q − q ) + ( q − q + q − q ) Since (cid:80) q k = 0 , the third square on the right is the same as the ﬁrst. Using thedeﬁnitions of ∆ , τ, γ, δ and the identities above for β and (cid:15) we get4∆ = 3 τ + 2 β + δ + γ + (cid:15) which implies the formula because (cid:15) = 0 . The equation is exact for cyclic time series,and for random processes with stationary increments. For ordinary time series thereis a small error, as discussed above, and we must check the accuracy of the equationfrom the data, see Figures 5, 11, and 16 where diﬀerences are below 1% .

5. Contents of supplement.

We give numerical and visual evidence for ourpartition formula and check our bound for small ∆ . We explore the potential of ourmethods for various applications - medical data, speech, weather and ecological data- and test how far we can decrease the window size. Autocorrelation and persistenceare compared in Figure 9 for a simple AR2-process, in Figure 10 for a song of TheBeatles, and in Figure 17 for a laser experiment of Sorriano et al. [2]. Further detailsof ordinal autocorrelation functions are discussed.

Acknowledgement.

I am grateful to Bernd Pompe, Marcus Vollmer, LucianoZunino, Mathias Bandt, Petra Gummelt and Helena Pe˜na for suggestions and com-ments on drafts of this paper.Institute of MathematicsUniversity of Greifswald17487 Greifswald, Germany [email protected]

Supplement

1. Overview

We start with a list of our datasets. The essential parameters are window lengthand mean ∆ which quantiﬁes the amount of structure in the data. The latter isnot so easy to ﬁx since it depends very much on d. For d near zero, autocorrelation,persistence and hence also ∆ assume very large values.subject / data window range of d mean ∆ motivation / applicationplethysmogram 3840/ 30s 0.3-1.5s 0.06 proof of partition formulaECG 15360/30s 0.3-1.5s 0.04 visualize long-term dataEEG 15360/30s 0.25-1.5s 0.002 screening for structure0-0.25s 0.013 formula for sleep scoringAR2 process 2000 1-100 0.01 theoretical modelmusic/speech 2205/50ms 0-7ms 0.024 speech analysis/synthesistemperature 720/1month 1-50h 0.03 exploratory data analysisparticulates 1200/50days 1-72h 0.003 notoriously noisy datatides 1242/5days 10.5-14.5h 0.10 visualization of dynamicslaser data 3520/88ns 741-800 0.01 detection of periodTable 1. Data studied in this supplement

2. Heart and brain data

Figure 5 veriﬁes the partition formula for heart data, and approves our ’conﬁdencebound’ n for ∆ = 0 . The vast majority of excluded places comes from artefacts,and it would make no sense to exclude still more places since the partition equationholds true with such a small error. We note that in millions of places in Figure 5and similar data, ˜ τ + ˜ β + ˜ γ + ˜ δ was always smaller than 100% plus a rounding errorof 10 − % . This inequality seems to be generally true.Processes with small average ∆ are more diﬃcult to handle. Figure 6 concernsthe EEG channel Fp2-F4 of the same record, n2 of [28], sampled with 512Hz. For d = 0 . , ..., . s the average ∆ is only 0.0017, and 47% of the places have small∆ . They are spread rather uniformly, so there is little chance to get informationfrom this range of d. We concentrate on d ≤ . s with average ∆ about 0.013 andonly 10% places below the bound. There seems some structure connected with sleepannotations. We have chosen the white ﬁeld near the line d = 0 when we average∆ over d = 4 , ..., UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 13

Figure 5.

Components of ∆ for a the plethysmogram and b theECG of control n2 of [28], with d from 0.3 to 1.5s. a from above: ˜ τ with overall percentage 46% , ˜ β with 25% , ˜ γ with 3% (colors scaledby factor 3), and ˜ δ with 26% . The remainder is only 0 .

07% in av-erage. Places with ∆ < n were not included in the average. Theyaccount for 7% of all places and are marked black in the picture. Thevast majority of black spots is caused by artefacts, movements of thepatient. In b we have ˜ τ with average 31%, ˜ β with 35%, ˜ γ with 5%(colors scaled by factor 3), ˜ δ with 29%, and 0 .

34% remaining error.The percentage of black spots was 5% . Can you guess REM phases?Solution in Figure 7.

Figure 6.

Places with ∆ < n ≈ .

001 in the EEG record of n2 aremarked black. a. For d = 0 . ... . s no structure can be seen. b. For d ≤ . s the light places are related to stages of deep sleep in theﬁgure below. Figure 7.

Coincidence of mean ∆ of EEG and annotation of sleepstages, for three controls from [28] and one patient suﬀering frominsomnia, with smaller ∆ . Data were selected by record structure,not by the coincidence! There are commercial systems for automaticsleep scoring. But ∆ is such a simple, transparent quantity that thiscoincidence is worthwhile to study. Mean was taken over d = 2 , ..., , that is 4 to 40ms for 512Hz records. UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 15

3. A Gaussian model process

Figure 8.

Places with ∆ < . d they are everywhere, for small d they become concentrated on lineswhere τ ( d ) ≈ . AR2 processes belong to the simplest stationary Gaussian processes. Autocor-relation and persistence can be determined analytically [6, 17]. Here we usethe same procedures as for our data, to ﬁnd out whether ˜ τ will be really 100%.We take the process X t = 1 . X t − − . X t − + W t with Gaussian noise W t which has oscillating and not too rapidly decreasing autocorrelation (Figure 9).Dependencies for d >

100 are very small, so we consider d from 1 to 50 or 100.Figure 8 shows that small places for ∆ form horizontal stripes where τ ( d ) ≈ . We determine ˜ τ on all places and on large places for several thousands of samplesof two sizes n. window delays ∆ < n ˜ τ , all ˜ τ , corrected n = 10000 d ≤

100 49% 86% 98.4% d ≤

50 20% 95% 98.9% n = 2000 d ≤

100 76% 73% 97.7% d ≤

50 55% 87% 97.9%Table 2. The partition formula is not useful for processes with small ∆ . Increasing our bound n would only slightly improve ˜ τ , on the cost of excludingmany other places. Thus our partition formula seems not so useful for smallaverage ∆ , in this case about 0 . . Here τ is better to work with than ˜ τ . Remark.

At this point, let us brieﬂy explain the type of symmetry givenwhen ordinal functions are zero. We take a random process ( X t ) with station-ary increments, ﬁx a delay d and consider the two-dimensional distribution of Z = X t + d − X t and Z = X t +2 d − X t + d . The frequencies of order patterns areprobabilities of sectors in the z z -plane: π ( d ) = P ( Z > , Z > , π ( d ) = P ( Z < , Z <

0) and π ( d ) = P ( Z + Z > , Z < , π ( d ) = P ( Z < , Z + Z > , π ( d ) = P ( Z > , Z + Z < , π ( d ) = P ( Z + Z < , Z > . When ( Z , Z ) has a density function f which is symmetric to theorigin, i.e. f ( z , z ) = f ( − z , − z ) then β ( d ) = δ ( d ) = 0 . If f is symmetric tothe line z + z = 0 , i.e. f ( z , z ) = f ( − z , − z ) then γ ( d ) = β ( d ) = δ ( d ) = 0 . Ordinal functions measure deviations from this kind of symmetry.

In the ﬁgure below, we study how autocorrelation and persistence of the aboveAR2 process will change when we add various disturbances to the data. Whileautocorrelation works better when data are perturbed by white noise, persis-tence performs better for other perturbations which occur in practice. In par-ticular, data from a sensor with nonlinear characteristics need not be calibratedwhen we use ordinal functions. We note that for cases c, d, and e in Figure 9,values of ˜ τ from Table 2 remain above 97% . Figure 9.

Autocorrelation and persistence for an AR2 process withadditive disturbances. In each case 10 samples of 2000 points wereprocessed, to determine statistical variation. On the left, 300 pointsof the respective time series are sketched. a. Original signal. b. Additive Gaussian white noise with signal-to-noise ratio 1 divides au-tocorrelation by 2 and makes persistence very ﬂat. c.

1% outliers withaverage amplitude 20 times σ of the original signal will harm the auto-correlation but not persistence. d. Low-frequency function sin t/ e. The monotonous transformation y = e x/ is applied to thevalues: autocorrelation is distorted, persistence unchanged. UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 17

4. Speech and music

For music data, the autocorrelogram sometimes shows the melody better thanthe spectrogram. Persistence behaves similarly.

Figure 10.

The ﬁrst 12 seconds of the song ”Hey Jude” of TheBeatles in sliding window analysis. Autocorrelation and persistencewere calculated for overlapping windows of 0.25 seconds length, n =11025 . Reversed y -axis helps recognize the melody, d = 49 , ..., τ at the bottom has quite a number of blue spots where ˜ β and ˜ δ yield the main contribution to ∆ . If we like to recognize not only melody, but also the text, we have to work withshort windows, as shown in Figure 11. The ordinal structure of speech seems aninteresting object for its own sake, as well as for applications.In Figure 12, places with small ∆ indicate unvoiced sounds when they are arrangedin a vertical pattern. When they form horizontal patterns, which are somewhatwavy in case of a song, they indicate lines τ ( d ) ≈ β, δ, and γ -shapes, as shown in Figure 13. Figure 11.

The ﬁrst 12 seconds of ”Hey Jude” as partition of ∆ with ˜ τ , ˜ β, ˜ γ, and ˜ δ (from above). A window of length 50ms was shiftedin steps of 10ms. The y -axis is in normal orientation, d runs from 1 to294, that is, 0.02 to 6.7ms. The short window reveals a lot of detailstructure. All vowels show clear β and δ -patterns, even γ is relevantat several places. The overall percentage is 57% for ˜ τ ,

15% for ˜ β,

11% for ˜ γ, and 16% for ˜ δ. The mean of the rest, i.e. the error of thepartition formula, is 0.77% which is good for the short window. Placeswith small ∆ excluded from this calculation are discussed below. Figure 12.

In the previous ﬁgure, 26% of the places ( t, d ) fulﬁl∆ ( d ) < n ≈ .

007 so that the signal does not signiﬁcantly diﬀerfrom white noise. They form the dark spots in b. For comparison, a shows the mean absolute amplitude of the signal. We have small ∆ for the two breaks with small amplitude, but much more pronouncedfor the unvoiced sounds, notably ’s’ in ’sad’ and ’song’. In the voicedsounds, places with small ∆ appear for d with persistence zero. UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 19

Figure 13.

Detail from Figure 11. The vowels of ’Jude’, ’don’t’,and the second syllable in ’better’ provide stationary parts of thetime series lasting for 0.3, 0.1, and 0.3 s. Only 20ms of each signalare shown in the top panel. Ordinal autocorrelation functions werecalculated for six resp. two disjoint windows of length 50ms and drawnfor one pitch period, which equals 4.5ms for all three sounds.The study of these patterns can improve methods of speech synthesis, speech anal-ysis, and speaker authentiﬁcation, but this is not an easy matter. There is a hugevariety of β, δ, and γ shapes which have to be collected, classiﬁed and understood.And there are open questions. Is there an ’inverse transform’ which reconstructsthe time series from its ordinal autocorrelation functions? What are the connec-tions between power spectrum, cepstrum and ordinal functions? Can a speakerreliably reproduce the β -shape? Can we hear the ordinal functions? If not, will thecomputer understand human speech better than humans themselves?

5. Weather, dust, and tides

We conclude with some everyday data. The German Weather Service ( ,Climate and Environment, Climate Data) provides hourly values of earth tempera-ture, at depth 5cm, starting 1978. Can I learn something about my town? Figure 14shows the last 2000 values of the year 2013. In autumn, there is still a daily rhythm,in winter this is rarely the case. We determine the persistence of the dataset withsliding windows of length 720, that is, one month. The eﬀect of summer and wintercan be seen, and we also see some irregularity: the data were ﬁrst sampled every 3years (which one can guess from the picture), then three times a day, and since 2001every hour. For such plausibility checks even the short window is enough. However,there is no chance to see details in diﬀerent months. The resolution of data is notsuﬃcient, even for this short window which already causes statistical inaccuracy.

Figure 14.

Hourly temperature of earth in Greifswald, Germany. a. The last 2000 values of 2013 show that the daily cycle is weak inwinter. b. Persistence for d = 1 , ...,

50h shows irregularities in thedata.Hourly data of PM10 particulate measurements were already used in Figure 3. Theseare notoriously noisy data, even if we consider them on logarithmic scale, as seen inthe upper part of Figure 15. Particulates in the air are not uniformly distributed,they come in fractal clusters. Actually these data, together with the EEG, havethe smallest ∆ values in our studies. In the dataset of San Bernardino from [21](station 3215 Trona-Athol, the other station 3500 shows diﬀerent characteristics)the mean ∆ is 0.003. For our window length 1200, there were 96% of small ∆ values according to our bound. Moreover, there were 13% missing values.For screening the data, we omit all time points with missing values. This is not quitecorrect, but worth a trial, and the whole ﬁnancial world works with disrupted timeseries. Persistence and up-down balance in Figure 15 show that in summer, but notin winter there is a daily rhythm in the data. It is similar as for temperatures above,but not obvious in the data. So the curves in Figure 3b are an eﬀect of summeronly. UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 21

Figure 15.

Hourly PM10 measurements at San Bernardino, Califor-nia [21]. The upper panel demonstrates the irregular character of thetime series. Ignoring missing data, a persistence τ and b up-down bal-ance β were determined with a sliding window of length 1200. Yearsare marked at June 1. There is a daily rhythm in summer which ishardly visible in the data or in autocorrelation, cf. Figure 3. Hourlyresolution does not allow to go further into detail.Based on this impression we can now decide how to proceed the data further. Itis not possible, however, to go into detail in Figure 15. With a window of 50 dayswe can not see changes between weeks. The sensor measurements are done everyfew seconds, however, and only the hourly averages are kept. With data of ﬁnerresolution, say every 6 minutes, much more detail could be studied.This is demonstrated in Figure 16 for tidal data from Anchorage, providing onemore veriﬁcation for our partition formula. Of course the tides form a very regularprocess, with the highest ∆ of all our examples. The possibility to work with awindow of 5 days instead of several weeks, and the resulting details in the ﬁgure aredue to the high resolution of the data, however. So this note should send a messageto colleagues who design measuring equipment: Do not preselect data. Give usersan option to study the wealth of all measured values and make their own choices. Figure 16.

We used tidal data to show that there are interesting β -shapes for diﬀerent sites, by averaging over a whole month. The mainactors in the tides’ play are moon and sun, however, and the datahave a multi-scale structure, shown in the upper panel. It seems in-teresting that ordinal functions make monthly and yearly cycles moretransparent than the series itself, autocorrelation and spectrum. Thepartition of ∆ for water levels from Anchorage was restricted hereto the rather short time from Jamuary 2013 to September 2014, andto delays d between 10.5 and 14.5 hours, around half a lunar day. Awindow of 1242 values, 5 lunar days, was shifted in steps of 12.4 hours.There are only 2% of places with ∆ < n . If they are excluded, theaverage value of ˜ τ is 76%, followed by ˜ β with 15%, ˜ δ with 6%, and ˜ γ with 2%. The average error is 0.54% . UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 23

6. A fast laser experiment

The problem in this last section is rather simple: ﬁnd the period of a very noisyrhythmic process with extremely short windows. We include this example becausethe data show what can be measured with current technology within parts of amillisecond.

Figure 17.

Data for this ﬁgure come from the laser experiment ofSorriano et al. [2]. A laser with wavelength 1542 nm and 11.7 mAthreshold current was excited just above the threshold with 12 mA.The sampling frequency was 25ps which means that 240000 valueswere measured during 6 µs.

The optical feedback was arranged so thatthe signal has a period of L = 1544 . We try to ﬁnd this period with awindow of 2.3 periods, or 88 ns, just the length of the series shown in a , where periodicity can hardly be seen. In b and c we study auto-correlation and persistence around d = L. The vertical interruptionsshow low-frequency disturbances, the laser does not run smoothly. d and e show ρ and τ around d = L . While ρ is a bit better at the fullwavelength, ττ