AAUTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTYDATA SERIES
CHRISTOPH BANDT
One form of big data are signals - time series of consecutive values.In physical experiments, billions of values can now be measured withina second [1, 2] . Signals of heart and brain in intensive care, as well asseismic waves, are measured with 100 up to 1000 Hz over hours, days oreven years [3, 4] . A song of 3 minutes on CD comprises 16 million values.Music and seismic vibrations basically consist of harmonic oscillationsfor which classical tools like autocorrelation and spectrogram work well [5, 6, 7] . This note presents similar tools for all kinds of rhythmic pro-cesses, with non-linear distortion, artefacts, and outliers. Permutationentropy [8, 9, 10] has been used in physics [2, 11] , medicine [12, 13, 14] , andengineering [15, 16] . Now ordinal patterns [17, 18, 19] are studied in detailfor big data. As new version of permutation entropy, we define a distanceto white noise consisting of four curious components. Applications to avariety of medical and sensor data are discussed.
Figure 1.
Nose breathing of a healthy volunteer in normal life. a:One minute of data without artefacts. b: Mean absolute flow for each30 seconds of 24 hours measurement. c: New function ˜ τ ( d ) for eachminute shows more structure, needs no calibration of data. Date : January 16, 2018. a r X i v : . [ s t a t . M E ] D ec CHRISTOPH BANDT
Classical time series from quarterly reports of companies, monthly unemploymentfigures, daily statistics of accidents etc. consist of 20 up to a few thousand valueswhich were determined with great care. Machine-generated data amount to millionsand are not double-checked. Usually it is easy to let the machine run faster andlonger but processing goes the other way: data are smoothed out by downsampling.This note shows how high time resolution can be exploited with simple ordinalautocorrelation functions. To demonstrate the remarkable effect of these tools, theywill be applied to raw data: no preprocessing, no filters, no detection of outliers, noremoval of artefacts, and no polishing of the results.Figure 1 comes from ongoing work with Achim Beule (Department ENT, Headand Neck Surgery, University of Greifswald) on respiration of healthy volunteers ineveryday life. Sensors measuring air flow with sampling frequency 50 Hz were fixedin both nostrils, gently enough to ensure comfort for 24 hours measurement. Mouthbreathing was not controlled, so the signals contain lots of artefacts. Traditionalanalysis takes averages over 30 seconds to obtain a better signal. More informationis found in a function ˜ τ ( d ) for each minute of the dirty signal. As explained below,˜ τ measures the percentage of ”elliptical symmetry” of the signal, depending ontime and the delay d which runs from 0.02 to 5 seconds. The collection of thesefunctions, visualized like a spectrogram, shows phases of activity and sleep, variousinterruptions of sleep, inaccurate measurements around 8 am, a little nap after 2pm. Frequency of respiration can be read from the lower dark red stripe whichmarks half of the wavelength, and the upper red and yellow stripe marks the fullwavelength (4 seconds in sleep, less than 3 in daily life).
1. Up-down balance.
Only the order relation between the values of the timeseries x = ( x , ..., x T ) will be used, not the values themselves. In the simplest case,the question is whether the time series goes more often upwards than downwards.For a delay d between 1 and T /
2, let n ( d ) and n ( d ) denote the number of timepoints t for which x t < x t + d and x t > x t + d , respectively. We determine the relativefrequencies of increase and decrease over d steps: p ( d ) = n ( d ) / ( n ( d ) + n ( d ))and p ( d ) = 1 − p ( d ) . Ties are disregarded. The up-down balance β ( d ) = p ( d ) − p ( d ) for d = 1 , , ... is a kind of autocorrelation function. It reflects the dependence structure of theunderlying process and has nothing to do with the size of the data.For illustration we take water levels from the database of the National OceanService [20]. Since water tends to come fast and disappear more slowly, we couldexpect β ( d ) to be negative, at least for small d. This is mostly true for water levelsfrom lakes, like Milwaukee in Figure 2. For sea stations, tidal influence makes thedata almost periodic, with a period slightly smaller than 25 hours. The data have 6minute intervals and we measure d in hours, writing d = 250 as d = 25h . A strictlyperiodic time series with period L fulfils β ( L − d ) = − β ( d ) which in the case L = 25implies β (12 .
5) = 0 , visible at all sea stations in Figure 2. Otherwise there are bigdifferences: at Honolulu and Baltimore the water level is more likely to fall withinthe next few hours, at San Francisco it is more likely to increase, and at Anchorage UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 3
Figure 2.
Water levels at 6 minute intervals from [20]. Original datashown for Sep 1-2, 2014. Functions β are given for September of 18years 1997-2014, and also for January in case of Anchorage and SanFrancisco. d runs from 6 min to 27 hours.there is a change at 6 hours. Each station has its specific β -profile, almost unchangedduring 18 years, which characterizes its coastal shape. β can also change with theseason, but these differences are smaller than those between stations.Figure 2 indicates that β, as well as related functions below, can solve some basicproblems of statistics: describe, distinguish, and classify objects. Cf. Figure 13.
2. Sliding windows and statistical variation.
The time series is divided intopieces, so-called windows, and β is determined for every window. If we have somehundred windows, which may overlap, we can compose them in a map like Figure 1or 4, where each time point corresponds to a window, the β -functions are representedby vertical lines, and the values β ( d ) are color coded. Such maps show whether theregime of the underlying process changes in time, and indicate transitions betweendifferent regimes: activity and rest, or sleep stages.If there are no systematic changes in the appearance of the functions, we canconsider the underlying process as stationary and estimate statistical fluctuations.Mathematical arguments say that the standard deviation of β ( d ) for fixed d is c √ n where n is the window length and c some constant. In our data, c varied from to 4, and in most cases we had c ≈ . For the data of Figure 2, the underlyingtidal process is certainly not stationary - but the influence of the moon was reducedby calculating β over one month, the annual component was minimized by takingonly September. We can roughly estimate σ, with deviations still depending on d. CHRISTOPH BANDT
For medical data, errors were obtained from stationary segments, e.g. deep sleep.Another error of order dn is observed for very large parameters d (Methods 2,3).The error estimate σ ≈ √ n can be used to determine the appropriate windowlength. To have the 2 σ -radius of the 95% confidence interval for β ( d ) smaller than0.01, we must take n ≥ . For Figures 2 and 3 we have n ≈ ± . . Thus our method works only for big data.A simple exact method checks whether β ( d ) significantly differs from zero, or the β -curves of two sites in Figure 2 significantly differ at some d. When windows do notoverlap, the median test is applicable. Since β (20) > β (20) for the underlying process cannot bezero since the p -value for 11 heads in 11 coin tosses is less than 0.001. Thus, exceptfor γ, most values of the functions in Figure 3 differ significantly from zero.
3. Patterns of length 3.
Three equidistant values x t , x t + d , x t +2 d without ties canrealize six order patterns. 213 denotes the case x t + d < x t < x t +2 d . For each pattern π and d = 1 , , .., T / n π ( d ) of appearancesin the same way as n ( d ) . In case π = 312 we count all t = 1 , ..., T − d with x t + d < x t +2 d < x t . Let S be the sum of the six numbers. Patterns with ties arenot counted. Next, we compute the relative frequencies p π ( d ) = n π ( d ) /S. For whitenoise it is known that all p π ( d ) are [8].As autocorrelation type functions, we now define certain sums and differences ofthe p π . The function τ ( d ) = p ( d ) + p ( d ) − x t + d − x t persists when we go d time steps ahead. The largest possible value of τ ( d )is , assumed for monotone time series. The minimal value is − . The constant was chosen so that white noise has persistence zero. The letter τ indicates that thisis one way to transfer Kendall’s tau to an autocorrelation function. Another versionwas studied in [22].Similar to the classical autocorrelation function ρ, a period of length L in a signalis indicated by minima of τ at d = L , L , L , ... For these d we have x t ≈ x t +2 d sothat the patterns 123 and 321 are rare. Near to d = L, L, L, ... the function τ is large, as ρ. For a noisy sequence, however, a local minimum appears directly at d = L, L, ..., since patterns tend to have equal probability there. The larger thenoise, the deeper the bump. In Figure 3b, τ at d = 12 and d = 24 shows exactlythis appearance and proves the existence of a 24 hour rhythm better than ρ. Beside τ and β we define two other autocorrelation type functions. For conve-nience, we drop the argument d.γ = p + p − p − p UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 5
Figure 3.
Autocorrelation ρ and ordinal functions ∆ , τ, β, γ, δ for a: the almost periodic series of water levels at Los Angeles (September1997-2014, data from [20]), b: the noisy series of hourly particulatevalues at nearby San Bernardino 2000-2011 from [21], with weak dailyrhythm, see Figure 15. The functions on the left are about three timeslarger, for ∆ nine times, but fluctuations are of the same size.is a measure of time irreversibility of the process, and δ = p + p − p − p describes up-down scaling since it approximately fulfils δ ( d ) = β (2 d ) − β ( d ) (Methods3). Like β, these functions measure certain symmetry breaks in the distribution ofthe time series. They all have strong invariance properties (cf. Figure 9):When a nonlinear monotonous transformation is applied to the data – manysensors measure proxy effects which are monotonously, but not linearly related to thetarget variable – all ordinal functions remain unchanged. They are not influenced bylow frequency components with wavelength much larger than d which often appearas artefacts in data. Since τ, β, γ, and δ are defined by assertions like X t + d − X t > , they do not require full stationarity of the underlying process. Stationary incrementssuffice - Brownian motion for instance has τ ( d ) = for all d [17]. Finally, ordinalfunctions are not influenced much by a few outliers. One wrong value, no matter howlarge, can change n π ( d ) only by ±
4. Permutation entropy is the Shannon entropy of the distribution of orderpatterns: H = − (cid:88) π p π log p π .H can be defined for any level n, using the vectors ( x t , x t + d , ..., x t +( n − d ) and their n ! order patterns π [8]. In practice we hardly go beyond n = 7 . Used as a measureof complexity and disorder, H can be calculated for time series of less than thou-sand values since statistical inaccuracies of the p π are smoothed out by averaging. CHRISTOPH BANDT
Applications include EEG data [12, 13, 18, 14], optical experiments [2, 1, 11], riverflow data [19], control of rotating machines [15, 16], economic applications (cf. [9]),and the theory of dynamical systems [23, 24]. Recent surveys on H are [9, 10]. As ameasure of disorder, H assumes its maximum log n ! for white noise. D = log n ! − H is called divergence or Kullback-Leibler distance to the uniform distribution p π = n ! of white noise.In this note all functions, including autocorrelation, measure the distance of thedata from white noise. For this reason, we take divergence rather than entropy, andwe replace − p π log p π by p π . As before, we drop the argument d. The function∆ = (cid:88) π ( p π − ) = (cid:88) π p π − where the sum runs over the six order patterns π of length 3, will be called the distance of the data from white noise. Of course, we can define ∆ for any length n, replacing by n ! . More precisely, ∆ is the squared Euclidean distance betweenthe observed order pattern distribution and the order pattern distribution of whitenoise. Considering white noise as complete disorder, ∆ measures the amount ofrule and order in the data. The minimal value 0 is obtained for white noise, and themaximum for monotone time series.From a practical viewpoint, ∆ is just a rescaling of H, related to the quadraticTaylor approximation of H around white noise parameters p π = H ≈ log 6 − . For our data these two functions can hardly be distinguished by eyesight.
5. Partition of the distance to white noise.
A Pythagoras type formula com-bines ∆ with the ordinal functions:4∆ = 3 τ + 2 β + γ + δ . This holds for each d = 1 , , ... The equation is exact for random processes withstationary increments as well as for cyclic time series. The latter means that wecalculate p π ( d ) from the series ( x , x , ..., x T , x , x , ..., x d ) where t runs from 1 to T. For real data we go only to T − d and have a boundary effect which causes theequation to be only approximately fulfilled (Methods 4). The difference is smallerthan 1% in most of our data, see Figures 5, 11, and 16.This partition is related to orthogonal contrasts in the analysis of variance. When∆ ( d ) is significantly different from zero, we can define new functions of d :˜ τ = 3 τ , ˜ β = β , ˜ γ = γ , ˜ δ = δ . By taking squares, we lose the sign of the values, but we gain a natural scale. ˜ τ , ˜ β, ˜ γ and ˜ δ lie between 0 and 1 = 100% , and they sum up to 1. For each d, they describethe percentage of order in the data which is due to the corresponding difference ofpatterns. Figure 1 shows ˜ τ for one-minute non-overlapping sliding windows as colorcode on vertical lines. UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 7
For Gaussian and elliptically symmetric processes, the functions β, γ, and δ areall zero, and ˜ τ is 1, for every d [17, Section 5]. An image of ˜ τ , as Figure 1, showsto which extent the data come from a Gaussian process and where the symmetry isbroken. We get one percentage for each time and delay, 300000 values in Figure 1,and an overall average. As a rule of thumb, we exclude points for which ∆ ( d ) < n where n is window length (Methods 2). This bird’s view method is a big datacounterpart of rigorous tests for elliptical symmetry based on the covariance matrix[25, 26, 27]. For ARMA processes, the average ˜ τ is above 97%, as should be, forEEG data about 80% , for music above 50%, and for heart data only 30-50%. Thefunctions β and δ, and less frequently γ, can represent the main part of ∆ , butusually only for special values of d. Details are discussed in the supplement.
Figure 4.
Sleep stages of healthy subject n3 from [28]. a: 10 secondsof data from EEG channel Fp2-F4, ECG and plethysmogram. b:Expert annotation of sleep depth from [28] agrees with distance ∆ . c: Function ˜ β of the plethysmogram describes changes including REM.
6. Sleep data.
As an application, we study data from the CAP sleep database byTerzano et al. [28] available at physionet [3]. Sleep stages S1-S4 and R for REMsleep were already annotated by experts, mainly using the EEG channel Fp2-F4and the oculogram. Figure 4 demonstrates that permutation entropy ∆ of thatEEG channel, averaged over d = 2 , ...,
20 gives an almost identical estimate of sleepdepth.
In [13, 29] permutation entropy was already recommended as indicator ofsleep stages. We verified this coincidence in all data of the database with a normalEEG, which seems magic since ∆ was introduced as a simple quantity, without anyregard to sleep data. For 512 Hz sampling frequency, d = 2 , ...,
20 corresponds to thetime scale between 4 and 40 ms. Thus our ∆ is a measure of smoothness of data atsmall scales which increases when high-frequency oscillations disappear. This seemsa complementary viewpoint to the classical R&K rules for sleep stage classificationand their recent modification [30] which refer to the appearance of low-frequencywaves and patterns. CHRISTOPH BANDT
REM phases are more difficult to detect. The oculogram was used for annotation.Figure 4c presents ˜ β for the plethysmogram, measured with an optical sensor at thefingertip. We can see all interruptions of sleep and a lot of breakpoints which coincidewith changes detected by the annotator. The REM phases are characterized by highvalues of ˜ β ( d ) , as well as an increase and a strongly increased variability of the heartrate: the characteristic wavelength d goes down. These variations influence ordinalfunctions so strongly that changes become more apparent than in a plot of heartrate.Oximeters measuring the plethysmogram are cheap and easy to apply, in bed athome and possibly in daily life. Many similar devices - armwrists, shirts, smart-phone apps - are currently being developed. Our ordinal functions can evaluate andvisualize their high-resolution data.
7. Conclusion.
On the basis of order patterns, error-resistent autocorrelationfunctions for big data were introduced. For patterns of length 3, the distance towhite noise was divided into four interesting components. Various extensions seempossible. To develop a coherent theory of ordinal time series remains a challenge,despite groundbreaking work by Marc Hallin and co-authors [31, 22, 25] on rankstatistics. On the practical side, a spectrogram-like visualization was introducedwhich allows to see at one glance the course of heart and respiration data over 24hours (Figures 1,4). Simple as it is, this technique may apply to data of sensorson satellites, weather stations, factory chimneys etc. as well as to experiments onnanoscale [1, 11] which may lay the ground for future generations of sensors. Inany application, the methods presented here have to be modified, combined withestablished techniques, specific machine-learning and imaging tricks. Mathematicalextraction of information must keep track with the revolution in sensor and computertechnology: this seems an emerging field of study.
References [1] A. Aragoneses, N. Rubido, J. Tiana-Aisina, M.C. Torrent & C. Masoller, Distinguishingsignatures of determinism and stochasticity in spiking complex systems, Scientific Reports3, Article 1778 (2012)[2] M.C. Soriano, L. Zunino, O.A. Rosso, I. Fischer & C.R. Mirasso, Time scales of a chaoticsemiconductor laser with optical feedback under the lens of a permutation information anal-ysis, IEEE J. Quantum Electronics 47 issue. 2, 252-261 (2011)[3] A.L. Goldberger et al. PhysioBank, PhysioToolkit, and PhysioNet: Components of a NewResearch Resource for Complex Physiologic Signals. Circulation 101(23):e215-e220 [Circu-lation Electronic Pages; http://circ.ahajournals.org/cgi/content/full/101/23/e215]; (2000).Data at: physionet.org/physiobank/database/capslpdb [4] GFZ Seismological Data Archive: geofon.gfz-potsdam.de/waveform [5] P.J. Brockwell & R.A. Davies, Time Series; Theory and Methods, 2nd ed., Springer, NewYork 1991[6] R.H. Shumway & D.S. Stoffer, Time Series Analysis and Its Applications, 2nd ed., Springer,New York 2006[7] A. Davis, M. Rubinstein, N. Wadhwa, G.J. Mysore, F. Durand & W.T. Freeman, The visualmicrophone: passive recovery of sound from video, SIGGRAPH 2014
UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 9 [8] C. Bandt & B. Pompe, Permutation entropy: a natural complexity measure for time series,Phys. Rev. Lett. 88, 174102 (2002)[9] M. Zanin, L. Zunino, O.A. Rosso & D. Papo, Permutation entropy and its main biomedicaland econophysics applications: a review, Entropy 14, 1553-1577 (2012)[10] J. Amigo, K. Keller & J. Kurths (eds.), Recent progress in symbolic dynamics and permu-tation entropy, Eur. Phys. J. Special Topics 222 (2013)[11] J.P. Toomey & D.M. Kane, Mapping the dynamical complexity of a semiconductor laser withoptical feedback using permutation entropy, Optics Express 22, issue 2, 1713-1725 (2014)[12] M. Staniek & K. Lehnertz, Symbolic transfer entropy, Phys. Rev. Lett. 100, 158101 (2008)[13] N. Nicolaou & J. Georgiou, The use of permutation entropy to characterize sleep encephalo-grams, Clin. EEG Neurosci 2011, 42:24[14] E. Ferlazzo et al. Permutation entropy of scalp EEG: a tool to investigate epilepsies, ClinicalNeurophysiology 125, 13-20 (2014)[15] U. Nair, B.M. Krishna, V.N.N. Namboothiri & V.P.N. Nampoori, Permutation entropy basedreal-time chatter detection using audio signal in turning process, Int. J. Adv. Manuf. Technol.46, 61-68 (2010)[16] R. Yan, Y. Liu & R.X.Gao, Permutation entropy: a nonlinear statistical measure for statuscharacterization of rotary machines, Mechan. Syst. Signal Processing 29, 474-484 (2012)[17] C. Bandt & F. Shiha, Order patterns in time series, J. Time Series Analysis 28, 646-665(2007)[18] G. Ouyang, C. Dang, D.A. Richards & X. Li, Ordinal pattern based similarity analysis forEEG recordings, Clinical Neurophysiology 121 (2010), 694-703[19] H. Lange, O.A. Rosso & M. Hauhs, Ordinal pattern and statistical complexity analysis ofdaily stream flow time series, Eur. Phys. J. Special Topics 222, 535-552 (2013)[20] The National Water Level Observation Network: [21] California Environmental Protection Agency, Air Resources Board: [22] T.S. Ferguson, C. Genest & M. Hallin, Kendall’s tau for serial dependence, Canadian J. Stat.28, 587-604 (2000)[23] J. Amigo, Permutation complexity in dynamical systems, Springer 2010[24] A.M. Unakafov & K. Keller, Conditional entropy of ordinal patterns, Physica D 269, 94-102(2014)[25] M. Hallin & D. Paindaveine, Semiparametrically efficient rank-based inference for shape I.Ann. Statistics 34, 2707-2756 (2008)[26] L. Sakharenko, Testing for ellipsoidal symmetry: a comparison study, Computational Stat.Data Analysis 53, 565-581 (2008)[27] A. Onatski, M.J. Moreira & M. Hallin, Asymptotic power of sphericity tests for high-dimensional data, Ann. Statistics 41, 1204-1231 (2013)[28] M.G. Terzano et al., Atlas, rules, and recording techniques for the scoring of cyclic alternatingpattern (CAP) in human sleep, Sleep Med. 2 (6), 537-553 (2001).[29] C.-E. Kuo, S.-F. Liang, Automatic stage scoring of single-channel sleep EEG based on mul-tiscale permutation entropy, Biomedical Circuits and Systems Conference (BioCAS), IEEE2011, 448-451[30] C. Iber, S. Anconi-Israel, A. Chesson & S.F. Quan, The AASM manual for the scoring of sleepand associated events: rules, terminologyand technical specifications. Westchester: AmericanAcademy of Sleep Medicine (2007)[31] M. Hallin & M.L. Puri, Aligned rank tests for linear models with autocorrelated error terms,J. Multivariate Analysis 50, 175-237 (1994)
Methods
1. The program code.
A few lines of MATLAB code provide all our functions. x(1:T) denotes the given time series, d is between 1 and dmax. y0=x(1:T-2*d); y1=x(d+1:T-d); y2=x(2*d+1:T);s=2*((y0>y1)+(y0>y2))+(y1>y2);s is a vector which contains the symbols 0 , , ..., x t , x t + d , x t +2 d ) for d = 1 , ..., T − d. We correct for missing values NaN and tiesby assigning symbols 6 to 11: h=isnan(y0)|isnan(y1)|isnan(y2)|(y0==y1)|(y0==y2)|(y1==y2);s=s+6*h;
Here | means ’or’. If patterns with equality are to be counted, this line has to bemodified. We now use the histogram function to determine the frequencies of thesix patterns. p=hist(s,12)/(T-2*d-sum(h)); q=p(1:6); This will determine all our functions, for example: beta=p(1)-p(6); tau=p(1)+p(6)-1/3; h2=(q-1/6)*(q-1/6)’; H=-log(q)*q’;
On a PC, the algorithm obtains Figure 3 within a few seconds and Figure 1 withina few minutes. In some papers [10], permutation entropy of scale d uses patternsfor sums of d consecutive terms of the x t , which makes sense if the x t represent adensity function, like precipitation or workload on a server. This version is easilyimplemented by cumulative sums, adding x=cumsum(x) as first line to the program.
2. Statistical accuracy.
We consider a fixed d and windows of length n. Then p ( d ) is a random sum of n terms 1,0 which say whether x t < x t + d is true orfalse. According to the binomial model, the standard deviation of p ( d ) and β ( d ) =2 p ( d ) −
12 1 √ n and √ n , respectively. In reality, the variation is biggerbecause the terms in the sum are correlated. The correlations between differences Y t = X t + d − X t are relevant. As a simple model, we take the sum of nk independentterms 1,0 each of which is repeated k times. For β ( d ) this gives the standarddeviation c √ n with c = k √ k. An estimate of c can only come from data. As a rulewe got c ≈ . For τ the variation is a bit smaller, for γ and δ a bit larger. Note thatcorrelations can increase with d. The partition formula and some assumptions givea rough estimate of n for the standard deviation of ∆ . Twice the value, n , is takenas a kind of confidence bound for the hypothesis ∆ = 0 . This is just a guideline:for the data in the supplement it works well.
3. Identities for pattern frequencies and boundary effect.
We fix d andconsider the p π for the whole time series. The equation p = p + p + p holds since on the right we count all t = 1 , ..., T − d with x t < x t + d . Similarly, p = p + p + p because on the right we count all t = d + 1 , ..., T − d with x t < x t + d . Strictly speaking, both relations are not correct since the count for p is over t = 1 , ..., T − d. The largest possible difference is dT − d when we exclude tiesfor simplicity. This upper bound is negligible for large T and comparably small d, and random fluctuations usually make the difference much smaller than the upper UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 11 bound. Taking the difference of the two equations we see that the quantity (cid:15) = p + p − p − p equals zero, with the same degree of accuracy. In other words, the numbers of localmaxima and of local minima in a time series coincide. Adding both identities for p and subtracting 1 = (cid:80) p π we obtain β = p − p which has been used throughout the paper to calculate β, see the code above. An-other relation of this type is p (2 d ) = p ( d ) + p ( d ) + p ( d ) . As a consequence,we get β (2 d ) = 2 p (2 d ) − β ( d ) + δ ( d ) which says that the functions β and δ are tightly connected. For the mathematical model of processes with stationaryincrements, as well as for the cyclic time series mentioned in section 5, all theseequations are exact [17].
4. Proof of the ∆ partition formula. As in the code, we set q = p − , q = p − , q = p − , q = p − , q = p − , q = p − . By high-schoolalgebra4 (cid:88) q k = 2( q + q ) + 2( q − q ) +( q + q + q + q ) + ( q − q − q + q ) +( q + q − q − q ) + ( q − q + q − q ) Since (cid:80) q k = 0 , the third square on the right is the same as the first. Using thedefinitions of ∆ , τ, γ, δ and the identities above for β and (cid:15) we get4∆ = 3 τ + 2 β + δ + γ + (cid:15) which implies the formula because (cid:15) = 0 . The equation is exact for cyclic time series,and for random processes with stationary increments. For ordinary time series thereis a small error, as discussed above, and we must check the accuracy of the equationfrom the data, see Figures 5, 11, and 16 where differences are below 1% .
5. Contents of supplement.
We give numerical and visual evidence for ourpartition formula and check our bound for small ∆ . We explore the potential of ourmethods for various applications - medical data, speech, weather and ecological data- and test how far we can decrease the window size. Autocorrelation and persistenceare compared in Figure 9 for a simple AR2-process, in Figure 10 for a song of TheBeatles, and in Figure 17 for a laser experiment of Sorriano et al. [2]. Further detailsof ordinal autocorrelation functions are discussed.
Acknowledgement.
I am grateful to Bernd Pompe, Marcus Vollmer, LucianoZunino, Mathias Bandt, Petra Gummelt and Helena Pe˜na for suggestions and com-ments on drafts of this paper.Institute of MathematicsUniversity of Greifswald17487 Greifswald, Germany [email protected]
Supplement
1. Overview
We start with a list of our datasets. The essential parameters are window lengthand mean ∆ which quantifies the amount of structure in the data. The latter isnot so easy to fix since it depends very much on d. For d near zero, autocorrelation,persistence and hence also ∆ assume very large values.subject / data window range of d mean ∆ motivation / applicationplethysmogram 3840/ 30s 0.3-1.5s 0.06 proof of partition formulaECG 15360/30s 0.3-1.5s 0.04 visualize long-term dataEEG 15360/30s 0.25-1.5s 0.002 screening for structure0-0.25s 0.013 formula for sleep scoringAR2 process 2000 1-100 0.01 theoretical modelmusic/speech 2205/50ms 0-7ms 0.024 speech analysis/synthesistemperature 720/1month 1-50h 0.03 exploratory data analysisparticulates 1200/50days 1-72h 0.003 notoriously noisy datatides 1242/5days 10.5-14.5h 0.10 visualization of dynamicslaser data 3520/88ns 741-800 0.01 detection of periodTable 1. Data studied in this supplement
2. Heart and brain data
Figure 5 verifies the partition formula for heart data, and approves our ’confidencebound’ n for ∆ = 0 . The vast majority of excluded places comes from artefacts,and it would make no sense to exclude still more places since the partition equationholds true with such a small error. We note that in millions of places in Figure 5and similar data, ˜ τ + ˜ β + ˜ γ + ˜ δ was always smaller than 100% plus a rounding errorof 10 − % . This inequality seems to be generally true.Processes with small average ∆ are more difficult to handle. Figure 6 concernsthe EEG channel Fp2-F4 of the same record, n2 of [28], sampled with 512Hz. For d = 0 . , ..., . s the average ∆ is only 0.0017, and 47% of the places have small∆ . They are spread rather uniformly, so there is little chance to get informationfrom this range of d. We concentrate on d ≤ . s with average ∆ about 0.013 andonly 10% places below the bound. There seems some structure connected with sleepannotations. We have chosen the white field near the line d = 0 when we average∆ over d = 4 , ..., UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 13
Figure 5.
Components of ∆ for a the plethysmogram and b theECG of control n2 of [28], with d from 0.3 to 1.5s. a from above: ˜ τ with overall percentage 46% , ˜ β with 25% , ˜ γ with 3% (colors scaledby factor 3), and ˜ δ with 26% . The remainder is only 0 .
07% in av-erage. Places with ∆ < n were not included in the average. Theyaccount for 7% of all places and are marked black in the picture. Thevast majority of black spots is caused by artefacts, movements of thepatient. In b we have ˜ τ with average 31%, ˜ β with 35%, ˜ γ with 5%(colors scaled by factor 3), ˜ δ with 29%, and 0 .
34% remaining error.The percentage of black spots was 5% . Can you guess REM phases?Solution in Figure 7.
Figure 6.
Places with ∆ < n ≈ .
001 in the EEG record of n2 aremarked black. a. For d = 0 . ... . s no structure can be seen. b. For d ≤ . s the light places are related to stages of deep sleep in thefigure below. Figure 7.
Coincidence of mean ∆ of EEG and annotation of sleepstages, for three controls from [28] and one patient suffering frominsomnia, with smaller ∆ . Data were selected by record structure,not by the coincidence! There are commercial systems for automaticsleep scoring. But ∆ is such a simple, transparent quantity that thiscoincidence is worthwhile to study. Mean was taken over d = 2 , ..., , that is 4 to 40ms for 512Hz records. UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 15
3. A Gaussian model process
Figure 8.
Places with ∆ < . d they are everywhere, for small d they become concentrated on lineswhere τ ( d ) ≈ . AR2 processes belong to the simplest stationary Gaussian processes. Autocor-relation and persistence can be determined analytically [6, 17]. Here we usethe same procedures as for our data, to find out whether ˜ τ will be really 100%.We take the process X t = 1 . X t − − . X t − + W t with Gaussian noise W t which has oscillating and not too rapidly decreasing autocorrelation (Figure 9).Dependencies for d >
100 are very small, so we consider d from 1 to 50 or 100.Figure 8 shows that small places for ∆ form horizontal stripes where τ ( d ) ≈ . We determine ˜ τ on all places and on large places for several thousands of samplesof two sizes n. window delays ∆ < n ˜ τ , all ˜ τ , corrected n = 10000 d ≤
100 49% 86% 98.4% d ≤
50 20% 95% 98.9% n = 2000 d ≤
100 76% 73% 97.7% d ≤
50 55% 87% 97.9%Table 2. The partition formula is not useful for processes with small ∆ . Increasing our bound n would only slightly improve ˜ τ , on the cost of excludingmany other places. Thus our partition formula seems not so useful for smallaverage ∆ , in this case about 0 . . Here τ is better to work with than ˜ τ . Remark.
At this point, let us briefly explain the type of symmetry givenwhen ordinal functions are zero. We take a random process ( X t ) with station-ary increments, fix a delay d and consider the two-dimensional distribution of Z = X t + d − X t and Z = X t +2 d − X t + d . The frequencies of order patterns areprobabilities of sectors in the z z -plane: π ( d ) = P ( Z > , Z > , π ( d ) = P ( Z < , Z <
0) and π ( d ) = P ( Z + Z > , Z < , π ( d ) = P ( Z < , Z + Z > , π ( d ) = P ( Z > , Z + Z < , π ( d ) = P ( Z + Z < , Z > . When ( Z , Z ) has a density function f which is symmetric to theorigin, i.e. f ( z , z ) = f ( − z , − z ) then β ( d ) = δ ( d ) = 0 . If f is symmetric tothe line z + z = 0 , i.e. f ( z , z ) = f ( − z , − z ) then γ ( d ) = β ( d ) = δ ( d ) = 0 . Ordinal functions measure deviations from this kind of symmetry.
In the figure below, we study how autocorrelation and persistence of the aboveAR2 process will change when we add various disturbances to the data. Whileautocorrelation works better when data are perturbed by white noise, persis-tence performs better for other perturbations which occur in practice. In par-ticular, data from a sensor with nonlinear characteristics need not be calibratedwhen we use ordinal functions. We note that for cases c, d, and e in Figure 9,values of ˜ τ from Table 2 remain above 97% . Figure 9.
Autocorrelation and persistence for an AR2 process withadditive disturbances. In each case 10 samples of 2000 points wereprocessed, to determine statistical variation. On the left, 300 pointsof the respective time series are sketched. a. Original signal. b. Additive Gaussian white noise with signal-to-noise ratio 1 divides au-tocorrelation by 2 and makes persistence very flat. c.
1% outliers withaverage amplitude 20 times σ of the original signal will harm the auto-correlation but not persistence. d. Low-frequency function sin t/ e. The monotonous transformation y = e x/ is applied to thevalues: autocorrelation is distorted, persistence unchanged. UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 17
4. Speech and music
For music data, the autocorrelogram sometimes shows the melody better thanthe spectrogram. Persistence behaves similarly.
Figure 10.
The first 12 seconds of the song ”Hey Jude” of TheBeatles in sliding window analysis. Autocorrelation and persistencewere calculated for overlapping windows of 0.25 seconds length, n =11025 . Reversed y -axis helps recognize the melody, d = 49 , ..., τ at the bottom has quite a number of blue spots where ˜ β and ˜ δ yield the main contribution to ∆ . If we like to recognize not only melody, but also the text, we have to work withshort windows, as shown in Figure 11. The ordinal structure of speech seems aninteresting object for its own sake, as well as for applications.In Figure 12, places with small ∆ indicate unvoiced sounds when they are arrangedin a vertical pattern. When they form horizontal patterns, which are somewhatwavy in case of a song, they indicate lines τ ( d ) ≈ β, δ, and γ -shapes, as shown in Figure 13. Figure 11.
The first 12 seconds of ”Hey Jude” as partition of ∆ with ˜ τ , ˜ β, ˜ γ, and ˜ δ (from above). A window of length 50ms was shiftedin steps of 10ms. The y -axis is in normal orientation, d runs from 1 to294, that is, 0.02 to 6.7ms. The short window reveals a lot of detailstructure. All vowels show clear β and δ -patterns, even γ is relevantat several places. The overall percentage is 57% for ˜ τ ,
15% for ˜ β,
11% for ˜ γ, and 16% for ˜ δ. The mean of the rest, i.e. the error of thepartition formula, is 0.77% which is good for the short window. Placeswith small ∆ excluded from this calculation are discussed below. Figure 12.
In the previous figure, 26% of the places ( t, d ) fulfil∆ ( d ) < n ≈ .
007 so that the signal does not significantly differfrom white noise. They form the dark spots in b. For comparison, a shows the mean absolute amplitude of the signal. We have small ∆ for the two breaks with small amplitude, but much more pronouncedfor the unvoiced sounds, notably ’s’ in ’sad’ and ’song’. In the voicedsounds, places with small ∆ appear for d with persistence zero. UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 19
Figure 13.
Detail from Figure 11. The vowels of ’Jude’, ’don’t’,and the second syllable in ’better’ provide stationary parts of thetime series lasting for 0.3, 0.1, and 0.3 s. Only 20ms of each signalare shown in the top panel. Ordinal autocorrelation functions werecalculated for six resp. two disjoint windows of length 50ms and drawnfor one pitch period, which equals 4.5ms for all three sounds.The study of these patterns can improve methods of speech synthesis, speech anal-ysis, and speaker authentification, but this is not an easy matter. There is a hugevariety of β, δ, and γ shapes which have to be collected, classified and understood.And there are open questions. Is there an ’inverse transform’ which reconstructsthe time series from its ordinal autocorrelation functions? What are the connec-tions between power spectrum, cepstrum and ordinal functions? Can a speakerreliably reproduce the β -shape? Can we hear the ordinal functions? If not, will thecomputer understand human speech better than humans themselves?
5. Weather, dust, and tides
We conclude with some everyday data. The German Weather Service ( ,Climate and Environment, Climate Data) provides hourly values of earth tempera-ture, at depth 5cm, starting 1978. Can I learn something about my town? Figure 14shows the last 2000 values of the year 2013. In autumn, there is still a daily rhythm,in winter this is rarely the case. We determine the persistence of the dataset withsliding windows of length 720, that is, one month. The effect of summer and wintercan be seen, and we also see some irregularity: the data were first sampled every 3years (which one can guess from the picture), then three times a day, and since 2001every hour. For such plausibility checks even the short window is enough. However,there is no chance to see details in different months. The resolution of data is notsufficient, even for this short window which already causes statistical inaccuracy.
Figure 14.
Hourly temperature of earth in Greifswald, Germany. a. The last 2000 values of 2013 show that the daily cycle is weak inwinter. b. Persistence for d = 1 , ...,
50h shows irregularities in thedata.Hourly data of PM10 particulate measurements were already used in Figure 3. Theseare notoriously noisy data, even if we consider them on logarithmic scale, as seen inthe upper part of Figure 15. Particulates in the air are not uniformly distributed,they come in fractal clusters. Actually these data, together with the EEG, havethe smallest ∆ values in our studies. In the dataset of San Bernardino from [21](station 3215 Trona-Athol, the other station 3500 shows different characteristics)the mean ∆ is 0.003. For our window length 1200, there were 96% of small ∆ values according to our bound. Moreover, there were 13% missing values.For screening the data, we omit all time points with missing values. This is not quitecorrect, but worth a trial, and the whole financial world works with disrupted timeseries. Persistence and up-down balance in Figure 15 show that in summer, but notin winter there is a daily rhythm in the data. It is similar as for temperatures above,but not obvious in the data. So the curves in Figure 3b are an effect of summeronly. UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 21
Figure 15.
Hourly PM10 measurements at San Bernardino, Califor-nia [21]. The upper panel demonstrates the irregular character of thetime series. Ignoring missing data, a persistence τ and b up-down bal-ance β were determined with a sliding window of length 1200. Yearsare marked at June 1. There is a daily rhythm in summer which ishardly visible in the data or in autocorrelation, cf. Figure 3. Hourlyresolution does not allow to go further into detail.Based on this impression we can now decide how to proceed the data further. Itis not possible, however, to go into detail in Figure 15. With a window of 50 dayswe can not see changes between weeks. The sensor measurements are done everyfew seconds, however, and only the hourly averages are kept. With data of finerresolution, say every 6 minutes, much more detail could be studied.This is demonstrated in Figure 16 for tidal data from Anchorage, providing onemore verification for our partition formula. Of course the tides form a very regularprocess, with the highest ∆ of all our examples. The possibility to work with awindow of 5 days instead of several weeks, and the resulting details in the figure aredue to the high resolution of the data, however. So this note should send a messageto colleagues who design measuring equipment: Do not preselect data. Give usersan option to study the wealth of all measured values and make their own choices. Figure 16.
We used tidal data to show that there are interesting β -shapes for different sites, by averaging over a whole month. The mainactors in the tides’ play are moon and sun, however, and the datahave a multi-scale structure, shown in the upper panel. It seems in-teresting that ordinal functions make monthly and yearly cycles moretransparent than the series itself, autocorrelation and spectrum. Thepartition of ∆ for water levels from Anchorage was restricted hereto the rather short time from Jamuary 2013 to September 2014, andto delays d between 10.5 and 14.5 hours, around half a lunar day. Awindow of 1242 values, 5 lunar days, was shifted in steps of 12.4 hours.There are only 2% of places with ∆ < n . If they are excluded, theaverage value of ˜ τ is 76%, followed by ˜ β with 15%, ˜ δ with 6%, and ˜ γ with 2%. The average error is 0.54% . UTOCORRELATION TYPE FUNCTIONS FOR BIG AND DIRTY DATA SERIES 23
6. A fast laser experiment
The problem in this last section is rather simple: find the period of a very noisyrhythmic process with extremely short windows. We include this example becausethe data show what can be measured with current technology within parts of amillisecond.
Figure 17.
Data for this figure come from the laser experiment ofSorriano et al. [2]. A laser with wavelength 1542 nm and 11.7 mAthreshold current was excited just above the threshold with 12 mA.The sampling frequency was 25ps which means that 240000 valueswere measured during 6 µs.
The optical feedback was arranged so thatthe signal has a period of L = 1544 . We try to find this period with awindow of 2.3 periods, or 88 ns, just the length of the series shown in a , where periodicity can hardly be seen. In b and c we study auto-correlation and persistence around d = L. The vertical interruptionsshow low-frequency disturbances, the laser does not run smoothly. d and e show ρ and τ around d = L . While ρ is a bit better at the fullwavelength, ττ