[PDF] Prediction in Projection: A new paradigm in delay-coordinate reconstruction

Abstract

Delay-coordinate embedding is a powerful, time-tested mathematical framework for reconstructing the dynamics of a system from a series of scalar observations. Most of the associated theory and heuristics are overly stringent for real-world data, however, and real-time use is out of the question due to the expert human intuition needed to use these heuristics correctly. The approach outlined in this thesis represents a paradigm shift away from that traditional approach. I argue that perfect reconstructions are not only unnecessary for the purposes of delay-coordinate based forecasting, but that they can often be less effective than reduced-order versions of those same models. I demonstrate this using a range of low- and high-dimensional dynamical systems, showing that forecast models that employ imperfect reconstructions of the dynamics---i.e., models that are not necessarily true embeddings---can produce surprisingly accurate predictions of the future state of these systems. I develop a theoretical framework for understanding why this is so. This framework, which combines information theory and computational topology, also allows one to quantify the amount of predictive structure in a given time series, and even to choose which forecast method will be the most effective for those data.

Full PDF

PPrediction in Projection: A new paradigm indelay-coordinate reconstruction by J. Garland

B.S., Colorado Mesa University, 2009M.S., University of Colorado, 2011A thesis submitted to theFaculty of the Graduate School of theUniversity of Colorado in partial fulﬁllmentof the requirements for the degree ofDoctor of PhilosophyDepartment of Computer Science2018 a r X i v : . [ m a t h . D S ] M a y his thesis entitled:Prediction in Projection: A new paradigm in delay-coordinate reconstructionwritten by J. Garlandhas been approved for the Department of Computer ScienceElizabeth BradleyProf. James D. MeissProf. James CrutchﬁeldProf. Aaron ClausetProf. Sriram SankaranarayananProf. Robert Easton DateThe ﬁnal copy of this thesis has been examined by the signatories, and we ﬁnd that both thecontent and the form meet acceptable presentation standards of scholarly work in the abovementioned discipline. iiGarland, J. (Ph.D., Computer Science)Prediction in Projection: A new paradigm in delay-coordinate reconstructionThesis directed by Prof. Elizabeth BradleyDelay-coordinate embedding is a powerful, time-tested mathematical framework for recon-structing the dynamics of a system from a series of scalar observations. Most of the associatedtheory and heuristics are overly stringent for real-world data, however, and real-time use is outof the question due to the expert human intuition needed to use these heuristics correctly. Theapproach outlined in this thesis represents a paradigm shift away from that traditional approach.I argue that perfect reconstructions are not only unnecessary for the purposes of delay-coordinatebased forecasting, but that they can often be less eﬀective than reduced-order versions of thosesame models. I demonstrate this using a range of low- and high-dimensional dynamical systems,showing that forecast models that employ imperfect reconstructions of the dynamics— i.e., modelsthat are not necessarily true embeddings—can produce surprisingly accurate predictions of thefuture state of these systems. I develop a theoretical framework for understanding why this is so.This framework, which combines information theory and computational topology, also allows oneto quantify the amount of predictive structure in a given time series, and even to choose whichforecast method will be the most eﬀective for those data. edication To Mom.This dissertation—and my entire academic journey—would have been abandoned multiple times,and likely never would have begun, if it were not for your endless love and support.

Acknowledgements

Liz , I have been told that as you look back on your life there will be one or maybe two peoplethat radically alter your entire life course. I have no doubt you are that person in my life. Youpersonally introduced me to complexity science and reinvigorating me as a scientist at a time inmy life I was ready to leave academics for good. You have not only shaped me as an academicbut as a better human being. You opened more doors than I can count and did everything in yourpower to give me every possible opportunity to succeed. You are the epitome of advisor, mentor,colleague, champion and friend.

My Family , without your love and support I am nothing.

Jim Meiss, for all the trident visits and willingness to let me bounce ideas oﬀ of you over theyears, from the beginning to the end of my dissertation. All our conversations about computationaltopology were invaluable in providing my thesis with a sound theoretical foundation.

Ryan James, for introducing me to, and patiently teaching me information theory.

Jim Crutchﬁeld, for numerous conversations, at the Santa Fe Institute and at UC Davis, through-out the research process. Your guidance in information theory was invaluable in ﬁnalizing the storyof my thesis.

Holger Kantz, for hosting me at Max-Planck-Institut F¨ur Physik Komplexer Systemer, where Ihad multiple interactions with you and your students that drastically matured the overall pictureof my thesis.

Bob Easton, for all the Trident seminars.

Aaron Clauset , Sriram Sankaranarayanan , Mark Muldoon , Simon DeDeo , and

ZachAlexander, for interesting and useful conversations throughout this process.

Santa Fe Institute, my academic home and sanctuary every summer. Your interdisciplinaryhalls recharged me every year, gave me a place to forge interesting collaborations with people (andin ﬁelds) I never would have imagined and took my career to heights I could have never dreamed. ontents

Chapter1

Overview and Motivation 1 Background and Related Work 52.1 Reconstructing Nonlinear Deterministic Dynamics . . . . . . . . . . . . . . . . . . . 52.1.1 Delay-Coordinate Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.2 Traditional Methods for Estimating the Embedding Delay τ . . . . . . . . . . 72.1.3 Traditional Methods for Estimating the Embedding Dimension m . . . . . . 142.1.4 Delay-Coordinate Embedding Reality Check . . . . . . . . . . . . . . . . . . . 172.2 Information Theory Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.2 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.3 I -Diagrams and Multivariate Mutual Information . . . . . . . . . . . . . . . . 212.2.4 Multivariate Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . 212.2.5 Estimating Information from Real-Valued Time-Series Data . . . . . . . . . . 242.2.6 Estimating Structural Complexity and Predictability . . . . . . . . . . . . . . 262.3 Forecast Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.3.1 Simple Prediction Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.3.2 (ARIMA) A Regression-Based Prediction Strategy . . . . . . . . . . . . . . . 292.3.3 Lorenz Method of Analogues . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.4 Assessing Forecast Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31ii Case Studies 333.1 Synthetic Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.1.1 The Lorenz-96 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.1.2 Lorenz 63 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2 Experimental Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.1 Computer Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Prediction in Projection 404.1 A Synthetic Example: Lorenz-96 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.1.1 Comparing ro-LMA and fnn-LMA . . . . . . . . . . . . . . . . . . . . . . . . . 404.1.2 Comparing ro-LMA with Traditional Linear Methods . . . . . . . . . . . . . . 424.2 Experimental Data: Computer Performance Dynamics . . . . . . . . . . . . . . . . . 424.3 Time Scales, Data Length and Prediction Horizons . . . . . . . . . . . . . . . . . . . 444.3.1 The τ Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.3.2 Prediction Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.3.3 Data Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Why it Works: A Deeper Understanding of Delay-Coordinate Reconstruction 535.1 Leveraging Information Storage to Select Reconstruction Parameters . . . . . . . . . 545.1.1 Shared Information and Delay Reconstructions . . . . . . . . . . . . . . . . . 545.1.2 Selecting “Forecast-Optimal” Reconstruction Parameters . . . . . . . . . . . 565.1.3 Data Requirements and Prediction Horizons . . . . . . . . . . . . . . . . . . . 655.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.2 Exploring the Topology of Dynamical Reconstructions . . . . . . . . . . . . . . . . . 725.2.1 Witness Complexes for Dynamical Systems . . . . . . . . . . . . . . . . . . . 725.2.2 Topologies of Reconstructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81iii Model-Free Quantiﬁcation of Time-Series Predictability 836.1 Traditional Methods for Predicting Predictability . . . . . . . . . . . . . . . . . . . . 846.2 Predictability, Complexity, and Permutation Entropy . . . . . . . . . . . . . . . . . 866.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Conclusion and Future Directions 96

Bibliography hapter 1Overview and Motivation

Complicated nonlinear dynamics are ubiquitous in natural and engineered systems. Methodsthat capture and use the state-space structure of a dynamical system are a proven strategy forforecasting the behavior of systems like this, but use of these methods is not always straightforward.The governing equations and the state variables are rarely known; rather, one has a single (orperhaps a few) series of scalar measurements that can be observed from the system. It can be achallenge to model the full dynamics from data like this, especially in the case of forecast models,which are only really useful if they can be constructed and applied on faster time scales than thoseof the target system. While the traditional state-space reconstruction machinery is a good way toaccomplish the task of modeling the dynamics, it is problematic in real-time forecasting because itgenerally requires input from and interpretation by a human expert. This thesis argues that thatroadblock can be sidestepped by using a reduced-order variant of delay-coordinate embedding tobuild forecast models: I show that the resulting forecasts can be as good as—or better than—thoseobtained using complete embeddings, and with far less computational and human eﬀort. I thenexplore the underlying reasons for this using a novel combination of techniques from computationaltopology and information theory.Modern approaches to modeling a time series for forecasting arguably began with Yule’s workon predicting the annual number of sunspots [122] through what is now known as autoregression .Before this, time-series forecasting was done mostly through simple global extrapolation [119].Global linear methods, of course, are rarely adequate when one is working with nonlinear dynamicalsystems; rather, one needs to model the details of the state-space dynamics in order to makeaccurate predictions. The usual ﬁrst step in this process is to reconstruct that dynamical structurefrom the observed data. The state-space reconstruction techniques proposed by Packard et al. [89]in 1980 were a critical breakthrough in this regard. In 1981, Takens showed that this method, delay-coordinate embedding , provides a topologically correct representation of a nonlinear dynamicalsystem if a speciﬁc set of theoretical assumptions are satisﬁed. I discuss this in detail in Section 2.1.1alongside the appropriate citations.A large number of creative strategies have been developed for using the state-space structureof a dynamical system to generate predictions, as discussed in depth in Section 2.3.3. Perhaps themost simple of these is the “Lorenz Method of Analogues” (LMA), which is essentially nearest-neighbor prediction [72]. Even this simple strategy, which builds predictions by looking for thenearest neighbor of a given point and taking that neighbor’s observed path as the forecast—worksquite well for forecasting nonlinear dynamical systems. LMA and similar methods have been usedsuccessfully to forecast measles and chickenpox outbreaks [112], marine phytoplankton populations[112], performance dynamics of a running computer( e.g., [36, 37]), the ﬂuctuations in a far-infraredlaser [98, 119], and many more.The reconstruction step that is necessary before any of these methods can be applied toscalar time-series data, however, can be problematic. Delay-coordinate embedding is a powerfulpiece of machinery, but estimating good values for its two free parameters, the time delay τ andthe dimension m , is not trivial. A large number of heuristics have been proposed for this task,but these methods, which I cover in depth in Sections 2.1.2 and 2.1.3, are computationally inten-sive and they require input from—and interpretation by—a human expert. This can be a realproblem in a prediction context: a millisecond-scale forecast is not useful if it takes seconds orminutes to produce. If eﬀective forecast models are to be constructed and applied in a mannerthat outpaces the dynamics of the target system, then, one may not be able to use the full, tra-ditional delay-coordinate embedding machinery to reconstruct the dynamics. And the hurdles ofdelay-coordinate reconstruction are even more of a problem in nonstationary systems, since thereconstruction machinery is only guaranteed to work for an inﬁnitely long noise-free observationof a single dynamical system. This means that no matter how much eﬀort and human intuitionis put into estimating m , or how precise a heuristic is developed for that process, the theoreticalconstraints of delay-coordinate embedding can never be satisﬁed in practice . This means that anexperimentalist can never guarantee, on any theoretical basis, the correctness of their embedding,no matter their choice of m . In Section 2.1, I provide an in-depth discussion of these issues.The conjecture that forms the basis for this thesis is that a formal embedding, althoughmandatory for detailed dynamical analysis, is not necessary for the purposes of prediction —inparticular, that reduced-order variants of delay-coordinate reconstructions are adequate for thepurposes of forecasting, even though they are not true embeddings [38]. As a ﬁrst step towardsvalidating that conjecture, I construct two-dimensional time-delay reconstructions from a numberof diﬀerent time-series data sets, both simulated and experimental, and then build forecast modelsin those spaces. I ﬁnd that forecasts produced using the Lorenz method of analogues on thesereduced-order models of the dynamics are roughly as accurate as—and often even more accuratethan—forecasts produced by the same method working in the complete embedding space of thecorresponding system. This exploration is detailed in Chapter 4.Figure 1.1 shows a quick proof-of-concept example: a pair of forecasts of the so-called“Dataset A,” a time series from a far-infrared laser from the Santa Fe Institute prediction compe-tition [119]. Even though the low-dimensional reconstruction used to generate the forecast in theright panel of the ﬁgure is not completely faithful to the underlying dynamics of this system, itappears to be good enough to support accurate short-term forecast models of nonlinear dynamics.While this example is encouraging, Dataset A is only one time series and it was drawn from acomparatively simple system—one that is well-described by a ﬁrst-return map (or, equivalently, aone-dimensional surface of section). The examples presented in Chapter 4 oﬀer a broader validationof this thesis’s central claim by constructing forecasts using two-dimensional delay reconstructionsof ensembles of data sets from a number of diﬀerent systems whose dynamics are far more complexthan the time series in Figure 1.1. Uniformly, the results indicate that the full complexity (andeﬀort) of the delay-coordinate ‘unfolding’ process may not be strictly necessary to the success offorecast models of real-world nonlinear dynamical systems. Finally, I want to emphasize that thisreduced-order strategy is intended as a short -term forecasting scheme. Dimensional reduction isa double-edged sword; it enables on-the-ﬂy forecasting by eliminating a diﬃcult estimation step,but it eﬀects a guaranteed memory loss in the model. I explore this limitation experimentally inSection 4.3 and theoretically in Section 5.1.3.While the results in Chapter 4 are interesting from a practical perspective, in that they allowdelay-coordinate reconstruction to be used in real time, they are perhaps even more interestingfrom a theoretical perspective. The central premise of this thesis is a heresy, according to the time ( j ) p j a nd c j (a) LMA on a complete embedding time ( j ) p j a nd c j (b) LMA on a two-dimensional delay reconstruction Figure 1.1: Forecasts of SFI Dataset A using Lorenz’s method of analogues in (a) a delay-coordinateembedding of the state-space dynamics and (b) a 2 D delay reconstruction of those dynamics. Blueos are the true continuation c j of the time series and red xs ( p j ) are the forecasts; black error barsare provided if there is a discrepancy between the two. Reconstruction parameter values for (a) wereestimated using standard techniques: the ﬁrst minimum of the average mutual information [33] forthe delay in both images and the false-near neighbor (FNN) method of Kennel et al. [62], with athreshold of 10%, for the dimension in the left-hand image. Even though the 2 D reconstructionused in (b) is not faithful to the underlying topology, it enables successful forecasting of the timeseries.dogma of delay-coordinate embedding, but regardless, it works. This naturally leads to the needfor a deeper exploration into why such a deviation from theory provides so much practical traction.That exploration is precisely the focus of Chapter 5, where I provide two disjoint explana-tions of why prediction in projection—my reduced-order strategy—works. The ﬁrst is from aninformation theoretic perspective; the second utilizes computational topology. These two disjointbranches of mathematics oﬀer two very diﬀerent, but quite complementary, tools for exploringthis discontinuity between theory and practice. The prior, the subject of Section 5.1, provides aframework for understanding how information is stored and transmitted from past to future indelay-coordinate reconstructions. Building upon ideas from this ﬁeld, I develop a novel methodcalled time-delayed active information storage ( A τ ) that can be used to select forecast-optimal pa-rameters for delay-coordinate reconstructions [42]. Using A τ , I show that for noisy ﬁnite length timeseries, a two-dimensional projection ( i.e., m = 2) often provides as much—or more—informationabout the future than a traditional embedding. This further corroborates the central premise ofthis thesis. This counter-intuitive result, its source, and its implications are discussed in depth inSection 5.1.3. Section 5.2 oﬀers an alternative view of the reconstruction process—one based ontopology. As I discuss in Section 2.1.1, the theoretical restrictions of delay-coordinate embeddingare intended to ensure a diﬀeomorphic reconstruction, something that is required for analysis ofdynamical invariants but that is excessive for reconstruction of topology. I conjecture that one ofthe reasons prediction in projection works is that topology (which is preserved by a homeomor-phism) becomes correct at much lower embedding dimensions than what one would expect from theassociated theorems. The results in Section 5.2 conﬁrm this, providing insight into the mechanics ofprediction in projection, explaining why this approach exhibits so much accuracy while being the-oretically wrong, and oﬀering a deeper understanding into delay-coordinate reconstruction theoryin general.Of course, no forecast model will be ideal for every task. In fact, as a corollary of theundecidability of the halting problem [117], no single forecasting schema is ideal for all noise-freedeterministic signals [119]—let alone all real-world time-series data sets. I do not want to give theimpression that this reduced-order model will be eﬀective for every time series, but I have shownthat it is eﬀective for a broad spectrum of signals. Following this line of reasoning, it is importantto be able to determine when prediction is possible at all, and, if so, what forecast model is thebest one to use. To this end, I have developed a Shannon information-theory based heuristic forquantifying precisely when a given time-series is predictable given the correct model [41]. Thisheuristic—the focus of Chapter 6—allows for a priori evaluation of when prediction in projectionwill be eﬀective.The rest of this thesis is organized as follows. Chapter 2 reviews all the necessary backgroundand related work, including the theory and practice of delay-coordinate embedding, informationtheory, and the forecast methods, as well as the ﬁgure of merit that I use for assessing forecastaccuracy. In Chapter 3, I introduce the case studies used in this thesis. In Chapter 4, I demonstratethe eﬀectiveness of this reduced order forecast strategy on a range of diﬀerent examples, comparingit to traditional linear and nonlinear forecasting strategies, and exploring some of its limitations.In Chapter 5, I provide a mathematical foundation for why prediction in projection works. InChapter 6, I describe a measure for quantifying time-series predictability to understand when myreduced-order method—or any forecasting strategy—will be eﬀective. At the end of Chapters 4-6I discuss specialized avenues of future research directly associated with the speciﬁc contributionof that chapter. In Chapter 7, I conclude and outline the next frontier of this work: developingstrategies for grappling with nonstationary time series in the context of delay coordinate basedforecasting—which, I believe, will require a combination of all aspects of this thesis to solve. hapter 2Background and Related Work2.1 Reconstructing Nonlinear Deterministic Dynamics The term nonlinear deterministic dynamical system describes a set X combined with a deter-ministic nonlinear evolution or update rule Φ, also called the generating equations . The set X couldbe as simple as R n or a similar geometric manifold, or as abstract as a set of symbols [79]. Elementsof the set X are referred to as states of the dynamical system; the set X is generally referred to asthe state space . The update or evolution rule is a ﬁxed mapping that gives a unique image to anyparticular element of the set. In the problems treated in this thesis, this update rule is deterministicand ﬁxed: given a particular state, the next state of the system is completely determined. Thetheory of dynamical systems is both vast and rich. This section of this dissertation is intended toreview the subset of this ﬁeld that is needed to understand the core ideas of my thesis. It is notintended as a general review of this ﬁeld. For more complete reviews, see [14, 59, 79].Dynamical systems can be viewed as falling into one of two categories: those that are discretein time and those that are continuous in time. The former are referred to as maps and denoted by (cid:126)y n +1 = Φ( (cid:126)y n ) , n ∈ N (2.1)The latter are referred to as ﬂows and are represented by a system of ﬁrst-order ordinary diﬀerentialequations ddt (cid:126)y ( t ) = Φ( (cid:126)y ( t )) , t ∈ R + (2.2)When the generating equations Φ of a dynamical system are known, the future state of any par-ticular initial condition can be completely determined. Unfortunately, knowledge of the generatingequations (or even the state space) is a luxury that is very rarely aﬀorded to an experimentalist. Inpractice, the dynamical system under study is a black box that is observed at regular time intervals.In such a situation, one can reconstruct the underlying dynamics using so-called delay-coordinateembedding , the topic of the following section. The process of collecting a time series { x j } Nj =1 or trace is formally the evaluation of an observation function [99] h : X → R at the true system state (cid:126)y ( t j ) at time t j for j = 1 , .., N , i.e. , x j = h ( (cid:126)y ( t j )) for j = 1 , . . . , N . Speciﬁcally, h smoothly observes the path of the dynamical systemthrough state space at regular time intervals, e.g. , measuring the angular position of a pendulumevery 0.01 seconds or measuring the average number of instructions executed by a computer percycle [6, 83]. Provided that the underlying dynamics Φ and the observation function h —are bothsmooth and generic, Takens [113] formally proves that the delay coordinate map F ( h, Φ , τ, m )( (cid:126)y ( t j )) = ([ x j x j − τ . . . x j − ( m − τ ]) = (cid:126)x j (2.3)from an d -dimensional smooth compact manifold M to R m is almost always a diﬀeomorphism on M whenever τ > m is large enough, i.e., m > d . Deﬁnition (Diﬀeomorphism, Diﬀeomorphic) . A function f : M → N is said to be a diﬀeomor-phism if it is a C bijective correspondence whose inverse is also C . Two manifolds M and N aresaid to be diﬀeomorphic if there exists a diﬀeomorphism F that maps M onto N . What all of this means is that, given an observable deterministic dynamical system—a com-puter for example, a highly complex nonlinear dynamical system [83] with no obvious ( X , Φ)—Ican measure a single quantity ( e.g. , instructions executed per cycle or L2 cache misses) and usethat time series to faithfully reconstruct the underlying dynamics up to diﬀeomorphism . In otherwords, the true unknown dynamics ( X , Φ) and the dynamics reconstructed from this scalar timeseries have the same topology . Though this is less information than one might like, it is still veryuseful, since many important dynamical properties ( e.g. , the Lyapunov exponent that parametrizeschaos) are invariant under diﬀeomorphism. It is also useful for the purposes of prediction—the goalof this thesis.The delay-coordinate embedding process involves two parameters: the time delay τ and theembedding dimension m . For notational convenience, I denote the embedding space with dimension m and time delay τ as E [ m, τ ]. To assure topological conjugacy, the Takens proof [113] requiresthat the embedding dimension m must be at least twice the dimension d of the ambient space; atighter bound of m > d cap , the capacity dimension of the original dynamics, was later establishedby Sauer et al. [99]. Deﬁnition (Capacity Dimension [79]) . Let N ( (cid:15) ) denote the minimum number of open sets ( (cid:15) -balls)of diameter less than or equal to (cid:15) that form a ﬁnite cover of a compact metric space X . Then thecapacity dimension of X is a real number d cap such that: N ( (cid:15) ) ≈ (cid:15) − d cap as (cid:15) → , explicitly d cap ≡ − lim (cid:15) → + ln N ( (cid:15) )ln (cid:15) (2.4) if this limit exists. Operationalizing either of these theoretical constraints can be a real challenge. d is not knownand accurate d cap calculations are not easy with experimental data. And besides, one must ﬁrstembed the data before performing those calculations.Apropos of the central claim of this thesis, it is worth considering the intention behind thesebounds on m . The worst-case bound of m > d cap is intended to eliminate all projection-inducedtrajectory crossings in the reconstructed dynamics. For most systems, and most projections, thedimensions of the subspaces occupied by these false crossings are far smaller than those of theoriginal systems [99]; often, they are sets of measure zero. For the delay-coordinate map to be adiﬀeomorphism, all of these crossings must be unfolded by the embedding process. This is necessaryif one is interested in calculating dynamical invariants like Lyapunov exponents. However, thenear-neighbor relationships that most state-space forecast methods use in making their predictionsare not invariant under diﬀeomorphism, so it does not make sense to place that strict conditionon a model that one is using for those purposes. False crossings will, of course, cause incorrectpredictions, but that is not a serious problem in practice if the measure of that set is near zero,particularly when one is working with noisy, real-world data.My reduced-order strategy explicitly ﬁxes m = 2. This choice takes care of one of the two freeparameters in the delay-coordinate reconstruction process, but selection of a value for the delay, τ , is still an issue. The theoretical constraints in this regard are less stringent: τ must be greaterthan zero and not a multiple of the period of any orbit [99, 113]. In practice, however, the noisyand ﬁnite-precision nature of digital data and ﬂoating-point arithmetic combine to make the choiceof τ much more delicate [59]. It is to this issue that I will turn next. τ The τ parameter deﬁnes the amount of time separating each coordinate of the delay vectors: (cid:126)x j = [ x j , x j − τ , . . . , x j − ( m − τ ] T . The theoretical constraints on the time delay are far fromstringent and this parameter does not—in theory [99, 113]—play a role in the correctness of theembedding. However, that assumes an inﬁnitely long noise-free time series [99, 113], a luxury thatis rare in practice. As a result of this practical limitation, the time delay τ plays a crucial role inthe usefulness of the embedding [17, 18, 33, 59, 67, 68, 95].The fact that the time delay does not play into the underlying mathematical framework isa double-edged sword. Because there are no theoretical constraints, there is no practical way toderive an “optimal” lag or even know what criterion an “optimal” lag would satisfy [59]. Casdagli et al. [24] provide a theoretical discussion of this, together with some treatment of the impactsof τ on reconstructing an attractor using a noisy observation function. Unfortunately no practi-cal methods for estimating τ came from that discussion, but it does nicely outline a range of τ between redundancy and irrelevance . For very small τ , especially with noisy observations, x j and x j − τ are eﬀectively indistinguishable. In this situation, the reconstruction coordinates are highly redundant [24, 46], i.e. , they contain nearly the same information about the system. This is not agood choice for τ because additional coordinates add almost nothing new to the model. Choosingan arbitrarily large τ is undesirable as well. On this end of the spectrum, the coordinates of thereconstruction become causally unrelated, i.e. , the measurement of x j − τ is irrelevant in under-standing x j [24]. Useful τ values lie somewhere between these two extrema. In practice, selectinguseful τ values can be quite challenging, as demonstrated in the following example. Example 1.

To explore the eﬀects of τ on an embedding, I ﬁrst construct an artiﬁcial time seriesby integrating the R¨ossler system [96] ˙ x = − y − z (2.5)˙ y = x + ay (2.6)˙ z = b + z ( x − c ) (2.7) with a = 0 . , b = 0 . , and c = 10 . , using a standard fourth-order Runge-Kutta integrator startingfrom [ x (0) , y (0) , z (0)] T = [10 , , T for 100,000 time steps with a time step of π/ . This resultsin a trajectory of the form (cid:126)y ( t j ) = [ x ( t j ) , y ( t j ) , z ( t j )] T , where t j = j ( π/ for j = 1 , . . . , , .This trajectory is plotted in Figure 2.1(a). To discard transient behavior, I remove the ﬁrst 1,000points of this trajectory. I deﬁne the observation function as h ( (cid:126)y ( t j )) = x ( t j ) = x j , resulting in thetime series: { x j } , j =1001 . The ﬁrst 5,000 points of this time series can be seen in Figure 2.1(b). Here by usefulness

I mean that not only are the dynamical invariants ( e.g. , Lyapunov exponents and fractaldimension) and topological properties, ( e.g. , neighborhood relations) preserves, but also that those quantities areattainable from the reconstructed dynamics. This is made more rigorous in Section 2.2, where I discuss information theory. (a) R¨ossler trajectory (b) R¨ossler time series.

Figure 2.1: The R¨ossler attractor and a segment of the time series of its x coordinate. To illustrate the role of τ in the delay-coordinate embedding process, I embed { x j } , j =1001 using m = 2 and several diﬀerent choices of τ . These embeddings are shown in Figure 2.2. In theory, eachof the choices of τ in Figure 2.2 should yield correct, topologically equivalent embeddings—given theright choice of m . In practice, however, that is not the case.First consider the top-left panel of Figure 2.2 where τ = 1 . Here, the axes are spread apartso little that the embedding appears to be a noisy line. This is because x j and x j − τ are eﬀectivelyindistinguishable at this small τ . In the embedding in the bottom-right panel of Figure 2.2, thereconstruction appears to be a ball of noise with only traces of underlying structure. At this large τ , the coordinates of the reconstruction are causally unrelated. This is known as “overfolding.”To visualize this concept, consider the progression in Figure 2.2 from τ = 30 to τ = 341 . As τ increases, the reconstruction expands away from the diagonal and begins to resemble the originalattractor. However, as τ increases past this point, the top corner of the reconstruction is slightlyfolded over.This “melting” eﬀect is called folding in the literature. “Overfolding” occurs when the recon-structed attractor folds back on itself, completely collapsing back to the diagonal, (as can be seen for τ = 101 ) and then re-expanding away from the diagonal, (as can be seen for τ = 341 ). Overfold-ing produces an unnecessarily (and in the case of noise, often incorrect) reconstruction [24, 59, 95].Compare, for example, the bottom-left panel of Figure 2.2 with the actual attractor in Figure 2.1(a).From a theoretical standpoint, given the right choice of m , these two objects are topologically equiv-alent; from a practical standpoint, however, the embedding is overly complex.If the time series were noisy, this overfolding would likely introduce additional error. Withknowledge of the true attractor, it is easy to say that the τ = 30 and τ = 46 embeddings mostclosely match its shape; without that knowledge, however, the choice is not obvious. The situationis even more delicate than this. If one knew, somehow, that τ = 30 and τ = 46 were both goodreconstructions, how would one know which of these two choices was optimal? With τ = 30 , nofolding has occurred, which is beneﬁcial because with noisy or projected dynamics (choosing m toosmall), foldings may cause false crossings. But the trajectory in E [ m, is not as “space ﬁlling” A false crossing is when two trajectories intersect due to projection or measurement error, a phenomenon thatcannot happen in a theoretical deterministic dynamical system. Figure 2.2: Delay-coordinate reconstruction of the R¨ossler time series in Figure 2.1(b) with m = 2and varying τ . as τ = 46 or as spread apart from the diagonal, so the coordinates are most likely more redundant.Weighing the importance of these kinds of criteria is non-trivial and, I believe, application speciﬁc.In the rest of this section, I review heuristics aimed at optimizing the estimation of τ by weighingthese diﬀerent attributes against one another. There are dozens of methods for estimating τ — e.g. , [17, 18, 33, 42, 46, 59, 67, 68, 88, 95]. This isa central issue in my thesis, so the following section surveys this literature in some depth. Choice0of τ is application- and system-speciﬁc [17, 59, 95]; a τ that works well for Lyapunov exponentcalculation may not work well for forecasting. For this reason, Kantz & Schreiber [59] suggestthat it may be necessary to perform additional system- and application-speciﬁc tuning of τ afterusing any generic selection heuristic. In my ﬁrst set of examples, I use the method of mutualinformation [33, 68]—described below in detail. While this is the standard τ -selection method, Iwill show in Section 4.3.1 that this choice is almost always suboptimal for forecasting. In Section 5.1,I provide a solution to this: an alternative τ selection method that leverages “active informationstorage” to select a τ that is optimal for forecasting speciﬁc reconstructions [42]. A na¨ıve strategy for selecting the time delay would be to choose a τ that forces the coordinatesof the delay vectors to be linearly independent. This is equivalent to choosing the ﬁrst zero of theautocorrelation function R ( τ ) R ( τ ) = 1 N − τ (cid:80) j ( x j − µ x )( x j − τ − µ x ) σ x (2.8)where N , µ x and σ x are respectively the length, average and standard deviation of the time se-ries [33, 59] . Several other methods have been proposed that suggest instead choosing τ wherethe autocorrelation function ﬁrst drops to a particular fraction of its initial value, or at the ﬁrstinﬂection point of that function [59, 95].An advantage to this class of methods is that its members are extremely computationallyeﬃcient; the autocorrelation function, for instance, can be calculated with the fast Fourier trans-form [95]. However, autocorrelation is a linear statistic that ignores nonlinear correlations. Thisoften yields τ values that work well for some systems and not well for others [33, 59, 95]. Instead of seeking linear independence between delay coordinates, it may be more appro-priate to seek general independence— i.e. , coordinates that share the least amount of information (also called “redundancy”) with one another. The following discussion requires some methods frominformation theory; for a review of these concepts, please refer to Section 2.2. Fraser & Swin-ney argue that selecting the ﬁrst minimum of the time-delayed mutual information will minimizethe redundancy of the embedding coordinates, maximizing the information content of the over-all delay vector [33]. In that approach, one obtains generally independent delay coordinates bysymbolizing the two time series X j = { x j } Nj =1 and X j − τ = { x j − τ } Nj =1+ τ by binning, discussed inSection 2.2.5.1, and then computes the mutual information between X j and X j − τ for a range of τ , call this I [ X j , X j − τ ; τ ]. Then for each τ , I [ X j , X j − τ ; τ ] is the amount of information sharedbetween the coordinates x j and x j − τ , i.e, I [ X j , X j − τ ] quantiﬁes how redundant the second axisis [33]. According to [33], choosing a τ that minimizes I [ X j , X j − τ ], results in generally independentdelay coordinates, i.e, delay coordinates that are minimally redundant.The argument for choosing τ in this way applies strictly to two-dimensional embeddings [33,59], but was extended to work in m dimensions in [68]. To accomplish this, Liebert & Schusterrewrote mutual information in terms of second-order Reny´ı entropies. This transformation allowedthem to show that the minima of I [ X j , X j − τ ; τ ] agreed with the minima of the correlation sum [49], C ( (cid:15) ; m, τ ), deﬁned as1 C ( (cid:15) ; m, τ ) = 1 N N (cid:88) i,j =1 i (cid:54) = j Θ[ (cid:15) − || (cid:126)x i − (cid:126)x j || ] (2.9)where N is the length of the time series, Θ is the Heavyside function, and (cid:126)x i , (cid:126)x j are the i th and j th delay vectors in E [ m, τ ]. In addition to extending the argument of [33] to m dimensions, themodiﬁcation of [68] allowed for much faster approximations of τ by simply ﬁnding the minimum of C ( (cid:15) ; m, τ ), which can be done quickly with the Grassberger-Procaccia algorithm [49, 68].The choice of the ﬁrst minimum of I [ X j , X j − τ ; τ ] is intended to avoid the kind of overfold-ing of the reconstructed attractor and irrelevance between coordinates that was demonstrated inFigure 2.2. This choice is discussed and empirically veriﬁed in [68] by showing that the ﬁrst mini-mum of C ( (cid:15) ; m, τ ) (so in turn I [ X j , X j − τ ; τ ]) corresponded to the most reliable calculations of the correlation dimension [49]. Deﬁnition (Correlation Dimension) . If the correlation sum, C ( (cid:15) ) , decreases like a power law, C ( (cid:15) ) ∼ (cid:15) D , then D is called the correlation dimension. Formally D = lim (cid:15) → log C ( (cid:15) )log (cid:15) (2.10) if this limit exists. The Grassberger-Procaccia algorithm [49] allows the correlation dimension tobe approximated for E [ m, τ ] as D = lim (cid:15) → log C ( (cid:15) ; m, τ )log (cid:15) (2.11)It was not shown in [68], however, that this choice corresponds to the best choice of τ for thepurposes of forecasting . In Section 4.3.1 I show that this is in fact not the case for all time series.Even so, it is a reasonable starting point, as this method is the gold standard in the associatedliterature. In my ﬁrst round of experiments, and as a point of comparison, I select τ at the ﬁrstminimum of the mutual information [33, 68] as calculated by mutual in the TISEAN package [53].There are a few possible drawbacks to this method. For example, there is no guarantee that I [ X j , X j − τ ; τ ] will ever achieve a minimum; a ﬁrst-order autoregressive process, for example, doesnot [59]. Rosenstein et al. [95] argue that calculating mutual information is too computationallyexpensive. Several papers [51,75] have argued that mutual information can give inconsistent results,especially with noisy data. There are several geometric and topological methods for approximating τ that address someof the shortcomings of mutual information, including: wavering product [67], ﬁll factor , integrallocal deformation [18], and displacement from diagonal [95], among others. Most of these methodshave the distinct advantage of attempting to solve for both m and τ simultaneously, albeit at thecost of being more complicated and less computationally eﬃcient. (This additional computationaloverhead is not a factor in my reduced-order framework as I explicitly ﬁx m = 2.) The term Grassberger-Procaccia algorithm is used generically for any algorithm that estimates the correlationdimension (and more generally the correlation integral) from the small- (cid:15) behavior of the correlation sum C ( (cid:15) ; m, τ ). The wavering product of Liebert et al. [67] is a topological method for simultaneously de-termining embedding dimension and time delay. This approach focused on detecting when theattractor is properly unfolded, i.e. , the situation in which projection-induced overlap disappears.Liebert et al. focused on preserving neighborhood relations of points in E [ m, τ ]. Whentransitioning from E [ m, τ ] to E [ m + 1 , τ ], an embedding preserves neighborhood relations of everypoint in E [ m, τ ], i.e., inner points remain inner points, and analogously with the boundary points.If these neighborhood relations are preserved, then m is a suﬃcient embedding dimension. Theso-called “direction of projection” [67] that mitigates false crossings is associated with the bestchoice of τ , i.e. , the τ that yields (for a ﬁxed dimension) the smallest amount of overlap. To thisend, they deﬁned two quantities Q ( i, k, m, τ ) = dist τm +1 ( i, j ( k, m )) dist τm +1 ( i, j ( k, m + 1)) , Q ( i, k, m, τ ) = dist τm ( i, j ( k, m )) dist τm ( i, j ( k, m + 1)) (2.12)where, dist τm +1 ( i, j ( k, m )) is the standard Euclidean distance measured in E [ m + 1 , τ ] betweenan i th reference point (cid:126)x i in E [ m, τ ] and its k th nearest neighbor (cid:126)x j ( k,m ) in E [ m, τ ] or similarlyfor dist τm +1 ( i, j ( k, m + 1)), the k th nearest neighbor of (cid:126)x i in E [ m + 1 , τ ]. To determine if theneighborhood relations are preserved in the embedding, they deﬁned the wavering product W i ( m, τ ) = (cid:32) ( N nb (cid:89) k =1 Q ( i, k, m, τ ) Q ( i, k, m, τ ) (cid:33) / (2 N nb ) (2.13)where N nb is the number of neighbors used in each neighborhood. If W i ( m, τ ) ≈

1, then thetopological properties are preserved locally by the embedding [67]. In order to compute this globally ,Liebert et al. deﬁned the average wavering product as W ( m, τ ) = ln  N ref N ref (cid:88) i =1 W i ( m, τ )  (2.14)where N ref is the number of reference points, typically chosen to be about 10% of the signal. Aminimum of W ( m, τ ) /τ as a function of τ yields an optimal τ for that choice of m . They alsoshowed that a suﬃcient embedding dimension can be found when W ( m, τ ) /τ converged to zero.Choosing the embedding parameters in this way guarantees that the embedding faithfully pre-serves neighborhood relations. This is particularly important when forecasting based on neighborrelations.This technique works very well on many systems, including the R¨ossler system and theMackey-Glass system. In particular, Liebert et al. showed that choosing m and τ in this wayallowed for accurate estimation of the information dimension [49]. Noise, however, is a seriouschallenge for this heuristic [62], so it may not be useful for real-world datasets. Integral local deformation, introduced by Buzug & Pﬁster in [18], attempts to maintaincontinuity of the dynamics on the reconstructed attractor: viz., neighboring trajectories remainclose for small evolution times. The underlying rationale is that false crossings will cause what3look like neighboring trajectories to separate exponentially in very short evolution time. Integrallocal deformation quantiﬁes this. Buzug & Pﬁster show that choosing m and τ to minimize thisquantity gives an approximation of τ that minimizes false crossings created by projection.In my work, I rely strongly on the continuity of the reconstructed dynamics, since I use theimage of neighboring trajectories for forecasting. Integral local deformation seems useful at ﬁrstglance for choosing a τ that helps to preserve the continuity of the underlying dynamics in theface of projection. However, the computational complexity of this measure makes it ineﬀective foron-the-ﬂy adaptation or selection of τ . In [18], Buzug & Pﬁster introduced a purely geometric heuristic for estimating τ . Thismethod attempts to maximally ﬁll the embedding space by spatially spreading out the points as faras possible. To accomplish this, Buzug & Pﬁster calculate the average volume of a large number of m -dimensional parallelepipeds, spanned by a set of m + 1 arbitrarily chosen m -dimensional delayvectors. They then show that the ﬁrst maximum of the average of these volumes as a function of τ (for a ﬁxed m ) maximizes the distance between trajectories. This method is computationallyeﬃcient, as no near-neighbor searching is required. However, for any attractor with multiple un-stable foci, there is no signiﬁcant maximum of the ﬁll factor as a function of τ [18, 95]. In addition,this method cannot take into account overfolding, as an overfolded embedding may be more space-ﬁlling than the “properly” unfolded counterpart [95]. This consideration is corrected (at the costof additional computational complexity) in the method described next. The average displacement method introduced by Rosenstein et al. [95], which is also known asthe displacement from diagonal method [59], also seeks a τ that causes the embedded attractor toﬁll the space as much as possible, while mitigating error caused by overfolding and also addressingsome other concerns [18]. Rosenstein et al. deﬁne the average displacement (from diagonal) for E [ m, τ ] as (cid:104) S m ( τ ) (cid:105) = 1 N N (cid:88) i =1 (cid:118)(cid:117)(cid:117)(cid:116) m − (cid:88) j =1 ( x i + jτ − x i ) (2.15)For a ﬁxed m , (cid:104) S m ( τ ) (cid:105) increases with increasing τ (at least initially; the attractor may collapse forlarge τ due to overfolding). Rosenstein et al. suggest choosing τ and m where the slope betweensuccessive (cid:104) S m ( τ i ) (cid:105) drops to around 40% of the slope between (cid:104) S m ( τ ) (cid:105) and (cid:104) S m ( τ ) (cid:105) , where τ and τ are the ﬁrst and second choices of τ . In noisy data sets, this leads to consistent and accuratecomputation of the correlation dimension. However, this—like most heuristics—was developedto correctly approximate dynamical invariants ( e.g., correlation dimension), and comes with noguarantees about forecast accuracy. Remark.

Several papers (e.g., [64, 77, 85, 95, 105]) have claimed that the emphasis should be placedon the window size τ w = τ m rather than τ or m independently. The basic premise behind this ideais that it is more important to choose τ w to span an important time segment (e.g., mean orbitalperiod) than the actual choice of either τ or m independently. This is something I have not foundto be the case when choosing parameters for delay reconstruction-based forecasting. (a) R¨ossler time series in E [2 ,

45] (b) R¨ossler time series in E [3 , Figure 2.3: An illustration of the utility of higher embedding dimensions to eliminate false crossingsin the dynamics. m As the embedding dimension m is not a parameter in my reduced-order algorithm, I onlyreview a few important methods for estimating it. This discussion is important mainly becausethese conventions are the point of departure (and comparison) for my work.A scalar time series { x j } Nj =1 measured from a dynamical system is a projection of the orig-inal state space onto a one-dimensional sub-manifold. A fundamental concern in the theoreticalembedding dimension requirement m > d cap is to ensure that the embedding has enough dimen-sions to “spread out” and thus avoid false crossings. Such crossings violate several properties ofdeterministic dynamical systems, e.g., determinism, uniqueness and continuity. In Figure 2.3(a),for example, the E [2 ,

45] embedding of the x -coordinate of the R¨ossler system contains trajectorycrossings. In Figure 2.3(b), however, the top-right region of Figure 2.3(a) has folded under theattractor and the intersections on the top left of Figure 2.3 have become a “tunnel.” The issuehere is that the dynamics do not have enough space to spread apart in two dimensions. Howeverwhen this dimension is increased, the attractor can spread out and the intersections disappear.According to [99], choosing m > d cap ensures that the attractor has enough space to spread outand false crossings will not occur. More precisely, the probability of a crossing occurring in a ballof size (cid:15) is p c ∝ (cid:15) m − d cap . Recall from Section 2.1.1, however, that this is for an inﬁnitely longnoise free time series and may not hold in practice, as noise can easily cause false crossings andviolate the assumptions that went into this estimation. It should also be noted that, even if I knewthe capacity dimension d cap of the system—which is generally not the case—I do not necessarilywant to choose m to be 2 d cap + 1. This is a (generally loose) suﬃcient bound that should ensurethe correctness of the embedding. But it is often the case that the embedding unfolds completelybefore m = 2 d cap + 1. The Lorenz system [71] has d cap = 2 . ± .

01, for example. [99] wouldsuggest using m = 5, but in fact this system can be embedded properly using m = 3 [62].Na¨ıvely, it may seem that simply choosing an “extremely large” m would be a simpler andcompletely reasonable choice. This is not true, in practice. First, the complexity of many ofthe algorithms that deduce information about dynamical systems scale exponentially with m [68].Worse yet, each noisy data point creates m noisy embedding points in the reconstruction [24].This ampliﬁcation of noise quickly destroys the usefulness of an embedding. In light of both of5these concerns, good values for the minimal m are highly sought after. For a noisy real-valuedtime series, this is still an open problem, but there exist several heuristic approximations ( e.g. ,[18, 20, 53, 59, 62, 64, 67]). Recall, too, that several of the methods presented in the previous sectionfor estimating τ — e.g. , wavering product [67] and integral local deformation [18]—simultaneouslyestimate both the delay and the dimension, m —the other free parameter in the embedding process.There are two standard classes of methods for estimating the minimal m , the method ofdynamical invariants and the method of false neighbors . In the following sections, I review thebasics of these two families. Dynamical invariants, such as correlation dimension, are topological measures of a systemthat persist under diﬀeomorphism. In theory, this means that once a particular choice of embeddingdimension, say ˆ m , yields a topologically valid reconstruction, increasing m should have no impacton these dynamical invariants. This is the case because in theory every E [ m > ˆ m, τ ] will betopologically conjugate, to one another and to the original dynamics. This implies that dynamicalinvariants will become persistent for increasing m , once ˆ m has been reached. Hence, choosingthe ﬁrst m for which dynamical invariants stop changing is a good way to estimate the minimaldimension needed to obtain a topologically valid reconstruction. The class of methods that is thetopic of this section follows directly from this logic: to choose m , one approximates some dynamicalinvariant ( e.g. , dominant Lyapunov exponent or correlation dimension) for a range of embeddingdimensions, choosing the ﬁrst embedding dimension for which it becomes persistent, and thencorroborates with other dynamical invariants.For example, one can approximate the correlation dimension for a range of embedding di-mensions using the Grassberger & Procaccia algorithm [49], choosing the ﬁrst m for which thatapproximation stops changing. Then one corroborates this choice by approximating the dominantLyapunov exponent for a range of m (using for example the algorithm in [120]), then choosingthe ﬁrst m where this result stops changing. Finally, one ensures these two estimates of m areconsistent with each other.Recall, though, that noise in the data can impact any dynamical invariant calculation, andthat that impact increases with m [24]. It is more often the case that there is a range of embeddingdimensions for which the dynamical invariant being approximated stays “fairly consistent.” Ascer-taining this is computationally expensive and requires time-intensive post processing and humaninterpretation. For these reasons, it is common to use an alternative heuristic, such as those coveredin the next section, to narrow down the search to a smaller range of embedding dimensions andthen select from this range using the method of dynamical invariants. The method of false neighbors was proposed by Kennel et al. in [62]. This heuristic searchesfor points that appear close only because the embedding dimension is too small. Consider a pointon the top of the tunnel in Figure 2.3(b) and a point directly below this point on the planarpart of the R¨ossler attractor. These two points are near neighbors in E [2 ,

45] because the tunnelcollapses down on the planar region; however, they are not near neighbors in E [3 ,

45] because theembedded attractor inﬂates, separating points on the top of the tunnel from the points on theplanar region. Since these two points are neighbors in E [2 ,

45] and not neighbors in E [3 , false near(est) neighbors at m = 2. Consider, in contrast, two neighboring points on the top6of the tunnel in E [3 , E [2 , et al. deﬁne the k th nearest neighbor (cid:126)x j ( k,m ) ∈ E [ m, τ ] of (cid:126)x i ∈ E [ m, τ ]as a false near(est) neighbor if (cid:18) dist τm +1 ( i, j ( k, m )) − dist τm ( i, j ( k, m )) dist τm ( i, j ( k, m )) (cid:19) / > R tol (2.16)where R tol is some tolerance. Recall that the dist τm ( i, j ( k, m )) is the distance between the i th point (cid:126)x i ∈ E [ m, τ ] and its k th nearest neighbor (cid:126)x j ( k,m ) ∈ E [ m, τ ]. But notice for delay vectors that dist τm +1 ( i, j ( k, m )) = | x i − mτ − x j ( k,m ) − mτ | + dist τm ( i, j ( k, m )) , so this condition simpliﬁes to | x i − mτ − x j ( k,m ) − mτ | dist τm ( i, j ( k, m )) > R tol (2.17)In particular, a neighbor is a false neighbor if the distance between the two points in E [ m + 1 , τ ]is signiﬁcantly more ( viz., R tol ) than the distance between the two neighbors in E [ m, τ ]. Kennel etal. claim that choosing a single nearest neighbor is suﬃcient ( i.e. , k = 1) [62]. In addition, theyclaim that empirically R tol ≥

10 seems to give robust results. This tolerance can be interpreted asdeﬁning false neighbors as points that are 10 times farther apart in E [ m + 1 , τ ] than in E [ m, τ ].This heuristic alone is not enough to distinguish chaos from uniform noise and can in-correctly classify time series constructed from a uniform distribution as having low-dimensionaldynamics. Kennel et al. found that for a uniformly distributed random time series, on aver-age, the nearest neighbor of a point is not near at all. Rather, dist τm ( i, j ( k, m )) ≈ R A , where R A = (cid:113) N (cid:80) Nj =1 ( x j − µ x ) . That is, the average distance to the nearest neighbor is the size ofthe attractor. To handle this, they deﬁned a secondary heuristic dist τm +1 ( i,j ( k,m )) R A > A tol , where A tol is another free parameter chosen as 2.0 without justiﬁcation. I want to note that this heuristic isadded to distinguish pure-uniform noise from chaotic dynamics, not to aid in estimating embeddingdimension for noisy observations of a chaotic system.For a time series with noise, near-neighbor relations—which are the basis for this class ofheuristics—can cause serious problems in practice. For well-sampled, noise-free data, it makessense to choose m as the ﬁrst embedding dimension for which the ratio of true to false neighborsgoes to zero [62]. For noisy data however, this is unrealistic; in practice, the standard approachis to choose the ﬁrst m for which the percentage of false near(est) neighbors drops below 10%. Iftopological correctness is vitally important for the application, a range of embedding dimensionsfor which the percentage of false near(est) neighbors drops to around ≈

10% is typically chosenand then this range is reﬁned using the method of dynamical invariants described above. This 10%is an arbitrary threshold, however; depending on the magnitude of noise present in the data, itmay need to be adjusted, as may R tol and A tol . For example, in the computer performance datapresented in Section 3.2.1.2, the percentage of false near(est) neighbors rarely dropped below even20%. Recently an extension of the false near(est) neighbor method was proposed by Cao [20],which attempts to get around the three diﬀerent tolerances ( R tol , A tol and the percentage of falseneighbors) in [62]. Cao points out that the tolerances—in particular, R tol —need to be speciﬁed on aper-time-series and even per-dimension basis. Assigning these tolerances universally is inadvisableand in many cases will lead to inconsistent estimates. In [20], he illustrates that diﬀerent choicesof these three tolerances result in very diﬀerent estimates for m . To get around this, he deﬁnes analternative heuristic that is tolerance free7 E ( m ) = 1 N − mτ N − mτ (cid:88) i =1 dist τm +1 ( i, j ( k, m )) dist τm ( i, j ( k, m )) (2.18)While the inside of the sum is very similar to Equation 2.17 of [62], the numerator is slightlydiﬀerent: dist τm +1 ( i, j ( k, m )) instead of | x i − mτ − x j ( k,m ) − mτ | . That is, the former measures thedistance between an element and its k th -nearest neighbor in E [ m, τ ] measured in E [ m + 1 , τ ],whereas the latter measures the change in distance between (cid:126)x i , (cid:126)x j ( k,m ) ∈ E [ m, τ ], and the samevectors extended in E [ m + 1 , τ ]. Cao then deﬁnes E m ) = E ( m +1) E ( m ) and shows that when E m )stops changing, a suﬃcient embedding dimension has been found. He also claims that if E m )does not stop changing, then one is observing noise and not deterministic dynamics [20]. Cao doesadmit that it is sometimes hard to determine if the E m ) curve is just slowly growing but willplateau eventually (in the case of high dimensional dynamics) or just constantly growing (in thecase of noise). To deal with this, he deﬁnes a secondary heuristic to help distinguish these twocases. As this method has been shown to give more consistent m , I hoped that this method couldprovide a more accurate comparison point. However, I was never able to successfully replicate theresults in [20] on any experimental data, so I chose to use the traditional version of this algorithmproposed by Kennel et al. in [62].The astute reader may have noticed a similarity between the method of false neighbors [62],the method of wavering products [67], and the methods of Cao [20]. It is true that these methodsare quite similar. In fact, almost all the methods for determining minimum embedding dimension[18, 20, 53, 59, 62, 64, 67] are based in some way on minimizing the number of false crossings. As thisparameter is not important in my work, I do not go into all of these nuances but simply use thestandard false near(est) neighbor approach to which the rest of these methods are fundamentallyrelated. In particular, I use the TISEAN [53] implementation of this algorithm ( false_nearest ) tochoose m with a ≈

20% threshold on the percentage of neighbors and the R tol and A tol selected bythe TISEAN implementation. In my later discussion, I refer to the reconstruction produced in thismanner as an embedding of the data. This is by no means perfect, but since it is the most widelyused method for estimating m , it is the most useful for the purposes of comparison. As discussed in Section 2.1.1, the theory of delay-coordinate embedding [99, 113] outlinesbeautiful machinery to reconstruct—up to diﬀeomorphism—the dynamics of a system from a scalarobservation. Unfortunately this theoretical machinery requires both inﬁnitely-long and noise-freeobservations of the dynamical system: luxuries that are never aﬀorded to a time-series analystin practice. While there has been a tremendous amount of informative literature on estimatingthe free parameters of delay-coordinate embedding, at the end of the day these heuristics arejust that: empirical estimates with no theoretical guarantees. This means that, even if the mostcareful rigorous in-depth analysis is used to estimate τ and m , there is no way to guarantee, in theexperimental context, that the reconstructed dynamics are in fact diﬀeomorphic to the observeddynamics.Even worse, overestimating m has drastic impacts on the usefulness of the embedding, as itexponentially ampliﬁes the noise present in the reconstruction. If little usable structure is present ina time series in the ﬁrst place, perverting this precious structure by amplifying noise is something apractitioner can ill aﬀord to do. Moreover, the methods that are most commonly used for estimating8 m are based on neighbor relations, which are easily corrupted by noisy data. As a result, theseheuristics tend to overestimate m .In addition to noise ampliﬁcation concerns and the lack of theoretical guarantees, the methodsfor estimating minimal embedding dimension are highly subjective, dependent on the estimate of τ , and require a great deal of human intuition to interpret correctly. This time-consuming, error-prone human-intensive process makes it eﬀectively impossible to use delay-coordinate embeddingfor automated or ‘on-the-ﬂy’ forecasting. As stated in Chapter 1, this is unfortunate because delay-coordinate embedding is such a powerful modeling framework. My reduced-order framework—thefoundation of this thesis—will, I hope, at least partially rectify this shortcoming. In this section, I provide a basic overview of notation and concepts from Shannon informationtheory [103], as well as a review of some more-advanced topics that are utilized throughout thethesis. I will ﬁrst cover the basics; an expert in this ﬁeld can easily skip this part. I will thenmove on to non-traditional topics viz., multivariate information theory (Section 2.2.3), methodsfor computing information measures on real-valued time series (Section 2.2.5), and measures toquantify the predictability of a real-valued time series (Section 2.2.6).

Perhaps the most fundamental concept or building block in information theory is the conceptof Shannon Entropy.

Deﬁnition ((Shannon) Entropy [103]) . Let Q be a discrete random variable with support { q , ..., q n } and a probability mass function p that maps a possible symbol to the probability of that symboloccurring, e.g., p ( q i ) = p i , where p i is the probability that an observation q is measured to be q i .The average amount of information gained by taking a measurement of Q and thereby specifyingan observation q is the Shannon Entropy (or simply entropy) H of Q , deﬁned by H [ Q ] = − n (cid:88) i =1 p ( q i ) log( p ( q i )) (2.19)Throughout this thesis, log is calculated with base two, so that the information is in bits. Theentropy H [ Q ] can be interpreted as the amount of “surprise” in observing a measurement of adiscrete random variable Q , or equivalently the average uncertainty in the outcome of a process, orthe amount of “information” in each observation of a process. Example 2 (Entropy of fair and biased coins) . First consider a fair coin: Q = { h, t } and p ( h ) = p ( t ) = 1 / . H [ Q ] = − [ p ( h ) log( p ( h )) + p ( t ) log( p ( t ))] (2.20)= − [ 12 log( 12 ) + 12 log( 12 )] (2.21)= − ( −

12 + −

12 ) = 1 (2.22)

At every ﬂip of the coin there is one bit of “new” information, or one bit of surprise. In contrast,consider an extremely biased coin with heads on both sides: Q = { h, t } and p ( h ) = 1 , p ( t ) = 0 .Then H [ Q ] = 0 , i.e., there are zero bits of “new” information at each toss, as the coin always givesheads. H [ Q ] = 1, onaverage, one (optimal) question needs to be asked to determine the outcome of the coin ﬂip: “Wasthe coin a head?” With the biased coin, however, the entropy was zero, which means on averageno questions were needed in order to infer the observation was a head (it always is!). The followingexample clariﬁes this. Example 3 (Entropy of Animal-Vegetable-Mineral [28]) . You may have, at some point duringyour childhood, played the game “Animal-Vegetable-Mineral.” If not, the rules are simple: playerone thinks of an object, and by a series of yes-no questions, and the other players attempt to guessthe object. “Is it bigger than a breadbox?” No. “Does it have fur?” Yes. “Is it a mammal?” No.This continues until the players can guess the object.As anyone who has played this game will attest, some questions are better than others—forexample, you usually try to focus on general categories ﬁrst (hence, the name of the game itself—is it an animal?) then get more speciﬁc within that category. Asking on the ﬁrst round “is it adissertation?” is likely to waste time—unless, perhaps, you are playing with a graduate student whois about to defend.If a game lasts too long, you may begin to wonder if there exists an optimal set of questions toask. “Could I have gotten the answer sooner, if I had skipped that useless question about the fur?”A moment’s reﬂection shows that, in fact, the optimal set of questions depends upon the player: ifsomeone is biased towards automobiles, it would be sensible to focus on questions that specify make,model, year, etc. You could then imagine writing down a script for this player: “ﬁrst ask if it is acar; then if yes, ask if it is domestic, if no, ask if it is a Honda...;” or for my nieces: “ﬁrst ask ifit’s Elsa from Frozen.” (It almost always is.)For each player and their preferences (i.e., for every probability distribution over the thingsthat player might choose), there is an optimal script. And for each optimal script for a given person,the game will last ﬁve rounds, or ten rounds, or seven, or twenty, depending on what they choosethat time. Profoundly, the number of questions you have to ask on average for a particular personand optimal script pair is given by Equation (2.19). In particular, we are measuring information(and uncertainty): the average number of yes-no questions we’ll need to ask to ﬁnd out an answer.

Understanding entropy is important to the rest of the discussion in this chapter as it is thefundamental building block of all other information-theoretic quantities.

It is often interesting to consider how knowledge about something informs us about somethingelse. People carrying umbrellas, for example, tells us something about the weather; it is not perfectbut (informally) if you tell me something about the weather, you also reduce my uncertainty aboutumbrella-carrying [28]. To constructively introduce the so-called mutual information , I will adaptthe next example from [28].How can we quantify the information between the weather W and umbrella-carrying U ? Forsimplicity, I will assume you only get to see one person—who is either carrying ( u ) or not carrying( u ) an umbrella, i.e., U = { u , u } with probability p ( u i ).Now assume that there are some ﬁnite number of weather types (say “rain”, “cloudy”,“windy” etc., labeled with w j ), each with a probability of occurring, p ( w j ). Then from Section 2.2.1,the uncertainty in the weather is simply0 H [ W ] = − N (cid:88) i =1 p ( w j ) log( p ( w j )) (2.23)We are interested in the probability of seeing a particular weather type given that we see the personcarrying an umbrella. For this, consider the conditional probability of weather type i given thatyou see someone carrying an umbrella— p ( w i | u ). Generally, p ( w i | u ) will be higher than p ( w j | u )when i is labeling weather with precipitation and j is not, so the uncertainty of the weather giventhat someone is carrying an umbrella is then H [ W | u ] = − p ( u ) (cid:88) j p ( w j | u ) log( p ( w j | u )) (2.24)or in words, “the uncertainty about the weather, given that the person who walked in was carryingan umbrella.” Similarly, for the reverse case, we could compute the associated uncertainty H [ W | u ],to determine “the uncertainty about the weather, given that the person who walked in was notcarrying an umbrella.” Combining (summing) these two we get the conditional entropy betweentwo variables (weather type and state of umbrella-carrying in this example). Deﬁnition (Conditional Entropy [103]) . Deﬁne Q and R to be discrete random variables withsupport { q , ..., q n } and { r , ..., r m } respectively. Then the conditional entropy is deﬁned as H [ Q | R ] = − (cid:88) i p ( r i ) (cid:88) j p ( q j | r i ) log( p ( q j | r i )) (2.25) where p ( q j | r i ) is the conditional probability of q j given r i . We can then quantify the “reduction in uncertainty” in the weather given that someone iscarrying an umbrella by H [ W ] − H [ W | u ], and the reverse with H [ W ] − H [ W | u ]. Note that thereduction can be positive or negative—in some climates, seeing your colleague not carrying anumbrella will make you more uncertain about the weather. Consider, for example, an extremelyrainy climate; it is either sunny, cloudy, or rainy, but most often rainy. You are generally quitecertain about the weather before you see your colleague (it is raining). So when they walk throughthe door without their umbrella, you think it is less likely to be raining, and so you are moreuncertain (the options sunny, cloudy, or rainy are now more evenly balanced).Now consider the “average reduction in uncertainty” of the weather given the state of umbrellacarrying I [ W, U ] = p ( u )( H [ W ] − H [ W | u ]) + p ( u )( H [ W ] − H [ W | u ]) (2.26)= H [ W ] − ( p ( u ) H [ W | u ] + p ( u ) H [ W | u ]) (2.27)= H [ W ] − H [ W | U ] (2.28)This is called the mutual information ; it tells us how much less uncertain we are, on average, about W given that we know U . Deﬁnition (Mutual Information) . Deﬁne Q and R to be discrete random variables with support { q , ..., q n } and { r , ..., r m } respectively, and let H ( Q ) be the entropy of Q and H ( Q | R ) be theconditional entropy. Then the mutual information I between Q and R is deﬁned as I [ Q, R ] = − (cid:88) i,j p ( q j , r i ) log p ( q j , r i ) p ( q j ) p ( r i ) (2.29)= H [ Q ] − H [ Q | R ] (2.30) Note: I [ Q, R ] = I [ R, Q ] [33]. more than twovariables. In the language of this section, that is equivalent to the situation where I have two or morecolleagues with umbrellas U C and U C and I want to know the average reduction in uncertaintyof the weather given the state of U C and U C , i.e., I [ W, U C , U C ]. This is unfortunately not astraightforward generalization and there is little agreement in the literature about interpreting oreven deﬁning multivariate mutual information. I -Diagrams and Multivariate Mutual Information The mathematical deﬁnitions of multivariate mutual information can get quite confusing tointerpret, especially when comparing and contrasting the diﬀerence in these deﬁnitions. To clarifythis discussion, I will use I -Diagrams of Yeung [121]—a highly useful visualization technique forinterpreting information theoretic quantities. I -Diagrams Figure 2.4 shows I -diagrams of some of the important quantities introduced in Sections 2.2.1and 2.2.2: (a) entropy H [ Q ], (b) joint entropy H [ Q, R ] (c) conditional entropy H [ Q | R ] and (d)mutual information I [ Q, R ]. In I -Diagrams, each circle represents the uncertainty in a particularvariable and the shaded region is the information quantity of interest, e.g., in (a) we are interestedin H [ Q ]—the uncertainty in Q —so the entire circle is shaded. Figure 2.4(b) introduces a newmeasure: joint entropy H [ Q, R ] = (cid:80) q,r p ( q, r ) log( p ( q, r )). H [ Q, R ] is uncertainty about processes Q and R ; this is easily depicted in an I -Diagram by simply shading both circles.The real magic of I -Diagrams comes from their ability to depict more-complex informationtheoretic measures by simply manipulating shaded regions. For example, recall from Section 2.2.2that conditional entropy H [ Q | R ]—Figure 2.4(b)—is the uncertainty about process Q given knowl-edge of R . One way of writing this is H [ Q | R ] = H [ Q, R ] − H [ Q ]: i.e., subtracting the shadedregions in (a) and (b) produces the shaded region in (c). The same can be done with mutualinformation. Recall from Section 2.2.2 that I [ Q, R ] is the shared uncertainty between Q and R or I [ Q, R ] = H [ Q ] − H [ Q | R ]: i.e., subtracting the shaded region in (a) from the shaded region in (c)produces the shaded region in (d). While obviously not a proof, this kind of approach allows usto easily build intuition about more complicated identities, e.g., symmetry of mutual information: I [ Q, R ] = H [ Q ] − H [ Q | R ] = H [ R ] − H [ R | Q ] = I [ R, Q ] . In the next section, I will use I -diagrams to explore three common interpretations of multivari-ate mutual information, interaction information [76] (also commonly called the co-information [10]),the binding information [87] (also called the dual total correlation [50]), and total correlation [118](also commonly called multi-information [111]). When interpreting I [ Q, R ] using I -Diagrams, the situation is quite simple, as there is exactlyone region of “shared uncertainty;” when generalizing even to three variables I [ Q, R, S ], the sit-uation becomes much more confusing—and this is reﬂected in the mathematical uncertainties aswell. Consider the generic three-variable I -Diagram in Figure 2.5. Instead of having one regionof overlap as in Figure 2.4(d), there are now four. There are three standard ways of shading eachthese regions to quantify I [ Q, R, S ].One interpretation is the so-called interaction information [10, 76]2 H [ Q ] (a) An I -diagram of entropy, H [ Q ]. H [ Q ] H [ R ] (b) An I -diagram of the joint entropy, H [ Q, R ]. H [ Q ] H [ R ] (c) An I -diagram of the conditional entropy, H [ Q | R ]. H [ Q ] H [ R ] (d) An I -diagram of mutual information, I [ Q, R ]. Figure 2.4: I -Diagrams of H [ Q ], H[Q,R], H [ Q | R ] and I [ Q, R ] C [ Q, R, S ] ≡ I [ Q, R, S ] ≡ − ( H [ Q ] + H [ R ] + H [ S ])+ ( H [ Q, R ] + H [ Q, S ] + H [ R, S ]) − H [ Q, R, S ] (2.31)As depicted in Figure 2.5(b), this is the intersection of H [ Q ], H [ R ] and H [ S ]. It describes thereduction in uncertainty that any two processes ( e.g., Q and R ), together, provide regarding thethird process ( e.g., Q and R ). While this may seem like the natural extension of mutual information,it does not take into account the information that is shared between the two process but not withthe third . One common criticism of this interpretation is that C [ Q, R, S ] is quite often negative.For example, when the shared information between { Q, R } is due entirely to information in S ,the interaction information can be negative as well as positive. Many interpretations of negativeinformation have been provided— e.g., that the variable S inhibits ( i.e., accounts for or explainssome of) the correlation between { Q,R } —but in general negative information is frowned upon [1].The next obvious step is to take into account the information that is shared between any twoprocess but not shared with the third , as well as the information shared between all three processes.This is called the binding information [50, 87]3 H [ Q ] H [ R ] H [ S ] (a) Generic three-variable I -Diagram. H [ Q ] H [ R ] H [ S ] (b) The interaction information (or coinformation), C [ Q, R, S ]. H [ Q ] H [ R ] H [ S ] (c) The binding information, B [ Q, R, S ] (or dual totalcorrelation). H [ Q ] H [ R ] H [ S ] (d) The total correlation (or multi-information), M [ Q, R ; S ]. The centermost region is more darklyshaded here to reﬂect the extra weight that that re-gion carries in the calculation. Figure 2.5: Generalizations of the mutual information to the multivariate case. B [ Q, R, S ] ≡ I [ Q, R, S ] ≡ H [ Q, R, S ] + (cid:34) (cid:88) i =1 H [ X ( i − , X ( i +1)%3] ) − H [ Q, R, S ] (cid:35) (2.32)where X = Q , X = R , and X = S , and % is the modulus operator. This quantity is de-picted in Figure 2.5(c). B [ Q, R, S ] has the nice feature that it is always positive, but it equallyweights information contained in two variables as information contained in all three. The totalcorrelation [111, 118]4 M [ Q, R, S ] ≡ I [ Q, R, S ] ≡ I [ X , X , X ] ≡ (cid:88) i =0 ( H [ X i ]) − H [ X , X , X ] (2.33)depicted in Figure 2.5(d) addresses this shortcoming, but is equally criticized for over emphasizinginformation that is shared by all three variables.The total correlation and binding information are both always positive but their relativemerits are a subject of contention. For a nice comparison of these and discussion of the associatedissues, please see [1]. The takeaway of this section should be that extending mutual informationas deﬁned in Section 2.2.2 to even the three-variable case, let alone beyond that, is non-trivial andnot well understood at all. This will become very important in Section 5.1, where I propose a newinformation-theoretic method for selecting delay reconstruction parameters. Note that all the information theory discussed thus far has been on discrete random variables.The topic of this thesis, however, involves real-valued time series. To compute any informationmeasure on real-valued time series, one must “symbolize” that data, i.e., map the real values to aset of discrete symbols. Ideally, this symbolization should preserve the information and/or dynamicsof the original time series, but this can be hard to accomplish in practice. The processes by whichthis is accomplished, and the issues that make it diﬃcult, are the focus of this section.

A common (and by far the simplest) symbolization method is binning . To symbolize a real-valued time series { x j } Nj =1 by binning, one breaks the time series support into n bins, which neednot be equally spaced. Then one deﬁnes a discrete random variable Q to have a symbol for each bin b i , i.e., Q has support { b i } ni =1 . The associated probability mass function is then computed using p ( b i ) = |{ j | x j ∈ b i }| N (2.34)For example, consider a time series with support on [0 ,

1] and bins b = [0 , .

5) and b = [0 . , b and b by counting thenumber of time-series elements that appear in each subinterval [0 , .

5) and [0 . , generating partition of thedynamics [13, 63]. Deﬁnition (Generating Partition) . Given a dynamical system f : M → M on a measure space ( M , F, µ ) , a ﬁnite partition P = { b i } ni =1 is said to be generating if the union of all images andpreimages of P gives the set of all µ -measurable sets F . In other words, the “natural” tree ofpartitions always generates some sub- σ -algebra, but if it gives the full σ -algebra of all measurablesets F , then P is called generating [97]. Unfortunately, even for most canonical dynamical systems, let alone all real-valued timeseries, the generating partition is not known or computable. Even when the partition is known itcan be fractal, as is the case with the H´enon map, for example, and thus useless for creating a ﬁnite

A useful alternative to simple binning is a class of methods known as kernel estimation [34, 100], in which the relevant probability density functions are estimated via a function Θ witha resolution or bandwidth ρ that measures the similarity between two points in Q × R space. Given points { q i , r i } and { q (cid:48) i , r (cid:48) i } in Q × R , one can deﬁneˆ p ρ ( q i , r i ) = 1 N N (cid:88) i (cid:48) =1 Θ (cid:18) q i − q (cid:48) i r i − r (cid:48) i − ρ (cid:19) (2.35)where Θ( x >

0) = 0 and Θ( x ≤

0) = 1. That is, ˆ p ρ ( q i , r i ) is the proportion of the N pairs ofpoints in Q × R space that fall within the kernel bandwidth ρ of { q i , r i } , i.e., the proportion ofpoints similar to { q i , r i } . When | · | is the max norm, this is the so-called box kernel. This too,however, can introduce bias [70] and is obviously dependent on the choice of bandwidth ρ . Afterthese estimates, and/or the analogous estimates for ˆ p ( q ), are produced, they are then used directlyto compute local estimates of entropy or mutual information for each point in space, which arethen averaged over all samples to produce the entropy or mutual information of the time series.For more details on this procedure, see [70].A less biased method to perform kernel estimation when one is interested in computingmutual information is the Kraskov-St¨ugbauer-Grassberger (KSG) estimator [63]. This approachdynamically alters the kernel bandwidth to match the density of the data, thereby smoothing outerrors in the probability density function estimation process. In this approach, one ﬁrst ﬁnds the k th nearest neighbor for each sample { q, r } (using max norms to compute distances in q and r ), thensets kernel widths ρ q and ρ r accordingly and performs the pdf estimation. There are two algorithmsfor computing I [ Q, R ] with the KSG estimator [70]. The ﬁrst is more accurate for small samplesizes but more biased; the second is more accurate for larger sample sizes. I use the second of thetwo in the results reported in this dissertation, as I have fairly long time series. This algorithmsets ρ q and ρ r to the q and r distances to the k th nearest neighbor. One then counts the number ofneighbors within and on the boundaries of these kernels in each marginal space, calling these sums n q and n r , and ﬁnally calculates I [ Q, R ] = ψ ( k ) − k − (cid:104) ψ ( n q ) + ψ ( n r ) (cid:105) + ψ ( n ) (2.36)where ψ is the digamma function . This estimator has been demonstrated to be robust to variationsin k as long as k ≥ O ( kN log N ),where N is the length of the time series and k is the number of neighbors being used in theestimate. While this is more expensive than traditional binning ( O ( N )), it is bias corrected, allowsfor adaptive kernel bandwidth to adjust for under- and over-sampled regions of space, and is bothmodel and parameter free (aside from k , to which it is very robust). In the case of delay-coordinate reconstruction, Q × R = X j × X j − τ The formula for the other KSG estimation algorithm is subtly diﬀerent; it sets ρ q and ρ r to the maxima of the q and r distances to the k nearest neighbors. An understanding of the predictive capacity of a real-valued time series— i.e., whether ornot it is even predictable—is essential to any forecasting strategy. In joint work with Ryan James,I propose to quantify the complexity of a signal by approximating the entropy production of thesystem that generated it. In general, estimating the entropy (production) of an arbitrary, real-valued time series is a challenging problem, as discussed above, but recent advances in Shannoninformation theory—in particular, permutation entropy [9, 32]—have reduced this challenge. Ireview this class of methods in this section.For the purposes of this thesis, I view the Shannon entropy—in particular, its growth ratewith respect to word length (the

Shannon entropy rate )—as a measure of the complexity and hencethe predictability for a time series. Time-series data consisting of i.i.d. random variables, such aswhite noise, have high entropy rates, whereas highly structured time-series—for example, those thatare periodic—have very low (or zero) entropy rates. A time series with a high entropy rate is almostcompletely unpredictable, and conversely. This can be made more rigorous: Pesin’s relation [91]states that in chaotic dynamical systems, the Kolmogorov-Sinai (KS) entropy is equal to the sumof the positive Lyapunov exponents λ i . These exponents directly quantify the rate at which nearbystates of the system diverge with time: | ∆ x ( t ) | ≈ e λt | ∆ x (0) | . The faster the divergence, the largerthe entropy. The KS entropy is deﬁned as the supremum of the Shannon entropy rates of allpartitions— i.e., all possible choices for binning [92]. As an aside, an alternative deﬁnition of thegenerating partition deﬁned above is a partition that achieves this supremum.From a diﬀerent point of view, I can consider the information (as measured by the Shannonentropy) contained in a single observable of the system at a given point in time. This informationcan be partitioned into two components: the information shared with past observations— i.e. , themutual information between the past and present—and the information in the present that is notcontained in the past ( viz., “the conditional entropy of the present given the past”). The ﬁrstpart is known as the redundancy ; the second is the aforementioned Shannon entropy rate . Againworking with R. G. James, I establish that the more redundancy in a signal, the more predictableit is [40, 41]. This is discussed in more detail in Chapter 6.Previous approaches to measuring temporal complexity via the Shannon entropy rate [74,102]required categorical data: x i ∈ S for some ﬁnite or countably inﬁnite alphabet S . Data taken fromreal-world systems are, however, eﬀectively real-valued. So for this reason I need to symbolizethe time series, as discussed above. The methods discussed above however, are generally biased orfragile in the face of noise.Bandt and Pompe introduced the permutation entropy (PE) as a “natural complexity mea-sure for time series” [9]. Permutation entropy involves a method for symbolizing real-valued timeseries that follows the intrinsic behavior of the system under examination. This method has manyadvantages, including robustness to observational noise, and its application does not require anyknowledge of the underlying mechanisms of the system. Rather than looking at the statistics ofsequences of values, as is done when computing the Shannon entropy, permutation entropy looksat the statistics of the orderings of sequences of values using ordinal analysis. Ordinal analysis ofa time series is the process of mapping successive time-ordered elements of a time series to theirvalue-ordered permutation of the same size. By way of example, if ( x , x , x ) = (9 , ,

7) then its ordinal pattern , φ ( x , x , x ), is 231 since x ≤ x ≤ x . The ordinal pattern of the permutation( x , x , x ) = (9 , ,

1) is 321. Measurements from ﬁnite-precision sensors are discrete, but data from modern high-resolution sensors are, forthe purposes of entropy calculations, eﬀectively continuous. Deﬁnition (Permutation Entropy) . Given a time series { x i } i =1 ,...,N , deﬁne S (cid:96) as all (cid:96) ! permuta-tions π of order (cid:96) . For each π ∈ S (cid:96) , deﬁne the relative frequency of that permutation occurring in { x i } i =1 ,...,N p ( π ) = |{ i | i ≤ N − (cid:96), φ ( x i +1 , . . . , x i + (cid:96) ) = π }| N − (cid:96) + 1 (2.37) where p ( π ) quantiﬁes the probability of an ordinal and |·| is set cardinality. The permutation entropyof order (cid:96) ≥ is deﬁned as PE ( (cid:96) ) = − (cid:88) π ∈S (cid:96) p ( π ) log p ( π ) (2.38)Notice that 0 ≤ P E ( (cid:96) ) ≤ log ( (cid:96) !) [9]. With this in mind, it is common in the literature tonormalize permutation entropy as follows: P E ( (cid:96) )log ( (cid:96) !) . With this convention, “low” PE is close to 0 and“high” PE is close to 1. Finally, it should be noted that the permutation entropy has been shownto be identical to the Kolmogorov-Sinai entropy for many large classes of systems [7], as long asobservational noise is suﬃciently small. As mentioned before, PE is equal to the Shannon entropyrate of a generating partition of the system. This transitive chain of equalities, from permutationentropy to Shannon entropy rate via the KS entropy, allows one to approximate the redundancy ofa signal—being the dual of the Shannon entropy rate—by 1 − P E ( (cid:96) )log ( (cid:96) !) .In this thesis, I utilize a variation of the basic permutation entropy technique, the weightedpermutation entropy (WPE), which was introduced in [32]. The intent behind the weighting isto correct for observational noise that is larger than the trends in the data, but smaller than thelarger-scale features. Consider, for example, a signal that switches between two ﬁxed points andcontains some additive noise. The PE is dominated by the noise about the ﬁxed points, driving itto ≈

1, which in some sense hides the fact that the signal is actually quite structured. To correctfor this, the weight of a permutation is taken into account w ( x (cid:96)i +1 ) = 1 (cid:96) i + (cid:96) (cid:88) j = i (cid:16) x j − ¯ x (cid:96)i +1 (cid:17) (2.39)where x (cid:96)i +1 is a sequence of values x i +1 , . . . , x i + (cid:96) , and ¯ x (cid:96)i +1 is the arithmetic mean of those values.The weighted probability of a permutation is deﬁned as p w ( π ) = (cid:88) i ≤ N − (cid:96) w ( x (cid:96)i +1 ) · δ ( φ ( x (cid:96)i +1 ) , π ) (cid:88) i ≤ N − (cid:96) w ( x (cid:96)i +1 ) (2.40)where δ ( x, y ) is 1 if x = y and 0 otherwise. Eﬀectively, this weighted probability emphasizespermutations that are involved in “large” features and de-emphasizes permutations that are smallin amplitude, relative to the features of the time series. The standard form of weighted permutationentropy is WPE( (cid:96) ) = − (cid:88) π ∈S (cid:96) p w ( π ) log p w ( π ) , (2.41)which can also be normalized by dividing by log( (cid:96) !), to make 0 ≤ WPE( (cid:96) ) ≤ (cid:96) . The primary consideration in that choice is thatthe value be large enough that forbidden ordinals are discovered, yet small enough that reasonablestatistics over the ordinals are gathered. If an average of 100 counts per ordinal is consideredto be suﬃcient, for instance, then (cid:96) = argmax ˆ (cid:96) { N (cid:39) (cid:96) ! } . In the literature, 3 ≤ (cid:96) ≤ (cid:96) , but that can require an arbitrarily long time series.In practice, what one should do is calculate the persistent permutation entropy by increasing (cid:96) untilthe result converges, but data length issues can intrude before that convergence is reached. I usethis approach to choose (cid:96) = 6 for the experiments presented in this thesis. This value represents agood balance between accurate ordinal statistics and ﬁnite-data eﬀects. Any discussion of new prediction technology is incomplete, of course, without a solid com-parison to traditional techniques. In this section, I describe the four diﬀerent forecasting methodsused in my thesis as points of comparison. These forecast methods include: • The random-walk method, which uses the previous value in the observed signal as theforecast, • The na¨ıve method, which uses the mean of the observed signal as the forecast, • The

ARIMA (auto-regressive integrated moving average) method, a common linear forecaststrategy for scalar time-series data, instantiated via the

ARIMA procedure [56], and • The

LMA (Lorenz method of analogues) method, which uses a near-neighbor forecast strat-egy on a delay-coordinate reconstruction of the signal.ARIMA, as its name suggests, is based on standard autoregressive techniques. LMA, introducedin Chapter 1, is designed to capture and exploit the deterministic structure of a signal from anonlinear dynamical system. The na¨ıve and random-walk methods, somewhat surprisingly, oftenoutperform these more-sophisticated prediction strategies in the case of highly complex signals, asdiscussed brieﬂy below and in depth in Chapter 6.

A random-walk predictor simply uses the last observed measurement as the forecast: that is,the predicted value p i at time i is calculated using the relation p i ≡ x i − (2.42)where { x j } Nj =1 is the time-series data. The prediction strategy that I refer to using the term “na¨ıve”averages all prior observations to generate the forecast p i ≡ µ x,i − = i − (cid:88) j =1 x j i − ≈ A simple and yet powerful way to capture and exploit the structure of data is to ﬁt a hy-perplane to the dataset and then use it to make predictions. The roots of this approach date backto the original autoregressive schema of Yule [122], which forecasts the value at the next time stepthrough a weighted average of the past q observations p i ≡ i − (cid:88) j = i − q a j x j (2.44)The weighting coeﬃcients a j are generally computed using either an ordinary least-squares ap-proach, or with the method of moments using the Yule-Walker equations. To account for noise inthe data, one can add a so-called “moving average” term to the model; to remove nonstationarities,one can detrend the data using a diﬀerencing operation. A strategy that incorporates all three ofthese features is called a nonseasonal ARIMA model . If evidence of periodic structure is present inthe data, a seasonal ARIMA model , which adds a sampling operation that ﬁlters out periodicities,can be a good choice.There is a vast amount of theory and literature regarding the construction and use of modelsof this type; please see [15] for an in-depth exploration. For the purposes of this thesis, I chooseseasonal ARIMA models to serve as a good exemplar for a broad class of linear predictors and a0useful point of comparison for my work. Fitting such a model to a time series involves choosingvalues for the various free parameters in the autoregressive, detrending, moving average, and ﬁl-tering terms. I employ the automated ﬁtting techniques described in [56] to accomplish this. Thisprocedure uses sophisticated methods—KPSS unit-root tests [65], a customization of the Canova-Hansen test [19], and the Akaike information criterion [2], conditioned on the maximum likelihoodof the model ﬁtted to the detrended data—to select good values for these free parameters.ARIMA forecasting is a common and time-tested procedure. Its adjustments for seasonality,nonstationarity, and noise make it an appropriate choice for short-term predictions of time-seriesdata generated by a wide range of processes. If information is being generated and/or transmitted ina nonlinear way, however, a global linear ﬁt is inappropriate and ARIMA forecasts can be inaccurate.Another weakness of this method is prediction horizon: an ARIMA forecast is guaranteed toconverge to a constant or linear trend after some number of predictions, depending on model order.To sidestep this issue, and make the comparison as fair as possible, I build ARIMA forecasts ina stepwise fashion: i.e. , ﬁt the model to the existing data, use that model to perform a one-stepprediction, rebuild it using the latest observations, and iterate until the desired prediction horizonis reached. For consistency, I take the same approach with the other models in this proposal aswell, even though doing so amounts to artiﬁcially hobbling LMA, the method that is the topic ofthe next section. The dynamical systems community has developed a number of methods that leverage delay-coordinate reconstruction for the purposes of forecasting dynamical systems ( e.g. , [23,72,93,107,112,119]). Since the goal of this thesis is to show that incomplete reconstructions—those that are nottrue embeddings—can give these kinds of methods enough traction to generate useful predictions,I choose one of the oldest and simplest members of that family to use in my analysis: Lorenz’smethod of analogues (LMA), which is essentially nearest-neighbor forecasting in reconstructionspace.To apply LMA to a scalar time-series data set { x j } nj =1 , one begins by performing a delay-coordinate reconstruction to produce a trajectory of the form { (cid:126)x j = [ x j x j − τ . . . x j − ( m − τ ] T } nj =1 − ( m − τ (2.45)using one or more of the heuristics presented in Sections 2.1.2 and 2.1.3 to choose m and τ .Forecasting the next point in the time series, x n +1 , amounts to reconstructing the next delayvector (cid:126)x n +1 in the trajectory. Note that, by the form of delay-coordinate vectors, all but the ﬁrstcoordinate of (cid:126)x n +1 are known. To choose that ﬁrst coordinate, LMA ﬁnds the nearest neighbor of (cid:126)x n in the reconstructed space —namely (cid:126)x j (1 ,m ) —and maps that vector forward using the delaymap, obtaining (cid:126)x j (1 ,m )+1 = [ x j (1 ,m )+1 x j (1 ,m )+1 − τ . . . x j (1 ,m )+1 − ( m − τ ] T (2.46)Using the image of the neighbor, one deﬁnes (cid:126)p n +1 ≡ [ x j (1 ,m )+1 x n +1 − τ . . . x n +1 − ( m − τ ] T (2.47) (cid:126)x n should not be chosen as its own neighbor as it has no forward image. In some cases, a longer Theiler exclusionmay be useful [115]. x n +1 is then p n +1 ≡ x j (1 ,m )+1 . If performing multi-step forecasts, one appendsthe new delay vector (cid:126)p n +1 = [ x j (1 ,m )+1 x n +1 − τ . . . x n +1 − ( m − τ ] T (2.48)to the end of the trajectory and repeats this process as needed.In my work, I use the LMA algorithm in two ways: ﬁrst—as a baseline for comparisonpurposes—on an embedding of each time series, with m chosen using the false near(est) neighbormethod [62]; second, with m ﬁxed at 2. In the rest of this thesis, I will refer to these as fnn-LMA and ro-LMA , respectively. The experiments reported in Chapter 4, unless stated otherwise, use thesame τ value for both fnn-LMA and ro-LMA , choosing it at the ﬁrst minimum of the time-delayedmutual information of the time series [33]. In Section 4.3, I explore the eﬀects of varying τ on theaccuracy of both methods. In Section 5.1, I show that a time-delayed version of the so-called activeinformation storage is a highly eﬀective method for selecting τ , and m as well, when forecasting isthe end goal.Dozens—indeed, hundreds—of more-complicated variants of the LMA algorithm have ap-peared in the literature ( e.g. , [23, 107, 112, 119]), most of which involve building some ﬂavor oflocal-linear model around each delay vector and then using it to make the prediction of the nextpoint. For the purposes of this thesis, I chose to use the basic original LMA because it is dynami-cally the most straightforward and thus provides a good baseline assessment. While I believe thatthe claims stated here extend to other state space-based forecast methods, the pre-processing stepsinvolved in some of those methods make a careful analysis of the results somewhat problematic.One can use GHKSS-based techniques, for instance, to project the full dynamics onto linear sub-manifolds [48] and then use those manifolds to build predictions. While it might be useful to applya method like that to an incomplete reconstruction, the results would be some nonlinear conﬂationof the eﬀects of the two diﬀerent projections and it would be diﬃcult to untangle and understandthe individual eﬀects. (Note that the careful study of the eﬀects of projection in forecasting thatare undertaken in this thesis may suggest why GHKSS-based techniques work so well; this point isdiscussed further in Chapter 4.)Since LMA does not rest on an assumption of linearity (as ARIMA models do), it can handleboth linear and nonlinear processes. If the underlying generating process is nondeterministic,however, it can perform poorly. For an arbitrary real-valued time series, without any knowledge ofthe generating process and with all of the attendant problems (noise, sampling issues, and so on),answers to the question as to which forecast model is best should, ideally, be derived from the data,but that is a diﬃcult task. By quantifying the balance between redundancy, predictive structure,and entropy for these real-valued time series—as I describe in Chapter 6—I can begin to answerthese questions in an eﬀective and practical manner. To assess and compare the prediction methods studied here, I calculate a ﬁgure of merit inthe following way. I split each N -point time series into two pieces: the ﬁrst 90%, referred to asthe “initial training” signal and denoted { x j } nj =1 , and the last 10%, known as the “test” signal { c j } ( k + n +1)= Nj = n +1 . Following the procedures described in Section 2.3, I build a model from the initialtraining signal, use that model to generate a prediction p n +1 of the value of x n +1 , and compare p n +1 to the true continuation, c n +1 . I then rebuild the model using { c n +1 } ∪ { x j } nj =1 and repeat theprocess k times, out to the end of the observed time series. This “one step prediction” process is2not technically necessary in the fnn-LMA or ro-LMA methods, which can generate arbitrary-length predictions, but the performance of the other three methods used here will degrade severely if theassociated models are not periodically rebuilt. In order to make the comparison fair, I use theiterative one-step prediction schema for all ﬁve methods . This has the slightly confusing eﬀect ofcausing the “test” signal to be used both to assess the accuracy of each model and for periodicreﬁtting.As a numerical measure of prediction accuracy, for each h -step forecast, I calculate the h -stepmean absolute scaled error ( h -MASE) between the true and predicted signals, deﬁned as h − MASE = k + n +1 (cid:88) j = n +1 | p j − c j | kn − h (cid:80) ni =1 (cid:113) (cid:80) hι =1 ( x i − x i + ι ) h (2.49) h -MASE is a normalized measure: the scaling term in the denominator is the average h -step in-sample forecast error for a random-walk prediction over the initial training signal { x i } ni =1 . That is, h -MASE < h -step random-walk forecast on the same data. Analogously, h -MASE > worse , on average, than the random-walk method. I choosethis error metric because it allows for fair comparison across varying methods, prediction horizons,and signal scales, and is a standard error measure in the forecasting literature [57].To provide insight into interpreting h -MASE values, I will refer back to the proof-of-conceptexample presented in Chapter 1. The one-step forecasts in Figure 1.1, for instance, had 1-MASEvalues of 0.117 and 0.148— i.e., the fnn-LMA and ro-LMA forecasts of the SFI dataset A were,respectively . = 8 . . = 6 . h -step forecasting with random walk will degrade as h increases.In general, then, it is to be expected that h -MASE will decrease drastically with increasing predic-tion horizon. Thus, h -MASE scores should not be compared for diﬀerent h . For example, 10-MASEcan be compared to 10-MASE for two diﬀerent methods or signals but should not be comparedto 1-MASE or 100-MASE, even on the same signal. While its comparative nature may seem odd,this error metric allows for fair comparison across varying methods, prediction horizons, and signalscales, making it a standard error measure in the forecasting literature—and a good choice for thisthesis, which involves a number of very diﬀerent signals. Although the accuracy of these predictions will degrade with prediction horizon, in the presence of positiveLyapunov exponents. hapter 3Case Studies

I use several diﬀerent dynamical systems as case studies throughout this thesis, both real andsynthetic. Two of them—the Lorenz 96 model and sensor data from a laboratory experiment oncomputer performance dynamics—persist across all chapters of this document; I use a number ofothers as well to drive home diﬀerent points in diﬀerent chapters. Each is described in more depthin the following sections.

When developing any new mathematical theory or method it is important to ﬁrst exploreit in the context of well-understood synthetic examples. This gives me a controlled environmentwhere I can test the boundaries of my theory, e.g., increasing data length or adding a (controlled)signal-to-noise ratio.

The Lorenz-96 model was introduced by Edward Lorenz in [73] to study atmospheric pre-dictability. Lorenz-96 is deﬁned by a set of K ﬁrst-order diﬀerential equations relating the K statevariables ξ . . . ξ K ˙ ξ k = ( ξ k +1 − ξ k − )( ξ k − ) − ξ k + F (3.1)for k = 1 , . . . , K , where F ∈ R is a constant forcing term that is independent of k . In this model,each ξ k is some atmospheric quantity (such as temperature or vorticity) at a discrete location ona periodic lattice representing a latitude circle of the earth. Following standard practice [61], Ienforce periodic boundary conditions and solve Equation (3.1) from several randomly chosen initialconditions using a standard fourth-order Runge-Kutta solver for 60,000 steps with a step size of normalized time units. I then discard the ﬁrst 10,000 points of each trajectory in order to eliminatetransient behavior. Finally, I create scalar time-series traces by individually “observing” each ofthe K state variables of the trajectory: i.e., h i ( ξ i ( t j )) = x j,i for j ∈ { , , . . . , , } and for i ∈ { , . . . , K } . I repeat all of this from a number of diﬀerent initial conditions—seven for the K = 47 Lorenz-96 system and 15 for the K = 22 case—producing a total of 659 traces for myforecasting study.In [61], Karimi & Paul studied this model extensively, performing and analyzing numerousparameter sweeps showing that it exhibits a vast array of possible dynamics: everything from ﬁxedpoints and periodic attractors to low- and high-dimensional chaos. One particularly interestingfeature of this dynamical system is the relationship between state-space dimension and how much4of that space is occupied by dynamics. For diﬀerent choices of the parameter values, the Lorenz-96system yields dynamics with low fractal dimensions in large state spaces as well as large fractaldimensions in large state spaces. All of these features make this model an ideal candidate fortesting and evaluating ro-LMA . For my initial investigation, I ﬁx F = 5 and choose K = 22 and K = 47—choices that yield chaotic trajectories with low and high [61] Kaplan-Yorke (Lyapunov)dimension [60] respectively: d KY (cid:47) K = 22 dynamics and d KY ≈

19 for K = 47.Projections of trajectories on these attractors can be seen in Figure 3.1.For each of these time series, I use the procedures outlined in Section 2.1.1 to estimate valuesfor the free parameters of the embedding process, obtaining m = 8 and τ = 26 for all traces in the K = 22 case, and m = 10 and τ = 31 for the K = 47 traces. Table 3.1 tabulates the estimated andtheoretical embedding parameter values for these two test cases, derived using the methodologiesdescribed in Sections 2.1.1-2.1.3.It has been shown in [109] that d KY ≈ d cap for typical chaotic systems. This suggests thatembeddings of the K = 22 and K = 47 time series would require m (cid:39) m (cid:39)

38, respectively.The values suggested by the false-near neighbor method for the K = 22 traces are in line withthis, but the K = 47 false-near neighbor values are far smaller than 2 d KY . For K = 47 there aretwo potential causes for this disparity. First, the false-near neighbor method does not guarantee m > d KY , it is simply a heuristic to mitigate false crossings in the dynamics. Second m > d KY is a suﬃcient bound—it could very well be the case that false crossings are eliminated before thisbound is reached.Figure 3.1: The Lorenz 96 attractor ( F = 5) with (left) K = 22 and (right) K = 47. Since 22 and47 dimensional plots are not possible, I plot 3D projections of these systems. In particular, eachdiﬀerently colored trajectory represents diﬀerent projections or equivalently choices of k : k =2 isaqua, k = 6 is blue and k = 18 is red.5Grid Points m -fnn τ m -Embedology m -Takens K = 22 8 26 ≈ K = 47 10 31 ≈

41 95Table 3.1: Estimated and theoretical embedding parameter values for the Lorenz-96 model. m -fnnis the embedding dimension produced by the false-nearest neighbor method. τ is chosen as the ﬁrstminimum of the mutual information curve. m -Embedology is derived following [99] and m -Takensis derived from [113]. (a) time × x ( t ) -20-15-10-505101520 (b) (c) Figure 3.2: Classic Lorenz attractor ( σ = 10, ρ = 28, β = 8 / R generated using fourth-order Runga-Kutta with a time step of . (b) A time-series trace ofthe x coordinate of that trajectory. (c) A 3 D projection of a delay-coordinate embedding of thetrajectory in (b) with dimension m = 5 and delay τ = 12. The now canonical Lorenz-63 model was introduced by Lorenz in 1963 as a ﬁrst example of“Deterministic Nonperiodic Flow,” what is now known as chaos. Lorenz 63 is deﬁned by a set ofthree ﬁrst-order diﬀerential equations system [71]˙ x = σ ( y − x ) (3.2)˙ y = x ( ρ − z ) − y (3.3)˙ z = xy − βz (3.4)with the typical chaotic parameter selections: σ = 10, ρ = 28, and β = 8 /

3. Figure 3.2(a) shows a50,000-point trajectory in R generated using fourth-order Runga-Kutta on those equations witha time step of T = , as well as a time-series trace of the x coordinate of that trajectory—Figure 3.2(b)—and a 3 D projection of a delay-coordinate embedding of that trace with dimension m = 5 and delay τ = 12, Figure 3.2(c). This canonical example is used in Chapter 5 to establishan explanation of why ro-LMA is able to get traction even though it works with a reconstructionthat does not meet the theoretical conditions of an embedding. Table 3.2 tabulates the estimatedand theoretical embedding parameter values for this system. Although Lorenz did not coin this term. m -fnn τ m -Embedology m -Takens5 174 ≈ Validation with synthetic data is an important ﬁrst step in evaluating any new theory, as itprovides a controlled, well-deﬁned and well-understood environment. However, these are luxuries,rarely if ever aﬀorded to an experimentalist. Furthermore, experimental data often misbehavesmore—and in diﬀerent ways—than synthetic data. For these reasons, it is vital to test a methodwith experimental time-series data if one is interested in real-world applications.

As has been established in prior work by our group, it is highly eﬀective to treat computersas nonlinear dynamical systems [6, 36–38, 40–42, 83, 84]. In this view, register and memory contentsand physical variables like the temperature of diﬀerent regions of the processor chip deﬁne the stateof the system. The logic hardwired into the computer, combined with the software executing onthat hardware, deﬁnes the dynamics of the system. Under the inﬂuence of these dynamics, the stateof the processor moves on a trajectory through its high-dimensional state space as the clock cyclesprogress and the program executes. Like Lorenz-96, this system has been shown to exhibit a rangeof interesting deterministic dynamical behavior, from periodic orbits to low- and high-dimensionalchaos [6, 83], making it a good test case for this thesis. It also has important practical implications;these dynamics, which arise from the deterministic, nonlinear interactions between the hardwareand the software, have profound eﬀects on execution time and memory use.

For the purposes of this thesis, I will consider a “stored-program computer,” i.e., a standardvon Neumann architecture, as a deterministic nonlinear dynamical system. In a stored-programcomputer, the current state—both instructions and data—are stored in some form of addressablememory. The contents of this memory are, as established in [83], the state space X of the computer.Other components of the computer, such as external memory and video cards, also play roles inits state. Those roles depend on the decisions made by the computer designers—how things areimplemented and connected—almost all of which are proprietary. In order to distinguish knownand unknown eﬀects, I follow [83] and deﬁne the state space X as a composition of the addressablememory elements (cid:126)m and the unknown implementation variables (cid:126)u : X = { (cid:126)ξ | (cid:126)ξ = [ (cid:126)m, (cid:126)u ] } (3.5)The distinction between (cid:126)m and (cid:126)u is important because the dynamics of a running computer havetwo distinct sources: a map (cid:126)F code that acts on the addressable memory (cid:126)m directly, as dictatedby the program instructions, and a map (cid:126)F impl that captures how the implementation aﬀects theevolution of the computer state. The overall dynamics of the computer—that is, the mapping from7its state at the j th clock cycle to its state at the j + 1 st clock cycle—is a composition of these twomaps: (cid:126)ξ ( t j +1 ) = Φ( (cid:126)ξ ( t j )) = (cid:126)F P ( (cid:126)ξ ( t j )) = (cid:126)F impl ◦ (cid:126)F code ( (cid:126)ξ ( t j )) (3.6)where (cid:126)F P is the performance dynamics of the computer. An improved design for the processor,for instance—that is, a “better” (cid:126)F P —is a change in (cid:126)F impl . The form of the map (cid:126)F code is dictatedby the combination of the computer’s formal speciﬁcation ( x86 64 , for the Intel i7 used in theexperiments here) and the software that it is running. Both (cid:126)F impl and (cid:126)F code are nonlinear anddeterministic, and their composed dynamics must be modeled together in order to predict futurecomputer performance.The framework outlined in the previous paragraph lets me use the methods of nonlineardynamics—in particular, delay-coordinate embedding—to model (cid:126)F P , as long as I observe thosedynamics in a way that satisﬁes the associated theorems (see Section 2.1.1). The hardware per-formance monitor registers (HPMs) that are built into modern processors can be programmed tocount events on the chip: the total number of instructions executed per cycle (IPC), for instance,or the total number of references to the data cache. These are some of the most widely used andsalient metrics in the computer performance analysis literature [3, 66, 81, 86, 104]. IPC is a goodproxy for processor eﬃciency because most modern microprocessors can execute more than oneinstruction per clock cycle. While this metric may not be an element of the state vector (cid:126)ξ , thefundamental theorems of delay-coordinate embedding only require that one measures a quantitythat is a smooth, generic function of at least one state variable. It was shown in [6] that the trans-formation performed by the HPMs in sampling the state (cid:126)ξ is indeed smooth and generic unlessthose registers overﬂow—an unlikely event, given that they are 64 bits long and that I read themevery 100 ,

000 instructions.The choice of that sample interval is important for another reason as well. The HPMs are partof the system under study, so accessing them can disturb the very dynamics that they are sampling.This potential observer problem was addressed in [84] by varying the sample interval and testingto make sure that the sampling was not aﬀecting the dynamics. To further reduce perturbation,the measurement infrastructure used to gather the data for the experiments reported here onlymonitors events when the target program is running, and not when the operating system (or themonitoring tool itself) have control of the microprocessor. I have completed a careful examinationof the impact of interrupt rate on prediction results that corroborates the discussion above; theseresults are reported in [37]. Finally, I follow best practices from the computer performance analysiscommunity [43] when measuring the system: I only use local disks and limit the number of otherprocesses that are running on the machine ( i.e. , Linux init level 1).The next section describes the experimental observation of this system and the diﬀerentchoices of (cid:126)F code . For an in-depth description of this custom-measurement infrastructure, includinga deeper discussion of the implications of the sampling interval, please see [6, 37, 83, 84].

The computer performance time-series data sets for the experiments presented in this thesiswere collected on an Intel Core R (cid:13) i7-2600-based machine running the 2.6.38-8 Linux kernel. I alsocarried out experiments on an Intel Core2 Duo. Those Core2 results, reported in [36] but omittedhere, are consistent with the results reported in this dissertation. This i7 microprocessor chip has This process entails subtracting successive HPM readings, checking for overﬂow and adjusting accordingly. time (instructions × i n s t r u c t i o n s p e r c y c l e Figure 3.3: (Left) Time-series data from a computer performance experiment: processor loadtraces, in the form of instructions executed per cycle (IPC) of a simple program ( col major ) thatrepeatedly initializes a 256 ×

256 matrix. Each point is the average IPC over a 100,000 instructionperiod. (Right) A 3D projection of a 12D embedding of this time series.eight processing units, a clock rate of 3.40 GHz, and a cache size of 8192 KB. The experiments in thisthesis involve performance traces gathered during the execution of several diﬀerent programs, begin-ning with the simple col major loop whose performance is depicted in the left panel of Figure 3.3,as well as a more-complex program from the SPEC 2006CPU benchmark suite ( ) [54]. Inaddition, I also carried out careful analysis of standard computer performance benchmarks pro-grams such as [54], linear algebra software from LAPACK (Linear Algebra PACKage)such as dgesdd and dgeev [8] and row major (the row-major analogue of col major ). Many ofthese experiments are omitted here for brevity; please see [36,37,40,41] for these companion results.I select col major and from this larger constellation of experiments for the discussion inmy thesis because they are informative in their own right and representative of the other results Iencountered; col major is a simple highly-structured chaotic time series, while is a chaotictime series where almost all structure has been consumed by noise.In all of these experiments, the scalar observation x j is a measurement of the processorperformance at time j during the execution of each program. To record these measurements, I usethe libpfm4 library, via PAPI (Performance Application Programming Interface) 5.2 [16], to stopprogram execution at 100,000-instruction intervals—the unit of time in these experiments—andread oﬀ the contents of the CPU’s onboard hardware performance monitors, which I programmedto count how many instructions are executed in each clock cycle(IPC). I also recorded and analyzedother metrics including total L2 cache misses, missed branch predictions, and L2 instruction cachehits. Description of these metrics, as well as corresponding analysis, are published in [36, 37]. Inthis thesis, I only report results on IPC, as it is representative of all of these results. For statisticalvalidation, I collect 15 performance traces from each of the programs. These traces, and theprocesses that generated them, are described in more depth in the rest of this section. col major is a simple C program that repeatedly initializes the upper triangle of a 2048 × f o r ( i =0; i < f o r ( j=i ; j < . Each point is the averageIPC in a 100,000 instruction period.As mentioned in Chapter 1 and shown in Figure 3.3, this simple program exhibits surprisinglycomplicated behavior. I also collected data from the row-major analogue to col major . Thesetime series were very diﬀerent than col major , but the forecasting results were largely the same,so they are omitted here but can be found in [36, 40].The SPEC CPU2006 benchmark suite [54] is a collection of complicated programs that areused in the computer-science community to assess and compare the performance of diﬀerent com-puters. is a member of that suite. It is a compiler : a program that translates code writtenin a high-level language (C, in the case of ) into a lower-level format that can be executedby the processor chip. Its behavior is far more complicated than that of col major , as is clear fromFigure 3.4. Unlike col major , where the processor utilization is quite structured, the performanceof appears almost random. In addition to , I also studied from thisbenchmark suite: a speech-recognition tool [54]. and the associated results are coveredin more depth in [37, 40].Table 3.3 tabulates the estimated embedding parameters for col major and . Noticethat because these dynamical systems are not understood from a theoretical perspective, i.e. , thegoverning equations or knowledge of the state space dimension are unknown, I must rely on theheuristics presented in Section 2.1.3. I should note that it has been estimated that the state space ofthese systems is at least 2 dimensions [83], which would mean that m -Takens > . However, thesame paper suggests the actual fractal dimension of these dynamics are much smaller, due in partto standard programming and design principles, which have the eﬀect of reducing the dimension ofthe dynamics, and that m -Embedology is probably less that ten. m -fnn τ m -Embedology m -Takens col major

12 2 ∗∗ **

13 10 ∗∗ **Table 3.3: Estimated embedding parameter values for the computer performance experiments. τ and m -fnn chosen as in Table 3.1. m -Embedology and m -Takens are not provided as thesedimensions are unknown for a experimental system like this one. hapter 4Prediction in Projection In this chapter, I demonstrate that the accuracies of forecasts produced by ro-LMA —Lorenz’smethod of analogues, operating on a two-dimensional time-delay reconstruction of a trajectoryfrom a dynamical system—are similar to, and often better than, forecasts produced by fnn-LMA ,which operates on an embedding of the same dynamics. While the brief example in Chapter 1is a useful ﬁrst validation of that statement, it does not support the kind of exploration that isnecessary to properly evaluate a new forecast method, especially one that violates the basic tenetsof delay-coordinate embedding. The SFI dataset A is a single trace from a single system—and alow dimensional system at that. My goal in this chapter is to show that ro-LMA is comparable to orbetter than fnn-LMA for a range of systems and parameter values—and to repeat each experimentfor a number of diﬀerent trajectories from each system. This exploration serves as an experimentalvalidation of the central premise of this thesis. And of course, any discussion of new forecastingstrategies is incomplete without a solid comparison with traditional methods. To this end, I presentresults for two dynamical systems, one simulated and one real: the Lorenz-96 model and sensordata from a laboratory experiment on computer performance dynamics. I produce ro-LMA forecastsof these systems and compare them to forecasts using the four traditional strategies presented inSection 2.3.

In this example, I perform two sets of forecasting experiments with ensembles of traces fromthe Lorenz-96 model [73], introduced in Section 3.1.1: one with K = 22 and the other with K = 47. ro-LMA and fnn-LMA As I will illustrate in the following discussion, both ro-LMA and fnn-LMA worked quite wellfor the K = 22 dynamics. See Figure 4.1(a) for a time-domain plot of an ro-LMA forecast of arepresentative trace from this system and Figures 4.1(b) and (c) for graphical representations ofthe forecast accuracy on that trace for both methods. In Figures 4.1(b) and (c), the vertical axisis the prediction p j and the horizontal axis is the true continuation c j . On this type of plot, aperfect prediction would lie on the diagonal. The diagonal structure on the p j vs. c j plots inthe Figure indicates that both of these LMA-based methods perform very well on this trace. Moreimportantly—from the standpoint of evaluation of my primary claim—the LMA forecasting strategyworked better on a two-dimensional reconstruction of these dynamics than on a full embedding,and by a statistically signiﬁcant margin: the 1-MASE scores of ro-LMA and fnn-LMA forecasts, one-step ahead Mean Absolute Scaled Error time ( j ) × p j a nd c j -3-2-101234567 (a) 2,500-point forecast using the reduced-order forecast method ro-LMA . Blue circles and red × s are thetrue and predicted values, respectively; vertical bars show where these values diﬀer.(b) ro-LMA forecast −2 0 2 4 6−3−2−10123456 p j c j (c) fnn-LMA forecast Figure 4.1: ro-LMA and fnn-LMA forecasts of a representative trace from the Lorenz-96 system with K = 22 and F = 5. Top: a time-domain plot of the ﬁrst 2,500 points of the ro-LMA forecast.Bottom: the predicted ( p j ) vs true ( c j ) values for forecasts of that trace generated by (b) ro-LMA and (c) fnn-LMA . On such a plot, a perfect prediction would lie on the diagonal. The 1-MASEscores of the forecasts in (b) and (c) were 0.392 and 0.461, respectively.computed following the procedures described in Section 2.4, are 0 . ± .

016 and 0 . ± . ro-LMA falls far short of the requirement for2topological conjugacy [99] in this system. Clearly, though, it captures enough structure to allowLMA to generate good predictions.The K = 47 case is a slightly diﬀerent story: here, ro-LMA still outperforms fnn-LMA , but notby a statistically signiﬁcant margin. The 1-MASE scores across all 329 traces were 0 . ± . . ± .

043 for ro-LMA and fnn-LMA , respectively. In view of the higher complexity of thestate-space structure of the K = 47 version of the Lorenz-96 system, the overall increase in 1-MASE scores over the K = 22 case makes sense. Recall that d KY is far higher for the K = 47 case:this attractor ﬁlls more of the state space and has many more manifolds that are associated withpositive Lyapunov exponents.This has obvious implications for predictability. Since I use the same traces for both methods,one might be tempted to think that the better performance or ro-LMA is a predictable consequenceof data length—simply because ﬁlling out a higher-dimensional object like the reconstruction usedby the fnn-LMA model requires more data. When I re-run the experiments with longer traces, the1-MASE scores for ro-LMA and fnn-LMA did converge, but not until the traces are over 10 pointslong, and at the (signiﬁcant) cost of near-neighbor searches in a space with many more dimensions.Note, too, that the longer delay vectors used by fnn-LMA span far more of the training set, whichat ﬁrst glance would seem to be a serious advantage from an information-theoretic standpoint(although, as shown later in Section 5.1, this is not always an advantage). In view of this, thecomparable performance of ro-LMA is quite impressive. All of these issues are explored at morelength in Section 4.3. ro-LMA with Traditional Linear Methods For the K = 22 time series, the LMA-based methods do signiﬁcantly better than the na¨ıveand ARIMA methods and about twice as well as the random walk method and this makes sense.Each point in the time series of Figure 4.1(a) is very close to its predecessor and successor, whichplays to the strengths of random walk. In contrast, the oscillations of the signal and the inertia ofthe mean make the na¨ıve method ineﬀective. The fact that the LMA-based methods outperform therandom walk at all, let alone twice as well, is quite impressive. Successive points of this signal are soclose together that it is really an ideal candidate for random walk forecasting, leaving little room foranother method to be more successful. However, both of the LMA-based techniques successfullymeet this challenge: about 2.5 times and 1.8 times better than random walk, respectively, for ro-LMA and fnn-LMA . See Figures 4.1(b) and 4.1(c) for a visual comparison. K = 47 is a very similar story; all of the LMA-based methods outperform na¨ıve and ARIMAby several orders of magnitude for the same reasons discussed in the previous paragraph. Thecomparison between the LMA methods and random walk is more interesting. Both ro-LMA and fnn-LMA MASE scores are almost identical to those of random walk forecasts. For the reasonsdiscussed above, this is not surprising; random walk is very well suited for this signal leaving a verysmall margin to be outperformed. Table 4.1 compares the forecast accuracy of all Lorenz-96 timeseries with each of the methods discussed above.

Validation with synthetic data is an important ﬁrst step in evaluating any new forecast strat-egy, but experimental time-series data are the acid test if one is interested in real-world applications.My second set of tests of ro-LMA , and comparisons of its accuracy to that of traditional forecaststrategies, involves data from the laboratory experiment on computer performance dynamics that3Table 4.1: The average 1-MASE scores of all four forecast methods for the two ensembles of Lorenz-96 time series.Parameters ro-LMA fnn-LMA

ARIMA na¨ıve { K = 22 , F = 5 } . ± .

016 0 . ± .

033 17 . ± .

310 17 . ± . { K = 47 , F = 5 } . ± .

047 1 . ± .

043 18 . ± .

583 17 . ± . p j ) versus true values ( c j ) for col major and generated with fourof the forecast methods considered in this thesis.was introduced in Section 3.2.1.I have tested ro-LMA on traces of many diﬀerent processor and memory performance metricsgathered during the execution of a variety of programs on several diﬀerent computers (see e.g., [35–37, 40, 41]). Here, for conciseness, I focus on processor performance traces from two diﬀerentprograms, one simple ( col major ) and one complex ( ), running on the same Intel i7-basedcomputer. As discussed in Section 3.2.1, computer performance dynamics result from a compositionof hardware and software. These two programs represent two diﬀerent dynamical systems, eventhough they are running on the same computer. The dynamical diﬀerences are visually apparentfrom the traces in Figures 3.3 and 3.4; they are also mathematically apparent from nonlinear time-series analysis of embeddings of those data [83], as well as in calculations of the information contentof the two signals. Among other things, has much less predictive structure than col major and is thus much harder to forecast [41]. These attributes make this a useful pair of experimentsfor an exploration of the utility of reduced-order forecasting.For statistical validation, I collect 15 performance traces from the computer as it ran eachprogram, calculated embedding parameters as described in Section 2.1.1, and generated forecastsof each trace using ro-LMA and the traditional methods outlined in Section 2.3. Figure 4.2 showssome representative examples. Recall that on such a plot, a perfect prediction would lie on the4Table 4.2: The average 1-MASE scores of all four forecast methods for the 15 trials of col major and .Signal fnn-LMA MASE ro-LMA

MASE ARIMA MASE na¨ıve MASE col major . ± .

002 0 . ± . . ± .

211 0 . ± . . ± .

021 1 . ± .

016 1 . ± .

016 0 . ± . e.g. , na¨ıve) is used on a non-constantsignal. In the case of Figure 4.2, fnn-LMA and ro-LMA both generate very accurate predictionsof the col major trace, while ARIMA does not. Note that the shape of Figure 4.2(b) (ARIMAon col major ) is reminiscent of the projected embedding in the right panel of Figure 3.3. Thisstructure is also present in a p j vs. c j plot of a random-walk forecast (not shown) on this samesignal. Indeed, for a random-walk predictor, a p j vs. c j plot is technically equivalent to a two-dimensional embedding with τ = 1. For ARIMA, the correspondence is not quite as simple, sincethe p j values are linear combinations of a number of past values of the c j , but the eﬀect is largelythe same.The 1-MASE scores for ro-LMA and fnn-LMA across all 15 trials in this set of experimentswere 0 . ± .

002 and 0 . ± . . ± . col major time series contains plenty of nonlinearstructure that the LMA-based methods can capture and utilize, whereas ARIMA can not. These1-MASE scores mean that both fnn-LMA and ro-LMA perform roughly 20 times better on col major than a random-walk predictor, while ARIMA only outperform random walk by a factor of 1.7. Thisis in accordance with the visual appearance of the corresponding images in Figure 4.2. For ,however, ro-LMA is somewhat more accurate: 1-MASE scores of 1 . ± .

016 versus fnn-LMA ’s1 . ± . col major , simply because the signal contains less predictive structure [41]. This actuallymakes the comparison somewhat problematic, as discussed at more length in Section 4.3.Comparing ro-LMA to the the na¨ıve method is illustrative. ro-LMA does signiﬁcantly betteron all signals but . This is reassuring as has very high complexity, almost noredundancy, and very little predictive structure [41]. With signals like this, simple forecast methodsthat do not rely on predictive structure tend to do very well; this is discussed in more depth inChapter 6.Table 4.2 summarizes all of the computer performance experiments presented in this dis-cussion. Overall, these results are consistent with the Lorenz-96 example in the previous section:prediction accuracies of ro-LMA and fnn-LMA are quite similar on all traces, despite the former’suse of a theoretically incomplete reconstruction. This amounts to a validation of the conjectureon which this thesis is based. And in both numerical and experimental examples, ro-LMA actually outperform fnn-LMA on the more-complex traces ( , K = 47). I believe that this is due tothe noise mitigation that is naturally eﬀected by a lower-dimensional reconstruction. In this Section, I explore the eﬀects of the values of the τ parameter (Section 4.3.1), predic-tion horizon (Section 4.3.2) and data length (Section 4.3.3) on ro-LMA . For the remainder of thischapter, I discontinue comparing ro-LMA to traditional linear methods, as that comparison would5 τ M A S E (a) ro-LMA on Lorenz-96 with K = 22. τ M A S E (b) ro-LMA on Lorenz-96 with K = 47. τ M A S E (c) ro-LMA on col major . τ M A S E (d) ro-LMA on . Figure 4.3: The eﬀect of τ on ro-LMA forecast accuracy. The blue dashed curves are the average1-MASE of the ro-LMA forecasts; the red dotted lines show ± the standard deviation. The blackvertical dashed lines mark the τ that is the ﬁrst minimum of the mutual information curve for eachtime series.not add anything to the discussion, and instead focus on the direct comparison between ro-LMA and fnn-LMA . τ Parameter

The embedding theorems require only that τ be greater than zero and not a multiple of anyperiod of the dynamics. In practice, however, τ can play a critical role in the success of delay-coordinate reconstruction—and any nonlinear time-series analysis that follows [33, 38, 59, 95]. Itfollows naturally, then, that τ might aﬀect the accuracy of an LMA-based method that uses thestructure of a time-delay reconstruction to make forecasts.Figure 4.3 explores this eﬀect in more detail. Across all τ values, the 1-MASE of col major was generally lower than the other three experiments—again, simply because this time series hasmore predictive structure. The K = 22 curve is generally lower than the K = 47 one for thesame reason, as discussed at the end of the previous section. For both Lorenz-96 traces, predictionaccuracy increases monotonically with τ . It is known that increasing τ can be beneﬁcial for longerprediction horizons [59]. The situation in Figure 4.3 involves short prediction horizons, so it makessense that my observations are consistent with the contrapositive of that result.For the experimental traces, the relationship between τ and 1-MASE score is less simple.There is only a slight upward overall trend (not visible at the scale of the Figure) and the curvesare nonmonotonic. This latter eﬀect is likely due to periodicities in the dynamics, which are verystrong in the col major signal ( viz., a dominant unstable period-three orbit in the dynamics,6 time ( j ) × p j a nd c j -4-202468 (a) ro-LMA forecast ( M ASE = 0 . τ ) time ( j ) × p j a nd c j -4-202468 (b) ro-LMA forecast ( M ASE = 0 . τ ) Figure 4.4: Time-domain plots of ro-LMA forecasts of a K = 47 Lorenz-96 trace with default andbest-case τ values: (a) the ﬁrst minimum of the time-delayed average mutual information and (b)the minimum of Figure 4.3(b).which traces out the top, bottom, and middle bands in Figure 3.3). Periodicities can cause obviousproblems for delay reconstructions—and forecast methods that employ them—if the delay is aharmonic or subharmonic of their frequencies, simply because the coordinates of the delay vectorare not independent samples of the dynamics. It is for this reason that Takens mentions thiscondition in his original paper. Here, the eﬀect of this is an oscillation in the forecast accuracy vs. τ curve: low when it is an integer multiple of the period of the dominant unstable periodic orbit inthe col major dynamics, for instance, then increasing with τ as more independence is introducedinto the coordinates, then falling again as τ reaches the next integer multiple of the period, and soon. This naturally leads to the issue of choosing a good value for the delay parameter. Recallthat all of the experiments reported so far used a τ value chosen at the ﬁrst minimum of themutual information curve for the corresponding time series. These values are indicated by theblack vertical dashed lines in Figure 4.3. This estimation strategy was simply a starting point,chosen here because it is arguably the most common heuristic used in the nonlinear time-seriesanalysis community. As is clear from Figure 4.3, though, it is not the best way to choose τ forreduced-order forecast strategies. Only in the case of col major is the τ value suggested by themutual-information calculation optimal for ro-LMA —that is, does it fall at the lowest point on the1-MASE vs. τ curve.This suggests that one can often improve the performance of ro-LMA simply by choosing adiﬀerent τ —i.e., by adjusting the one free parameter of that reduced-order forecast method. Inall cases (aside from col major , where the default τ was the optimal value), adjusting τ allow ro-LMA to outperform fnn-LMA . The improvement can be quite striking: for visual comparison,Figure 4.4 shows ro-LMA forecasts of a representative K = 47 Lorenz-96 trace using default andbest-case values of τ . However, that comparison is not really fair. Recall that the embeddingthat is used by fnn-LMA , as deﬁned so far, ﬁxes τ at the ﬁrst minimum of the average mutualinformation for the corresponding trace. It may well be the case that that τ value is suboptimal for that method as well—as it was for ro-LMA . To test this, I perform an additional set of experimentsto ﬁnd the optimal τ for fnn-LMA . Table 4.3 shows the numerical values of the 1-MASE scores,for forecasts made with default and best-case τ values, for both methods and all traces. In thetwo simulated examples, best-case ro-LMA signiﬁcantly outperforms best-case fnn-LMA ; in the two7Table 4.3: The eﬀects of the τ parameter. The “default” value is ﬁxed, for both ro-LMA and fnn-LMA , at the ﬁrst minimum of the average mutual information for that trace; the “best case”value is chosen individually, for each method and each trace, from plots like the ones in Figure 4.3.Signal ro-LMA ro-LMA fnn-LMA fnn-LMA (default τ ) (best-case τ ) (default τ ) (best-case τ )Lorenz-96 K = 22 0 . ± .

016 0 . ± .

002 0 . ± .

033 0 . ± . K = 47 0 . ± .

047 0 . ± .

006 1 . ± .

043 0 . ± . col major . ± .

003 0 . ± .

002 0 . ± . . ± .

016 1 . ± .

014 1 . ± .

021 1 . ± . fnn-LMA is better, but not by a huge margin. That is, even if oneoptimizes τ individually for these two methods, ro-LMA keeps up with, and sometimes outperforms, fnn-LMA . Again, this supports the main point of this thesis: forecast methods based on incompletereconstructions of time-series data can be very eﬀective—and much less work than those that requirea full embedding.In view of my claim that part of the advantage of ro-LMA stems from the natural noisemitigation eﬀects of a low-dimensional reconstruction, it may appear somewhat odd that fnn-LMA works better on the experimental time-series data, which certainly contain noise. Comparisons oflarge 1-MASE scores are somewhat problematic, however. Recall that 1-MASE > τ values—generate poor predictions for : 24–53% worse, on the average, than simply predicting thatthe next value will be equal to the previous value. There could be a number of reasons for this poorperformance. This signal has almost no predictive structure [41] and fnn-LMA ’s extra axes may addto its ability to capture that structure—in a manner that outweighs the potential noise eﬀects ofthose extra axes. The dynamics of col major , on the other hand, are fairly low dimensional anddominated by a single unstable periodic orbit; it could be that the embedding of these dynamicsused in fnn-LMA captures its structure so well that fnn-LMA is basically perfect and ro-LMA cannotdo any better.While the plots and 1-MASE scores in this section suggest that ro-LMA forecasts are quitegood—better than traditional linear methods, and as good or better than LMA upon true embeddings—it is important to note that both “default” and “best-case” τ values were chosen after the fact inall of those experiments. This is not useful in practice. A signiﬁcant advantage of a reduced-orderforecast strategy like ro-LMA is its ability to work ‘on the ﬂy’ in situations where one may not havethe leisure to run an average mutual information calculation on a long segment of the trace andﬁnd a clear minimum—let alone construct a plot like Figure 4.3 and choose an optimal τ from it.(Producing that plot required 3,000 runs involving a total of 22,010,700 forecasted points, whichtook approximately 44.5 hours on an Intel Core i7.)The results that I report in Section 5.1 and in [42], however, suggest that it is possible toestimate optimal τ values for delay reconstruction-based forecasting by calculating the value thatmaximizes the information shared between each delay vector and the future state of the system.For all of the examples in this thesis, that strategy produces the same τ value as found with theexhaustive search mentioned above. This is a fairly eﬃcient calculation: O ( n log n ) time where n n is very large. However, this measurecan be calculated on very small subsets of the time series and still produce accurate results, whichcould allow τ to be selected adaptively for the purposes of forecasting nonstationary systems with ro-LMA . There are fundamental limits on the prediction of chaotic systems. Positive Lyapunov ex-ponents make long-term forecasts a diﬃcult prospect beyond a certain point for even the mostsophisticated methods [41, 59, 119]. Note that the coordinates of points in higher-dimensionaldelay-reconstruction spaces sample wider temporal spans of the time series. In theory, this meansthat one should be able to forecast further into the future with a higher-dimensional reconstructionwithout losing memory of the initial condition. This raises an important concern about ro-LMA :whether its accuracy will degrade with increasing prediction horizon more rapidly than that of fnn-LMA .Recall that the formulations of both methods, as described and deployed in the previoussections of this chapter, assume that measurements of the target system are available in real time:they “rebuild” the LMA models after each step, adding new time-series points to the embeddingsor reconstructions as they arrive. Both ro-LMA and fnn-LMA can easily be modiﬁed to producelonger forecasts, however—say, h steps at a time, only updating the model with new observationsat h -step intervals. Naturally, one would expect forecast accuracy to suﬀer as h increased for anynon-constant signal. The question at issue in this section is whether the greater temporal span ofthe data points used by fnn-LMA mitigates that degradation, and to what extent.In Table 4.4, I provide h -MASE scores—with h ≥ h -step versions of the diﬀerent forecast experiments from Sections 4.1 and4.2. The important comparisons here are, as mentioned above, across the rows of the table. Thediﬀerent methods “reach” diﬀerent distances back into the time series to build the models thatproduce those forecasts, of course, depending on their delay and dimension. At ﬁrst glance, thismight appear to make it hard to sensibly compare, say, default- τ ro-LMA and best-case- τ fnn-LMA ,since they use diﬀerent τ s and diﬀerent values of the reconstruction dimension and thus are spanningdiﬀerent ranges of the time series. Because h is measured in units of the sample interval of the timeseries, however, comparing one h -step forecast to another (for the same h ) does make sense.There are a number of interesting questions to ask about the patterns in this table, beginningwith the one that set oﬀ these experiments: how do fnn-LMA and ro-LMA compare if one individuallyoptimizes τ for each method? The numbers indicate that ro-LMA beats fnn-LMA for h = 1 on the K = 22 traces, but then loses progressively badly ( i.e., by more σ s) as h grows. col major followsthe same pattern except that ro-LMA is worse even at h = 1. For , fnn-LMA performsbetter at both τ s and all values of h , but the disparity between the accuracy of the two methodsdoes not systematically worsen with increasing h . For K = 47, ro-LMA consistently beats fnn-LMA for both τ s for h ≤

10 but the accuracy of the two methods is comparable for longer predictionhorizons. These results suggest that optimizing τ can improve both fnn-LMA and ro-LMA and that,depending on the signal, this optimization can change the relative accuracy of the two methods.This ﬁnding catalyzed the development of the forecast-speciﬁc parameter selection framework thatis outlined in Section 5.1 and [42].Another interesting question is whether the assertions in the previous section stand up toincreasing prediction horizon. Those assertions are based on the results that appear in the h = 1 For an explanation of h -MASE see Section 2.4. h -step mean absolute scaled error ( h -MASE) scores for diﬀerent forecast horizons( h ). As explained in Section 2.4, h -MASE scores should not be compared for diﬀerent h ( i.e., downthe columns of this table).Signal h ro-LMA ro-LMA fnn-LMA fnn-LMA (default τ ) (best-case τ ) (default τ ) (best-case τ )Lorenz-96 K = 22 1 0 . ± .

016 0 . ± .

002 0 . ± .

003 0 . ± . K = 22 10 0 . ± .

008 0 . ± .

003 0 . ± .

011 0 . ± . K = 22 50 0 . ± .

007 0 . ± .

008 0 . ± .

002 0 . ± . K = 22 100 0 . ± .

005 0 . ± .

004 0 . ± .

001 0 . ± . K = 47 1 0 . ± .

047 0 . ± .

006 0 . ± .

053 0 . ± . K = 47 10 0 . ± .

011 0 . ± .

005 0 . ± .

042 0 . ± . K = 47 50 0 . ± .

011 0 . ± .

010 0 . ± .

011 0 . ± . K = 47 100 0 . ± .

006 0 . ± .

005 0 . ± .

005 0 . ± . col major . ± .

003 0 . ± .

002 0 . ± . col major

10 0 . ± .

006 0 . ± .

003 0 . ± .

001 0 . ± . col major

50 0 . ± .

009 0 . ± .

003 0 . ± .

003 0 . ± . col major

100 0 . ± .

004 0 . ± .

006 0 . ± .

003 0 . ± . . ± .

016 1 . ± .

014 1 . ± .

021 1 . ± .

10 0 . ± .

009 0 . ± .

007 0 . ± .

50 0 . ± .

003 0 . ± .

005 0 . ± .

003 0 . ± .

100 0 . ± .

002 0 . ± .

003 0 . ± .

002 0 . ± . ro-LMA was better than fnn-LMA on the K = 22 Lorenz-96 experiments, forinstance, for both τ values. This pattern does not persist for longer prediction horizons: rather, fnn-LMA generally outperforms ro-LMA on the K = 22 traces for h = 10 , , and 100. The h = 1comparisons for K = 47 and col major do generally persist for higher h , however. As mentionedbefore, is problematic because its 1-MASE scores are so high, but the accuracies of thetwo methods are similar for all h > fnn-LMA generally outperforms ro-LMA for longer prediction horizons makessense simply because ro-LMA samples less of the time series and therefore has less ‘memory’ aboutthe dynamics. This is a well-known eﬀect [119]. In view of the fundamental limits on predictionof chaotic dynamics, however, it is worth considering whether either method is really makingcorrect long-term forecasts. Indeed, time-domain plots of long-term forecasts (e.g., Figure 4.5)reveal that both fnn-LMA and ro-LMA forecasts have fallen oﬀ the true trajectory and onto shadowtrajectories—another well-known phenomenon when forecasting chaotic dynamics [98].In other words, it appears that even a 50-step forecast of these chaotic trajectories is atall order: i.e., that I am running up against the fundamental bounds imposed by the Lyapunovexponents. In view of this, it is promising that ro-LMA generally keeps up with fnn-LMA in manycases—even when both methods are struggling with the prediction horizon, and even though themodel that ro-LMA uses has much less memory about the past history of the trajectory. Animportant aspect of my future research on this topic will be developing eﬃcient methods for derivingbounds on reasonable prediction horizons purely from the time series, i.e., without using traditional0 time ( j ) × p j a nd c j -6-4-202468 Figure 4.5: A best-case- τ ro-LMA forecast of a K = 47 Lorenz-96 trace for h = 50. The forecast(red) follows the true trajectory (blue) for a while, falls oﬀ onto a shadow trajectory, then getsrecorrected when a new set of observations are incorporated into the model after h time steps.methods such as Lyapunov exponents, which are diﬃcult to estimate from experimental data.See [42] for some of my preliminary results on this line of research. Most real-world data sets are ﬁxed in length and some are quite short. Moreover, manyof the dynamical systems that one might like to predict are nonstationary. For these reasons, itis important to understand the eﬀects of data length upon forecast methods that employ delayreconstructions. For the reconstruction to be an actual embedding that supports accurate calcula-tions of dynamical invariants, the data requirements are fairly dire. Traditional estimates ( e.g., bySmith [106] and by Tsonis et al. [116]) suggest that ≈ data points would be required to embedthe Lorenz-96 K = 47 data in Section 4.1, where the known d KY values [61] indicate that one mightneed at least m = 38 dimensions to properly unfold the dynamics. As described in Section 2.1.1,however, that is only truly necessary if one is interested in preserving the diﬀeomorphism betweentrue and reconstructed dynamics, down to the last detail. For the purposes of prediction, thank-fully, one can make progress with far less data. For example, Sauer [98] successfully forecasted thecontinuation of a 16,000-point time series using a 16-dimensional embedding; Sugihara & May [112]used delay-coordinate embedding with m as large as seven to successfully forecast biological andepidemiological time-series data as short as 266 points.While the results in the previous sections are based on far longer traces than the examplesmentioned at the end of the previous paragraph, it is still worth exploring whether data-lengthissues are aﬀecting those results—and evaluating whether those eﬀects diﬀerentially impact ro-LMA because of its lower-dimensional model. This kind of test can be problematic in practice, of course,since it requires varying the length of the data set. In a synthetic example like Lorenz-96, that isnot a problem, since one can just run the ODE solver for more steps.Figure 4.6 shows 1-MASE scores for ro-LMA and fnn-LMA forecasts of the Lorenz-96 systemas a function of data length. For both K = 22 and K = 47, the fnn-LMA error is higher than1 Data Length × - M A S E ro-LMAfnn-LMA (a) 1-MASE as a function of data length for predic-tions of the Lorenz-96 K = 22 , F = 5 traces. Data Length × - M A S E ro-LMAfnn-LMA (b) 1-MASE as a function of data length for predic-tions of the Lorenz-96 K = 47 , F = 5 traces. Figure 4.6: The eﬀects of data length on forecast error for ﬁxed prediction horizon h = 1. Thedashed lines show the mean forecast error for each method; the dotted lines indicate the range ofthe standard deviation.the ro-LMA error, corroborating the results in the previous sections. Both generally improve withdata length, as one would expect—but only up to a point. The 1-MASE scores of the ro-LMA forecasts, in particular, reach a plateau at about 350,000 points in the K = 22 case and 150,000in the K = 47 case. fnn-LMA , on the other hand, keeps improving out to the end of the Figure.Eventually, there is a crossover for K = 22, but not until 1.6 million points. For K = 47, the curvesare still converging out to 4 million points. The diﬀerence in crossover points is not surprising,given that the dimension of the K = 47 dynamics is so much higher: d KY < ≈

3, versus ≈

19. What is surprising, and useful, is that for traces with 1 million points—far more data than a practitionercan generally hope for— ro-LMA is still outperforming fnn-LMA . Moreover, it is doing so using fewerdimensions, which makes it computationally eﬃcient.Plateaus in curves like the ones in Figure 4.6 suggest that the corresponding forecast methodhas captured all of the information that it can use, so that adding data does not improve the forecast.This eﬀect, which is described at more length in Section 5.1, depends on dimension for the obviousreason that ﬁlling out a higher-dimensional object requires more data. This suggests another pieceof forecasting strategy: when one is data-rich, it may be wise to choose fnn-LMA over ro-LMA . Inmaking this choice, though, one should also consider the added computational complexity, whichwill be magniﬁed by the longer data length. When data are not plentiful, though, or the system isnonstationary, my results suggest that it is advantageous to ignore the theoretical bounds of [99]and use low-dimensional reconstructions in forecast models. In summary, it appears that incomplete embeddings—time-delay reconstructions that do notsatisfy the formal conditions of the embedding theorems—are indeed adequate for the purposes offorecasting dynamical systems. Indeed, they appear to oﬀer simple state-space based methods evenmore traction on this prediction task than full embeddings, with greatly reduced computationaleﬀort.The study in this thesis speciﬁcally focuses on the 2 D instantiation of the claim above, for2two reasons. First, that is the extreme that, in a sense, most seriously violates the basic tenets ofthe delay-coordinate embedding machinery; second, working in 2D enables the largest reductionin the cost of the near-neighbor searches that are the bulk of the computational eﬀort in moststate space-based forecast methods. A number of other issues arise when one considers increasingthe reconstruction dimension beyond two, besides raw computational complexity. Among otherthings, that would introduce another free parameter into the method, thereby requiring some sortof principled strategy for choosing its value (a choice that ro-LMA completely avoids by ﬁxing m = 2). In general, one would expect forecast accuracy to improve with the number of dimensionsin the reconstruction, but not without limit. Among other things, noise eﬀects grow with thatdimension (simply because every noisy data point aﬀects m points in the reconstructed trajectory).And from an information-theoretic standpoint, one would expect diminishing returns when thespan of the delay vector ( m × τ ) exceeds the “memory” of the system. For all of those reasons, itwould seem that there should be a plateau beyond which increasing the reconstruction dimensiondoes not improve the accuracy of a forecast methods that use the resulting models. I explorethat issue further in [42] and synopsize the relevant material in Section 5.1. In that discussion, Iuse the measure mentioned in the last paragraph of Section 4.3.1 to derive optimal reconstructiondimensions for near-neighbor forecasting for a broad range of systems, noise levels, and forecasthorizons—all of which turn out to be m = 2. In the bulk of the dynamical systems literatureon forecasting, however, the optimal reconstruction dimension for the purposes of forecasting wasthought to be near the value that provides a true embedding of the data. The results in this thesissuggest, again, that this is not the case.I chose the classic Lorenz method of analogues as a good exemplar of the class of state space-based forecast methods, but I believe that my results will hold for other members of that class( e.g. , [23, 93, 98, 107, 112, 119]). Working with a low-dimensional reconstruction could potentiallyreduce the computational search and storage costs of any such method, while also avoiding theso-called “curse of dimensionality” and mitigating noise multipliers caused by extra embeddingdimensions [24]. Reduced-order reconstructions also reduces data requirements, since fewer pointsare required to ﬁll out a lower-dimensional object. And when one ﬁxes m = 2, there is only asingle free parameter τ in the method—one that can be estimated eﬀectively from a short sampleof the data set, allowing the reduced-order method to adapt to nonstationary dynamics. There maybe some limitations on the class of methods for which these claims hold, of course; matters mayget more complicated, and the results less clear, for forecast methods that perform other kinds ofprojections. On the ﬂip side, however, my results can be viewed as explaining why those methodswork so well.Again, no forecast model will be ideal for all noise-free deterministic signals, let alone all real-world time-series data sets. However, the proof of concept oﬀered in this section is encouraging:prediction in projection—a simple yet powerful reduction of a time-tested method—appears towork remarkably well, even though the models that it uses are not necessarily topologically faithfulto the true dynamics. hapter 5Why it Works: A Deeper Understanding of Delay-Coordinate Reconstruction The experimental validation of ro-LMA provided in Chapter 4 is promising but that analysisgave rise to several unanswered questions, e.g., • Why does ro-LMA work when it is eﬀectively a heresy?, • If m = 2 works so well, why not m = 3?, • How much data is necessary before m > • If one wants to forecast two or three steps into the future, is m = 2 still eﬃcient, or should m be increased?, • Can τ be chosen a priori to optimize the accuracy of ro-LMA ?, and • Can all of these questions be answered strictly by analyzing the data?This chapter provides a two-part analysis that answers many of these questions. The ﬁrst partleverages information theory to select forecast-optimal parameters for delay-coordinate reconstruc-tion. The second borrows methods from computational topology to gain new insight into thedelay-coordinate embedding theory and machinery.These two theoretical frameworks—information theory and computational topology—aremathematically disjoint but complementary in terms of developing a complete theory of reconstruction-based forecasting. In particular, the combination of these two tools allows, for the construction ofa new paradigm in delay-coordinate reconstruction. Section 5.1 oﬀers a novel method, developedin collaboration with R. G. James, for leveraging the information that is stored in delay vectors toperform parameter selection that is tailored to the exact stipulations of the data set at hand ( e.g., data length, signal-to-noise ratio and desired forecast horizon). The traditional approach to this isbased on the assumption that the diﬀeomorphism instantiated by the delay-coordinate map, whichis essential for dynamical invariant calculations, is also optimal for forecasting. As the results inChapter 4 suggest, however, this may not be the best approach. Section 5.1 further corroboratesthese ﬁndings and suggests a reason why. Section 5.2 provides a deeper theoretical understand-ing of delay-coordinate reconstruction through computational topology, oﬀering yet another reasonwhy ro-LMA works. This exploration is based on the assumption that, when forecasting, one mightonly require knowledge of the topology of the invariant set; in collaboration with J. D. Meiss, Iconjecture that the reconstructed dynamics might be homeomorphic to the original dynamics at alower dimension than that needed for a diﬀeomorphically correct embedding. This suggests why ro-LMA gets traction despite its use of an incomplete reconstruction.4The combination of these powerful mathematical tools—information theory and computa-tional topology—allows me to construct a deeper and more complete story of reduced-order fore-casting with delay-coordinate reconstruction.

As has been discussed throughout this thesis, the task of choosing good values for the freeparameters in delay-coordinate reconstruction has been the subject of a large and active body ofliterature over the past few decades. The majority of these techniques focus on the geometry of thereconstruction, which is appropriate when one is interested in quantities like fractal dimension andLyapunov exponents. It is not necessarily the best approach when one is building a delay recon-struction for the purposes of prediction , however, as I showed in Section 4.3.1. That issue, which isthe focus of this section, has received comparatively little attention in the extensive literature ondelay reconstruction-based prediction [23, 72, 93, 107, 112, 119].In this section, I propose a robust, computationally eﬃcient method that I call time delayedactive information storage , A τ , which can be used to select parameter values that maximize theinformation shared between the past and the future—or, equivalently, that maximize the reductionin uncertainty about the future given the current model of the past [42]. The implementationdetails, and a complexity analysis of the algorithm, are covered in Section 5.1.1. In Section 5.1.2, Ishow that simple prediction methods working with A τ -optimal reconstructions— i.e., constructionsusing parameter values that follow from the A τ calculations—perform better, on both real andsynthetic examples, than those same forecast methods working with reconstructions that are builtusing the traditional parameter selection heuristics (time-delayed mutual information for τ andfalse-near neighbors for m ). Finally, in Section 5.1.3 I explore the utility of A τ in the face ofdiﬀerent data lengths and prediction horizons. The information shared between the past and the future is known as the excess entropy [25].I will denote it here by E = I [ ←− X , −→ X ], where I is the mutual information [121] and ←− X and −→ X represent the inﬁnite past and the inﬁnite future, respectively. E is often diﬃcult to estimate fromdata due to the need to calculate statistics over potentially inﬁnite random variables [58]. Whilethis is possible in principle, it is too diﬃcult in practice for all but the simplest of dynamics [110].In any case, the excess entropy is not exactly what one needs for the purposes of prediction, sinceit is not realistic to expect to have the inﬁnite past or to predict inﬁnitely far into the future. Formy purposes, it is more productive to consider the information contained in the recent past anddetermine how much that explains about the not-too-distant future. To that end, I deﬁne the stateactive information storage A S ≡ I [ S j , X j + h ] (5.1)where S j is an estimate of the state of the system at time j and X j + h is the state of the system h steps in the future. In the special case where the state estimate S takes the form of a standard m -dimensional delay vector, I will refer to A S as the time delayed active information storage A τ ≡ I [[ X j , X j − τ , . . . , X j − ( m − τ ] , X j + h ] (5.2) A τ can be nicely visualized—and compared to traditional methods like time-delayed mutualinformation—using the I-diagrams of Yeung, introduced in Section 2.2.3. Figure 5.1(a) shows an5 I -diagram of time-delayed mutual information for a speciﬁc τ . Recall that in a diagram like this,each circle represents the uncertainty in a particular variable. The left circle in Figure 5.1(a),for instance, represents the average uncertainty in observing X j − τ ( i.e., H [ X j − τ ]); similarly, thetop circle represents H [ X j + h ], the uncertainty in the h th future observation. Also recall that eachof the overlapping regions represents shared uncertainty: e.g., in Figure 5.1(a), the shaded regionrepresents the shared uncertainty between X j and X j − τ —more precisely, the quantity I [ X j , X j − τ ].Notice that minimizing the shaded region in Figure 5.1(a)—that is, rendering X j and X j − τ asindependent as possible—maximizes the total uncertainty that is explained by the combined model[ X j , X j − τ ] (the sum of the area of the two circles). This is precisely the argument made byFraser and Swinney in [33]; see Section 2.1.2 for a full explanation. However, it is easy to seefrom the I -diagram that choosing τ in this way does not explicitly take into account explanationsof the future —that is, it does not reduce the uncertainty about X t + h . Moreover, this approachto selecting τ does not automatically extend to higher dimensional embeddings, e.g., minimizing I [ X j , X j − τ ], may or may not minimize I [ X j , X j − τ , X j − τ ] and in fact this extension is non-trivial;see Section 2.2.3 for a full discussion of why this is so. The obvious next step would be to explicitly H [ X j + h ] H [ X j − τ ] H [ X j ] (a) I [ X j − τ , X j ] H [ X j + h ] H [ X j − τ ] H [ X j ] (b) A τ = I [[ X j , X j − τ ]; X j + h ] Figure 5.1: (a) An I-diagram of the time-delayed mutual information. The circles represent un-certainties ( H ) in diﬀerent variables; the shaded region represents I [ X j ; X j − τ ], the time-delayedmutual information between the current state X j and the state τ time units in the past, X j − τ .Notice that the shaded region is indiﬀerent to H [ X j + h ], the uncertainty about the future. (b) AnI-diagram of A τ , the quantity proposed in this section: I [[ X j , X j − τ ]; X j + h ]. This quantity cap-tures the shared information between the past, present, and future independently, as well as theinformation that the past and present, together, share with the future.include the future in the estimation procedure. As I discussed in Section 2.2.3, however, explicitlyincluding the future in the calculation i.e., I [ X j , X j − τ , X j + h ], is not straightforward. The rest ofthis section discusses some of the common interpretations of this quantity and why they are notappropriate for the task at hand.The interaction information [10,76] is one such interpretation of I [ X j , X j − τ , X j + h ] depicted inFigure 2.5(b); this is the intersection of H [ X j ], H [ X j − τ ] and H [ X j + h ]. It describes the reduction in6uncertainty that the two past states, together, provide regarding the future. While this is obviouslyan improvement over the time-delayed mutual information of Figure 5.1(a), it does not take intoaccount the information that is shared between X j and the future but not shared with the past ( i.e., X j − τ ), and vice versa. The binding information and total correlation, depicted in Figures 2.5(c)and (d), address this shortcoming, but both also include information that is shared between thepast and the present, but not with the future. This is not terribly useful for the purposes ofprediction. Moreover, the total correlation overweights information that is shared between all threecircles—past, present, and future—thereby artiﬁcially over-valuing information that is shared in alldelay coordinates. In the context of predicting X t + h , the provenance of the information is irrelevantand so the total correlation also seems ill-suited to the task at hand.Note that the total correlation has been used in a similar manner to the time-delayed mu-tual information method in estimating τ [33]: e.g., minimizing M [ X j ; X j − τ ; X j − τ ] for a three-dimensional embedding. Minimizing the total correlation is equivalent to maximizing the entropy,making the delay vectors maximally informative because dependencies among the dimensions havebeen minimized. While on the surface this may seem a boon to prediction, consider the issue ofpredicting the state of the system at time j + τ : if the coordinates of the delay vector are maximallyindependent, they will also be independent of the value being predicted . In light of this, the minimaltotal correlation approach is not well aligned with the goal of prediction.Time-delayed active information storage addresses all of the issues raised in the previous para-graphs. By treating the generic delay vector as a joint variable, rather than a series of single vari-ables, A τ captures the shared information between the past, present, and future independently—theleft and right colored wedges in Figure 5.1(b)—as well as the information that the past and present,together, share with the future (the center wedge). By choosing delay-reconstruction parametersthat maximize A τ , then, one can explicitly maximize the amount of information that each delayvector contains about the future [42].That property means that A τ can be used to select τ for ro-LMA . Speciﬁcally, to estimate a“forecast-optimal” τ value for ro-LMA using A τ , one would simply calculate A τ = I [[ X j , X j − τ ] , X t + h ]for a range of τ , choosing the ﬁrst maximum ( i.e., minimizing the uncertainty about the h th fu-ture observation). In Section 5.1.2, I explore that claim using ro-LMA and fnn-LMA , but thatexploration can be easily extended to any time-delayed state estimator—such as the methods usedin [23,107,112,119]—by using the general form of A τ , viz., the state active information storage, A S .For example, if the time series is pre-processed ( e.g., via a Kalman ﬁlter [108], a low-pass ﬁlter andan inverse Fourier transform [98], or some other local-linear transformation [23, 59, 107, 112, 119],)the state estimator simply becomes S j = ˆ (cid:126)x j where ˆ (cid:126)x j is the processed m -dimensional delay vector. This section demonstrates how to use A τ to choose parameter values for delay-coordinatereconstructions constructed speciﬁcally for the purposes of forecasting, using several of the casestudies presented in Chapter 3. For the discussion that follows, the term “ A τ -optimal” is usedto refer to the parameter values ( m and τ ) that maximize A τ over a range of m and τ . Thegeneral parameter selection framework is presented at ﬁrst, not assuming the use of either fnn-LMA or ro-LMA , and then the A τ -optimal reconstructions are compared to fnn-LMA and ro-LMA . Forsimplicity, in this initial discussion, forecast horizons are ﬁxed at h = 1 for each experiment. For the A τ calculations, this means that A τ = I [ S j , X j +1 ], with S j = [ X j , X j − τ , . . . , X j − ( m − τ ) ] T . Recallthat with one-step forecasts, 1-MASE is the ﬁgure of merit. Section 5.1.3.2 considers increasingthe prediction horizon using h -MASE, with h >

1, to assess accuracy. Section 5.1.3.1 considers the7 τ

10 20 30 40 50 m (a) τ

10 20 30 40 50 m (b) Figure 5.2: The eﬀects of reconstruction parameter values on A τ and forecast accuracy for theLorenz 96 system. (a) A τ values for diﬀerent delay reconstructions of a representative trace fromthat system with { K = 22 , F = 5 } . (b) 1-MASE scores for LMA forecasts on diﬀerent delayreconstructions of that trace.eﬀects of the length of the traces. The ﬁrst step in this demonstration uses some standard synthetic examples, both maps(H´enon, logistic) and ﬂows: the classic Lorenz 63 system [71] and the Lorenz 96 atmosphericmodel [73]. The dynamics of each of these systems are reconstructed from the traces described inChapter 3 using diﬀerent values m and τ . A τ is computed for each of those reconstructed trajecto-ries using a Kraskov-St¨ugbauer-Grassberger (KSG) estimator [63], as described in Section 2.2.5.2.LMA is then used to generate forecasts of every trace using each { m, τ } pair, their 1-MASE scoresare computed as described in Section 2.4, and the relationship between the 1-MASE scores and the A τ values for the corresponding time series are discussed. Flow Examples

Figure 5.2(a) shows a heatmap of the A τ values for reconstructions of a representative tra-jectory from the Lorenz 96 system with { K = 22 , F = 5 } , for a range of m and τ . Not surprisingly,this image reveals a strong dependency between the values of the reconstruction parameters and thereduction in uncertainty about the near future that is provided by the reconstruction. Very low τ values, for instance, produce delay vectors that have highly redundant coordinates, and so providesubstantial information about the immediate future. Again, the standard heuristics only focus onminimizing redundancy between coordinates, choosing the τ value that minimizes the mutual in-formation between the ﬁrst two coordinates in the delay vector. For this Lorenz 96 trajectory, thatapproach [33] yields τ = 26, while standard dimension-estimation heuristics [62] suggest m = 8.8The A τ value for a delay reconstruction built with those parameter values is 3 . ± . not , however, the A τ -optimal reconstruction; choosing m = 2 and τ = 1, for instance, resultsin a higher value ( A τ = 5 . ± . i.e., signiﬁcantly more reduction in uncertainty about thefuture. This may be somewhat counter-intuitive, since each of the delay vectors in the A τ -optimalreconstruction spans far less of the data set and thus one would expect points in that space tocontain less information about the future. Figure 5.2(a) suggests, however, that this in fact not thecase; rather, the uncertainty increases with both dimension and time delay.The question at issue in this section is whether that reduction in uncertainty about the futurecorrelates with improved accuracy of an LMA forecast built from that reconstruction. Since the A τ -optimal choices maximize the shared information between the state estimator and X j +1 , one wouldexpect a delay reconstruction model built with those choices to aﬀord LMA the most leverage. Totest that conjecture, I perform an exhaustive search with m = 2 , . . . ,

15 and τ = 1 , . . . ,

50. Foreach { m, τ } pair, I use LMA to generate forecasts from the corresponding reconstruction, computetheir 1-MASE scores, and plot the results in a heatmap similar to the one in Figure 5.2(a). Asone would expect, the 1-MASE and A τ heatmaps are generally antisymmetric. This antisymmetrybreaks down somewhat for low m and high τ , where the forecast accuracy is low even though thereconstruction contains a lot of information about the future.I suspect that this breakdown is due to a combination of overfolding (too-large values of τ )and projection (low m ). Even though each point in an overfolded reconstruction may contain alot of information about the future, the false crossings created by this combination of eﬀects poseproblems for a near-neighbor forecast strategy like LMA. The improvement that occurs if one addsanother dimension is consistent with this explanation. Notice, too, that this eﬀect only occurs farfrom the maximum in the A τ surface—the area that is of interest if one is using A τ to chooseparameter values for reconstruction models.In general, though, maximizing the redundancy between the state estimator and the futuredoes appear to minimize the resulting forecast error of LMA. Indeed, the maximum on the surfaceof Figure 5.2(b) ( m = 2 , τ = 1) is exactly the minimum on the surface of Figure 5.2(a). Theaccuracy of this forecast is almost six times higher (1-MASE = 0 . ± . . ± . i.e., there may be ranges of m and τ for which A τ and 1-MASE are optimal, and roughly constant. In these cases, it makes sense to choose the lowest m on the plateau, since that minimizes computational eﬀort, data requirements, and noise eﬀects.Notice that in this experiment, m = 2 was actually the A τ -optimal reconstruction dimension, andthat correspondence let me calculate the forecast optimal τ for ro-LMA without exhaustive search.While the results discussed in the previous paragraph do provide a preliminary validationof the claim that one can use A τ to select good parameter values for delay reconstruction-basedforecast strategies, they only involve a single example system. Similar experiments on traces fromthe Lorenz 96 system with diﬀerent parameter values { K = 47 , F = 5 } (not shown) demonstrateidentical results—indeed, the heatmaps are visually indistinguishable from the ones in Figure 5.2.Furthermore, for { K = 47 , F = 5 } , m = 2 is again the A τ -optimal reconstruction dimension,and A τ again estimates the forecast optimal τ for ro-LMA —quickly, without exhaustive search.Figure 5.3 shows heatmaps of A τ and 1-MASE for similar experiments on the canonical Lorenz 63system [71]. As in the Lorenz 96 case, the heatmaps are generally antisymmetric, conﬁrming thatmaximizing A τ is roughly equivalent to minimizing 1-MASE. Again, though, the antisymmetryis not perfect; for high τ and low m , the eﬀects of projecting an overfolded attractor cause falsecrossings that trip up LMA. As before, adding a dimension mitigates this eﬀect by removing thesefalse crossings. Both the Lorenz 63 and Lorenz 96 plots show a general decrease in predictability9 τ

10 20 30 40 50 m (a) τ

10 20 30 40 50 m (b) Figure 5.3: The eﬀects of reconstruction parameter values on A τ and forecast accuracy for theLorenz 63 system. (a) A τ values for diﬀerent delay reconstructions of a representative trace fromthat system. (b) 1-MASE scores for LMA forecasts on diﬀerent delay reconstructions of that trace.for large m and high τ , with roughly hyperbolic equipotentials dividing the colored regions. Thelocations and heights of these equipotentials diﬀer because the two signals are not equally easy topredict. This matter is discussed further at the end of this section.Numerical A τ and 1-MASE values for LMA forecasts on diﬀerent reconstructions of bothLorenz systems are tabulated in the top three rows of Table 5.1, along with the reconstructionparameter values that produced those results. These results bring out two important points. First,as suggested by the heatmaps, the m and τ values that maximize A τ (termed m A τ and τ A τ in thetable) are close, or identical, to the values that minimize 1-MASE ( m E and τ E ) for all three Lorenzsystems. This is notable because—as discussed in Section 5.1.3.1—the former can be estimatedquite reliably from a small sample of the trajectory in only a few seconds of compute time, whereasthe exhaustive search that is involved in computing m E and τ E for Table 5.1 required close to 30hours of CPU time per system/parameter set ensemble. A second important point that is apparentfrom Table 5.1 is that delay reconstructions built using the traditional heuristics—the values withthe H subscript—are comparatively ineﬀective for the purposes of LMA-based forecasting. Thisis notable because that is the default approach in the literature on state-space based forecastingmethods for dynamical systems. Moreover, in all cases m E and m A τ are far lower than what theembedding theory would suggest, further corroborating the basic premise of this thesis.A close comparison of Figures 5.2 and 5.3 brings up another important point: some time seriesare harder to forecast than others. Figure 5.4 breaks down the details of the two suites of Lorenz96 experiments, showing the distribution of A τ and 1-MASE values for all of the reconstructions.Although there is some overlap in the K = 22 and K = 47 histograms— i.e., best-case forecastsof the former are better than most of the forecasts of the latter—the K = 47 traces generally Note that the color map scales are not identical across all heatmap ﬁgures in this thesis; rather, they are chosenindividually, to bring out the details of the structure of each experiment. H is the representative accuracy of LMA forecasts that use delay reconstructions withparameter values ( m H and τ H ) chosen via standard heuristics for the corresponding traces. Simi-larly, 1-MASE A τ is the accuracy of LMA forecasts that use reconstructions built with the m and τ values that maximize A τ , and 1-MASE E is the error of the best forecasts for each case, foundvia exhaustive search over the m, τ parameter space. ∗∗ : on these signals the standard heuristicsfailed. Signal 1-MASE H A τ E Parameters { m H , τ H } { m A τ , τ A τ } { m E , τ E } Lorenz-96 K = 22 0 . ± .

033 0 . ± .

002 0 . ± . { , } { , } { , } Lorenz-96 K = 47 1 . ± .

043 0 . ± .

006 0 . ± . { , } { , } { , } Lorenz 63 0 . ± .

008 0 . ± .

006 0 . ± . { , } { , } { , } H´enon Map ∗∗ . × − ± . × − . × − ± . × − { ∗∗ , ∗∗ } { , } { , } Logistic Map ∗∗ . × − ± . × − . × − ± . × − { ∗∗ , ∗∗ } { , } { , } A τ F r e q u e n c y Lorenz 96 { K = 22 , F = 5 } Lorenz 96 { K = 47 , F = 5 } (a) MASE F r e q u e n c y Lorenz 96 { K = 22 , F = 5 } Lorenz 96 { K = 47 , F = 5 } (b) Figure 5.4: Histograms of A τ and 1-MASE values for all traces from the Lorenz 96 { K = 22 , F = 5 } and { K = 47 , F = 5 } systems for all { m, τ } values in Figures 5.2 and 5.3: (a) A τ (b) 1-MASE.contain less information about the future and thus are harder to forecast accurately. As discussedin Section 4.1, this is to be expected. Map Examples m and τ often fail. The time-delayed mutual information of [33], for example, maydecay exponentially, without showing any clear minimum. And the lack of spatial continuity of theorbit of a map violates the underlying idea behind the method of [62]. State space-based forecastingmethods can, however, be very useful in generating predictions of trajectories from systems likethis— if one selects the two free parameters properly.In view of this, it would be particularly useful if one could use A τ to choose embeddingparameter values for maps. This section explores that notion using two canonical examples, shownin the bottom two rows of Table 5.1. For the H´enon map x n +1 = 1 − ax n + y n (5.3) y n +1 = bx n (5.4)with a = 1 . b = 0 .

3, the A τ -optimal parameter values, m = 2 and τ = 1, occur at A τ =6 . ± . . × − ± . × − ).These parameter values make sense, of course; a ﬁrst-return map of the x coordinate is eﬀectivelythe H´enon map, so [ x j , x j − ] is a perfect state estimator (up to a scaling term). But in practice,of course, one rarely knows the underlying dynamics of the system that generated a time series, sothe fact that one can choose good reconstruction parameter values by maximizing A τ is notable—especially since standard heuristics for that purpose fail for this system.The same pattern holds for the logistic map, x n +1 = rx n (1 − x n ), with r = 3 .

65. Again, forvalidation, I generate 15 trajectories from randomly-chosen initial conditions. For this ensemble ofexperiments, the A τ -optimal parameter values, which occur at A τ = 9 . ± . . × − ± . × − ). As in the H´enon example, thesevalues ( m = 1 and τ = 1) make complete sense, given the form of the map. But again, one doesnot always know the form of the system that generated a given time series. In both of these mapexamples, the standard heuristics fail, but A τ clearly indicates that one does not actually need toreconstruct these dynamics—rather, that near-neighbor forecasting on the time series itself is thebest approach. The previous section provided a preliminary veriﬁcation of the conjecture that parametersthat maximize A τ also maximize forecast accuracy for LMA, for both maps and ﬂows. Whileexperiments with synthetic examples are useful, it is important to show that A τ is also a useful wayto choose parameter values for delay reconstruction-based forecasting of real-world data, where thetime series are noisy and perhaps short, and one does not know the dimension of the underlyingsystem—let alone its governing equations. This section extends the exploration in the previoussection, using experimental data from two diﬀerent dynamical systems: a far-infrared laser and alaboratory computer-performance experiment. A Far-Infrared Laser

I begin this discussion by returning to the canonical test case from Chapter 1, SFI datasetA [119], which was gathered from a far-infrared laser. As in the synthetic examples in Sec-tion 5.1.2.1, the A τ and 1-MASE heatmaps (Figure 5.5) are largely antisymmetric for this signal.Again, there is a band across the bottom of each image because of the combined eﬀects of overfoldingand projection. Note the similarity between Figures 5.5 and 5.3: the latter resemble “smoothed”2 τ

10 20 30 40 50 m (a) τ

10 20 30 40 50 m (b) Figure 5.5: The eﬀects of reconstruction parameter values on A τ and forecast accuracy for SFIdataset A. (a) A τ values for diﬀerent delay reconstructions of that signal. (b) 1-MASE scores forLMA forecasts of those reconstructions.versions of the former. It is well known [119] that the SFI dataset A is well described by the Lorenz63 system with some added noise, so this similarity is reassuring. An LMA forecast using the A τ -optimal reconstruction of this trace was more accurate than similar forecasts using reconstruc-tions built using traditional heuristics—1-MASE A τ = 0 . H = 0 . E = 0 . { m A τ , τ A τ } and { m E , τ E } are not identical for this signal. This is because the optima in the heatmaps in Figure 5.5are bands, rather than unique points—as was the case in the synthetic examples in Section 5.1.2.1.In a situation like this, a range of { m, τ } values are statistically indistinguishable, from the stand-point of the forecast accuracy aﬀorded by the corresponding reconstruction. The values suggestedby the A τ calculation ( m A τ = 9 and τ A τ = 1) and by the exhaustive search ( m E = 7, τ E = 1)are all on this plateau, those suggested by the traditional heuristics ( m H = 7, τ H = 3) howeverare not. Again, these results suggest that one can use A τ to choose good parameter values fordelay reconstruction-based forecasting, but SFI dataset A is only a single trace from a fairly simplesystem. Computer Performance Dynamics

Finally, I will return to the computer performance dynamics of col major and :experiments that involve multiple traces from each system, which allows for statistical analysis.As in the previous examples (Lorenz 63, Lorenz 96, H´enon Map, Logistic Map, SFI dataset A),heatmaps of 1-MASE and A τ for a representative col major time series—Figure 5.6(b)—are largelyantisymmetric. And again, reconstructions using the A τ -optimal parameter values allowed LMAto produce highly accurate forecasts of this signal: 1-MASE A τ = 0 . ± . Note that the SFI dataset A 1-MASE values are not averages as there is only one trace. τ

10 20 30 40 50 m (a) τ

10 20 30 40 50 m (b) Figure 5.6: The eﬀects of reconstruction parameter values on A τ and forecast accuracy for arepresentative trace of col major . (a) A τ values for diﬀerent delay reconstructions of that trace.(b) 1-MASE scores for LMA forecasts on those reconstructions.optimal 1-MASE E = 0 . ± . col major dynamics. When τ isa multiple of this period ( τ = 3 κ ), the coordinates of the delay vector are not independent, whichlowers A τ and makes forecasting more diﬃcult. (There is a nice theoretical discussion of this eﬀectin [99].) Conversely, A τ spikes and 1-MASE plummets when τ = 3 κ −

1, since the coordinates insuch a delay vector cannot share any prime factors with the period of the orbit. The band alongthe bottom of both images is, again, due to a combination of overfolding and projection. The other14 traces in this experiment yield structurally identical heatmaps and the variance between thesetrials were only ± .

037 on average.Another diﬀerence between the col major heatmaps and the ones in Figures 5.2, 5.3, and 5.5is the apparent overall trend: the “good” regions (low 1-MASE and high A τ ) are in the lower-leftquadrants of those heatmaps, but in the upper-right quadrants of Figure 5.6. This is partly anartifact of the diﬀerence in the color-map scale, which is chosen here to bring out some importantdetails of the structure, and partly due to that structure itself. Speciﬁcally, the optima of the col major heatmaps—the large dark red and blue regions in Figures 5.7(a) and (b), respectively—are much broader than the ones in the earlier discussion of this section, perhaps because the signalis so close to periodic. (This is also the case to some extent in the SFI Dataset A example, forthe same reason.) This geometry makes precise comparisons of A τ -optimal and 1-MASE-optimalparameter values somewhat problematic, as the exact optima on two almost-ﬂat but slightly noisylandscapes may not be in the same place. Indeed, the A τ values at { m A τ , τ A τ } and { m E , τ E } arewithin a standard error across all 15 traces of col major .And that brings up an interesting tradeoﬀ. For practical purposes, what one wants is { m A τ , τ A τ } values that produce a 1-MASE value that is close to the optimum 1-MASE E . However,4 τ A τ (a) τ - M A S E (b) Figure 5.7: 1-MASE and A τ for ro-LMA forecasts of all 15 col major traces, plotted as a function of τ . The blue dashed curves show the averages across all trials; the red dotted lines are that average ± the standard deviation. (a) A τ values for delay reconstructions of these traces with m = 2 anda range of values of τ . (b) 1-MASE scores for ro-LMA forecasts of those reconstructions.the algorithmic complexity of most nonlinear time-series analysis and prediction methods scalesbadly with m . In cases where the A τ maximum is broad, then, one might want to choose thelowest value of m on that plateau—or even a value that is on the shoulder of that plateau, if oneneeds to balance eﬃciency over accuracy. As I showed in Chapter 4, using ro-LMA does just that,and that appears to work quite well for the col major data. This amounts to marginalizing theheatmaps in Figure 5.6 with m = 2, which produces cross sections like the ones shown in Figure 5.7.The antisymmetry between A τ and 1-MASE is quite apparent in these plots; the global maximumof the former coincides with the global minimum (0 . ± . τ = 2. This isnot much higher than the overall optimum of 0 . ± . A τ one simply ﬁxes m = 2, as is done with ro-LMA , then selects τ by calculating A τ across a rangeof τ s, rather than across a 2D { m, τ } space.The correspondence between 1-MASE and A τ also holds true for other marginalizations: i.e., the minimum 1-MASE and the maximum A τ occur at the same τ value for all m -wise slices of the col major heatmaps, to within statistical ﬂuctuations. The methods of [33] and [62], incidentally,suggest τ H = 2 and m H = 12 for these traces; the average 1-MASE of an LMA forecast on such5a reconstruction is 0 . ± . m = 2marginalization, although still short of the overall optimum. The correspondence between τ H and τ A τ is coincidence; for this particular signal, maximizing the independence of the coordinateshappens to maximize the information about the future that is contained in each delay vector. Thisis most likely due to the strength of the unstable three cycle present in these dynamics. In thiscase, the coordinates would be maximally independent and contain the most information aboutthe future when τ = ρ −

1, where ρ is the period of the dynamics. The m = 12 result is notcoincidence—and quite interesting, in view of the fact that the m = 2 forecast is so good. It isalso surprising in view of the huge number of transistors—potential state variables—in a moderncomputer. As described in [83], however, the hardware and software constraints in these systemsconﬁne the dynamics to a much lower-dimensional manifold.The col major program is what is known in the computer-performance literature as a “micro-kernel”—a extremely simple example that is used in proof-of-concept testing. The fact that itsdynamics are so rich speaks to the complexity of the hardware (and the hardware-software interac-tions) in modern computers; again, see [83,84] for a much deeper discussion of these issues. Moderncomputer programs are far more complex than this simple micro-kernel, of course, which begs thequestion: what does A τ tell us about the dynamics of truly complex systems like the memory orprocessor usage patterns of sophisticated programs—which the computer performance communitymodels as stochastic systems?For , the answer is, again, that A τ appears to be an eﬀective and eﬃcient way toassess predictability. As shown in [41] and synopsized in Chapter 6, this time series shares little to noinformation with the future: i.e., it cannot be predicted using delay reconstruction-based forecastingmethods, regardless of τ and m values. The experiments in [41] required dozens of hours of CPUtime to establish that conclusion; A τ gives the same results in a few seconds, using much less data.The structure of the heatmaps for this experiment, as shown in Figure 5.8, is radically diﬀerent.The patterns visible in the previous 1-MASE plots, and the antisymmetry between A τ and 1-MASEplots, are absent from this pair of images, reﬂecting the lack of predictive content in this signal.Note, too, that the color map scales are diﬀerent in this ﬁgure. This reﬂects the much-lower valuesof A τ for this signal: over all 15 experiments of , for the parameter range in Figure 5.8, A τ reached an absolute maximum of 0.7722, compared to the absolute maximum of 5.3026 for allexperiments of the Lorenz 96 with K = 22. Indeed, the 1-MASE surface in Figure 5.8(b) never dipsbelow 1.0, Figure 5.2, in contrast, never exceeds ≈ . i.e., for all 15 traces of , 1-MASEnever drops below 1.0. That is, regardless of parameter choice, LMA forecasts of are nobetter than simply using the prior value of this scalar time series as the prediction. In comparison,with every experiment with Lorenz 96 K = 22—regardless of parameter choice—the 1-MASE forLMA generally stays below 0.3—more than twice as good as a random walk. The uniformly low A τ values in Figure 5.8(a) are an eﬀective indicator of this—and, again, they can be calculatedquickly, from a relatively small sample of the data. It is to that issue that I turn next. In some real-world situations, it may be impractical to rebuild forecast models at everystep, as I have done in the previous sections of this thesis—because of computational expense, forinstance, or because the data rate is very high. In these situations, one may wish to predict h timesteps into the future, then stop and rebuild the model to incorporate the h points that have arrivedduring that period, and repeat.6 τ

10 20 30 40 50 m (a) τ

10 20 30 40 50 m (b) Figure 5.8: The eﬀects of reconstruction parameter values on A τ and forecast accuracy for a repre-sentative trace from a computer-performance dynamics experiment using the benchmark.(a) A τ values for diﬀerent delay reconstructions of this trace. (b) 1-MASE scores for LMA forecastson those reconstructions.In chaotic systems, of course, there are fundamental limits on prediction horizon even if oneis working with inﬁnitely long traces of all state variables. A key question at issue in this section ishow that eﬀect plays out in forecast models that use delay reconstructions from scalar time-seriesdata. I explore that issue in Section 5.1.3.2. And since real-world data sets are not inﬁnitely long,it is also important to understand the eﬀects of data length on the estimation of A τ . I explore thisquestion in the following section, using one-step-ahead forecasts so that I can compare the resultsto those in the previous sections. A τ Estimation

The quantity of data used in a delay reconstruction directly impacts the usefulness of thatreconstruction. If one is interested in approximating the correlation dimension via the Grassberger-Procaccia algorithm, for instance, it has been shown that one needs 10 (2+0 . m ) data points [106,116].Those bounds are overly pessimistic for forecasting, however, as mentioned in Section 4.3.3. A keychallenge, then, is to determine whether one’s time series really calls for as many dimensions anddata points as the theoretical results require, or whether one can get away with fewer dimensions—and how much data one needs in order to ﬁgure all of that out. A τ is a useful solution to those challenges. As established in the previous sections, cal-culations of this quantity can reveal what dimension is required for delay reconstruction-basedforecasting of dynamical systems. And, as alluded to there, A τ can be estimated accurately froma surprisingly small number of points. The experiments in this section explore that intertwinedpair of claims in more depth by increasing the length of the Lorenz 96 traces and testing whetherthe information content of the state estimator derived from standard heuristics converges to the A τ -optimal estimator. (This kind of experiment is not possible in practice, of course, when the7time series is ﬁxed, but can be done in the context of this synthetic example.)Figure 5.9 shows the results. When the data length is short, the m = 2 state estimator hasthe most information about the future. This makes perfect sense; a short time series cannot fullysample a complicated object, and when an ill-sampled high-dimensional manifold is projected intoa low-dimensional space, infrequently visited regions of that manifold can act eﬀectively like noise.From an information-theoretic standpoint, this would increase the eﬀective Shannon entropy rate ofeach of the variables in the delay vector. In the I -diagram in Figure 5.1(b), this would manifest asdrifting apart of the two circles, decreasing the size of the shaded region that one needs to maximizefor eﬀective forecasting.If that reasoning is correct, longer data lengths should ﬁll out the attractor, thereby mitigatingthe spurious increase in the Shannon entropy rate and allowing higher-dimensional reconstructionsto outperform lower-dimensional ones. This is indeed what happens, as shown in Figure 5.9. For data length × A τ (a) Lorenz-96 { K = 22 , F = 5 } system data length × A τ (b) Lorenz-96 { K = 47 , F = 5 } system Figure 5.9: Average optimal A τ versus data length for all 15 traces from the Lorenz-96 systemusing τ = 1 in all cases. Blue circles corresponds to m = 2, purple diamonds to m = 4, and red xsto m = 8. (a) Optimal A τ for traces from the { K = 22 , F = 5 } system. (b) Optimal A τ for tracesfrom the { K = 47 , F = 5 } system.8both the K = 22 and K = 47 traces, once the signal is 2 million points long, the four-dimensionalestimator stores more information about the future than the two-dimensional case. Note, though,that the optimal A τ of the m = 8 reconstruction model is still lower than in the m = 2 or m = 4cases, even at the right-hand limit of the plots in Figure 5.9. That is, even with a time series thatcontains 4 × points, it is more eﬀective to use a lower-dimensional reconstruction to make anLMA forecast. But the really important message here is that A τ allows one to determine the bestreconstruction parameters for the available data , which is an important part of the answer to thechallenges outlined at the beginning of this chapter.Something very interesting happens in the m = 2 results for Lorenz 96 model with K = 47:the A τ curve reaches a maximum value around 100,000 points and stops increasing, regardlessof data length. What this means is that this two-dimensional reconstruction contains as muchinformation about the future as can be ascertained from the ro-LMA state estimator, suggestingthat increasing the length of the training set would not improve forecast accuracy. To explore this,I construct LMA forecasts of diﬀerent-length traces (100,000–2.2 million points) from this system,then reconstruct their dynamics with diﬀerent m values and the appropriate τ A τ for each case,and—again—repeat this full experiment 15 times for statistical validation. For m = 2, both A τ and 1-MASE results did indeed plateau at 200,000 points—at 5 . ± . . ± . m = 4 forecast accuracy surpassed m = 2 at around 2 millionpoints. In neither case, by the way, did m = 8 catch up to either m = 2 or m = 4, even at 4million data points. Of course, one must consider the cost of storing the additional variables in ahigher-dimensional model, particularly in data sets this long, so it may be worthwhile in practice tosettle for the m = 2 forecast—which is only slightly less accurate and requires only 200,000 points.This has another major advantage as well. If the time series is non-stationary, a forecast strategythat requires fewer points is particularly useful because it can adapt more quickly. So far in this section, I have considered forecasts that were constructed one step at a timeand studied the correspondence of their accuracy with one-step-ahead calculations of A τ . Here, Iconsider longer prediction horizons ( h ) and explore whether one can use a h -step-ahead version of A τ —i.e., I [ S j , X j + h ], with h > h steps in the future.Of course, one would expect the A τ -optimal { m, τ } values for a given time series to dependon the prediction horizon. It has been shown, for instance, that longer-term forecasts generally dobetter with larger τ [59], and conversely [38]. It also makes sense that one might need to reachdiﬀerent distances into the past (via the span of the delay vector) in order to reduce the uncertaintyabout events that are further into the future [119]. All of these eﬀects are corroborated by A τ .Figure 5.10 demonstrates this with a representative trace of the K = 22 Lorenz 96 system, focusingon m = 2 for simplicity. The topmost dashed curve in this ﬁgure is for the h = 1 case— i.e., ahorizontal slice of Figure 5.2(a) made at m = 2. The maximum of this curve is the optimal τ value( τ A τ ) for this reconstruction. The overall shape of this curve reﬂects the monotonic increase inthe uncertainty about the future with τ that is noted on page 58. The other curves in Figure 5.10show A τ as a function of τ for h = 2 , , . . . , down to h = 100. Note that the lower curves do notdecrease monotonically; rather, there is a slight initial rise. This is due to the issue raised aboveabout the span of the delay vector: if one is predicting further into the future, it may be useful toreach further into the past. In general, this causes the optimal τ to shift to the right as prediction9Figure 5.10: The eﬀects of prediction horizon ( h ) on A τ for a representative time series of the K = 22 Lorenz 96 system for a ﬁxed reconstruction dimension ( m = 2). The curves in the image,from top to bottom, correspond to prediction horizons of h = 1 to h = 100. p A τ Figure 5.11: The eﬀects of prediction horizon ( h ) on average A τ (over 22 trials) of the K = 22Lorenz 96 system for a ﬁxed time delay ( τ = 1) and two diﬀerent reconstructions of the system.The red line represents m = 2; the blue represents m H = 8, the value suggested for this signal bythe technique of false neighbors.horizon increases, going down the plot— i.e., longer prediction horizons require larger τ s ( cf. [59]).For very long horizons, the choice of τ appears to matter very little. In particular, A τ is fairlyconstant (and quite low) for 5 < τ <

50 when h > i.e., regardless of the choice of τ , thereis very little information about the h -distant future in any delay reconstruction of this signal for h >

30. This eﬀect should not be surprising, and is well corroborated in the literature. However,it can be hard to know a priori , when one is confronted with a data set from an unknown system,what prediction horizon makes sense. A τ oﬀers a computationally eﬃcient way to answer thatquestion from not very much data.Figure 5.11 shows a similar exploration but considers the eﬀects of the reconstruction dimen-sion on A τ as forecast horizon increases, this time ﬁxing τ = 1 for simplicity. These results indicatethat the m = 2 state estimator contains more information about the future for short prediction0 p A τ Figure 5.12: The eﬀects of prediction horizon ( h ) on average A τ over the 15 trials of col major for a ﬁxed time delay ( τ = 1) and two diﬀerent reconstruction dimensions. The red line represents m = 2; the blue represents m H = 12, the value suggested for this signal by the technique of falseneighbors.horizons. This ties back to a central theme of this thesis: low-dimensional reconstructions can workquite well. Unsurprisingly, that does not always hold for arbitrary prediction horizons, Figure 5.11shows that the full reconstruction is better for longer horizons. This is to be expected, since ahigher reconstruction dimension allows the state estimator to capture more information about thepast. Finally, note that A τ decreases monotonically with prediction horizon for both m = 2 and m H . This, too, is unsurprising. Pesin’s relation [91] says that the sum of the positive Lyapunovexponents is equal to the entropy rate, and if there is a non-zero entropy rate, then genericallyobservations will become increasingly independent the further apart they are. This explanationalso applies to Figure 5.10, of course, but it does not hold for signals that are wholly (or nearly)periodic.Recall that the col major dynamics in Section 5.1.2.2 are chaotic, but with a dominant un-stable periodic orbit—which have a variety of interesting eﬀects on the results. Figure 5.12 exploresthe eﬀects of prediction horizon on those results. Not surprisingly, there is some periodicity in the A τ versus h relationships, but not for the same reasons that caused the stripes in Figure 5.6(b).Here, the peaks in A τ do occur at multiples of the period. That is, the m = 2 state estimator canforecast with the most success when the value being predicted is in phase with the delay vector.Note that this eﬀect is far stronger for m = 2 than m H , simply because of the instability of thatperiodic orbit; the visits made to it by the chaotic trajectory are more likely to be short than long.As expected, A τ decays with prediction horizon—but only at ﬁrst, after which it begins to riseagain, peaking at h = 69 and h = 71. This may be due to a second higher-order unstable periodicorbit in the col major dynamics.In theory, one can derive rigorous bounds on prediction horizon. The time at which S j willno longer have any information about the future can be determined by considering: R ( h ) = I [ S j , X j + h ] H [ X j + h ] (5.5) i.e., the percentage of the uncertainty in X j + h that can be reduced by the delay vector. Generically,this will limit to some small value equal to the amount of information that the delay vector containsabout any arbitrary point on the attractor. Given some criteria regarding how much information1above the “background” is required of the state estimator, one could use an R ( h ) versus h curveto determine the maximum practical horizon.In practice, one can select parameters for delay reconstruction-based forecasting by explicitlyincluding the prediction horizon in the A τ function, ﬁxing the horizon h at the required value,performing the same search as I did in earlier sections over a range of m and τ , and then choosinga point on (or near) the optimum of that A τ surface. The computational and data requirementsof this calculation, as shown in Section 5.1.3.1, are far superior to those of the standard heuristicsused in delay reconstructions. A τ is a novel metric for quantifying how much information about the future is containedin a delay reconstruction. Using a number of diﬀerent dynamical systems, I demonstrated a di-rect correspondence between the A τ value for diﬀerent delay reconstructions and the accuracyof forecasts made with Lorenz’s method of analogues on those reconstructions. Since A τ can becalculated quickly and reliably from a relatively small amount of data, without any knowledgeabout the governing equations or the state space dynamics of the system, that correspondence is amajor advantage, in that it allows one to choose parameter values for delay reconstruction-basedforecast models without doing an exhaustive search on the parameter space. Signiﬁcantly, A τ -optimal reconstructions are better, for the purposes of forecasting, than reconstructions built usingstandard heuristics like mutual information and the method of false neighbors, which can requirelarge amounts of data, signiﬁcant computational eﬀort, and expert human interpretation. Perhaps,most importantly A τ allows one to answer other questions regarding forecasting with theoreticallyunsound models— e.g., why it is possible to obtain a better forecast using a low-dimensional re-construction than with a true embedding. It also allows one to understand bounds on predictionhorizon without having to estimate Lyapunov spectra or Shannon entropy rates, both of which arediﬃcult to obtain for arbitrary real-valued time series. That, in turn, allows one to tailor one’sreconstruction parameters to the amount of available data and the desired prediction horizon—andto know if a given prediction task is just not possible.The experiments reported in this section involved a simple near-neighbor forecast strategyand state estimators that are basic delay reconstructions of raw time-series data. The deﬁnition andcalculation of A τ do not involve any assumptions about the state estimator, though, so the resultspresented here should also hold for other state estimators. For example, it is common in forecastingapplications to pre-process the time series: for example, low-pass ﬁltering or interpolating to pro-duce additional points. Calculating A τ after performing such an operation will accurately reﬂect theamount of information in that new time series—indeed, it would reveal if that pre-processing step destroyed information. And I believe that the basic conclusions in this section extend to other state-space based forecast schemas besides LMA, such as those used in [23, 98, 107, 112, 119]—although A τ may not accurately select optimal parameter values for strategies that involve post -processingthe data (e.g., GHKSS [48]).There are many other interesting potential ways to leverage A τ in the practice of forecasting.If the A τ -optimal τ = 1, that may be a signal that the time series is undersampling the dynamicsand that one should increase the sample rate. One could use the more general form A S at a ﬁnergrain to optimizing τ individually for each dimension, as suggested in [85, 90, 105], where optimalvalues are selected based on criteria that are not directly related to prediction. To do this, onecould deﬁne S j = [ X j , X j − τ , X j − τ , . . . , X j − τ m − ] and then simply maximize A S using that stateestimator constrained over { τ i } m − i =1 .2 Topology is of particular interest in forecasting dynamics, since many properties—the exis-tence of periodic orbits, recurrence, entropy, etc.—depend only upon topology. However, computingtopology from time series can be a real challenge—even with the aid of delay-coordinate reconstruc-tion. As I have mentioned repeatedly throughout this thesis, success of this reconstruction proce-dure depends heavily on the choice of the two free parameters, but the embedding theorems providelittle guidance regarding how to choose good values for these parameters. The delay-coordinatereconstruction machinery (both theorems and heuristics) targets the computation of dynamicalinvariants like the correlation dimension and the Lyapunov exponent. However, if one just wants toextract the topological structure of an invariant set—as is the case with forecasting—a scaled-backversion of that machinery may be suﬃcient. In the following discussion, I adopt the philosophythat one might only desire knowledge of the topology of the invariant set. In collaboration withJ. D. Meiss, I conjecture that this might be possible with a lower reconstruction dimension thanthat needed to obtain a true embedding. That is, the reconstructed dynamics might be homeomor-phic to the original dynamics at a lower dimension than that needed for a diﬀeomorphically correctembedding [39]. This is an alternative validation of the central premise of my thesis.To compute topology from data, one can use a simplicial complex– e.g., the witness complex of [27]. To construct such a complex, one chooses a set of “landmarks,” typically a subset ofthe data, that become the vertices of the complex. The connections between the landmarks aredetermined by their nearness to the rest of the data—the “witnesses.” Two landmarks in thecomplex are joined by an edge, for instance, if they share at least one witness.My initial work on this approach [39] suggests that the witness complex correctly resolvesthe homology of the underlying invariant set— viz., its Betti numbers—even if the reconstructiondimension is well below the thresholds for which the embedding theorems assure smooth conjugacybetween the true and reconstructed dynamics. This means that some features of the large-scaletopology are present even if the reconstruction dimension does not satisfy the associated theorems.I conjecture that this structure aﬀords ro-LMA the means necessary to generate accurate forecasts.The witness complex is covered in more depth in Section 5.2.1, which also describes the notionof persistence and demonstrates how that idea is used to choose scale parameters for a complexbuilt from reconstructed time-series data. In Section 5.2.2, I explore how the homology of such acomplex changes with reconstruction dimension.

To compute the topology of data that sample an invariant set of a dynamical system, oneneeds a complex that captures the shape of the data but is robust with respect to noise andother sampling issues. A witness complex is an ideal choice for these purposes. Such a complexis determined by the reconstructed time-series data, W ⊂ R m —the witnesses —and an associatedset L ⊂ R m , the landmarks , which can (but need not) be chosen from among the witnesses. Thelandmarks form the vertex set of the complex; the connections between them are dictated by thegeometric relationships between W and L . In a general sense, a witness complex can be deﬁnedthrough a relation R ( W, L ) ⊂ W × L . As Dowker noted [29], any relation gives rise to a pair ofsimplicial complexes. In the one used here, a point w ∈ W is a witness to an abstract k -dimensionalsimplex σ = (cid:104) l i , l i , . . . l i k +1 (cid:105) ⊂ L whenever { w } × σ ⊂ R ( W, L ). The collection of simplices thathave witnesses is a complex relative to the relation R . For example, two landmarks are connected ifthey have a common witness—this is a one-simplex. Similarly, if three landmarks have a common3 ε ε w a w b l l l D a D b Figure 5.13: Illustration of the fuzzy witness relation Equation (5.6). The closest landmark towitness w a is l , and since (cid:107) w a − l (cid:107) < D a + ε , the simplex (cid:104) l , l (cid:105) is in the complex. Similarly w b witnesses the edge (cid:104) l , l (cid:105) .witness, they form a two-simplex, and so on.There are many possible deﬁnitions for a witness relation R ; see [39] for a discussion. Arelation that is particularly useful for analyzing noisy real data [4, 39] is the ε -weak witness [21],or what is called a “fuzzy” witness [4]: a point witnesses a simplex if all the landmarks in thatsimplex are within ε of the closest landmark to the witness: Deﬁnition (Fuzzy Witness) . The fuzzy witness set for a point l ∈ L is the set of witnesses W ε ( l ) = { w ∈ W : (cid:107) w − l (cid:107) ≤ min l (cid:48) ∈ L (cid:107) w − l (cid:48) (cid:107) + ε } (5.6)In this case, the relation consists of the collections R = ∪ l ∈ L ( W ε ( l ) × { l } ) and a simplex σ is in thecomplex whenever ∩ l ∈ σ W ε ( l ) (cid:54) = ∅ —that is, when all of its vertices share a witness. This relation isillustrated in Figure 5.13.The fuzzy witness complex reduces to the “strong witness complex” of de Silva and Carlsson[27] when ε = 0. In such a complex, an edge exists between two landmarks iﬀ there exists awitness that is exactly equidistant from those landmarks. This is not a practical notion of sharedcloseness for ﬁnite noisy data sets. In this case, ε in Equation (5.6) allows for some amount ofimmunity to ﬁnite data and noise eﬀects, but must be chosen correctly as I discuss in the nextparagraph. A simpler implementation of the fuzzy witness complex consists of simplices whosepairs of vertices have a common witness; this implementation gives a “clique” or “ﬂag” complex,analogous to the Rips complex [45]. This is called a “lazy” complex in [27] and instantiated as the LazyWitnessStream class in the javaPlex [114] software. In the notation introduced above, thecomplex is K ε ( W, L ) = { σ ⊂ L : W ε ( l ) ∩ W ε ( l (cid:48) ) (cid:54) = ∅ , ∀ l, l (cid:48) ∈ σ } (5.7)Following [4], I will use this particular construction in the following discussion. My goal is to studythe topology of witness complexes of delay-coordinate reconstructions and determine whether thetopology is resolved correctly when the reconstruction dimension is low.Figure 5.14 shows four witness complexes built from the 100,000-point trajectory of theLorenz 63 system that is shown in Figure 3.2(a) for varying values of the fuzziness parameter, (cid:15) .The landmarks (red dots) consist of (cid:96) = 201 points equally spaced along the trajectory, i.e., every4 y x z (a)ε = 0.001 y x z (b)ε = 0.01 y x z (c) ε = 0.6 Figure 5.14: “One-skeletons” of witness complexes constructed from the trajectory of Figure 3.2(a)using the fuzzy witness relation depicted in Figure 5.13. In each one-skeleton, the red dots arethe (cid:96) = 201 equally-spaced landmarks. A black edge between two landmarks l i and l j signiﬁes theexistence of a one simplex (cid:104) l i , l j (cid:105) in the complex, i.e., l i and l j shared at least one witness. As (cid:15) increases, more landmarks will satisfy (cid:107) w a − l k (cid:107) < D a + ε for each w a and the one complex willﬁll in.∆ t = 500 th point of the time series. There are many ways to choose landmarks; this particularstrategy distributes them according to the invariant measure of the attractor. One could also chooselandmarks randomly from the trajectory or using the “max-min” selector of [27]; each of thesegives results similar to those shown. When ε is small, very few witnesses fall in the thin regionsrequired by Equation (5.6), so the resulting complex does not have many edges and is thus not agood representation of the shape of the data. As ε grows, more witnesses fall in the “shared” regionsand the complex ﬁlls in, revealing the basic homology of the attractor of which the trajectory is asample. There is an obvious limit to this, however: when ε is very large, even the largest holes inthe complex are obscured.In order to evaluate the topology of incomplete reconstructions, one needs to ensure the cor-rectness of the topology. However, as Figure 5.14 illustrates, the simplex, from which one estimatesthe topology, depends on the choice of (cid:15) , and choosing the right (cid:15) for that job is non-trivial. Onecan do so using the progression of images in Figure 5.14 and the notion of persistence . Studying thechange in homology under changing scale parameters is a well-established notion in computational For a deeper discussion of the number of landmarks to use and landmark selection choice see [39]. Choose the ﬁrst landmark at random, and given a set of landmarks, choose the next to be the data point farthestaway from the current set.

0 0.02 0.04 0.06 0.08 0.1 β β ε β β ε Figure 5.15: Persistence barcodes computed using javaPlex for a (cid:96) = 201 witness complex of thetrajectory of Figure 3.2(a). Each plot tabulates the two lowest Betti numbers of the complex for100 values of the scale parameter ε . The left panel shows the behavior when 0 . ≤ ε ≤ .

1, theright panel zooms out to the range 0 . ≤ ε ≤ . barcode persistence diagram [45]. Figure 5.15 shows barcodes of the ﬁrst two Betti numbers forthe witness complexes of Figure 5.14. Each horizontal line in the barcode is the interval in ε forwhich there exists a particular non-bounding cycle, thus the number of such lines is the rank of thehomology group—a Betti number. The values for β and β are computed using javaPlex [114]over the range 0 . ≤ ε ≤ .

7, using the

ExplicitMetricSpace to choose the equally spaced pointsand the

LazyWitnessStream to obtain a clique complex from the (cid:96) = 201 landmarks. There areno three-dimensional voids in the results, i.e., β was always zero for this range of ε —a reasonableimplication for this 2 . ε is very small, as in Figure 5.14(a), thewitness complex has many components and the β barcode shows a large number of entries. As ε grows, the spurious gaps between these components disappear, leaving a single component thatpersists above ε ≈ . ε > .

014 correctly capturethe connectedness of the underlying attractor. The β barcode plot shows a similar pattern: thereare many holes for small ε that are successively ﬁlled in as that parameter grows, leaving the twomain holes ( i.e., β = 2) for ε > .

01. Above ε > . ε > .

05, the complex becomes topologically trivial. Above this value,the resulting complexes—recall Figure 5.14(d)—have no non-contractible loops and are homologousto a point (acyclic).As alluded to above, this notion of persistence can be turned around and used to selectgood values for the parameters that play a role in topological data analysis— e.g., looking for the ε value at which the homology stabilizes or selecting the number of landmarks that are necessaryto construct a topologically faithful complex. However, deﬁnitions of what constitutes stabilizationare subjective and can be problematic. Even so, persistence is a powerful technique and I makeuse of it in a number of ways in the rest of this section.The examples in Figures 5.14 and 5.15 involve a full trajectory from a dynamical system.This thesis focuses on reconstructions of scalar time-series data—structures whose topology isguaranteed to be identical to that of the underlying dynamics if the reconstruction process is carried6out properly. But what if the dimension m does not satisfy the requirements oﬀ the theorems? Canone still obtain useful results about the topology of that underlying system, even if those dynamicsare not properly unfolded in the sense of [89,99,113]? Throughout this thesis, I have argued that inthe context of forecasting, the answer to that question is yes. In the next section, I take a step awayfrom forecasting and examine whether the answer is also yes in the case of topology —speciﬁcallyhomology—and discuss the implications of that answer for the central theme of this thesis. As discussed in Section 2.1.1, a scalar time series of a dynamical system is a projection ofthe d -dimensional dynamics onto R —an action that does not automatically preserve the topologyof the object. Delay-coordinate embedding allows one to reconstruct the underlying dynamics,up to diﬀeomorphism, if the reconstruction dimension is large enough. The question at issuein this section is whether one can use the witness complex to obtain a useful, coarse-graineddescription of the topology from lower-dimensional reconstructions, namely the homology—e.g.,the basic connectivity of the invariant set, or the number of holes in it that are larger than acertain scale. The answer to this question can provide a deeper understanding of the mechanics of ro-LMA .The short answer is yes. Figure 5.16 shows a side-by-side comparison of witness complexes andbarcode diagrams for the Lorenz 63 trajectory of Figure 3.2(a) and a two-dimensional reconstruction( m = 2) using the x coordinate of that trajectory. The full 3 D trajectory on the left and the 2 D reconstruction on the right have the same homology. In other words, the correct large-scale homologyis accessible from a witness complex of a D reconstruction, even though the reconstruction doesnot satisfy the conditions of the associated theorems. And that leads to a fundamental question for this thesis: how does the homology of a delay-coordinate reconstruction change with the dimension m ? The standard answer to this in the delay-coordinate embedding literature is that the topology should change at ﬁrst, then stabilize when m became large enough to correctly unfold the topology of the underlying attractor. In practice,however, a too-large m will invoke the curse of dimensionality and destroy the ﬁdelity of thereconstruction. Moreover, increasing m exacerbates both noise eﬀects and computational expense.For all of these reasons, it would be a major advantage if one could obtain useful information aboutthe homology of the underlying attractor—even if not the full topology—from a low-dimensionaldelay-coordinate reconstruction.Again, it appears that this is possible. Figure 5.17 shows witness complexes for m = 2and m = 3 reconstructions of the Lorenz time series of Figure 3.2(b). The barcodes for the ﬁrsttwo Betti numbers of these two complexes, as computed using javaPlex , have similar structure:the complexes become connected ( β = 1) at a small value of ε , and the dominant, persistenthomology corresponds to the two primary holes ( β = 2) in the attractor. Note, by the way, thatFigure 5.17(a) is not simply a 2 D projection of Figure 5.17(b); the edges in each complex reﬂectthe geometry of the witness relationships in diﬀerent spaces, and so may diﬀer. Higher-dimensionalreconstructions—not easily displayed—have the same homology for suitable choices of ε , thoughfor m >

5, it is necessary to increase the number of landmarks to obtain a persistent β = 2.That brings up an important point: if one wants to sensibly compare witness complexesconstructed from diﬀerent reconstructions of a single data set, one has to think carefully about the (cid:96) and ε parameters. Here, I use persistence to choose a good value of (cid:96) . I ﬁnd that the results For example, recall the method of dynamical invariants. yxz ε = 1.2(a)( x,y,z ) x ( t ) x ( t – τ ) -20-20-10 -100 010 1020 20 ε = 0.2 m = 2(b) ε β ( x,y,z )(c) ε m = 2 β (d) Figure 5.16: One-skeletons of the witness complexes (top row) and barcode diagrams for β (bottomrow) of the Lorenz system. The plots in the left-hand column are computed from the three-dimensional ( x, y, z ) trajectory of Figure 3.2(a); those in the right-hand column are computed froma two-dimensional ( m = 2) delay-coordinate reconstruction from the x coordinate of that trajectorywith τ = 174. In both cases, (cid:96) = 201 equally spaced landmarks (red × s) are used. Both complexeshave two persistent nonbounding cycles (green and blue edges) but the 2 D reconstruction requiresonly ≈ ε = 0 . D trajectory requires ≈ ε = 1 .

2) to eliminate spurious loops.are robust with respect to changes in that value, across all reconstruction dimension values in thisstudy, so I ﬁx (cid:96) ≈

200 for all the experiments reported in this section. In the experiments in the previous section, the scale parameter ε was given in absolute units.To generalize this approach across diﬀerent examples and diﬀerent reconstruction dimensions, itmakes sense to compare reconstructions with ε chosen to be a ﬁxed fraction of the diameter,diam( W ), of the set W ε = ξ diam( W ) (5.8)For example, for the full 3 D attractor in Figure 3.2(a)diam( W xyz ) = (cid:112) ( x max − x min ) + ( y max − y min ) + ( z max − z min ) = 75 . The precise value varies slightly because the length of a trajectory reconstructed from a ﬁxed-length data setdecreases with increasing m (since one needs a full span of m × ( τ ) data points to construct a point in the reconstructionspace). -20 -15 -10 -5 0 5 10 15 20-20-15-10-505101520 (a) x ( t ) x ( t – τ ) m = 2 x ( t ) x ( t–τ ) x ( t – τ ) m = 3(b) Figure 5.17: The eﬀects of reconstruction dimension: One-skeletons of witness complexes of diﬀerentreconstructions of the scalar time series of Figure 3.2(b). Both reconstructions use τ = 174, theﬁrst minimum of the average time-delayed mutual information, (cid:96) = 198 equally spaced landmarks(red dots), and ξ = 0 . ε values used in Figure 5.15—0 . ≤ ε ≤ . . × − ≤ ξ ≤ .

023 in this diameter-scaled measure.The diameter of the reconstruction varies in a natural way with the dimension m . Sincedelay-coordinate reconstruction of scalar data unfolds the full range of those data along everyadded dimension, the diameter of an m -dimensional reconstruction will bediam( W m ) = (cid:112) m ( x max − x min ) = 37 . √ m (5.10)for this dataset, where x represents the scalar time-series data. Since this unfolding will change thegeometry of the reconstruction, I need to scale ε accordingly. The witness complexes in Figure 5.17are constructed with a ﬁxed value of ξ = 0 . ε = 37 . √ . .

283 in absolute units, while for Figure 3.2(b), diam( W ) = 37 . √ ε = 0 . ε —which is used throughout the rest of this section—should allow the witness complex to adaptappropriately to the eﬀects of changing reconstruction dimension and ﬁnite data.To formalize the exploration of the reconstruction homology and extend that study acrossmultiple dimensions, one can use a variant of the classic barcode diagram that shows, for eachsimplex, the reconstruction dimension values at which it appears in and vanishes from the complex.Figure 5.18(a) shows such a plot for edges that involve l , the ﬁrst landmark on the reconstructedtrajectory. A number of interesting features are apparent in this image. Unsurprisingly, most ofthe one-simplices that exist in the m = 1 witness complex—many of which are likely due to thestrong eﬀects of the projection of the underlying R d trajectory onto R —vanish when one moves to m = 2. There are other short-lived edges in the complex as well: e.g., the edge from l to l thatis born at m = 2 and dies at m = 3. The sketch in Figure 5.18(b) demonstrates how edges can beborn as the dimension increases: in the m = 2 reconstruction, (cid:96) and (cid:96) share a witness (the greensquare); when one moves to m = 3, spreading all of the points out along the added dimension,that witness is moved far from (cid:96) —and into the shared region between (cid:96) and (cid:96) . There are alsolong-lived edges in the complex of Figure 5.18(a). The one between l and l that persists from m = 1 to m = 8 is particularly interesting: this pair of landmarks has shared witnesses in the scalar9 m = 1 Transitions L a nd m a r k I nd e x (a) x ( t ) x ( t – τ ) x ( t – τ ) l l l l l l ww (b) Figure 5.18: (a) Dimension barcode for edges in the witness complex of the reconstructed scalartime series of Figure 3.2(b) that involve l , the ﬁrst landmark, for reconstructions with m = 1 , . . . , m − → m tickmark on the horizontal axis indicates the transition at which an edge between l and l i is born; a square indicates the transition at which that edge vanishes from the complex. Anarrow at the right-hand edge of the plot indicates an edge that was still stable when the algorithmcompleted. For all reconstructions, τ = 174, (cid:96) = 198, and ξ = 0 . m = 2 → and in all reconstructions . Possible causes for this are explored in more depth below. All ofthese eﬀects depend on ξ , of course; lowering ξ will decrease both the number and average lengthof the edge persistence bars.While this ∆ m barcode image is interesting, the amount of detail that it contains makes itsomewhat unwieldy. To study the m -persistence of all of (cid:96) × (cid:96) edges in a witness complex, one wouldneed to examine (cid:96) of these plots—or condense them into a single plot with (cid:96) entries on the verticalaxis. Instead, one can plot what I call an edge lifespan diagram : an (cid:96) × (cid:96) matrix whose ( i, j ) th pixelis colored according to the maximum range of m for which an edge exists in the complex betweenthe i th and j th landmarks; see Figure 5.19. If the edge { l i , l j } existed in the complex for 2 ≤ m < ≤ m <

8, for instance, ∆ m would be three and the i, j th pixel would be coded in cyan. Edgesthat do not exist for any dimension are coded white.A prominent feature of Figure 5.19 is a large number (683) of edges with a lifespan 1 (blue).Of these edges, 463 exist for m = 1, but not for m = 2, and thus reﬂect the anomalous behaviorof projecting a 2 .

06 dimensional object onto a line. This is also seen, as described above, in thebarcode of Figure 5.18.Another interesting set of features in the lifespan diagram is the diagonal line segments.Note that the color of the pixels in these segments varies, though most of them correspondto edges with longer lifespans. These segments indicate the existence of ∆ m -persistent edges { l i , l j } , { l i +1 , l j +1 } , { l i +2 , l j +2 } . . . . This is likely due to the continuity of the dynamics [5]. Recallthat the landmarks are evenly spaced in time, so l i +1 is the ∆ t -forward image of l i . Thus a diagonalsegment may indicate that the ∆ t -forward images of (at least one) witness that is shared between l i and l j is shared between l i +1 and l j +1 , and so on. The lengths of the longer line segments suggestthat that continuity fails after 5-10 ∆ t steps, probably because of the positive Lyapunov exponentson the attractor. As a simple check on this reasoning, one can compute an edge lifespan diagramfor a dynamical system with a limit cycle. The structure of such a plot (not shown) is dominatedby diagonal lines of high ∆ m -persistence, with a few other scattered one-persistent edges.0 Landmark Index

50 100 150 L a nd m a r k I nd e x Δ m Figure 5.19: Edge lifespan diagram: pixel i, j on this image is color-coded according to the maximumrange ∆ m of dimension for which an edge exists between landmarks l i and l j in the witness complexof the reconstructed scalar time series of Figure 3.2(b) for m = 1 , . . . ,

8. For all reconstructions, τ = 174, (cid:96) = 198, and ξ = 0 . maximal m -lifespan goes back to one of the basic premisesof persistence: that features that persist for a wide range of parameter values are in some sensemeaningful. To explore this, Figure 5.20 shows the witness complex of Figure 5.17(a), highlightingthe ∆ m ≥ m = 2 and persist at least to m = 4. Thereexists a fundamental core to the complex that persists as the dimension grows and thus is robustto geometric distortion, but there are also short-lived edges that ﬁll in the complex in accord withthe local geometric structure of the reconstruction. Indeed, when m = 2, the projection artiﬁciallycompresses near the origin; small simplicies ﬁll in this region due to the landmark clustering there.However, in the transition to m = 3— viz., Figure 5.17(b)—this region stretches away from theorigin, spreading the landmarks out. There is a similar cluster of “fragile” edges near the lower leftcorner of the complex.Even though geometric evolution with increasing reconstruction dimension leads to the deathof many local edges, the large-scale homology is correct in both complexes of Figure 5.17, althoughthe ﬁne-scale topology is resolved diﬀerently by the dimension-dependent geometry. So while theedges with longer lifespan are indeed more important to the core structure, the short-lived edges arealso important because they allow the complex to adapt to the geometric evolution of the attractorand ﬁll in the details of the skeleton that are necessary and meaningful in that dimension.In the spirit of the false near-neighbor method [62], one might be tempted to take the short-lived edges as an indication that the reconstruction dimension is inadequate. However, one com-putes homology from the overall complex . As the example above shows, homology is relativelyrobust with respect to individual edges. The moral of this story is that the lifespan of an edgeis not necessarily an obvious indication of its importance to the homology of the complex; ∆ m -persistence plays a diﬀerent role here than the abscissa of traditional barcode persistence plots.1Figure 5.20: Witness complex of Figure 5.17(a) with ∆ m ≥ m = 1 edges as (red) dashed lines.The results in this section show that it is possible to compute the homology of an invariantset of a dynamical system using a simplicial complex built from a low-dimensional reconstruction ofa scalar time series. These results have a number of interesting implications. Among other things,they suggest that the traditional delay-coordinate reconstruction process may be excessive if one isonly interested in large-scale topological structure such as the homology. This is directly aproposof the central claim of this thesis, as it explains why it is possible to construct accurate predic-tions of the future state of a high-dimensional dynamical system using a two-dimensional delay-coordinate reconstruction. The delay-coordinate machinery strives to obtain a diﬀeomorphism—nota homeomorphism—between the true and reconstructed attractors. However, many of the prop-erties of attractors that are important for forecasting (continuity, recurrence, entropy, etc.) aretopological, so requiring only a homeomorphism is not only natural, but also more eﬃcient [80].This section uses a single example—the Lorenz 63 system—but I believe that the approachwill work on other dynamical systems and I plan in the future to do a careful exploration ofadditional systems, both maps and ﬂows. Chapter 4 oﬀered experimental validation of my proposed paradigm shift in the practice ofdelay-coordinate reconstruction, but left many unanswered questions about the theoretical under-pinnings (and implications) of this approach. This chapter answered those questions by drawing on2the complementary theoretical frameworks of information theory and computational topology. Thenovel computationally-eﬃcient metric ( A τ ) explicitly leveraged information stored in a delay vectorto select forecast-optimal parameters for delay coordinate reconstruction. This metric allowed fortailoring of reconstruction parameters to available data length, the signal-to-noise ratio in the timeseries, and the desired prediction horizon. In addition, this information-theoretic approach gaveme the language and methods to answer diﬃcult questions like the ones posed on page 53. Perhapsmost importantly, the results in Section 5.1 validated that the state estimator of ro-LMA often hasmore information about the near future than a traditional embedding—a fact that is completelycounter to all the current theory.Section 5.2 deviated signiﬁcantly from the tone of the rest of this thesis. Instead of discussingdelay-coordinate reconstruction purely from a forecasting perspective, it turned a critical eye towardthe theoretical foundation and assumptions of this powerful framework, which is the basis forall of nonlinear time-series analysis. The consistent story throughout the previous chapters isthat the theoretical requirements of delay-coordinate reconstruction are not necessary—and canindeed be overkill—when one wishes to use it for the purposes of short-term forecasting. But, why is this the case? Is it simply because more information is present in lower-dimensional stateestimators, as established in Section 5.1, or is there something deeper underpinning this methodfrom a theoretical perspective? In Section 5.2, I used the canonical Lorenz 63 system to argue thatlarge-scale homology can be attainable at much lower dimensions than the theory might suggest. Ibelieve this in turn suggests that a homeomorphism , in the form of the delay-coordinate map, canbe achieved at lower dimensions than what is needed for a diﬀeomorphism. This insight suggestedan alternative explanation as to why ro-LMA gets traction before it should: speciﬁcally, that ahomeomorphic reconstruction of a dynamical system may be suﬃcient for short-term forecasting ofdynamical systems. However, further work will be required to rigorously prove this broader claim. hapter 6Model-Free Quantiﬁcation of Time-Series Predictability Time-series data can span a wide range of complexities and no single forecast algorithm canbe expected to handle this entire spectrum eﬀectively. This poses an interesting problem when oneis developing any new forecasting technology, such as the reduced order framework outlined in thisthesis. In particular, given an arbitrary time series—with an undeﬁned level of complexity—canone expect that method to be eﬀective? The ﬁrst step to answering this question is to deﬁne aspectrum of predictive complexity [41].On the low end of this spectrum are time series that exhibit perfect predictive structure, i.e, signals whose future values can be perfectly predicted from past values. Signals like this can beviewed as the product of an underlying process that generates information and/or transmits it fromthe past to the future in a perfectly predictable fashion. Constant or periodic signals, for example,fall in this class. On the opposite end of this spectrum are signals that are—from a forecastingperspective— fully complex , where the underlying generating process transmits no information atall from the past to the future. White noise processes fall in this class. In fully complex signals,knowledge of the past gives no insight into the future, regardless of what model one chooses to use.Signals in the midrange of this spectrum, e.g., deterministic chaos, pose interesting challenges froma modeling perspective. In these signals, enough information is being transmitted from the past tothe future that an ideal model—one that captures the generating process—can forecast the futurebehavior of the observed system with high accuracy.This leads naturally to an important and challenging question to which I alluded in the ﬁrstparagraph of this chapter: given a noisy real-valued time series from an unknown system, doesthere exist any forecast model that can leverage the information (if any) that is being transmittedforward in time by the underlying generating process? A ﬁrst step in answering this question isto reliably quantify where on the complexity spectrum a given time series falls; a second step is todetermine how complexity and predictability are related in these kinds of data sets. With theseanswers in hand, one can develop a practical strategy for assessing appropriateness of forecastmethods for a given time series. If the forecast produced by ro-LMA is poor, for example, but thetime series contains a signiﬁcant amount of predictive structure, one can reasonably conclude that ro-LMA is inadequate to the task and that one should seek another method.The goal of this chapter is to develop eﬀective heuristics to put that strategy into practice.Recall that Chapter 4 of this thesis demonstrated that ro-LMA can be eﬀective, and Chapter 5provided reasons why and how this is the case. The heuristic proposed in this chapter goes onestep further and addresses when ro-LMA —and indeed any forecast algorithm—can be expected tobe eﬀective.The information in an observation can be partitioned into two pieces: redundancy and entropygeneration [25]. The approach exploits this decomposition in order to assess how much predictive4structure is present in a signal— i.e., where it falls on the complexity spectrum mentioned above.I deﬁne complexity as a particular approximation of Kolmogorov-Sinai entropy [69]. That is, Iview a random-walk time series (which exhibits high entropy) as purely complex, whereas a low-entropy periodic signal is on the low end of the complexity spectrum. This diﬀers from the notion ofcomplexity used by e.g., [101], which would consider a time series without any statistical regularitiesto be non-complex. In collaboration with R. G. James, I argue that an extension of permutationentropy [9]—a method for approximating the entropy through ordinal analysis—is an eﬀective wayto assess the complexity of a given time series. Permutation entropy, introduced in Section 2.2.6,is ideal for this purpose because it works with real-valued data and is known to converge to thetrue entropy value. Other existing techniques either require speciﬁc knowledge of the generatingprocess or produce biased values of the entropy [13].I focus on real-valued, scalar, time-series data from physical experiments. I do not assumeany knowledge of the generating process or its properties: whether it is linear, nonlinear, deter-ministic, stochastic, etc. To explore the relationship between complexity, predictive structure, andactual predictability, I generate forecasts for several experimental computer performance time-seriesdatasets using the ﬁve diﬀerent prediction strategies discussed in this thesis, then compare the ac-curacy of those predictions to the permutation entropy of the associated signals. This results intwo primary ﬁndings:(1) The permutation entropy of a noisy real-valued time series from an unknown system iscorrelated with the accuracy of an appropriate predictor.(2) The relationship between permutation entropy and prediction accuracy is a useful empiricalheuristic for identifying mismatches between prediction models and time-series data.There has, of course, been a great deal of good work on diﬀerent ways to measure the complexityof data, and previous explorations have conﬁrmed repeatedly that complexity is a challenge toprediction. It is well known that the way information is generated and processed internally bya system plays a critical role in the success of diﬀerent forecasting methods—and in the choiceof which method is appropriate for a given time series. This constellation of issues has not beenproperly explored, however, in the context of noisy, poorly sampled, real-world data from unknownsystems. That exploration, and the development of strategies for putting its results into eﬀectivepractice, is the primary contribution of this chapter. The empirical results in Section 6.2 notonly elucidate the relationship between complexity and predictability, but also provide a practicalstrategy to aid practitioners in assessing the appropriateness of a prediction model for a givenreal-world noisy time series from an unknown system—a challenging task for which little guidanceis currently available. In the context of this thesis, the value of this is that it provides a generalframework for assessing whether or not ro-LMA is an appropriate choice for a given time series, orif a more sophisticated or even a simpler strategy is required.The rest of this chapter is organized as follows. Section 6.1 discusses previous results ongenerating partitions, local modeling, and error distribution analysis, and situates this work in thatcontext. In Section 6.2, I estimate the complexity of a number of time-series traces and comparethat complexity to the accuracy of various predictions models operating on that time series. InSection 6.3, I discuss these results and their implications, and consider future areas of research.

Hundreds, if not thousands, of strategies have been developed for a wide variety of predictiontasks. The purpose of this chapter is not to add a new weapon to this arsenal, nor to do any5sort of theoretical assessment or comparison of existing methods. In the spirit of this thesis,the goals here are focused more on the practice of prediction: (i) to empirically quantify thepredictive structure that is present in a real-valued scalar time series and (ii) to explore how theperformance of prediction methods is related to that inherent complexity. It would, of course,be neither practical nor interesting to report results for every existing forecast strategy; instead,I use the same representative set of methods that appear throughout this thesis, as described inSection 2.3.Quantifying predictability, which is sometimes called “predicting predictability,” is not anew problem. Most of the corresponding solutions fall into two categories that I call model-basederror analysis and model-free information analysis. The ﬁrst class focuses on errors produced bya speciﬁc forecasting schema. This analysis can proceed locally or globally. The local versionapproximates error distributions for diﬀerent regions of a time-series model using local ensemblein-sample forecasting. These distributions are then used as estimates of out-of-sample forecasterrors in those regions. For example, Smith et al. make in-sample forecasts using ensembles aroundselected points in order to predict the local predictability of a time series [107]. This approach canbe used to show that diﬀerent portions of a time series exhibit varying levels of local predictiveuncertainty.Local model-based error analysis works quite well, but it only approximates the local predic-tive uncertainty in relation to a ﬁxed model . It cannot quantify the inherent predictability of a timeseries and thus cannot be used to draw conclusions about predictive structure that other forecastmethods may be able to leverage. Global model-based error analysis moves in this direction. It usesout-of-sample error distributions, computed post facto from a class of models, to determine whichof those models was best. After building an autoregressive model, for example, it is common tocalculate forecast errors and verify that they are normally distributed. If they are not, that suggeststhat there is structure in the time series that the model-building process was unable to recognize,capture, and exploit. The problem with this approach is lack of generality. Normally distributederrors indicate that a model has captured the structure in the data insofar as is possible, given theformulation of that particular model ( viz., the best possible linear ﬁt to a nonlinear dataset). Thisgives no indication as to whether another modeling strategy might do better.A practice known as deterministic vs. stochastic modeling [22, 44] bridges the gap betweenlocal and global approaches to model-based error analysis. The basic idea is to construct a seriesof local linear ﬁts, beginning with a few points and working up to a global linear ﬁt that includesall known points, and then analyze how the average out-of-sample forecast error changes as afunction of number of points in the ﬁt. The shape of such a “DVS” graph indicates the amountsof determinism and stochasticity present in a time series.The model-based error analysis methods described in the previous three paragraphs are basedon speciﬁc assumptions about the underlying generating process and knowledge about what willhappen to the error if those assumptions hold or fail. Model- free information analysis movesaway from those restrictions. My approach falls into this class: I wish to measure the inherentcomplexity of an arbitrary empirical time series, then study the correlation of that complexity withthe predictive accuracy of forecasts made using a number of diﬀerent methods.I build on the notion of redundancy that was introduced on page 83, which formally quantiﬁeshow information propagates forward through a time series: i.e., the mutual information between The terms “in sample” and “out of sample” are used in diﬀerent ways in the forecasting community. Here, Idistinguish those terms by the part of the time series that is the focus of the prediction: the observed data for theformer and the unknown future for the latter. In-sample forecasts—comparisons of predictions generated from part of the observed time series—are useful for assessing model error and prediction horizons, among other things. n observations and the current one. The redundancy of i.i.d. random processes, forinstance, is zero, since all observations in such a process are independent of one another. On theother hand, deterministic systems, including chaotic ones, have high redundancy—in fact, maximal redundancy in the inﬁnite limit—and thus they can be perfectly predicted if observed for longenough [44]. In practice, it is quite diﬃcult to estimate the redundancy of an arbitrary, real-valuedtime series. Doing so requires knowing either the Kolmogorov-Sinai entropy or the values of allpositive Lyapunov exponents of the system. Both of these calculations are diﬃcult, the latterparticularly so if the data are very noisy or the generating system is stochastic.Using entropy and redundancy to quantify the inherent predictability of a time series is nota new idea. Past methods for this, however, ( e.g., [74, 102]) have hinged on knowledge of the generating partition of the underlying process, which lets one transform real-valued observationsinto symbols in a way that preserves the underlying dynamics [69]. Using a partition that isnot a generating partition— e.g., simply binning the data—can introduce spurious complexity intothe resulting symbolic sequence and thus misrepresent the entropy of the underlying system [13].Generating partitions are luxuries that are rarely, if ever, aﬀorded to an analyst, since one needsto know the underlying dynamics in order to construct one. And even if the dynamics are known,these partitions are diﬃcult to compute and often have fractal boundaries [31]. (See Section 2.2.5.1for a review of these issues.)In the development described in the following section, I sidestep these issues by using a variantof the permutation entropy of Bandt and Pompe [9] to estimate the value of the Kolmogorov-Sinaientropy of a real-valued time series—and thus the redundancy in that data, which my results conﬁrmto be an eﬀective proxy for predictability. This diﬀers from existing approaches in a number ofways. It does not rely on generating partitions—and thus does not introduce bias into the resultsif one does not know the dynamics or cannot compute the partition. Permutation entropy makesno assumptions about, and requires no knowledge of, the underlying generating process: whetherit is linear or nonlinear, what its Lyapunov spectrum is, etc. These features make my approachapplicable to noisy real-valued time series from all classes of systems, deterministic and stochastic. In this section, I oﬀer an empirical validation of the two ﬁndings introduced on page 84,namely:(1) The weighted permutation entropy (WPE) of a noisy real-valued time series from an un-known system is correlated with prediction accuracy— i.e., the predictable structure in anempirical time-series data set can be quantiﬁed by its WPE.(2) The relationship between WPE and mean absolute scaled error (1-MASE) is a useful empir-ical heuristic for identifying mismatches between prediction models and time-series data— i.e., when there is structure in the data that the model is unable to exploit.The experiments below involve four diﬀerent prediction methods: fnn-LMA , na¨ıve, ARIMAand random walk, applied to time-series data from eight diﬀerent experimental systems: col major , , and six diﬀerent segments of a computer performance experiment that I have not yet dis-cussed in this thesis called dgesdd , a Fortran program from the LAPACK linear algebra package [8]that calculates the singular value decomposition of a rectangular M by N matrix with real-valuedentries. For my experiments, I choose M = 750 and N = 1000 and generate the matrix entriesrandomly.7Figure 6.1: A processor performance trace of instructions per cycle (IPC) during the execution of dgesdd . The colors (also separated by vertical dashed lines) identify the diﬀerent segments of thesignal that are discussed in the text.The behavior of this program as it computes the singular values of this matrix is complexand interesting, as is clearly visible in Figure 6.1. As the code moves though its diﬀerent phases—diagonalizing the matrix, computing its transpose, multiplying, etc.—the processor utilization pat-terns change quite radically. For the ﬁrst ∼ col major and traces inFigures 3.3 and 3.4, in contrast, appear to be far more consistent over time—probably the result ofa single generating process with consistent complexity. dgesdd , has multiple regimes, each probablythe result of diﬀerent generating processes. To take advantage of this rich experimental data set, Isplit the signal into six diﬀerent segments, thereby obtaining an array of examples for the analysesin the following sections. For notational convenience, I refer to these 90 time-series data sets as dgesdd i , with i ∈ { . . . } where i corresponds to one of the six segments of the signal, orderedfrom left to right. These segments, which were determined visually, are shown in diﬀerent colorsin Figure 6.1. Visual decomposition is subjective, of course, particularly since the regimes exhibitsome fractal structure. Thus, it may be the case that more than one generating process is at workin each of our segments. This is a factor in the discussion that follows.The objective of these experiments is to explore how prediction accuracy is related to WPE.Working from the ﬁrst 90% of each signal, I generate a prediction of the last 10% using all four

15 runs, each with six regimes . . . . . . . . . . . . . W P E col_major403.gccdgesdd dgesdd dgesdd dgesdd dgesdd dgesdd Figure 6.2: Weighted permutation entropy versus 1-MASE of the best prediction of a number ofdiﬀerent time series. The solid curve is a least-squares log ﬁt of these points. The dashed curvesreﬂect the standard deviation of the model in its parameter space. Points that lie below and tothe right of the shaded region indicate that the time series has more predictive structure than theforecast strategy is able to utilize.prediction methods, then calculate the 1-MASE value of those predictions. I also calculate theWPE of each time series using a wordlength chosen via the procedure described at the end ofSection 2.2.6. In order to assess the run-to-run variability of these results, I repeat all of thesecalculations on 15 separate trials: i.e.,

15 diﬀerent runs of each program.Figure 6.2 plots the WPE values versus the corresponding 1-MASE values of the best pre-diction for each of the 120 time series in this study. The obvious upward trend is consistent withthe notion that there is a pattern in the WPE-MASE relationship. However, a simple linear ﬁt isa bad idea here. First, any signal with zero entropy should be perfectly predictable ( i.e., ≈ of the form y = a log( bx + 1) to thesepoints, with y = WPE and x = 1-MASE. The solid curve in the ﬁgure shows this ﬁt; the dashedcurves show the standard deviation of this model in its parameter space: i.e., y = a log( bx + 1)with ± one standard deviation on each of the two parameters. Points that fall within this deviationvolume (light grey) correspond to predictions that are comparable to the best ones found in this The speciﬁc values of the coeﬃcients are a = 7 . × − and b = 1 . × . fnn-LMA col major . ± .

002 0 . ± .

211 0 . ± .

002 0 . . ± .

011 1 . ± .

010 1 . ± .

016 1 . ± .

021 0 . dgesdd . ± .

095 2 . ± .

328 0 . ± .

075 0 . ± .

076 0 . dgesdd . ± .

012 3 . ± .

040 2 . ± .

027 1 . ± .

020 0 . dgesdd . ± .

009 31 . ± .

282 0 . ± .

010 0 . ± .

021 0 . dgesdd . ± .

035 2 . ± .

074 0 . ± .

032 0 . ± .

036 0 . dgesdd . ± .

047 20 . ± .

192 2 . ± .

051 0 . ± .

048 0 . dgesdd . ± .

055 2 . ± .

083 1 . ± .

061 0 . ± .

068 0 . above that volume (dark grey) are better still. I choose to truncate theshaded region because of a subtle point regarding the 1-MASE of an ideal predictor, which shouldnot be larger than 1 unless the training and test signals are diﬀerent. This is discussed at morelength below.The curves and regions in Figure 6.2 are a graphical representation of the ﬁrst ﬁnding in-troduced on page 84. This representation is, I believe, a useful heuristic for determining whethera given prediction method is well matched to a particular time series. If your point is outsidethe grey region, then that particular model is not capturing all the available structure of the timeseries. This is not, of course, a formal result. The forecast methods and data sets used here werechosen to span the space of standard prediction strategies and the range of dynamical behaviors,but they do not cover those spaces exhaustively. My goal here is an empirical assessment of therelationship between predictability and complexity, not formal results about a “best” predictor fora given time series. There may be other methods that produce lower 1-MASE values than thosein Figure 6.2, but the sparseness of the points above and below the one- σ region about the dashedcurve in this plot strongly suggests a pattern of correlation between the underlying predictabilityof a time series and its WPE. The rest of this section describes these results and claims in moredetail—including the measures taken to assure meaningful comparisons across methods, trials, andprograms—elaborates on the meaning of the diﬀerent curves and limits in the ﬁgure, and ties theseresults into the overall thesis goals.Figure 6.3 shows WPE vs. 1-MASE plots for the full set of experiments; Table 6.1 containsall the associated numerical values. There are 15 points in each cluster, one for each trial. (Thepoints in Figure 6.2 are the leftmost of the points for the corresponding trace in any of the fourplots in Figure 6.3.) The WPE values do not vary very much across trials. For most traces,the variance in 1-MASE scores is low as well, resulting in small, tight clusters. In some cases—ARIMA predictions of col major , for instance—the 1-MASE variance is larger, which spreads outthe clusters horizontally. The mean 1-MASE scores of predictions generated with fnn-LMA aregenerally closer to the dashed curve; the ARIMA method clusters are more widely spread, thena¨ıve clusters even more so. A few of the clusters have very high variance; these are discussed laterin this section.The main thing to note here, however, is not the details of the shapes of the clusters, but rathertheir positions in the four plots: speciﬁcally, the fact that many of them are to the right of and/or0Figure 6.3: WPE vs. 1-MASE for all trials, methods, and systems—with the exception of dgesdd , dgesdd , and dgesdd , which are omitted from the top-right plot for scale reasons, as described inthe text. Numerical values, including means and standard deviations of the errors, can be found inTable 6.1. The curves and shaded regions are the same as in the previous ﬁgure.below the dashed curve that identiﬁes the boundary of the shaded region. These predictions are notas good as my heuristic suggests they could be.

Focusing in on any single signal makes this clear: fnn-LMA works best for dgesdd , for instance, followed by the random-walk prediction method,then ARIMA and na¨ıve. Again, this provides some practical leverage: if one calculates an WPE vs.1-MASE value that is outside the shaded region, that suggests that the prediction method is notwell matched to the task at hand—that is, the time series has more predictive structure than themethod is able to use. The results of ARIMA on dgesdd , for instance, suggest that one should trya diﬀerent method. The position of the fnn-LMA cluster for dgesdd , on the other hand, reﬂectsthe ability of that method to capture and exploit the structure that is present in this signal. WPE1vs. 1-MASE values like this, which fall in the shaded region, indicate to the practitioner that theprediction method is well-suited to the task. The following discussion uses a number of examplesto lay out the details that underlie these claims.Though col major is a very simple program, its dynamics are actually quite complicated,as discussed in Section 3.2.1.2. Recall from Figure 4.2 and Table 4.2 that the na¨ıve, ARIMA, and(especially) random-walk prediction methods do not perform very well on this signal. The 1-MASEscores of these predictions are 0 . ± . . ± . . ± . ≈ . col major trials is 0 . ± . col major clusters are far to the right of and/orbelow the dashed curve. Again, this indicates that these methods are not leveraging the availableinformation in the signal. The dynamics of col major may be complicated, but they are notunstructured. This signal is nonlinear and deterministic [83], and if one uses a prediction techniquethat is based a nonlinear model ( fnn-LMA )—rather than a method that simply predicts the runningmean (na¨ıve) or the previous value (random walk), or one that uses a linear model (ARIMA)—the1-MASE score is much improved: 0 . ± . col major signal. The 1-MASE scores of random-walk predictionsof this signal are all ≈ col major example also brings out some of the shortcomings of automated model-building processes. Note that the + points are clustered very tightly in the lower left quadrantof the na¨ıve, random-walk, and fnn-LMA plots in Figure 6.3, but spread out horizontally in theARIMA plot. This is because of the way the auto.arima process—the ﬁtting procedure that I usefor ARIMA models—works [56]. If a KPSS test of the time series in question indicates that it isnonstationary, the ARIMA recipe adds an integration term to the model. This test gives mixedresults in the case of the col major process, ﬂagging ﬁve of the 15 trials as stationary and tenas nonstationary. I conjectured that ARIMA models without an integration term perform morepoorly on these ﬁve signals, which increases the error and thereby spreads out the points. I testedthis hypothesis by forcing the inclusion of an integration term in the ﬁve cases where a KPSS testindicated that such a term was not needed. This action removes the spread, pushing all 15 of the col major ARIMA points in Figure 6.3 into a tight cluster.The discussion in the previous paragraph highlights the second ﬁnding of this chapter: theability of the graphical heuristic of Figure 6.2 to ﬂag inappropriate models. auto.arima is anautomated, mechanical procedure for choosing modeling parameters for a given data set. Whilethe tests and criteria employed by this algorithm (Section 2.3.2) are sophisticated, the results canstill be sub-optimal—if the initial space of models being searched is not broad enough, for instance,or if one of the preliminary tests gives an erroneous result. Moreover, auto.arima always returnsa model, and it can be very hard to detect when that model is bad. The results discussed in thischapter suggest a way to do so: if the 1-MASE score of an auto-ﬁtted model like an auto.arima result is out of line with the WPE value of the data, that can be an indication of inappropriateness A Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test [65]—one of many tests performed by auto.arima to chosethe ARIMA parameters—is used for testing that an observable time series is stationary around a deterministic trend. dgesdd (0 . ± . col major . This indicates that therate of forward information transfer of the underlying process is lower, but that observations fromthis system still contain a signiﬁcant amount of structure that can, in theory, be used to predictthe future course of the time series. The 1-MASE scores of the na¨ıve and ARIMA predictions forthis system are 20 . ± .

192 and 2 . ± . of the training set portions of the same signals. As before, thepositions of these points on a WPE vs. 1-MASE plot—signiﬁcantly below and to the right of theshaded region—should suggest to a practitioner that a diﬀerent method might do better. Indeed,for dgesdd , fnn-LMA produces a 1-MASE score of 0 . ± .

048 and a cluster of results that largelywithin the shaded region on the WPE-MASE plot. This is consistent with the second ﬁnding ofthis chapter: the fnn-LMA method can capture and reproduce the way in which the dgesdd systemprocesses information, but the na¨ıve and ARIMA prediction methods cannot.The WPE of is higher still: 0 . ± . fnn-LMA —cannot get any traction. Since ﬁtting a hyperplane us-ing least squares should ﬁlter out some of the noise in the signal, the fact that fnn-LMA outperformsARIMA (1 . ± .

021 vs. 1 . ± . cf., [83]), and fnn-LMA is designed to capture and exploit that kind of structure. Note that all four clusters inFigure 6.3 are outside the shaded region; in the case of the random-walk prediction, for instance,the 1-MASE value is 1 . ± . dgesdd results,for the same reasons—and visibly so, judging by the red segment of Figure 6.1, where the periodand amplitude of the oscillations are decreasing. dgesdd —the dark blue (ﬁrst) segment of Figure 6.1—behaves very diﬀerently than the otherseven systems in this study. Though its weighted permutation entropy is very high (0 . ± . . ± .

075 (ARIMA), 0 . ± .

076 ( fnn-LMA ), and 0 . ± .

095 (random walk). This pushesthe corresponding clusters of points in Figure 6.3 well above the trend followed by the other sevensignals. The reasons for this are discussed below. The 1-MASE scores of the predictions that areproduced by the na¨ıve method for this system, however, are highly inconsistent. The majority of theblue diamond-shaped points on the top-right plot in Figure 6.3 are clustered near a 1-MASE scoreof 0.6, which is better than the other three methods. In ﬁve of the 15 dgesdd trials, however, thereare step changes in the signal. This is a diﬀerent nonstationarity than in the case of col major —large jump discontinuities rather than small shifts in the baseline—and not one that I am able tohandle by simply forcing the ARIMA model to include a particular term. The na¨ıve method hasa very diﬃcult time with signals like this, particularly if there are multiple step changes. Thatraised the 1-MASE scores of these trials, pushing the corresponding points to the right, and inturn raising both the mean and variance of this set of trials.The eﬀects described in the previous paragraph are also exacerbated by the way 1-MASE is The na¨ıve 1-MASE score is large because of the bimodal nature of the distribution of the values of the signal,which makes guessing the mean a particularly bad strategy. The same thing is true of the dgesdd signal. This includes the cluster of three points near 1-MASE ≈ .

5, as well as two points that are beyond the domainof the graph, at 1-MASE ≈ . − . dgesdd time seriescalculated. Recall that 1-MASE scores are scaled relative to a random-walk forecast. This createsseveral issues. Since random-walk prediction works very badly on signals with frequent, large,rapid transitions (even simple periodic signals) this class of signals can exhibit low WPE, and high1-MASE. This is because the random-walk forecast of this signal will be 180 degrees out of phasewith the true continuation. This eﬀect can shift points leftwards on a WPE vs. 1-MASE plot, andthat is exactly why the dgesdd clusters in Figure 6.3 are above the dashed curve. This time series,part of which is shown in closeup in Figure 6.4, is not quite the worst-case signal for a random-walk prediction, but it still poses a serious challenge. It is dominated by a noisy regime (between ≈ ≈ dgesdd on the training signal , diﬀerences between thetest signal and training signal can create issues. This is why the 1-MASE values in Table 6.1are not identically one for every random-walk forecast of every time series: the last 10% of thesesignals is signiﬁcantly diﬀerent from the ﬁrst 90%. The deviation from 1.00 will depend on theprocess that generated the data—whether it has multiple regimes, what those regimes look like,and how it switches between them—as well as the experimental setup ( e.g., sensor precision anddata length). For the processes studied here, these eﬀects do not cause the 1-MASE values toexceed 1.15, but pathological situations ( e.g., a huge switch in scale right at the training/testsignal boundary, or a signal that simply grows exponentially) could produce higher values. Thissuggests another potentially useful heuristic: if the 1-MASE of a random-walk prediction of a timeseries is signiﬁcantly diﬀerent from 1, it could be an indication that the signal is nonstationary.4The curves in Figures 6.2 and 6.3 are determined from ﬁnite sets of methods and data. I puta lot of thought and eﬀort into making these sets representative and comprehensive. The forecastmethods involved ranged from the simple to the sophisticated; the time-series data analyzed in thissection are sampled from systems whose behavior spans the dynamical behavior space. While Iam cautiously optimistic about the generality of my conclusions, more exploration will be requiredbefore I can make deﬁnitive or general conclusions. However, my preliminary exploration showsthat data from the H´enon map [55], the Lorenz 63 system [71], the SFI dataset A [119], and arandom-walk process all fall within the one- σ volume of the ﬁt in Figures 6.2 and 6.3 region, as dovarious nonlinear transformations of dgesdd , dgesdd and dgesdd , so I am optimistic.In this chapter I strictly used fnn-LMA as the nonlinear exemplar to explore how predictionaccuracy is related to WPE. With this relationship established, applying this heuristic to assessingthe appropriateness of ro-LMA for a given signal is quite trivial: one simply compares the accuracyof ro-LMA , tabulated in Tables 4.1 and 4.2, with the WPE of that signal, given in Table 6.1, usingthe graphical heuristic in Figure 6.2.Note that there has been prior work under a very similar title to our paper on this topic [41],but there are only superﬁcial similarities between the two research projects. Haven et al. [52]utilize the relative entropy to quantify the diﬀerence in predictability between two distributions:one evolved from a small ensemble of past states using the known dynamical system, and the otherthe observed distribution. My work quantiﬁes the predictability of a single observed time seriesusing weighted permutation entropy and makes no assumptions about the generating process.More closely related is the work of Boﬀetta et al. [12], who investigated the scaling behaviorof ﬁnite-size Lyapunov exponents (FSLE) and (cid:15) -entropy for a wide variety of deterministic systemswith known dynamics and additive noise. While the scaling of these measures acts as a generalproxy for predictability bounds, this approach diﬀers from my work in a number of fundamentalways. First, [12] is a theoretical study that does not involve any actual predictions. I focus onreal-world time-series data, where one does not necessarily have the ability to perturb or otherwiseinteract with the system of interest, nor can one obtain or manufacture the (possibly large) numberof points that might be needed to estimate the (cid:15) -entropy for small (cid:15) . Second, I do not require apriori knowledge about the noise and its interaction with the system. Third, I tie information—inthe form of the weighted permutation entropy—directly to prediction error via calculated valuesof a speciﬁc error metric. Though FSLE and (cid:15) -entropy allow for the comparison of predictabilitybetween systems, they do not directly provide an estimate of prediction error. Finally, my approachalso holds for stochastic systems, where neither the FLSEs nor their relationship to predictabilityare well deﬁned. Forecast strategies that are designed to capture predictive structure are ineﬀective whensignal complexity outweighs information redundancy. This poses a number of serious challengesin practice. Without knowing anything about the generating process, it is diﬃcult to determinehow much predictive structure is present in a noisy, real-world time series. And even if predictivestructure exists, a given forecast method may not work, simply because it cannot exploit thestructure that is present ( e.g., a linear model of a nonlinear process). If a forecast model is notproducing good results, a practitioner needs to know why: is the reason that the data contain nopredictive structure— i.e., that no model will work—or is the model that s/he is using simply notgood enough?In this chapter, I have argued that redundancy is a useful proxy for the inherent predictability5of an empirical time series. To operationalize that relationship, I used an approximation of theKolmogorov-Sinai entropy, estimated using a weighted version of the permutation entropy of [9].This WPE technique—an ordinal calculation of forward information transfer in a time series—isideal for my purposes because it works with real-valued data and is known to converge to thetrue entropy value. Using a variety of forecast models and more than 150 time-series data setsfrom experiments and simulations, I have shown that prediction accuracy is indeed correlated withweighted permutation entropy: the higher the WPE, in general, the higher the prediction error.The relationship is roughly logarithmic, which makes theoretical sense, given the nature of WPE,predictability, and 1-MASE.An important practical corollary to this empirical correlation of predictability and WPE is apractical strategy for assessing appropriateness of forecast methods. If the forecast produced by aparticular method is poor but the time series contains a signiﬁcant amount of predictive structure,one can reasonably conclude that that method is inadequate to the task and that one should seekanother method. fnn-LMA , for instance, performed better in most cases because it is more general.(This is particularly apparent in the col major and dgesdd examples.) The na¨ıve method, whichsimply predicts the mean, can work very well on noisy signals because it eﬀects a ﬁltering operation.The simple random-walk strategy outperforms fnn-LMA , ARIMA, and the na¨ıve method on the signal, which is extremely complex— i.e., extremely low redundancy.The curves and shaded regions in Figures 6.2 and 6.3 operationalize the discussion in theprevious paragraph. These geometric features are a preliminary, but potentially useful, heuristicfor knowing when a model is not well-matched to the task at hand: a point that is below and/orto the right of the shaded regions on a plot like Figure 6.3 indicates that the time series has morepredictive structure than the forecast model can capture and exploit—and that one would be welladvised to try another method. In the context of this thesis, this heuristic allows a practitionerto know if ro-LMA is capturing all the available predictive structure in a time series or if anothermethod such as fnn-LMA should be used instead.These curves were determined empirically using a speciﬁc error metric and a ﬁnite set offorecast methods and time-series traces. If one uses a diﬀerent error metric, the geometry of theheuristic may be diﬀerent—and may not even make sense, if one uses a metric that does not supportcomparison across diﬀerent time series. And while the methods and traces used in this study werechosen to be representative of the practice, they are of course not completely comprehensive. Itis certainly possible, for instance, that the nonlinear dynamics of computer performance is subtlydiﬀerent from the nonlinear dynamics of other systems. My preliminary results on other systems,not shown here, e.g., H´enon, Lorenz 63, a random-walk process, SFI dataset A, and more, lead meto believe that the results of this chapter will generalize beyond the examples presented. hapter 7Conclusion and Future Directions

Delay-coordinate embedding, the bedrock of nonlinear time-series analysis, has been thefoundation of forecasting techniques for nonlinear dynamical systems. A signiﬁcant hurdle in theapplication of these techniques in a real-time fashion or for nonstationary systems is proper es-timation of the embedding dimension. The source of this diﬃculty is rooted in trying to obtaina diﬀeomorphic reconstruction of the observed system, i.e., a topologically perfect representation.This thesis presented a paradigm shift away from that traditional approach, showing that for short-term delay-coordinate based forecasting of limited noisy data, perfection is not necessary, and caneven be detrimental .As a ﬁrst step along this path, I introduced a novel forecasting schema, ro-LMA , that sidestepsthe diﬃcult parameter estimation step by simply ﬁxing m = 2. For a range of low- and high-dimensional synthetic and experimental systems, I showed that ro-LMA produced short-term pre-dictions on par with or exceeding the accuracy of traditional embeddings even though ro-LMA employed incomplete reconstructions of the dynamics— i.e, models that are not necessarily trueembeddings. This eﬀected an experimental validation of the central premise of this thesis, viz., thecurrent paradigm in delay-coordinate embedding may be overly stringent for short-term forecastsof real-world data.The utility of incomplete reconstructions is in stark contrast to traditional views of delay-coordinate embedding, so experimental validation of this bold claim is insuﬃcient. In order topresent a complete validation of this methodology, I provided two complementary theoreticalframeworks, based in information theory and computational topology, to explain why and how this reduced-order modeling strategy works.The information-theoretic analysis focused on understanding how information about thefuture is stored in delay-coordinate vectors. Leveraging this knowledge, in collaboration withR. G. James, I constructed a novel metric, time-delayed active information storage ( A τ ), for select-ing forecast-speciﬁc reconstruction parameters. This approach to parameter selection is drasticallydiﬀerent than standard approaches. Instead of focusing on the calculation of dynamical invariantsas the end goal, A τ maximizes the information about the future stored in each delay vector—explicitly optimizing the reconstruction for the purposes of forecasting. A τ allows one to selectparameter values that are tailored speciﬁcally to the quantity of data available, the signal-to-noiseratio of the time series and the required forecast horizon—and does so quickly, eﬃciently and di-rectly from the data. A τ independently corroborated the central claims of this thesis, showing thatfor noisy, limited datasets, often the state estimator used in ro-LMA contained as much—or more—information about the near future as a full embedding. This result is counter-intuitive; one wouldthink, up to some limit, each new dimension would add information to the model. The fact thatmore information is not necessarily gained in each new dimension changes the way delay-coordinate7based forecasting should be approached, and oﬀers a fundamental explanation of why prediction inprojection is eﬀective in practice.The topological analysis in Section 5.2 questioned the fundamental need for a diﬀeomorphic reconstruction—especially when one is not interested in calculating dynamical invariants. In col-laboration with J. D. Meiss, I conjectured that it may be possible to ascertain information aboutthe large-scale topology of the invariant set—speciﬁcally, the homology—with a lower reconstruc-tion dimension than that needed to obtain an embedding. Using a simple canonical example, Ishowed that the witness complex correctly resolved the homology of the underlying invariant set, viz., its Betti numbers, even if the reconstruction dimension was well below the thresholds for whichthe embedding theorems assure smooth conjugacy between the true and reconstructed dynamics.Since many properties that one cares about for forecasting—the existence of periodic orbits, re-currence, entropy, etc.—depend only upon topology, the stabilization of large-scale topology atlow dimensions eﬀects an alternative validation of the central premise of this thesis. I furtherconjectured—but did not prove—that this unexpected resolution of large-scale topology at low di-mensions may be due to the existence of a homeomorphism between the original and reconstructeddynamics. Proving the broader claim that the delay-coordinate map is a homeomorphism at lowerembedding dimensions than required for a diﬀeomorphism and that a homeomorphic reconstructionis suﬃcient for short-term forecasting will be a key future direction of research.While I illustrated that ro-LMA works for a broad spectrum of signals and provided soundtheoretical evidence supporting why it works, it is important to remember that ro-LMA —or anyforecast model for that matter—will not be ideal for all tasks. Following this line of reasoning itwas vital to understand when ro-LMA would be eﬀective. In this capacity, again in collaborationwith R. G. James, I developed a model-free framework for quantifying when a time-series exhibitspredictable features, viz., bounded information production. This heuristic allows one to know apriori whether ro-LMA —or any forecast method—is appropriate for forecasting a given time series.There are a number of important avenues for future work associated with each topic proposedin this thesis; those avenues were all discussed individually in their respective chapters. However,the next frontier of this work, which draws upon all aspects of the research described in this dis-sertation, is developing strategies for grappling with nonstationary time series—a serious challengein any time-series modeling problem—in the context of delay coordinate based forecasting. Welive in a nonlinear and nonstationary world. Real-world systems change over time: bearings breakdown, computer systems get updated, transistors wear out, the climate moves between glacial andinterglacial periods, etc. The nonlinear and nonstationary nature of systems like these highlightsthe need for adaptive models, built on the ﬂy, that require little to no human interaction.My ﬁrst experience with nonstationarity involved a performance trace of row major runningon an Intel Core Duo that exhibited an interesting phenomenon termed “ghost triangles” (ascan be seen in Figure 7.1) [5]. After a routine (automatic) operating system update, not onlywere the “ghost triangles” gone, but the entire triangular structure present in the two-dimensionalreconstruction had been lost. I thought I had learned my lesson the hard way with unforeseenchange, but the dynamics were even more sensitive than I originally thought. As a result ofthe auto-update ﬁasco, our group purchased a new computer and disabled its update scheduler.However, I soon learned this was not enough. When I went back to repeat some experiments, thecomputer crashed with a standard kernel panic() halt. After this, the traces were never the same:the system halt caused a bifurcation in the performance dynamics. Figure 7.2 shows a before-and-after example involving another SPEC benchmark, 482.sphinx. According to computer designtheory, this halt should not have changed anything. Nonetheless, it actually caused a fundamentalshift in the performance and a change in the dynamical structure.8Figure 7.1: A two-dimensional reconstruction of L2 cache-misses of row major on an Intel CoreDuo with τ = 100 ,

000 instructions.These examples drive home the fact that real-world nonlinear systems routinely undergobifurcations in their dynamics. This means that a forecast model should not be trained once andthen used for all time. Instead, it should be constantly adapted to the current dynamics. For the fnn-LMA method, this is next to impossible due to the human-intensive parameter selection process.The agility of ro-LMA (no need to estimate m) should make it possible to tackle delay-coordinatebased forecasting of nonstationary time series. Detecting regime shifts—and adapting the timedelay accordingly—is the ﬁrst step in this important area of future work.To accomplish this, I plan to study whether the methods described in this thesis can signalregime shifts in a time series. Fuzzy witness complexes may be a useful strategy in this vein ofresearch by detecting and characterizing bifurcations [11]. Suppose that, for example, the ﬁrstpart of the data set corresponds trivially to an equilibrium—that is, to a set with β k = 0 for all k > β signals a regime change.Not all regime shifts are the result of a bifurcation, however. In cases like that, a changein information mechanics, e.g., information storage ( A τ ) or information production (WPE), couldsignal a regime shift—even if the topology of the new regime was too similar (or identical) to the oldregime. My WPE vs. 1-MASE results could be particularly powerful in this scenario, as their valuescould not only help with regime-shift detection, but also suggest what kind of model might workwell in each new regime. Similarly, changes in information storage could also be quite useful. Achange in A τ suggests a regime shift has occurred and simultaneously suggests a forecast-optimalparameter set for the new regime. Even more importantly, it indicates whether new parametervalues are even necessary in the new regime!Of particular interest in this new frontier of research will be the class of so-called hybridsystems [47], which exhibit discrete transitions between diﬀerent continuous regimes— e.g., a lathethat has an intermittent instability or traﬃc at an internet router, whose characteristic normaltraﬃc patterns shift radically during an attack. Traded ﬁnancial markets, too, are highly susceptibleto jump processes. Eﬀective modeling and prediction of these kinds of systems is quite diﬃcult;doing so adaptively and automatically is an important and interesting challenge.The elements of this thesis could be combined to form a complete nonstationary forecasting9Figure 7.2: Processor load (IPC) traces [Top] and 3D projections of the respective reconstructeddynamics [Bottom] of before [(a),(c)] and after [(b,d)] a kernel panic() halt.framework that could address that challenge. The method would work as follows: use A τ to selectforecast-optimal parameters and start forecasting using ro-LMA . While forecasting on a small buﬀerof new data, monitor information production (WPE), information propagation and storage ( A τ )and the homology (witness complex). If any of these drastically change, a regime shift has occurredand the model should be rebuilt. Nicely, as A τ and WPE are already being monitored on the newdata buﬀer, the new regime is already modeled and forecasting can begin immediately.This thesis bridged the gap between rigorous nonlinear mathematical models—which areineﬀective in real-time—and na¨ıve methods that are agile enough for adaptive modeling of non-stationary processes. This in and of itself has real practical utility for a wide spectrum of forecastingtasks as a simple, agile, noise-resilient, forecasting strategy for nonlinear systems, but this thesiswent far beyond practical optimizations at the sacriﬁce of theoretical rigor. Speciﬁcally, the the-oretical analysis outlined here oﬀered a deeper understanding of delay-coordinate embedding—anunderstanding that suggests (and justiﬁes) the need for a new paradigm in delay-reconstructiontheory that was the overall goal of this research project. ibliography [1] S. A. Abdallah and M. D. Plumbley. A measure of statistical complexity based on predictiveinformation with application to ﬁnite spin systems. Physics Letters A, 376(4):275–281, 2012.[2] H. Akaike. A new look at the statistical model identiﬁcation. IEEE Transactions on AutomaticControl, 19(6):716–723, 1974.[3] A. Alameldeen and D. Wood. IPC considered harmful for multiprocessor workloads. IEEEMicro, 26(4):8–17, 2006.[4] Z. Alexander, E. Bradley, J. D. Meiss, and N. F. Sanderson. Simplicial multivalued mapsand the witness complex for dynamical analysis of time series. SIAM Journal on AppliedDynamical Systems, 14(3):1278–1307, 2015.[5] Z. Alexander, J. D. Meiss, E. Bradley, and J. Garland. Iterated function system models indata analysis: Detection and separation. Chaos: An Interdisciplinary Journal of NonlinearScience, 22(2):023103, 2012.[6] Z. Alexander, T. Mytkowicz, A. Diwan, and E. Bradley. Measurement and dynamical analysisof computer performance data. In Advances in Intelligent Data Analysis IX, volume 6065.Springer Lecture Notes in Computer Science, 2010.[7] J. Amig´o. Permutation complexity in dynamical systems: Ordinal patterns, permutationentropy and all that. Springer, 2012.[8] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Green-baum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users’ Guide. Society forIndustrial and Applied Mathematics, Philadelphia, PA, third edition, 1999.[9] C. Bandt and B. Pompe. Permutation entropy: A natural complexity measure for time series.Physical Review Letters, 88(17):174102, 2002.[10] A. J. Bell. The co-information lattice. In Proceedings of 4th International Symposium onIndependent Component Analysis and Blind Source Separation, volume 2003, pages 921–926,2003.[11] J. Berwald, M. Gidea, and M. Vejdemo-Johansson. Automatic recognition and taggingof topologically diﬀerent regimes in dynamical systems. Discontinuity, Nonlinearity, andComplexity, 3(4):413–426, 2015.01[12] G. Boﬀetta, M. Cencini, M. Falcioni, and A. Vulpiani. Predictability: A way to characterizecomplexity. Physics Reports, 356(6):367–474, 2002.[13] E. Bollt, T. Stanford, Y. C. Lai, and K. ˙Zyczkowski. What symbolic dynamics do we get witha misplaced partition?: On the validity of threshold crossings analysis of chaotic time-series.Physica D: Nonlinear Phenomena, 154(3):259–286, 2001.[14] E. Bradley and H. Kantz. Nonlinear time-series analysis revisited. Chaos: An InterdisciplinaryJournal of Nonlinear Science, 25(9):097610, 2015.[15] P. Brockwell and R. Davis. Introduction to Time Series and Forecasting. Springer-Verlag,New York, 2002.[16] S. Browne, C. Deane, G. Ho, and P. Mucci. PAPI: A portable interface to hardware perfor-mance counters. In Proceedings of Department of Defense HPCMP Users Group Conference,1999.[17] Th. Buzug and G. Pﬁster. Comparison of algorithms calculating optimal embedding param-eters for delay time coordinates. Physica D: Nonlinear Phenomena, 58(1-4):127–137, 1992.[18] Th. Buzug and G. Pﬁster. Optimal delay time and embedding dimension for delay-timecoordinates by analysis of the global static and local dynamical behavior of strange attractors.Physical Review A, 45(10):7073–7084, 1992.[19] F. Canova and B. Hansen. Are seasonal patterns constant over time? a test for seasonalstability. Journal of Business & Economic Statistics, 13(3):237–252, 1995.[20] L. Cao. Practical method for determining the minimum embedding dimension of a scalartime series. Physica D: Nonlinear Phenomena, 110(1-2):43–50, 1997.[21] G. Carlsson. Topological pattern recognition for point cloud data. Acta Numerica, 23:289–368, 2014.[22] M. Casdagli. Chaos and deterministic versus stochastic non-linear modelling. Journal of theRoyal Statistical Society, Series B, 54:303–328, 1992.[23] M. Casdagli and S. Eubank. Nonlinear Modeling and Forecasting. Addison Wesley, 1992.[24] M. Casdagli, S. Eubank, J. D. Farmer, and J. F. Gibson. State space reconstruction in thepresence of noise. Physica D: Nonlinear Phenomena, 51(1-3):52–98, 1991.[25] J. P. Crutchﬁeld and D. P. Feldman. Regularities unseen, randomness observed: Levels ofentropy convergence. Chaos: An Interdisciplinary Journal of Nonlinear Science, 13(1):25–54,2003.[26] R. L. Davidchack, Y. C. Lai, E. M. Bollt, and M. Dhamala. Estimating generating partitionsof chaotic systems by unstable periodic orbits. Physical Review E, 61(2):1353–1356, 2000.[27] V. de Silva and E. Carlsson. Topological estimation using witness complexes. In EurographicsSymposium on Point-Based Graphics (2004), pages 157–166. The Eurographics Association,Zurich, 2004.02[28] S. DeDeo. Information theory for intelligent people. Available athttp://tuvalu.santafe.edu/ simon/it.pdf., 2015.[29] C. H. Dowker. Homology groups of relations. Annals of Mathematics, 56(1):84–95, 1952.[30] H. Edelsbrunner, D. Letscher, and A. Zomorodian. Topological persistence and simpliﬁcation.In IEEE Symposium on Foundations of Computer Science, pages 454–463, 2000.[31] M. Eisele. Comparison of several generating partitions of the h´enon map. Journal of PhysicsA, 32(9):1533–1545, 1999.[32] B. Fadlallah, B. Chen, A. Keil, and J. Pr´ıncipe. Weighted-permutation entropy: A com-plexity measure for time series incorporating amplitude information. Physical Review E,87(2):022911, 2013.[33] A. Fraser and H. Swinney. Independent coordinates for strange attractors from mutual in-formation. Physical Review A, 33(2):1134–1140, 1986.[34] S. Frenzel and B. Pompe. Partial mutual information for coupling analysis of multivariatetime series. Physical Review Letters, 99(20):204101, 2007.[35] J. Garland. Prediction in projection: Computer performance forecasting, a dynamical systemsapproach. M. S. Thesis, Department of Applied Mathematics, University of Colorado atBoulder, 2011.[36] J. Garland and E. Bradley. Predicting computer performance dynamics. In Advances inIntelligent Data Analysis X, volume 7014. Springer Lecture Notes in Computer Science, 2011.[37] J. Garland and E. Bradley. On the importance of nonlinear modeling in computer perfor-mance prediction. In Advances in Intelligent Data Analysis XII, volume 8207, pages 210–222.Springer Lecture Notes in Computer Science, 2013.[38] J. Garland and E. Bradley. Prediction in projection. Chaos: An Interdisciplinary Journal ofNonlinear Science, 25:123108, 2015.[39] J. Garland, E. Bradley, and J. D. Meiss. Exploring the topology of dynamical reconstructions.Physica D: Nonlinear Phenomena, 334:49–59, 2016.[40] J. Garland, R. G. James, and E. Bradley. Determinism, complexity, and predictability ofcomputer performance. arXiv:1305.5408, 2013.[41] J. Garland, R. G. James, and E. Bradley. Model-free quantiﬁcation of time-series predictabil-ity. Physical Review E, 90(5):052910, 2014.[42] J. Garland, R. G. James, and E. Bradley. Leveraging information storage to select forecast-optimal parameters for delay-coordinate reconstructions. Physical Review E, 93(2):022221,2016.[43] A. Georges, D. Buytaert, and L. Eeckhout. Statistically rigorous Java performance evalua-tion. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming,Systems, Languages, and Applications (OOPSLA), pages 57–76, 2007.03[44] N. Gershenfeld and A. Weigend. The future of time series. In Time Series Prediction:Forecasting the Future and Understanding the Past. Santa Fe Institute Studies in the Sciencesof Complexity, Santa Fe, NM, 1993.[45] R. Ghrist. Barcodes: The persistent topology of data. Bulletin of the American MathematicalSociety, 45(1):61–75, 2008.[46] J. Gibson, J. Farmer, M. Casdagli, and S. Eubank. An analytic approach to practical statespace reconstruction. Physica D: Nonlinear Phenomena, 57(1-2):1–30, 1992.[47] R. Goebel, R. G. Sanfelice, and A Teel. Hybrid dynamical systems. IEEE Control SystemsMagazine, 29(2):28–93, 2009.[48] P. Grassberger, R. Hegger, H. Kantz, C. Schaﬀrath, and T. Schreiber. On noise reductionmethods for chaotic data. Chaos: An Interdisciplinary Journal of Nonlinear Science, 3(2):127–141, 1993.[49] P. Grassberger and I. Procaccia. Measuring the strangeness of strange attractors. PhysicaD: Nonlinear Phenomena, 9(1-2):189–208, 1983.[50] T. S. Han. Nonnegative entropy measures of multivariate symmetric correlations. Informationand Control, 36(2):133–156, 1978.[51] C. Hasson, R. Van Emmerik, G. Caldwell, J. Haddad, J. Gagnon, and J. Hamill. Inﬂuenceof embedding parameters and noise in center of pressure recurrence quantiﬁcation analysis.Gait & Posture, 27(3):416–422, 2008.[52] K. Haven, A. Majda, and R. Abramov. Quantifying predictability through information theory:Small sample estimation in a non-Gaussian framework. Journal of Computational Physics,206(1):334–362, 2005.[53] R. Hegger, H. Kantz, and T. Schreiber. Practical implementation of nonlinear time seriesmethods: The TISEAN package. Chaos: An Interdisciplinary Journal of Nonlinear Science,9(2):413–435, 1999.[54] J. Henning. SPEC CPU2006 benchmark descriptions. SIGARCH Computer ArchitectureNews, 34(4):1–17, 2006.[55] M. H´enon. A two-dimensional mapping with a strange attractor. Communications inMathematical Physics, 50(1):69–77, 1976.[56] R. Hyndman and Y. Khandakar. Automatic time series forecasting: The forecast package forR. Journal of Statistical Software, 27(3):1–22, 2008.[57] R. Hyndman and A. Koehler. Another look at measures of forecast accuracy. InternationalJournal of Forecasting, 22(4):679–688, 2006.[58] R. G. James, J. R. Mahoney, C. J. Ellison, and J. P. Crutchﬁeld. Many roads to synchrony:Natural time scales and their algorithms. Physical Review E, 89(4):042135, 2014.[59] H. Kantz and T. Schreiber. Nonlinear Time Series Analysis. Cambridge University Press,Cambridge, 1997.04[60] J. L. Kaplan and J. A. Yorke. Chaotic behavior of multidimensional diﬀerence equations. InFunctional Diﬀerential Equations and Approximation of Fixed Points, volume 730 of LectureNotes in Mathematics, pages 204–227. Springer Berlin Heidelberg, 1979.[61] A. Karimi and M. Paul. Extensive chaos in the Lorenz-96 model. Chaos: An InterdisciplinaryJournal of Nonlinear Science, 20(4):043105, 2010.[62] M. Kennel, R. Brown, and H. Abarbanel. Determining minimum embedding dimension usinga geometrical construction. Physical Review A, 45(6):3403–3411, 1992.[63] A. Kraskov, H. St¨ogbauer, and P. Grassberger. Estimating mutual information. PhysicalReview E, 69(6):066138, 2004.[64] D. Kugiumtzis. State space reconstruction parameters in the analysis of chaotic time series—the role of the time window length. Physica D: Nonlinear Phenomena, 95(1):13–28, 1996.[65] D. Kwiatkowski, P. Phillips, P. Schmidt, and Y. Shin. Testing the null hypothesis of station-arity against the alternative of a unit root: How sure are we that economic time series havea unit root? Journal of Econometrics, 54(1-3):159–178, 1992.[66] A. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenburg. A large, fast instruc-tion window for tolerating cache misses. In Proceedings of the International Symposium onComputer Architecture (ISCA), pages 59–70, 2002.[67] W. Liebert, K. Pawelzik, and H. Schuster. Optimal embeddings of chaotic attractors fromtopological considerations. Europhysics Letters, 14(6):521–526, 1991.[68] W. Liebert and H. Schuster. Proper choice of the time delay for the analysis of chaotic timeseries. Physics Letters A, 142(2-3):107–111, 1989.[69] D. Lind and B. Marcus. An introduction to symbolic dynamics and coding. CambridgeUniversity Press, 1995.[70] J. T. Lizier. JIDT: An information-theoretic toolkit for studying the dynamics of complexsystems. Frontiers in Robotics and Artiﬁcial Intelligence, 1(11):1–20, 2014.[71] E. Lorenz. Deterministic nonperiodic ﬂow. Journal of the Atmospheric Sciences, 20(2):130–141, 1963.[72] E. Lorenz. Atmospheric predictability as revealed by naturally occurring analogues. Journalof the Atmospheric Sciences, 26(4):636–646, 1969.[73] E. Lorenz. Predictability: A problem partly solved. In Predictability of Weather and Climate,pages 40–58. Cambridge University Press, 2006.[74] R. Mantegna, S. Buldyrev, A. Goldberger, S. Havlin, C. Peng, M. Simons, and H. Stanley.Linguistic features of noncoding DNA sequences. Physical Review Letters, 73(23):3169–3172,1994.[75] J. Martinerie, A. Albano, A. Mees, and P. Rapp. Mutual information, strange attractors,and the optimal estimation of dimension. Physical Review A, 45(10):7058–7064, 1992.[76] W. J. McGill. Multivariate information transmission. Psychometrika, 19(2):97–116, 1954.05[77] J. McNames. A nearest trajectory strategy for time series prediction. In ProceedingsInternational Workshop on Advanced Black-Box Techniques for Nonlinear Modeling, pages112–128, 1998.[78] R. Meese and K. Rogoﬀ. Empirical exchange rate models of the seventies: Do they ﬁt out ofsample? Journal of International Economics, 14(1):3–24, 1983.[79] J. D. Meiss. Diﬀerential dynamical systems. Monographs on Mathematical Modeling andComputation. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2007.[80] K. Mischaikow, M. Mrozek, J. Reiss, and A. Szymczak. Construction of symbolic dynamicsfrom experimental time series. Physical Review Letters, 82(6):1144–1147, 1999.[81] T. Moseley, J. Kihm, D. Connors, and D. Grunwald. Methods for modeling resource con-tention on simultaneous multithreading processors. In Proceedings of the InternationalConference on Computer Design, 2005.[82] J. Mu´ck and P. Skrzypczy´nski. Can we beat the random walk in forecasting CEE exchangerates? National Bank of Poland Working Papers 127, National Bank of Poland, EconomicInstitute, 2012.[83] T. Myktowicz, A. Diwan, and E. Bradley. Computers are dynamical systems. Chaos: AnInterdisciplinary Journal of Nonlinear Science, 19(3):033124, 2009.[84] T. Mytkowicz. Supporting experiments in computer systems research. PhD thesis, Universityof Colorado, November 2010.[85] C. Nichkawde. Optimal state-space reconstruction using derivatives on projected manifold.Physical Review E, 87(2):022905, 2013.[86] S. Nussbaum and J. Smith. Modeling superscalar processors via statistical simulation. InProceedings of the 2001 International Conference on Parallel Architectures and CompilationTechniques (PACT), pages 15–24, 2001.[87] E. Olbrich, N. Bertschinger, N. Ay, and J. Jost. How should complexity scale with systemsize? The European Physical Journal B, 63(3):407–415, 2008.[88] E. Olbrich and H. Kantz. Inferring chaotic dynamics from time-series: On which length scaledeterminism becomes visible. Physical Letters A, 232(1-2):63–69, 1997.[89] N. Packard, J. P. Crutchﬁeld, J. Farmer, and R. Shaw. Geometry from a time series. PhysicalReview Letters, 45(9):712–716, 1980.[90] L. Pecora, L. Moniz, J. Nichols, and T. Carroll. A uniﬁed approach to attractor reconstruc-tion. Chaos: An Interdisciplinary Journal of Nonlinear Science, 17(1):013110, 2007.[91] Y. B. Pesin. Characteristic Lyapunov exponents and smooth ergodic theory. RussianMathematical Surveys, 32(4):55–114, 1977.[92] K. Petersen. Ergodic theory. Cambridge Studies in Advanced Mathematics. CambridgeUniversity Press, 1989.06[93] A. Pikovsky. Noise ﬁltering in the discrete time dynamical systems. Soviet Journal ofCommunications, Technology and Electronics, 31(5):911–914, 1986.[94] V. Robins. Computational topology for point data: Betti numbers of αα