[PDF] A Statistical Investigation of Long Memory in Language and Music

Abstract

Representation and learning of long-range dependencies is a central challenge confronted in modern applications of machine learning to sequence data. Yet despite the prominence of this issue, the basic problem of measuring long-range dependence, either in a given data source or as represented in a trained deep model, remains largely limited to heuristic tools. We contribute a statistical framework for investigating long-range dependence in current applications of deep sequence modeling, drawing on the well-developed theory of long memory stochastic processes. This framework yields testable implications concerning the relationship between long memory in real-world data and its learned representation in a deep learning architecture, which are explored through a semiparametric framework adapted to the high-dimensional setting.

Full PDF

AA Statistical Investigation of Long Memory in Language and Music

Alexander Greaves-Tunnell and Zaid HarchaouiUniversity of Washington { alecgt, zaid } @uw.eduJune 10, 2019 Abstract

Representation and learning of long-range dependencies is a central challenge confronted in mod-ern applications of machine learning to sequence data. Yet despite the prominence of this issue, thebasic problem of measuring long-range dependence, either in a given data source or as represented in atrained deep model, remains largely limited to heuristic tools. We contribute a statistical framework forinvestigating long-range dependence in current applications of deep sequence modeling, drawing on thewell-developed theory of long memory stochastic processes. This framework yields testable implica-tions concerning the relationship between long memory in real-world data and its learned representationin a deep learning architecture, which are explored through a semiparametric framework adapted to thehigh-dimensional setting.

Advances in the design and optimization of deep recurrent neural networks (RNNs) have lead to signiﬁ-cant breakthroughs in the modeling of complex sequence data, including natural language and music. Anomnipresent challenge in these sequence modeling tasks is to capture long-range dependencies between ob-servations, and a great variety of model architectures have been developed with this objective explicitly inmind. However, it can be difﬁcult to assess whether and to what extent a given RNN has learned to representsuch dependencies, that is, whether it has long memory .Currently, if a model’s capacity to represent long-range dependence is measured at all, it is typicallyevaluated heuristically against some task or tasks in which success is taken an indicator of “memory” in acolloquial sense. Though undoubtedly helpful, such heuristics are rarely deﬁned with respect to an underly-ing mathematical or statistical property of interest, nor do they necessarily have any correspondence to thedata on which the models are subsequently trained. In this paper, we pursue a complementary approach inwhich long-range dependence is assessed as a quantitative and statistically accessible feature of a given datasource. Consequently, the problem of evaluating long memory in RNNs can be re-framed as a comparisonbetween a learned representation and an estimated property of the data.The main contribution is the development and illustration of a methodology for the estimation, visualiza-tion, and hypothesis testing of long memory in RNNs, based on an approach that mathematically deﬁnes anddirectly estimates long-range dependence as a property of a multivariate time series. We thus contextualizea core objective of sequence modeling with deep networks against a well-developed but as-yet-unexploitedliterature on long memory processes. 1 a r X i v : . [ s t a t . M L ] J un e offer extensive validation of the proposed approach and explore strategies to overcome problemswith hypothesis testing for long memory in the high-dimensional regime. We report experimental resultsobtained on a wide-ranging collection of real-world music and language data, conﬁrming the (often strong)long-range dependencies that are observed by practitioners. However, we ﬁnd that this property is notadequately captured by a variety of RNNs trained to benchmark performance on a language dataset.. Related work.

Though a formal connection to long memory processes has been lacking thus far, ma-chine learning applications to sequence modeling have long been concerned with the capture of long-rangedependencies. The development of RNN models has been strongly inﬂuenced by the identiﬁcation of the“vanishing gradient problem” in Bengio et al. (1994). More complex recurrent architectures, such as longshort-term memory (Hochreiter and Schmidhuber, 1997a), gated recurrent units (Cho et al., 2014), andstructurally constrained recurrent networks (Mikolov et al., 2015) were designed speciﬁcally to alleviatethis problem. Alternative approaches have pursued a more formal understanding of RNN computation, forexample through kernel methods (Lei et al., 2017), by means of ablative strategies clarifying the compu-tation of the RNN hidden state (Levy et al., 2018), or through a dynamical systems approach (Miller andHardt, 2019). A modern statistical perspective on nonlinear time series analysis is provided in (Douc et al.,2014).Long-range dependence is most commonly evaluated in RNN models by test performance on a syn-thetic classiﬁcation task. For example, the target may be the parity of a binary sequence (so-called “parity”problems), or it may be the class of a sequence whose most recent terms are replaced with white noise(“2-sequence” or “latch” problems) (Bengio et al., 1994; Bengio and Frasconi, 1994; Lin et al., 1996). Asimple demonstration relatively early in RNN history by Hochreiter and Schmidhuber (1997b) showed thatsuch tasks can often be solved quickly by random parameter search, casting doubt on their informative-ness. Whereas the authors proposed a different heuristic, we seek to re-frame the problem of long memoryevaluation so that it is amenable to statistical analysis.Classical constructions of long memory processes (Mandelbrot and Van Ness, 1968; Granger and Joyeux,1980; Hosking, 1981) laid the foundation for statistical methods to estimate long memory from time seriesdata. See also (Moulines et al., 2008; Reisen et al., 2017) for recent works in this area. The multivariateestimator of Shimotsu (2007) is the foundation of the methodology we develop here. It is by now well un-derstood that failure to properly account for long memory can severely diminish performance in even basicestimation or prediction tasks. For example, the sample variance is both biased and inefﬁcient as an estima-tor of the variance of a stationary long memory process (Percival and Guttorp, 1994). Similarly, failure tomodel long memory has been shown to signiﬁcantly harm the predictive performance of time series models,particularly in the case of multi-step forecasting (Brodsky and Hurvich, 1999).

Long memory in stochastic processes.

Long memory has a simple and intuitive deﬁnition in terms of theautocovariance sequence of a real, stationary stochastic process X t ∈ R , t ∈ Z . The process X t is said tohave long memory if the autocovariance γ ( k ) = Cov( X t , X t + k ) , k ∈ Z Code corresponding to these experiments, including an illustrative Jupyter notebook, is available for download at https://github.com/alecgt/RNN_long_memory . γ X ( k ) ∼ L γ ( k ) | k | − (1 − d ) as k → ∞ , (1)for some d ∈ (0 , / , where a ( k ) ∼ b ( k ) indicates that a ( k ) /b ( k ) → as k → ∞ and L γ ( k ) is a slowlyvarying function at inﬁnity (refer to Appendix A for an introduction to slowly varying functions). The term“long memory” is justiﬁed by the slow (hyperbolic) decay of the autocovariance sequence, which enablesmeaningful information to be preserved between distant observations in the series. As a consequence of thisslow decay, the partial sums of the absolute autocovariance sequence diverge. This can be directly contrastedwith the “short memory” case, in which the autocovariance sequence is absolutely summable. Moreover,we note that the parameter d allows one to quantify the memory by controlling the strength of long-rangedependencies.In the time series literature, a spectral deﬁnition of “memory” is preferred, as it uniﬁes the long andshort memory cases. A second-order stationary time series can be represented in the frequency domain byits spectral density function f X ( λ ) = ∞ (cid:88) k = −∞ γ ( k ) e − ikλ . If X t has a spectral density function that satisﬁes f X ( λ ) = L f ( λ ) | λ | − d (2)where L f ( λ ) is slowly varying at zero, then X t has long memory if d ∈ (0 , / , short memory for d = 0 ,and “intermediate memory” or “antipersistence” if d ∈ ( − / , . The two deﬁnitions of long memory areequivalent when L f ( λ ) is quasimonotone (Beran et al., 2013).We summarize the complementary time and frequency domain views of long memory with a simpleillustration in Figure 1, which contrasts a short memory autoregressive (AR) process of order 1 with its longmemory counterpart, the fractionally integrated AR process. The autocovariance series is seen to convergerapidly for the AR process, whereas it diverges for the fractionally integrated AR process. Meanwhile, Eq.(2) implies that the long memory parameter d has a geometric interpretation in the frequency domain as theslope of log f X ( λ ) versus − λ ) as λ → (i.e. − λ ) → ∞ ).Finally, the relevance of this topic in machine learning can be motivated through its consequences forprediction. Again, the contrast between the AR and fractionally integrated AR processes provides a simpleand concrete illustration. Writing each process in terms of its Wold decomposition (Brockwell and Davis,2013) X t = ∞ (cid:88) j =0 a j ε t − j , the optimal predictor ˆ X t + h of X t + h for the prediction horizon h ≥ is deﬁned as the minimizer of themean squared error MSE ( h ) = E [( ˆ X − X t + h ) ]ˆ X t + h = arg min ˆ X E [( ˆ X − X t + h ) ] = ∞ (cid:88) j =0 a j + h ε t − j and analyzed in terms the proportion of variance explained R ( h ) = 1 − Var( X t + h ) − MSE ( h ) . In the shortmemory AR case, the terms | a j | , and therefore R ( h ) , decay exponentially (see Appendix B); by contrast,when X t is fractionally integrated R ( h ) decays only hyperbolically (Beran et al., 2013). Therefore, in3

50 100 150 200 250t0.00.20.40.60.81.0 γ ( t ) AR(1)FI(d)-AR(1) −2 0 2 4 6 8 10 12-2 log λ−20246 l o g f ( λ ) AR(1)FI(d)-AR(1)01020304050 Σ γ ( t ) Figure 1: Time and frequency domain views of an AR(1) process ( blue , d = 0) and its long memorycounterpart obtained by fractional differencing ( orange , d = 0 . . Left:

Autocorrelation sequences (solidlines) of the two processes, along with their partial sums (dotted lines).

Right:

Plot of log spectral densityversus − times log frequency.the long memory regime, past observations can retain signiﬁcant explanatory power with respect to futureprediction targets, and informative forecasts are available over horizons extending well beyond that of ananalogous short memory process. Many common models do not have long memory.

Despite the appeal and practicality of long memoryfor modeling complex time series, we emphasize that it is absent from nearly all common statistical modelsfor sequence data. We offer a short list of examples, with proofs deferred to Appendix B of the Supplement. • Markov models . If X t is a Markov process on a ﬁnite state space X , and Y t = g ( X t ) for any function g : X → R , then Y t has short memory. We show that this property holds even in a complex model withMarkov structure, the Markov transition distribution model for high-order Markov chains (Raftery,1985). In light of this, long memory processes are sometimes called “non-Markovian”. • Autoregressive moving average (ARMA) models . ARMA models, a ubiquitous tool in time seriesmodeling, likewise have exponentially decaying autocovariances and thus short memory (Brockwelland Davis, 2013). This may be somewhat surprising, as causal ARMA models with nontrivial movingaverage components are equivalent to linear autoregressive models of inﬁnite order. • Nonlinear autoregressions . Finally, and most importantly for our present focus, nonlinearity of thestate transition function is no guarantee of long memory. We show that a class of autoregressiveprocesses in which the state is subject to iterated nonlinear transformations still fails to achieve aslowly decaying autocovariance sequence (Lin et al., 1996; Gourieroux and Jasiak, 2005).

Semiparametric estimation of long memory.

Methods for the estimation of the long memory parameter d have been developed and analyzed under increasingly broad conditions. Here, we focus on semiparametricmethods, which offer consistent estimation of the long memory without the need to estimate or even specifya full parametric model. The term “semiparametric” refers to the fact that the estimation problem involvesa ﬁnite-dimensional parameter of interest (the long memory vector) and an inﬁnite-dimensional nuisanceparameter (the spectral density). 4emiparametric estimation in the Fourier domain leverages the implication of Eq. (2) that f X ( λ ) ∼ c f | λ | − d (3)as λ → , with c f a nonzero constant. Estimators are constructed directly from the periodogram usingonly terms corresponding to frequencies near the origin. The long memory parameter d is estimated ei-ther by log-periodogram regression, which yields the Geweke-Porter-Hudak (GPH) estimator (Geweke andPorter-Hudak, 1983), or through a local Gaussian approximation, which gives the Gaussian semiparametricestimator (GSE) (Robinson, 1995). The GSE offers greater efﬁciency, requires weaker distributional as-sumptions, and can be deﬁned for both univariate and multivariate time series; therefore it will be our mainfocus. Multivariate long memory processes.

Analysis of long memory in multivariate stochastic processes isa topic of more recent investigation in the time series literature. The common underlying assumption inmultivariate semiparametric estimation of long memory is that the real, vector-valued process X t ∈ R p , canbe written as  (1 − B ) d . . . − B ) d p   X t ... X tp  =  U t ... U tp  , (4)where X ti is the i th component of X t , U t ∈ R p is a second-order stationary process with spectral densityfunction bounded and bounded away from zero at zero frequency, B is the backshift in time operator, and | d i | < / for every i = 1 , ..., p (Shimotsu, 2007). The backshift operation B j X t = X t − j , j ∈ Z isextended to non-integer orders via (1 − B ) − d = ∞ (cid:88) k =0 Γ( d + k ) k !Γ( d ) B k , and thus X t is referred to as a fractionally integrated process when d (cid:54) = 0 . Fractionally integrated pro-cesses are the most commonly used models for data with long-range dependencies, encompassing parametricclasses such as the vector autoregressive fractionally integrated moving average (VARFIMA), a multivariateand long memory extension of the popular ARMA family of time series models.If X t is deﬁned as in Eq. (4), then its spectral density function f X ( λ ) satisﬁes (Hannan, 2009) f X ( λ ) = Φ( λ, d ) f U ( λ )Φ ∗ ( λ, d ) , where x ∗ denotes the complex conjugate of x , f U ( λ ) is the spectral density function of U t at frequency λ ,and Φ( λ, d ) = diag (cid:16) (1 − e iλ ) − d i (cid:17) i =1 ,...,p . Given an observed sequence ( x , ..., x T ) = x T with discrete Fourier transform y j = 1 √ πT T (cid:88) t =1 x t e − iλ j t , λ j = 2 πj/T, λ j by the periodogram I ( λ j ) = y j y ∗ j . Under the assumption that f U ( λ ) ∼ G as λ → for some real, symmetric, positive deﬁnite G ∈ R p × p ,the local behavior of f X ( λ ) around the origin is governed only by d and G : f ( λ j ) ∼ Φ( λ, d ) G Φ ∗ ( λ, d ) . (5) The Gaussian semiparametric estimator

The Gaussian semiparametric estimator of d (Shimotsu, 2007)is computed from a local, frequency-domain approximation to the Gaussian likelihood based on Eq. (5).The approximation is valid under restriction of the likelihood to a range of frequencies close to the origin.Using the identity − e − iλ = 2 sin( λ/ e i ( π − λ ) / , we have the approximation Φ( λ, d ) ≈ diag ( λ − d e i ( π − λ ) / ) (cid:44) Λ( d ) , which is valid up to an error term of order O ( λ ) .The Gaussian log-likelihood is written in the frequency domain as (Whittle, 1953) L m = 1 m m (cid:88) j =1 log det f X ( λ j ) + Tr (cid:2) f X ( λ j ) − y j y ∗ j (cid:3) ≈ m m (cid:88) j =1 (cid:104) log det Λ j ( d ) G Λ ∗ j ( d ) + Tr (cid:104)(cid:0) Λ j ( d ) G Λ ∗ j ( d ) (cid:1) − I ( λ j ) (cid:105) (cid:105) . Validity of the approximation is ensured by restriction of the sum to the ﬁrst m Fourier frequencies, with m = o ( T ) .Solving the ﬁrst-order optimality condition ∂ L m ∂G = 1 m m (cid:88) j =1 (cid:104) ( G T ) − − (cid:0) G − Λ j ( d ) − I ( λ j )Λ ∗ j ( d ) − G − (cid:1) T (cid:105) = 0 for G yields (cid:98) G ( d ) = 1 m m (cid:88) j =1 Re (cid:2) Λ j ( d ) − I ( λ j )Λ ∗ j ( d ) − (cid:3) . Substitution back into the objective results in the expression L m ( d ) = log det (cid:98) G ( d ) − p (cid:88) i =1 d i m (cid:88) j =1 log λ j , (6)and the Gaussian semiparametric estimator is obtained as the minimizer ˆ d GSE = arg min d ∈ Θ L m ( d ) , (7)over the feasible set Θ = ( − / , / p . 6 key result due to Shimotsu (2007) establishes that the estimator ˆ d GSE is consistent and asymptoticallynormal under mild conditions, with √ m ( ˆ d GSE − d ) → d N (0 , Ω − ) , (8)where Ω = 2 (cid:20) I p + G (cid:12) G − + π G (cid:12) G − − I p ) (cid:21) ,d is the true long memory, and (cid:12) denotes the Hadamard product. Optimization.

Relatively little discussion of optimization procedures for problem in Eq. (7) is availablein the time series literature. We are not aware of any proof that the objective is convex in the multivariatesetting for instance.To compute the estimator ˆ d GSE , we apply L-BFGS-B, a quasi-Newton algorithm that handles box con-straints (Byrd et al., 1995). L-BFGS-B is an iterative algorithm requiring the gradient of the objective; thisis derived in Appendix C of the Supplement.

Bandwidth selection

The choice of the bandwidth parameter m determines the tradeoff between bias andvariance in the estimator: at small m the variance may be high due to few data points, while setting m toolarge can introduce bias by accounting for the behavior of the spectral density function away from the origin.When it is possible to simulate from the target process, as will be the case when we evaluate criteriafor long memory in recurrent neural networks, we can naturally control the variance simply by simulatinglong sequences and computing a dense estimate of the periodogram. Without knowledge of the shape of thespectral density function, however, it is difﬁcult to know how to set the bandwidth to avoid bias, and thuswe prefer the relatively conservative setting of m = √ T . This choice is justiﬁed by a bias study for thebandwidth parameter, which is given in Appendix D of the Supplement. RNN hidden state as a nonlinear model for a long memory process.

The standard tool for statisticalmodeling of multivariate long memory processes is the vector autoregressive fractionally integrated movingaverage (VARFIMA) model, which represents the process X t ∈ R p with long memory parameter d as Φ( B )(1 − B ) d X t = θ ( B ) Z t , where Z t is a white noise process and (1 − B ) d = diag ((1 − B ) d i ) , i = 1 , ..., p (Lobato, 1997; Sowell,1989). Under the standard stationarity and invertibility conditions on the matrix polynomials Φ( B ) and Θ( B ) , respectively, the process can be represented as X t = (1 − B ) − d Φ − ( B )Θ( B ) Z t , which shows that the X t has a composite representation in terms of linear “features” of the input sequenceand an explicit fractional integration step ensuring that it satisﬁes the deﬁnition of multivariate long memoryin Eq. (4).We extend this view to deep network models for sequences with long range dependencies. The keydifference is that RNN models are not constrained to work with a linear representation of the data, nor do7hey explicitly contain a step that guarantees the long memory of X t . To evaluate long memory in an RNNmodel, we study the stochastic process X t = Ψ( Z t ) , (9)where Z t is again a white noise, and the nonlinear transformation Ψ describes the RNN transformationof inputs to the hidden state. In a typical RNN model, a decision rule is learned by linear modeling ofthe hidden state; this framework thus aligns with a broader theoretical characterization of deep learning asapproximate linearization of complex decision boundaries in input space by means of a learned nonlinearfeature representation (Bruna and Mallat, 2013; Mairal et al., 2014; Jones et al., 2019; Bietti and Mairal,2019). Testable criteria for RNN capture of long-range dependence.

The complexity of Ψ( · ) correspondingto even the most basic RNN sequence models precludes a fully theoretical treatment of long memory inprocesses described by Eq. (9). Nonetheless, this characterization suggests an approach for the statisticalevaluation of long memory in RNNs, as it establishes testable criteria under which a model of the form Eq.(9) describes a process X t with long memory. In particular, to satisfy the deﬁnition in Eq. (4) we must have X t = Ψ( Z t ) = (1 − B ) − d ˜Ψ( Z t ) for some d (cid:54) = 0 and process ˜Ψ( Z t ) with bounded and nonzero spectral density at zero frequency. Semi-parametric estimation of d in the frequency domain provides a means to evaluate this condition such that theresults are agnostic to the behavior of ˜Ψ( Z t ) at higher frequencies. If Ψ( Z t ) admits a representation in termsof an explicit fractional integration step, then this can be investigated in two complementary experiments:1. Integration of fractionally differenced input.

Deﬁne ˜ X t = (1 − B ) d Z t , where Z t is a standard Gaussian white noise and d is the long memory parameter corresponding to thesource X t on which the model was trained. If the sequence ˜ x T is drawn from ˜ X t , then we expect toﬁnd that ˆ d GSE (˜ h T ) ≈ , where ˜ h T = Ψ(˜ x T ) is the RNN hidden representation of the simulated input. On the other hand,nonzero long memory in the hidden state indicates a mismatch between fractional integration learnedby the RNN and long memory of the data X t .2. Long memory transformation of white noise.

Conversely, we expect to ﬁnd that the RNN hiddenrepresentation of a white noise sequence has a nonzero long memory parameter. White noise has aconstant spectrum and thus a long memory parameter equal to zero. If Ψ( · ) performs both the featurerepresentation and fractional integration functions that are handled separately and explicitly in theVARFIMA model, then a zero-memory input will be transformed to a nonzero-memory sequence ofhidden states. 8 otal memory. It is common for sequence embeddings and RNN hidden layers to have hundreds of di-mensions, and thus long memory estimation for these sequences naturally occurs in a high-dimensionalsetting. This topic is virtually unexplored in the time series literature, where multivariate studies tend tohave modest dimension. Practically, this raises two main issues. First, if p ≈ m for dimension p andbandwidth m , then the approximation of the test statistic distribution by its asymptotic limit will be of poorquality, and the resulting test is likely to be miscalibrated. Second, it becomes difﬁcult to interpret the longmemory vector d , particularly when the coordinates of the corresponding time series are not meaningfulthemselves.We resolve both issues by considering the total memory statistic ¯ d , deﬁned as ¯ d = T ˆ d GSE . (10)Computation of the total memory is no more complex than that of the GSE, and it has an intuitiveinterpretation as the coordinate-wise aggregate strength of long memory in a multivariate time series. Asymptotic normality of the total memory estimator.

The total memory is a simple linear functionalof the GSE, and thus its consistency and asymptotic normality can be established by a simple argument. Inparticular, deﬁning ¯ d = g ( d ) (cid:44) T ˆ d GSE , we see that ∇ g ( d ) = , which is clearly nonzero at zero, so that by Eq. (8) and the delta method we have √ m ( ¯ d − ¯ d ) → d N (0 , T Ω − ) , (11)where ¯ d is the true total memory of the observed process.To validate this proposed estimator, we provide a “sanity check” on simulated high-dimensional datawith known long memory in Appendix E of the Supplement. Visualizing and testing for long memory in high dimensions.

The visual time-domain summary of longmemory in Figure 1 can be extended to the multivariate setting. In this case, the autocovariance γ ( k ) =Cov( X t , X t + k ) is matrix-valued, which for the purpose of evaluating long memory can be summarized bythe scalar Tr ( | γ ( k ) | ) , where the absolute value is taken element-wise. Recall that a sufﬁcient condition forshort memory is the absolute convergence of the autocovariance series, whereas this series diverges for longmemory processes.From a testing perspective, a statistical decision rule for the presence of long memory can be derivedfrom the asymptotic distribution of the corresponding estimator. However, when the dimension p is largeand we conservatively set the bandwidth m = √ T , we may have m ≈ p even when the observed sequenceis relatively long.The classical approach to testing for the multivariate Gaussian mean is based on the Wald statistic m ( ˆ d − d ) T Ω( ˆ d − d ) , which has a χ ( p ) distribution under the null hypothesis H : d = d .In Appendix F of the Supplement, we give a demonstration that the standard Wald test can be seriouslymiscalibrated when m ≈ p , whereas testing for long memory with the total memory statistic remains well-calibrated in this setting. These results are consistent with previous observations that the Wald test for longmemory can have poor ﬁnite-sample performance even in low dimensions (Shimotsu, 2007; Hurvich andChen, 2000), though these studies suggest no alternative.9 Experiments

Much of the development of deep recurrent neural networks has been motivated by the goal of ﬁnding goodrepresentations and models for text and audio data. Our results in this section conﬁrm that such data can beconsidered as realizations of long memory processes. A full summary of results is given in Table 1, andautocovariance partial sums are plotted in Figure 2. To facilitate comparison of the estimated long memoryacross time series of different dimension, we report the normalized total memory ¯ d/p = ( T ˆ d GSE ) /p in alltables.For all experiments, we test the null hypothesis H : ¯ d = 0 against the one-sided alternative of long memory, H : ¯ d > . We set the level of the test to be α = 0 . and compute the corresponding critical value c α from theasymptotic distribution of the total memory estimator. Given an estimate of the total memory ¯ d ( x T ) , ap-value is computed as P ( ¯ d > ¯ d ( x T ) | ¯ d = 0) ; note that a p-value less than α = 0 . corresponds torejection of the null hypothesis in favor of the long memory alternative. Table 1: Total Memory in Natural Language and Music Data.

Data Norm. total memory p-value Reject H ? Naturallanguage Penn TreeBank 0.163 < × − (cid:88) Facebook CBT 0.0636 < × − (cid:88) King James Bible 0.192 < × − (cid:88) Music J.S. Bach 0.0997 < × − (cid:88) Miles Davis 0.322 < × − (cid:88) Oum Kalthoum 0.343 < × − (cid:88) Natural language data.

We evaluate long memory in three different sources of English language textdata: the Penn TreeBank training corpus (Marcus et al., 1993), the training set of the Children’s Book Testfrom Facebook’s bAbI tasks (Weston et al., 2016), and the King James Bible. The Penn TreeBank corpusand King James Bible are considered as single sequences, while the Children’s Book Test data consists of98 books, which are considered as separate sequences. We require that each sequence be of length at least T = 2 , which ensures that the periodogram can be estimated with reasonable density near the origin.Finally, we use GloVe embeddings (Pennington et al., 2014) to convert each sequence of word tokens to anequal-length sequence of real vectors of dimension k = 200 .The results show signiﬁcant long memory in each of the text sources, despite their apparent differences.As might be expected, the children’s book measured from the Facebook bAbI dataset demonstrates the Code for all results in this section is available at https://github.com/alecgt/RNN_long_memory

50 100 150 200 250 k . . . . . . . P T r ( | γ ( k ) | ) PTBBibleFB bAbI k P T r ( | γ ( k ) | ) Miles DavisOum KalthoumJ.S. Bach

Figure 2:

Partial sum of the autocovariance trace for embedded natural language and music data.

Left : Naturallanguage data. For clarity we include only the longest of the 98 books in the Facebook bAbI training set.

Right:

Musicdata. Each of the ﬁve tracks from both Miles Davis and Oum Kalthoum is plotted separately, while the Bach cellosuite is treated as a single sequence. weakest long-range dependencies, as is evident both from the value of the total memory statistic and theslope of the autocovariance partial sum.

Music data.

Modeling and generation of music has recently gained signiﬁcant visibility in the deep learn-ing community as a challenging set of tasks involving sequence data. As in the natural language experiments,we seek to evaluate long memory in a broad selection of representative data. To this end, we select a com-plete Bach cello suite consisting of 6 pieces from the MusicNet dataset (Thickstun et al., 2017), the jazzrecordings from Miles Davis’

Kind of Blue , and a collection of the most popular works of famous Egyptiansinger Oum Kalthoum.For the Bach cello suite, we embed the data from its raw scalar wav ﬁle format using a reduced versionof a deep convolutional model that has recently achieved near state-of-the-art prediction accuracy on theMusicNet collection of classical music (Thickstun et al., 2018). Details of the model training, includingperformance benchmarks, are provided in Appendix H of the Supplement.We are not aware of a prominent deep learning model for either jazz music or vocal performances.Therefore, for the recordings of Miles Davis and Oum Kalthoum, we revert to a standard method and extractmel-frequency cepstral coefﬁcients (MFCC) from the raw wav ﬁles at a sample rate of

Hz (Loganet al., 2000). A study of the impact of embedding choice on estimated long memory, including a longmemory analysis of the Bach data under MFCC features, is provided in Appendix G.The results show that long memory appears to be even more strongly represented in music than in text.We ﬁnd evidence of particularly strong long-range dependence in the recordings of Miles Davis and OumKalthoum, consistent with their reputation for repetition and self-reference in their music.Overall, while the results of this section are unlikely to surprise practitioners familiar with the modelingof language and music data, they are scientiﬁcally useful for two main reasons: ﬁrst, they show that ourlong memory analysis is able to identify well-known instances of long-range dependence in real-world data;second, they establish quantitative criteria for the successful representation of this dependency structure byRNNs trained on such data. 11 .2 Long memory analysis of language model RNNs

We now turn to the question of whether RNNs trained on one of the datasets evaluated above are able to rep-resent the long-range dependencies that we know to be present. We evaluate the criteria for long memory onthree different RNN architectures: long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997a),memory cells (Levy et al., 2018), and structurally constrained recurrent networks (SCRN) (Mikolov et al.,2015). Each network is trained on the Penn TreeBank corpus as part of a language model that includes alearned word embedding and linear decoder of the hidden states; the architecture is identical to the “small”LSTM model in (Zaremba et al., 2014), which is preferred for the tractable dimension of the hidden state.Note that our objective is not to achieve state-of-the-art results, but rather to reproduce benchmark perfor-mance in a well-known deep learning task. Finally, for comparison, we will also include an untrained LSTMin our experiments; the parameters of this model are simply set by random initialization.

Table 2: Language Model Performance by RNN Type

Model Test Perplexity

Zaremba et al. 114.5LSTM 114.5Memory cell 119.0SCRN 124.3

RNN integration of fractionally differenced input.

Having estimated the long memory parameter d corresponding to the Penn TreeBank training data in the previous section, we simulate inputs ˜ x T with T = 2 from by fractional differencing of a standard Gaussian white noise and evaluate the total memoryof the corresponding hidden representation Ψ(˜ x T ) for each RNN. Results from n = 100 trials are compiledin Table 3 (standard error of total memory estimates in parentheses). We test the null hypothesis H : ¯ d = 0 against the one-sided alternative H : ¯ d < , which corresponds to the model’s failure to represent the fullstrength of fractional integration observed in the data. Table 3: Residual Total Memory in RNN Representations of Fractionally Differenced Input.

Model Norm. total memory p-value Reject H ? LSTM (trained) − . × − (0.00475) . × − (cid:88) LSTM (untrained) − . × − (0.00387) < × − (cid:88) Memory cell − . × − (0.00539) . × − (cid:88) SCRN − . × − (0.00631) . × − (cid:88) RNN transformation of white noise.

For a complementary analysis, we evaluate whether the RNNs canimpart nontrivial long-range dependency structure to white noise inputs. In this case, the input sequence z T is drawn from a standard Gaussian white noise process, and we test the corresponding hidden representation Ψ( z T ) for nonzero total memory. As in the previous experiment, we select T = 2 , choose the bandwidthparameter m = √ T , and simulate n = 100 trials for each RNN. Results are detailed in Table 4. We test12 : ¯ d = 0 against H : ¯ d > ; here, the alternative corresponds to successful transformation of whitenoise input to long memory hidden state. Table 4: Total Memory in RNN Representations of White Noise Input.

Model Norm. total memory p-value Reject H ? LSTM (trained) − . × − (0.00405) 0.583 X LSTM (untrained) − . × − (0.00223) 0.572 X Memory cell − . × − (0.00452) 0.552 X SCRN . × − (0.00522) 0.324 X Discussion.

We summarize the main experimental result as follows: there is a statistically well-deﬁnedand practically identiﬁable property, relevant for prediction and broadly represented in language and musicdata, that is not present according to two fractional integration criteria in a collection of RNNs trained tobenchmark performance.Tables 3 and 4 show that each evaluated RNN fails both criteria for representation of the long-rangedependency structure of the data on which it was trained. The result holds despite a training protocolthat reproduces benchmark performance, and for RNN architectures speciﬁcally engineered to alleviate thegradient issues typically implicated in the learning of long-range dependencies.

We have introduced and demonstrated a framework for the evaluation of long memory in RNNs that proceedsfrom a well-known deﬁnition in the time series literature. Under this deﬁnition, long memory is the conditionenabling meaningful autocovariance at long lags in a multivariate time series. Of course, for sufﬁcientlycomplex processes, this will not fully characterize the long-range dependence structure of the data generatingprocess. Nonetheless, it represents a practical and informative foundation upon which to develop a statisticaltoolkit for estimation, inference, and hypothesis testing, which goes beyond the current paradigm of heuristicchecks.The long memory framework makes possible a formal investigation of speciﬁc and quantitative hy-potheses concerning the fundamental issue of long-range dependencies in deep sequence learning. Theexperiments presented in this work investigate this phenomenon in natural language and music data, and inthe learned representations of RNNs themselves, using the total memory statistic as an interpretable quan-tity that avoids the challenges associated with high-dimensional testing. The results identify long memoryas a broadly prevalent feature of natural language and music data, while showing evidence that benchmarkrecurrent neural network models designed to capture this phenomenon may in fact fail to do so. Finally,this work suggests future topics in both time series, particularly concerning long memory analysis in highdimensions, and in deep learning, as a challenge to learn long memory representations in RNNs.

Acknowledgments

This work was supported by the Big Data for Genomics and Neuroscience Training Grant 8T32LM012419,NSF TRIPODS Award CCF-1740551, the program “Learning in Machines and Brains” of CIFAR, and13aculty research awards.

References

Y. Bengio and P. Frasconi. Credit assignment through time: Alternatives to backpropagation. In

Adv. NIPS ,1994.Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difﬁcult.

IEEE Transactions on Neural Networks , 5(2):157–166, 1994.J. Beran, Y. Feng, S. Ghosh, and R. Kulik.

Long-Memory Processes: Probabilistic Properties and StatisticalMethods . Springer, 2013.A. Bietti and J. Mairal. Group invariance, stability to deformations, and complexity of deep convolutionalrepresentations.

The Journal of Machine Learning Research , 20(1):876–924, 2019.R. C. Bradley et al. Basic properties of strong mixing conditions. a survey and some open questions.

Probability Surveys , 2:107–144, 2005.P. J. Brockwell and R. A. Davis.

Time Series: Theory and Methods . Springer, 2013.J. Brodsky and C. M. Hurvich. Multi-step forecasting for long-memory processes.

Journal of Forecasting ,18(1):59–75, 1999.J. Bruna and S. Mallat. Invariant scattering convolution networks.

IEEE Transactions on Pattern Analysisand Machine Intelligence , 35(8):1872–1886, 2013.R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization.

SIAM Journal on Scientiﬁc Computing , 16(5):1190–1208, 1995.K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learningphrase representations using rnn encoder–decoder for statistical machine translation. In

EMNLP , 2014.R. Douc, E. Moulines, and D. Stoffer.

Nonlinear Time Series: Theory, Methods and Applications with RExamples . Chapman and Hall/CRC, 2014.J. Geweke and S. Porter-Hudak. The estimation and application of long memory time series models.

Journalof Time Series Analysis , 4(4):221–238, 1983.C. Gourieroux and J. Jasiak. Nonlinear innovations and impulse responses with application to var sensitivity.

Annales d’Economie et de Statistique , pages 1–31, 2005.C. W. Granger and R. Joyeux. An introduction to long-memory time series models and fractional differenc-ing.

Journal of Time Series Analysis , 1(1):15–29, 1980.E. J. Hannan.

Multiple Time Series . John Wiley & Sons, 2009.S. Hochreiter and J. Schmidhuber. Long short-term memory.

Neural Computation , 9(8):1735–1780, 1997a.S. Hochreiter and J. Schmidhuber. LSTM can solve hard long time lag problems. In

Adv. NIPS , 1997b.14. R. Hosking. Fractional differencing.

Biometrika , 68(1):165–176, 1981.C. M. Hurvich and W. W. Chen. An efﬁcient taper for potentially overdifferenced long-memory time series.

Journal of Time Series Analysis , 21(2):155–180, 2000.I. Ibragimov and Y. V. Linnik. Independent and stationary dependent variables, 1965.C. Jones, V. Roulet, and Z. Harchaoui. Kernel-based translations of convolutional networks. arXiv preprintarXiv:1903.08131 , 2019.T. Lei, W. Jin, R. Barzilay, and T. Jaakkola. Deriving neural architectures from sequence and graph kernels.In

ICML , 2017.O. Levy, K. Lee, N. FitzGerald, and L. Zettlemoyer. Long short-term memory as a dynamically computedelement-wise weighted sum. In

ACL , 2018.T. Lin, B. G. Horne, P. Tino, and C. L. Giles. Learning long-term dependencies in NARX recurrent neuralnetworks.

IEEE Transactions on Neural Networks , 7(6):1329–1338, 1996.I. N. Lobato. Consistency of the averaged cross-periodogram in long memory series.

Journal of Time SeriesAnalysis , 18(2):137–155, 1997.B. Logan et al. Mel frequency cepstral coefﬁcients for music modeling. In

ISMIR , 2000.J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid. Convolutional kernel networks. In

Adv. NIPS , 2014.B. B. Mandelbrot and J. W. Van Ness. Fractional Brownian motions, fractional noises and applications.

SIAM Review , 10(4):422–437, 1968.M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of English: ThePenn Treebank.

Computational Linguistics , 19(2):313–330, 1993.S. P. Meyn and R. L. Tweedie.

Markov chains and stochastic stability . Springer Science & BusinessMedia, 2012.T. Mikolov, A. Joulin, S. Chopra, M. Mathieu, and M. Ranzato. Learning longer memory in recurrent neuralnetworks. In

ICLR , 2015.J. Miller and M. Hardt. Stable recurrent models. In

ICLR , 2019.E. Moulines, F. Roueff, and M. S. Taqqu. A wavelet Whittle estimator of the memory parameter of anonstationary Gaussian time series.

The Annals of Statistics , 36(4):1925–1956, 2008.J. Pennington, R. Socher, and C. Manning. GloVe: Global vectors for word representation. In

EMNLP ,2014.D. B. Percival and P. Guttorp. Long-memory processes, the Allan variance and wavelets. In

Wavelet Analysisand its Applications , volume 4, pages 325–344. Elsevier, 1994.V. Pipiras and M. S. Taqqu.

Long-range dependence and self-similarity , volume 45. Cambridge UniversityPress, 2017. 15. E. Raftery. A model for high-order Markov chains.

Journal of the Royal Statistical Society. Series B(Methodological) , pages 528–539, 1985.V. A. Reisen, C. L´evy-Leduc, and M. S. Taqqu. An M-estimator for the long-memory parameter.

Journalof Statistical Planning and Inference , 187:44–55, 2017.P. M. Robinson. Gaussian semiparametric estimation of long range dependence.

The Annals of Statistics ,23(5):1630–1661, 1995.K. Shimotsu. Gaussian semiparametric estimation of multivariate fractionally integrated processes.

Journalof Econometrics , 137(2):277–310, 2007.F. Sowell. Maximum likelihood estimation of fractionally integrated time series models. Working paper,1989.J. Thickstun, Z. Harchaoui, and S. Kakade. Learning features of music from scratch. In

ICLR , 2017.J. Thickstun, Z. Harchaoui, D. P. Foster, and S. M. Kakade. Invariances and data augmentation for supervisedmusic transcription. In

ICASSP , 2018.D. Tjøstheim. Non-linear time series and Markov chains.

Advances in Applied Probability , 22(3):587–611,1990.J. Weston, A. Bordes, S. Chopra, A. M. Rush, B. van Merri¨enboer, A. Joulin, and T. Mikolov. TowardsAI-complete question answering: A set of prerequisite toy tasks. In

ICLR , 2016.P. Whittle. Estimation and information in stationary time series.

Arkiv f¨or Matematik , 2(5):423–434, 1953.W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. CoRR abs/1409.2329,2014. 16 ppendixA Slowly varying functions

Deﬁnition A.1 (Slowly varying at inﬁnity / zero) . A positive function L is said to be slowly varying atinﬁnity if for any u > , L ( xu ) ∼ L ( u ) as x → ∞ . A function L is slowly varying at zero if L (1 /u ) is slowly varying at inﬁnity. Slowly varying functions serve an important purpose in the study of long memory processes by sig-niﬁcantly expanding the class of autocovariance (for slowly varying at inﬁnity) or spectral density (slowlyvarying at zero) functions described without changing their relevant asymptotic behavior. In particular, ifwe deﬁne R ( u ) = u ρ L ( u ) , with ρ ∈ R , then R ( au ) /R ( u ) → a ρ as u → ∞ , so that the slowly varying function can be ignored in thelimit. Properties of slowly varying functions are also used to show equivalence between various notions oflong memory, or establish conditions under which such notions are equivalent. For example, the followingproposition relates the time-domain deﬁnition of long memory to the summability of the autocovariancefunction: Proposition A.2 (Prop. 2.2.1, Pipiras and Taqqu (2017)) . Let L be slowly varying at inﬁnity and p > − .Then γ k = L ( k ) k p , k ≥ implies that n (cid:88) k =1 γ k ∼ L ( n ) n p +1 p + 1 , as n → ∞ . The equivalence between the time and spectral domain deﬁnitions of long memory can be establishedunder the condition that the slowly varying part is quasi-monotone.

Deﬁnition A.3 (Quasi-monotone) . A slowly varying function L on [0 , ∞ ) quasi-monotone if it is of boundedvariation on any compact interval and if there exists some δ > such that (cid:90) x u δ | dL ( u ) | = O ( x δ L ( x )) , as x → ∞ . For a proof we refer to Pipiras and Taqqu (2017) Section 2.2.4. A thorough treatment of slowly varyingfunctions in the context of long memory is available in Pipiras and Taqqu (2017) Sections 2.1-2.2 and Beranet al. (2013) Section 1.3.

B Short memory of common time series models

We ﬁrst note that in order to show that the class of processes described by a given time series model hasshort memory, it is sufﬁcient to show ∞ (cid:88) k = −∞ | γ X ( k ) | < ∞ (12)17or each process X t belonging to the parametric family. The property (cid:80) k | γ X ( k ) | = ∞ is implied byboth the frequency and time domain deﬁnitions of long memory for a scalar process, which are themselvesequivalent under the condition that the slowly varying part of the spectral density near zero is quasimono-tone. Therefore, establishing (12) for a given class of models implies that they do not satisfy the deﬁnitionof a long memory process. Proposition B.1.

Let X t be an irreducible and aperiodic Markov chain on a ﬁnite state space X such thatits corresponding transition matrix P has distinct eigenvalues. Let g : X → R , and deﬁne Y t = g ( X t ) .Then Y t is a short memory process.Proof. Computation of the autocovariance for a ﬁnite state Markov model is classical, but we include it herefor completeness. Let X t be an irreducible and aperiodic Markov chain on the ﬁnite space X = { , ..., m } ,and suppose that the transition matrix P (where P ij = p ( X t +1 = i | X t = j ) ) has distinct eigenvalues. Then X t has a unique stationary distribution, and we denote its elements p ( X t = i ) = π i .Let X have the distribution π , and deﬁne Y t = g ( X t ) for t ≥ and some g : X → R . Note that Y t isstationary since X t is stationary. We will show that the scalar process Y t has short memory.Write the autocovariance γ Y ( k ) = E Y k Y − [ E Y ] = m (cid:88) i =1 m (cid:88) j =1 g ( i ) g ( j ) p ( k ) ij π j − m (cid:88) i =1 m (cid:88) j =1 g ( i ) g ( j ) π i π j = m (cid:88) i =1 m (cid:88) j =1 g ( i ) g ( j ) π j (cid:16) p ( k ) ij − π i (cid:17) , where p ( k ) ij = p ( X t + k = i | X t = j ) .Since P has distinct eigenvalues, it is similar to a diagonal matrix Λ : P = Q Λ Q − , so that P k = Q Λ k Q − = m (cid:88) i =1 λ ki q i ˜ q (cid:48) i , where q i and ˜ q i denote the i th row of Q and Q − , respectively, and x (cid:48) denotes the transpose of x . Further-more, from the existence of the unique stationary distribution π = ( π , ..., π m ) (cid:48) we have that P π = π so that λ = 1 , and since P is a stochastic matrix, the corresponding left eigenvalue is (cid:2) ... (cid:3) P = (cid:2) ... (cid:3) . Thus λ q ˜ q (cid:48) i = π (cid:2) ... (cid:3) =  π ... π ... . . . ... π m ... π m  = Π . P k = Π + m (cid:88) i =2 λ ki q i ˜ q (cid:48) i , and since the λ i ’s are distinct, | λ i | < for i = 2 , ..., m . Therefore, | p ( k ) ij − π i | < C s k for some s ∈ (0 , , which from above implies that γ Y ( k ) < C s k . The absolute convergence of the autocovariance series then follows by comparison to the dominatinggeometric series.Furthermore, as we next show, neither extension of the Markov chain to higher (ﬁnite) order or taking(ﬁnite) mixtures of Markov chains is sufﬁcient to obtain a long memory process. We provide a novel proofthat the mixture transition distribution (MTD) model (Raftery, 1985) for high-order Markov chains deﬁnesa short memory process under conditions similar to those of the proof above.

Proposition B.2.

Let X t be an order- p Markov chain whose transition tensor is parameterized by the MTDmodel p ( X t = i | X t − = j , ..., X t − p = j p ) = p (cid:88) (cid:96) =1 λ (cid:96) Q ( (cid:96) ) ij (cid:96) , (13) where each Q ( (cid:96) ) is a column-stochastic matrix, λ (cid:96) > for each (cid:96) = 1 , ..., p , and (cid:80) (cid:96) λ (cid:96) = 1 . Suppose thatthe state space X is ﬁnite with |X | = m , and we deﬁne Y t = g ( X t ) for some g : X → R . Then Y t is a shortmemory process.Proof. In order to write the autocovariance sequence of an MTD process, we must ﬁrst establish its station-ary distribution. Let Q ∈ R m p × m p denote the multivariate Markov transition matrix, which has entries q s p − ,s (cid:48) p = p ( X t = s , ..., X t − p +1 = s p − | X t − = s (cid:48) , ..., X t − p = s (cid:48) p )= (cid:40)(cid:80) p(cid:96) =1 λ (cid:96) Q ( (cid:96) ) s s (cid:48) (cid:96) if s t = s (cid:48) t for t = 1 , ..., p − otherwise.We make the following assumptions on Q : • Q has distinct eigenvalues • Each Q ( (cid:96) ) has strictly positive elements on the diagonalEach state of Q is reachable from all others, so Q is irreducible. The second assumption above shows thatthe states corresponding to the m nonzero diagonal elements of Q are aperiodic, and thus Q is aperiodic.The transition matrix Q therefore speciﬁes an ergodic Markov chain and hence has a unique stationarydistribution ξ ∈ R m p . We will denote by π ∈ R m the univariate marginal of ξ .Now let ξ ∈ R m p be the multivariate stationary distribution of X t , and let π ∈ R m be its univariatemarginal. Let ( X − p , ..., X − ) have the distribution ξ , and deﬁne X t according to (13) for t = 0 , , , ... .Then both X t and Y t = g ( X t ) are stationary. 19he autocovariance γ Y ( t ) can be written as γ Y ( k ) = m (cid:88) i =1 m (cid:88) j =1 g ( i ) g ( j ) π j (cid:16) p ( k ) ij − π i (cid:17) , where p ( k ) ij = p ( X t + k = i | X t j ) .Observe that the transition probability p ( k ) ij can be obtained from the k -step multivariate transition matrix Q k via p ( k ) ij = p ( X t + k = i | X t = j )= (cid:88) s p − p ( X t + k = i, X t + k − t + k − p +1 = s p − | X t = j )= (cid:88) s p − (cid:88) s (cid:48) p − p ( X t + k = i, X t + k − t + k − p +1 = s p − | X t = j, X t − t − p +1 = s (cid:48) p − ) p ( X t − t − p +1 = s (cid:48) p − )= (cid:88) s p − (cid:88) s (cid:48) p − q ( k ) is p − ,js (cid:48) p − p ( X t − = s (cid:48) , ..., X t − p +1 = s (cid:48) p − ) . We note that the summation over s p − is precisely the marginalization required to obtain π i from ξ .Therefore, we can write | p ( k ) ij − π i | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) s p −  (cid:88) s (cid:48) p − q ( k ) is p − ,js (cid:48) p − p ( X t − = s (cid:48) , ..., X t − p +1 = s (cid:48) p − )  − ξ is p − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) s p − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) s (cid:48) p − q ( k ) is p − ,js (cid:48) p − p ( X t − = s (cid:48) , ..., X t − p +1 = s (cid:48) p − )  − ξ is p − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) However, for each q s p − ,s (cid:48) p we have | q s p − ,s (cid:48) p − ξ s p − | < Cs k for some s ∈ (0 , by an argument analogous to the Markov chain example. This implies (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) s (cid:48) p − q ( k ) is p − ,js (cid:48) p − p ( X t − = s (cid:48) , ..., X t − p +1 = s (cid:48) p − ) − ξ is p − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < Cs k since (cid:80) s (cid:48) p − q ( k ) is p − ,js (cid:48) p − p ( X t − = s (cid:48) , ..., X t − p +1 = s (cid:48) p − ) is a convex combination of elements obeyingthe same bound. Therefore, we have | p ( k ) ij − π i | ≤ (cid:88) s p − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) s (cid:48) p − q ( k ) is p − ,js (cid:48) p − p ( X t − = s (cid:48) , ..., X t − p +1 = s (cid:48) p − )  − ξ is p − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < (cid:88) s p − Cs k = ˜ Cs k , Proposition B.3.

Deﬁne the ARMA process X t by φ ( B ) X t = θ ( B ) Z t , where Z t is a white noise process with variance σ and φ ( z ) (cid:54) = 0 for all z ∈ C such that | z | = 1 . Then X t is a short memory process.Proof. As in the Markov chain case, the proof is classical but included for completeness. Let X t be deﬁnedas in the statement above. Then X t has the representation X t = ∞ (cid:88) j = −∞ ψ j Z t − j where the coefﬁcients ψ j are given by θ ( z ) φ ( z ) − = ψ ( z ) = ∞ (cid:88) j = −∞ ψ j z j , with the above series absolutely convergent on r − < | z | < r for some r > (cf. Brockwell and Davis(2013), Chapter 3).Absolute convergence implies that there exists some (cid:15) > and L < ∞ such that ∞ (cid:88) j = −∞ | ψ j (1 + (cid:15) ) j | = ∞ (cid:88) j = −∞ | ψ j | (1 + (cid:15) ) j = L, so that there exists a K < ∞ for which | ψ j | < K (1 + (cid:15) ) j . The autocovariance can be expressed as γ X ( k ) = σ ∞ (cid:88) j = −∞ ψ j ψ j + | k | , | γ X ( k ) | = σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:88) j = −∞ ψ j ψ j + | k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ σ ∞ (cid:88) j = −∞ | ψ j || ψ j + | k | |≤ σ K ∞ (cid:88) j = −∞ (cid:15) ) j (cid:15) ) j + | k | =  σ K ∞ (cid:88) j = −∞ (cid:15) ) j  (cid:15) ) | k | = Cs | k | for s = 1 / (1 + (cid:15) ) ∈ (0 , .Therefore, as with the Markov models, the autocovariance sequence of an ARMA process is not onlyabsolutely summable but also dominated by an exponentially decaying sequence.Finally, we show that in general nonlinear state transitions are not sufﬁcient to induce long range depen-dence, a point particularly relevant to the analysis of long memory in RNNs. Proposition B.4.

Deﬁne the scalar nonlinear autoregressive process X t +1 = f ( X t ) + ε t , where { ε t } is a white noise sequence with positive density with respect to Lebesgue measure and satisfying E | ε t | < ∞ , while f : R → R is bounded on compact sets and satisﬁes sup | x | >r (cid:12)(cid:12)(cid:12)(cid:12) f ( x ) x (cid:12)(cid:12)(cid:12)(cid:12) < for some r > . Then X t has a unique stationary distribution π , and the sequence of random variables { X t , t ≥ } initialized with X ∼ π is strictly stationary and geometrically ergodic.Furthermore, if E | X t | δ < ∞ for some δ > , then { X t } is a short memory process.Proof. The proof proceeds by analysis of X t as a Markov chain on a general state space ( R , B ) , where B is the standard Borel sigma algebra on the real line. Deﬁne the transition kernel P ( x, B ) = P ( X t ∈ B | X t − = x ) for any x ∈ R and B ∈ B .We ﬁrst establish that X t is aperiodic. A d -cycle is deﬁned by a collection of disjoint sets { D i } , , ..., d − such that1. For x ∈ D i , P ( x, D i +1 ) = 1 , i = 0 , ..., d − d .2. The set [ ∪ i D i ] C has measure zero. 22he period is deﬁned as the largest d for which { X t } has a d -cycle (Meyn and Tweedie, 2012). Clearly,however, since ε t has positive density with respect to Lebesgue measure, p ( x, D ) = 1 only if D = R up tonull sets. Thus the period is d = 1 , so { X t } is aperiodic.Strict stationarity and geometric ergodicity are established by showing that the aperiodic chain { X t } satisﬁes a strengthened version of the Tweedie criterion (Meyn and Tweedie, 2012), which requires theexistence of a measurable non-negative function g : R → R , (cid:15) > , R > and M < ∞ such that R E [ g ( X t +1 ) | X t = x ] ≤ g ( x ) − (cid:15), x ∈ K c E [ g ( X t +1 ) { X t +1 ∈ K c }| X t = x ] ≤ M, x ∈ K for some set K satisfying inf x ∈ K m (cid:88) n =1 P n ( x, B ) > Under the conditions of f and ε t assumed above, this criterion is established for the process X t in Tjøstheim(1990) (Thm 4.1), with g ( x ) = | x | .Geometric ergodicity implies that the (cid:107) λP n − π (cid:107) T V ≤ Cρ n , with C < ∞ , ρ ∈ (0 , , and where (cid:107)·(cid:107) T V denotes the total variation distance between measures. A well-known result in the theory of Markov chains (Bradley et al., 2005) establishes that geometric ergodicity isequivalent to absolute regularity, which is parameterized by β ( k ) = sup 12 I (cid:88) i =1 J (cid:88) j =1 | P ( A i ∪ B j ) − P ( A i ) P ( B j ) | , where the supremum is taken over all ﬁnite partitions { A , ..., A I } and { B , ..., B J } of the sigma ﬁelds A = σ ( X t ) and B = σ ( X t + k ) . In particular, β ( k ) decays at least exponentially fast.Furthermore, for any two sigma ﬁelds A and B we have β ( A , B ) = sup 12 I (cid:88) i =1 J (cid:88) j =1 | P ( A i ∪ B j ) − P ( A i ) P ( B j ) |≥ sup 12 | P ( A ∪ B ) − P ( A ) P ( B ) | , A ∈ A , B ∈ B = 2 α ( A , B ) , so that the α -mixing parameter is also bounded by an exponentially decaying sequence.Finally, if E | X | δ for some δ > , then the absolute covariance obeys (Ibragimov and Linnik (1965),Thm. 17.2.2) | γ ( k ) | = σ − | ρ ( k ) | ≤ Cα ( k ) δ/ (2+ δ ) , which completes the proof. 23 Gradient of the GSE objective

Recall that the objective function is given by L m ( d ) = log det (cid:98) G ( d ) − m (cid:88) i =1 d i m m (cid:88) j =1 log λ j , with (cid:98) G ( d ) = 1 m m (cid:88) j =1 Re (cid:2) Λ j ( d ) − I T,X ( λ j )Λ ∗ j ( d ) − (cid:3) Λ j ( d ) = diag ( λ − dj e i ( π − λ j ) / ) I T,X ( λ j ) = y j y ∗ j , y j = 1 √ πT T (cid:88) t =1 x t e − iλ j t , λ j = 2 πj/T. The derivative with respect to the element d (cid:96) of the long memory vector d , for any (cid:96) = 1 , ..., p , is ∂∂d (cid:96) R ( d ) = Tr (cid:20) (cid:98) G ( d ) − ∂∂d (cid:96) (cid:98) G ( d ) (cid:21) − m m (cid:88) j =1 log λ j . Note that Fourier frequencies λ j are strictly positive for j ≥ , so that log λ j is well deﬁned.For the term ∂∂d (cid:96) (cid:98) G ( d ) , note that the ( h, k ) element of the matrix (cid:98) G ( d ) can be written as m m (cid:88) j =1 Re (cid:20) I ( λ j ) h,k exp (cid:18) ( d h + d k ) log λ j + i ( π − λ j )( d h − d k )2 (cid:19)(cid:21) , and therefore the derivative ∂∂d (cid:96) (cid:98) G ( d ) is given by (cid:18) ∂∂d (cid:96) (cid:98) G ( d ) (cid:19) h,k =  m (cid:80) mj =1 Re (cid:104) I ( λ j ) (cid:96),k c − j exp( c + j d k ) exp( c − j d (cid:96) ) (cid:105) for h = (cid:96), h (cid:54) = k m (cid:80) mj =1 Re (cid:104) I ( λ j ) h,(cid:96) c + j exp( c − j d h ) exp( c + j d (cid:96) ) (cid:105) for k = (cid:96), h (cid:54) = k m (cid:80) mj =1 Re [2 I ( λ j ) (cid:96),(cid:96) log λ j exp(2 d (cid:96) log λ j )] for (cid:96) = h = k otherwise , where c − j = log λ j − i (cid:18) π − λ j (cid:19) c + j = log λ j + i (cid:18) π − λ j (cid:19) . Spectral density function of an ARFIMA(1, d ,1) process (left) and smoothed estimates of the periodogramfor the ﬁrst coordinate of the embedded Bible text and Bach cello suite (center and right, respectively). Cutoff pointsassociated with four choices of the bandwidth m are plotted as vertical dashed lines; the semiparametric estimate ofthe long memory for each sequence is essentially a measure of the slope based on the subset of points ( − λ, I ( λ )) tothe right of this line. D Bias study for bandwidth parameter

We demonstrate the potential for semiparametric estimation to incur bias when the bandwidth parameter m is set too high relative to the over length T of the observed sequence. The bias results from inclusion ofperiodogram ordinates in the long memory estimator that capture behavior in the spectral density functionnot local to the origin.We give an illustration for univariate time series, which allows us to take advantage of a convenientvisual interpretation of the long memory as the slope of log I ( λ j ) against − λ j ) as λ j → . Figure3 shows the spectral density function corresponding to three scalar processes: an ARFIMA(1, d ,1) processwith d = 0 . , a univariate projection of the embedded text from the King James Bible, and a univariateprojection of the embedded Bach cello suite. For the ARFIMA process, the spectral density function can becomputed exactly; for the other two sequences, it is estimated by the smoothed periodogram.By marking the cutoff points − λ m associated with different choices of m , we indicate the subsetof points ( − λ j , log I ( λ j )) to the right of this cutoff used to compute the semparametric estimate of d . In the scalar case, this is essentially the slope of the SDF as λ approaches zero; thus it becomes clearthat bias can be introduced when points sufﬁciently far from the origin are included. On the other hand,choosing m too small introduces the risk of high variance in the estimator; note for example that the estimatewith m = T . for the Bach cello suite would be strongly inﬂuenced by a single point just to the left of − λ = 15 . E Validation of total memory estimator

We compute the total memory statistic ¯ d = T ˆ d GSE for simulated fractionally differenced Gaussian white noise sequences of dimension k = 200 . We simulatefour different settings for the long memory parameter: • Zero : Each coordinate of d is equal to zero. 25 Constant : Each coordinate of d is set to the same value, d = 0 . . • Subset : 90 % of the coordinates are set to , while the remaining are set to have strong longmemory with d = 0 . . • Range : The elements of d are drawn from a scaled Beta distribution with support on (0 , . andcentered at . .For each setting, we simulate n = 100 sequences and compute the total memory. Results are plottedin Figure 4, while in Table 5 we compare the sample mean and variance of the estimator compared to theasymptotic value stated in the main paper. − . − . − .

005 0 .

000 0 .

005 0 .

010 0 .

015 0 . Zero d: True total memory = 0.000 .

23 0 .

24 0 .

25 0 .

26 0 . Constant d: True total memory = 0.250 .

020 0 .

025 0 .

030 0 .

035 0 .

040 0 .

045 0 .

050 0 . Subset d: True total memory = 0.040 .

085 0 .

090 0 .

095 0 .

100 0 .

105 0 .

110 0 . Variable d: True total memory = 0.103

Figure 4:

Sample distribution of the total memory estimator ¯ d in four different simulation settings.Table 5: Comparison of the sample mean and variance for the total memory estimator with the true total memory ofthe generating process and the asymptotic variance of the total memory estimator (both given in parentheses). Setting Mean Variance

Zero . × − (0.0) 0.00801 (0.00698)Constant 0.249 (0.25) 0.00793 (0.00698)Subset 0.382 (0.04) 0.00804 (0.00698)Range 0.101 (0.1029) 0.00696 (0.00698) In each of these four diverse simulation settings, the total memory estimator accurately recovers the trueunderlying parameter of the data generating process.

F Calibration of total memory vs. Wald test in high dimensions

Here we demonstrate that the standard Wald test can be badly miscalibrated in the high-dimensional regime,whereas testing for long memory with the total memory statistic remains well-calibrated. Recall that, givenan estimate ˆ d GSE of the multivariate long memory parameter, the Wald statistic for the null hypothesis H : d = 0 is computed as t Wald = ˆ d T GSE (Ω /m ) ˆ d GSE . Total memory test Type-I error: 0.1

Wald test Type-I error: 0.76 −4 −3 −2 −1 0 1 2 30.000.05

Type-I error: 0.05

Type-I error: 0.41 −3 −2 −1 0 1 2 30.000.050.100.150.200.250.300.350.40

Type-I error: 0.04

Type-I error: 0.11

Figure 5:

Sample distribution of the test statistic over n = 100 trials for m = √ T = 256 (top row), m = 512 (middle), and m = 1280 (bottom). Empirical type-I errors are computed using the critical value corresponding to anominal type-I error of . . This quantity is distributed as a χ ( p ) random variable under H .For the total memory, we compute ¯ d = T ˆ d GSE , and in the main paper we have shown that this quantity is distributed as a N (0 , Ω /m ) random variable whenthe true total memory ¯ d = 0 .We simulate n = 100 realizations of length T = 2 from a standard Gaussian process (thus d = 0 )of dimension p = 200 , computing both the Wald and total memory test statistics. In Figure 5, we plotthe a comparison of the sample distribution of each test statistic against its asymptotic distribution over arange of values for m . For values of m close to p , we see that the empirical type-I error of the Wald testis severely inﬂated relative to the nominal level α = 0 . ; in other words, the test spuriously rejects thenull and claims to ﬁnd long memory when none exists at a rate much higher than accounted for. The totalmemory test, by contrast, largely avoids this issue, even in the case where there are barely more observationsthan dimensions.Of course, with enough data, the Wald test becomes increasingly well-calibrated, but this is not at allan easy condition to satisfy while maintaining the integrity of the statistical analysis. We have already seen27n Appendix C that simply increasing m is not an option for real-world data, as this is likely to inducesigniﬁcant bias. On the other hand, the length T of the observed sequence would have to be enormous,even by machine learning standards, to achieve m (cid:29) p with the reasonable choice m = √ T when thedimension p is large. Finally, even if such data were available, we would likely prefer a method that allowsvalid inference at lower m for computational reasons. G Impact of embedding choice on long memory

We evaluate the impact of embedding choice on estimated long memory from two perspectives. First, weinclude a re-analysis of the Bach cello suite data using the same MFCC features as used for the MilesDavis and Oum Kalthoum recordings. This allows us to state results for long memory estimation uniformlyacross a single choice of embedding, and to evaluate the impact of embedding choice on the long memoryanalysis across two very different but informative representations of the raw time series. The results (seeTable 6) show that the Bach data has long memory under both representations, though the average strengthas measured by normalized total memory is somewhat variable.

Table 6: Long memory of Bach data by choice of embedding.

Embedding Norm. total memory p-value Reject H ? Mel-frequency cep-stral coefﬁcients 0.308 0.003 (cid:88)

Convolutional features 0.0997 < × − (cid:88) −0.003 −0.002 −0.001 0.000 0.001 0.002 0.003Norm. total memory0.02.55.07.510.012.515.017.5 C o un t Figure 6:

Histogram of normalized total memory computed from n = 100 permutations of the Penn TreeBanktraining data. Second, we consider a “negative control” experiment in which we re-estimate the long memory vectorfor the embedded Penn TreeBank training set after permuting the sequential ordering of the data. Thisaddresses the question of whether our positive result truly captures a sequence-dependent property of thedata, or if it could have been produced spuriously as the consequence of other decisions related to the data28nalysis (including, for example, the choice of embedding). We compute the total memory statistic for n = 100 random permutations of the Penn TreeBank training data.The results (see Figure 6) show that the total memory of the permuted data is concentrated near zero, witha sample mean of . × − and standard error . ; a one-sample test of the mean correspondinglyfails to reject the null hypothesis H : ¯ d = 0 with p = 0 . . H Classical music features from a MusicNet convolutional model

The reduced version of the MusicNet model of Thickstun et al. (2018) used to obtain an embedding for theBach cello suite is derived from the convolutional model implemented in musicnet module.ipynb , aPyTorch interface to MusicNet available at https://github.com/jthickstun/pytorch_musicnet .We reduce the number of hidden states to (this corresponds to setting k = 200 in the notebook), bothfor computational tractability in the optimization procedure and to achieve consistency with the embeddingdimension for our natural language experiments.The model is trained on the MusicNet training corpus with no further modiﬁcation of the tutorial note-book. Successful training and an informative feature mapping are indicated by the competitive performanceof the model, even despite the reduced dimension of the hidden representation, in terms of the averageprecision of its predictions on the test set (see Table 7). Results for our trained model ( longmem-embed )are favorable in comparison to both short-time Fourier transform (STFT) and commercial software (Melo-dyne) baselines, while approaching the quality of the fully learned ﬁlterbank (Learned ﬁlterbank; Thickstunet al. (2017)) and state-of-the-art translation invariant network (Wide-translation-invariant; Thickstun et al.(2018)). Table 7: Performance Comparison for Models of MusicNet Data

Model Avg. Precision

STFT 60.4Melodyne 58.8 longmem-embed65.1Learned ﬁlterbank 67.8Wide-translation-invariant 77.3