Prediction of Hilbertian autoregressive processes : a Recurrent Neural Network approach
PPrediction of Hilbertian autoregressive processes : a Recurrent NeuralNetwork approach
Andr´e Mas ∗ and Cl´ement Carr´e Bionomeex, Montpellier, France IMAG, CNRS, Univ. Montpellier, France
Abstract
The autoregressive Hilbertian model (ARH) was introduced in the early 90’s by Denis Bosq. It wasthe subject of a vast literature and gave birth to numerous extensions. The model generalizes the classicalmultidimensional autoregressive model, widely used in Time Series Analysis. It was successfully appliedin numerous fields such as finance, industry, biology. We propose here to compare the classical predictionmethodology based on the estimation of the autocorrelation operator with a neural network learning ap-proach. The latter is based on a popular version of Recurrent Neural Networks : the Long Short TermMemory networks. The comparison is carried out through simulations and real datasets.
In memory of Besnik Pumo
The contribution of Denis Bosq to functional data analysis and modeling is major for several reasons. Firstof all his work has the flavor of pioneering steps. The article Bosq (1991) dates back to the very beginning offunctional data. Of course some earlier papers investigate functions as statistical observations such as Dauxoiset al. (1982), Kleffe (1973) but these authors usually confine themselves to infer without modeling, restrictingto first and second order moment estimation or to correlation analysis. The second interesting point comesfrom the model itself : the ARH(1) as defined in Bosq (1991) and soon studied by late Besnik Pumo and TaharMourid in their PhD thesis. It is seemingly the first model acting on functional data. The linear regressionmodel appears only a few years later in Cardot et al. (1999). The ARH(1) will pave the way towards a specificdomain of FDA which reveals fruitful: functional time series and processes. The reader interested with thistopic is referred to Hormann and Kokoszka (2010); Aue et al. (2015) and references therein for instance. At lastthe works by Denis Bosq had a clear methodological impact by introducing tools from fundamental mathematicsand connecting statistics with functional analysis and operator theory.The ARH model has been widely investigated and generalized in several directions. The underlying Hilbertspace was replaced by a space of continuous function in Pumo (1998) then generalized to Banach spaces inMourid (2002). The autoregressive operator was extended to linear processes with early results in Merlev`ede(1996). The celebrated monograph Bosq (2000) sums up the main results on the topic (see also later Bosq(2007)). At last some authors proposed variants of the original ARH by including derivatives (see the ARHDmodel Mas and Pumo (2009)) or by adding exogenous variable (Damon and Guillas (2002) and their ARHXmodel).The outline of the paper is the following. The ARH(1) model is introduced in the next section as well asthe classical estimation procedure. Then we summarize Long Short Term Memory blocks that were selectedfor a numerical comparison. The results of our experiments are given in the last section. We first treated alarge simulated dataset, then compared two temperature datasets and finally focused on a synthetic non-linearprocess.
Let H be a real separable Hilbert space endowed with an inner product (cid:104) ., . (cid:105) and a norm derived from it denoted (cid:107) . (cid:107) . In the rest of the paper the space H is the function space L (Ω), where Ω is assumed to be a real compact ∗ Corresponding author: [email protected] a r X i v : . [ s t a t . C O ] A ug nterval, usually [0 , T ] for T >
0. The space H could as well be of W m, (Ω) a Sobolev space with regularityindex m . W m, = (cid:110) f ∈ L (Ω) : f ( m ) ∈ L (Ω) (cid:111) . We will consider in the sequel a sample ( X , ..., X n ) ∈ H n . When X is of functional nature its whole pathis assumed to be observed. The expectation E X is a vector of H whenever it exists. The covariance operatorof X is denoted Γ. It is a positive, symmetric linear operator from H to H defined by :Γ = E [( X − E X ) ⊗ ( X − E X )]where u ⊗ v = (cid:104) u, (cid:105) v is the tensor product notation for rank-one operators. The operator Γ is trace-class andself-adjoint whenever E (cid:107) X (cid:107) < + ∞ . The centered autoregressive Hilbertian model reads : X n +1 = ρ ( X n ) + ε n +1 , n ∈ Z (1)where ( ε n ) n ∈ N is a Hilbertian strong white noise and ρ is a bounded linear operator acting from H to H . Themodel is studied in detail in Bosq (2000). Let (cid:107) ρ (cid:107) L denote the classical -uniform- operator norm of ρ . Weremind the reader this basic but crucial fact (see ibidem Theorem 3.1 p 74) : if (cid:107) ρ (cid:107) L < X n solution of (1) is uniquely defined and stationary. In the sequel we assume that ( X n ) n ∈ Z is both stationary andcentered.Estimation of ρ is a difficult problem. Due to the functional framework, likelihood approaches are untractablein a truly infinite-dimensional framework. It can be shown that ρ is the solution of a specific inverse problem.Namely if D = E [ (cid:104) X n , ·(cid:105) X n +1 ] is the cross-covariance of order 1 of the process : D = ρ · Γ . (2)The trouble with the above equation is that Γ − does not exist unless Γ is one-to-one. Then it is an unboundedlinear operator, though measurable, and is defined on a domain D (cid:32) H . This domain is Borel-measurable butneither open nor closed and dense in H . As a consequence deriving from (2) ρ = D · Γ − is not correct since ρ is defined on the whole H whereas Γ − is not.Any reasonable estimation procedures should simultaneously estimate D and Γ and regularize the latter inorder to define say “ˆΓ † ”, approximation of Γ − . The estimation of D and Γ is usually simple though theirempirical version : ˆΓ n = 1 n n − (cid:88) i =1 X i ⊗ X i ˆ D n = 1 n − n − (cid:88) i =1 X i ⊗ X i +1 . At this point note that two smoothing strategies may be applied to stabilize the previous estimates : eithersmoothing the data (e.g. spline smoothing or decomposition in a basis of smooth function space) or smoothingthe covariance operators only.Approximation of Γ − is usually more tricky and requires the computation of a regularized inverse denotedˆΓ † above. This may be done directly by methods that are classical in inverse problem solving. For instance aridge estimate provides then ˆΓ † n = (Γ n + T n ) − where T n is a regularizing (Tikhonov) matrix usually taken as α n I where α n > α n ↓ I denotes the identity matrix. Spectral (PCA based) regularization involve therandom eigenelements of Γ n , denoted ( λ i,n , φ i,n ) ∈ R + × H where Γ n φ i,n = λ i,n φ i,n . A classical output is then: ˆΓ † n = k n (cid:88) i =1 λ i,n φ i,n ⊗ φ i,n . where k n must be selected accurately.Following again (2) an estimate of ρ then writes :ˆ ρ n = ˆ D n · ˆΓ † n The predictor is ˆ ρ n ( X n +1 ) and stems from the preceding equation. Note that the evaluation of ˆ ρ n at X n +1 simplifies the object under concern (the predictor is in H whereas ρ is an operator on H ) and has a smoothingeffect on the inverse problem mentioned above. Other results and further details may be found in Bosq (2000);Mas (2007). The question of predicting time series from neural networks is absolutely not new, see Bengio et al. (1995).When addressing the specific issue of prediction in time series, especially functional time series, Recurrent2igure 1: Architecture of a LSTM block/unitNeural Networks (RNN) appear as a natural and potentially effective solution. The basic RNNs architecturelinks a sequential input X n -typically with a stochastic dependence between X n and its past- with an output Y n through an hidden layer H n . The sequence H n is often compared with the hidden state in Hidden Markovchain modeling. We refer to the beginning of Li et al. (2018) for a nice presentation of RNN’s. The system isdriven by the two following equations : (cid:26) H n = σ h ( A X n + B H n − ) Y n = σ y ( C H n ) (3)where A , B and C are matrices and σ h and σ y are two sigmoidal activation functions. Note that the previousmatrices are fixed and not updated in the learning step. This specific structure enables the hidden layer tokeep a memory of the past. As a consequence RNNs were successfully applied in speech recognition and moregenerally in treating dependent data indexed by time. Numerous variants of the RNN were proposed, many ofthem trying to make the network deeper, see Pascanu et al. (2014).One of the most efficient variants of RNN are Long Short Term Memory units, trying to overcome therelative unability of RNN to capture long term dependence. They were introduced in the late 90’s in Hochreiterand Schmidhuber (1997). Several tutorials may be found on the internet about LSTM. We give a sketch of theway LSTMs run but we refer the reader to Greff et al. (2017) for a formal description. A key improvement inLSTMs over RNN relies on the addition of a cell state to the hidden space H n that appears in (3). Figure 1shows the architecture of the single block LSTM which was used in this work. The cell state for unit n is avector denoted C n . The cell state and the hidden state influence each other through three channels, also calledgates. Roughly speaking C n updates H n within the LSTM block and will keep along the different layers thetruly important information. The three gates may be described in a few words. A first “Forget” gate sweepsoff the unimportant coordinates in the new input and in the current hidden state. Then the “Input” gate aimsat updating the cell state from C n to C n +1 . It applies a filter similar to the Forget gate on the concatenatedvector ( H n − , X n ). In parallel a tanh activation function is applied to the same vector, exactly like a singlelayer neural network. Then an Hadamard-product (coordinate-wise multiplication) merges the two precedingvectors. The by-product is added to the cell state posterior to the Forget Gate. The last step is the “OutputGate” that first scales the current C n then filters ( H n − , X n ) through a last sigmoid function. The resultingtwo vectors are Hadamard-multiplied, simultaneously generating the output and updating H n − to H n .LSTMs gave rise to numerous variants. For instance some connections may be added between the three gatesmentioned above and the current value of the cell state (referred to as “peephole” connections). Conversely theLSTM architecture may be simplified like in the Gated Recurrent Unit (Cho et al. (2014)). Below we consider only the one-step (functional) predictor : ˆ ρ n ( X n +1 ). Keep in mind that this predictorprovides a forecast of the whole path of the functional data on its period typically. All this comes finally downto a multi-step prediction in terms of univariate time-series.After testing different stategies to evaluate our numerical results we adopted the following methodology.First we decided not to consider the rough Mean Square Error since it depends on the data’s range and does3ot allow a comparison between datasets and methods. The MSE may also be hard to interpret. The MeanAbsolute Relative Error (MARE) will be defined this way in our framework (we are aware that concatenatingAbsolute and Relative in the same acronym is not especially elegant) :MARE = 1 n te n te (cid:88) i =1 Abs (cid:16) X i , ˆ X i (cid:17) where ˆ X i = ˆ ρ n tr ( X i ), n te is the size of the testing set and :Abs (cid:16) X i , ˆ X i (cid:17) = T (cid:88) j =1 (cid:12)(cid:12)(cid:12) X i ( t j ) − ˆ X i ( t j ) (cid:12)(cid:12)(cid:12) | X i ( t j ) | . The integer T is the (time-)grid size and the t j ’s are the discretization times . One of the problems with theabove definition is that the denumerator may be null or very small. In order to avoid this problem all datasetswere normalized to [0 . , The simulations were carried out with the freqdom.fda package, We had the opportunity to process a ratherlarge dataset this way. The data were generated by the fts.rar function. They follow a centered ARH(1)process with Gaussian white noise in a Fourier basis of dimension 2 D + 1. This means that each data X i obeys(1) and is developed as a series made of a constant function and D harmonics such as : X i ( t ) = a + D (cid:88) k =1 (cid:110) a ( i ) k cos (2 πkt ) + b ( i ) k sin (2 πkt ) (cid:111) where a ( i ) k and b ( i ) k are sequences of real random variables. In order to ensure stationarity 50 burningiterations of the processes were conducted. The scenarii for the simulations depend on : • Three different values for the Fourier basis dimension D , • Two different autocorrelation operators ρ described just below.The default autocorrelation operator of fts.rar was used. It is a large dimensional matrix whose row i -column j cell ρ i,j is proportional to exp ( | i − j | ) hence rapidly decreasing out of the diagonal. We alsoinvestigated the situation when the cells decrease more slowly : ρ i,j ∝ | i − j | . In both cases the Hilbert-Schmidt norm of ρ was fixed so that (cid:107) ρ (cid:107) HS = 0 . × n tr = 600, n v = 200 and n te = 200 respectively for training, validation and test. In the classical approach, the data matrix was processedas an fdata object by the far function of the package far by S. Guillas and J. Damon (Damon and Guillas(2015)). The previous package carries out the estimation of ρ by the spectral cut (PCA) methodology andthe prediction. The cross-validation step correctly determines the optimal value of k n , around 2 D + 1 in allsituations.The Neural Network part was conducted under Keras (see Chollet et al. (2015)) with TensorFlow 2.0 asbackend. The LSTM unit is followed by a dense layer whose output size equals the grid size (here 500). Thelearning rate for gradient descent was fixed to 1e-4. The training step is stopped when the MARE does notdecrease anymore on 5 consecutive epochs on the validation set. The best epoch is then used for the testingstep. The LSTM was carried out by taking into account the data in a sliding window of varying size (denotedSWS below). Here since the data are simulated according to an ARH(1) this optimal SWS is 1. In the case ofan ARH(p) it would be obviously p.Table 4.1 displays the MARE values. We notice first that the autoregression operator structure (exponentialor power 2) does not seem to have a clear impact. The MARE generally decreases when the latent dimension D increases. A penalization term should certainly be added to balance this side-effect. Remind however thatour goal here is to compare two methodologies. The ARH model was always optimally calibrated and providesthe best results which is not surprising. We checked that the MARE decreases logically when the noise level inthe model shrinks. Conversely the LSTM cell was not specifically designed for this data. The gap is not wideand seems rather promising in view of application on real data.4
100 200 300 400 500 − . − . − . . . . t X ( t ) Figure 2: Sample of size 5 of the simulated dataset with D = 21 and ρ of exponential type ρ type Effective Dimension (2D+1) Stat pred MARE SWS LSTM MAREexp 21 Month S ea s u r f a c e t e m pe r a t u r e Figure 3: The Sea Surface Temperature in El Nino dataset
The El Nino dataset is one of the first which was studied in the framework of dependent functional data (see e.g.Besse et al. (2000)). Our version comes from the rainbow package in R. It provides the sea surface temperaturefrom January 1950 to December 2018 observed monthly. The bunch of curves is plotted at Figure 3.Out of 69 curves-data, 40 were used for training, 15 for validation and 14 for test. The modest size of thedataset restricted our study to SWS of size 1 and 2 only. The summary of MARE is given in Table 4.2.1.Stat. Predict MARE SWS LSTM MARE0.226 1 0.3012 0.308Table 2: El Nino Dataset : MARE for the statistical predictor versus LSTMThe LSTM is again outperformed by the statistical predictor, but the MARE range, above 20% is not good.At this point we must mention that we were faced with two main numerical issues concerning this meteorologicaldataset.First of all, even if we do not aim here at proving (again) the global warming, it seems that this fact couldbe retrieved from observations of the ten first versus the ten last curves-data as plotted on Figure 4. The tenfirst are black-solid, the ten last are red-dashed. It is plain that sea temperature for the six first months ofobservations tend to be higher for recent years. As a consequence the basic assumption on stationarity of thedata is not clearly fulfilled.Second we need to underline the problems encountered when applying the usual strategy based on training,validation and testing for such dependent functional data. As explained earlier the training test is separatedfrom the testing set by a validation interval containing 15 years of data. This strategy is clearly more sensitiveto potential non-stationarity or slight perturbation in the model than in the situation where the sample is i.i.d.It results in a potential overfitting. Ideally, validation, training and test should be performed continuously alongthe sample. But the model is not adapted to such strategies. Even if these results are not given here we noticeda substantial improvement of the MARE when using only a training and a testing set (without folding) plus asimple grid-search on k n . It may be interesting to compare the previous popular El Nino file with another temperature dataset re-trieved freely from the website and ranging from 1985 / / / /
31. The temperatures are recorded hourly in the city of Bale, Switzerland. We decided to consider thedaily agregated data (the daily mean was used) in order to reduce drastically the ratio between the ambient6 month T e m pe r a t u r e Figure 4: The ten first (black solid) and ten last(red dashed) curves-data in the El Nino datasetdimension and the sample size. The data matrix is (35 × n = 35 hence the half of El Nino’s but the time frequencyis the day (against the month). A sample of curves is given in Figure 5.The learning strategy was similar to El Nino. The prediction error is provided in Table 4.3. Learning andcalibration is performed on curves 1 to 30 and prediction on curves 31 to 35. The optimal dimension choice forthe ARH predictor is k n = 3. Conversely to El Nino the high sampling frequency of data allowed to exploreSWS from 1 to 5. It is noticeable that the MARE are between 10% and 15% and improved with respect to ElNino. The statistical predictor is again slightly better than LSTM. All this tends to prove that the ARH modelseems really competitive for these temperature datasets.Stat. Predict MARE SWS LSTM MARE0.116 1 0.2452 0.1333 0.1264 0.1405 0.125Table 3: Bale temperature Dataset : MARE for the statistical predictor versus LSTM Following the remark of a referee we investigated a situation which is less favorable to the ARH predictor andsimulated a basic nonlinear functional autoregressive process. Start from a basic ARH equation simulating X n = ρ ( X n − ) + (cid:15) n . Then construct the nonlinear process the following way : X n.ln ( t ) = 3 cos (10 π · X n ( t )) − exp ( − X n ( t ))A sample of four successive curves is plotted on Figure 6. For a fair comparison with previous results, thedimension and sample size are the same as in section 4.1, respectively 500 and 1000.Stat. Predict MARE SWS LSTM MARE1.235 1 0.6472 0.6615 0.655Table 4: Nonlinear autoregressive process : MARE for the statistical predictor versus LSTMThe highly non-linear behaviour of X n.ln is confirmed by the results in Table 4.4. A “quick and dirty” search7
100 200 300 − day T e m pe r a t u r e Figure 5: A sample of 4 curves from the Bale temperature dataset. All data were picked in the testing set. − − − − − t X Figure 6: Plot of X n.l for an overview of the nonlinear ARH process simulated on a grid of 500 time points.The whole sample size is 1000. 8ives an optimal k n around 70. The MARE are very high for this synthetic dataset close to a white noise.Anyway, despite this fact, the LSTM performs almost twice better than the statistical predictor. This work attempts to compare the historical/statistical track for prediction in ARH models with a NeuralNetwork approach centered on LSTMs. Data and code are available at https://gitlab.com/arh-lstm/ .Several facts should be underlined in order to show the limits of our results. • We did not study here the impact of the sampling frequency i.e. the size of discretization grid for thefunctional data. We noticed however some improvement between the El Nino and the Bale dataset. Onthis basis nothing solid should be stated however. We could have also focused on the effect of the samplesize or of the ρ operator norm on the accuracy of the results. • The architecture used here is simplistic because based on a single LSTM block. Introducing some depthby adding several layers of LSTM should certainly improve the predictions of the simulated dataset. ElNino is certainly not suited to a sequence of cells. • We used the discretized version of the functional data coming down to a large dimensional input vector(up to size 500 here). Clearly feeding the network with the Fourier coefficient instead leads to a morecompact entry and paves the way to another approach.Our framework was centered on the functional autoregressive process of order 1 and may be restrictive insome way. The design of LSTM is general enough to foster a wider investigation : autoregressive processesof order p >
References
Aue, A., Norinho, D. D., and H¨ormann, S. (2015). On the prediction of stationary functional time series.
Journal of the American Statistical Association , 110(509):378–392.Bengio, S., Fessant, F., and Collobert, D. (1995). A connectionist system for medium-term horizon time seriesprediction. In
In Proc. Intl. Workshop Application Neural Networks to Telecoms , pages 308–315.Besse, P. C., Cardot, H., and Stephenson, D. B. (2000). Autoregressive forecasting of some functional climaticvariations.
Scandinavian Journal of Statistics , 27(4):673–687.Bosq, D. (1991). Modelization, nonparametric estimation and prediction for continuous time processes. In
Nonparametric Functional Estimation and Related Topics , pages 509–529. Springer Netherlands.Bosq, D. (2000).
Linear Processes in Function Spaces . Springer New York.Bosq, D. (2007). General linear processes in Hilbert spaces and prediction.
Journal of Statistical Planning andInference , 137(3):879–894.Cardot, H., Ferraty, F., and Sarda, P. (1999). Functional linear model.
Statistics and Probability Letters ,45(1):11–22.Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014).Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Moschitti,A., Pang, B., and Daelemans, W., editors,
EMNLP , pages 1724–1734. ACL.Chollet, F. et al. (2015). Keras. https://keras.io .Damon, J. and Guillas, S. (2002). The inclusion of exogenous variables in functional autoregressive ozoneforecasting.
Environmetrics , 13(7):759–774.Damon, J. and Guillas, S. (2015). far: Modelization for Functional AutoRegressive Processes . R package version0.6-5.Dauxois, J., Pousse, A., and Romain, Y. (1982). Asymptotic theory for the principal component analysisof a vector random function: Some applications to statistical inference.
Journal of Multivariate Analysis ,12(1):136–154. 9reff, K., Srivastava, R. K., Koutnik, J., Steunebrink, B. R., and Schmidhuber, J. (2017). LSTM: A searchspace odyssey.
IEEE Transactions on Neural Networks and Learning Systems , 28(10):2222–2232.Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory.
Neural Computation , 9(8):1735–1780.Hormann, S. and Kokoszka, P. (2010). Weakly dependent functional data.
Ann. Statist. , 38(3):1845–1884.Kleffe, J. (1973). Principal components of random variables with values in a separable Hilbert space.
Mathe-matische Operationsforschung und Statistik , 4(5):391–406.Li, S., Li, W., Cook, C., Zhu, C., and Gao, Y. (2018). Independently recurrent neural network (IndRNN): Build-ing a longer and deeper RNN. In .IEEE.Mas, A. (2007). Weak convergence in the functional autoregressive model.
Journal of Multivariate Analysis ,98(6):1231–1261.Mas, A. and Pumo, B. (2009). Functional linear regression with derivatives.
Journal of Nonparametric Statistics ,21(1):19–40.Merlev`ede, F. (1996). Central limit theorem for linear processes with values in a Hilbert space.
StochasticProcesses and their Applications , 65(1):103–114.Mourid, T. (2002). Estimation and prediction of functional autoregressive processes.
Statistics , 36(2):125–138.Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2014). How to construct deep recurrent neural networks.In
Proceedings of the Second International Conference on Learning Representations (ICLR 2014) .Pumo, B. (1998). Prediction of continuous time processes by C[0,1]-valued autoregressive process.