[PDF] Combining Multiple Time Series Models Through A Robust Weighted Mechanism

Abstract

Improvement of time series forecasting accuracy through combining multiple models is an important as well as a dynamic area of research. As a result, various forecasts combination methods have been developed in literature. However, most of them are based on simple linear ensemble strategies and hence ignore the possible relationships between two or more participating models. In this paper, we propose a robust weighted nonlinear ensemble technique which considers the individual forecasts from different models as well as the correlations among them while combining. The proposed ensemble is constructed using three well-known forecasting models and is tested for three real-world time series. A comparison is made among the proposed scheme and three other widely used linear combination methods, in terms of the obtained forecast errors. This comparison shows that our ensemble scheme provides significantly lower forecast errors than each individual model as well as each of the four linear combination methods.

Full PDF

CCombining Multiple Time Series Models Through A Robust Weighted Mechanism

Ratnadip Adhikari

School of Computer and Systems Sciences Jawaharlal Nehru University New Delhi, India Email: [email protected]

R. K. Agrawal

School of Computer and Systems Sciences Jawaharlal Nehru University New Delhi, India Email: [email protected]

Abstract —Improvement of time series forecasting accuracy through combining multiple models is an important as well as a dynamic area of research. As a result, various forecasts combination methods have been developed in literature. However, most of them are based on simple linear ensemble strategies and hence ignore the possible relationships between two or more participating models. In this paper, we propose a robust weighted nonlinear ensemble technique which considers the individual forecasts from different models as well as the correlations among them while combining. The proposed ensemble is constructed using three well-known forecasting models and is tested for three real-world time series. A comparison is made among the proposed scheme and three other widely used linear combination methods, in terms of the obtained forecast errors. This comparison shows that our ensemble scheme provides significantly lower forecast errors than each individual model as well as each of the four linear combination methods.

Keywords-time series; forecasts combination; Box-Jenkins models; artificial neural networks; support vector machines I. I NTRODUCTION

Time series forecasting is a continuously growing research area with fundamental importance in many domains of business, finance, demography, science and engineering, etc. Improvement of forecasting accuracies has received extensive attentions from researchers during the past two decades [1, 2]. In literature, it has been frequently observed that a suitable combination of multiple forecasts substantially improves the overall accuracy as well as outperforms each individual model [3, 4]. Combination strategies are the best intuitive alternatives to use, when there is a considerable amount of uncertainty associated with the selection of the optimal forecasting model. Moreover, combining multiple forecasts reduces the errors arising from faulty assumptions, bias, or mistakes in the data to a great extent. Starting with the benchmark work of Bates and Granger in 1969 [5], a number of forecasts combination techniques have been developed in literature [6–8]. The most popular among them are the weighted linear combinations, where the weights assigned to the individual models are either equal or decided according to some rigorous mathematical rule. Some common linear forecasts combination methods are the simple average, trimmed average, Winsorized average, median, error-based method, outperformance method, variance-based pooling, etc. [9, 10]. Although linear combination techniques are easy to understand and implement, but they completely ignore the possible relationships among the participating models in the ensemble. This limitation has a negative effect on the forecasting accuracy of the ensemble, especially when the constituent forecasts are significantly correlated. Nonlinear forecast combination is an area in which the existing literature is quite limited and so has a strong need of further developments [10]. In this paper, we propose a weighted nonlinear mechanism for combining forecasts from multiple time series models. Our proposed technique is partially motivated by the work of Freitas and Rodrigues [11] and it considers individual forecasts as well as correlations between pairs of forecasts for combining. Three forecasting models, viz. Autoregressive Integrated Moving Average (ARIMA), Artificial Neural Network (ANN) and Support Vector Machine (SVM) are used to build up the proposed ensemble. The appropriate combination weights are determined from the performances of the individual models on the validation datasets. The effectiveness of our proposed technique is empirically tested on three real-world time series, in terms of three common error measures: the Mean Absolute Percentage Error (MAPE), the Mean Squared Error (MSE), and the Average Relative Variance (ARV). Also, the forecasting performances of the proposed ensemble scheme for all three datasets are compared with three popular linear combination methods. These are the simple average, the median and a weighted linear combination in which the weights are determined from the validation MAPE values, obtained by the individual forecasting models. The rest of the paper is organized as follows. Section II presents some well-known linear forecasts combination techniques. Our nonlinear ensemble scheme is described in Section III. In Section IV, we discuss about the three time series forecasting models, which are used here to build up the proposed ensemble. Experimental results are reported in Section V and in Section VI, we conclude this paper. II. L INEAR F ORCASTS C OMBINATION T ECHNIQUES

In a linear combination technique, the combined forecast for the associated time series is calculated through a linear function of the individual forecasts from the contributing models. Let, Y =[ y , y , …, y N ] T be the actual time series, which is to be forecasted using n different models and T( ) ( ) ( )( ) 1 2 ˆ ˆ ˆ ˆ, , , i i ii N y y y  =   Y … be its forecast obtained from the i th model ( i =1, 2,…, n ). Then, a linear combination of these n orecasted series of the original time series produces T( ) ( ) ( )( ) 1 2 ˆ ˆ ˆ ˆ, , , c c cc N y y y  =   Y … , given by: ( ) ( ) (1) (2) ( ) ˆ ˆ ˆ ˆ, , , 1, 2, , . c nk k k k y f y y y k N = ∀ = … … where, f is some linear function of the individual forecasts ( ) ˆ ik y ( i =1, 2,…, n ; k =1, 2,…, N ). Thus, we have: ( ) (1) (2) ( ) ( )1 2 1 ˆ ˆ ˆ ˆ ˆ .1, 2, , . nc n in ik k k k ki y w y w y w y w yk N = = + + + = ∀ =  ∑ …… (1) Here, w i is the weight assigned to the i th forecasting method. To ensure unbiasedness, it is often assumed that the weights add up to unity. Some widely used linear combination techniques are briefly described below: • In the simple average , all models are assigned equal weights, i.e. w i =1 ⁄ n ( i =1, 2,…, n ) [9, 10]. • In the trimmed average , individual forecasts are combined by a simple arithmetic mean, excluding the worst performing k % of the models. A trimming of 10%–30% is usually recommended [9, 10]. • In the

Winsorized average , the i smallest and i largest forecasts are selected and respectively set as the ( i +1) th smallest and ( i +1) th largest forecasts [9]. • In the median-based combining, the combination function f is the median of the individual forecasts. Median is sometimes preferred over simple average as it is less sensitive to extreme values [12, 13]. • In the error-based combining, the weight to each model is assigned to be the inverse of the past forecast error (e.g. MSE, MAE, MAPE, etc.) of the corresponding model [3, 10]. • In the variance-based method, the optimal weights are determined through the minimization of the total Sum of Squared Error (SSE) [7, 10]. III. T HE P ROPOSED E NSEMBLE T ECHNIQUE

As mentioned earlier, a major disadvantage of a linear combination technique is that it considers only the contributions of the individual models, but totally overlooks the possible relationships among them. As a result, there is a considerable reduction in the forecasting accuracy of a linear combination scheme, when two or more participating models in the ensemble are correlated. To overcome this limitation, our ensemble technique is developed as an extension of the usual linear combination in order to deal with the possible correlations between pairs of forecasts. A. Mathematical Description

Here we describe our ensemble method for combining three forecasts, but it can be easily generalized. Let, the actual test dataset of a time series be Y =[ y , y , …, y N ] T with T( ) ( ) ( )( ) 1 2 ˆ ˆ ˆ ˆ, , , i i ii N y y y  =   Y … being its forecast obtained from the i th method ( i =1, 2, 3). Let, µ ( i ) and σ ( i ) be the mean and standard deviation of ( ) ˆ , i Y respectively. Then the combined forecast of Y is defined as: ( )( ) ( ) ( ) ( ) ( ) ( ) T( ) ( ) ( )( ) 1 2(1) (2) (3)0 1 2 3(1) (2) (2) (3) (3) (1)1 2 32 ˆ ˆ ˆ ˆ, , , ,ˆ ˆ ˆ ˆˆ ˆ ˆ ˆ ˆ ˆ .ˆ ˆ1, 2,3; 1, 2, , c c cc Nck k k kk k k k k ki i i ik k y y yy w w y w y w yv v v v v vv yi k N θ θ θµ σ  =   = + + + + + + = − = =  Y …… (2) In (2), the nonlinear terms in calculating ( ) ˆ ck y are included to take into account the correlation effects between pairs of forecasts. The optimal weights are to be determined by minimizing the forecast SSE, given by: ( ) ( ) ˆSSE N ck kk y y = = − ∑ (3) Now, for optimization of the combination weights, we have: ( ) ( ) ( ) ( )( ) SSE 0SSE 0 .0,1, 2,3; 1, 2,3 ij wi j θ ∂ ∂ = ∂ ∂ = = =  (4) After computations of the associated partial derivatives in (4) and subsequent mathematical simplifications, we can get the following system of linear equations: T . + = + =  Vw Z θ bZ w U θ d (5) where, [ ] [ ] ( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( ) ( ) T T0 1 2 3 1 2 3T T T T T(1) (2) (3)1 1 1(1) (2) (3) 41 2 2 3 3 11 1 1 1 1 11 2 2 3 3 1 3 , , , , , ,, , , ,ˆ ˆ ˆ1 .ˆ ˆ ˆ1ˆ ˆ ˆ ˆ ˆ ˆˆ ˆ ˆ ˆ ˆ ˆ

N N N NN N N N N N N w w w wy y yy y y v v v v v vv v v v v v θ θ θ × × = = = = = = =     =           =         w θ V F F Z F G U G G b F Y d G YFG (cid:35) (cid:35) (cid:35) (cid:35)(cid:35) (cid:35) (cid:35)

Now, after solving the matrix equations in (5), the required optimal weights are obtained as: ( ) ( ) ( )

1T 1 T 1opt 1opt opt . −− −− = − − = −  θ U Z V Z d Z V bw V b Z θ (6) The optimal weights exist if and only if all the matrix inverses involved in (6) exist. . Determination of the Combination Weights

To determine the optimal weights in our proposed ensemble technique, the knowledge of the forecast SSE is required, but it is unknown in advance. So, we divide the available observations of the associated time series into a pair of training and validation subsets. The size of the validation set is chosen to be equal to the size of the out-of-sample testing dataset. The individual forecasting models are then trained on the training set and the optimal combination weights are calculated by minimizing the validation SSE. This approach of weights determination is especially suitable for time series showing regular patterns (e.g. stationary or seasonal series). C. Our Ensemble Algorithm

Let Y =[ y , y , …, y N ] T be the available observations of a time series and M i ( i =1, 2,…, n ) be the n forecasting models to be combined. Then, the steps in our proposed ensemble algorithm are outlined below: Divide the dataset Y into a pair of training and validation subsets Y train and Y validation , respectively as follows: [ ][ ] Ttrain 1 2 Tvalidation 1 2 , , ,, , , N y y yy y y αα α+ + = = YY … … where,size of the training setsize of the validation set size of the testing set N α α =− == Formulate the equation for the combined forecast, as shown in (2). Train each forecasting model M i ( i =1, 2,…, n ) on Y train to forecast the Y validation dataset. From the minimization of validation SSE, determine the optimal combination weight vectors w comb and θ comb by using (6). Use w comb and θ comb to compute the combined forecasts for predicting the testing set. IV. T HE T HREE C ONSTITUENT F ORECASTING M ODELS

To build up our proposed nonlinear ensemble, three popular time series forecasting models are used in this paper, which are: the Autoregressive Integrated Moving Average (ARIMA), the Support Vector Machine (SVM) and the Artificial Neural Network (ANN). Brief descriptions of these three models are presented below. A. The ARIMA Model

The ARIMA models, developed by Box and Jenkins [2] are the most popular statistical methods for time series forecasting. An ARIMA( p , d , q ) model is given by: ( )( ) ( ) d t t L L y L φ θ ε− = (7) where, ( ) ( )

11 1 p qi ji j t ti j

L L L L Ly y φ φ θ θ −= = = − = + = ∑ ∑

Here, the model orders p , d , q represent autoregressive , degree of differencing and moving average processes, respectively; y t is the actual time series and ε t is a white noise process. In this model, a nonstationary time series is transformed to a stationary one by successively ( d times) differencing it [2, 4]. For real world applications, a single differencing is often sufficient. B. The SVM Model

During the past few years, SVMs have gained notable popularity in the forecasting domain. These are a class of robust statistical models, developed by Vapnik and co-workers in 1995 and are based on the Structural Risk Minimization (SRM) principle [14]. Time series forecasting is a branch of Support Vector Regression (SVR) in which the principal aim is to construct an optimal separating hyperplane to correctly classify real-valued outputs. Given a training dataset { } , Ni i i y = x with , , ni i y ∈ ∈ x (cid:92) (cid:92) the SVM method attempts to approximate the unknown data generation function in the form: f ( x )= w · φ ( x )+ b , where w is the weight vector, φ is the nonlinear mapping to a higher dimensional feature space and b is the bias term. The SVM regression is converted to a Quadratic Programming Problem (QPP), using Vapnik’s ε -insensitive loss function [14, 15] in order to minimize the empirical risk. After solving the associated QPP, the optimal decision hyperplane is given by: ( ) ( ) ( ) * opt1 , s N i i ii y K b α α = = − + ∑ x x x (8) where, * , i i α α are the Lagrange multipliers ( i =1, 2,…, N s ), K ( x , x i ) is the kernel function, N s is the number of support vectors and b opt is the optimal bias. In this paper, the Radial Basis Function (RBF) kernel: K ( x , y )=exp(–|| x – y || ⁄ σ ) is used and the proper SVM hyper parameters are selected through cross validation techniques. C. The ANN Model

ANNs are a class of nonlinear, nonparametric, data-driven and self-adaptive models, originally inspired by the intelligent working mechanism of human brains [4, 16]. Over the years, ANNs are used as excellent alternative to the common statistical models for time series forecasting. The most popular ANN architectures in forecasting domain are the

Multilayer Perceptrons (MLPs) . They consist of a feedforward structure of three layers, viz. an input layer, one or more hidden layer and an output layer. The nodes in each layer are connected to those in the immediate next layer by acyclic links [16]. Single hidden layer is sufficient for most applications. It is often customary to use the notation ( p , h , q ) for referring an ANN with p input, h hidden and q output nodes. A typical MLP architecture is shown in Fig. 1. The forecasting performance of an ANN model depends on a number of factors, e.g. the selection of proper network architecture, training algorithm, activation functions, significant time lags, etc [16, 17]. Unfortunately, no rigorous heoretical procedure is available in this regard and often these issues have to be resolved experimentally. Figure 1.

A typical MLP architecture

In this paper, we use the

Resilient Propagation (RP) [18] as the network training algorithm and the logistic and identity functions as the hidden and output layer activation functions, respectively. Cross validation techniques are adopted, when necessary for selecting the appropriate ANN structure for a time series. V. E XPERIMENTAL R ESULTS AND D ISCUSSIONS

For empirical verification of forecasting performances of our proposed ensemble technique, three real world time series are used in this paper. These are the Canadian lynx, the Wolf’s sunspots and the monthly international airline passengers series. All the three series are available on the well-known Time Series Data Library (TSDL) [19]. The description of these three time series is presented in Table I and their corresponding time plots are shown in Fig. 2.

TABLE I. D ESCRIPTION OF THE T IME S ERIES D ATASETS

Time Series Description Size

Lynx a Number of lynx trapped per year in the Mackenzie River district of Northern Canada (1821–1934). Total size: 114 Testing size: 14 Sunspots

The annual number of observed sunspots (1700–1987). Total size: 288 Testing size: 67 Airline Passengers

Monthly number of international airline passengers (in thousands) (January 1949–December 1960).

Total size: 144 Testing size: 12 a. The logarithms (to the base 10) of the observations are used

The experiments in this paper are implemented on MATLAB; the neural network toolbox [20] is used for fitting the ANN models. Forecasting performances of all the models are evaluated in terms of three well-known error statistics, viz. the Mean Absolute Percentage Error (MAPE), the Mean Squared Error (MSE), and the Average Relative Variance (ARV). These are defined below: ˆ1MAPE= 100. N t ttt y yN y = − × ∑ ( )( ) ( )

21 2 21 1

N t ttN Nt t tt t y yN y y y µ == = −   − −          ∑∑ ∑ Here, t y and ˆ t y are the actual and forecasted observations, respectively; N is the size and µ is the mean of the test set. The values of all these error measures are desired to be as less as possible for an efficient forecasting performance. N u m b e r o f l ynx t r a pp e d (a) O b s e r v e d nu m b e r o f s un s po t s (b) Year N u m b e r o f p a ss e ng e r s ( ' s ) (c) Figure 2.

Time plots: (a) lynx, (b) sunspots, (c) airline passengers I npu t s t o t h e n e t w o r k Output

Bias Bias ……………………………………… ………………………………………

Input Layer Hidden Layer Output Layer he Canadian lynx and Wolf’s sunspots are stationary time series, having approximate cycles of 10 and 11 years, respectively. Both these series have been extensively studied in literature. Following notable previous works [4, 22], in this paper we fit the ARIMA(12, 0, 0) (i.e. AR(12)) and a (7, 5, 1) ANN to the lynx series, while the ARIMA(9, 0, 0) (i.e. AR(9)) and a (4, 4, 1) ANN to the sunspots series. The airline time series shows a multiplicative, monthly seasonal pattern with an upward trend. The most suitable statistical model for this series is the Seasonal ARIMA (SARIMA) of order (0, 1, 1)×(0, 1, 1) , a variation of the basic ARIMA model [2, 17, 21]. This model is used in this paper for the airline data. For neural network modeling, the Seasonal ANN (SANN) structure, developed by Hamzacebi in 2008 [21] is employed. For a seasonal time series with period s , the SANN considers a ( s , h , s ) ANN structure, h being the number of hidden nodes. This model is quite simple to understand and apply, yet found to be very effective in modeling seasonal data [21]. In this paper, the (12, 1, 12) SANN is used for the airline time series. Our proposed ensemble mechanism is compared with three other popular linear combination techniques. These are the simple average, the median and an error-based weighted linear combination in which the weight to an individual model is assigned to be the inverse of the corresponding validation MAPE value. The obtained forecasting results of all the fitted models for all three time series are depicted in Table II. TABLE II. F ORECSATING R ESULTS FOR THE T HREE T IME S ERIES a. Original MSE=Obtained MSE×10 The presented results in Table II show that our ensemble technique has provided lowest forecast errors among all fitted models for all the three time series. From these empirical findings, it is quite evident that significant improvement in forecasting accuracies can be achieved by employing the proposed nonlinear ensemble technique. However, it should be noted that this technique is suitable when the constituent forecasts are strongly correlated to each other. In this paper, the term

Forecast Diagram is used to refer the graph which shows the actual and forecasted values for a time series. The three forecast diagrams, obtained through our ensemble method are presented in Fig. 3. actualensemble (a) actualensemble (b) actualensemble (c)

Figure 3.

Forecast diagrams: (a) lynx, (b) sunspots, (c) airline passengers

Error Measures Lynx Sunspots

Airline a ARIMA

MAPE 3.277425 60.03853 3.709594 MSE 0.012849 483.4907 0.041177 ARV 0.070715 0.216361 0.076589

SVM

MAPE 5.811812 40.43313 2.336608 MSE 0.052676 792.9613 0.017689 ARV 0.279036 0.715230 0.029238

ANN

MAPE 2.912820 35.88591 2.577642 MSE 0.013675 341.0395 0.019601 ARV 0.098152 0.163557 0.033675

Simple Average

MAPE 2.807244 32.74193 2.373179 MSE 0.013812 379.6589 0.019099 ARV 0.086445 0.251363 0.033634

Median

MAPE 2.742802 33.42647 2.480301 MSE 0.013004 331.5190 0.020184 ARV 0.083436 0.206604 0.035741

Weighted Linear Comb.

MAPE 2.714213 31.84739 2.369855 MSE 0.013142 311.0158 0.018612 ARV 0.082196 0.170267 0.032665

Proposed Ensemble

MAPE 2.691642 30.01823 2.317835 MSE 0.008523 275.7206 0.016017 ARV 0.059014 0.149325 0.029592 I. C ONCLUSIONS

Combining forecasts from conceptually different models effectively reduces the prediction errors and hence provides considerably increased accuracy. Over the years, many linear forecasts combination techniques have been developed in literature. Although these are simple to understand and implement but often criticized due to their ignorance of the relationships among contributing forecasts. Literature on nonlinear forecast combination is very limited and so there is a need of more extensive works on this area. In this paper, we propose a robust weighted nonlinear technique for combining multiple time series models. The proposed method considers individual forecasts as well as the correlations between forecast pairs for creating the ensemble. A rigorous algorithm is suggested to determine the appropriate combination weights. The proposed ensemble is constructed with three well-known forecasting models and is tested on three real world time series, two stationary and one seasonal. Empirical findings demonstrate that the proposed technique outperforms each individual model as well as three other popular linear combination techniques, in terms of obtained forecast accuracies. It is to be noted that the proposed mechanism is highly efficient when the contributing forecasts are strongly correlated. In future, the effectiveness of the suggested method can be further explored, especially for nonstationary and chaotic time series datasets. A

CKNOWLEDGMENT

The first author gratefully thanks the Council of Scientific and Industrial Research (CSIR) for the obtained financial assistance to carry out this research work. R

EFERENCES [1]

J. G. Gooijer, R. J. Hyndman, “

25 years of time series forecasting, ” International Journal of Forecasting, vol. 22, pp. 443–473, 2006. [2]

G.E.P. Box, G.M. Jenkins, Time Series Analysis: Forecasting and Control, 3 rd ed. California: Holden-Day, 1970. [3] J. S. Armstrong, Combining Forecasts, Principles of Forecasting: A Handbook for Researchers and Practitioners, J. Scott Armstrong (ed.): Norwell, MA: Kluwer Academic Publishers, 2001. [4]

G. P. Zhang, “Time series forecasting using a hybrid ARIMA and neural network model,” Neurocomputing, vol. 50, pp.159–175, 2003. [5]

J. M. Bates, C. W. J. Granger, “Combination of forecasts,” Operational Research Quarterly, vol. 20, pp. 451–468, 1969. [6]

R. T. Clemen, “Combining forecasts: A review and annotated bibliography,” International Journal of Forecasting , vol. 5, pp. 559–583, 1989. [7]

C. Aksu, S. Gunter, “An empirical analysis of the accuracy of SA, OLS, ERLS and NRLS combination forecasts,” International Journal of Forecasting, vol. 8, pp. 27–43, 1992. [8]

H. Zou, Y. Yang, “Combining time series models for forecasting,” International Journal of Forecasting, vol. 20, pp. 69–84, 2004. [9]

V. Richmond, R. Jose, R. L. Winkler, “Simple robust averages of forecasts: Some empirical results,” International Journal of Forecasting, vol. 24, pp. 163–169, 2008. [10]

C. Lemke, B. Gabrys, “Meta-learning for time series forecasting and forecast combination,” Neurocomputing, vol. 73, pp. 2006–2016, June 2010. [11]

P. S. Frietas, A. J. Rodrigues, “Model combination in neural-based forecasting,,” European Journal of Operational Research, Vol. 173, pp. 801–814, 2006. [12]

C. Agnew, “Bayesian consensus forecasts of macroeconomic variables, ” Journal of Forecasting, vol. 4, pp. 363–376, 1985. [13]

J. H. Stock, M. W. Watson, “Combination forecasts of output growth in a seven-country data set,” Journal of Forecasting, vol. 23, pp. 405–430, 1985. [14]

V. Vapnik, Statistical Learning Theory, New York, Springer-Verlag, 1995. [15]

J. A. K. Suykens, J. Vandewalle, “Least squares support vector machines classifiers,” Neural Processing Letters, vol. 9, pp. 293–300, 1999. [16]

G. Zhang, B.E. Patuwo, M.Y. Hu, “Forecasting with arti ﬁ cial neural networks: The state of the art,” International Journal of Forecasting, vol. 14, pp.35–62, 1998. [17] J. Faraway, C. Chat ﬁ eld, “Time series forecasting with neural networks: a comparative study using the airline data,” Applied Statistics, vol. 47, pp. 231–250, 1998. [18] M. Reidmiller, H. Braun, "A direct adaptive method for faster backpropagation learning: The rprop algorithm," Proc. IEEE Int. Conference on Neural Networks (ICNN), San Francisco, 1993, pp. 586–591, doi: 10.1109/ICNN.1993.298623. [19]

R. J. Hyndman, Time Series Data Library (TSDL), URL: http://robjhyndman.com/TSDL/, Jan. 2010. [20]

H. Demuth, M. Beale, M. Hagan, Neural Network Toolbox User's Guide, the MathWorks: Natic, MA, 2010. [21]