Forecasting Commodity Prices Using Long Short-Term Memory Neural Networks
FF ORECASTING C OMMODITY P RICES U SING L ONG S HORT -T ERM M EMORY N EURAL N ETWORKS
Racine Ly
AKADEMIYA2063Kigali, Rwanda [email protected]
Fousseini Traore
International Food Policy Research InstituteDakar, Senegal [email protected]
Khadim Dia
AKADEMIYA2063Kigali, Rwanda [email protected]
January 18, 2021 A BSTRACT
This paper applies a recurrent neural network (RNN) method to forecast cotton and oil prices. We showhow these new tools from machine learning, particularly Long-Short Term Memory (LSTM) models,complement traditional methods. Our results show that machine learning methods fit reasonably wellthe data but do not outperform systematically classical methods such as Autoregressive IntegratedMoving Average (ARIMA) models in terms of out of sample forecasts. However, averaging theforecasts from the two type of models provide better results compared to either method. Comparedto the ARIMA and the LSTM, the Root Mean Squared Error (RMSE) of the average forecast was0.21 and 21.49 percent lower respectively for Cotton. For Oil the forecast averaging does not provideimprovements in terms of RMSE. We suggest using a forecast averaging method and extending ouranalysis to a wide range of commodity prices.
Forecasting commodity prices is paramount for many economic actors. When building budgets, experts rely on growthprojections at the government level, which are almost always based on underlying forecasts of primary commoditiesexported by the country. Oil-dependent countries represent a typical example where this kind of scenario is encountered.For developing countries, many depend on a few raw materials (agricultural and minerals), the price of whichdetermines the growth rate. Being able to forecast commodity prices is vital for other public entities or parastatalstoo. In many developing countries, parastatals are managing stabilization funds aimed at smoothing commodity pricemovements. These entities need world price forecasts in order to fix producers’ prices for the current campaign. Forresearchers, knowing the best data generating process and the forecast errors is essential for various modeling purposes.For instance, in agricultural models involving expectations , the expected price is generally given by an ARMA model(Antonovitz and Green, 1990; De Janvry and Sadoulet, 1995). As we will show later, LSTM networks provide a goodalternative to ARMA models.Since the seventies, Box-Jenkins (1970) approaches have been popular in forecasting time series. These ap-proaches have introduced ARMA models and their extensions as the cornerstone of forecasting tools. However,machine learning methods that can handle time-series data and perform forecasting have grown over the last threedecades. Among all methods, those based on Recurrent Neural Network are particularly interesting as they can carryover information (memory) from previous periods into the future. Early papers using RNNs to forecast time seriesinclude Kamijo and Tanigawa (1990), Chakraborty et al. (1992), and a comparison with ARIMA models by Kohzadiet al. (1996). However, one issue encountered with the first generation of RNN is the so-called vanishing gradientproblem for highly dependent (long memory) data. Thus, in a seminal paper, Hochreiter and Schmidhuber (1997)proposed a new approach called Long Short-term Memory with the capacity to filter out which information from the The widely used Nerlove model falls in that category. Although one can mention earlier work by Tinbergen (1939) and Klein (1950). a r X i v : . [ q -f i n . S T ] J a n PREPRINT - J
ANUARY
18, 2021past should be processed and retained . This triggered new literature on forecasting times series (Gheyas and Smith2011; Khandelwal et al., 2015; Kumar et al., 2018). In a similar way, other classes of machine learning methods such asSupport Vector Machine (SVM) regressions, have been developed and applied to time series forecasting, includinghybrid approaches with ARIMA models (Pai and Lin, 2005).Our objective in this paper is to focus on LSTM models as they present a series of interesting characteristics.First, they are non-parametric so that they can handle suitably non-linear patterns. Second, they do not require the errorterm to follow a distribution. Third, LSTM models do not require the underlying data to follow a stationary process,so they are not affected by unit-roots. These three features present an interesting advantage over regression-basedmethods, be they ARIMA or not. Thus, using monthly data of cotton and oil prices, our results suggest that theLSTM model fits the data reasonably well but does not outperform the ARIMA models for out of sample forecast.However, a combination of forecasts from the two models yields better results for the cotton dataset than either approach.The remainder of this paper is organized as follows. We first present in section 2 the traditional RecurrentNeural Networks (RNN), and then, the LSTM approach proposed to solve the vanishing gradient problem of the firstgenerations of RNN. In section 3, we present the data set used. Section 4 presents the computational aspects whilethe results of the LSTM approach applied to oil and cotton prices are highlighted in section 5. In the sixth section,we compare the LSTM approach with ARIMA models. The seventh section the forecast averaging approach and ourconclusions are discussed in section 8. An artificial Neural Network is a supervised learning technique within the machine learning set of models. It is used forboth regression and classification tasks depending on the type of data used and the modeling purpose. An artificialneural network comprises an input layer that encounters the different inputs’ features, at least one hidden layer thatwill process the dataset’s hidden characteristics, and an output layer that will yield the network’s forecasts. Eachlayer has several units called "neurons," which have the role of receiving information from preceding neurons andsend processed information to the following ones. The input layer has several neurons equal to the number of thedataset’s features, and each layer is augmented with a so-called "bias" neuron which role is to facilitate the algorithmicwriting. The number of hidden layers – which is one of the so-called hyper-parameters - is set by the modeler througha fine-tuning process. The number of output layer’s neurons is equal to the number of expected outputs from the network.Within an artificial neural network, each neuron communicates with all the neurons from the preceding layerand with those from the following layer (aside from the last layer) through weights affected at each connection. Thelatter is computed by applying an activation function - which is usually a sigmoid or hyperbolic tangent function,among others that are used in the literature - to the product of a weights vector controlling the mapping function from alayer l to the following layer l + 1 , and the preceding activations of the previous layer.The learning process of an artificial neural network is performed using the backpropagation algorithm (Rumelhart,Hinton, and William, 1986), which is the optimization process through which the weights that represent the connectionsbetween neurons are updated using the chain rule method to compute the gradient of the loss function with respectsto the weights. A gradient descent rule or related version is used to update their values. The weights are randomlyinitialized using a uniform distribution, the bias units are set to zero, and the initial activations for the first layer areequal to the dataset’s features.The backpropagation procedure is as follows: A forward pass is performed by computing the "cash" valuesused as an argument to assess the activation of the layers. For an L -layers neural network, the k th layer uses thefollowing constitutive equations to yield the cash vector z [ k ] and the activation vector a [ k ] for each neuron and layer. z [ k ] = η [ k ] a [ k − + b [ k ] (1) a [ k ] = g [ k ] (cid:16) z [ k ] (cid:17) (2) See section 2 for a full description of LSTM networks. PREPRINT - J
ANUARY
18, 2021In equations (1) and (2), η [ k ] is the vector of weights of all connections going from neurons at layer k − to neurons atlayer k , and g [ k ] is the activation function that is applied to neurons at layer k . The network’s last activation functioncorresponds to the network’s predicted value, which is compared with the actual value from the response variableprovided by the training dataset. A backward pass is then applied to the artificial neural network to compute thegradients’ cost function for weights assigned to each connection. The residual δ [ L ] = a [ L ] − y , at the last layer betweenthe network predictions a [ L ] and actual values y is backpropagated into the network, and the residual associated withthe preceding layers is computed as follows: δ [ k ] = (cid:16) η [ k ] (cid:17) T δ [ k +1] ◦ g [ k ] (cid:48) (cid:16) z [ k ] (cid:17) (3)The gradient of the cost function J for the weight between two nodes i and j at a layer k materialized by η [ k ] ij , is equalto the product of the activation of the node j at layer k and the residual of the node i at layer k + 1 computed withequation (3). ∂ J ( η ) ∂η [ k ] ij = a [ k ] j δ [ k +1] i (4)For an L -layers artificial neural network with m examples in the training set ( x (1) , y (1) ) , ... , ( x ( m ) , y ( m ) ) , an iterativeprocess is performed to aggregate the gradients in an accumulator using equation (4). The procedure is as follows: Algorithm 1
Backpropagation Algorithm for i = 1 to m do Set a [1] = x [ i ] Forward propagation to compute a [ k ] for k = 1 , , , · · · , L Compute δ [ L ] = a [ L ] − y [ i ] Compute δ [ L − , · · · , δ [2] with equation (3) ∆ [ k ] ij := ∆ [ k ] ij + a [ k ] j δ [ k +1] i D [ k ] ij = m (cid:16) ∆ [ k ] ij + λη [ k ] ij (cid:17) if j (cid:54) = 0 (nonbias node) D [ k ] ij = m (cid:16) ∆ [ k ] ij (cid:17) if j = 0 (for bias node) ∂ J ( η ) ∂η [ k ] ij = D [ k ] ij end for Artificial neural networks are suitable for numerical and qualitative datasets for regression and classification tasks.However, for sequential-type datasets where inputs and outputs are sequences, ANNs are not adequate for two mainreasons: (i) Inputs and outputs need to have the same lengths and such could not be the case in sequential datasets, (ii)ANNs cannot deal with the need of sharing features learned across different positions in a sequential dataset. RecurrentNeural Networks (RNN) have been introduced to solve the limitations mentioned above.
The primary purpose of Recurrent Neural Networks is to deal with datasets that have inputs and outputs that aresequences. RNNs are built on top of the same ideas and architecture of an artificial neural network. The main differenceresides in RNNs being able to encounter historical and current sequences in predicting the outcome at the same currenttime-step or sequence. Such confer to RNNs a real advantage in predicting time-sensitive sequential data such astime-series.For a sequential labeled dataset ( X, Y ) , where X ∈ R T x and Y ∈ R T y , where T x and T y are the number ofsequences in the input and labels data respectively, the guiding equations to perform a forward pass within RNNs is asfollows: a
ANUARY
18, 2021Where a
ANUARY
18, 2021 ˜ c
ANUARY
18, 2021
We use international prices for two commodities: one agricultural product (cotton) and one mineral (oil). All the dataare taken from the World Bank commodity prices dataset. The Cotton price used is the Cotton A index, the mostcommon index used to represent offering prices on the international cotton market. It corresponds to the average of thecheapest five quotations among eighteen varieties of upland cotton traded internationally. The base quality of the indexis MIDDLING − / (cid:48)(cid:48) . For crude oil, we use the average spot price of Brent, Dubai, and West Texas Intermediate,the three leading indices representing the market. All prices are observed monthly. The cotton and oil datasets have 708data points each, acquired from January 1960 to December 2018. Sequential modeling for time-series forecasts using the LSTM framework requires transforming the dataset into featuresand labels for a supervised learning process. For such, the sliding window technique has been used across the cottonand oil datasets. It consists of shifting a window with a fixed-width – also called the lookback parameter - from left toright in the dataset. The window moves one unit at a time, and the data points within its range are taken as examples inthe input features. The next value at the right edge of the window is considered as the label at each step; in other words,it is considered as the expected output produced by the data within the sliding window.Figure 2: Illustration of the sliding window technique to create input features and labels from time-series datasets. Thedataset is read from left to right, and each increment in the window movement is considered an example of the trainingprocessThe initial dataset is divided into two parts: the training set, which is used to train the LSTM model, and the test setused to assess the model accuracy. For the latter, only the inputs are fed into the model to produce forecasts, comparedwith the actual values. The sliding window technique is applied to both the training and test sets.A ratio of 70 and 30 percent has been used to generate training and test sets for the oil and cotton datasets.Such a choice is motivated by the amount of data that is available in the initial database. One might not select too manydata points for training purposes and fewer data points for testing to avoid two main known issues: first, an overtrainedmodel that could potentially yield overfitting and poor generalization for new inputs. Second, the lack of enough testingdata points to assess model accuracy fairly. The lookback parameter is defined during the fine-tuning process with abaseline model. For a time-series with m train training data points and a lookback parameter µ , the number of trainingexamples is equal to m train − µ for a one-unit-at-a-time ( τ = 1 ) sliding window. Besides, sequential modeling withLSTM requires a 3D shape for input data. Therefore, training and test datasets are reshaped to ( m train − µ , τ , µ ) and( m test − µ , τ , µ ) respectively. 6 PREPRINT - J
ANUARY
18, 2021Table 1: LSTM hyperparameters for the baseline modelHyper-parameter ValueDropout 10Units 50Epochs 50Batch size 32
To answer the research question of which of the two LSTM and ARIMA yield the best results in time-series forecastswith the selected oil and cotton datasets, they were subject to a fine-tuning process to select the best model in the modelspace. For LSTM, one hidden layer with 170 units baseline model was built with random hyperparameters. A min-maxnormalization procedure was used to accelerate the optimization process through the input data ranging between 0 and1. The Adaptative Moment Estimation (ADAM) optimizer (Kingma and Ba, 2015) was used with a learning rate of0.001 and an exponential decay rate of first and second moment estimates β and β of 0.9 and 0.999, respectively. Theroot mean squared error was used as a loss function, and the random generator seed was fixed to 3 for reproducibility.The baseline model was built with the Keras package on Python Anaconda’s distribution. The model was executed onthe Google Collaboratory Pro plan with Tensor Process Units (TPU).Figures 3 and 4 shows the baseline model results for the oil and cotton datasets. Each of the figures is composed ofthree subplots: (i) the goodness of fit between the training dataset and its corresponding actual values, (ii) the lossvalues during the training process with the number of epochs, (iii) and the comparison between predicted values on thetest set and its corresponding actual values. The test set predictions show that the hyper-parameters are not the optimalones, which suggests the need for a fine-tuning process in identifying a local minimum to the cost function with itscorresponding hyper-parameters values.Figure 3: LSTM Baseline Model results on oil datasets. (top-left): Goodness of fit on training dataset; (top-right): lossvalues during the training process; (bottom): Comparison between model predictions and actual values on the test set.7 PREPRINT - J
ANUARY
18, 2021Figure 4: LSTM Baseline Model results on cotton datasets. (top-left): Goodness of fit on training dataset; (top-right):loss values during the training process; (bottom): Comparison between model predictions and actual values on the testset.Table 2: Hyperparameters values for the LSTM grid search. Note: Lookback units are in months as the datasetHyper-parameter ValuesDropout 0.001, 0.01, 0.03, 0.1, 0.3LSTM Units 10, 50, 90, 130, 170Epoch 20, 40, 60, 80, 100Lookback 2, 4, 6, 8, 10
For a fair comparison between LSTM and ARIMA models, a fine-tuning process through a grid search was performedto identify a local minimum in the cost function values and its corresponding hyperparameters. Such a process isessential to identify the best model’s parameter values that yield the best predictions measure with the Root MeanSquare Error in our case. Four hyperparameters were considered: lookback, Dropout, LSTM units, and the number ofepochs. Five values were tested and crossed via a loop to observe the best combination among them (cf. table 2).An increasing number of hidden layers – starting from one - has been tested alone and yields no improvements in theerror reduction, as the batch size. Similar to the baseline model, the grid search was conducted on Google Collaboratorywith a TPU configuration. The process took four hours on a MacBook Pro 2019, 1.4 GHz Quad-Core Intel Core i5, 8Gb 2133 MHz LPDDR3. The results are shown in Table 3.Table 3: Hyperparameters values that produce a local minimum for the test set RMSE for Oil and Cotton datasetsCommodity Lookback Dropout LSTM Units Epochs RMSEOil 2 0.001 170 100 0.1974Cotton 2 0.3 170 100 0.19738
PREPRINT - J
ANUARY
18, 2021
The hyperparameters from table 3 have been used for the LSTM model and the two commodities’ dataset. The resultsare shown in Figure 5 and 6. For both commodities, the grid search outcomes yield a faster decrease of the training loss,and the test set predictions to following better the actual data. The over and underestimations of predictions that couldbe observed with the baseline model and large variabilities show a better correspondence. However, even if the gridsearch suggests 100 as the best number of epochs among the values provided, increasing its value does not seems toimprove the learning process given the plateau reached after ten iterations. Such is aligned with observations that havebeen made by Sima et al. (2018).Figure 5: LSTM Model results for oil with hyper-parameters values retrieved from the grid search. (top-left): Goodnessof fit on training dataset; (top-right): loss values during the training process; (bottom): Comparison between modelpredictions and actual values on the test set.Figure 6: LSTM Model results for cotton with hyper-parameters values retrieved from the grid search. (top-left):Goodness of fit on training dataset; (top-right): loss values during the training process; (bottom): Comparison betweenmodel predictions and actual values on the test set. 9
PREPRINT - J
ANUARY
18, 2021
This section compares the LSTM model results to a traditional and widely used family of models, namely the ARIMAclass. First, a time series process Y t follows an ARMA ( p, q ) model if it can be represented as: Y t − ϕ i Y t − i − · · · − ϕ i Y t − i = ε t − θ i ε t − i − · · · − θ i ε t − i (20)Where ϕ i ( i = 1 , · · · , p ) and θ i ( i = 1 , · · · , q ) are coefficients and ε t a white noise (cid:0) , σ ε (cid:1) process. More compactly,it can be expressed as: Φ ( L ) Y t = Θ ( L ) ε t (21)Where Φ ( L ) = 1 − ϕ L − · · · − ϕ i L p and Θ ( L ) = 1 − θ L − · · · − θ q L q and L is the lag operator. When the Y t process contains d (unit) roots, the process becomes an ARIMA ( p, d, q ) and can be represented as: Φ ( L ) (1 − L ) d Y t = Θ ( L ) ε t (22)The first task is to perform unit root tests to select the series’ integration order ( d ) . Following perron (1989), Zivot andAndrews (1992), we perform a generalized breakpoint unit root test. Indeed, standard tests such as Dickey-Fuller havelow power and are biased towards the null if the series exhibits a change in mean and or in trend. Subsequent literaturehas emerged, proposing various unit root tests that remain valid in the presence of a break. The idea consists of treatingeach date as a potential breakpoint, and the break date, which minimizes the t-statistic associated with the Dickey-Fullertest, is selected. More formally, following Vogelsang and Perron (1998), the following generalized Dickey-Fuller test isconsidered with the appropriate restrictions made for each series: P t = µ + βt + θDU t ( T b ) + γDT t ( T b ) + ωD t ( T b ) + αP t − + p (cid:88) i =1 τ i ∆ P t − + u t (23)Where, P t : represents the price of the commodity T b : the break date DU t ( T b ) = 1 ( t ≥ T b ) : an intercept break variable which takes the value 0 for periods before the break and one after DT t ( T b ) = 1 ( t ≥ T b ) • ( t − T b + 1) : a trend break variable that takes the value 0 for periods before the break and abreak date rebased trend for subsequent dates D t ( T b ) = 1 ( t = T b ) : a one-time dummy variable that takes the value one at the exact break date and 0 otherwise.Special cases are obtained through restrictions on the parameters β , θ , γ , ω . Thus, the following restrictionsyield the specific models: β = 0 and β = 0 : non-trending data with intercept γ = 0 : trending data with intercept break θ = 0 and ω = 0 : trending data with trend break.When no restriction is applied, we have a model with both trend and intercept breaks. Table 4 below present theresults of the test allowing for breaks. For both prices, we reject the unit root hypothesis at the usual 5% level .We can then estimate an ARMA model without differencing the series. To select the best model, we analyze theseries’ correlogram up to 30 lags for potential values of p and q and choose the model that minimizes the Schwarz in-formation criterion. We ended up with an ARMA (4,2) model for cotton price and an ARMA (4,1) model for the oil price.The out of sample forecasts are presented in Fig 7 and 8 and the underlying statistics in Table 5. For bothcommodities, the ARIMA models perform well in predicting the prices. Table 5 shows that both the Root MeanSquared Errors (RMSE) and the Mean Average Percentage Errors (MAPE) are lower than the LSTM model. The We also tested for seasonality using a HEGY test and no seasonal pattern was detected in the series. PREPRINT - J
ANUARY
18, 2021MAPE is 0.44 percentage point lower for cotton and 0.52-point lower for oil. To formally test the LSTM forecasts’performance vis-a-vis the ARIMA model, we perform the mean squared error-based test of Harvey-Leybourne andNewbold (1997). We conclude at the 5% significance level that the LSTM forecasts have lower prediction errors thanARIMA for both commodities. Table 4: Unit root tests. Source: Authors computationMin t p -valueCotton -4.94 0.0391Oil -4.85 0.0275Figure 7: Cotton actual versus predicted prices.Source: Authors computations Figure 8: Oil actual versus predicted prices.Source: Authors computationsTable 5: Comparison of LSTM and ARIMA resultsCriterion ARIMA LSTM HLN test(p-val)Cotton RMSE 0.117 0.150 0.007***MAPE 3.779 4.21Oil RMSE 5.133 5.950 0.001***MAPE 6.432 6.957
There has been a recent interest in forecast averaging technics (Hendry and Clements, 2004), based on Bates andGranger (1969). The main idea behind forecast averaging (or combination of forecasts) is that models are misspecifiedto capture all the particular variable patterns. Indeed, a combination of forecasts can yield lower prediction errors thaneither of the original forecasts (Bates and Granger, 1969), since one model may make a different assumption about therelationship between the variables.One question that arises when doing forecasts combination is the weights used for each model (Smith andWallis, 2009). Different methods have been proposed in the literature, and we focus here on four of them.• Simple mean: the simple arithmetic mean of the forecasts is used for each observation. Thus, every forecast isgiven the same weight. 11
PREPRINT - J
ANUARY
18, 2021• Least squares: under this approach, the forecasts are regressed against the actual values, and the coefficient ofthese regressions are used weights.• Mean square error weights: The Mean square error of each forecast is used to form individual forecast weights.• Mean Square error ranks: The Mean square error of each forecast is used to rank them, and then the ratio ofthe inverse of the ranks is used so that each forecast’s weight is its rank divided by the sum of all ranks.Table 6 below presents the forecast combination results using different weighting schemes, and table 7 compare the bestaveraging model with both ARIMA and LSTM. The least-squares’ weights give the best combination for the RMSEand by ranks for the MAPE for both cotton and oil. There is, however, a difference between the two commodities: whilefor oil, the ARIMA specification still yields the best forecasts. For cotton, the forecast averaging approach outperformsboth the LSTM and the ARIMA models. The latter is true; whatever criterion is used (RMSE or MAPE). This highlightsthe fact that the individual forecasts suffer from misspecification and fail to capture the variable’s behavior fully.Table 6: Forecast combinationsCotton OilRMSE MAPE RMSE MAPESimple mean 0.149 3.505 5.013 6.059Least-squares 0.137*** 3.520 4.913*** 6.008Mean square error 0.145 3.442 4.962 6.026Mean square error ranks 0.143 3.424*** 4.915 6.000***Table 7: Comparison of forecast averages with individual modelsARIMA LSTM AverageCotton RMSE 0.1377 0.1754 0.1374MAPE 3.492 3.935 3.424Oil RMSE 4.823 5.493 4.915MAPE 5.937 6.366 6.000
Significant progress has been made over the last two decades in machine learning with potential uses in many areas,including economics. In this paper, we highlighted the potential of these new tools as an alternative or at least acomplement to traditional forecasting methods. Neural network models do not require a battery of pre-tests to studythe time series’s underlying properties (such as unit root tests) and do not assume a parametric form. These are twodesirable properties in time series analysis.While our two examples indicate that neural network models do not outperform traditional ARIMA models,the results show that combining the two approaches can yield better forecasts than either one of the models. We proposeextending our analysis to a wide range of commodity prices or performing a Monte Carlo analysis to identify thesituations under which one of the models will outperform the other and the circumstances for favoring a combinationapproach.
References
Antonovitz, F. and R. Green (1990), "Alternative Estimates of Fed Beef Supply Response to risk", American Journal ofAgricultural Economics, 72, 475-488.Bates, J. M and C. W. J. Granger (1969), "The Combination of Forecasts", Operational Research Quarterly, 20 (4),451-468.Box, G. E. P and G. M. Jenkins (1970), Time Series Analysis: Forecasting and Control, San Francisco: Holden-Day.12
PREPRINT - J