[PDF] Forecasting Commodity Prices Using Long Short-Term Memory Neural Networks

Abstract

This paper applies a recurrent neural network (RNN) method to forecast cotton and oil prices. We show how these new tools from machine learning, particularly Long-Short Term Memory (LSTM) models, complement traditional methods. Our results show that machine learning methods fit reasonably well the data but do not outperform systematically classical methods such as Autoregressive Integrated Moving Average (ARIMA) models in terms of out of sample forecasts. However, averaging the forecasts from the two type of models provide better results compared to either method. Compared to the ARIMA and the LSTM, the Root Mean Squared Error (RMSE) of the average forecast was 0.21 and 21.49 percent lower respectively for cotton. For oil, the forecast averaging does not provide improvements in terms of RMSE. We suggest using a forecast averaging method and extending our analysis to a wide range of commodity prices.

Full PDF

FF ORECASTING C OMMODITY P RICES U SING L ONG S HORT -T ERM M EMORY N EURAL N ETWORKS

Racine Ly

AKADEMIYA2063Kigali, Rwanda [email protected]

Fousseini Traore

International Food Policy Research InstituteDakar, Senegal [email protected]

Khadim Dia

AKADEMIYA2063Kigali, Rwanda [email protected]

January 18, 2021 A BSTRACT

This paper applies a recurrent neural network (RNN) method to forecast cotton and oil prices. We showhow these new tools from machine learning, particularly Long-Short Term Memory (LSTM) models,complement traditional methods. Our results show that machine learning methods ﬁt reasonably wellthe data but do not outperform systematically classical methods such as Autoregressive IntegratedMoving Average (ARIMA) models in terms of out of sample forecasts. However, averaging theforecasts from the two type of models provide better results compared to either method. Comparedto the ARIMA and the LSTM, the Root Mean Squared Error (RMSE) of the average forecast was0.21 and 21.49 percent lower respectively for Cotton. For Oil the forecast averaging does not provideimprovements in terms of RMSE. We suggest using a forecast averaging method and extending ouranalysis to a wide range of commodity prices.

Forecasting commodity prices is paramount for many economic actors. When building budgets, experts rely on growthprojections at the government level, which are almost always based on underlying forecasts of primary commoditiesexported by the country. Oil-dependent countries represent a typical example where this kind of scenario is encountered.For developing countries, many depend on a few raw materials (agricultural and minerals), the price of whichdetermines the growth rate. Being able to forecast commodity prices is vital for other public entities or parastatalstoo. In many developing countries, parastatals are managing stabilization funds aimed at smoothing commodity pricemovements. These entities need world price forecasts in order to ﬁx producers’ prices for the current campaign. Forresearchers, knowing the best data generating process and the forecast errors is essential for various modeling purposes.For instance, in agricultural models involving expectations , the expected price is generally given by an ARMA model(Antonovitz and Green, 1990; De Janvry and Sadoulet, 1995). As we will show later, LSTM networks provide a goodalternative to ARMA models.Since the seventies, Box-Jenkins (1970) approaches have been popular in forecasting time series. These ap-proaches have introduced ARMA models and their extensions as the cornerstone of forecasting tools. However,machine learning methods that can handle time-series data and perform forecasting have grown over the last threedecades. Among all methods, those based on Recurrent Neural Network are particularly interesting as they can carryover information (memory) from previous periods into the future. Early papers using RNNs to forecast time seriesinclude Kamijo and Tanigawa (1990), Chakraborty et al. (1992), and a comparison with ARIMA models by Kohzadiet al. (1996). However, one issue encountered with the ﬁrst generation of RNN is the so-called vanishing gradientproblem for highly dependent (long memory) data. Thus, in a seminal paper, Hochreiter and Schmidhuber (1997)proposed a new approach called Long Short-term Memory with the capacity to ﬁlter out which information from the The widely used Nerlove model falls in that category. Although one can mention earlier work by Tinbergen (1939) and Klein (1950). a r X i v : . [ q -f i n . S T ] J a n PREPRINT - J

ANUARY

18, 2021past should be processed and retained . This triggered new literature on forecasting times series (Gheyas and Smith2011; Khandelwal et al., 2015; Kumar et al., 2018). In a similar way, other classes of machine learning methods such asSupport Vector Machine (SVM) regressions, have been developed and applied to time series forecasting, includinghybrid approaches with ARIMA models (Pai and Lin, 2005).Our objective in this paper is to focus on LSTM models as they present a series of interesting characteristics.First, they are non-parametric so that they can handle suitably non-linear patterns. Second, they do not require the errorterm to follow a distribution. Third, LSTM models do not require the underlying data to follow a stationary process,so they are not affected by unit-roots. These three features present an interesting advantage over regression-basedmethods, be they ARIMA or not. Thus, using monthly data of cotton and oil prices, our results suggest that theLSTM model ﬁts the data reasonably well but does not outperform the ARIMA models for out of sample forecast.However, a combination of forecasts from the two models yields better results for the cotton dataset than either approach.The remainder of this paper is organized as follows. We ﬁrst present in section 2 the traditional RecurrentNeural Networks (RNN), and then, the LSTM approach proposed to solve the vanishing gradient problem of the ﬁrstgenerations of RNN. In section 3, we present the data set used. Section 4 presents the computational aspects whilethe results of the LSTM approach applied to oil and cotton prices are highlighted in section 5. In the sixth section,we compare the LSTM approach with ARIMA models. The seventh section the forecast averaging approach and ourconclusions are discussed in section 8. An artiﬁcial Neural Network is a supervised learning technique within the machine learning set of models. It is used forboth regression and classiﬁcation tasks depending on the type of data used and the modeling purpose. An artiﬁcialneural network comprises an input layer that encounters the different inputs’ features, at least one hidden layer thatwill process the dataset’s hidden characteristics, and an output layer that will yield the network’s forecasts. Eachlayer has several units called "neurons," which have the role of receiving information from preceding neurons andsend processed information to the following ones. The input layer has several neurons equal to the number of thedataset’s features, and each layer is augmented with a so-called "bias" neuron which role is to facilitate the algorithmicwriting. The number of hidden layers – which is one of the so-called hyper-parameters - is set by the modeler througha ﬁne-tuning process. The number of output layer’s neurons is equal to the number of expected outputs from the network.Within an artiﬁcial neural network, each neuron communicates with all the neurons from the preceding layerand with those from the following layer (aside from the last layer) through weights affected at each connection. Thelatter is computed by applying an activation function - which is usually a sigmoid or hyperbolic tangent function,among others that are used in the literature - to the product of a weights vector controlling the mapping function from alayer l to the following layer l + 1 , and the preceding activations of the previous layer.The learning process of an artiﬁcial neural network is performed using the backpropagation algorithm (Rumelhart,Hinton, and William, 1986), which is the optimization process through which the weights that represent the connectionsbetween neurons are updated using the chain rule method to compute the gradient of the loss function with respectsto the weights. A gradient descent rule or related version is used to update their values. The weights are randomlyinitialized using a uniform distribution, the bias units are set to zero, and the initial activations for the ﬁrst layer areequal to the dataset’s features.The backpropagation procedure is as follows: A forward pass is performed by computing the "cash" valuesused as an argument to assess the activation of the layers. For an L -layers neural network, the k th layer uses thefollowing constitutive equations to yield the cash vector z [ k ] and the activation vector a [ k ] for each neuron and layer. z [ k ] = η [ k ] a [ k − + b [ k ] (1) a [ k ] = g [ k ] (cid:16) z [ k ] (cid:17) (2) See section 2 for a full description of LSTM networks. PREPRINT - J

ANUARY

18, 2021In equations (1) and (2), η [ k ] is the vector of weights of all connections going from neurons at layer k − to neurons atlayer k , and g [ k ] is the activation function that is applied to neurons at layer k . The network’s last activation functioncorresponds to the network’s predicted value, which is compared with the actual value from the response variableprovided by the training dataset. A backward pass is then applied to the artiﬁcial neural network to compute thegradients’ cost function for weights assigned to each connection. The residual δ [ L ] = a [ L ] − y , at the last layer betweenthe network predictions a [ L ] and actual values y is backpropagated into the network, and the residual associated withthe preceding layers is computed as follows: δ [ k ] = (cid:16) η [ k ] (cid:17) T δ [ k +1] ◦ g [ k ] (cid:48) (cid:16) z [ k ] (cid:17) (3)The gradient of the cost function J for the weight between two nodes i and j at a layer k materialized by η [ k ] ij , is equalto the product of the activation of the node j at layer k and the residual of the node i at layer k + 1 computed withequation (3). ∂ J ( η ) ∂η [ k ] ij = a [ k ] j δ [ k +1] i (4)For an L -layers artiﬁcial neural network with m examples in the training set ( x (1) , y (1) ) , ... , ( x ( m ) , y ( m ) ) , an iterativeprocess is performed to aggregate the gradients in an accumulator using equation (4). The procedure is as follows: Algorithm 1

Backpropagation Algorithm for i = 1 to m do Set a [1] = x [ i ] Forward propagation to compute a [ k ] for k = 1 , , , · · · , L Compute δ [ L ] = a [ L ] − y [ i ] Compute δ [ L − , · · · , δ [2] with equation (3) ∆ [ k ] ij := ∆ [ k ] ij + a [ k ] j δ [ k +1] i D [ k ] ij = m (cid:16) ∆ [ k ] ij + λη [ k ] ij (cid:17) if j (cid:54) = 0 (nonbias node) D [ k ] ij = m (cid:16) ∆ [ k ] ij (cid:17) if j = 0 (for bias node) ∂ J ( η ) ∂η [ k ] ij = D [ k ] ij end for Artiﬁcial neural networks are suitable for numerical and qualitative datasets for regression and classiﬁcation tasks.However, for sequential-type datasets where inputs and outputs are sequences, ANNs are not adequate for two mainreasons: (i) Inputs and outputs need to have the same lengths and such could not be the case in sequential datasets, (ii)ANNs cannot deal with the need of sharing features learned across different positions in a sequential dataset. RecurrentNeural Networks (RNN) have been introduced to solve the limitations mentioned above.

The primary purpose of Recurrent Neural Networks is to deal with datasets that have inputs and outputs that aresequences. RNNs are built on top of the same ideas and architecture of an artiﬁcial neural network. The main differenceresides in RNNs being able to encounter historical and current sequences in predicting the outcome at the same currenttime-step or sequence. Such confer to RNNs a real advantage in predicting time-sensitive sequential data such astime-series.For a sequential labeled dataset ( X, Y ) , where X ∈ R T x and Y ∈ R T y , where T x and T y are the number ofsequences in the input and labels data respectively, the guiding equations to perform a forward pass within RNNs is asfollows: a = g ( W aa a + W ax x + b a ) (5) y = g ( W ya a + b y ) (6)3 PREPRINT - J

ANUARY

18, 2021Where a is the activation value for the RNN unit at the sequence < t > , W aa , W ax and W ya are a set of weightsthat materialized the relationship between activations from one sequence to the previous one; activations and inputssequences; and outputs and activations from the same sequence, respectively. The parameters b a and b y are the biasterms.A signiﬁcant limitation of RNNs resides in their insufﬁcient capacity to encounter long-term dependencies.Such usually appears with a problem of vanishing gradients. The latter occurs when the error between the networkoutputs and the actual data is poorly backpropagated through the network and tends not to affect the ﬁrst layers forrelatively deep architecture. Therefore, the low gradient values make the network’s weights update not occurring, whichyields inaccurate predictions. The opposite, exploding gradient, can also be a limitation of RNNs in dealing withlong-term dependencies. However, such could be mitigated using the gradient clipping technique, which consists ofimposing a maximum limit to gradient values and avoiding numerical overﬂows in weights’ values.To address the vanishing gradient issue of RNNs, Gated Recurrent Units (GRUs) have been proposed by(Cho, Van Merriënboer, Bahdanau, and Bengio, 2014) and (Chung, Gucehre, Cho, and Bengio, 2014). The mainimprovement compared to RNNs is introducing a memory cell that will drag, through the network, long-termdependencies from previous sequences and mitigate the vanishing gradient issue. For such, a memory cell at a sequence < t > , noted c takes the role of the activation values. At each sequence, a candidate for the memory cell iscomputed using a tangent hyperbolic activation function applied to a linear combination of the input sequence x and the previous memory cell value c . A gate Γ u is computed to assess if the memory cell’s information at theprevious sequence c will be kept or replaced by the memory cell candidate. The gate is computed by applying asigmoid activation function to the linear combination of c and x . Hence, the new memory cell at the sequence < t > is computed using the mixture rule. The guiding equations of a GRU in its simplest form are as follows: ˜ c = tanh ( W c [ c , x ] + b c ) (7) Γ u = σ ( W u (cid:2) c , x (cid:3) + b u ) (8) c = Γ u ◦ ˜ c + (1 − Γ u ) ◦ c (9)The main advantage of GRU in mitigating the vanishing gradients issue with RNNs resides in the factor (1 − Γ u ) inwhich even if Γ u is very close to zero, the value of c in equation (9) is maintained, which will avoid the gradientof the cost function for activations (cf. equation (4)) to be null, therefore, allowing the updates of the weights usinggradient descent-based rule.Equations (7) to (9) are the constitutive ones for the simplest version of the GRU. A complete version con-sists of adding a gate to assess if the memory cell at the previous sequence c is still relevant for each unitat the sequence < t > . The new set of equations becomes, with all weights and biased being updated using thebackpropagation algorithm: ˜ c = tanh ( W c [Γ r ◦ [ c , x ] + b c ) (10) Γ u = σ ( W u [ c , x ] + b u ) (11) Γ r = σ ( W r [ c , x ] + b r ) (12) c = Γ u ◦ ˜ c + (1 − Γ u ) ◦ c (13) The Gated Recurrent Unit is well suited to learn long connections in a sequence. Another version of the sequencemodel built on top of GRU’s is the so-called Long Short-Term Memory model (LSTM) introduced by (Hochreiter andSchmidhuber, 1997). LSTM models introduce two new gates that make the learning process more ﬂexible: the forgetand output gates. For the ﬁrst one, in LSTM modeling, instead of assigning the update gate’s complement − Γ u tocontrol the memory cell at the previous sequence c (cf. equation (9)), a new "forget" gate Γ f is introduced. Thesecond "output" gate Γ o is to yield the content of the new memory cell ct. Hence, the set of equations that deﬁne thecomputations within an LSTM unit is: 4 PREPRINT - J

ANUARY

18, 2021 ˜ c = tanh ( W c [ a , x ] + b c ) (14) Γ u = σ ( W u [ a , x ] + b u ) (15) Γ f = σ ( W f [ a , x ] + b f ) (16) Γ o = σ ( W o [ a , x ] + b o ) (17) c = Γ u ◦ ˜ c + Γ f ◦ c (18) a = Γ o ◦ tanh ( c ) (19)Sequence modeling through a recurrent neural network and its most advanced version, LSTM, have the advantageover simple neural network architecture by its capacity to encounter current and previous sequences into producingforecasts. Moreover, as showcased by the constitutive equations illustrated above, LSTM modeling does not need theinput dataset to be stationary, as required by the ARIMA-type models. Such makes LSTM a potential candidate intime-series forecasts with the possibility of encountering historical shocks across units and layers through the memorycell at each sequence, which is the time steps in time-series context.Figure 1: (a) A one-layer LSTM network for sequence modeling. (b) Illustration of computations within an LSTM unitat a time-step or sequence < t > PREPRINT - J

ANUARY

18, 2021

We use international prices for two commodities: one agricultural product (cotton) and one mineral (oil). All the dataare taken from the World Bank commodity prices dataset. The Cotton price used is the Cotton A index, the mostcommon index used to represent offering prices on the international cotton market. It corresponds to the average of thecheapest ﬁve quotations among eighteen varieties of upland cotton traded internationally. The base quality of the indexis MIDDLING − / (cid:48)(cid:48) . For crude oil, we use the average spot price of Brent, Dubai, and West Texas Intermediate,the three leading indices representing the market. All prices are observed monthly. The cotton and oil datasets have 708data points each, acquired from January 1960 to December 2018. Sequential modeling for time-series forecasts using the LSTM framework requires transforming the dataset into featuresand labels for a supervised learning process. For such, the sliding window technique has been used across the cottonand oil datasets. It consists of shifting a window with a ﬁxed-width – also called the lookback parameter - from left toright in the dataset. The window moves one unit at a time, and the data points within its range are taken as examples inthe input features. The next value at the right edge of the window is considered as the label at each step; in other words,it is considered as the expected output produced by the data within the sliding window.Figure 2: Illustration of the sliding window technique to create input features and labels from time-series datasets. Thedataset is read from left to right, and each increment in the window movement is considered an example of the trainingprocessThe initial dataset is divided into two parts: the training set, which is used to train the LSTM model, and the test setused to assess the model accuracy. For the latter, only the inputs are fed into the model to produce forecasts, comparedwith the actual values. The sliding window technique is applied to both the training and test sets.A ratio of 70 and 30 percent has been used to generate training and test sets for the oil and cotton datasets.Such a choice is motivated by the amount of data that is available in the initial database. One might not select too manydata points for training purposes and fewer data points for testing to avoid two main known issues: ﬁrst, an overtrainedmodel that could potentially yield overﬁtting and poor generalization for new inputs. Second, the lack of enough testingdata points to assess model accuracy fairly. The lookback parameter is deﬁned during the ﬁne-tuning process with abaseline model. For a time-series with m train training data points and a lookback parameter µ , the number of trainingexamples is equal to m train − µ for a one-unit-at-a-time ( τ = 1 ) sliding window. Besides, sequential modeling withLSTM requires a 3D shape for input data. Therefore, training and test datasets are reshaped to ( m train − µ , τ , µ ) and( m test − µ , τ , µ ) respectively. 6 PREPRINT - J

ANUARY

18, 2021Table 1: LSTM hyperparameters for the baseline modelHyper-parameter ValueDropout 10Units 50Epochs 50Batch size 32

To answer the research question of which of the two LSTM and ARIMA yield the best results in time-series forecastswith the selected oil and cotton datasets, they were subject to a ﬁne-tuning process to select the best model in the modelspace. For LSTM, one hidden layer with 170 units baseline model was built with random hyperparameters. A min-maxnormalization procedure was used to accelerate the optimization process through the input data ranging between 0 and1. The Adaptative Moment Estimation (ADAM) optimizer (Kingma and Ba, 2015) was used with a learning rate of0.001 and an exponential decay rate of ﬁrst and second moment estimates β and β of 0.9 and 0.999, respectively. Theroot mean squared error was used as a loss function, and the random generator seed was ﬁxed to 3 for reproducibility.The baseline model was built with the Keras package on Python Anaconda’s distribution. The model was executed onthe Google Collaboratory Pro plan with Tensor Process Units (TPU).Figures 3 and 4 shows the baseline model results for the oil and cotton datasets. Each of the ﬁgures is composed ofthree subplots: (i) the goodness of ﬁt between the training dataset and its corresponding actual values, (ii) the lossvalues during the training process with the number of epochs, (iii) and the comparison between predicted values on thetest set and its corresponding actual values. The test set predictions show that the hyper-parameters are not the optimalones, which suggests the need for a ﬁne-tuning process in identifying a local minimum to the cost function with itscorresponding hyper-parameters values.Figure 3: LSTM Baseline Model results on oil datasets. (top-left): Goodness of ﬁt on training dataset; (top-right): lossvalues during the training process; (bottom): Comparison between model predictions and actual values on the test set.7 PREPRINT - J

ANUARY

18, 2021Figure 4: LSTM Baseline Model results on cotton datasets. (top-left): Goodness of ﬁt on training dataset; (top-right):loss values during the training process; (bottom): Comparison between model predictions and actual values on the testset.Table 2: Hyperparameters values for the LSTM grid search. Note: Lookback units are in months as the datasetHyper-parameter ValuesDropout 0.001, 0.01, 0.03, 0.1, 0.3LSTM Units 10, 50, 90, 130, 170Epoch 20, 40, 60, 80, 100Lookback 2, 4, 6, 8, 10

For a fair comparison between LSTM and ARIMA models, a ﬁne-tuning process through a grid search was performedto identify a local minimum in the cost function values and its corresponding hyperparameters. Such a process isessential to identify the best model’s parameter values that yield the best predictions measure with the Root MeanSquare Error in our case. Four hyperparameters were considered: lookback, Dropout, LSTM units, and the number ofepochs. Five values were tested and crossed via a loop to observe the best combination among them (cf. table 2).An increasing number of hidden layers – starting from one - has been tested alone and yields no improvements in theerror reduction, as the batch size. Similar to the baseline model, the grid search was conducted on Google Collaboratorywith a TPU conﬁguration. The process took four hours on a MacBook Pro 2019, 1.4 GHz Quad-Core Intel Core i5, 8Gb 2133 MHz LPDDR3. The results are shown in Table 3.Table 3: Hyperparameters values that produce a local minimum for the test set RMSE for Oil and Cotton datasetsCommodity Lookback Dropout LSTM Units Epochs RMSEOil 2 0.001 170 100 0.1974Cotton 2 0.3 170 100 0.19738

PREPRINT - J

ANUARY

18, 2021

The hyperparameters from table 3 have been used for the LSTM model and the two commodities’ dataset. The resultsare shown in Figure 5 and 6. For both commodities, the grid search outcomes yield a faster decrease of the training loss,and the test set predictions to following better the actual data. The over and underestimations of predictions that couldbe observed with the baseline model and large variabilities show a better correspondence. However, even if the gridsearch suggests 100 as the best number of epochs among the values provided, increasing its value does not seems toimprove the learning process given the plateau reached after ten iterations. Such is aligned with observations that havebeen made by Sima et al. (2018).Figure 5: LSTM Model results for oil with hyper-parameters values retrieved from the grid search. (top-left): Goodnessof ﬁt on training dataset; (top-right): loss values during the training process; (bottom): Comparison between modelpredictions and actual values on the test set.Figure 6: LSTM Model results for cotton with hyper-parameters values retrieved from the grid search. (top-left):Goodness of ﬁt on training dataset; (top-right): loss values during the training process; (bottom): Comparison betweenmodel predictions and actual values on the test set. 9

PREPRINT - J

ANUARY

18, 2021

This section compares the LSTM model results to a traditional and widely used family of models, namely the ARIMAclass. First, a time series process Y t follows an ARMA ( p, q ) model if it can be represented as: Y t − ϕ i Y t − i − · · · − ϕ i Y t − i = ε t − θ i ε t − i − · · · − θ i ε t − i (20)Where ϕ i ( i = 1 , · · · , p ) and θ i ( i = 1 , · · · , q ) are coefﬁcients and ε t a white noise (cid:0) , σ ε (cid:1) process. More compactly,it can be expressed as: Φ ( L ) Y t = Θ ( L ) ε t (21)Where Φ ( L ) = 1 − ϕ L − · · · − ϕ i L p and Θ ( L ) = 1 − θ L − · · · − θ q L q and L is the lag operator. When the Y t process contains d (unit) roots, the process becomes an ARIMA ( p, d, q ) and can be represented as: Φ ( L ) (1 − L ) d Y t = Θ ( L ) ε t (22)The ﬁrst task is to perform unit root tests to select the series’ integration order ( d ) . Following perron (1989), Zivot andAndrews (1992), we perform a generalized breakpoint unit root test. Indeed, standard tests such as Dickey-Fuller havelow power and are biased towards the null if the series exhibits a change in mean and or in trend. Subsequent literaturehas emerged, proposing various unit root tests that remain valid in the presence of a break. The idea consists of treatingeach date as a potential breakpoint, and the break date, which minimizes the t-statistic associated with the Dickey-Fullertest, is selected. More formally, following Vogelsang and Perron (1998), the following generalized Dickey-Fuller test isconsidered with the appropriate restrictions made for each series: P t = µ + βt + θDU t ( T b ) + γDT t ( T b ) + ωD t ( T b ) + αP t − + p (cid:88) i =1 τ i ∆ P t − + u t (23)Where, P t : represents the price of the commodity T b : the break date DU t ( T b ) = 1 ( t ≥ T b ) : an intercept break variable which takes the value 0 for periods before the break and one after DT t ( T b ) = 1 ( t ≥ T b ) • ( t − T b + 1) : a trend break variable that takes the value 0 for periods before the break and abreak date rebased trend for subsequent dates D t ( T b ) = 1 ( t = T b ) : a one-time dummy variable that takes the value one at the exact break date and 0 otherwise.Special cases are obtained through restrictions on the parameters β , θ , γ , ω . Thus, the following restrictionsyield the speciﬁc models: β = 0 and β = 0 : non-trending data with intercept γ = 0 : trending data with intercept break θ = 0 and ω = 0 : trending data with trend break.When no restriction is applied, we have a model with both trend and intercept breaks. Table 4 below present theresults of the test allowing for breaks. For both prices, we reject the unit root hypothesis at the usual 5% level .We can then estimate an ARMA model without differencing the series. To select the best model, we analyze theseries’ correlogram up to 30 lags for potential values of p and q and choose the model that minimizes the Schwarz in-formation criterion. We ended up with an ARMA (4,2) model for cotton price and an ARMA (4,1) model for the oil price.The out of sample forecasts are presented in Fig 7 and 8 and the underlying statistics in Table 5. For bothcommodities, the ARIMA models perform well in predicting the prices. Table 5 shows that both the Root MeanSquared Errors (RMSE) and the Mean Average Percentage Errors (MAPE) are lower than the LSTM model. The We also tested for seasonality using a HEGY test and no seasonal pattern was detected in the series. PREPRINT - J

ANUARY

18, 2021MAPE is 0.44 percentage point lower for cotton and 0.52-point lower for oil. To formally test the LSTM forecasts’performance vis-a-vis the ARIMA model, we perform the mean squared error-based test of Harvey-Leybourne andNewbold (1997). We conclude at the 5% signiﬁcance level that the LSTM forecasts have lower prediction errors thanARIMA for both commodities. Table 4: Unit root tests. Source: Authors computationMin t p -valueCotton -4.94 0.0391Oil -4.85 0.0275Figure 7: Cotton actual versus predicted prices.Source: Authors computations Figure 8: Oil actual versus predicted prices.Source: Authors computationsTable 5: Comparison of LSTM and ARIMA resultsCriterion ARIMA LSTM HLN test(p-val)Cotton RMSE 0.117 0.150 0.007***MAPE 3.779 4.21Oil RMSE 5.133 5.950 0.001***MAPE 6.432 6.957

There has been a recent interest in forecast averaging technics (Hendry and Clements, 2004), based on Bates andGranger (1969). The main idea behind forecast averaging (or combination of forecasts) is that models are misspeciﬁedto capture all the particular variable patterns. Indeed, a combination of forecasts can yield lower prediction errors thaneither of the original forecasts (Bates and Granger, 1969), since one model may make a different assumption about therelationship between the variables.One question that arises when doing forecasts combination is the weights used for each model (Smith andWallis, 2009). Different methods have been proposed in the literature, and we focus here on four of them.• Simple mean: the simple arithmetic mean of the forecasts is used for each observation. Thus, every forecast isgiven the same weight. 11

PREPRINT - J

ANUARY

18, 2021• Least squares: under this approach, the forecasts are regressed against the actual values, and the coefﬁcient ofthese regressions are used weights.• Mean square error weights: The Mean square error of each forecast is used to form individual forecast weights.• Mean Square error ranks: The Mean square error of each forecast is used to rank them, and then the ratio ofthe inverse of the ranks is used so that each forecast’s weight is its rank divided by the sum of all ranks.Table 6 below presents the forecast combination results using different weighting schemes, and table 7 compare the bestaveraging model with both ARIMA and LSTM. The least-squares’ weights give the best combination for the RMSEand by ranks for the MAPE for both cotton and oil. There is, however, a difference between the two commodities: whilefor oil, the ARIMA speciﬁcation still yields the best forecasts. For cotton, the forecast averaging approach outperformsboth the LSTM and the ARIMA models. The latter is true; whatever criterion is used (RMSE or MAPE). This highlightsthe fact that the individual forecasts suffer from misspeciﬁcation and fail to capture the variable’s behavior fully.Table 6: Forecast combinationsCotton OilRMSE MAPE RMSE MAPESimple mean 0.149 3.505 5.013 6.059Least-squares 0.137*** 3.520 4.913*** 6.008Mean square error 0.145 3.442 4.962 6.026Mean square error ranks 0.143 3.424*** 4.915 6.000***Table 7: Comparison of forecast averages with individual modelsARIMA LSTM AverageCotton RMSE 0.1377 0.1754 0.1374MAPE 3.492 3.935 3.424Oil RMSE 4.823 5.493 4.915MAPE 5.937 6.366 6.000

Signiﬁcant progress has been made over the last two decades in machine learning with potential uses in many areas,including economics. In this paper, we highlighted the potential of these new tools as an alternative or at least acomplement to traditional forecasting methods. Neural network models do not require a battery of pre-tests to studythe time series’s underlying properties (such as unit root tests) and do not assume a parametric form. These are twodesirable properties in time series analysis.While our two examples indicate that neural network models do not outperform traditional ARIMA models,the results show that combining the two approaches can yield better forecasts than either one of the models. We proposeextending our analysis to a wide range of commodity prices or performing a Monte Carlo analysis to identify thesituations under which one of the models will outperform the other and the circumstances for favoring a combinationapproach.

References

Antonovitz, F. and R. Green (1990), "Alternative Estimates of Fed Beef Supply Response to risk", American Journal ofAgricultural Economics, 72, 475-488.Bates, J. M and C. W. J. Granger (1969), "The Combination of Forecasts", Operational Research Quarterly, 20 (4),451-468.Box, G. E. P and G. M. Jenkins (1970), Time Series Analysis: Forecasting and Control, San Francisco: Holden-Day.12

PREPRINT - J