[PDF] Long short-term memory networks and laglasso for bond yield forecasting: Peeping inside the black box

Abstract

Modern decision-making in fixed income asset management benefits from intelligent systems, which involve the use of state-of-the-art machine learning models and appropriate methodologies. We conduct the first study of bond yield forecasting using long short-term memory (LSTM) networks, validating its potential and identifying its memory advantage. Specifically, we model the 10-year bond yield using univariate LSTMs with three input sequences and five forecasting horizons. We compare those with multilayer perceptrons (MLP), univariate and with the most relevant features. To demystify the notion of black box associated with LSTMs, we conduct the first internal study of the model. To this end, we calculate the LSTM signals through time, at selected locations in the memory cell, using sequence-to-sequence architectures, uni and multivariate. We then proceed to explain the states' signals using exogenous information, for what we develop the LSTM-LagLasso methodology. The results show that the univariate LSTM model with additional memory is capable of achieving similar results as the multivariate MLP using macroeconomic and market information. Furthermore, shorter forecasting horizons require smaller input sequences and vice-versa. The most remarkable property found consistently in the LSTM signals, is the activation/deactivation of units through time, and the specialisation of units by yield range or feature. Those signals are complex but can be explained by exogenous variables. Additionally, some of the relevant features identified via LSTM-LagLasso are not commonly used in forecasting models. In conclusion, our work validates the potential of LSTMs and methodologies for bonds, providing additional tools for financial practitioners.

Full PDF

LLong short-term memory networks and laglasso for bond yield forecasting:Peeping inside the black box (cid:73)

Manuel Nunes a, ∗ , Enrico Gerding a , Frank McGroarty b , Mahesan Niranjan a a School of Electronics and Computer Science, University of Southampton, University Road, Southampton SO17 1BJ, United Kingdom b Southampton Business School, University of Southampton, University Road, Southampton SO17 1BJ, United Kingdom

Abstract

Modern decision-making in ﬁxed income asset management beneﬁts from intelligent systems, which involve the use ofstate-of-the-art machine learning models and appropriate methodologies. We conduct the ﬁrst study of bond yield fore-casting using long short-term memory (LSTM) networks, validating its potential and identifying its memory advantage.Speciﬁcally, we model the 10-year bond yield using univariate LSTMs with three input sequences and ﬁve forecastinghorizons. We compare those with multilayer perceptrons (MLP), univariate and with the most relevant features. Todemystify the notion of black box associated with LSTMs, we conduct the ﬁrst internal study of the model. To thisend, we calculate the LSTM signals through time, at selected locations in the memory cell, using sequence-to-sequencearchitectures, uni and multivariate. We then proceed to explain the states’ signals using exogenous information, forwhat we develop the LSTM-LagLasso methodology. The results show that the univariate LSTM model with additionalmemory is capable of achieving similar results as the multivariate MLP using macroeconomic and market information.Furthermore, shorter forecasting horizons require smaller input sequences and vice-versa. The most remarkable prop-erty found consistently in the LSTM signals, is the activation / deactivation of units through time, and the specialisationof units by yield range or feature. Those signals are complex but can be explained by exogenous variables. Additionally,some of the relevant features identiﬁed via LSTM-LagLasso are not commonly used in forecasting models. In conclu-sion, our work validates the potential of LSTMs and methodologies for bonds, providing additional tools for ﬁnancialpractitioners. Keywords:

Finance, Long short-term memory network, LagLasso, Deep learning, Bond market

1. Introduction

The bond market, in particular the government sector,plays a fundamental role in the overall functioning ofthe economy and is of paramount importance for ﬁnan-cial markets. This is the case, both as an asset class byitself (with an overall size of USD 102.0 trillion, as of31-Dec-2016 (Bloomberg, 2017), which compares to aglobal equity market of USD 66.3 trillion), and becausethe valuation methods of other asset classes often dependon bond yields as input information, especially for equit-ies and corporate bond yields. In addition, its importancederives from the fact that bonds and ﬁxed income securit-ies in general are a signiﬁcant component of the portfolios (cid:73)

Declarations of interest: none. ∗ Corresponding author.

E-mail addresses: [email protected] (Manuel Nunes),[email protected] (Enrico Gerding), [email protected](Frank McGroarty), [email protected] (Mahesan Niranjan) of pension funds and insurance companies. Most com-monly, the percentage of bonds varies from 25 to 40% ofportfolio assets for pension funds and around 40 to 70%in the case of life insurance companies (OECD, 2015).Moreover, the bond market is at the early stages ofboth quantitative investing and electronic market-making,clearly lagging behind equities and foreign exchange(forex) markets. This lag becomes evident in the sci-entiﬁc literature. In fact, considerable attention has beendevoted to the use of machine learning and developmentof techniques for equity markets (Booth et al., 2014; Eilerset al., 2014; Ballings et al., 2015; Dunis et al., 2016; Fisc-her and Krauss, 2018; Kraus and Feuerriegel, 2017; Qinet al., 2017; Sermpinis et al., 2019), and also for forexmarkets (Gradojevic and Yang, 2006; Huang et al., 2007;Choudhry et al., 2012; Sermpinis et al., 2012; Fletcher andShawe-Taylor, 2013; Sermpinis et al., 2013), just to men-tion a few studies (more detail in Section 2). In contrast,

Preprint submitted to arXiv 6 May 2020 a r X i v : . [ q -f i n . C P ] M a y here is a signiﬁcant gap in both the academic literatureand the ﬁnance industry when it comes to the applicationof machine learning techniques in ﬁxed income markets(Castellani and Santos, 2006; Dunis and Morrison, 2007;Kanevski et al., 2008; Kanevski and Timonin, 2010; Sam-basivan and Das, 2017).Furthermore, most of the applications of machine learn-ing to ﬁnancial assets tend to be limited to forecasting andcomparison of results versus benchmarks. Only very fewpublications can be found that try to extract additional in-formation from the model or study how the model works(see Section 2). Indeed, advances in machine learning en-able enhanced decision-making by e.g. using new typesof data (Kraus and Feuerriegel, 2017) and reinforcementlearning techniques (Eilers et al., 2014). However, for ma-chine learning models to be useful in asset managementdecision-making they need to be trustworthy. To achievethis, a better understanding of their functioning is crucial.In this ﬁeld of machine learning, more precisely in deeplearning, one of the most successful models for sequencelearning is the long short-term memory (LSTM) networks.The architecture of this model includes a feedback loopmechanism that enables the model to “remember” past in-formation. This model has been achieving top results inother scientiﬁc ﬁelds but has not been used on a broadbasis in ﬁnancial applications. This will be further de-tailed in Section 2. More speciﬁcally, in the case of bonds,they have not been studied previously with LSTMs. Thisis an additional gap in the literature, despite both the im-portance of this asset class in ﬁnancial markets and thepotential of LSTMs for ﬁnancial forecasting. The discus-sion of its potential will be the focus of Section 3.2.Given the status quo on machine learning research inbonds, the main high-level objectives of this study aretwofold: to assess the potential of LSTM networks forbond yield forecasting, testing their memory advantageversus memory-free models such as standard feedforwardneural networks; and to demystify the preconceived no-tion of black box associated to the LSTM model. Togetherthese objectives go towards bridging the gaps identiﬁed inthe literature and presented above. Besides, they contrib-ute to improved knowledge and trustworthiness of LSTMnetworks, providing asset management practitioners withadditional tools for better decision-making.In more detail, our key contributions are as follows.First, we conduct an innovative application of a deeplearning model (LSTM) to bonds. The results are com-pared to memory-free multilayer perceptrons (MLP). Ourresults validate the potential of LSTM networks for yieldforecasting. This enables their use in intelligent systems for the asset management industry, in order to supportthe decision-making process associated with the activitiesof bond portfolio management and trading. Additionally,we identify the LSTM’s memory advantage over stand-ard feedforward neural networks, showing that the uni-variate LSTM model with additional memory is capableof achieving similar results as the multivariate MLP withadditional information from markets and the economy.Second, we go beyond the application of LSTMs, byconducting an in-depth study of the model itself (openingthe black box), to understand the representations learnedby its internal states. Such explanations of what black-boxmodels learn is a popular topic of interest for certiﬁcationand litigation purposes. In more detail, we extract andanalyse the signals in both states (hidden and cell) and atthe gates inside the LSTM memory cell. This is the ﬁrstcontribution to demystify the notion of black box attachedto LSTMs using a technique which is fundamentally dif-ferent from the most relevant ones found in the literature.Other studies are applied to a di ﬀ erent type of recurrentneural network (Giles et al., 2001) or perform an externalanalysis of the model (Fischer and Krauss, 2018).Third and last, following the extraction of signals atthose locations, we proceed to explain the informationthey contain with exogenous economic and market vari-ables. For that purpose, we develop a new methodologyhere identiﬁed as LSTM-LagLasso, based on both Lasso(Tibshirani, 1996) and LagLasso (Mahler, 2009). Thismethodology is capable of identifying both relevant fea-tures and corresponding lags.The remainder of this paper is structured as follows.In Section 2 the literature review is presented. In Sec-tion 3 we introduce the theory behind the deep learningmodel used in our research, together with its main ad-vantages, limitations and potential for yield forecasting.Section 4 covers the bond yield forecasting study usingLSTMs versus MLPs. Section 5 focuses on the internalanalysis of signals inside the LSTM model, while Sec-tion 6 details the explanation of those signals using exo-genous variables, introducing the LSTM-LagLasso meth-odology developed for that purpose. Finally, in Section 7the main conclusions are outlined together with directionfor future work.

2. Literature review

The literature review starts by looking at the main applic-ations of recurrent neural networks in ﬁnance and otherﬁelds. Then, considering the subset of publications onforecasting ﬁnancial assets, we analyse in detail the whole2cope of the application carried out, to compare and con-trast with our research.

The recurrent neural network family of model, especiallythe LSTM networks, has shown to signiﬁcantly outper-form. In fact, in one of the most recent and comprehensivebooks on deep learning the authors categorically state thatgated RNNs are the most e ﬀ ective sequence models usedin practical applications, i.e. LSTMs and gated recurrentunit (GRU) based networks (Goodfellow et al., 2016).The applications are almost endless, and comprise awide variety of activities and scientiﬁc ﬁelds, such as(see Lipton et al. (2015)): handwriting recognition, textgeneration, natural language processing (recognition, un-derstanding and generation), time series prediction, videoanalysis, musical information retrieval, image captioning,music generation, and in interactive type of problems likecontrolling a robot. For natural language processing inparticular, LSTMs are among the most widely used deeplearning models to date.Most of these activities have in common the fact thatthey have sequential data. And this is what RNNs andLSTMs do best. They can process sequences as input, asoutput or in the most general case on both sides (Karpathy,2015). Furthermore, the LSTMs are used to take advant-age of their capability to learn long-term dependencies.In the ﬁnancial domain, no publication could be foundin the current literature using this type of model in ﬁxedincome markets. A discussion of its potential in this areawill be the focus of Section 3.2. Indeed, applications ofRNN and LSTM models were found ﬁrst and foremostfor equities. See, for example, the works by Xiong et al.(2016), Persio and Honchar (2016b, 2017), Fischer andKrauss (2018), Kraus and Feuerriegel (2017), Munkh-dalai et al. (2017) and Qin et al. (2017). In addition, sub-stantial work can also be found on forex markets (Gileset al., 2001; Maknickien˙e and Maknickas, 2012; Persioand Honchar, 2016a). Other applications can be detec-ted for ﬁnancial crises prediction (Gilardoni, 2017) andfor credit risk evaluation in P2P platforms, or peer-to-peerlending (Zhang et al., 2017). Considering the applications to other ﬁnancial assetspresented in Section 2.1, what most of them have in com-mon is that they are pure applications of the model. Thisusually includes the implementation of the model to theselected asset class (equity, forex, or other), the calcula-tion of errors, and the comparison with other models or benchmarks. In some notable exceptions, the researchgoes beyond the pure application, and analyses the in-formation inside the model. The main studies in this con-text are presented below.To begin with, Strobelt et al. (2018) have developed atool that facilitates the visualisation of signals inside theLSTM memory cell, called “LSTMVis”. This analysiscan be used to identify hidden state dynamics in the LSTMmodel that would otherwise be lost information. There aresome indications in the literature that this type of inform-ation extraction from the gates may be relevant, as willbe seen below. This visualisation tool is especially ori-entated to the manipulation of sequences of words. Theymay consist of English words for translation or sentimentdetection, or other type of symbolic input as musical notesor code.Based on the “LSTMVis” visualisation tool, Persioand Honchar (2017) provide examples of activation timeseries in RRNs. The authors mention that this type of ana-lysis may be able to detect trends in time series and, in thiscase, the signals could be used as indicators. However, itis not speciﬁed in their paper at what level in the modelthe authors captured those signals (states or gates and theiridentiﬁcation). Nevertheless, this study represents a ﬁrstattempt at this subject, although further research is neededto demonstrate the mentioned detection of trends capabil-ity in time series.In an earlier study, Giles et al. (2001) refer to the pos-sibility of extracting rules and knowledge from trainedrecurrent networks modelling noisy time series. To thatend, they ﬁrst convert the input time series into a sym-bolic representation using self-organizing maps. Then theproblem becomes one of grammatical inference and theyuse RNNs considering the sequence of symbols as inputs.More speciﬁcally, they use an Elman type of architecturefor the recurrent neural network (Elman, 1990). In addi-tion, the converted inputs facilitate the extraction of sym-bolic information from the trained RNNs in the form ofdeterministic ﬁnite state automata. The interpretation ofthat information resulted in the extraction of simple rules,such as trend following and mean reversion.In contrast to Giles et al. (2001), in our research weuse the LSTM model, a more recent type of recurrentneural networks. Moreover, it should be emphasized thatthe conversion into symbols has a ﬁltering e ﬀ ect on thedata and this may be undesirable. Hence, our data pre-processing does not include any type of ﬁltering operationgiven our view that the data complexity and its volatilitydo not conform with the concept of noise. This issue willbe further detailed in Section 6.1 – LSTM-LagLasso.3ast but not least, Fischer and Krauss (2018) have useda di ﬀ erent approach with the same objective of interpret-ing what happens in the model. We will brieﬂy explainthe methodology used, for context and to clarify the dif-ferences in relation to our approach. In their research,the authors carry out a notable and comprehensive mar-ket prediction study on the component stocks of the S&P500. Using LSTMs they forecast the next time step,subsequently ranking the individual stocks based on theprobability of outperforming the cross-sectional median.Then they group the k top and bottom stock performers( k ∈ { , , , , } ). Considering the model’s in-put sequence of returns for each of those groups, theycalculate several descriptive statistics and identify char-acteristics of the stocks belonging to each of those twogroups (top and bottom likely performers). Using thismethodology, they found that stocks in the top and bottomgroup exhibit the following characteristics: high volatility,below-mean momentum, and extreme returns in the lastfew days with a tendency to revert them in the short-term.This is especially the case for groups with a smaller num-ber of stocks (smaller k ). Since these characteristics werefound on direct outputs of the model, they are attributedto the functioning of the LSTM networks.Based on the result obtained, Fischer and Krauss (2018)devised a trading strategy, which consists of selling the(recent past) winners and buying the (recent past) losers.This is a possible but simpliﬁed trading strategy. It isknown in practice in ﬁnancial markets as “contrarian”, inthe sense that it is counter-intuitive, and it is certainly farfrom consensual. See for example the work of Jegadeeshand Titman (1993) and Wang et al. (2019) supporting theopposite strategy “buy the winners and sell the losers”. Onthe contrary, Khanal and Mishra (2014) found no clearevidence of a proﬁtable “buy the winners and sell thelosers” strategy, considering it on a buy and hold basis for3 to 12-month periods and for the period studied between1990 and 2012. And ﬁnally, the work of Antoniou et al.(2003) supporting the strategy “buy the losers and sell thewinners”, considered in the work in analysis. In fact, inthe most dangerous situations for a stock and most penal-ising for a portfolio return, a sequence of negative returnsmay be just the beginning of a serious bear market for thatparticular stock, or something even more serious a ﬀ ectingthe company that will result in a prolonged correction.To conclude, the authors (Fischer and Krauss, 2018)conducted an external analysis of the LSTM model, inthe sense that they use the output of the model to infertheir functioning. No internal analysis is carried out ofthe LSTM states or gates, and this is the main di ﬀ erence in relation to our research and one of our contributions tothe present state-of-the-art.

3. Deep learning model

In this section, we present the deep learning model usedfor this study, namely the LSTM networks. The othermodel considered is not described here. Further inform-ation on standard feed-forward neural networks / MLPcan be found elsewhere (see, for example, Bishop (2006),Hastie et al. (2013) and Rumelhart et al. (1986), the lat-ter for the training process using the back-propagation al-gorithm). Then we discuss the potential of LSTMs foryield forecasting.

The LSTM architecture was ﬁrst introduced by Ho-chreiter and Schmidhuber (1997), and subsequently ad-apted by other researchers (Gers et al., 1999; Gers andSchmidhuber, 2000; Graves and Schmidhuber, 2005).The LSTM model is a type of recurrent neural network,having a structure that includes a clever feedback loopmechanism delayed in time, and this structure can be “un-rolled” in time. At each time step, the LSTM cell hasa structure which is substantially more complex than astandard RNN, incorporating four complete neural net-works in each of those cells (also called memory cells). InFigure 1 a simpliﬁed diagram of the unrolled chain-typestructure is presented, identifying the main componentsof a memory cell. A more detailed representation of anLSTM cell is presented in Figure 2.The corresponding equations that govern the modernLSTM model can be expressed in the following form,with Figure 2 showing where these operations occur in theLSTM cell (Hochreiter and Schmidhuber, 1997; Liptonet al., 2015; Goodfellow et al., 2016): f ( t ) = σ ( W f x x ( t ) + W f h h ( t − + b f ) (1) i ( t ) = σ ( W ix x ( t ) + W ih h ( t − + b i ) (2) g ( t ) = tanh ( W gx x ( t ) + W gh h ( t − + b g ) (3) o ( t ) = σ ( W ox x ( t ) + W oh h ( t − + b o ) (4) c ( t ) = f ( t ) ⊗ c ( t − + i ( t ) ⊗ g ( t ) (5) h ( t ) = o ( t ) ⊗ tanh ( c ( t ) ) (6)where f ( t ) is the function for the forget gate; i ( t ) and g ( t ) are the functions for the input gate and for the inputnode, respectively; o ( t ) the function for the output gate;4 igure 1: Long short-term memory cells unrolled in time (based on (Olah, 2015), with modiﬁcations made by the authors).Figure 2: Long short-term memory detailed cell diagram (based on (Pr¨ugel-Bennett, 2017), with modiﬁcations made by the authors). c ( t ) and c ( t − is the cell state (also called internal state) attime step t and t − h ( t ) and h ( t − the hidden state at timestep t and t − x ( t ) is the input vector at time step t ; W arethe weight matrices, with W f x as an example, representingthe weight matrices for the connection input-to-forget gate(indices indicating to-from connections); b f , b i , b g , b o arethe bias vectors; σ is the logistic sigmoid activation func-tion; tanh the hyperbolic tangent activation function; and ⊗ represents the Hadamard product (i.e. element-wisemultiplication). After describing the architecture of the LSTMs we nowidentify the advantages, limitations and potential for yieldforecasting. First, the main advantage of the LSTM modelis related to the reason why it was developed in the ﬁrstplace. The RNNs could not capture the long-term de-pendencies due to the vanishing or exploding gradientsproblem ﬁrst identiﬁed by Hochreiter (1991) in a diplomathesis (Hochreiter et al., 2001; Schmidhuber, 2015), andin parallel research work by Bengio et al. (1993, 1994). Infact, this memory capability is one of the characteristics5hat clearly separates this type of model from the stand-ard neural networks, in particular the MLPs. Given thestateless condition of MLP models they only learn ﬁxedfunction approximations.For modelling ﬁnancial time series, it seems probablethat long-term dependencies are important. Even thoughthe last available value of the series is the one collect-ing all the available information in the market up to thepresent moment, inversion points tend to follow certainpatterns, frequently exploited by technical analysts wholook essentially at chart data. A model with memory andcapable of learning long-term dependencies may be bene-ﬁcial for this reason.Second, in sequence prediction problems, the sequenceimposes an order on the data that must be respected whentraining and forecasting, i.e. the order of the observa-tions is important for the modelling process. This is thecase with ﬁnancial time series. However, in feed-forwardneural networks / MLPs, the modelling of time series’temporal structure is only done indirectly through the con-sideration of multiple time steps as di ﬀ erent input fea-tures. Although with this method previous values are in-cluded in the regression problem, the natural “sequence”or structure of the time series is not really present in themodelling process and the model does not have any know-ledge of it. The LSTMs are the most e ﬀ ective models forsequence learning, modelling these time sequences dir-ectly. Additionally, it can input and output sequences timestep by time step, enabling variable length inputs and / oroutputs. With this property, they overcome one of themain limitations of standard feedforward neural networks.Third and last, a model for ﬁnancial time series shouldbe able to perform multi-asset forecasting, for the predic-tion of several targets simultaneously and it would be de-sirable to perform multi-step forecasting, to consider sev-eral forecasting horizons into the future. The LSTM typeof model is capable of dealing with multivariate prob-lems and also with multi-step prediction using sequence-to-sequence architectures, thus fulﬁlling these requisitesnaturally.On the limitation side, some time series forecastingproblems are technically simpler, not requiring the char-acteristics of a recurrent type of model. This is the casein particular when the most relevant data for making theprediction is within a small window of recent historic val-ues. Here, the capability to deal with long-term depend-encies and the model “memory” are clearly not necessary.In this type of situation, MLPs and even linear modelsmay outperform the LSTM pure-autoregressive univariatemodel, with lower complexity (Gers et al., 2002; Brown- lee, 2018).Overall, given the additional complexity of the model,it should only be used when the type of problem we haveis better modelled by this type of neural network archi-tecture. And this is reﬂected in two main conditions: se-quential data and when the long-term dependencies mayhelp the forecasting process. Nonetheless, LSTM net-works’ potential for yield forecasting seems evident. Des-pite being predominantly used for non-ﬁnancial applica-tions, their characteristics make them potentially suitablefor ﬁnancial time series predictions.

4. LSTM networks for bond yield forecasting

Now that we have discussed the LSTM model and its po-tential for yield forecasting, we move on to the empiricalwork carried out. In this section, we describe the choicesmade on the dataset used, identifying the target, featuresand pre-modelling operations (including the generation ofadditional features for the MLP model, the train-test splitand normalisation).

Given the interconnectedness and mutual inﬂuence ofvarious asset classes in the markets, we consider a largenumber of features from ﬁnancial markets. These are se-lected from government bond markets and from relatedclasses and indicators: credit (corporate bonds), equities,currencies, commodities and volatility. Additional fea-tures are added, which are calculated from the previouslymentioned features, speciﬁcally, bond spreads, slope ofthe yield curve and simple technical analysis indicators.Furthermore, economic variables are also very import-ant, as clearly exempliﬁed by the well established yields-macro models such as the Dynamic Nelson–Siegel model(Diebold and Li, 2006). Hence, a vast range of economicindicators are also included, from di ﬀ erent geographiclocations. The complete list includes 159 features and,because it is so extensive, it is stored and made publiclyavailable (Nunes, 2020). The target chosen for this studyis the generic rate of the 10-year Euro government bondyield.The dataset was obtained from Bloomberg database(Bloomberg, 2017) and covers the period from January1999 to April 2017, giving 4779 time points of data. Theformer is the starting date for most time series of the Eurobenchmarks and the total period covers several bull andbear markets. Regarding data frequency, the selection wasdaily closing values, which are easily available for ﬁnan-cial assets in general.6s mentioned previously in Section 3.2, in the case offeed-forward neural networks / MLPs, the modelling oftime series’ temporal structure is done indirectly throughthe consideration of multiple time steps as di ﬀ erent inputfeatures. Contrary to the MLP work, there is no need togenerate additional features for the LSTM network, sinceprevious time steps are given directly to the model in theform of an input sequence. Hence, for the MLPs, new fea-tures are generated from the original ones, correspondingto lagged values of the respective time series. In our re-search, six time steps are considered, based on previousstudies (Mahler, 2009). The selection of the most relevantfeatures is carried out using Lasso (standing for Least Ab-solute Shrinkage and Selection Operator) regression pro-posed by Tibshirani (1996).As is common, we divide the data into two groups, fortraining and testing the models. In this case, a 70% / ﬀ erent scales in some cases. For ex-ample, the 10-year yield is quoted in percentage, usuallystaying below 6%, while several equity indices reach val-ues well above 10000 (Dow Jones, Nikkei 225 and HangSeng). In this section, the empirical work carried out is identi-ﬁed and explained: ﬁrst, we compare directly the univari-ate MLP and LSTM models; then, we further assess theLSTM potential for yield forecasting using di ﬀ erent inputsequences. Moreover, the models considered are speciﬁedand additional aspects of the methodology adopted are de-scribed, in particular in what concerns moving windows,retraining of models and cross-validation. A summaryof the models used and additional model information ispresented in Table 1.For a streamlined approach to the model, we applyLSTM networks to a univariate type of problem, i.e. themodel has only one feature, corresponding to past values Table 1: Summary of the models used.

Model information

Original features 159Generated features 795Target 10-year yieldForecasting horizons 0 (next day), 5, 10, 15, 20Moving window size 3000 daysHidden units MLP 10Hidden units LSTM 100Time steps MLP 6 daysTime steps LSTM 6, 21, 61 days

Model Short name Description

Direct comparison MLP vs. LSTM

MLP NN TgtOnly MLP with target data onlyLSTM LSTM06 LSTM using input sequenceof 6 time steps

LSTMs with di ﬀ erent input sequences MLP NN RelFeat MLP with relevant featuresNN TgtOnly MLP with target data onlyLSTM LSTM06 LSTM using input sequenceof 6 time stepsLSTM21 LSTM using input sequenceof 21 time stepsLSTM61 LSTM using input sequenceof 61 time stepsof the target we want to predict. This is justiﬁed by the factthat we want to be able to assess the LSTM model poten-tial, so we prefer to make the comparison with MLPs in itsmost pure condition. In other words, we prefer to performthe comparison without introducing additional features inthe LSTM model, which would introduce an extra level ofcomplexity.When forecasting for longer forecasting horizons bey-ond one step ahead, there are two methods that can beused: direct or iterative forecasting. On the one hand, inthe direct forecast only current and past data is used toforecast directly the time step required, using a horizon-speciﬁc model. On the other hand, in the iterative forecasta one step ahead model is iterated forward until the targetforecasting horizon is reached. All the models use directforecasting of targets given that, with ﬁnancial time series,the prediction errors tend to propagate fast if we were tomake iterative predictions. As a result, new neural net-7orks are trained for each forecasting horizon.For the direct comparison between the univariate MLPand the univariate LSTM, an LSTM is selected thatmost closely simulates the conditions applied to the MLPneural network, i.e. both considering 6 time steps, basedon previous studies (Mahler, 2009). The correspondingmodels are (Table 1): Model “NN TgtOnly”, meaningneural network using only the target variable as feature;and Model “LSTM06”. To further assess the potentialof LSTMs for yield forecasting, we extend the numberof models on both sides. On the LSTM side, we con-sider three di ﬀ erent input sequences in total: 6, 21 and 61time steps, with the last two corresponding to approxim-ately 1 and 3 calendar months. The selected number oftime steps also follow the structure “next day” plus de-sired period, i.e. 1 +

20, 1 +

60, which will be useful lateron this work when using sequence-to-sequence LSTM ar-chitectures. On the MLP side, we add to the univariatemodel, the MLP using the most relevant features selectedfor the 10-year bond yield target and individually for eachforecasting horizon (Model “NN RelFeat”). Besides, thecomparisons are carried out for all forecasting horizonsconsidered, i.e. next day and (next day plus) 5, 10, 15 and20 days ahead.Both MLP and LSTM models are trained using movingwindows and retraining of models at every time step. Thistechnique is feasible in real time and is used to take fulladvantage of the models.Throughout this study, the main metric used is the meansquared error (MSE), which is commonly used for thispurpose. The results in Section 4.3 are presented in thenormalised version, i.e. calculated directly from the realand predicted normalised yields (Section 4.1), since thenon-normalised equivalents are scale dependent. Hence,the normalised metric is used to facilitate the comparisonof models.In terms of the number of hidden units for the LSTM,this is set to 100 units (Table 1). A higher number wouldrequire additional computational time and this is a goodcompromise between speed and accuracy. This numberis also compatible with the maximum number of relevantvariables considered in the MLP study for modelling thelongest forecasting horizons. Regarding the MLP numberof hidden units, the main conclusion from the hyperpara-meter tuning is that 10 hidden units is a good compromise,with signiﬁcant overﬁtting observed for neural networkswith more than 100 units.Finally, the optimiser chosen is Adam (Kingma and Ba,2015). This is the algorithm for gradient-based optimiza-tion of the stochastic objective function, and is frequently used for this type of problem (Brownlee, 2018).

In this section, the main results are presented for bothstudies: the direct comparison MLP vs. LSTM and theassessment of LSTM potential using di ﬀ erent input se-quences. This is followed by a discussion of its implic-ations for the present research.Starting with the ﬁrst study, the direct comparison ofmodels, the results obtained are summarised in Figure 3.As can be seen, the results obtained with the LSTM modelare better for forecasting horizon of 5 and 10 days, achiev-ing MSE reductions of 25% and 14%, respectively (me-dian values), compared to the univariate MLP. When thehorizon is too large (20 days) the advantage of the LSTMis lost. In all other cases, the results are not signiﬁcantlydi ﬀ erent. Figure 3: Direct comparison of models: univariate multilayer per-ceptron “NN TgtOnly” vs. long short-term memory networks for in-put sequence of 6 time steps “LSTM06”.

Another aspect that can be observed from the results isthat the standard deviation obtained with LSTMs is lowerfor all horizons analysed. This is also an important out-come, giving indications of higher stability of this typeof model when compared with the more traditional MLPmodel.Regarding the second study, LSTMs using di ﬀ erent in-put sequences, the results are shown in Figure 4. Whenforecasting the next day (Figure 4a), the results obtainedfrom all models are similar. This situation is equivalentto that reported in Nunes et al. (2019) for next day fore-casting with models including multivariate linear regres-sion and a variety of MLP-type of models. One strikingadvantage revealed by the LSTM is again the lower stand-8rd deviation in all LSTM models when compared to bothMLPs.However, it is when we forecast more distant time steps,that the beneﬁts of LSTMs become more evident. Hence,considering the forecasting horizon of (next day plus) 5days, as shown in Figure 4b, the LSTMs with input se-quence length of 6 and 21 time steps (Models “LSTM06”and “LSTM21”, respectively) produce results that signi-ﬁcantly outperform both MLP models, with lower errorsand much lower standard deviations. When compared tothe univariate MLP, these models achieve MSE reductionsof 25% and 34%, respectively (median values). On theother hand, the LSTM with an input sequence of 61 days(Model “LSTM61”), does not produce a reduction in er-ror, generating similar median results also with signiﬁc-antly lower standard deviation. Therefore, it appears thatforecasting 5 days ahead does not require such a long in-put sequence of 60 days.When we consider forecasting horizons of 10 and 15days (Figures 4c and 4d), all LSTMs tend to performbetter than the univariate MLP model (Model “NN Tg-tOnly”) with lower standard deviations. The MSE reduc-tions achieved range from 2% to 47%. Another import-ant observation is that the LSTMs with longer input se-quences are able to reach similar levels of forecasting er-ror to the MLP with the most relevant features (Model“NN RelFeat”), again with lower standard deviation. Infact, the LSTM appears capable of compensating, at leastpartially, the lack of additional information from marketswith additional memory via longer input sequences. Thisis a promising result for future work.Finally, the di ﬀ erences are not so clear when we con-sider forecasting horizon equal to (next day plus) 20 days(Figure 4e). In this case, the LSTMs with longer inputsequences (Models “LSTM21” and “LSTM61”) performbetter than the univariate MLP (MSE reductions of 17%and 19%, respectively), with slightly lower standard de-viation. However, the shorter sequence LSTM (Model“LSTM06”) produces slightly worse median value (MSEincrease of 11%), although with lower standard deviation.A possible explanation for this may be that the input se-quence length of 6 time steps is already insu ﬃ cient forforecasting 20 days ahead. Thus, either additional fea-tures or longer input sequences are required.These results suggest that the LSTM architecture needsto take into account the speciﬁc problem we are trying tosolve and the type of forecasting horizon we aim to pre-dict. Additionally, the results of these comparisons carriedout over a wide range of input and forecasting horizonssuggest that, under some conditions, the structure in the data is better captured by having models with time delaysin them (i.e. LSTMs), which can strike a balance betweenthe use of immediate and distant past data. However, theadvantage of one class of models over others is not uni-versal, as indicated by the no free lunch theorem (Wolpertand Macready, 1997). In the next section, we probe theLSTM further, opening the black box, to see if the rep-resentations learned in its internal states are interpretablein any way. Such explanations of what black-box mod-els learn is a popular topic of interest for certiﬁcation andlitigation purposes.

5. Opening the LSTM black box: signals analysis

Neural network methods in general and the LSTM in par-ticular are considered black box type of models. Theyestablish functional relationships between inputs and out-puts, but one cannot extract interpretable informationfrom the model itself. In this section we present our con-tribution to demystify this complex deep learning model,by analysing the signals inside the LSTM memory celland extracting relevant information. In turn, this is sub-sequently used to identify relevant explanatory featuresfor this problem (Section 6). This section also includesa brief explanation of the LSTM memory cell, its statesand gates, since they are indispensable to understand themethodology used.

We will now explain how an LSTM works, the main com-ponents of the cell (Figure 2) and the complete algorithm(Equations 1 to 6). In each LSTM cell the ﬂow of inform-ation is controlled by tree gates, namely: forget, input andoutput gates. The operations performed at cell level areschematically presented in Figure 5. All calculations ateach gate depend on the current inputs at the same timestep and the previous hidden state (at time step t-1). Theoutput from the cell depends on those two variables plusthe cell state at time step t.

Forget gate

The forget gate deﬁnes which information to remove or“forget” from the cell state. For this purpose, the forgetgate has a neural network with a logistic sigmoid activ-ation function ranging from 0 to 1 (Figure 5a). The ex-tremes of that interval correspond to: keep this informa-tion (1) or completely remove this information (0). Themaths operations performed at this gate are representedby Equation 1.9 a) Forecasting horizon = = =

10 days. (d) Forecasting horizon =

15 days.(e) Forecasting horizon =

20 days.

Figure 4: Comparison of models: two types of multilayer perceptrons (MLP using the relevant features determined for the 10-year yield targetand for each forecasting horizon “NN RelFeat” and univariate MLP “NN TgtOnly”) vs. long short-term memory networks (LSTM) for inputsequences of 6, 21 and 61 time steps (“LSTM06”, “LSTM21” and “LSTM61”, respectively). a) Forget gate. (b) Input gate and input node.(c) Cell state update. (d) Output gate and hidden state update. Figure 5: Long short-term memory states and gates.

Input gate and input node

The input gate speciﬁes which information to add to theprevious cell state. This part of the cell comprises twoelements as shown in Figure 5b. The ﬁrst component isthe input gate, where the inputs (hidden state at time stept-1 and inputs at time step t) go through a neural networkwith a logistic sigmoid activation function (correspondsto node i and function i ( t ) ). The second component is theinput node (to di ﬀ erentiate from “gate”) and representsthe new “candidates” that could be added to the cell state.These are generated through a neural network with a hy-perbolic tangent activation function (corresponds to nodeg and function g ( t ) ). The corresponding operations carriedout in this section of the cell are represented by Equations2 and 3. Cell state update

The cell state update is performed using the results of boththe forget and input gates. The operations implementedfor this purpose are presented schematically in Figure 5cand in mathematical terms by Equation 5.

Output gate and hidden state update

The output gate deﬁnes the information from the cell statethat will be used as output of the memory cell for thepresent time step. This gate has the fourth neural networkof the LSTM cell, with a logistic sigmoid activation func-tion (Figure 5d). The operations applied in this gate arerepresented by Equation 4. Finally, the actual output fromthe cell results from the hidden state update, which is com-puted using Equation 6, taking into account the results ofthe output gate and the present cell state, pushed through11 hyperbolic tangent function (other activation functionsmay be used).

Cell and hidden states

The information ﬂows through the gates described above,and they are important to justify what happens in thestates. However, all the information is transmitted to thefollowing time step through the states. In this sense, thesignals from the states summarise all information. Themain concept behind the cell state is that it represents thelong-term memory of the model, while the hidden statecorresponds to the short-term memory.

The analysis of signals in the LSTM states and gates isconducted at di ﬀ erent levels of the memory cell, speciﬁc-ally: forget gate (Equation 1); product of the outputs fromthe input gate and input node (Equations 2 and 3); out-put gate (Equation 4); cell state (Equation 5); and hiddenstate (Equation 6). The product of the outputs from in-put gate and input node is chosen instead of the individualoutputs. The main reason is that the product is what isadded to update the previous cell state, thus having mostrelevant and interpretable information. Using the equa-tions referred above, the signals are calculated in each ofthose locations, at every time step, and for each individualunit of the LSTM memory cell.Regarding features, three di ﬀ erent cases are analysed toexamine whether the behaviour found is consistent underdi ﬀ erent conditions. In this case we use univariate (fea-ture set 1) and multivariate LSTM models (feature sets 2and 3). The 10-year bond yield is a feature in all sets con-sidered. In the second set we add a technical momentumindicator developed by Merrill Lynch, now Bank of Amer-ica Merrill Lynch (Garman, 2001). This indicator is basedon the concept of “reversion to the mean” and assumingnormal distribution of the deviations from a short-term av-erage of 30 working days. The third feature set includesthe 10-year yield and the two closest benchmarks in theyield curve, the 5-year and 30-year yield. Yields with ma-turities adjacent to the one we want to forecast were foundto be relevant features in previous work (Nunes et al.,2019). A summary of the LSTM model used in this studyis presented in Table 2.Additional options for the signals analysis are justiﬁedbelow. First, for this study we select forecasting horizonof next day + ﬀ erentiation between MLP and LSTM models. Second,the number of hidden unit in the LSTM is reduced to three Table 2: Summary of the LSTM model used for signal analysis.

LSTM architecture

Features set 1 10-year yieldset 2 10-year, momentum indicatorset 3 10-year, 5-year, 30-year yieldTarget 10-year yieldForecasting horizon next day + + The main results for the di ﬀ erent feature sets are presen-ted in this section, starting with the univariate model (fea-ture set 1). The signals for both hidden and cell states arepresented in Figure 6. These signals are plotted againstthe 10-year yield, for reference and facilitate the interpret-ation process.From the results, we can observe some similaritybetween signals of the hidden and cell states. It is worthemphasising that the hidden state at time step t is calcu-lated using the cell state at the same time step and theresult of the output gate using Equation 6. Consequently,some similarity between those signals may be expected.A more remarkable property shown in the signals is thatunit 1 becomes almost inactive, both in terms of volatil-ity and weight, during two di ﬀ erent periods identiﬁed asperiod 1 and 2 in Figure 6. During those, while unit 1tends to a zero weight, units 2 and 3 take over becom-ing more active and following more closely the volatil-ity of the 10-year bond yield. The periods mentioned arenaturally di ﬀ erent both in terms of duration and occur-rence in time. However, they both correspond to periodsin which the 10-year yield assumes downward extremevalues, more speciﬁcally yields of 3.6-3.7% and below.Taking this into account, there is some evidence that some12 a) Hidden state signals.(b) Cell state signals. Figure 6: Long short-term memory network signals. ﬀ erent yield ranges.As mentioned before, the hidden and cell states are themost important since they summarise all the informationgoing through all the other gates. The signals at the cellgates are helpful to understand and conﬁrm what happensat the states’ level. In this vein, and as an example ofthis type of conﬁrmation in relation to the cell state, wepresent the signals at the following two locations in thecell: input gate ⊗ input node and forget gate (Figure 7).We observed that unit 1 becomes almost inactive tendingto a zero weight during periods 1 and 2 (Figure 6). Thisbehaviour is better understood and justiﬁed at gate level.Indeed, at input gate ⊗ input node location the resultingsignal is adding approximately zero of unit 1 to the pre-vious cell state (Figure 7a). At the same time, the forgetgate is increasing the amount to forget of unit 1 (via lowerweights) during the same periods (Figure 7b), while de-creasing it for the other two units (via higher weights).As to the hidden state, the above conclusions are com-bined with the result from the output gate, which showsa marked decline in the weights for unit 1, with oppositemovements for units 2 and 3 (Figure 8).Moving to feature sets 2 and 3, although the behaviouris distinct in each case, the same type of activation / de-activation of units can be identiﬁed. For conciseness, anexempliﬁcation of the results obtained for the hidden statefor both feature sets is provided in Figure 9. For set 2(Figure 9a), despite the high volatility of the second fea-ture (momentum indicator, Table 2), the same two periodscan clearly be observed as described for feature set 1 (Fig-ure 6a). To note, in this case it is unit 3 that assumes therole of previous unit 1. This switch has no relevance sincethey are all equal units at the start of the learning pro-cess. Another interesting observation (not presented herefor succinctness), is the fact that given the high volatilityof the momentum indicator, two of the units tend to “spe-cialise” in this feature and only one unit in the 10-yearyield.For feature set 3 (Figure 9b), the pattern of the signalsin relation to the 10-year yield data is much looser. Oneof the reasons that may justify this behaviour is the highercorrelation among all yields considered as features (5, 10and 30-year yield). As a result, there is a lower levelof dependency on only one of them. Similarly to whatwas found in the previous feature sets, we can identifysome form of yield range specialisation of the units. Inthis case, unit 3 seems to cover a yield range above 4.5% approximately, becoming much less active subsequently(Figure 9b).Overall, the most remarkable property found consist-ently in the LSTM signals, for all feature sets, is the ac-tivation / deactivation of units during learning. It can becharacterised by an alternation of periods of weight closeto zero or low variability, with no signiﬁcant change in thestates, with periods where the units become highly act-ive giving higher contributions to the forecasting process.Furthermore, we found evidence that the LSTM units mayspecialise in di ﬀ erent yield ranges or features considered.

6. New LSTM-LagLasso method

Our intention here is to interpret the representationslearned by the LSTM model, whose estimation is drivenby purely statistical considerations of the error being min-imized, in terms of external e ﬀ ects that might have aninﬂuence on the bond yields. Should the model ex-tract meaningful features during the learning process, therepresentations would correlate with exogenous informa-tion that was not available to the learning algorithm. Inother words, it is precisely because the representationsmay match such exogenous information that the predict-ive ability of the black-box models is good. In this section, we brieﬂy present the Lasso, the Kalman-LagLasso and then evolve to introduce our methodology,the LSTM-LagLasso.

Lasso

The Lasso regression we formulate to explain the sig-nals within the LSTM (features extracted by the LSTM)in terms of exogenous variables has the form (Tibshirani,1996): min w { (cid:107) X w − s lstm (cid:107) + γ (cid:107) w (cid:107) } (7)where X is the matrix of features and respective lags; w the vector of unknown parameters; s lstm are the LSTMcell and hidden state signals (target vectors); γ is the reg-ularisation parameter; (cid:107) (cid:107) denotes the L -norm; and (cid:107) (cid:107) the L -norm.The Lasso regression determines the parameters of themodel by minimising the sum of squared residuals, usingan L -norm penalty for the weights. Due to the type ofconstraint, it tends to lead to sparse solutions, i.e. some14 a) Gate signals: input gate ⊗ input node.(b) Gate signals: forget gate. Figure 7: Example of signals at gates level. igure 8: Gate signals: output gate. coe ﬃ cients are exactly zero and as a result the corres-ponding features are discarded. This is particularly im-portant since it enables a continuous type of feature selec-tion through the tuning of the regularisation parameter γ ,and the identiﬁcation of the most relevant features for themodel. Kalman-LagLasso

Mahler (2009) introduced the Kalman-LagLasso methodin a study to predict the monthly changes of the S&P500index, using macroeconomic and ﬁnancial variables. Theoverall procedure included two phases. In the ﬁrst, Kal-man ﬁlters (Kalman, 1960; Niranjan, 1996) were used todenoise the explanatory variables and predict the residuals(part not explained by the model). Then the LagLassophase is implemented, to determine the most relevant fea-tures and respective lag using the ﬁltered / denoised vari-ables to explain the prediction residuals. The LagLassomethod is based on Lasso (Tibshirani, 1996), implemen-ted via the modiﬁed Least Angle Regression (LARS) al-gorithm (Efron et al., 2004). The algoritm was modiﬁedby Mahler (2009), so that only one lag per feature is selec-ted. Once selected, the other lags of the same feature areeliminated from the active set of features so that they can-not be selected again. The Kalman-LagLasso is an elegantmethod of combining a forecasting method with error ana- lysis, seeking to explain the part not explained by model,i.e. the residuals, with external ﬁnancial variables.More recently (Montesdeoca and Niranjan, 2020), andusing a similar approach, the Kalman-LagLasso methodhas been used to compare the type of information inﬂuen-cing US stock indices (S&P 500 and Dow Jones IndustrialAverage) in contrast with that a ﬀ ecting cryptocurrencies(Bitcoin and Ethereum). LSTM-LagLasso

The LSTM-LagLasso method is the new methodology wedeveloped to explain the signals extracted from the LSTMstates. It is inspired in the Kalman-LagLasso method, butwith signiﬁcant modiﬁcations in terms of model used, tar-get variable to which the LagLasso method is applied, aswell as the methodology to determine the relevant featuresand respective lags.When compared to the Kalman-LagLasso method, ourobjective is di ﬀ erent and we aim to analyse what inform-ation is contained in the signals, in particular, if they canbe explained by external variables. As a result of this ob-jective, instead of using a Kalman ﬁlter implementation ofa linear autoregressive model, we use as the main modelnon-linear LSTMs.Our methodology also diverges in the target used for theLagLasso procedure. While in the Kalman-LagLasso the16 a) Feature set 2.(b) Feature set 3. Figure 9: Hidden state signals for feature sets 2 and 3. ﬀ er from theoriginal paper (Mahler, 2009) in several aspects. First,we do not denoise the features and use Lasso directlywith the variables and respective lags. To explain thisoption, we need to mention that Kalman ﬁlters are of-ten applied to sensors data (Park et al., 2019), and it isknown that this type of data does not provide exact read-ings / measurements. This is because they introduce theirown distortions and they are always corrupted by noise(Maybeck, 1979). In this context, the use of Kalman ﬁlterto remove the noise of sensor data is a natural application.However, in ﬁnancial markets the concept of noise can-not realistically be applied, in our opinion. Instead, it isthe complexity of market dynamics that is responsible forprice formation. Indeed, it is the multiplicity of factorsinﬂuencing the markets that makes the problem extremelycomplex. Part of those factors are already incorporated inthe historic values of the time series, but another import-ant part is coming from new information arriving to themarkets in real-time, together with the new expectationand reactions of market participants to that new informa-tion. Ultimately, from the interaction of all these factorsresults a new equilibrium and a new real-time asset valu-ation. This concept applied more directly to market vari-ables, can also be extended to macroeconomic indicators.Our approach seems to be more appropriate when dealingwith ﬁnancial time series.The second main di ﬀ erence in our methodology is thatwe consider all lags selected as relevant in the LSTM-LagLasso algorithm and not only one for each feature,as proposed in the Kalman-LagLasso procedure (Mahler,2009). We found no reason to limit the number of lags thatmay participate in the forecasting process to only one. TheLSTM-LagLasso methodology is outlined in Algorithm 1.The LSTM-LagLasso method is applied individuallyto both hidden and cell states, and for each of the threeLSTM units. Additional clariﬁcation of the options con-sidered are presented below. First, the number of lags isequal to those of the sequences (input and output) used inthe LSTM model in Section 5.2, i.e., six lags per feature(Algorithm 1, Line 2). Second, for the external variablesto explain the hidden and cell state signals, we use thesame large set of features considered for the MLP model(Section 4.1), a list of 159 macroeconomic and marketvariables not available to the model during the learningprocess (Algorithm 1, Line 3). Third, for selection of the Algorithm 1:

LSTM-LagLasso algorithm input :

Macroeconomic and market features target:

LSTM cell and hidden state signals Extract LSTM cell and hidden state signals and setthem as targets in independent models (one perstate). Select the number of lags to consider { k ∈ [1 , } . Build matrix M containing macroeconomic andmarket features. Transform matrix M into matrix X byincorporating k lags. Standardize all variables (features and target) tohave zero mean and unit standard deviation. for i ← γ min to γ max do Perform Lasso regression (Equation 7). Identify non-zero weight values in vector w . end for Select γ (look for stabilising trend in the number offeatures against γ and forecasting error). Perform Lasso regression with selected γ . Identify the ﬁnal most relevant variables and lags(non-zero weight values in vector w ).regularisation parameter we considered both the trend inthe number of features against γ and the forecasting error(Algorithm 1, Line 10). After an initial rapid drop in thenumber of features, the trend changes signiﬁcantly, but aclear period of stabilisation cannot be observed. Addi-tionally, for γ above 1.0, the error starts increasing morerapidly and the quality of ﬁt deteriorating. For this reasonwe selected γ = . γ . Forth and last, we excludefrom the original features the 10-year bond yield, sincethis information is known to the model, both last availablevalue and respective lags, as the only feature consideredin the univariate LSTM model used in Section 4. The application of LSTM-LagLasso to the hidden state ispresented in Figures 11–13, respectively for hidden units1 to 3. The ﬁgures show the most relevant variables iden-tiﬁed by this methodology. Note that we plot the absolutevalue of the weights and not the weights directly, for easiervisualisation of the ﬁgure. Since we are using market vari-ables to explain the LSTM signals and not a ﬁnancial assettarget directly, the visualisation of whether the weights are18 igure 10: Example of LagLasso prediction of LSTM signals: actualversus predicted, for the hidden state, unit 3. positive of negative is not as relevant as the magnitude ofit, i.e. the relevance of the feature as explanatory variable.The corresponding results regarding the application ofLSTM-LagLasso to the cell state are not presented here,for brevity, but also because the main points that can beextracted from them only reinforce the conclusions for thehidden state. This can also be observed in the Venn dia-gram shown in Figure 14, where almost 80% of the rel-evant features identiﬁed are common to both hidden andcell state. An additional point worth noticing is that thecell state needs slightly more explanatory variables (threemore).A similar type of comparison is conducted among thedi ﬀ erent hidden units, to illustrate the common relevantfeatures identiﬁed. The diagram is presented in Figure 15.Here we can see that not all top relevant features are com-mon to all hidden units, although 24.7% of them are com-mon to all three units in the hidden state and 21.4% in thecell state.An additional summary of results for the most inﬂuen-tial relevant variables, is presented in Table 3. For thispurpose, only those with an absolute weight greater thanor equal to 0.15 are selected.From the results a number of conclusions can be drawn(Figures 11–13 and Table 3). First, the results conﬁrmthat the signals can be explained by external sources ofinformation, not available to the models both during train-ing or forecasting. Second, the LSTM signals are complexand require a signiﬁcant number of explanatory variables.Third, we observe that for many features there are severallags that are relevant for the prediction process. The mostimportant lags are the last two values (t, t-1) and the lag Table 3: Summary of most inﬂuential relevant features for both states.The listed features have an absolute weight greater than or equal to0.15 in at least one of the LSTM units. The cells in the table for whichthe weight is equal to zero are left empty.

Feature name WeightUnit 1 Unit 2 Unit 3Hidden state

ECB Reﬁ Rate 0.1503M Euribor Fut 4th 0.177 0.248GBR 30Y 0.161 0.053DEU Bond Fut 30Y 0.674 0.287 0.410EUR Swaps 5Y 0.398EUR Swaps 30Y 0.192 0.199 0.315FTSE 100 0.056 0.160 0.107Gold Futures 0.193 0.012 0.004US 2Y-10Y Spread 0.068 0.087 0.198EUR 10Y-30Y Spread 0.264 0.214EUR 10Y MA 5 days 0.247 0.082EUR 30Y MA 200 days 0.298EUR 5Y 0.221 0.202EUR 30Y 0.450 0.168

Cell state

3M Euribor Fut 4th 0.21290day Euro$ Fut 5th 0.203GBR 30Y 0.151 0.038DEU Bond Fut 30Y 0.723 0.921 0.533EUR Swaps 5Y 0.258EUR Swaps 30Y 0.220 0.186Gold Futures 0.168 0.109 0.051EUR 10Y-30Y Spread 0.271 0.270EUR 10Y MA 5 days 0.525EUR 30Y MA 200 days 0.273EUR 5Y 0.003 0.172EUR 30Y 0.240one week before (t-5). The importance of the latter lag isinteresting, pointing to some type of weekly seasonalityor inﬂuence. Given this conclusion that lags are import-ant, selecting only one lag per feature, as proposed in theKalman-LagLasso method (Mahler, 2009), would elim-inate this additional information, limiting the forecastingability.Fourth and last, using the LSTM-laglasso some ofthe most relevant features selected are conventional mar-ket / macro variables, but others are less common, non-19 i gu r e : L S T M - L a g L a ss o r e l e v a n t f ea t u r e s f o r t h e h i dd e n s t a t e , un it , c on s i d e r i ng a r e gu l a r i s a ti onp a r a m e t e γ e qu a lt o1 . . i gu r e : L S T M - L a g L a ss o r e l e v a n t f ea t u r e s f o r t h e h i dd e n s t a t e , un it , c on s i d e r i ng a r e gu l a r i s a ti onp a r a m e t e γ e qu a lt o1 . . i gu r e : L S T M - L a g L a ss o r e l e v a n t f ea t u r e s f o r t h e h i dd e n s t a t e , un it , c on s i d e r i ng a r e gu l a r i s a ti onp a r a m e t e γ e qu a lt o1 . . igure 14: Comparative Venn diagram of the most relevant featuresfor both hidden and cell states.Figure 15: Venn diagram of the most relevant features for the hiddenstate and all LSTM hidden units. conventional ones. We refer here “conventional” in thesense that they have been used more frequently in model-ling ﬁnancial assets in the past (Nelson and Siegel, 1987;Dunis and Morrison, 2007; Arrieta-Ibarra and Lobato,2015), or are more common sense variables for that pur-pose.In the conventional group of relevant features, we canrefer those related to central bank reference rates (ECBreﬁnancing rate), macroeconomic indicators of inﬂation(US Consumer Price Index less food and energy, Euro-zone Core Monetary Union Index of Consumer Prices),economic growth / growth expectations (Institute for Sup- ply Management Manufacturing, US Industrial Produc-tion, US Capacity Utilisation, ZEW Eurozone Expecta-tion of Economic Growth), and labour market (EurozoneUnemployment).But the explanatory variables go well beyond thatgroup, with a wide range of relevant features in the non-conventional group. Some of them are speciﬁc to the bondmarket, all of them adding signiﬁcant information to theprevious group. The top relevant feature by weight is thelong German government bond future (DEU Bond Fut30Y), reaching the value of 0.921 for unit 2 of the cellstate (Table 3). Note that contrary to what happens at thegates (output of a sigmoid function between 0 and 1), theweights in the LSTM states are not limited to 1. Alsowithin the futures asset class, we ﬁnd the 3M Euribor and90d Euro$ Futures (4th and 5th contracts). These con-tracts have a horizon of approximately one year ahead,thus incorporating investors’ expectations on the evolu-tion of short-term rates.In addition, ﬁnancial instruments with maturities adja-cent to the one we want to predict are also included in thisgroup of relevant features, namely: 5 and 30-year Eurogovernment bond yield (note that we have excluded the10-year yield from the LSTM-LagLasso set of features asmentioned in Section 6.1); 2, 10 and 30-year UK gov-ernment bond yield; and 5, 10 and 30-year EUR swaprates. Directly related with yields, we can identify as rel-evant features intra-curve spreads such as US 2–10-yearspread and EUR 10–30-year spread; as well as inter-curvespreads, speciﬁcally, EUR-GBR 10-year spread and EUR-JPN 10-year spread.Furthermore, the most relevant features determined viaLSTM-LagLasso also include the following asset classesand macroeconomic variables: commodities (Gold Fu-tures and Brent Crude Futures); equity indices (EuroStoxx 50, FTSE100, and S&P500); foreign exchangerates (EUR-USD X-Rate and EUR-JPY X-Rate); theECB Balance Sheet Long-Term Reﬁnancing Operations;OECD Leading Indicators of US, European Area, and Ja-pan; and ﬁnally technical analysis indicators (5-year, 10-year, 30-year moving averages of 5, 50 and 200 days).It is important to emphasise some aspects regarding thelatter group of non-conventional relevant features identi-ﬁed using the LSTM-LagLasso. First, the 5-year and 30-year are adjacent maturities to the 10-year yield we arestudying and tend to lead ﬂattening and steepening move-ments of the yield curve around the 10-year maturity. Inparticular the long German government bond future is aleveraged instrument with very long maturity and dura-tion. Consequently, they are highly price-sensitive and23eact very quickly to market movements. This justiﬁesbeing a top relevant feature. The second aspect worthhighlighting is that, contrary to what could be expected,most of the non-conventional features have higher weightsthan the conventional ones. Besides, the 5-year and 30-year come more important than the 10-year maturity it-self (Table 3). This may be explained by the fact thatthe 10-year yield is already known to the model. Third,of note also is the inclusion in those relevant features ofindicators related to the ECB balance sheet (ECB Bal-ance Sheet Long-Term Reﬁnancing Operations), at a timewhen central banks have been involved in large-scale as-set purchases or quantitative easing, that clearly has animpact on the overall yield levels in the market. Fourthand last, another example is the OECD Leading Indic-ators in di ﬀ erent geographic areas. These indicators aredesigned with the objective of providing early signals ofturning points in economic cycles and so it is interestingto see them identiﬁed as relevant features using this meth-odology.In summary, the LSTM model captures important dataand incorporates the information into its long and short-term memories. Ultimately, the LSTM-LagLasso meth-odology can also be used for features selection given therichness of information contained in the hidden and cellstates. In this section, we evaluate the strength of results by as-sessing whether they could be obtained by chance. Thehypothesis we want to test is whether the results obtainedwith real features and with Gaussian random variablescould be part of the same distribution. For that pur-pose, we apply the LSTM-LagLasso method replacing themacroeconomic and market features by the same numberof Gaussian random variables. The corresponding meansquared errors are then calculated for each experiment.The simulation is run one hundred times in order todetermine the corresponding probability density functionand the results are presented in Figure 16. From these wecan safely conclude, with statistical signiﬁcance, that theresults with real features are not obtained by chance.

7. Conclusions and future work

This work has three main components. First, we con-duct an application of LSTM networks to the bond mar-ket, speciﬁcally for forecasting the 10-year Euro govern-ment bond yield, and compare the results to memory-free standard feedforward neural networks, in particular

Figure 16: Forecasting error of LSTM-LagLasso using macroeco-nomic / market features and using Gaussian random features. MLPs. This is the ﬁrst study of its kind as can be con-ﬁrmed by the lack of published literature in this area. Tothis end, we model the 10-year bond yield using univariateLSTMs with di ﬀ erent input sequences (6, 21 and 61 timesteps), considering ﬁve forecasting horizons, the next dayas well as further into the future, up to next day plus 20days. Our objective is to compare those LSTM modelswith univariate MLPs, as well as MLPs using the mostrelevant features. These are determined using Lasso re-gression, for each forecasting horizon. We closely followthe same data and methodology for this comparison. Inaddition, the use of training moving windows incorpor-ating the most recent information as it becomes availablehas the advantage of increased ﬂexibility to changing mar-ket conditions.The direct comparison of models in identical conditionsshow that, with the LSTM, we can obtain results that aresimilar or better and with lower standard deviations. Inthe comparison with the LSTMs using di ﬀ erent input se-quences, especially for forecasting horizons equal to 10and 15 days, we observe that the LSTMs with longer in-put sequences achieve similar levels of forecasting accur-acy to the MLP with the most relevant features, with lowerstandard deviation. In other words, the univariate LSTMmodel with additional memory is capable of achievingsimilar results as the multivariate MLP with additional in-formation from markets and the economy. This is a re-markable achievement and a promising result for futurework. Furthermore, the results for the univariate LSTMshow that shorter forecasting horizons require smaller in-put sequences and, vice-versa. Therefore, there is a needto adjust the LSTM architecture to the forecasting horizon24nd in general terms to the conditions of the problem.In summary, the results obtained in the empirical workvalidate the potential of LSTMs for yield forecastingand identify their memory advantage when compared tomemory-free models. This enables the incorporation ofLSTMs in autonomous systems for the asset managementindustry, with special relevance to pension funds, insur-ance companies and investment funds.Second, with the objective to analyse the internal func-tioning of the LSTM model and mitigate the preconceivednotion of black box normally associated with this type ofmodel, we conduct an in-depth internal analysis of the in-formation in the memory cell through time. This is theﬁrst contribution with that objective. Alternative worksare either applied to a di ﬀ erent type of model, or con-duct an external analysis of the LSTMs (Section 2.2). Toachieve this goal, we select several locations within thememory cell to directly calculate and extract the signals(weights) at each time step and hidden unit. Speciﬁcally,the locations are as follows: forget gate, product of theoutputs from the input gate and input node, output gate,cell state, and hidden state. This analysis is carried out us-ing sequence-to-sequence (6 days) LSTM architectures,with uni and multivariate feature sets ( 10-year yield; 10-year yield plus momentum indicator, and 10-year yieldplus 5 and 30-year yield), with reduced number of hid-den units (3 units), for interpretability purposes, and for aforecasting horizon of next day plus 5 days.Overall, considering all feature sets, the most remark-able property found consistently in the LSTM signals, isthe activation / deactivation of units through time, thuscontributing or not (respectively) to the forecasting pro-cess. Moreover, we found evidence that the LSTM unitstend to specialise in di ﬀ erent yield ranges or features con-sidered in the model.In the third study / contribution, we investigate theinformation contained in the signals extracted from theLSTM hidden and cell states, to examine whether thecorresponding time series can be explained by externalsources of information. To this e ﬀ ect, we introduce a newmethodology here identiﬁed as LSTM-LagLasso, basedon both Lasso and Kalman-LagLasso. This methodologyis capable of identifying both relevant features and corres-ponding lags, as the Kalman-LagLasso, but with signiﬁc-ant modiﬁcations (Section 6.1 – LSTM-LagLasso).The ﬁndings show that the information contained in theLSTM states is complex, but may be explained by exo-genous macroeconomic and markets variables, not knownto the model during the learning process. Thus, it is worthexploring this information using the developed LSTM- LagLasso methodology, which may be used as an altern-ative feature selection method. On the relevant featuresselected with the LSTM-LagLasso method, they indicateconventional as well as non-conventional market / macroindicators (Section 6.2), contributing to the predictionprocess, but which are not commonly used in forecastingmodels. In addition, the LSTM-LagLasso identiﬁes lagsas important, in particular t , t − t −

5. Above all,LSTM networks can capture this information and main-tain it in the long and short-term memories, i.e. cell andhidden states.With respect to future work, our present research fo-cuses on ﬁnancial asset forecasting, development of meth-odologies and analysing internally the LSTM model.However, the ultimate purpose in the industry is portfo-lio management and trading. In relation to asset forecast-ing, this is a di ﬀ erent type of problem. Obtaining cor-rect predictions does not necessarily translate into proﬁt-able strategies. Thus, the next step is to implement thistype of model in autonomous systems, to assess its poten-tial for trading and portfolio management in ﬁxed incomemarkets. Finally, we want to emphasise that the work de-scribed in this paper is a fundamental component neces-sary for the implementation of those intelligent systems. Acknowledgements

This work is supported by the UK Engineering andPhysical Sciences Research Council (EPSRC Award No.1921702). All the information required to download thefull dataset used in this research (in particular the identi-ﬁcation of features), from a Bloomberg Professional ter-minal, is made publicly available (Nunes, 2020). The au-thors would like to thank Luis Montesdeoca for helpfuldiscussions during the course of this study.

References

Antoniou, A., Galariotis, E. C., and Spyrou, S. I. (2003). Proﬁts frombuying losers and selling winners in the London Stock Exchange.

Journal of Business & Economics Research (JBER) , 1(11). https: // doi.org / / jber.v1i11.3069.Arrieta-Ibarra, I. and Lobato, I. N. (2015). Testing for predictabilityin ﬁnancial returns using statistical learning procedures. Journal ofTime Series Analysis , 36(5):672–686.Ballings, M., Van den Poel, D., Hespeels, N., and Gryp, R. (2015).Evaluating multiple classiﬁers for stock price direction prediction.

Expert Systems with Applications , 42(20):7046–7056. https: // doi.org / / j.eswa.2015.05.013.Bengio, Y., Frasconi, P., and Simard, P. (1993). The problem of learn-ing long-term dependencies in recurrent networks. In Proceedingsof the International Conference on Neural Networks , pages 1183–1188. IEEE. engio, Y., Simard, P., and Frasconi, P. (1994). Learning long-termdependencies with gradient descent is di ﬃ cult. Transactions onNeural Networks , 5(2):157–166.Bishop, C. M. (2006).

Pattern recognition and machine learning (in-formation science and statistics) . Springer-Verlag New York.Bloomberg (2017). Bloomberg professional database | Subscriptionservice.Booth, A., Gerding, E., and McGroarty, F. (2014). Automated tradingwith performance weighted random forests and seasonality.

ExpertSystems with Applications , 41(8):3651–3661. https: // doi.org / / j.eswa.2013.12.009.Brownlee, J. (2018). Long short-term memory networks with Python.Develop sequence prediction models with deep learning . MachineLearning Mastery, Jason Brownlee.Castellani, M. and Santos, E. A. d. (2006). Forecasting long-term gov-ernment bond yields: An application of statistical and AI models.

ISEG, Departamento de Economia , pages 1–34.Choudhry, T., McGroarty, F., Peng, K., and Wang, S. (2012). High-frequency exchange-rate prediction with an artiﬁcial neural net-work.

Intelligent Systems in Accounting, Finance and Management ,19(3):170–178.Diebold, F. X. and Li, C. (2006). Forecasting the term structure ofgovernment bond yields.

Journal of Econometrics , 130(2):337–364. https: // doi.org / / j.jeconom.2005.03.005.Dunis, C. L., Middleton, P. W., Karathanasopolous, A., and Theoﬁla-tos, K. (2016). Artiﬁcial intelligence in ﬁnancial markets: Cuttingedge applications for risk management, portfolio optimization andeconomics . New Developments in Quantitative Trading and Invest-ment. Palgrave Macmillan UK.Dunis, C. L. and Morrison, V. (2007). The economic value of advancedtime series methods for modelling and trading 10-year governmentbonds.

European Journal of Finance , 13(4):333–352.Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Leastangle regression.

The Annals of Statistics , 32(2):407–499.Eilers, D., Dunis, C. L., Mettenheim, H.-J. v., and Breitner, M. H.(2014). Intelligent trading of seasonal e ﬀ ects: A decision supportalgorithm based on reinforcement learning. Decision Support Sys-tems , 64:100–108. https: // doi.org / / j.dss.2014.04.011.Elman, J. L. (1990). Finding structure in time. Cognitive science ,14(2):179–211.Fischer, T. and Krauss, C. (2018). Deep learning with long short-term memory networks for ﬁnancial market predictions.

EuropeanJournal of Operational Research , 270(2):654–669. https: // doi.org / / j.ejor.2017.11.054.Fletcher, T. and Shawe-Taylor, J. (2013). Multiple kernel learning withﬁsher kernels for high frequency currency prediction. Computa-tional Economics , 42(2):217–240.Garman, M. C. (2001). High yield allocations in the short run: Markettrends in the allocation decision.

Merrill Lynch, Global SecuritiesResearch & Economics Group , pages 1–8.Gers, F. A., Eck, D., and Schmidhuber, J. (2002). Applying LSTM totime series predictable through time-window approaches. In Taglia-ferri, R. and Marinaro, M., editors,

Proceedings of the Italian Work-shop on Neural Nets, WIRN Vietri-01 , Perspectives in Neural Com-puting, pages 193–200. Springer.Gers, F. A. and Schmidhuber, J. (2000). Recurrent nets that time andcount. In

Proceedings of the IEEE-INNS-ENNS International JointConference on Neural Networks, IJCNN , volume 3, pages 189–194.IEEE.Gers, F. A., Schmidhuber, J., and Cummins, F. (1999). Learning toforget: Continual prediction with LSTM. In

Proceedings of theInternational Conference on Artiﬁcial Neural Networks, ICANN , volume 2, pages 850–855. Institution of Engineering and Techno-logy.Gilardoni, G. (2017). Recurrent neural network models for ﬁnancialdistress prediction. Master’s thesis, Politecnico di Milano.Giles, C. L., Lawrence, S., and Tsoi, A. C. (2001). Noisy time seriesprediction using recurrent neural networks and grammatical infer-ence.

Machine learning , 44(1-2):161–183. https: // doi.org / / A:1010884214864.Goodfellow, I., Bengio, Y., and Courville, A. (2016).

Deep Learning .MIT Press.Gradojevic, N. and Yang, J. (2006). Non-linear, non-parametric, non-fundamental exchange rate forecasting.

Journal of Forecasting ,25(4):227–245.Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classiﬁc-ation with bidirectional LSTM and other neural network architec-tures.

Neural Networks , 18(5–6):602–610. https: // doi.org / / j.neunet.2005.06.042.Hastie, T., Tibshirani, R., and Friedman, J. (2013). The elementsof statistical learning: Data mining, inference, and prediction .Springer Series in Statistics, second edition.Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalennetzen. Diploma thesis, Technische Universit¨at M¨unchen.Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2001).Gradient ﬂow in recurrent nets: The di ﬃ culty of learning long-termdependencies. In Kremer, S. C. and Kolen, J. F., editors, A FieldGuide to Dynamical Recurrent Networks , pages 1–15. Wiley-IEEEPress.Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory.

Neural Computation , 9(8):1735–1780.Huang, W., Lai, K. K., Nakamori, Y., Wang, S., and Yu, L. (2007).Neural networks in ﬁnance and economics forecasting.

Inter-national Journal of Information Technology & Decision Making ,6(01):113–140.Jegadeesh, N. and Titman, S. (1993). Returns to buying winners andselling losers: Implications for stock market e ﬃ ciency. The Journalof ﬁnance , 48(1):65–91. https: // doi.org / / j.1540-6261.1993.tb04702.x.Kalman, R. E. (1960). A new approach to linear ﬁltering and pre-diction problems. Journal of Basic Engineering , 82(1):35–45.https: // doi.org / / Physica A: Statistical Mechanicsand its Applications , 387(15):3897–3903. https: // doi.org / / j.physa.2008.02.069.Kanevski, M. and Timonin, V. (2010). Machine learning analysis andmodeling of interest rate curves. Proceedings of the European Sym-posium on Artiﬁcial Neural Networks, Computational Intelligenceand Machine Learning, ESANN , pages 47–52.Karpathy, A. (2015). The unreasonable e ﬀ ectiveness of recurrentneural networks. URL https: // karpathy.github.io / / / / rnn-e ﬀ ectiveness / (Accessed on 19-Jan-2018).Khanal, A. R. and Mishra, A. K. (2014). Is the ‘buying winners andselling losers’ trading strategy proﬁtable in the new economy? Ap-plied Economics Letters , 21(15):1090–1093. https: // doi.org / / Proceedings of the International Conferenceon Learning Representations, ICLR , pages 1–15. arXiv preprintarXiv:1412.6980v9.Kraus, M. and Feuerriegel, S. (2017). Decision support from ﬁnancialdisclosures with deep neural networks and transfer learning.

De-cision Support Systems , 104:38–48. https: // doi.org / / j.dss. arXiv preprintarXiv:1506.00019v4 , pages 1–38.Mahler, N. (2009). Modeling the S&P 500 index using the Kalmanﬁlter and the LagLasso. In Proceedings of the International Work-shop on Machine Learning for Signal Processing, MLSP , pages 1–6. IEEE. https: // doi.org / / MLSP.2009.5306195.Maknickien˙e, N. and Maknickas, A. (2012). Application of neural net-work for forecasting of exchange rates and forex trading. In

Pro-ceedings of the International Scientiﬁc Conference Business andManagement , pages 122–127.Maybeck, P. S. (1979).

Stochastic models, estimation, and control .Mathematics in Science and Engineering. Academic Press.Montesdeoca, L. and Niranjan, M. (2020). On comparing the inﬂu-ences of exogenous information on bitcoin prices and stock in-dex values. In Pardalos, P., Kotsireas, I., Guo, Y., and Knotten-belt, W., editors,

Mathematical Research for Blockchain Economy,MARBLE , pages 93–100. Springer. https: // doi.org / / Proceedings of theInternational Conference on Frontiers of Information Technology,Applications and Tools, FITAT , pages 1–4.Nelson, C. R. and Siegel, A. F. (1987). Parsimonious modeling ofyield curves.

Journal of Business , pages 473–489.Niranjan, M. (1996). Sequential tracking in pricing ﬁnancial optionsusing model based and neural network approaches. In Mozer,M. C., Jordan, M. I., and Petsche, T., editors,

Advances in NeuralInformation Processing Systems, NIPS , pages 960–966. MIT Press.Nunes, M. (2020). Dataset information for article “Long short-termmemory networks and laglasso for bond yield forecasting: Peepinginside the black box”. University of Southampton, doi allocatedand to be registered by UoS [Dataset].Nunes, M., Gerding, E., McGroarty, F., and Niranjan, M. (2019).A comparison of multitask and single task learning with artiﬁcialneural networks for yield curve forecasting.

Expert Systems withApplications , 119:362–375. https: // doi.org / / j.eswa.2018.11.012.OECD (2015). Business and ﬁnance outlook . OECD Publishing, Paris.Olah, C. (2015). Understanding LSTM networks. URL https: // colah.github.io / posts / / (Accessed on 19-Jan-2018).Park, S., Gil, M.-S., Im, H., and Moon, Y.-S. (2019). Measurementnoise recommendation for e ﬃ cient Kalman ﬁltering over a largeamount of sensor data. Sensors , 19(5):1–19. https: // doi.org / / s19051168.Persio, L. D. and Honchar, O. (2016a). Artiﬁcial neural networks ap-proach to the forecast of stock market price movements. Interna-tional Journal of Economics and Management Systems , 1:158–162.Persio, L. D. and Honchar, O. (2016b). Artiﬁcial neural networks ar-chitectures for stock price prediction: Comparisons and applica-tions.

International Journal of Circuits, Systems and Signal Pro-cessing , 10:403–413.Persio, L. D. and Honchar, O. (2017). Recurrent neural networksapproach to the ﬁnancial forecast of google assets.

InternationalJournal of Mathematics and Computers in Simulation , 11:7–13.Pr¨ugel-Bennett, A. (2017). Advanced machine learning. University ofSouthampton, School of Electronics and Computer Science.Qin, Y., Song, D., Chen, H., Cheng, W., Jiang, G., and Cottrell, G. W.(2017). A dual-stage attention-based recurrent neural network for time series prediction. In

Proceedings of the International JointConference on Artiﬁcial Intelligence, IJCAI , pages 2627–2633. IJ-CAI Organization.Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learninginternal representations by error propagation. In

D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explora-tions in the microstructure of cognition , volume 1, pages 318–362.MIT Press.Sambasivan, R. and Das, S. (2017). A statistical machine learningapproach to yield curve forecasting. In

Proceedings of the Interna-tional Conference on Computational Intelligence in Data Science,ICCIDS , pages 1–6. IEEE.Schmidhuber, J. (2015). Deep learning in neural networks: An over-view.

Neural networks , 61:85–117. https: // doi.org / / j.neunet.2014.09.003.Sermpinis, G., Karathanasopoulos, A., Rosillo, R., and de la Fuente,D. (2019). Neural networks in ﬁnancial trading. Annals of Opera-tions Research . https: // doi.org / / s10479-019-03144-y.Sermpinis, G., Laws, J., Karathanasopoulos, A., and Dunis, C. L.(2012). Forecasting and trading the EUR / USD exchange rate withGene Expression and Psi Sigma Neural Networks.

Expert Systemswith Applications , 39(10):8865–8877. https: // doi.org / / j.eswa.2012.02.022.Sermpinis, G., Theoﬁlatos, K., Karathanasopoulos, A., Georgopoulos,E. F., and Dunis, C. (2013). Forecasting foreign exchange rates withadaptive neural networks using radial-basis functions and particleswarm optimization. European Journal of Operational Research ,225(3):528–540. http: // dx.doi.org / / j.ejor.2012.10.020.Strobelt, H., Gehrmann, S., Pﬁster, H., and Rush, A. M. (2018). LST-MVis: A tool for visual analysis of hidden state dynamics in re-current neural networks. IEEE Transactions on Visualization andComputer Graphics , 24(1):667–676.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.

Journal of the Royal Statistical Society. Series B (Methodological) ,pages 267–288.Wang, J., Zhang, Y., Tang, K., Wu, J., and Xiong, Z. (2019). Al-phastock: A buying-winners-and-selling-losers investment strategyusing interpretable deep reinforcement attention networks. In

Pro-ceedings of the International Conference on Knowledge Discov-ery & Data Mining, SIGKDD , pages 1900–1908. ACM. https: // doi.org / / IEEE Transactions on Evolutionary Computation ,1(1):67–82. https: // doi.org / / arXiv preprintarXiv:1512.04916v3 , pages 1–6.Zhang, Y., Wang, D., Chen, Y., Shang, H., and Tian, Q. (2017). Creditrisk assessment based on long short-term memory model. In Pro-ceedings of the International Conference on Intelligent Computing,ICIC , pages 700–712. Springer., pages 700–712. Springer.