Dynamic Prediction Length for Time Series with Sequence to Sequence Networks
DDynamic Prediction Length for Time Serieswith Sequence to Sequence Networks
Mark Harmon, Diego Klabjan Department of Engineering Sciences and Applied Mathematics Department of Industrial Engineering and Management SciencesNorthwestern University
Abstract
Recurrent neural networks and sequence to sequence models re-quire a predetermined length for prediction output length. Our modeladdresses this by allowing the network to predict a variable lengthoutput in inference. A new loss function with a tailored gradient com-putation is developed that trades off prediction accuracy and outputlength. The model utilizes a function to determine whether a particu-lar output at a time should be evaluated or not given a predeterminedthreshold. We evaluate the model on the problem of predicting theprices of securities. We find that the model makes longer predictionsfor more stable securities and it naturally balances prediction accuracyand length.
Recurrent neural networks are very popular and effective at solving difficultsequence problems such as language translation, creation of artificial mu-sic, and video prediction. New architectures, such as Sequence to Sequencesnetworks by Sutskever, Vinyals, and Le (2014) and Memory Networks bySukhbaatar, Weston, Fergus et al. (2015) are used to solve problems in lan-guage translation and answer questions using a large memory bank. However,these problems generally have training data with given sequence outputs (for1 a r X i v : . [ c s . L G ] A ug xample, a model translating a sentence from English to Spanish). Becauseinput and output sequences are known a priori for these problems, it is pos-sible to solve them with a fixed model architecture.A fixed model architecture is effective for sequences, but there are a num-ber of problems related to multiple time series datasets that do not havea natural sequence size. For example, a company may wish to predict thenumber of products to be shipped out for sale based upon customer demand.Each product has a different amount of demand volatility, which can makean enormous difference in how far in advance they are willing to predict thedemand of a product. In this case, it would be extremely useful to have amodel that can balance F1 score and the number of future predictions basedupon a product’s base demand.Another example, which we explore in this work, is financial security priceprediction. Some securities are extremely volatile, which makes prediction forlonger times highly inaccurate. On the other hand, low volatility securitiesare easier to predict further into the future.The biggest problem in multiple time series predictions when it comes todynamic prediction length is that the training data does not exhibit outputsequences of various length. For this reason, a different model is required. Inmultiple time series, input sequences can be naturally created for exampleby a fixed-size sliding window. However, the length of the output sequencescan be dynamic since typically there is flexibility in how far to predict in thefuture. In inference, we allow our prediction model to generate a differentnumber of predictions depending on the current input sequence as well as adifferent number of predictions per time series. The number of predictionsthe model generates depends on a thresholding function that determines themodel’s confidence of that particular output. If the confidence is too low,we no longer consider the predictions our model generates for that particularsample.The main objective of our study is to create a prediction model that bal-ances F1 score and predicting into the future. The main challenge is the factthat samples which are taken from infinite time series do not naturally con-tain dynamic length predictions. This aspect requires a different loss functionthat includes the notion of confidence and tailored computation of gradientson different length output sequences. We create a model architecture relyingon a novel loss function that allows for a dynamic number of output predic-tions. We explain the ideas and concepts by utilizing predictions of severalcorrelated financial securities. In this case, rather than having to adjust2he predictive length manually depending on market volatility, the modellearns how far in advance it can confidently predict during training. Forsecurity prediction, a model that is not limited to a fixed number of outputpredictions can provide much more robust price predictions. For example,we expect that some security j with high volatility during the training phaseshould result in fewer predictions. On the other hand, if the same security j istrained during a low volatility period, we expect the model to generate morepredictions. Clearly, a dynamic model that can easily adjust to the currenttraining environment of a security can provide huge benefits. In inferencethe model provides a natural way to stop generating output predictions.Our work contains two main contributions. First, we create a way tomeasure the confidence of a model’s prediction without having to rely onBayesian statistics. Second, our model is the first of its kind to allow fordynamic prediction length with sequence to sequence networks. Along theway, we have to tailor gradient computation.In our study, we use two financial security datasets which consist of sev-eral years of tick prices. One contains five distinct securities and the othercontains twenty-two different securities. We find that our new architecturesuccessfully balances prediction F1 score and the number of future predic-tions. In addition, our architecture uses different prediction lengths at differ-ent times for each security due to stochastic drift between training and testsets. The best dynamic output prediction length model is a sequence to se-quence network which earns an F1 score of 0.503 in contrast to a traditionalLSTM structure that only gets an F1 score of 0.209 for a single prediction.In Section 4.2, we review two main subjects related to our work. First, weinspect studies within the realm of deep learning related to our new modelarchitecture. Second, we analyze other work on predicting financial securitieswith a focus on machine learning and deep learning methods. In Section 4.3,we present the dynamic prediction length model while in Section 4.4 wepresent a computational study based on securities. Similar to our concept of dynamic output prediction, Pointer Networks byVinyals, Fortunato, and Jaitly (2015) are used for problems such as sortingvariable sized sequences. They use an attention mechanism that points toa particular part of the input sequence that is used as the next output.3lthough this architecture can allow for variable input sizes, the output sizeis constrained to be the same size as the input. Our model allows for anysize output (unrelated to the input size) up to some arbitrary maximum size.Pointer networks are also not applicable to our case since output is not aspecific part of input.Graves (2016) introduces adaptive computation time (ACT) for recurrentnetworks. The author creates an additional metric that allows the networkto continue “pondering” the input through additional computation. We canthink of ACT as a model that in each time has a dynamic number of stackedLSTM or GRU cells. We considered using ACT along with our architec-ture, but it increases the computational complexity by recalculating the feed-forward step a number of times. Basing the stopping decision with respectto the natural choice of the output time requiring substantial computationaltime, does not work when most times a single layer is needed. In time serieswith a walk forward strategy, smaller less complex models are more effective.This renders ACT not appropriate. In addition, online algorithms need tobe agile, and adding computational time would be a detriment to a model.Therefore, we choose to use traditional LSTM architectures that train muchfaster.To achieve better training and dynamic output sizes, we added an ad-ditional term to the loss function and modify the measure of predictions.Although there have been many new architectures such as Residual Net-works by He, Zhang, Ren, and Sun (2016), Memory Networks by Sukhbaataret al. (2015), and Neural Turing Machines by Graves, Wayne, and Danihelka(2014), none of these incorporate a new loss function with their architec-ture. This tendency of not creating new loss functions is noted by Janocha,Czarnecki et al. (2017). The authors explain that although there is a lot ofwork in neural network activations, optimization, and architecture, the lossfunction used for nearly all neural networks is a combination of log loss andL1/L2 norms. Janocha et al. (2017) show that functions that were previ-ously deemed to be inferior in deep learning can be more robust than log lossfor classification problems. Therefore, it is important for studies to continueexploring loss functions to increase the network performance and to create avariety of new models.There are some studies that significantly change the loss function in deeplearning. For example, GAN’s by Goodfellow, Pouget-Abadie, Mirza, Xu,Warde-Farley, Ozair, Courville, and Bengio (2014) are immensely popularand design a loss function for their specific problem and architecture to bal-4nce generation and classification. ACT by Graves (2016) also uses a uniqueaddition to the loss function so that their network does not “ponder” on theinput for too long. There is growing momentum for using the Wassersteinmetric as the loss function as seen in the work by Frogner, Zhang, Mobahi,Araya, and Poggio (2015). The Wasserstein metric is even successfully beingused in GAN’s in the work of Arjovsky, Chintala, and Bottou (2017). Weexpand the volume of work in this area by developing a loss function thatencourages a model to have a dynamic output length at prediction. In con-trast to loss functions specific to problem type, our architecture can be usedwith any recurrent neural network architecture.Next, we focus on works on predicting financial securities with deep net-works. Most work utilizing deep neural networks focuses on applying otherforms of data for prediction, such as news about financial markets or specificcompanies. Chong, Han, and Park (2017) review many of the predictionmethods commonly used for security prediction and the predicted outcome.Niaki and Hoseinzade (2013), focuses on predicting upward and downwardmovement of S&P 500 with a deep feed-forward network. Ding, Zhang, Liu,and Duan (2015) use historical pricing data in combination with financialnews data with a deep feed forward network. Krauss, Do, and Huck (2017)utilizes both random forests and deep feed-forward networks to find statisti-cal arbitrage on the S&P 500. In contrast to the aforementioned models, wepredict significant movement in prices, utilize temporal models, and predictmultiple securities with a single model.Sirignano (2016) uses deep learning directly on financial security pricedata as well. He utilizes limit order book data with multiple ask/bid pricesfor each security to predict the change in the spread. In addition, he uses adeep feed-forward network and a separate model for the prediction of eachsecurity. In contrast, we predict all securities with one model and applyrecurrent neural networks in addition to feed forward networks. Anothersimilar work, Dixon, Klabjan, and Bang (2016), uses deep feed-forward net-works for prediction directly on security prices. In contrast, our study appliesrecurrent and convolutional recurrent models in addition to the baseline feedforward network.There is also work on security prediction beyond the standard feed-forward network.Borovykh, Bohte, and Oosterlee (2017) uses convolutional neural networksover the security time series to make predictions. Others, such as Bao, Yue,and Rao (2017), use stacked auto-encoders and wavelet transformations to5orm an embedding of financial data, and feed this into an RNN model forprediction. Chen, Zhou, and Dai (2015) use an RNN model on openingand closing security prices on the Chinese market with seven classificationcategories. Akita, Yoshihara, Matsubara, and Uehara (2016) use both textualand price information to predict a security price. In contrast to these works,our model consists only of security price data. Also, we apply sequence tosequence and convolutional LSTM models in addition to having a dynamicoutput length for security predictions.
In this section, we explain the architecture of our model that predicts a dy-namic number of outputs. The model uses a sequence to sequence (Seq2Seq)network in combination with our new proposed loss function. For details onSeq2Seq networks, please see the original paper by Sutskever et al. (2014).We also use attention in our model which is explained in the paper by Bah-danau, Cho, and Bengio (2014).In addition, we use teacher forcing. We found that the first input to thedecoder has a great impact on accuracy. By far the best performing firstdecoder input is the ground truth associated with the last encoder input.For a general Seq2Seq network, let θ describe the trainable weights, X be our input sample, and f qt ( X, θ, f i To create a model that predicts a dynamic output length for each time series,we introduce a function, g ( f i ≤ t ) that captures confidence of the predictionat time t . Note that the confidence function may be dependent on previousvalues f , f , ...f t . To determine output length, we measure the confidencefunction against a threshold value τ (which is a hyper-parameter) resultingin min θ E ( X,Y ) Q (cid:88) q =1 (cid:88) g ( f qi ≤ t ( X ; θ ; f i 14) for the twenty-two security dataset.For the other dataset, a few of the securities contain large data imbalancesat the large upward and downward movements. Since this is the case, wepick β = 0 . The following results are based upon the best sequence sizes, neurons perlayer, optimization method, and number of layers found through hyperpa-rameterization of each network. The models were trained on Titan X’s andNvidia 1080’s and implemented in tensorflow. For each dataset, we have onemodel to classify all twenty-two ETF’s from one dataset and one model toclassify all five commodities in the other dataset. We test the ETF dataseton all baseline and final model architectures. Due to computational timeconstraints, we only calculate the results of the commodity dataset using thebest model from the ETF dataset. For our first experiments, we use a FFN, LSTM network, LSTM Seq2Seqnetwork, and a ConvLSTM Seq2Seq network in the setting of a fixed pre-diction of 1 or 10. The FFN consists of two layers with sixty-four neurons11able 1: F1 Scores of Baseline ModelsArchitecture ETF’s CommoditiesFFN (One Pred) 0.176 0.159LSTM (One Pred) 0.209 0.286ConvLSTM (One Pred) 0.513 0.410LSTM Seq2Seq (One Pred) 0.598 0.513LSTM Seq2Seq (Ten Pred) 0.509 0.474and ten returns for each security. We train the FFN with the stochasticgradient optimization method. The basic recurrent LSTM network consistsof two LSTM layers, each with sixty-four neurons and input sequence size oftwenty. For all recurrent networks, we use the ADAM optimization methodby Kingma and Ba (2014), which is known to perform well for recurrent net-works. We train for a number of epochs until the F1 score no longer increaseson the validation dataset. Last, the sequence to sequence network consistsof one encoder and one decoder layer, each with sixty-four neurons and ei-ther LSTM or ConvLSTM layers. We tried using deeper networks, but thoseresulted in much lower F1 score. We utilize the same model architecture forboth the ETF and commodity datasets.We are using financial security datasets, which typically contain imbal-anced classes. When considering imbalanced classes, the traditional accuracymeasure does not display the true effectiveness of a model. Therefore, weutilize the F1 score, which accounts for imbalanced classes. The F1 score iscalculated via a one-against-all structure of precision and recall for each ofthe five classes. The values we report are an average of the F1 scores for eachclass and security.The FFN, LSTM, LSTM Seq2Seq, and ConvLSTM Seq2Seq results arepresented in Table 1. To produce these results, we give each model a warmstart by training on 5 windows of data. We then train the models for 25more windows on the 22 ETF’s and 5 commodities datasets. The F1 scoresin Table 1 are the average test scores over this 25 window period.We use our baseline results in Table 1 to determine the best architecturefor our dynamic output length model and to examine the differences of thevarious seq2seq networks. We observe that each single prediction modelgenerally increases in performance when using more robust models (such asseq2seq). As expected, the seq2seq models and recurrent LSTM models far12utperform the FFN, but we did not expect such a large performance increasewhen utilizing a seq2seq model. Even when making 10 predictions forwardin time, the seq2seq model more than doubles the F1 score compared to thetraditional LSTM network that makes a single prediction. With ConvLSTMlayers, we found that the best formulation is utilizing 1D convolutions witheach security representing a channel in the image. Ultimately, we find thatthe best ConvLSTM seq2seq model does not perform as well as the LSTMseq2seq model. Since the LSTM seq2seq model is by far the most accurate,we choose it for our dynamic output length model. We begin with the two confidence functions that depend only upon the cur-rent prediction: maximum and CD. We study the sensitivity of τ and λ , andthe comparison between the indicator (Ind) and sigmoid (Sig) functions, i.e.loss functions (4.2) and (4.3) respectively. In the following sections, we referto the indicator and sigmoid functions as masking functions because theymask the output from the Kullback-Leibler divergence.A large gradient for λ max( τ − G qt , 0) means that we encourage the modelto make more predictions. We can control the output length by adjusting thehyperparameters λ and τ . In Figure 1 we present a range of τ and λ valuesand the respective F1 scores with G qt being the maximum. In addition, weplace an annotation of the best pair ( τ, λ ) that results in the largest distanceabove the static curve in bold. We add this annotation to all figures of thistype.In Figure 1, we first create a line that gives the F1 scores from a staticseq2seq model for 5 training windows after a 5 window warm start. Wecompute F1 for 1 , , , 10 predictions by the static model and interpolate theremaining values. It is important to point out that the F1 scores here aredifferent than in Table 1 since we are measuring 5 windows rather than 25.We expect the dynamic model to be above this curve when using appropriatevalues for ( τ, λ ). Each point in Figure 1 is the average output length andF1 score of the 22 ETF dataset for a pair ( τ, λ ). On the left is the indicatormasking function and the right is the sigmoid masking function. For bothfigures, we use the same values for τ and λ . We find that with τ = 0 . 5, ourmodel has high accuracy with a relatively large average output length.13igure 1: F1 and average output length for pairs of τ and λ . We use themaximum confidence function with the indicator function on the left andsigmoid on the right.The red points are a pair ( τ, λ ) that perform worse than the static modeland the green are those that perform better. We expect to find many pairsof hyperparameters that lead to superior F1 scores since our model uses adynamic prediction length for each sample. There are a few outliers that fallwell below the estimated performance in the figure on the left, which usesthe indicator masking function. The sigmoid function on the right has manymore points above the static curve and does not have any points below the0 . λ ’s. When λ is too big, the model stops training for prediction accuracy and only focuseson creating more predictions. Note that not all large λ ’s lead to strictly poorperformance since the best sigmoid pair has λ = 5 . τ, λ ) with the indicator masking function on the left and sigmoidmasking function on the right. Note that with CD, the number of pointsbelow the static line is smaller compared to the maximum function. When λ is too large, the indicator version has three points that fall far below ourestimation. It is important to point out that there is a large distance betweenthe green points and the blue curve that is rarely seen with other confidencefunctions. Based on these results, we recommend using the CD over themaximum function. It requires very little extra computation and leads tobetter F1 with a similar average output length.In Figures 1 and 2, we show that large λ ’s usually lead to inferior models.Therefore, we explore τ to see if we can find a relationship between this14igure 2: F1 score and average output length for pairs of τ and λ . We useCD with the indicator function on the left and sigmoid function on the right.hyperparameter and the average output length. We present two images inFigure 3 of the sensitivity of the indicator and sigmoid masking functionswith maximum and CD functions. We create these figures by choosing λ =0 . 1, which produces the best performance for all cases of τ and plot therelationship between average number of predictions and τ . Note that weuse the same approach of utilizing a 5 window warm start and measure theaverage output length based upon the next 5 windows.On the left, we observe that the slope for the sigmoid masking functionis sharper than the indcator function for both confidence functions. In thecase of these two functions, the sensitivity is not nearly as large as the othertwo confidence functions that we explore in the next section. For CD, theslope is much smaller (in terms of absolute value), which provides additionalevidence to recommend it over the maximum confidence function.Given these two confidence and masking functions, we recommend us-ing CD with the sigmoid masking function. Although the sigmoid maskingfunction is more sensitive to τ , we find that sigmoid does not have as manypoints falling under the static curve. Next, we examine the two confidencefunctions that depend on the current and previous predictions.15igure 3: The output length and τ values for both CD and maximum func-tions with λ = 0 . 1. The trend lines are made via linear regression and showhigh statistical significance between τ and output length. The other two functions we test are total variation (TV) and the first Wasser-stein distance (EMD). We design the first two functions, maximum and CD,to find the confidence of the current prediction using only the current pre-diction and input sequence. On the other hand, we use TV and EMD todetermine the volatility between the current and previous prediction. If thevolatility is low, we expect the model to be confident enough to make severalpredictions. We first examine the results from TV in Figure 4.TV does not garner high performance with either masking function. Thereare some points with the sigmoid masking function that are slightly betterthan the expected F1 scores, but TV is clearly not a successful measure.There are likely some cases where the TV may be large between two pre-dictions, but the model may still be confident about that prediction. Forexample, if the prediction changes from one label to another at the next pre-diction step, the TV between these two predictions may be large. However,this does not necessarily imply that the model is not confident about its nextprediction. Also note that there are two rows of points above and below thestatic curve of the sigmoid version. The difference between these two is thatthe value for λ is larger for the points below the estimated curve, which issimilar to the results we obtain form the maximum and CD functions.In Figure 5, we observe that EMD contains a similar pattern to TV.16igure 4: The F1 score and respective output length for various ( τ, λ ) pairswith TV. The indicator function is on the left while the sigmoid function ison the right.Similar to TV, the indicator masking function with EMD performs poorlyon the left compared to the better F1 scores of the sigmoid masking functionon the right. EMD performs slightly better than TV with a few more pointsabove the estimated curve. This likely occurs because EMD is a more robustprobability measure. Instead of finding the maximum distance between twoprobability measures P and Q (TV), EMD finds the cost of transformingthe entirety of one distribution P into another distribution Q . EMD is morerobust, but a label change from one time step to the next may correlatewith a large EMD. However, this does not always imply lower predictionconfidence.To conclude our examination of TV and EMD, we observe the relation-ship between the number of predictions and τ in Figure 6. Note that therelationship is the opposite of the maximum and CD functions since we re-verse the relationship with respect to τ , which we clarify in Section 4.3.1.The slopes for the sigmoid versions of both metrics are larger than theirindicator function counterparts. We also observe that the slopes are largeoverall when compared to both the maximum and CD confidence functions.Slight changes in τ lead to large changes in output lengths for TV and EMD.In general, we found that the values of these two metrics tend to be small,which is what leads to their sensitivity to τ . Overall, for all metrics, thesigmoid masking function slope is larger in magnitude than the same ver-sion with the indicator masking function. Also, TV and EMD show that17igure 5: The F1 score and respective output length for various ( τ, λ ) pairswith EMD. The indicator function is on the left while the sigmoid functionis on the right.confidence in predictions depends upon more than volatility. Next, we test our model on the five commodity dataset. Since we observethat the CD with the sigmoid masking function provide the best performingmodel with the ETF dataset, we use the identical functions on the commoditydataset. In addition, we create four static models predicting 1 , , , 10 timesteps each to create the blue curve as with the previous ETF figures. InFigure 7, we present the dynamic model results with the commodity dataset.As with the previous figures of the same type, we run the model for a total of10 walk-forward steps and average the final 5 F1 scores, which are reportedin Figure 7.The first thing we observe is that this is the first time that every singlepair ( τ, λ ) is above the static predictions. This probably occurs becauseof volatility in the commodity dataset. Some commodities are extremelyvolatile, and the price ranges from big highs to lows for a high percentageof the labels. In addition, other commodities in the dataset have very lowvolatility for a long period of time. Because of the skewed nature of thecommodity dataset, it is possible for our model to only choose to predict attimes when it can make many correct predictions. To observe the dynamicmodel in Figure 8, we observe the F1 scores of our best dynamic commoditymodel and two static models over 10 walk forward periods.18igure 6: The output length and τ values for both TV and the EMD with λ = 0 . 1. The trend lines are made via traditional linear regression and showhigh statistical significance between τ and output length.Figure 7: The F1 score and respective output length for various ( τ, λ ) pairswith CD and the sigmoid masking function with the five commodity dataset.The F1 score of our dynamic model is superior at nearly every walkforward time period in comparison to a static single prediction model, exceptfor walk forward period 5. As expected, the single prediction static modelhas a better F1 score on average than the static model making 7 predictions.However, the difference between 1 prediction and 7 predictions is small, which19igure 8: The F1 score of our best dynamice model: τ = 0 . λ = 0 . In Table 2, we present a summary of the best results for our dynamic model.To measure which pair ( τ, λ ) is best, we measure the relative improvementbetween the F1 score of the dynamic model and the equivalent predictionlength F1 score of the static model (from the blue curves shown in previousfigures). The dynamic model with CD and sigmoid does extremely well forthe commodity dataset. The same functions also give the best F1 score withthe ETF dataset. 20able 2: Dynamic Model Summary (All Sigmoid)Architecture F1 Gap % ( τ, λ ) Prediction LengthDynamic Max (ETF) 5.44 (0 . , . 0) 7.68Dynamic CD (ETF) 6.45 (0 . , . 0) 6.76Dynamic TV (ETF) 4.49 (0 . , . 1) 6.16Dynamic EMD (ETF) 3.72 (0 . , . 1) 6.00Dynamic CD (Commodity) 21.2 (0 . , . 5) 6.89 In this work, we create a new loss function with the seq2seq network thatmakes a dynamic number of predictions for each input sequence. In addi-tion, we construct a new metric called “Confidence Distance” to measurethe confidence the model has when making a prediction. When testing thenew model, we find that a dynamic prediction length model can outperforma similar static seq2seq network. In addition, we find that of our four con-fidence functions, CD gives the most accurate predictions over maximum,TV, and the EMD. We examine two versions of our model with each confi-dence function: masking the Kullback-Leibler divergence with the indicatorfunction and masking with the sigmoid function. We find that the sigmoidfunction leads to better and more reliable prediction F1 even though it ismore sensitive to hyperparameter τ .When using the dynamic model, we recommend to keep the value for λ low and to vary τ to get the desired output length to F1 ratio. Seq2Seqmodels perform well with financial security data, and we recommend theiruse and to make the first decoder input the label associated with the finalencoder input. For the best dynamic output length performance, the bestsetting utilizes τ ∈ [0 . , . 5] and λ ≤ . References Akita, R., A. Yoshihara, T. Matsubara, and K. Uehara (2016): “Deep learn-ing for stock prediction using numerical and textual information,” in , 1–6.21rjovsky, M., S. Chintala, and L. Bottou (2017): “Wasserstein GAN,” arXivpreprint arXiv:1701.07875 .Bahdanau, D., K. Cho, and Y. Bengio (2014): “Neural machine translationby jointly learning to align and translate,” arXiv preprint arXiv:1409.0473 .Bao, W., J. Yue, and Y. Rao (2017): “A deep learning framework for finan-cial time series using stacked autoencoders and long-short term memory,” PLOS ONE .Borovykh, A., S. Bohte, and C. W. Oosterlee (2017): “Conditional timeseries forecasting with convolutional neural networks,” arXiv preprintarXiv:1703.04691 .Chen, K., Y. Zhou, and F. Dai (2015): “A LSTM-based method for stockreturns prediction: A case study of china stock market,” in IEEE Interna-tional Conference on Big Data , 2823–2824.Chong, E., C. Han, and F. C. Park (2017): “Deep learning networks for stockmarket analysis and prediction: Methodology, data representations, andcase studies,” Expert Systems with Applications , 187–205.Ding, X., Y. Zhang, T. Liu, and J. Duan (2015): “Deep learning for event-driven stock prediction,” in Twenty-Fourth International Joint Conferenceon Artificial Intelligence .Dixon, M., D. Klabjan, and J. H. Bang (2016): “Classification-based financialmarkets prediction using deep neural networks,” Algorithmic Finance , 1–11.Frogner, C., C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio (2015):“Learning with a wasserstein loss,” in Advances in Neural InformationProcessing Systems , 2053–2061.Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio (2014): “Generative adversarialnets,” in Advances in Neural Information Processing Systems , 2672–2680.Graves, A. (2016): “Adaptive computation time for recurrent neural net-works,” arXiv preprint arXiv:1603.08983 .22raves, A., G. Wayne, and I. Danihelka (2014): “Neural turing machines,” arXiv preprint arXiv:1410.5401 .He, K., X. Zhang, S. Ren, and J. Sun (2016): “Deep residual learning forimage recognition,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 770–778.Janocha, K., W. M. Czarnecki, et al. (2017): “On loss functions for deepneural networks in classification,” Schedae Informaticae , 2016, 4959.Kingma, D. P. and J. Ba (2014): “Adam: A method for stochastic optimiza-tion,” arXiv preprint arXiv:1412.6980 .Krauss, C., X. A. Do, and N. Huck (2017): “Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the s&p 500,” Eu-ropean Journal of Operational Research , 689–702.Niaki, S. T. A. and S. Hoseinzade (2013): “Forecasting s&p 500 index usingartificial neural networks and design of experiments,” Journal of IndustrialEngineering International , 1.Sirignano, J. A. (2016): “Deep learning for limit order books,” arXiv preprintarXiv:1601.01987 .Sukhbaatar, S., J. Weston, R. Fergus, et al. (2015): “End-to-end memorynetworks,” in Advances in Neural Information Processing Systems , 2440–2448.Sutskever, I., O. Vinyals, and Q. V. Le (2014): “Sequence to sequence learn-ing with neural networks,” in Advances in Neural Information ProcessingSystems , 3104–3112.Vinyals, O., M. Fortunato, and N. Jaitly (2015): “Pointer networks,” in Advances in Neural Information Processing Systems , 2692–2700.Xingjian, S., Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo(2015): “Convolutional LSTM network: A machine learning approach forprecipitation nowcasting,” in