[PDF] Dynamic Prediction Length for Time Series with Sequence to Sequence Networks

Abstract

Recurrent neural networks and sequence to sequence models require a predetermined length for prediction output length. Our model addresses this by allowing the network to predict a variable length output in inference. A new loss function with a tailored gradient computation is developed that trades off prediction accuracy and output length. The model utilizes a function to determine whether a particular output at a time should be evaluated or not given a predetermined threshold. We evaluate the model on the problem of predicting the prices of securities. We find that the model makes longer predictions for more stable securities and it naturally balances prediction accuracy and length.

Full PDF

DDynamic Prediction Length for Time Serieswith Sequence to Sequence Networks

Mark Harmon, Diego Klabjan Department of Engineering Sciences and Applied Mathematics Department of Industrial Engineering and Management SciencesNorthwestern University

Abstract

Recurrent neural networks and sequence to sequence models re-quire a predetermined length for prediction output length. Our modeladdresses this by allowing the network to predict a variable lengthoutput in inference. A new loss function with a tailored gradient com-putation is developed that trades oﬀ prediction accuracy and outputlength. The model utilizes a function to determine whether a particu-lar output at a time should be evaluated or not given a predeterminedthreshold. We evaluate the model on the problem of predicting theprices of securities. We ﬁnd that the model makes longer predictionsfor more stable securities and it naturally balances prediction accuracyand length.

Recurrent neural networks are very popular and eﬀective at solving diﬃcultsequence problems such as language translation, creation of artiﬁcial mu-sic, and video prediction. New architectures, such as Sequence to Sequencesnetworks by Sutskever, Vinyals, and Le (2014) and Memory Networks bySukhbaatar, Weston, Fergus et al. (2015) are used to solve problems in lan-guage translation and answer questions using a large memory bank. However,these problems generally have training data with given sequence outputs (for1 a r X i v : . [ c s . L G ] A ug xample, a model translating a sentence from English to Spanish). Becauseinput and output sequences are known a priori for these problems, it is pos-sible to solve them with a ﬁxed model architecture.A ﬁxed model architecture is eﬀective for sequences, but there are a num-ber of problems related to multiple time series datasets that do not havea natural sequence size. For example, a company may wish to predict thenumber of products to be shipped out for sale based upon customer demand.Each product has a diﬀerent amount of demand volatility, which can makean enormous diﬀerence in how far in advance they are willing to predict thedemand of a product. In this case, it would be extremely useful to have amodel that can balance F1 score and the number of future predictions basedupon a product’s base demand.Another example, which we explore in this work, is ﬁnancial security priceprediction. Some securities are extremely volatile, which makes prediction forlonger times highly inaccurate. On the other hand, low volatility securitiesare easier to predict further into the future.The biggest problem in multiple time series predictions when it comes todynamic prediction length is that the training data does not exhibit outputsequences of various length. For this reason, a diﬀerent model is required. Inmultiple time series, input sequences can be naturally created for exampleby a ﬁxed-size sliding window. However, the length of the output sequencescan be dynamic since typically there is ﬂexibility in how far to predict in thefuture. In inference, we allow our prediction model to generate a diﬀerentnumber of predictions depending on the current input sequence as well as adiﬀerent number of predictions per time series. The number of predictionsthe model generates depends on a thresholding function that determines themodel’s conﬁdence of that particular output. If the conﬁdence is too low,we no longer consider the predictions our model generates for that particularsample.The main objective of our study is to create a prediction model that bal-ances F1 score and predicting into the future. The main challenge is the factthat samples which are taken from inﬁnite time series do not naturally con-tain dynamic length predictions. This aspect requires a diﬀerent loss functionthat includes the notion of conﬁdence and tailored computation of gradientson diﬀerent length output sequences. We create a model architecture relyingon a novel loss function that allows for a dynamic number of output predic-tions. We explain the ideas and concepts by utilizing predictions of severalcorrelated ﬁnancial securities. In this case, rather than having to adjust2he predictive length manually depending on market volatility, the modellearns how far in advance it can conﬁdently predict during training. Forsecurity prediction, a model that is not limited to a ﬁxed number of outputpredictions can provide much more robust price predictions. For example,we expect that some security j with high volatility during the training phaseshould result in fewer predictions. On the other hand, if the same security j istrained during a low volatility period, we expect the model to generate morepredictions. Clearly, a dynamic model that can easily adjust to the currenttraining environment of a security can provide huge beneﬁts. In inferencethe model provides a natural way to stop generating output predictions.Our work contains two main contributions. First, we create a way tomeasure the conﬁdence of a model’s prediction without having to rely onBayesian statistics. Second, our model is the ﬁrst of its kind to allow fordynamic prediction length with sequence to sequence networks. Along theway, we have to tailor gradient computation.In our study, we use two ﬁnancial security datasets which consist of sev-eral years of tick prices. One contains ﬁve distinct securities and the othercontains twenty-two diﬀerent securities. We ﬁnd that our new architecturesuccessfully balances prediction F1 score and the number of future predic-tions. In addition, our architecture uses diﬀerent prediction lengths at diﬀer-ent times for each security due to stochastic drift between training and testsets. The best dynamic output prediction length model is a sequence to se-quence network which earns an F1 score of 0.503 in contrast to a traditionalLSTM structure that only gets an F1 score of 0.209 for a single prediction.In Section 4.2, we review two main subjects related to our work. First, weinspect studies within the realm of deep learning related to our new modelarchitecture. Second, we analyze other work on predicting ﬁnancial securitieswith a focus on machine learning and deep learning methods. In Section 4.3,we present the dynamic prediction length model while in Section 4.4 wepresent a computational study based on securities. Similar to our concept of dynamic output prediction, Pointer Networks byVinyals, Fortunato, and Jaitly (2015) are used for problems such as sortingvariable sized sequences. They use an attention mechanism that points toa particular part of the input sequence that is used as the next output.3lthough this architecture can allow for variable input sizes, the output sizeis constrained to be the same size as the input. Our model allows for anysize output (unrelated to the input size) up to some arbitrary maximum size.Pointer networks are also not applicable to our case since output is not aspeciﬁc part of input.Graves (2016) introduces adaptive computation time (ACT) for recurrentnetworks. The author creates an additional metric that allows the networkto continue “pondering” the input through additional computation. We canthink of ACT as a model that in each time has a dynamic number of stackedLSTM or GRU cells. We considered using ACT along with our architec-ture, but it increases the computational complexity by recalculating the feed-forward step a number of times. Basing the stopping decision with respectto the natural choice of the output time requiring substantial computationaltime, does not work when most times a single layer is needed. In time serieswith a walk forward strategy, smaller less complex models are more eﬀective.This renders ACT not appropriate. In addition, online algorithms need tobe agile, and adding computational time would be a detriment to a model.Therefore, we choose to use traditional LSTM architectures that train muchfaster.To achieve better training and dynamic output sizes, we added an ad-ditional term to the loss function and modify the measure of predictions.Although there have been many new architectures such as Residual Net-works by He, Zhang, Ren, and Sun (2016), Memory Networks by Sukhbaataret al. (2015), and Neural Turing Machines by Graves, Wayne, and Danihelka(2014), none of these incorporate a new loss function with their architec-ture. This tendency of not creating new loss functions is noted by Janocha,Czarnecki et al. (2017). The authors explain that although there is a lot ofwork in neural network activations, optimization, and architecture, the lossfunction used for nearly all neural networks is a combination of log loss andL1/L2 norms. Janocha et al. (2017) show that functions that were previ-ously deemed to be inferior in deep learning can be more robust than log lossfor classiﬁcation problems. Therefore, it is important for studies to continueexploring loss functions to increase the network performance and to create avariety of new models.There are some studies that signiﬁcantly change the loss function in deeplearning. For example, GAN’s by Goodfellow, Pouget-Abadie, Mirza, Xu,Warde-Farley, Ozair, Courville, and Bengio (2014) are immensely popularand design a loss function for their speciﬁc problem and architecture to bal-4nce generation and classiﬁcation. ACT by Graves (2016) also uses a uniqueaddition to the loss function so that their network does not “ponder” on theinput for too long. There is growing momentum for using the Wassersteinmetric as the loss function as seen in the work by Frogner, Zhang, Mobahi,Araya, and Poggio (2015). The Wasserstein metric is even successfully beingused in GAN’s in the work of Arjovsky, Chintala, and Bottou (2017). Weexpand the volume of work in this area by developing a loss function thatencourages a model to have a dynamic output length at prediction. In con-trast to loss functions speciﬁc to problem type, our architecture can be usedwith any recurrent neural network architecture.Next, we focus on works on predicting ﬁnancial securities with deep net-works. Most work utilizing deep neural networks focuses on applying otherforms of data for prediction, such as news about ﬁnancial markets or speciﬁccompanies. Chong, Han, and Park (2017) review many of the predictionmethods commonly used for security prediction and the predicted outcome.Niaki and Hoseinzade (2013), focuses on predicting upward and downwardmovement of S&P 500 with a deep feed-forward network. Ding, Zhang, Liu,and Duan (2015) use historical pricing data in combination with ﬁnancialnews data with a deep feed forward network. Krauss, Do, and Huck (2017)utilizes both random forests and deep feed-forward networks to ﬁnd statisti-cal arbitrage on the S&P 500. In contrast to the aforementioned models, wepredict signiﬁcant movement in prices, utilize temporal models, and predictmultiple securities with a single model.Sirignano (2016) uses deep learning directly on ﬁnancial security pricedata as well. He utilizes limit order book data with multiple ask/bid pricesfor each security to predict the change in the spread. In addition, he uses adeep feed-forward network and a separate model for the prediction of eachsecurity. In contrast, we predict all securities with one model and applyrecurrent neural networks in addition to feed forward networks. Anothersimilar work, Dixon, Klabjan, and Bang (2016), uses deep feed-forward net-works for prediction directly on security prices. In contrast, our study appliesrecurrent and convolutional recurrent models in addition to the baseline feedforward network.There is also work on security prediction beyond the standard feed-forward network.Borovykh, Bohte, and Oosterlee (2017) uses convolutional neural networksover the security time series to make predictions. Others, such as Bao, Yue,and Rao (2017), use stacked auto-encoders and wavelet transformations to5orm an embedding of ﬁnancial data, and feed this into an RNN model forprediction. Chen, Zhou, and Dai (2015) use an RNN model on openingand closing security prices on the Chinese market with seven classiﬁcationcategories. Akita, Yoshihara, Matsubara, and Uehara (2016) use both textualand price information to predict a security price. In contrast to these works,our model consists only of security price data. Also, we apply sequence tosequence and convolutional LSTM models in addition to having a dynamicoutput length for security predictions.

In this section, we explain the architecture of our model that predicts a dy-namic number of outputs. The model uses a sequence to sequence (Seq2Seq)network in combination with our new proposed loss function. For details onSeq2Seq networks, please see the original paper by Sutskever et al. (2014).We also use attention in our model which is explained in the paper by Bah-danau, Cho, and Bengio (2014).In addition, we use teacher forcing. We found that the ﬁrst input to thedecoder has a great impact on accuracy. By far the best performing ﬁrstdecoder input is the ground truth associated with the last encoder input.For a general Seq2Seq network, let θ describe the trainable weights, X be our input sample, and f qt ( X, θ, f i

To create a model that predicts a dynamic output length for each time series,we introduce a function, g ( f i ≤ t ) that captures conﬁdence of the predictionat time t . Note that the conﬁdence function may be dependent on previousvalues f , f , ...f t . To determine output length, we measure the conﬁdencefunction against a threshold value τ (which is a hyper-parameter) resultingin min θ E ( X,Y ) Q (cid:88) q =1 (cid:88) g ( f qi ≤ t ( X ; θ ; f i ˆ T ( q ) when computing the gradient.In the forward pass, we stop when G q ¯ t < τ occurs the ﬁrst time for timeseries q and thus ¯ t = ¯ t ( q ). It may happen that for some G q ˆ t ( q ) ≥ τ for someˆ t ( q ) > ¯ t ( q ). When this occurs, we do not consider the terms corresponding toˆ t ( q ) in the Kullback-Leibler divergence or in our F1 score calculation. Withthe indicator function, the outputs from the Kullback-Leibler divergence arecompletely masked to zero. However, with the sigmoid function, the outputsbelow τ are only partially masked. In addition, when utilizing the sigmoidfunction, we calculate the F1 scores in the same manner as in the case of theindicator function.After obtaining ¯ t ( q ) in the forward pass for each time series q , we changethe loss to the following when using (1):min θ E ( X,Y )) Q (cid:88) q =1  ¯ t ( q ) (cid:88) t =1 L qt + T (cid:88) ˆ t =¯ t ( q )+1 λ max( τ − G q ˆ t ,  . From this point on, we do standard backpropagation treating ¯ t ( q ) as ﬁxedfor the current sample. We study our model on ﬁnancial time series data. Our data consists of four-teen years of securities at ﬁve minute tick intervals for two sets consisting of9wenty-two ETF’s and ﬁve distinct commodities. The identity of the secu-rities are unknown; therefore, we do not incorporate any additional featuressuch as news and market announcements. To clarify, the dataset consists ofonly the single tick price rather than prices on multiple levels of an orderbook. In addition, we do not have trade volume information. The pricesfor the twenty-two ETF dataset are the returns of each security, which isthe relative change in price from one 5 minute interval to the next. Theﬁve commodity dataset consists of the price of each security at each time t . In order to have consistency between the two datasets, we choose to onlyconsider the returns of the ﬁve commodity dataset.We create two representations of our data: one for the feed forward andseq2seq networks and the other for the convolutional seq2seq network. First,we create a sequence of inputs consisting of ¯ T returns for each ﬁnancialasset. Since our data can be viewed as a continuous streaming sequence(no obvious beginning or end), we create sequences of various size ¯ T . Foreach sequence, the next sequence moves forward by a 5 minute interval. Wefound that including this overlap increased prediction accuracy. In our bestperforming feed forward networks, our feature vectors consist of ten returns.For a seq2seq network we increase the sequence size to twenty.Since an increase in the sequence size greatly increases the computationtime of a seq2seq neural network, we consider a 1-dimensional convolutionalLSTM. A convolutional LSTM layer can take much larger sequences becausethe convolutional step in each LSTM node reduces the total number of se-quences via a convolutional kernel. Since we input multiple returns fromdiﬀerent ﬁnancial assets, each channel of convolution represents a diﬀerenceﬁnancial security (similar to RGB). We tried two models: convolution feed-ing into LSTM and ConvLSTM (Xingjian, Chen, Wang, Yeung, Wong, andWoo (2015)). All seq2seq models are embedded with our loss function unlessstated otherwise.For normalization, we use the standard approach of calculating the meanand standard deviation of the training set and using that to normalize boththe training and validation/test sets. For our data, we use one year of trainingdata (47,000 samples), and classify the next week of returns (1,000 samples).We utilize a walk-forward methodology to evaluate our model over theentire dataset. After each training phase we move the training and testingwindows one week forward. The ﬁrst week of the previous training data isdropped form the training data of the next phase. Each training is warm-started (pretrained) from the previous window.10o classify the returns of each security, we split the labels into ﬁve classes:large upward movement, small upward movement, insigniﬁcant movement,small downward movement, and large downward movement.We calculate the mean ( µ D ) and standard deviation ( σ D ) of the returns s D of the previous day to create our labeling scheme of ﬁve classes for thereturns of day D + 1. To classify the return x D +1 t on day D + 1 at time t , weuse the following rules. x D +1 t < µ D − σ D ,µ D − σ D ≤ x D +1 t < µ D − βσ D ,µ D − βσ D ≤ x D +1 t < µ D + βσ D ,µ D + βσ D ≤ x D +1 t < µ D + σ D ,µ D + σ D ≤ x D +1 t Here β is a paramter. We set β such that roughly 50% of all values liewithin insigniﬁcant movement ( β = 0 .

14) for the twenty-two security dataset.For the other dataset, a few of the securities contain large data imbalancesat the large upward and downward movements. Since this is the case, wepick β = 0 . The following results are based upon the best sequence sizes, neurons perlayer, optimization method, and number of layers found through hyperpa-rameterization of each network. The models were trained on Titan X’s andNvidia 1080’s and implemented in tensorﬂow. For each dataset, we have onemodel to classify all twenty-two ETF’s from one dataset and one model toclassify all ﬁve commodities in the other dataset. We test the ETF dataseton all baseline and ﬁnal model architectures. Due to computational timeconstraints, we only calculate the results of the commodity dataset using thebest model from the ETF dataset.

For our ﬁrst experiments, we use a FFN, LSTM network, LSTM Seq2Seqnetwork, and a ConvLSTM Seq2Seq network in the setting of a ﬁxed pre-diction of 1 or 10. The FFN consists of two layers with sixty-four neurons11able 1: F1 Scores of Baseline ModelsArchitecture ETF’s CommoditiesFFN (One Pred) 0.176 0.159LSTM (One Pred) 0.209 0.286ConvLSTM (One Pred) 0.513 0.410LSTM Seq2Seq (One Pred) 0.598 0.513LSTM Seq2Seq (Ten Pred) 0.509 0.474and ten returns for each security. We train the FFN with the stochasticgradient optimization method. The basic recurrent LSTM network consistsof two LSTM layers, each with sixty-four neurons and input sequence size oftwenty. For all recurrent networks, we use the ADAM optimization methodby Kingma and Ba (2014), which is known to perform well for recurrent net-works. We train for a number of epochs until the F1 score no longer increaseson the validation dataset. Last, the sequence to sequence network consistsof one encoder and one decoder layer, each with sixty-four neurons and ei-ther LSTM or ConvLSTM layers. We tried using deeper networks, but thoseresulted in much lower F1 score. We utilize the same model architecture forboth the ETF and commodity datasets.We are using ﬁnancial security datasets, which typically contain imbal-anced classes. When considering imbalanced classes, the traditional accuracymeasure does not display the true eﬀectiveness of a model. Therefore, weutilize the F1 score, which accounts for imbalanced classes. The F1 score iscalculated via a one-against-all structure of precision and recall for each ofthe ﬁve classes. The values we report are an average of the F1 scores for eachclass and security.The FFN, LSTM, LSTM Seq2Seq, and ConvLSTM Seq2Seq results arepresented in Table 1. To produce these results, we give each model a warmstart by training on 5 windows of data. We then train the models for 25more windows on the 22 ETF’s and 5 commodities datasets. The F1 scoresin Table 1 are the average test scores over this 25 window period.We use our baseline results in Table 1 to determine the best architecturefor our dynamic output length model and to examine the diﬀerences of thevarious seq2seq networks. We observe that each single prediction modelgenerally increases in performance when using more robust models (such asseq2seq). As expected, the seq2seq models and recurrent LSTM models far12utperform the FFN, but we did not expect such a large performance increasewhen utilizing a seq2seq model. Even when making 10 predictions forwardin time, the seq2seq model more than doubles the F1 score compared to thetraditional LSTM network that makes a single prediction. With ConvLSTMlayers, we found that the best formulation is utilizing 1D convolutions witheach security representing a channel in the image. Ultimately, we ﬁnd thatthe best ConvLSTM seq2seq model does not perform as well as the LSTMseq2seq model. Since the LSTM seq2seq model is by far the most accurate,we choose it for our dynamic output length model.

We begin with the two conﬁdence functions that depend only upon the cur-rent prediction: maximum and CD. We study the sensitivity of τ and λ , andthe comparison between the indicator (Ind) and sigmoid (Sig) functions, i.e.loss functions (4.2) and (4.3) respectively. In the following sections, we referto the indicator and sigmoid functions as masking functions because theymask the output from the Kullback-Leibler divergence.A large gradient for λ max( τ − G qt ,

0) means that we encourage the modelto make more predictions. We can control the output length by adjusting thehyperparameters λ and τ . In Figure 1 we present a range of τ and λ valuesand the respective F1 scores with G qt being the maximum. In addition, weplace an annotation of the best pair ( τ, λ ) that results in the largest distanceabove the static curve in bold. We add this annotation to all ﬁgures of thistype.In Figure 1, we ﬁrst create a line that gives the F1 scores from a staticseq2seq model for 5 training windows after a 5 window warm start. Wecompute F1 for 1 , , ,

10 predictions by the static model and interpolate theremaining values. It is important to point out that the F1 scores here arediﬀerent than in Table 1 since we are measuring 5 windows rather than 25.We expect the dynamic model to be above this curve when using appropriatevalues for ( τ, λ ). Each point in Figure 1 is the average output length andF1 score of the 22 ETF dataset for a pair ( τ, λ ). On the left is the indicatormasking function and the right is the sigmoid masking function. For bothﬁgures, we use the same values for τ and λ . We ﬁnd that with τ = 0 .

5, ourmodel has high accuracy with a relatively large average output length.13igure 1: F1 and average output length for pairs of τ and λ . We use themaximum conﬁdence function with the indicator function on the left andsigmoid on the right.The red points are a pair ( τ, λ ) that perform worse than the static modeland the green are those that perform better. We expect to ﬁnd many pairsof hyperparameters that lead to superior F1 scores since our model uses adynamic prediction length for each sample. There are a few outliers that fallwell below the estimated performance in the ﬁgure on the left, which usesthe indicator masking function. The sigmoid function on the right has manymore points above the static curve and does not have any points below the0 . λ ’s. When λ is too big, the model stops training for prediction accuracy and only focuseson creating more predictions. Note that not all large λ ’s lead to strictly poorperformance since the best sigmoid pair has λ = 5 . τ, λ ) with the indicator masking function on the left and sigmoidmasking function on the right. Note that with CD, the number of pointsbelow the static line is smaller compared to the maximum function. When λ is too large, the indicator version has three points that fall far below ourestimation. It is important to point out that there is a large distance betweenthe green points and the blue curve that is rarely seen with other conﬁdencefunctions. Based on these results, we recommend using the CD over themaximum function. It requires very little extra computation and leads tobetter F1 with a similar average output length.In Figures 1 and 2, we show that large λ ’s usually lead to inferior models.Therefore, we explore τ to see if we can ﬁnd a relationship between this14igure 2: F1 score and average output length for pairs of τ and λ . We useCD with the indicator function on the left and sigmoid function on the right.hyperparameter and the average output length. We present two images inFigure 3 of the sensitivity of the indicator and sigmoid masking functionswith maximum and CD functions. We create these ﬁgures by choosing λ =0 .

1, which produces the best performance for all cases of τ and plot therelationship between average number of predictions and τ . Note that weuse the same approach of utilizing a 5 window warm start and measure theaverage output length based upon the next 5 windows.On the left, we observe that the slope for the sigmoid masking functionis sharper than the indcator function for both conﬁdence functions. In thecase of these two functions, the sensitivity is not nearly as large as the othertwo conﬁdence functions that we explore in the next section. For CD, theslope is much smaller (in terms of absolute value), which provides additionalevidence to recommend it over the maximum conﬁdence function.Given these two conﬁdence and masking functions, we recommend us-ing CD with the sigmoid masking function. Although the sigmoid maskingfunction is more sensitive to τ , we ﬁnd that sigmoid does not have as manypoints falling under the static curve. Next, we examine the two conﬁdencefunctions that depend on the current and previous predictions.15igure 3: The output length and τ values for both CD and maximum func-tions with λ = 0 .

1. The trend lines are made via linear regression and showhigh statistical signiﬁcance between τ and output length. The other two functions we test are total variation (TV) and the ﬁrst Wasser-stein distance (EMD). We design the ﬁrst two functions, maximum and CD,to ﬁnd the conﬁdence of the current prediction using only the current pre-diction and input sequence. On the other hand, we use TV and EMD todetermine the volatility between the current and previous prediction. If thevolatility is low, we expect the model to be conﬁdent enough to make severalpredictions. We ﬁrst examine the results from TV in Figure 4.TV does not garner high performance with either masking function. Thereare some points with the sigmoid masking function that are slightly betterthan the expected F1 scores, but TV is clearly not a successful measure.There are likely some cases where the TV may be large between two pre-dictions, but the model may still be conﬁdent about that prediction. Forexample, if the prediction changes from one label to another at the next pre-diction step, the TV between these two predictions may be large. However,this does not necessarily imply that the model is not conﬁdent about its nextprediction. Also note that there are two rows of points above and below thestatic curve of the sigmoid version. The diﬀerence between these two is thatthe value for λ is larger for the points below the estimated curve, which issimilar to the results we obtain form the maximum and CD functions.In Figure 5, we observe that EMD contains a similar pattern to TV.16igure 4: The F1 score and respective output length for various ( τ, λ ) pairswith TV. The indicator function is on the left while the sigmoid function ison the right.Similar to TV, the indicator masking function with EMD performs poorlyon the left compared to the better F1 scores of the sigmoid masking functionon the right. EMD performs slightly better than TV with a few more pointsabove the estimated curve. This likely occurs because EMD is a more robustprobability measure. Instead of ﬁnding the maximum distance between twoprobability measures P and Q (TV), EMD ﬁnds the cost of transformingthe entirety of one distribution P into another distribution Q . EMD is morerobust, but a label change from one time step to the next may correlatewith a large EMD. However, this does not always imply lower predictionconﬁdence.To conclude our examination of TV and EMD, we observe the relation-ship between the number of predictions and τ in Figure 6. Note that therelationship is the opposite of the maximum and CD functions since we re-verse the relationship with respect to τ , which we clarify in Section 4.3.1.The slopes for the sigmoid versions of both metrics are larger than theirindicator function counterparts. We also observe that the slopes are largeoverall when compared to both the maximum and CD conﬁdence functions.Slight changes in τ lead to large changes in output lengths for TV and EMD.In general, we found that the values of these two metrics tend to be small,which is what leads to their sensitivity to τ . Overall, for all metrics, thesigmoid masking function slope is larger in magnitude than the same ver-sion with the indicator masking function. Also, TV and EMD show that17igure 5: The F1 score and respective output length for various ( τ, λ ) pairswith EMD. The indicator function is on the left while the sigmoid functionis on the right.conﬁdence in predictions depends upon more than volatility. Next, we test our model on the ﬁve commodity dataset. Since we observethat the CD with the sigmoid masking function provide the best performingmodel with the ETF dataset, we use the identical functions on the commoditydataset. In addition, we create four static models predicting 1 , , ,

10 timesteps each to create the blue curve as with the previous ETF ﬁgures. InFigure 7, we present the dynamic model results with the commodity dataset.As with the previous ﬁgures of the same type, we run the model for a total of10 walk-forward steps and average the ﬁnal 5 F1 scores, which are reportedin Figure 7.The ﬁrst thing we observe is that this is the ﬁrst time that every singlepair ( τ, λ ) is above the static predictions. This probably occurs becauseof volatility in the commodity dataset. Some commodities are extremelyvolatile, and the price ranges from big highs to lows for a high percentageof the labels. In addition, other commodities in the dataset have very lowvolatility for a long period of time. Because of the skewed nature of thecommodity dataset, it is possible for our model to only choose to predict attimes when it can make many correct predictions. To observe the dynamicmodel in Figure 8, we observe the F1 scores of our best dynamic commoditymodel and two static models over 10 walk forward periods.18igure 6: The output length and τ values for both TV and the EMD with λ = 0 .

1. The trend lines are made via traditional linear regression and showhigh statistical signiﬁcance between τ and output length.Figure 7: The F1 score and respective output length for various ( τ, λ ) pairswith CD and the sigmoid masking function with the ﬁve commodity dataset.The F1 score of our dynamic model is superior at nearly every walkforward time period in comparison to a static single prediction model, exceptfor walk forward period 5. As expected, the single prediction static modelhas a better F1 score on average than the static model making 7 predictions.However, the diﬀerence between 1 prediction and 7 predictions is small, which19igure 8: The F1 score of our best dynamice model: τ = 0 . λ = 0 . In Table 2, we present a summary of the best results for our dynamic model.To measure which pair ( τ, λ ) is best, we measure the relative improvementbetween the F1 score of the dynamic model and the equivalent predictionlength F1 score of the static model (from the blue curves shown in previousﬁgures). The dynamic model with CD and sigmoid does extremely well forthe commodity dataset. The same functions also give the best F1 score withthe ETF dataset. 20able 2: Dynamic Model Summary (All Sigmoid)Architecture F1 Gap % ( τ, λ ) Prediction LengthDynamic Max (ETF) 5.44 (0 . , .

0) 7.68Dynamic CD (ETF) 6.45 (0 . , .

0) 6.76Dynamic TV (ETF) 4.49 (0 . , .

1) 6.16Dynamic EMD (ETF) 3.72 (0 . , .

1) 6.00Dynamic CD (Commodity) 21.2 (0 . , .

5) 6.89

In this work, we create a new loss function with the seq2seq network thatmakes a dynamic number of predictions for each input sequence. In addi-tion, we construct a new metric called “Conﬁdence Distance” to measurethe conﬁdence the model has when making a prediction. When testing thenew model, we ﬁnd that a dynamic prediction length model can outperforma similar static seq2seq network. In addition, we ﬁnd that of our four con-ﬁdence functions, CD gives the most accurate predictions over maximum,TV, and the EMD. We examine two versions of our model with each conﬁ-dence function: masking the Kullback-Leibler divergence with the indicatorfunction and masking with the sigmoid function. We ﬁnd that the sigmoidfunction leads to better and more reliable prediction F1 even though it ismore sensitive to hyperparameter τ .When using the dynamic model, we recommend to keep the value for λ low and to vary τ to get the desired output length to F1 ratio. Seq2Seqmodels perform well with ﬁnancial security data, and we recommend theiruse and to make the ﬁrst decoder input the label associated with the ﬁnalencoder input. For the best dynamic output length performance, the bestsetting utilizes τ ∈ [0 . , .

5] and λ ≤ . References

Akita, R., A. Yoshihara, T. Matsubara, and K. Uehara (2016): “Deep learn-ing for stock prediction using numerical and textual information,” in , 1–6.21rjovsky, M., S. Chintala, and L. Bottou (2017): “Wasserstein GAN,” arXivpreprint arXiv:1701.07875 .Bahdanau, D., K. Cho, and Y. Bengio (2014): “Neural machine translationby jointly learning to align and translate,” arXiv preprint arXiv:1409.0473 .Bao, W., J. Yue, and Y. Rao (2017): “A deep learning framework for ﬁnan-cial time series using stacked autoencoders and long-short term memory,”

PLOS ONE .Borovykh, A., S. Bohte, and C. W. Oosterlee (2017): “Conditional timeseries forecasting with convolutional neural networks,” arXiv preprintarXiv:1703.04691 .Chen, K., Y. Zhou, and F. Dai (2015): “A LSTM-based method for stockreturns prediction: A case study of china stock market,” in

IEEE Interna-tional Conference on Big Data , 2823–2824.Chong, E., C. Han, and F. C. Park (2017): “Deep learning networks for stockmarket analysis and prediction: Methodology, data representations, andcase studies,”

Expert Systems with Applications , 187–205.Ding, X., Y. Zhang, T. Liu, and J. Duan (2015): “Deep learning for event-driven stock prediction,” in

Twenty-Fourth International Joint Conferenceon Artiﬁcial Intelligence .Dixon, M., D. Klabjan, and J. H. Bang (2016): “Classiﬁcation-based ﬁnancialmarkets prediction using deep neural networks,”

Algorithmic Finance , 1–11.Frogner, C., C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio (2015):“Learning with a wasserstein loss,” in

Advances in Neural InformationProcessing Systems , 2053–2061.Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio (2014): “Generative adversarialnets,” in

Advances in Neural Information Processing Systems , 2672–2680.Graves, A. (2016): “Adaptive computation time for recurrent neural net-works,” arXiv preprint arXiv:1603.08983 .22raves, A., G. Wayne, and I. Danihelka (2014): “Neural turing machines,” arXiv preprint arXiv:1410.5401 .He, K., X. Zhang, S. Ren, and J. Sun (2016): “Deep residual learning forimage recognition,” in

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 770–778.Janocha, K., W. M. Czarnecki, et al. (2017): “On loss functions for deepneural networks in classiﬁcation,”

Schedae Informaticae , 2016, 4959.Kingma, D. P. and J. Ba (2014): “Adam: A method for stochastic optimiza-tion,” arXiv preprint arXiv:1412.6980 .Krauss, C., X. A. Do, and N. Huck (2017): “Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the s&p 500,”

Eu-ropean Journal of Operational Research , 689–702.Niaki, S. T. A. and S. Hoseinzade (2013): “Forecasting s&p 500 index usingartiﬁcial neural networks and design of experiments,”

Journal of IndustrialEngineering International , 1.Sirignano, J. A. (2016): “Deep learning for limit order books,” arXiv preprintarXiv:1601.01987 .Sukhbaatar, S., J. Weston, R. Fergus, et al. (2015): “End-to-end memorynetworks,” in

Advances in Neural Information Processing Systems , 2440–2448.Sutskever, I., O. Vinyals, and Q. V. Le (2014): “Sequence to sequence learn-ing with neural networks,” in

Advances in Neural Information ProcessingSystems , 3104–3112.Vinyals, O., M. Fortunato, and N. Jaitly (2015): “Pointer networks,” in

Advances in Neural Information Processing Systems , 2692–2700.Xingjian, S., Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo(2015): “Convolutional LSTM network: A machine learning approach forprecipitation nowcasting,” in