[PDF] Stock2Vec: A Hybrid Deep Learning Framework for Stock Market Prediction with Representation Learning and Temporal Convolutional Network

Abstract

We have proposed to develop a global hybrid deep learning framework to predict the daily prices in the stock market. With representation learning, we derived an embedding called Stock2Vec, which gives us insight for the relationship among different stocks, while the temporal convolutional layers are used for automatically capturing effective temporal patterns both within and across series. Evaluated on S&P 500, our hybrid framework integrates both advantages and achieves better performance on the stock price prediction task than several popular benchmarked models.

Full PDF

SStock2Vec: A Hybrid Deep Learning Framework forStock Market Prediction with Representation Learningand Temporal Convolutional Network

Xing Wang a , Yijun Wang b , Bin Weng c , Aleksandr Vinel a a Department of Industrial Engineering, Auburn University, AL, 36849, USA b Verizon Media Group (Yahoo!), Champaign, IL, 61820, USA c Amazon.com Inc., Seattle, WA, 98108, USA

Abstract

We have proposed to develop a global hybrid deep learning framework to predict the dailyprices in the stock market. With representation learning, we derived an embedding calledStock2Vec, which gives us insight for the relationship among diﬀerent stocks, while the tem-poral convolutional layers are used for automatically capturing eﬀective temporal patternsboth within and across series. Evaluated on S&P 500, our hybrid framework integrates bothadvantages and achieves better performance on the stock price prediction task than severalpopular benchmarked models.

Keywords:

Stock prediction, Stock2Vec, Embedding, Distributional representation, Deeplearning, Time series forecasting, Temporal convolutional network

1. Introduction

In ﬁnance, the classic strong eﬃcient market hypothesis (EMH) posits that the stockprices follow random walk and cannot be predicted [1]. Consequently, the well-known capitalassets pricing model (CAPM) [2, 3, 4] serves as the foundation for portfolio management,asset pricing, among many applications in ﬁnancial engineering. The CAPM assumes alinear relationship between the expected return of an asset (e.g., a portfolio, an index, or asingle stock) and its covariance with the market return, i.e., for a single stock, CAPM simplypredicts its return r i within a certain market with the linear equation r i ( t ) = α i + β i r m ( t ) , a r X i v : . [ q -f i n . S T ] S e p here the Alpha ( α i ) describes the stock’s ability to beat the market, also refers to as its“excess return” or “edge”, and the Beta ( β i ) is the sensitivity of the expected returns ofthe stock to the expected market returns ( r m ). Both Alpha and Beta are often ﬁtted usingsimple linear regression based on the historical data of returns. With the eﬃcient markethypothesis (EMH), the Alphas are entirely random with expected value of zero, and can notbe predicted.In practice, however, ﬁnancial markets are more complicated than the idealized andsimpliﬁed strong EMH and CAPM. Active traders and empirical studies suggest that theﬁnancial market is never perfectly eﬃcient and thus the stock prices as well as the Alphas canbe predicted, at least to some extent. Based on this belief, stock prediction has long playeda key role in numerous data-driven decision-making scenarios in ﬁnancial market, such asderiving trading strategies, etc. Among various methods for stock market prediction, theclassical Box-Jenkins models [5], exponential smoothing techniques, and state space models[6] for time series analysis are most widely adopted, in which the factors of autoregressivestructure, trend, seasonality, etc. are independently estimated from the historical observa-tions of each single series. In recent years, researchers as well as the industry have deployedvarious machine learning models to forecast the stock market, such as k-nearest neighbors(kNN) [7, 8], hidden Markove model (HMM) [9, 10], support vector machine (SVM) [11, 12],artiﬁcial neural network (ANN) [13, 14, 15, 16, 17], and various hybrid and ensemble meth-ods [18, 19, 20, 18, 21], among many others. The literature has demonstrated that machinelearning models typically outperform traditional statistical time series models, which mightbe mainly due to the following reasons: 1) less strict assumption for the data distributionrequirement, 2) various model architecture can eﬀectively learn complex linear and non-linerfrom data, 3) sophisticated regularization techniques and feature selection procedures pro-vide ﬂexibility and strength in handling correlated input features and control of overﬁtting,so that more features can be thrown in the machine learning models. As the ﬂuctuation ofthe stock market indeed depends on a variety of related factors, in addition to utilizing thehistorical information of stock prices and volumes as in traditional technical analysis [22],recent research of stock market forecasting has been focusing on informative external sourceof data, for instance, the accounting performance of the company [23], macroeconomic eﬀects224, 21], government intervention and political events [25], etc. With the increased popularityof web technologies and their continued evolution, the opinions of public from relevant news[26] and social media texts [27, 28] have an increasing eﬀect on the stock movement, variousstudies have conﬁrmed that combining the extensive crowd-sourcing and/or ﬁnancial newsdata facilitates more accurate prediction [29].During the last decade, with the emergence of deep learning, various neural networkmodels have been developed and achieved success in a broad range of domains, such ascomputer vision [30, 31, 32, 33] and natural language processing [34, 35, 36]. For stock pre-diction speciﬁcally, recurrent neural networks (RNNs) are the most preferred deep learningmodels to be implemented [37, 38]. Convolutional neural networks (CNNs) have also beenutilized, however, most of the work transformed the ﬁnancial data into images to apply 2Dconvolutions as in standard computer vision applications. For example, the authors of [39]converted the technical indicators data to 2D images and classiﬁed the images with CNNto predict the trading signals. Alternatively, [40] directly used the candlestick chart graphsas inputs to determine the Buy, Hold and Sell behavior as a classiﬁcation task, while in[41], the bar chart images were fed into CNN. The authors of [42] uses a 3D CNN-basedframework to extract various sources of data including diﬀerent markets for predicting thenext day’s direction of movement of ﬁve major stock indices, which showed a signiﬁcantimproved prediction performance compared to the baseline algorithms. There also existsresearch combining RNN and CNN together, in which the temporal patterns were learnedby RNNs, while CNNs were only used for either capturing the correlation between nearbyseries (in which the order matters if there are more than 2 series) or learning from images,see [43, 44]. Deployment of CNN in all these studies diﬀers signiﬁcantly from ours, since weaim at capturing the temporal patterns without relying on two-dimensional convolutions. In[45], 1D causal CNN was used for making predictions based on the history of closing pricesonly, while no other features were considered.Note that all of the aforementioned work has put their eﬀort into learning more accurateAlphas, and most of the existing research focuses on deriving separate models for each of thestock, while only few authors consider the correlation among diﬀerent stocks over the entiremarkets as a possible source of information. In other words, the Betas are often ignored.3t the same time, since it is natural to assume that markets can have nontrivial correlationstructure, it should be possible to extract useful information from group behavior of assets.Moreover, rather than the simpliﬁed linearity assumed in CAPM, the true Betas may exhibitmore complicated nonlinear relationships between the stock and the market.In this paper, we propose a new deep learning framework that leverages both the under-lying Alphas and (nonlinear) Betas. In particular, our approach innovates in the followingaspects:1) from model architecture perspective, we build a hybrid model that combines the advan-tages of both representation learning and deep networks. With representation learn-ing, speciﬁcally, we use embedding in the deep learning model to derive implicit Betas,which we refer to as Stock2Vec, that not only gives us insight into the correlation struc-ture among stocks, but also helps the model more eﬀectively learn from the featuresthus improving prediction performance. In addition, with recent advances on deeplearning architecture, in particular the temporal convolutional network, we further re-ﬁne Alphas by letting the model automatically extract temporal information from rawhistorical series.2) and from data source perspective, unlike many time series forecasting work that di-rectly learn from raw series, we generate technical indicators features supplementedwith external sources of information such as online news. Our approach diﬀers frommost research built on machine learning models, since in addition to explicit hand-engineered temporal features, we use the raw series as augmented data input. Moreimportantly, instead of training separate models on each single asset as in most stockmarket prediction research, we learn a global model on the available data over theentire market, so that the relationship among diﬀerent stocks can be revealed.The rest of this paper is organized as follows. Section 2 lists several recent advances thatare related to our method, in particular deep learning and its applications in forecasting aswell as the representation learning. Section 3 illustrates the building blocks and details ofour proposed framework, speciﬁcally, Stock2Vec embedding and the temporal convolutionalnetwork, as well as how our hybrid models are built. Our models are evaluated on the S&P400 stock price data and benchmarked with several others, the evaluation results as well asthe interpretation of Stock2Vec are shown in Section 5. Finally, we conclude our ﬁndingsand discuss the meaningful future work directions in Section 6.

2. Related Work

Recurrent neural network (RNN) and its variants of sequence to sequence (Seq2Seq)framework [46] have achieved great success in many sequential modeling tasks, such as ma-chine translation [47], speech recognition [48], natural language processing [49], and exten-sions to autoregressive time series forecasting [50, 51] in recent years. However, RNNs cansuﬀer from several major challenges. Due to its inherent temporal nature (i.e., the hiddenstate is propagated through time), the training cannot be parallelized. Moreover, trainedwith backpropagation through time (BPTT) [52], RNNs severely suﬀer from the problemof gradient vanishing thus often cannot capture long time dependency [53]. More elaboratearchitectures of RNNs use gating mechanisms to alleviate the gradient vanishing problem,with the long short-term memory (LSTM) [54] and its simpliﬁed variant, the gated recurrentunit (GRU) [55] being the two canonical architectures commonly used in practice.Another approach, convolutional neural networks (CNNs) [56], can be easily parallelized,and recent advances eﬀectively eliminate the vanishing gradient issue and hence help buildingvery deep CNNs. These works include the residual network (ResNet) [57] and its variantssuch as highway network [58], DenseNet [59], etc. In the area of sequential modeling, 1Dconvolutional networks oﬀered an alternative to RNNs for decades [60]. In recent years,[61] proposed WaveNet, a dilated causal convolutional network as an autoregressive genera-tive model. Ever since, multiple research eﬀorts have shown that with a few modiﬁcations,certain convolutional architectures achieve state-of-the-art performance in the ﬁelds of au-dio synthesis [61], language modeling [62], machine translation [63], action detection [64],and time series forecasting [65, 66]. In particular, [67] abandoned the gating mechnism inWaveNet and proposed temporal convolutional network (TCN). The authors benchmarkedTCN with LSTM and GRU on several sequence modeling problems, and demonstrated thatTCN exhibits substantially longer memory and achieves better performance.Learning of the distributed representation has also been extensively studied [68, 69, 70]5ith arguably the most well-known application being word embedding [49, 34, 35] in languagemodeling. Word embedding maps words and phrases into distributed vectors in a semanticspace in which words with similar meaning are closer, and some interesting relations amongwords can be revealed, such as King − Man ≈ Queen − WomanParis − France ≈ Rome − Italyas shown in [34]. Motivated by Word2Vec, the neural embedding methods have been ex-tended to other domains in recent years. The authors of [71] obtained item embeddingfor recommendation systems through a collaborative ﬁltering neural model, and‘ called itItem2Vec which is capable of inferring relations between items even when user informationis not available. Similarly, [72] proposed Med2Vec that learns the medical concepts with thesequential order and co-occurrence of the concept codes within patients’ visit, and showedhigher prediction accuracy in clinical applications. In [73], the authors mapped every categor-ical features into “entity embedding” space for structured data and applied it successfully ina Kaggle competition, they also showcased the learned geometric embedding coincides withthe real map surprisingly well when projected to 2D space.In the ﬁeld of stock prediction, the term “Stock2Vec” has already been used before.Speciﬁcally, [74] trained word embedding that specializes in sentiment analysis over theoriginal Glove and Word2Vec language models, and using such a “Stock2Vec” embeddingand a two-stream GRU model to generate the input data from ﬁnancial news and stock prices,the authors predicted the price direction of S&P500 index. The authors of [75] proposedanother “Stock2Vec” which also can be seen as a specialized Word2Vec, trained using the co-occurences matrix with the number of the news articles that mention both stocks as entries.Stock2Vec model proposed here diﬀers from these homonymic approaches and has its distinctcharacteristics. First, our Stock2Vec is an entity embedding that represent the stock entitiesrather than a word embedding that denotes the stock names with language modeling. Asthe diﬀerence between entity embedding and word embedding may seem ambiguous, moreimportantly, instead of training the linguistic models with the co-occurrences of the words,our Stock2Vec embedding is trained directly as features through the overall predictive model,6ith the direct objective that minimizes prediction errors, thus illustrating the relationshipsamong entities, while the others are actually ﬁne-tuned subset of the original Word2Veclanguage model. Particularly inspiring for our work are the entity embedding [73] and thetemporal convolutional network [67].

3. Methodology

We focus on predicting the future values of stock market assets given the past. More for-mally speaking, our input consists of a fully observable time series signals y T = ( y , · · · , y T )together with another related multivariate series X T = ( x , · · · , x T ), in which x t ∈ R n − ,and n is the total number of series in the data. We aim at generating the correspond-ing target series ˆ y T +1: T + h = (ˆ y T +1 , · · · , ˆ y T + h ) ∈ R h as the output, where h ≥ θ to obtain a nonlinear mapping from the input state space tothe predicted series, i.e., ˆ y T +1: T + h = f ( X T , y T | θ ), so that the distribution of our outputcould be as close to the true future values distribution as possible. That is, we wish toﬁnd min θ E X , y (cid:80) T + ht = T +1 KL (cid:0) y t || ˆ y t (cid:1) . Here, we use Kullback-Leibler (KL) divergence to mea-sure the diﬀerence between the distributions of the true future values y T +1: T + h and thepredictions ˆ y T +1: T + h . Note that our formulation can be easily extended to multivariateforecasting, in which the output and the corresponding input become multivariate seriesˆ y T +1: T + h ∈ R k × h and y T ∈ R k × h , respectively, where k is the number of forecasting vari-ables, The related input series is then X T ∈ R ( n − k ) × T , and the overall objective becomesmin θ E X T , y T (cid:80) T + ht = T +1 (cid:80) ki =1 KL (cid:0) y i,t || ˆ y i,t (cid:1) . In this paper, in order to increase the sample eﬃ-ciency and maintain a relatively small number of parameters, we will train d separate modelsto forecast each series individually. In machine learning ﬁelds, the categorical variables, if are not ordinal, are often one-hotencoded into a sparse representation. i.e., e : x (cid:55)→ δ ( x, c ) , δ ( x, c ) is the Kronecker delta, in which each dimension represents a possible category.Let the number of categories of x be | C | , then δ ( x, c ) is a vector of length | C | with theonly element set to 1 for x = c , and all others being zero. Note that although providinga convenient and simple way of representing categorical variables with numeric values forcomputation, one-hot encoding has various limitations. First of all, it does not place similarcategories closer to one another in vector space, within one-hot encoded vectors, all cat-egories are orthogonal to each other thus are totally uncorrelated, i.e., it cannot provideany information on similarity or dissimilarity between the categories. In addition, if | C | is large, one-hot encoded vectors can be high-dimensional and often sparse, which meansthat a prediciton model has to involve a large number of parameters resulting in ineﬃcientcomputaitons. For the cross-sectional data that we use for stock market, the number of totalinteractions between all pairs of stocks increases exponentially with the number of symbolswe consider, for example, there are approximately (cid:0) (cid:1) ≈ . O ( n ) for a n × n matrix), and they cannot adapt to minor changesin the data. In addition, the unsupervised transformation based on PCA or SVD do notuse predictor variable, and hence it is possible that the derived components that serve assurrogate predictors provide no suitable relationship with the target. Moreover, since PCAand SVD utilize the ﬁrst and second moments, they rely heavily on the assumption that theoriginal data have approximate Gaussian distribution, which also limits the eﬀectiveness oftheir usage.Neural embedding is another approach to dimensionality reduction. Instead of computing8nd storing global information about the big dataset as in PCA or SVD, neural embeddinglearning provides us a way to learn iteratively on a supervised task directly. In this paper,we present a simple probabilistic method, Stock2Vec, that learns a dense distributionalrepresentation of stocks in a relatively lower dimensional space, and is able to capture thecorrelations and other more complicated relations between stock prices as well.The idea is to design such a model whose parameters are the embeddings. We call amapping φ : x → R D a D -dimensional embedding of x , and φ ( x ) the embedded representa-tion of x . Suppose the transformation is linear, then the embedding representation can bewritten as z = W x = (cid:88) c w c δ x,c . The linear embedding mapping is equivalent to an extra fully-connected layer of neuralnetwork without nonlinearity on top of the one-hot encoded input. Then each output of theextra linear layer is given as z d = (cid:88) c w c,d δ x,c = w d x, where d stands for the index of embedding layer, and w c,d is the weight connecting the one-hot encoding layer to the embedding layer. The number of dimensions D for the embeddinglayer is a hyperparameter that can be tuned based experimental results, usually boundedbetween 1 and | C | . For our Stock2Vec, as we will introduce in Section 5, there are 503diﬀerent stocks, and we will map them into a 50-dimensional space.The assumption of learning a distribuional representation is that the series that havesimilar or opposite movement tend to correlated with each other, which is consistent with theassumption of CAPM, that the return of a stock is correlated with the market return, whichin turn is determined by all stocks’ returns in the market. We will learn the embeddingsas part of the neural network for the target task of stock prediction. In order to learnthe intrinsic relations among diﬀerent stocks, we train the deep learning model on data ofall symbols over the market, where each datum maintains the features for its particularsymbol’s own properties, include the symbol itself as a categorical feature, with the targetto predict next day’s price. The training objective is to minimize the mean squared error ofthe predicted prices as usual. 9 utputDropoutDense LayerConcatenateDropout + BN + ReLUDense Layer Dropout + BN + ReLUDense LayerCategorical Input Continuous Input Embedding

Normalization

Figure 1: Model Architecture of Stock2Vec.

In contrast to standard fully-connected neural networks in which a separate weight de-scribes an interaction between each input and output pair, CNNs share the parameters formultiple mappings. This is achieved by constructing a collection of kernels (aka ﬁlters) withﬁxed size (which is generally much smaller than that of the input), each consisting of a setof trainable parameters, therefore, the number of parameters is greatly reduced. Multiplekernels are usually trained and used together, each specialized in capturing a speciﬁc featurefrom the data. Note that the so-called convolution operation is technically a cross-correlationin general, which generates linear combinations of a small subset of input, thus focusing onlocal connectivity. In CNNs we generally assume that the input data has some grid-liketopology, and the same characteristic of the pattern would be the same for every location,i.e., yields the property of equivariance to translation [76]. The size of the output wouldthen not only depend on the size of the input, but also on several settings of the kernels: thestride, padding, and the number of kernels. The stride s denotes the interval size betweentwo consecutive convolution centers, and can be thought of as downsampling the output.Whereas with padding, we add values (zeros are used most often) at the boundary of theinput, which is primarily used to control the output size, but as we will show later, it canalso be applied to manage the starting position of the convolution operation on the input.The number of kernels adds another dimensionality on the output, and is often denoted asthe number of channels. 10 .3.1. 1D Convolutional Networks Sequential data often display long-term correlations and can be though of as a 1D gridwith samples taken at regular time intervals. CNNs have shown success in time seriesapplications, in which the 1D convolution is simply an operation of sliding dot productsbetween the input vector and the kernel vector. However, we make several modiﬁcations totraditional 1D convolutions according to recent advances. The detailed building blocks ofour temporal CNN components are illustrated in the following subsections.

As we mentioned above, in a traditional 1D convolutional layer, the ﬁlters are slided acrossthe input series. As a result, the output is related to the connection structure between theinputs before and after it. As shown in Figure 2(a), by applying a ﬁlter of width 2 withoutpadding, the predicted outputs ˆ x , · · · , ˆ x T are generated using the input series x , · · · , x T .The most severe problem within this structure is that we use the future to predict the past,e.g., we have used x to generate ˆ x , which is not appropriate in time series analysis. Toavoid the issue, causal convolutions are used, in which the output x t is convoluted only withinput data which are earlier and up to time t from the previous layer. We achieve this byexplicitly zero padding of length ( kernel size −

1) at the beginning of input series, as a result,we actually have shifted the outputs for a number of time steps. In this way, the predictionat time t is only allowed to connect to historical information, i.e., in a causal structure,thus we have prohibited the future aﬀecting the past and avoided information leakage. Theresulted causal convolutions is visualized in Figure 2(b). Time series often exhibits long-term autoregressive dependencies. With neural networkmodels hence, we require for the receptive ﬁeld of the output neuron to be large. That is, theoutput neuron should be connected with the neurons that receive the input data from manytime steps in the past. A major disadvantage of the aforementioned basic causal convolutionis that in order to have large receptive ﬁeld, either very large sized ﬁlters are required, orthose need to be stacked in many layers. With the former, the merit of CNN architecture islost, and with the latter, the model can become computationally intractable. Following [61],11 a) standard (non-causal) (b) causal

Figure 2: Visualization of a stack of 1D convolutional layers, non-causal v.s. causal. we adopted the dilated convolutions in our model instead, which is deﬁned as F ( s ) = ( x ∗ d f )( s ) = k − (cid:88) i =0 f ( i ) · x s − d × i , where x ∈ R T is a 1-D series input, and f : { , · · · , k − } → N is a ﬁlter of size k , d iscalled the dilation rate, and ( s − d × i ) accounts for the direction of the past. In a dilatedconvolutional layer, ﬁlters are not convoluted with the inputs in a simple sequential manner,but instead skipping a ﬁxed number ( d ) of inputs in between. By increasing the dilationrate multiplicatively as the layer depth (e.g., a common choice is d = 2 j at depth j ), weincrease the receptive ﬁeld exponentially, i.e., there are 2 l − k input in the ﬁrst layer that canaﬀect the output in the l -th hidden layer. Figure 3 compares non-dilated and dilated causalconvolutional layers. In traditional neural networks, each layer feeds into the next. In a network with residualblocks, by utilizing skip connections, a layer may also short-cut to jump over several others.The use of residual network (ResNet) [57] has been proven to be very successful and becomethe standard way of building deep CNNs. The core idea of ResNet is the usage of shortcutconnection which skips one or more layers and directly connects to later layers (which is theso-called identity mapping), in addition to the standard layer stacking connection F . Figure4 illustrates a residual block, which is the basic unit in ResNet. A residual block consists ofthe abovementioned two branches, and its output is then g ( F ( x ) + x ), where x denotes the12 utput Dilation = 1

Hidden LayerHidden LayerHidden LayerInput

Dilation = 2Dilation = 4Dilation = 8

OutputHidden LayerHidden LayerHidden LayerInput (a) Non-dilated

Output

Dilation = 1

Hidden LayerHidden LayerHidden LayerInput

Dilation = 2Dilation = 4Dilation = 8

OutputHidden LayerHidden LayerHidden LayerInput (b) Dilated

Figure 3: Visualization of a stack of causal convolutional layers, non-dilated v.s. dilated. input to the residual block, and g is the activation function.By reusing activation from a previous layer until the adjacent layer learns its weights,CNNs can eﬀectively avoid the problem of vanishing gradients. In our model, we implementeddouble-layer skips. Our overall prediction model is constructed as a hybrid, combining Stock2Vec embeddingapproach with an advanced implementation of TCN, schematically represented on Figure5. Compared with Figure 1, it contains an additional TCN module. However, instead ofproducing the ﬁnal prediction outputs of size 1, we let the TCN module output a vector asa feature map that contains information extracted from the temporal series. As a result, it13 ctivation functionweight layeractivation functionweight layer activation functionweight layeractivation functionweight layer (a) A standard block activation functionweight layeractivation functionweight layer activation functionweight layeractivation functionweight layer (b) A residual block

Figure 4: Comparison between a regular block and a residual block. In the latter, the convolution is short-circuited. adds a new source of features, which can be We concatenated with the learned Stock2Vecfeatures. Note that the TCN module can be replaced by any other architecture that learnstemporal patterns, for example, LSTM-type network. Finally, a series of fully-connectedlayers (referred to as “head layers”) are applied to the combined features producing the ﬁnalprediction output. Implementation details are discussed in Section 5.1.Note that in each TCN block, the convolutional layers use dropout in order to limitthe inﬂuence that earlier data have on learning [77, 78]. It is then followed by a batchnormalization layer [79]. Both dropout and batch normalization provide a regularizationeﬀect that avoids overﬁtting. The most widely used activation function, the rectiﬁed linearunit (ReLU) [80] is used after each layer except for the last one.14 utputDropoutDense LayerConcatenateDropout + BN + ReLUDense LayerTCN Feature Map Continuous Feature MapCategorical Feature MapDropout + BN + ReLUDense Layer Dropout + BN + ReLUDense LayerCategorical Input Continuous Input

Embedding

NormalizationSequence Input D r opou t + B N + R eL UC O N V C O N V D r opou t + B N + R eL UC O N V C O N V TCNBlockTCNBlock

Figure 5: The full model architecture of hybrid TCN-Stock2Vec.

4. Data Speciﬁcation

Technical Indicators Category Description

Moving average convergence or divergence (MACD) Trend Reveals price change in strength, direction and trend durationParabolic Stop And Reverse (PSAR) Trend Indicates whether the current trend is to continue or to reverseBollinger Bands (BB R (cid:13) ) Volatility Forms a range of prices for trading decisionsStochastic Oscillator (SO) Momentum Indicates turning points by comparing the price to its rangeRate Of Change (ROC) Momentum Measures the percent change of the pricesOn-Balance Volume (OBV) Volume Accumulates volume on price direction to conﬁrm price movesForce Index (FI) Volume Measures the amount of strength behind price move Table 1: Description of technical indicators used in this study.

Note that the features can also be split into categorical and continuous. Each of thecategorical features is mapped to dense numeric vectors via embedding, in particular, thevectors embedded from the stock name as a categorical feature are called Stock2Vec. Wescale all continuous features (as well as next day’s price as the target) to between 0 and 1,since it is widely accepted that neural networks are hard to train and are sensitive to inputscale [81, 79], while some alternative approaches, e.g., decision trees, are scale-invariant[82]. It is important to note that we performed scaling separately on each asset, i.e., lineartransformation is performed so that the lowest and highest price for asset A over the trainingperiod is 0 and 1 respectively. Also note scaling statistics are obtained with the training16et only, which prevents leakage of information from the test set, avoiding introduction oflook-ahead bias.As a tentative illustration, Figure 6 shows the most important 20 features for predictingnext day’s stock price, according to the XGBoost model we trained for benchmarking.

F score sector_PUBLIC UTILITIESWeek_34news_buzzDay_15sector_ENERGYYear_2018trend_macd_signalDayofweek_0Week_5momentum_rocindustry_REAL ESTATEvolume_obvvolatility_bbwsentimenttrend_macd_difftrend_macdvolatility_bbmDayofyearnews_volumeAdj Close F e a t u r e s Figure 6: Feature importance plot of XGBoost model.

In our experiments, the data are split into training, validation and test sets. The last126 trading days of data are used as the test set, cover the period from 2019/08/16 to2020/02/18, and include 61000 samples. The rest data are used for training the model, inwhich the last 126 trading days, from 2019/02/15 to 2019/08/15, are used as validation set,while the ﬁrst 499336 samples, cover the period from 2015/01/02 to 2019/02/14, form thetraining set. Table 2 provides a summary of the datasets we used in this research.

Table 2: Dataset summary.Training set Validation set Test setStarting date 2015/01/02 2019/02/15 2019/08/16End date 2019/02/14 2019/08/15 2020/02/18Sample size 499336 61075 61000 . Experimental Results and Discussions In the computational experiments below we compare performance of seven models.. Twomodels are based on time series analysis only (TS-TCN and TS-LSTM), two use staticfeature only (random forest [83] and XGBoost [84]), pure Stock2Vec model and ﬁnally, twoversions of the proposed hybrid model (LSTM-Stock2Vec and TCN-Stock2Vec). This waywe can evaluate the eﬀect of diﬀerent model architectures and data features. Speciﬁcally,we are interested in evaluating whether employing feature embedding leads to improvement(Stock2Vec vs random forest and XGBoost) and whether a further improvement can beachieved by incorporating time-series data in the hybrid models.Random forest and XGBoost are ensemble models that deploy enhanced bagging andgradient boosting, respectively. We pick these two models since both have shown powerfulpredicting ability and achieved state-of-the-art performance in various ﬁelds. Both are tree-based models that are invariant to scales and perform split on one-hot encoded categoricalinputs, which is suitable for comparison with embeddings in our Stock2Vec models. We built100 bagging/boosting trees for these two models. LSTM and TCN models are constructedbased on pure time series data, i.e., the inputs and outputs are single series, without anyother feature as augmented series. In later context, we call these two models TS-LSTMand TS-TCN, respectively. The Stock2Vec model is a fully-connected neural network withembedding layers for all categorical features, it has the exactly same inputs as XGBoost andrandom forest. As we introduced in Section 3.4, our hybrid model combines the Stock2Vecmodel with an extra TCN module to learn the temporal eﬀects. And for comparison purpose,we also evaluated the hybrid model with LSTM as the temporal module. We call them TCN-Stock2Vec and LSTM-Stock2Vec correspondingly.Our deep learning models are implemented in PyTorch [85]. In Stock2Vec, the embeddingsizes are set to be half of the original number of categories, thresholded by 50 (i.e., themaximum dimension of embedding output is 50). These are just heuristics as there is nocommon standard for choosing the embedding sizes. We concatenate the continuous inputwith the outputs from embedding layers, followed by two layers of fully-connected layers,18ith sizes of 1024 and 512, respectively. The dropout rates are set to 0.001 and 0.01 for thetwo hidden layers correspondingly.For the RNN module, we implement two-layer stacked LSTM, i.e., in each LSTM cell(that denotes a single time step), there are two LSTM layers sequentially connected, andeach layer consists of 50 hidden units. We need an extra fully-connected layer to control theoutput size for the temporal module, depending on whether to obtain the ﬁnal prediction asin TS-LSTM (with output size to be 1), or a temporal feature map as in LSTM-Stock2Vec.We set the size of temporal feature map to be 30 in order to compress the information forboth LSTM-Stock2Vec and TCN-Stock2Vec. In TCN, we use another convolutional layer toachieve the same eﬀect. To implement the TCN module, we build a 16-layer dilated causalCNN as the component that focuses on capturing the autoregressive temporal relations fromthe series own history. Each layer contains 16 ﬁlters, and each ﬁlter has a width of 2.Every two consecutive convolutional layers form a residual block after which the previousinputs are added to the ﬂow. The dilation rate increases exponentially along every stackedresidual blocks, i.e., to be 1 , , , , · · · , . − was used to train TS-TCN and TS-LSTM. To trainStock2Vec, we deployed the super-convergence scheme as in [87] and used cyclical learningrate over every 3 epochs, with a maximum value of 10 − . In the two hybrid models, while theweights of the head layers were randomly initialized as usual, we loaded the weights frompre-trained Stock2Vec and TS-TCN/TS-LSTM for the corresponding modules. By doingthis, we have applied transfer learning scheme [88, 89, 90] and wish the transferred moduleshave the ability to eﬀectively process features from the beginning. The head layers weretrained for 2 cycles (each contains 2 epochs) with maximum learning rate of 3 × − whilethe transferred modules were frozen. After this convergence, the entire network was ﬁne-tuned for 10 epochs by standard Adam optimizer with learning rate of 10 − , during which19n early stopping paradigm [91] was applied to retrieve the model with smallest validationerror. We select the hyperparemeters based upon the model performance on the validationset. To evaluate the performance of our forecasting model, three commonly used evaluationcriteria are used in this study: (a) the root mean square error (RMSE), (b) the mean absoluteerror (MAE), (c) the mean absolute percentage error (MAPE), (d) the root mean squarepercentage error (RMSPE): RMSE = (cid:118)(cid:117)(cid:117)(cid:116) H H (cid:88) t =1 ( y t − ˆ y t ) (1)MAE = 1 H H (cid:88) t =1 (cid:12)(cid:12) y t − ˆ y t (cid:12)(cid:12) (2)MAPE = 1 H H (cid:88) t =1 (cid:12)(cid:12)(cid:12) y t − ˆ y t y t (cid:12)(cid:12)(cid:12) ×

100 (3)RMSPE = (cid:118)(cid:117)(cid:117)(cid:116) H H (cid:88) t =1 (cid:12)(cid:12)(cid:12) y t − ˆ y t y t (cid:12)(cid:12)(cid:12) ×

100 (4)where y t is the actual target value for the t -th observation, ˆ y t is the predicted value for thecorresponding target, and H is the forecast horizon.The RMSE is the most popular measure for the error rate of regression models, as n → ∞ ,it converges to the standard deviation of the theoretical prediction error. However, thequadratic error may not be an appropriate evaluation criterion for all prediction problems,especially in the presence of large outliers. In addition, the RMSE depends on scales, andis also sensitive to outliers. The MAE considers the absolute deviation as the loss and isa more “robust” measure for prediction, since the absolute error is more sensitive to smalldeviations and much less sensitive to large ones than the squared error. However, since thetraining process for many learning models are based on squared loss function, the MAEcould be (logically) inconsistent to the model optimization selection criteria. The MAEis also scale-dependent, thus not suitable to compare prediction accuracy across diﬀerentvariables or time ranges. In order to achieve scale independence, the MAPE measures the20rror proportional to the target value, while instead of using absolute values, the RMSPE canbe seen as the root mean squared version of MAPE. The MAPE and RMSPE however, areextremely unstable when the actual value is small (consider the case when the denominator orclose to 0). We will consider all four measures mentioned here to have a more complete viewof the performance of the models considering the limitations of each performance measure.In addtion, we will compare the running time as an additional evaluation criterion. As we introduced in Section 3, the main goal of training Stock2Vec model is to learnthe intrinsic relationships among stocks, where similar stocks are close to each other in theembedding space, so that we can deploy the interactions from cross-sectional data, or morespeciﬁcally, the market information, to make better predictions. To show this is the case,we extract the weights of the embedding layers from the trained Stock2Vec model, map theweights down to two-dimensional space with a manifold by using PCA, and visualize theentities to look at how the embedding spaces look like. Note that besides Stock2Vec, we alsolearned embeddings for other categorical features.Figure 7(a) shows the ﬁrst two principal components of the sectors. Note that herethe ﬁrst two components account for close to 75% of variance. We can generally observethat

Health Care, Technology/Consumer Services and

Finance occupy the opposite cornersof the plot, i.e., represent unique sectors most dissimilar from one another. On the otherhand a collection of more traditional sectors:

Public Utilities, Energy, Consumer Durablesand Non-Durables, Basic Industries generally are grouped closer together. The plot, then,allows for a natural interpretation which is in accordance with our intuition, indicating thatthe learned embedding can be expected to be reasonable.Similarly, from the trained Stock2Vec embeddings, we can obtain a 50-dimensional vectorfor each separate stock. We simialrly visualize the learned Stock2Vec with PCA in Figure8(a), and color each stock by the sector it belongs to. It is important to note that in thiscase, the ﬁrst two components of PCA only account for less than 40% of variance. In otherwords, in this case, the plotted groupings do not represent the learned information as wellas in the previous case. Indeed, when viewed all together, individual assets do not exhibit21 a) Visualization of learned embeddings for sec-tors, projected to 2-D spaces using PCA. number of principle components c u m u l a t i v e e x p l a i n e d v a r i a n c e (b) The cumulative explained variance ratio foreach of the principal components, sorted by thesingular values Figure 7: PCA on the learned embeddings for Sectors readily discernible patterns. This is not necessarily an indicator of deﬁciency of the learnedembedding, and instead suggests that two dimensions are not suﬃcient in this case.However, lots of useful insight can be gained from the distributed representations, forinstance, we could consider the similarities between stocks in the learned vector space is anexample of these beneﬁts as we will show below,To reveal some additional insights from the similarity distance, we sort the pairwise cosinedistance (in the embedded space) between the stocks in the ascending order. In Figure 9a,we plot the ticker “NVDA” (Nvidia) as well as its six nearest neighbors in the embeddingspace. The six companies that are closest to Nvidia, according to the embeddings of learnedweights, are either of the same type (technology companies) with Nvidia: Facebook, Akamai,Cognizant Tech Solutions, Charte; or fast growing during the past ten years (was the casefor Nvidia during the tested period): Monster, Discover Bank. Similarly, we plot the tickerof Wells Fargo (“WFC”) and its 6 nearest neighbors in Figure 9b, all of which are eitherbanks or companies that provide other ﬁnancial services.These observations suggest are another indicator that Stock2Vec can be expected to learnsome useful information, and indeed is capable of coupling together insights from a numberof unrelated sources, in this case, asset sector and it’s performance.The following points must be noted here. First, most of the nearest neighbors are not22 .05 0.00 0.05 0.100.100.050.000.050.10 sector

CAPITAL GOODSTRANSPORTATIONCONSUMER SERVICESTECHNOLOGYHEALTH CAREMISCELLANEOUSCONSUMER NON-DURABLESPUBLIC UTILITIESBASIC INDUSTRIESFINANCECONSUMER DURABLESENERGY (a) Visualization of Stock2Vec (colored by sec-tors), projected to 2-D spaces using PCA. number of principle components c u m u l a t i v e e x p l a i n e d v a r i a n c e (b) The cumulative explained variance ratio foreach of the principal components, sorted by thesingular values Figure 8: PCA on the learned Stock2Vec embeddings the closest points in the two-dimensional plots due to the imprecision of mapping into two-dimensions. Secondly, although the nearest neighbors are meaningful for many companies asthe results either are in the same sector (or industry), or present similar stock price trend inthe last a few years, this insight does not hold true for all companies, or the interpretationcan be hard to discern. For example, the nearest neighbors of Amazon.com (AMZN) includetransportation and energy companies (perhaps due to its heavy reliance on these industriesfor its operation) as well as technology companies. Finally, note that there exist manyother visualization techniques for projection of high dimensional vectors onto 2D spaces thatcould be used here instead of PCA, for example, t-SNE [92] or UMAP [93]. However, neitherprovided visual improvement of the grouping eﬀect over Figure 8(a) and hence we do notpresent those results here.Based on the above observations, Stock2Vec provides several beneﬁts: 1) reducing thedimensionality of categorical feature space, thus the computational performance is improvedwith smaller number of parameter, 2) mapping the sparse high-dimensional one-hot encodedvectors onto dense distributional vector space (with lower dimensionality), as a result, similarcategories are learned to be placed closer to one another in the embedding space, unlike inone-hot encoding vector space where every pairs of categories yield the same distance and areorthogonal to each other. Therefore, the outputs of the embedding layers could be served23 .050 0.025 0.000 0.025 0.050 0.075 0.1000.100.050.000.050.10

AKAMCHTRCTSHDFSFB MNST

NVDA (a) Nearest neighbors of NVDA, which are: 1).’MNST’, Monster, fast growing; 2) ’FB’, Face-book, IT; 3)’DFS’, Discover Bank, fast growing;4) ’AKAM’, Akamai, IT; 5) ’CTSH’, CognizantTech Solutions, IT; 6) ’CHTR’, Charter, com-munication services.

AXP CMAETFC PGRSPGI STT

WFC (b) Nearest neighbors of WFC, which are: 1)’ETFC’: E-Trader, ﬁnancial; 2) ’STT’: StateStreet Corp., bank; 3) ’CMA’: Comerica, bank;4) ’AXP’: Amex, ﬁnancial; 5) ’PGR’: Progres-sive Insurance, ﬁnancial; 6) ’SPGI’, S&P Global,Inc., ﬁnancial & data

Figure 9: Nearest neighbors of Stock2Vec based on similarity between stocks. as more meaningful features, for later layers of neural networks to achieve more eﬀectivelearning. Not only that, the meaningful embeddings can be used for visualization, providesus more interpretability of the deep learning models.

Table 3 and Figure 10 report the overall average (over the individual assets) forecast-ing performance of the out-of-sample period from 2019-08-16 to 2020-02-14. We observethat TS-LSTM and TS-TCN perform worst. We can conlude that this is because thesetwo models only consider the target series and ignore all other features. TCN outperformsLSTM, probably since it is capable of extracting temporal patterns over long history withoutmore eﬀectively gradient vanishing problem. Moreover, the training speed of our 18-layerTCN is about ﬁve times faster than that of LSTM per iteration (aka batch) with GPU, and24he overall training speed (given all overhead included) is also around two to three timesfaster. With learning from all the features, the random forest and XGBoost models performbetter than purely timeseries-based TS-LSTM and TS-TCN, with the XGBoost predictionsare slightly better than that from random forest. This demonstrates the usefulness of ourdata source, especially the external information combined into the inputs. We can then ob-serve that despite having the same input as random forest and XGBoost, the proposed ourStock2Vec model further improves accuracy of the predictions, as the RMSE, MAE, MAPEand RMSPE decrease by about 36%, 38%, 41% and 43% over the XGBoost predictions, re-spectively. This indicates that the use of deep learning models, in particular the Stock2Vecembedding improves the predictions, by more eﬀectively learning from the features over thetree-based ensemble models. With integration of temporal modules, there is again a signif-icant improvement of performance in terms of prediction accuracy. The two hybrid modelsLSTM-Stock2Vec and TCN-Stock2Vec not only learn from features we give explicitly, butalso employ either a hidden state or a convolutional temporal feature mapping to implic-itly learn relevant information from historical data. Our TCN-Stock2Vec achieves the bestperformance across all models, as the RMSE and MAE decreases by about 25%, while theMAPE decreases by 20% and the RMSPE decreases by 14%, comparing with Stock2Vecwithout the temporal module.

Table 3: Average performance comparison.RMSE MAE MAPE(%) RMSPE (%)TS-LSTM 6.35 2.36 1.62 2.07TS-TCN 5.79 2.15 1.50 1.96Random Forest 4.86 1.67 1.31 1.92XGBoost 4.57 1.66 1.28 1.83Stock2Vec 2.94 1.04 0.76 1.05LSTM-Stock2Vec 2.57 0.85 0.68 1.04TCN-Stock2Vec

Figure 10 shows the boxplots of the prediction errors of diﬀerent approaches, from whichwe can see our proposed models achieve smaller absolute prediction errors in terms of notonly the mean also the variance, which indicates more robust forecast. The median absolute25rediction errors (and the interquartile range, i.e., IQR) of our TS-TCN model is around1.01 (1.86), while they are around 0.74 (1.39), 0.45 (0.87), and 0.36 (0.66) for XGBoost,Stock2Vec and TCN-Stock2Vec, respectively.

TS-LSTM TS-TCN RandomForest XGBoost Stock2Vec LSTM-Stock2Vec TCN-Stock2Vec012345 A b s o l u t e E rr o r Figure 10: Boxplot comparison of the absolute prediction errors.

Similarly, we aggregate the metrics on the sector level, and calculate the average per-formance within each sector. We report the RMSE, MAE, MAPE, and RMSPE in TablesA.4, A.5, A.6, and A.7, respectively, from which we can see again our Stock2Vec performsbetter than the two tree-ensemble models for all sectors, and adding the temporal modulewould further improve the forecasting accuracy. TCN-Stock2Vec achieves the best RMSE,MAE, MAPE and RMSPE in all sectors with one exception. Better performance on diﬀerentaggregated levels demonstrates the power of our proposed models.We further showcase the predicted results of 20 symbols to gauge the forecasting perfor-mance of our model under a wide range of industries, volatilities, growth patterns and othergeneral conditions. The stocks have been chosen to evaluate how the proposed methodolo-gies would perform under diﬀerent circumstances. For instance, Amazons (AMZN) stock wasconsistently increasing in price across the analysis period, while the stock price of Verizon(VZ) was very stable, and Chevrons stock (CVX) had both periods of growth and decline. Inaddition, these 20 stocks captured several industries: (a) retail (e.g., Walmart), (b) restau-rants (e.g., McDonalds), (c) ﬁnance and banks (e.g., JPMorgan Chase and Goldman Sachs),26d) energy and oil & gas (e.g., Chevron), (e) techonology (e.g., Facebook), (f) communica-tions (e.g., Verizon), etc. Table B.8, B.9, B.10, B.11 show the out-of-sample RMSE, MAE,MAPE and RMAPE, respectively, from the predictions given by the ﬁve models we discussedabove. Again, Stock2Vec generally performs better than random forest and XGBoost, andthe two hybrid models have quite similar performance which is signiﬁcantly better than thatof others. While there also exist a few stocks on which LSTM-Stock2Vec or even Stock2Vecwithout temporal module produce most accurate predictions, for most of the stocks, TCN-Stock2Vec model performs the best. This demonstrates our models generalize well to mostsymbols.Furthermore, we plot the prediction pattern of the competing models for the abovemen-tioned stocks on the test set in Appendix C, compared to the actual daily prices. Weobserve that the random forest and XGBoost models predict up-and-downs with a lag formost of the time, as the current price plays too much a role as a predictor, probably mainlydue to the correct scaling reason. And there occasionally exist several ﬂat predictions overa period for some stocks (see approximately 2019/09 in Figure C.15, 2020/01 in FigureC.18, and 2019/12 in Figure C.30), which is a typical eﬀect of tree-based methods, indicatesinsuﬃcient splitting and underﬁtting despite so many ensemble trees were used. With en-tity embeddings, our Stock2Vec model can learn from the features much more eﬀectively,its predictions coincide with the actual up-and downs much more accurately. Although itoverestimates the volatility by exaggerating the amplitude as well as the frequency of oscil-lations, the overall prediction errors are getting smaller than the two tree-ensemble models.And our LSTM-Stock2Vec and TCN-Stock2Vec models further beneﬁt from the temporallearning modules by automatically capturing the historical characteristics from time seriesdata, especially the nonlinear trend and complex seasonality that are diﬃcult to be capturedby hand-engineered features such as technical indicators, as well as the common temporalfactors that are shared among all series across the whole market. As a result, with the abilityto extract the autoregressive dependencies over long term both within and across series fromhistorical data, the predictions from these two models alleviate wild oscillations, and aremuch more close to the actual prices, while still correctly predict the up-and-downs for mostof the time with eﬀective learning from input features.27 . Concluded Remarks and Future Work

Our argument that implicitly learning Alphas and Betas upon cross-sectional data fromCAPM perspective is novel, however, it is more of an insight rather than systematic analysis.In this paper, we built a global hybrid deep learning models to forecast the S&P stock prices.We applied the state-of-the-art 1-D dilated causal convolutional layers (TCN) to extract thetemporal features from the historical information, which helps us to reﬁne learning of theAlphas. In order to integrate the Beta information into the model, we learn a single modelthat learns from the data over the whole market, and applied entity embeddings for thecategorical features, in particular, we obtained the Stock2Vec that reveals the relationshipamong stocks in the market, our model can be seen as supervised dimension reduction methodin that point of view. The experimental results show our models improve the forecastingperformance. Although not demonstrated in this work, learning a global model from thedata over the entire market can give us an additional beneﬁt that it can handle the cold-start problem, in which some series may contain very little data (i.e., many missing values),our model has the ability to infer the historical information with the structure learned fromother series as well as the correlation between the cold-start series and the market. It mightnot be accurate, but is much informative than that learned from little data in the singleseries.There are several other directions that we can dive deeper as the future work. First ofall, the stock prices are heavily aﬀected by external information, combining extensive crowd-sourcing, social media and ﬁnancial news data may facilitate a better understanding ofcollective human behavior on the market, which could help the eﬀective decision making forinvestors. These data can be obtained from the internet, we could expand the data source andcombine their inﬂuence in the model as extra features. In addition, although we have shownthat the convolutional layers have several advantages over the most widely used recurrentneural network layers for time series, the temporal learning layers in our model could bereplaced by any other type, for instance, the recent advances of attention models could bea good candidate. Also, more sophisticated models can be adopted to build Stock2Vec, bykeeping the goal in mind that we aim at learning the implicit intrinsic relationship betweenstock series. In addition, learning the relationship over the market would be helpful for us to28uild portfolio aiming at maximizing the investment gain, e.g., by using standard Markowitzportfolio optimization to ﬁnd the positions. In that case, simulation of trading in the marketshould provide us more realistic and robust performance evaluation than those aggregatedlevels we reported above. Liquidity and market impacts can be taken into account in thesimulation, and we can use Proﬁt & Loss (P&L) and the Sharpe ratio as the evaluationmetrics.

References [1] E. F. Fama, The behavior of stock-market prices, The journal of Business 38 (1) (1965)34–105 (1965).[2] W. F. Sharpe, Capital asset prices: A theory of market equilibrium under conditions ofrisk, The journal of ﬁnance 19 (3) (1964) 425–442 (1964).[3] J. Lintner, The valuation of risk assets and the selection of risky investments in stockportfolios and capital budgets, in: Stochastic optimization models in ﬁnance, Elsevier,1975, pp. 131–155 (1975).[4] M. C. Jensen, F. Black, M. S. Scholes, The capital asset pricing model: Some empiricaltests (1972).[5] G. E. Box, G. M. Jenkins, Some recent advances in forecasting and control, Journal ofthe Royal Statistical Society. Series C (Applied Statistics) 17 (2) (1968) 91–109 (1968).[6] R. Hyndman, A. B. Koehler, J. K. Ord, R. D. Snyder, Forecasting with exponentialsmoothing: the state space approach, Springer Science & Business Media, 2008 (2008).[7] K. Alkhatib, H. Najadat, I. Hmeidi, M. K. A. Shatnawi, Stock price prediction usingk-nearest neighbor (knn) algorithm, International Journal of Business, Humanities andTechnology 3 (3) (2013) 32–44 (2013).[8] Y. Chen, Y. Hao, A feature weighted support vector machine and k-nearest neighboralgorithm for stock market indices prediction, Expert Systems with Applications 80(2017) 340–355 (2017).[9] M. R. Hassan, B. Nath, M. Kirley, A fusion model of hmm, ann and ga for stock marketforecasting, Expert systems with Applications 33 (1) (2007) 171–180 (2007).[10] M. R. Hassan, K. Ramamohanarao, J. Kamruzzaman, M. Rahman, M. M. Hossain, Ahmm-based adaptive fuzzy inference system for stock market forecasting, Neurocom-puting 104 (2013) 10–25 (2013).[11] H. Yang, L. Chan, I. King, Support vector machine regression for volatile stock marketprediction, in: International Conference on Intelligent Data Engineering and AutomatedLearning, Springer, 2002, pp. 391–396 (2002).2912] C.-L. Huang, C.-Y. Tsai, A hybrid sofm-svr with a ﬁlter-based feature selection forstock market forecasting, Expert Systems with Applications 36 (2) (2009) 1529–1539(2009).[13] J.-Z. Wang, J.-J. Wang, Z.-G. Zhang, S.-P. Guo, Forecasting stock indices with backpropagation neural network, Expert Systems with Applications 38 (11) (2011) 14346–14355 (2011).[14] E. Guresen, G. Kayakutlu, T. U. Daim, Using artiﬁcial neural network models in stockmarket index prediction, Expert Systems with Applications 38 (8) (2011) 10389–10397(2011).[15] W. Kristjanpoller, A. Fadic, M. C. Minutolo, Volatility forecast using hybrid neuralnetwork models, Expert Systems with Applications 41 (5) (2014) 2437–2442 (2014).[16] L. Wang, Y. Zeng, T. Chen, Back propagation neural network with adaptive diﬀerentialevolution algorithm for time series forecasting, Expert Systems with Applications 42 (2)(2015) 855–863 (2015).[17] M. G¨o¸cken, M. ¨Oz¸calıcı, A. Boru, A. T. Dosdo˘gru, Integrating metaheuristics andartiﬁcial neural networks for improved stock price prediction, Expert Systems withApplications 44 (2016) 320–331 (2016).[18] J. Patel, S. Shah, P. Thakkar, K. Kotecha, Predicting stock and stock price indexmovement using trend deterministic data preparation and machine learning techniques,Expert systems with applications 42 (1) (2015) 259–268 (2015).[19] A. Booth, E. Gerding, F. Mcgroarty, Automated trading with performance weightedrandom forests and seasonality, Expert Systems with Applications 41 (8) (2014) 3651–3661 (2014).[20] S. Barak, M. Modarres, Developing an approach to evaluate stocks by forecasting ef-fective features with data mining methods, Expert Systems with Applications 42 (3)(2015) 1325–1339 (2015).[21] B. Weng, W. Martinez, Y.-T. Tsai, C. Li, L. Lu, J. R. Barth, F. M. Megahed, Macroe-conomic indicators alone can predict the monthly closing price of major us indices: In-sights from artiﬁcial intelligence, time-series analysis and hybrid models, Applied SoftComputing 71 (2018) 685–697 (2018).[22] J. J. Murphy, Technical analysis of the ﬁnancial markets: A comprehensive guide totrading methods and applications, Penguin, 1999 (1999).[23] E. F. Fama, K. R. French, Common risk factors in the returns on stocks and bonds,Journal of (1993).[24] P. C. Tetlock, M. Saar-Tsechansky, S. Macskassy, More than words: Quantifying lan-guage to measure ﬁrms’ fundamentals, The Journal of Finance 63 (3) (2008) 1437–1467(2008). 3025] Q. Li, Y. Chen, L. L. Jiang, P. Li, H. Chen, A tensor-based information framework forpredicting the stock market, ACM Transactions on Information Systems (TOIS) 34 (2)(2016) 1–30 (2016).[26] B. Weng, L. Lu, X. Wang, F. M. Megahed, W. Martinez, Predicting short-term stockprices using ensemble methods and online data sources, Expert Systems with Applica-tions 112 (2018) 258–273 (2018).[27] J. Bollen, H. Mao, X. Zeng, Twitter mood predicts the stock market, Journal of com-putational science 2 (1) (2011) 1–8 (2011).[28] N. Oliveira, P. Cortez, N. Areal, The impact of microblogging data for stock marketprediction: Using twitter to predict returns, volatility, trading volume and survey sen-timent indices, Expert Systems with Applications 73 (2017) 125–144 (2017).[29] Q. Wang, W. Xu, H. Zheng, Combining the wisdom of crowds and technical analysis forﬁnancial market prediction using deep random subspace ensembles, Neurocomputing299 (2018) 51–61 (2018).[30] Y. LeCun, L. Bottou, Y. Bengio, P. Haﬀner, Gradient-based learning applied to docu-ment recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324 (1998).[31] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classiﬁcation with deep convolu-tional neural networks, in: Advances in neural information processing systems, 2012,pp. 1097–1105 (2012).[32] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale imagerecognition, arXiv preprint arXiv:1409.1556 (2014).[33] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Uniﬁed, real-time object detection, in: Proceedings of the IEEE conference on computer vision andpattern recognition, 2016, pp. 779–788 (2016).[34] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eﬃcient estimation of word representationsin vector space, arXiv preprint arXiv:1301.3781 (2013).[35] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representa-tion, in: Proceedings of the 2014 conference on empirical methods in natural languageprocessing (EMNLP), 2014, pp. 1532–1543 (2014).[36] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectionaltransformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).[37] A. M. Rather, A. Agarwal, V. Sastry, Recurrent neural network and a hybrid model forprediction of stock returns, Expert Systems with Applications 42 (6) (2015) 3234–3241(2015).[38] T. Fischer, C. Krauss, Deep learning with long short-term memory networks for ﬁnancialmarket predictions, European Journal of Operational Research 270 (2) (2018) 654–669(2018). 3139] O. B. Sezer, A. M. Ozbayoglu, Algorithmic ﬁnancial trading with deep convolutionalneural networks: Time series to image conversion approach, Applied Soft Computing70 (2018) 525–538 (2018).[40] G. Hu, Y. Hu, K. Yang, Z. Yu, F. Sung, Z. Zhang, F. Xie, J. Liu, N. Robertson,T. Hospedales, et al., Deep stock representation learning: From candlestick charts toinvestment decisions, in: 2018 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP), IEEE, 2018, pp. 2706–2710 (2018).[41] O. B. Sezer, A. M. Ozbayoglu, Financial trading model with stock bar chart image timeseries with deep convolutional neural networks, arXiv preprint arXiv:1903.04610 (2019).[42] E. Hoseinzade, S. Haratizadeh, Cnnpred: Cnn-based stock market prediction using adiverse set of variables, Expert Systems with Applications 129 (2019) 273–285 (2019).[43] W. Long, Z. Lu, L. Cui, Deep learning-based feature engineering for stock price move-ment prediction, Knowledge-Based Systems 164 (2019) 163–173 (2019).[44] Z. Jiang, D. Xu, J. Liang, A deep reinforcement learning framework for the ﬁnancialportfolio management problem, arXiv preprint arXiv:1706.10059 (2017).[45] L. Di Persio, O. Honchar, Artiﬁcial neural networks architectures for stock price pre-diction: Comparisons and applications, International journal of circuits, systems andsignal processing 10 (2016) (2016) 403–413 (2016).[46] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks,in: Advances in neural information processing systems, 2014, pp. 3104–3112 (2014).[47] K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk,Y. Bengio, Learning phrase representations using rnn encoder-decoder for statisticalmachine translation, arXiv preprint arXiv:1406.1078 (2014).[48] H. Sak, A. W. Senior, F. Beaufays, Long short-term memory recurrent neural networkarchitectures for large scale acoustic modeling (2014).[49] Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin, A neural probabilistic language model,Journal of machine learning research 3 (Feb) (2003) 1137–1155 (2003).[50] D. Salinas, V. Flunkert, J. Gasthaus, T. Januschowski, Deepar: Probabilistic forecastingwith autoregressive recurrent networks, International Journal of Forecasting (2019).[51] S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, T. Januschowski,Deep state space models for time series forecasting, in: Advances in neural informationprocessing systems, 2018, pp. 7785–7794 (2018).[52] P. J. Werbos, Backpropagation through time: what it does and how to do it, Proceedingsof the IEEE 78 (10) (1990) 1550–1560 (1990).[53] R. Pascanu, T. Mikolov, Y. Bengio, On the diﬃculty of training recurrent neural net-works, in: International conference on machine learning, 2013, pp. 1310–1318 (2013).3254] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (8)(1997) 1735–1780 (1997).[55] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrentneural networks on sequence modeling, arXiv preprint arXiv:1412.3555 (2014).[56] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D.Jackel, Backpropagation applied to handwritten zip code recognition, Neural computa-tion 1 (4) (1989) 541–551 (1989).[57] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in:Proceedings of the IEEE conference on computer vision and pattern recognition, 2016,pp. 770–778 (2016).[58] R. K. Srivastava, K. Greﬀ, J. Schmidhuber, Training very deep networks, in: Advancesin neural information processing systems, 2015, pp. 2377–2385 (2015).[59] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolu-tional networks, in: Proceedings of the IEEE conference on computer vision and patternrecognition, 2017, pp. 4700–4708 (2017).[60] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, K. J. Lang, Phoneme recognitionusing time-delay neural networks, IEEE transactions on acoustics, speech, and signalprocessing 37 (3) (1989) 328–339 (1989).[61] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalch-brenner, A. Senior, K. Kavukcuoglu, Wavenet: A generative model for raw audio, arXivpreprint arXiv:1609.03499 (2016).[62] Y. N. Dauphin, A. Fan, M. Auli, D. Grangier, Language modeling with gated convo-lutional networks, in: International conference on machine learning, 2017, pp. 933–941(2017).[63] J. Gehring, M. Auli, D. Grangier, D. Yarats, Y. N. Dauphin, Convolutional sequenceto sequence learning, arXiv preprint arXiv:1705.03122 (2017).[64] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, G. D. Hager, Temporal convolutional networksfor action segmentation and detection, in: proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2017, pp. 156–165 (2017).[65] M. Binkowski, G. Marti, P. Donnat, Autoregressive convolutional neural networks forasynchronous time series, in: International Conference on Machine Learning, 2018, pp.580–589 (2018).[66] Y. Chen, Y. Kang, Y. Chen, Z. Wang, Probabilistic forecasting with temporal convolu-tional neural network, Neurocomputing (2020).[67] S. Bai, J. Z. Kolter, V. Koltun, An empirical evaluation of generic convolutional andrecurrent networks for sequence modeling, arXiv preprint arXiv:1803.01271 (2018).3368] Y. Bengio, S. Bengio, Modeling high-dimensional discrete data with multi-layer neuralnetworks, in: Advances in Neural Information Processing Systems, 2000, pp. 400–406(2000).[69] A. Paccanaro, G. E. Hinton, Learning distributed representations of concepts usinglinear relational embedding, IEEE Transactions on Knowledge and Data Engineering13 (2) (2001) 232–244 (2001).[70] G. E. Hinton, et al., Learning distributed representations of concepts, in: Proceedingsof the eighth annual conference of the cognitive science society, Vol. 1, Amherst, MA,1986, p. 12 (1986).[71] O. Barkan, N. Koenigstein, Item2vec: neural item embedding for collaborative ﬁltering,in: 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing(MLSP), IEEE, 2016, pp. 1–6 (2016).[72] E. Choi, M. T. Bahadori, E. Searles, C. Coﬀey, M. Thompson, J. Bost, J. Tejedor-Sojo,J. Sun, Multi-layer representation learning for medical concepts, in: Proceedings ofthe 22nd ACM SIGKDD International Conference on Knowledge Discovery and DataMining, 2016, pp. 1495–1504 (2016).[73] C. Guo, F. Berkhahn, Entity embeddings of categorical variables, arXiv preprintarXiv:1604.06737 (2016).[74] D. L. Minh, A. Sadeghi-Niaraki, H. D. Huy, K. Min, H. Moon, Deep learning approachfor short-term stock trends prediction based on two-stream gated recurrent unit network,Ieee Access 6 (2018) 55392–55404 (2018).[75] Q. Wu, Z. Zhang, A. Pizzoferrato, M. Cucuringu, Z. Liu, A deep learning frameworkfor pricing ﬁnancial instruments, arXiv preprint arXiv:1909.04497 (2019).[76] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, MIT press, 2016 (2016).[77] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: asimple way to prevent neural networks from overﬁtting, The journal of machine learningresearch 15 (1) (2014) 1929–1958 (2014).[78] Y. Gal, Z. Ghahramani, Dropout as a bayesian approximation: Representing modeluncertainty in deep learning, in: international conference on machine learning, 2016,pp. 1050–1059 (2016).[79] S. Ioﬀe, C. Szegedy, Batch normalization: Accelerating deep network training by reduc-ing internal covariate shift, arXiv preprint arXiv:1502.03167 (2015).[80] V. Nair, G. E. Hinton, Rectiﬁed linear units improve restricted boltzmann machines,in: ICML, 2010 (2010).[81] X. Glorot, Y. Bengio, Understanding the diﬃculty of training deep feedforward neu-ral networks, in: Proceedings of the thirteenth international conference on artiﬁcialintelligence and statistics, 2010, pp. 249–256 (2010).3482] P. Covington, J. Adams, E. Sargin, Deep neural networks for youtube recommendations,in: Proceedings of the 10th ACM conference on recommender systems, 2016, pp. 191–198 (2016).[83] L. Breiman, Random forests, Machine learning 45 (1) (2001) 5–32 (2001).[84] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings ofthe 22nd acm sigkdd international conference on knowledge discovery and data mining,2016, pp. 785–794 (2016).[85] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,L. Antiga, A. Lerer, Automatic diﬀerentiation in pytorch (2017).[86] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprintarXiv:1412.6980 (2014).[87] L. N. Smith, N. Topin, Super-convergence: Very fast training of neural networks usinglarge learning rates, in: Artiﬁcial Intelligence and Machine Learning for Multi-DomainOperations Applications, Vol. 11006, International Society for Optics and Photonics,2019, p. 1100612 (2019).[88] S. J. Pan, Q. Yang, A survey on transfer learning, IEEE Transactions on knowledgeand data engineering 22 (10) (2009) 1345–1359 (2009).[89] Y. Bengio, Deep learning of representations for unsupervised and transfer learning, in:Proceedings of ICML workshop on unsupervised and transfer learning, 2012, pp. 17–36(2012).[90] M. Long, H. Zhu, J. Wang, M. I. Jordan, Deep transfer learning with joint adaptationnetworks, in: International conference on machine learning, 2017, pp. 2208–2217 (2017).[91] Y. Yao, L. Rosasco, A. Caponnetto, On early stopping in gradient descent learning,Constructive Approximation 26 (2) (2007) 289–315 (2007).[92] L. v. d. Maaten, G. Hinton, Visualizing data using t-sne, Journal of machine learningresearch 9 (Nov) (2008) 2579–2605 (2008).[93] L. McInnes, J. Healy, J. Melville, Umap: Uniform manifold approximation and projec-tion for dimension reduction, arXiv preprint arXiv:1802.03426 (2018).

Appendix A. Sector Level Performance Comparison able A.4: Sector level RMSE comparison Random Forest XGBoost Stock2Vec LSTM-Stock2Vec TCN-Stock2VecBasic Industries 1.70 1.61 1.06 0.85

Capital Goods 11.46 10.30 6.25 6.01

Consumer Durables 1.78 1.67 0.99 0.93

Consumer Non-Durables 1.57 1.55 0.98 0.87

Consumer Services 4.75 4.69 3.34 2.76

Energy 1.50 1.44 0.76 0.76

Finance 2.08 2.06 1.39 1.05

HealthCare 3.44 3.37 1.95 1.98

Miscellaneous 8.23 7.96 5.22 4.14

Public Utilities 0.94 0.95 0.64

Transportation 2.00 1.90 1.15 1.03

Table A.5: Sector level MAE comparison

Random Forest XGBoost Stock2Vec LSTM-Stock2Vec TCN-Stock2VecBasic Industries 1.06 1.03 0.64 0.52

Capital Goods 3.13 3.07 1.93 1.57

Consumer Durables 1.21 1.18 0.71 0.63

Consumer Non-Durables 0.96 0.93 0.57 0.52

Consumer Services 1.83 1.84 1.19 0.98

Energy 0.98 0.95 0.50 0.51

Finance 1.19 1.17 0.79 0.55

HealthCare 1.99 1.96 1.15 1.10

Miscellaneous 3.18 3.18 2.08 1.56

Public Utilities 0.63 0.64 0.44

Transportation 1.26 1.23 0.74 0.65

Table A.6: Sector level MAPE (%) comparison

Random Forest XGBoost Stock2Vec LSTM-Stock2Vec TCN-Stock2VecBasic Industries 1.34 1.31 0.74 0.65

Capital Goods 1.21 1.24 0.76 0.59

Consumer Durables 1.30 1.26 0.73 0.68

Consumer Non-Durables 1.48 1.32 0.76 0.85

Consumer Services 1.24 1.23 0.71 0.66

Energy 2.04 1.88 0.97 1.08

Finance 1.18 1.16 0.74

Miscellaneous 1.23 1.23 0.81 0.66

Public Utilities 0.88 0.90 0.57 0.49

Technology 1.44 1.43 0.83 0.68

Transportation 1.26 1.23 0.71 0.66

Table A.7: Sector level RMSPE (%) comparison

Random Forest XGBoost Stock2Vec LSTM-Stock2Vec TCN-Stock2VecBasic Industries 1.86 1.80 0.98 0.91

Capital Goods 1.63 1.65 1.01 0.83

Consumer Durables 1.79 1.68 0.96 0.95

Consumer Non-Durables 2.41 2.02 1.13 1.37

Consumer Services 1.88 1.82 0.99 1.07

Energy 2.89 2.66 1.29 1.51

Finance 1.60 1.56 1.00 0.78

HealthCare 2.17 2.00 1.15 1.24

Miscellaneous 1.66 1.63 1.05 0.95

Public Utilities 1.25 1.23 0.74 0.71

Technology 2.09 2.00 1.13 1.04

Transportation 1.76 1.68 0.98 0.95 ppendix B. Performance comparison of diﬀerent models for the one-day aheadforecasting on diﬀerent symbols Table B.8: RMSE comparison of diﬀerent models for the one-day ahead forecasting on diﬀerent symbols

Random Forest XGBoost Stock2Vec LSTM-Stock2Vec TCN-Stock2VecAAPL (Apple) 4.71 4.52 2.86 2.16

AFL (Aﬂac) 0.59 0.62 0.46

AMZN (Amazon.com) 29.91 28.47 23.80 17.73

BA (Boeing) 6.00 6.44 3.98 3.83

CVX (Chevron) 1.42 1.62 1.03 0.75

DAL (Delta Air Lines) 0.79 0.77 0.48 0.40

DIS (Walt Disney) 1.95 1.91 1.17 1.10

FB (Facebook) 3.51 5.54 2.15 1.72

GE (General Electric) 0.39 0.30

GS (Goldman Sachs Group) 3.11 3.00 1.86

JPM (JPMorgan Chase) 1.72 1.63 1.59

MCD (McDonald’s) 2.67 2.50 1.51 1.26

NKE (Nike) 1.27 1.23 1.01

VZ (Verizon Communications) 0.54 0.55 0.46 0.29

WMT (Walmart) 1.34 1.43 1.06 0.55

Table B.9: MAE comparison of diﬀerent models for the one-day ahead forecasting on diﬀerent symbols

Random Forest XGBoost Stock2Vec LSTM-Stock2Vec TCN-Stock2VecAAPL (Apple) 3.63 3.56 2.15 1.72

AFL (Aﬂac) 0.45 0.44 0.35

BA (Boeing) 4.59 5.10 2.87 2.87

CVX (Chevron) 1.07 1.22 0.75 0.57

DAL (Delta Air Lines) 0.59 0.58 0.36 0.29

DIS (Walt Disney) 1.37 1.40 0.87 0.77

FB (Facebook) 2.54 3.80 1.65 1.16

GE (General Electric) 0.30 0.22

GS (Goldman Sachs Group) 2.48 2.37 1.31

JPM (JPMorgan Chase) 1.34 1.23 1.17

MCD (McDonald’s) 1.99 1.96 1.26 0.89

NKE (Nike) 0.97 0.98 0.77

VZ (Verizon Communications) 0.43 0.42 0.36 0.22

WMT (Walmart) 1.02 1.10 0.87 able B.10: MAPE (%) comparison of diﬀerent models for the one-day ahead forecasting on diﬀerent symbols Random Forest XGBoost Stock2Vec LSTM-Stock2Vec TCN-Stock2VecAAPL (Apple) 1.43 1.39 0.80 0.68

AFL (Aﬂac) 0.88 0.86 0.66

BA (Boeing) 1.33 1.47 0.82 0.83

CVX (Chevron) 0.94 1.06 0.65 0.50

DAL (Delta Air Lines) 1.03 1.02 0.63 0.51

DIS (Walt Disney) 0.99 1.01 0.61 0.55

FB (Facebook) 1.29 1.92 0.82 0.59

GE (General Electric) 2.99 2.13

GS (Goldman Sachs Group) 1.14 1.09 0.59

JPM (JPMorgan Chase) 1.08 1.00 0.90

VZ (Verizon Communications) 0.73 0.71 0.60 0.37

WMT (Walmart) 0.88 0.94 0.73

Table B.11: RMAPE (%) comparison of diﬀerent models for the one-day ahead forecasting on diﬀerentsymbols

Random Forest XGBoost Stock2Vec LSTM-Stock2Vec TCN-Stock2VecAAPL (Apple) 1.89 1.76 1.04 0.85

AFL (Aﬂac) 1.15 1.19 0.87 0.60

AMZN (Amazon.com) 1.60 1.55 1.28 0.95

BA (Boeing) 1.74 1.85 1.13 1.11

CVX (Chevron) 1.25 1.42 0.88 0.65

DAL (Delta Air Lines) 1.39 1.36 0.83 0.71

DIS (Walt Disney) 1.41 1.38 0.81 0.79

FB (Facebook) 1.77 2.75 1.06 0.85

GE (General Electric) 3.96 2.89

GS (Goldman Sachs Group) 1.44 1.39 0.84

JPM (JPMorgan Chase) 1.40 1.33 1.20

MCD (McDonald’s) 1.30 1.22 0.73 0.62

NKE (Nike) 1.38 1.34 1.03

VZ (Verizon Communications) 0.93 0.93 0.76 0.49

WMT (Walmart) 1.15 1.23 0.89 0.47

Appendix C. Plots of the actual versus predicted prices of diﬀerent models onthe test data Date P r i c e $ Actual v.s. Out-Of-Sample Predictions for AAPL