[PDF] Alpha Discovery Neural Network based on Prior Knowledge

Abstract

Genetic programming (GP) is the state-of-the-art in financial automated feature construction task. It employs reverse polish expression to represent features and then conducts the evolution process. However, with the development of deep learning, more powerful feature extraction tools are available. This paper proposes Alpha Discovery Neural Network (ADNN), a tailored neural network structure which can automatically construct diversified financial technical indicators based on prior knowledge. We mainly made three contributions. First, we use domain knowledge in quantitative trading to design the sampling rules and object function. Second, pre-training and model pruning has been used to replace genetic programming, because it can conduct more efficient evolution process. Third, the feature extractors in ADNN can be replaced by different feature extractors and produce different functions. The experiment results show that ADNN can construct more informative and diversified features than GP, which can effectively enriches the current factor pool. The fully-connected network and recurrent network are better at extracting information from the financial time series than the convolution neural network. In real practice, features constructed by ADNN can always improve multi-factor strategies' revenue, sharpe ratio, and max draw-down, compared with the investment strategies without these factors.

Full PDF

AAlpha Discovery Neural Network, the Special Fountain ofFinancial Trading Signals

Jie Fang * Tsinghua Shenzhen InternationalGraduate SchoolShenzhen, [email protected]

Shutao Xia † Tsinghua Shenzhen InternationalGraduate SchoolShenzhen, [email protected]

Jianwu Lin

Tsinghua Shenzhen InternationalGraduate SchoolShenzhen, [email protected]

Zhikang Xia

Tsinghua Shenzhen InternationalGraduate SchoolShenzhen, [email protected]

Xiang Liu

Tsinghua Shenzhen InternationalGraduate SchoolShenzhen, [email protected]

Yong Jiang

Tsinghua Shenzhen InternationalGraduate SchoolShenzhen, [email protected]

ABSTRACT

Genetic programming (GP) is the state-of-the-art in financial auto-mated feature construction task. It employs reverse polish expressionto represent features and then conducts the evolution process. How-ever, with the development of deep learning, more powerful featureextraction tools are available. This paper proposes Alpha DiscoveryNeural Network (ADNN), a tailored neural network structure whichcan automatically construct diversified financial technical indicatorsbased on prior knowledge. We mainly made three contributions.First, we use domain knowledge in quantitative trading to design thesampling rules and object function. Second, pre-training and modelpruning has been used to replace genetic programming, becauseit can conduct more efficient evolution process. Third, the featureextractors in ADNN can be replaced by different feature extractorsand produce different functions. The experiment results show thatADNN can construct more informative and diversified features thanGP, which can effectively enriches the current factor pool. The fully-connected network and recurrent network are better at extractinginformation from the financial time series than the convolution neu-ral network. In real practice, features constructed by ADNN canalways improve multi-factor strategies ′ revenue, sharpe ratio, andmax draw-down, compared with the investment strategies withoutthese factors. CCS CONCEPTS • Applied computing → Economics ; •

Computing methodologies → Machine learning algorithms . * First Author † Corresponding AuthorPermission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

KEYWORDS

Automated Financial Feature Construction, Genetic Programming,Deep Neural Network, Quantitative Trading, Evolution Process

ACM Reference Format:

Jie Fang, Shutao Xia, Jianwu Lin, Zhikang Xia, Xiang Liu, and Yong Jiang.2020. Alpha Discovery Neural Network, the Special Fountain of FinancialTrading Signals. In

Proceedings of KDD-MLF-2020.

ACM, San Diego, CA,USA, 8 pages.

In quantitative trading, predicting the future return of stocks is oneof the most important and challenging tasks. Various factors can beused to predict the future return of stocks. Such as the history price,volume and company ′ s financial data, etc. Normally, researchersdefine the features constructed from price and volume as technicalindicators, the features constructed from the company ′ s financialdata as fundamental data. On this task, various famous multi-factormodels have been proposed, and many classical technical and fun-damental factors have been constructed. For example, Fama-FrenchThree-Factor Model [6] leverages three important factors that canprovide the majority of information to explain the stock return. Lateron, there are Fama-French Five Factor Model [7], and many otherfactors constructed by human experts. However, there are two short-comings. First, it ′ s very expensive to hire human experts. Second, ahuman can ′ t construct some nonlinear features from high dimensiondata. Thus, both academic researchers and institutional investorshave paid more and more attention to automated financial featureconstruction task [23].Feature construction is a process that discovers the relationshipsbetween features, and augments the space of features by inferring orcreating new features. In this process, new features can be generatedfrom a combination of existing features [21]. A more straightforwarddescription is that the algorithms use operators, hyper-parametersand existing features to construct a new feature. Sometimes bothfeature construction and feature selection can be merged togetherin one procedure. These methods consist of the wrapper, filtering,and embedded [3]. Filtering is easy but achieves poor performance;it utilizes only some criteria to choose a feature and sometimes it a r X i v : . [ q -f i n . S T ] N ov DD-MLF-2020, August 24, 2020, San Diego, CA, USA Jie Fang, Shutao Xia, Jianwu Lin, Zhikang Xia, Xiang Liu, and Yong Jiang can help us to monitor the feature construction process. Wrapperperforms well by directly applying the model ′ s results as an objectfunction. Thus, it can treat an individually trained model as a newlyconstructed feature. However, a considerable amount of computa-tional resources and time are required. Embedded is a method thatuses generalized factors and a pruning technique to select or com-bine features, which serves as a middle choice between filtering andwrapper. The most well-known and frequently employed automatedfeature construction method is Genetic Programming (GP), whichis a kind of wrapper method that reverses polish expression to rep-resent features and then simulates the evolution process. However,different domains require different object functions, and the inputdata ′ s data structure may differ [16]. Thus, it ′ s very important todo this task within a specific domain. This method has been shownto work well in many industries, such as object detection [17], fi-nance [24], and database management [29]. However, the drawbackof the method is that the constructed formulas are very similar andmay cause co-linearity. In the financial feature construction task,the benchmark is genetic programming algorithm. It uses geneticprogramming to conduct the evolution process of formulaic fac-tors [1] [27]. WorldQuant [15] made public 101 formulaic alphafactors, which are also constructed by using this method. However,this method didn ′ t produce diversified features. The constructedfeatures are similar, and they didn ′ t contain a very high level ofinformation.With the development of deep learning, more and more researchersbegin to use the neural network to extract features from raw data andthen add a fully-connected layer to reshape the feature ′ s output. Sim-ilarly, a trained model represents a newly constructed feature. YangZhong [30] leverages it on pattern recognition tasks, he employsa CNN model to construct facial descriptors, and this method pro-duces features that have considerably more information than the pastmethod. K Shan [25] conducts experiments on this task and employsa deeper and wider convolution neural network. Hidasi B [12] usesa recurrent neural network to pre-locate the feature-rich region andsuccessfully constructs more purified features. In a text classificationtask, Botsis T [2] leverages recurrent neural networks to build a rule-based classifier among text data, in which each classifier representsa part of the text. S Lai [18] proposes a network structure that usesboth a recurrent neural network and a convolution neural network toextract text information. With the help of a neural network ′ s strongfitting ability, we can produce highly informative features by tailor-ing the network structure for different industries. In financial featureconstruction tasks, researchers have begun to use a neural networkto give an embedding representation of financial time series. Morespecifically, Fuli Feng [8] leverages LSTM to embedding variousstock time series, and then uses adversarial training to make a binaryclassification on stock ′ s future return. Leonardo [22] adopts welldesigned LSTM to extract features from unstructured news data, andthen form a continuous embedding. The experiment result showsthat these unstructured data can provide much information and theyare very helpful for event-driven trading. Zhige Li [ ? ] leveragesa Skip-gram architecture to learn stock embedding inspired by avaluable knowledge repository formed by fund manager ′ s collectiveinvestment behaviors. This embedding can better represent the dif-ferent affinities over technical indicators. With a similar idea, weuse a neural network to give a brief embedding of long financial time series. This embedding can help to summarize the most im-portant information in the high dimension data. Different from theprevious work, we mainly make three contributions in this paper.First, we strictly design the sampling rules. All the stocks on thesame trading day are regarded as one batch, which meets economicprinciples. Second, we didn ′ t simply use the stock return to serveas object function, but we use the spearman coefficient of stock ′ sreturn and stock ′ s feature value to serve as object function. We arethe first to use this object function in neural network and we alsohave fixed its un-derivable problem. Third, we adopt pre-training andmodel pruning to add up enough diversity into our constructed fea-tures, which helps this system to produce more diversified featuresthan the benchmark. In this paper, we proposed a novel networkstructure called ADNN, which is tailored for stock time series. Thisframework can use different deep neural networks to automaticallyconstruct financial factors. ADNN has outperformed the benchmarkon this task, from the perspective of all frequently compared indi-cators. What ′ s more, we find some interesting differences betweendifferent feature extractors on this task, and we conduct experimentsto comprehend them. In quantitative trading, investors commonly construct factors, andregard these factors as trading signals. In automated financial featureconstruction task, what we want is to let a algorithm to automaticallyconstruct new factors, to determine the variable, operator and hyper-parameters.The benchmark on this task is GP. It uses a reverse polish expres-sion to represent the feature ′ s formula and then leverages binary treeto store its explicit expression. In each training iteration, researchersleverage GP to conduct the evolution process. This evolution pro-cess includes merging different formulas, cutting some parts of theformula and changing some parts of the formula, etc. The trainingprocess is shown in Figure 1. Figure 1: This is the GP ′ s evolution process. Each tree representa formulaic factors, and the right tree will get survived accord-ing to the object function. As shown in Figure 1, researchers add diversity into the con-structed features by changing a part of the reverse polish expression.For example, we have a frequently used factor 1, shown in formula 1. lpha Discovery Neural Network, the Special Fountain of Financial Trading Signals KDD-MLF-2020, August 24, 2020, San Diego, CA, USA

And then we make a small change on factor 1, in order to constructa new factor, shown in formula 2.

𝐹𝑎𝑐𝑡𝑜𝑟 = ℎ𝑖𝑔ℎ 𝑝𝑟𝑖𝑐𝑒 − 𝑙𝑜𝑤 𝑝𝑟𝑖𝑐𝑒𝑣𝑜𝑙𝑢𝑚𝑒.𝑠ℎ𝑖 𝑓 𝑡 ( ) (1) 𝐹𝑎𝑐𝑡𝑜𝑟 = ℎ𝑖𝑔ℎ 𝑝𝑟𝑖𝑐𝑒 − 𝑣𝑜𝑙𝑢𝑚𝑒𝑣𝑜𝑙𝑢𝑚𝑒.𝑠ℎ𝑖 𝑓 𝑡 ( ) (2)Factor 1 means the relative strength of price compared with vol-ume, which has economic meaning. However, factor 2 is totallydifferent from factor1, and it is really hard to explain. Because inthis algorithm, the parent factor and child factor have little in com-mon. The parent factor has high IC, but the child factor may notsuccessfully inherent the good characteristics from its parent factor.As a result, we think GP is not a good method to construct newfactors, due to its low efficient evolution process on this task. The network structure of the ADNN is shown in Figure 2. The majorcontributions of this novel network structure includes 1). ADNNuses Spearman Correlation to serve as a loss function, which mimicshuman practices of quantitative investment. And the sampling rulesalso meet economic principle. 2). A meaningful derivable kennelfunction is proposed to replace the un-derivable operator 𝑟𝑎𝑛𝑘 () .3). We use pre-training and pruning to replace the GP ′ s evolutionprocess, which is more efficient. Figure 2: Alpha discovery neural network ′ s structure. As shown in Figure 2, in each back-propagation, ADNN randomlysamples 𝐷 trading days ′ data, and then calculate the SpearmanCoefficient of factor value and factor return in each trading day. 𝐷 should be larger than 3, and taking 𝐷 trading day ′ s information intoaccount can help the neural network to get a more stable convergence.Quantitative investors care more about the relative strength of eachstock on the same trading day, rather than its absolute strength.Thus, doing calculation in each trading day and using the SpearmanCoefficient to serve as loss function is reasonable.In each batch, we assume that there are 𝑚 stocks that belong tothis trading day. The input tensor ′ s shape is ( 𝑚, , 𝑛 ) , because thereare 𝑚 samples, and 5 types of time series, which is the open price,high price, low price, close price, and volume. Each time series ′ input length is 𝑛 . We also name the output tensor as factor value, with shape ( 𝑚, ) . The factor return tensor ′ s shape is ( 𝑚, ) , whichmeans the revenue that we can earn from this asset for a long periodof time. The length of the holding time is 𝑎 . Here, we assume that allthe feature extractors in Figure 2 are Multi-layer Perceptron (MLP),which is easy for us to give a general mathematics description. Inthe experiment part, we will show the experiment results based onmore complicated and diversified feature extractors. 𝑤 𝑖 means thekernel matrix in 𝑖 𝑡ℎ layer, 𝑏 𝑖 means the bias matrix in 𝑖 𝑡ℎ layer, 𝑎 𝑖 means the activate function in 𝑖 𝑡ℎ layer, and there will be 𝑝 layersin total. 𝑥 = 𝑙 𝑝 = 𝑎 𝑝 ( 𝑤 𝑝𝑇 𝑙 𝑝 − + 𝑏 𝑝 ) ,𝑙 = 𝑎 ( 𝑤 𝑇 𝐼𝑛𝑝𝑢𝑡 + 𝑏 ) . (3) 𝑦 = 𝐹𝑎𝑐𝑡𝑜𝑟 𝑅𝑒𝑡𝑢𝑟𝑛 = 𝑐𝑙𝑜𝑠𝑒 𝑝𝑟𝑖𝑐𝑒 𝑡 + 𝑎 𝑐𝑙𝑜𝑠𝑒 𝑝𝑟𝑖𝑐𝑒 𝑡 − (4)We apply a Spearman Correlation to calculate the correlationbetween a factor value and a factor return. This setting can help usto obtain powerful features that are suitable to forecast the futurestock return. And this setting also makes our batch size and samplingrules become meaningful. Only the data belongs to the same tradingday, should be involved in the same batch. However, SpearmanCorrelation uses operator rank() to gid rid of some anomalies infinancial time series. Rank() is not derivable, which is not acceptablefor the neural network. Thus, we use a derivable kernel function g(x)to replace rank(). 𝑔 ( 𝑥 ) = + 𝑒𝑥𝑝 (− 𝑝 ∗ 𝑥 − ¯ 𝑥 ∗ 𝑠𝑡𝑑 ( 𝑥 ) ) (5)As shown in formula 5, at first, it projects x into a normal distri-bution which is zero-centralized. Next, it uses a hyper-parameter pto make sure that the 2.5%-97.5% of data should lay in the rangebetween [ 𝑚𝑒𝑎𝑛 − 𝑠𝑡𝑑, 𝑚𝑒𝑎𝑛 + 𝑠𝑡𝑑 ] . Thus, p equals to 1.83. wecan get p=1.83. For example, one out-lier 𝑥 𝑖 = ¯ 𝑥 + 𝑠𝑡𝑑 ( 𝑥 ) , and 𝑔 ( 𝑥 𝑖 )− 𝑔 ( ¯ 𝑥 ) 𝑔 ( ¯ 𝑥 ) ≤ 𝑥 𝑖 − ¯ 𝑥 ¯ 𝑥 , so the result is 𝑠𝑡𝑑 ≤ .

362 ¯ 𝑥 . It means if onedistribution ′ s standard deviation is large, and it is larger than .

362 ¯ 𝑥 ,the g(x) can shorten the distance between outliers and the centralpoint. If the distribution ′ s standard deviation is very small, g(x) willmake it worse. However, even in this case, we can make sure that95% of the points are between [ 𝑚𝑒𝑎𝑛 − 𝑠𝑡𝑑, 𝑚𝑒𝑎𝑛 + 𝑠𝑡𝑑 ] , which isacceptable. The final object function is defined in formula 6, whereE(x) represents the expected value of x, ¯ 𝑥 represents the averagevalue of x. And in each batch, we calculate the average value from 𝑞 trading days, which can make the optimization process more stable. 𝐼𝐶 ( 𝑥, 𝑦 ) = 𝐸 ( 𝑔 ( 𝑥 ) − ¯ 𝑔 ( 𝑥 ) , 𝑔 ( 𝑦 ) − ¯ 𝑔 ( 𝑦 )) 𝐸 ( 𝑔 ( 𝑥 ) − ¯ 𝑔 ( 𝑥 )) 𝐸 ( 𝑔 ( 𝑦 ) − ¯ 𝑔 ( 𝑦 ))) ,𝐿𝑜𝑠𝑠 = − 𝑞 𝑞 ∑︁ 𝑖 = 𝐼𝐶 ( 𝑥 𝑖 , 𝑦 𝑖 ) . (6) Combining with model stealing [14] and pruning on input datacan improve the signal ′ s diversity. Model stealing means that ifthe input 𝑥 and the output 𝑦 are known, our network can obtain asuitable parameter 𝑤 to fit this projection. However, this technique DD-MLF-2020, August 24, 2020, San Diego, CA, USA Jie Fang, Shutao Xia, Jianwu Lin, Zhikang Xia, Xiang Liu, and Yong Jiang is not always helpful to learn a distribution without tailoring thenetwork structure. If we have a fixed network structure, and wehave no idea about the target distribution, the techniques such asremoving the outliers, will be very helpful for the continuous priorknowledge. Using high-temperature T also works for the discreteprior knowledge.Pre-training uses 𝑓 ( 𝑥 ) = 𝑎 ( 𝑤 𝑇 𝑥 + 𝑏 ) to embed the input data(the data is embedded by MLP (Several fully-connected layers withtanh and relu activation functions. The number of neural in eachlayer should be decided by the length of input time series), 𝑤 meanskernel matrix, 𝑏 means bias matrix, 𝑎 means activation function) andthen use this embedded layer to mimic the prior knowledge. In thispart, we use the mean squared error as the object function. arg min 𝑎,𝑏,𝑤 𝑛 𝑁 ∑︁ 𝑖 = ( 𝑦 𝑖 − 𝑓 ( 𝑥 𝑖 )) (7)Almost all technical indicators can be easily learned by usingMLP. Here, MSE or MAE can ′ t represent the real pre-training per-formance, because all factor values are really small, which makesall MSE value very small. In order to have a better measurementof the performance, 𝑛 (cid:205) 𝑁𝑖 = | 𝑦 𝑖 − 𝑓 ( 𝑥 𝑖 ) 𝑦 𝑖 | is used to measure its errorrate. Some classical technical indicators, such as MA, EMA, MACD,RSI, BOLL, and other typical financial descriptors are selected asprior knowledge for pre-training, shown in Table 1. Table 1: Here are the formula of some classical technical indi-cators and financial descriptors. They serve as prior knowledgefor ADNN. Close refers to stock close price, volume refers tostock volume, and AdjClose refers to adjusted close price.

Technical Indicator Mathematical ExpressionMA 𝑀𝐴 𝑁 ( 𝑥 𝑛 ) = 𝑁 (cid:205) 𝑁𝑘 = 𝑥 𝑛 − 𝑘 EMA

𝐸𝑀𝐴 𝑁 ( 𝑥 𝑛 ) = 𝑁 + (cid:205) ∞ 𝑘 = ( 𝑁 − 𝑁 + ) 𝑘 𝑥 𝑛 − 𝑘 MACD

𝑀𝐴𝐶𝐷 = 𝐸𝑀𝐴 𝑚 ( 𝑖 ) − 𝐸𝑀𝐴 𝑛 ( 𝑖 ) PVT

𝑃𝑉𝑇 ( 𝑖 ) = 𝑃𝑉𝑇 ( 𝑖 − ) + 𝑣𝑜𝑙𝑢𝑚𝑒 ( 𝑖 )∗( 𝑐𝑙𝑜𝑠𝑒 ( 𝑖 ) − 𝑐𝑙𝑜𝑠𝑒 ( 𝑖 − ))/ 𝑐𝑙𝑜𝑠𝑒 ( 𝑖 − ) TOP10 𝑀𝐴 = 𝑀𝐴 ( 𝐶𝑙𝑜𝑠𝑒, ) 𝑇𝑂𝑃 = 𝑀𝐴 𝑀𝐴 𝑡𝑜𝑝 − DC 𝐻 = 𝑀𝐴 ( 𝐻𝑖𝑔ℎ × 𝐴𝑑 𝑗𝐶𝑙𝑜𝑠𝑒 / 𝐶𝑙𝑜𝑠𝑒, 𝑛 ) 𝐿 = 𝑀𝐴 ( 𝐿𝑜𝑤 × 𝐴𝑑 𝑗𝐶𝑙𝑜𝑠𝑒 / 𝐶𝑙𝑜𝑠𝑒, 𝑛 ) 𝑀 = ( 𝐻 + 𝐿 ) 𝐷𝐶 = 𝐴𝑑 𝑗𝐶𝑙𝑜𝑠𝑒 / 𝑀 BOLL

𝑆𝑡𝑑𝑉 = 𝑀𝑆𝑡𝑑𝑣 ( 𝐶𝑙𝑜𝑠𝑒, 𝑛 ) 𝑀𝑒𝑎𝑛 = 𝑀𝐴 ( 𝐶𝑙𝑜𝑠𝑒, 𝑛 ) 𝐿𝐵 = 𝑀𝑒𝑎𝑛 − 𝑆𝑡𝑑𝑣𝐵𝐵𝐿 = 𝐿𝐵𝐶𝑙𝑜𝑠𝑒

𝑀𝑆𝑡𝑑𝑣 𝑛,𝑡 = 𝑆𝑡𝑑𝑣 ( 𝐶𝑙𝑜𝑠𝑒 𝑡 − 𝑛 : 𝑡 ) Some descriptors with different parameters such as DC(5) andDC(15) will be regarded as different prior knowledge because theyhave given enough diversity to ADNN. The testing error rate ofpre-training these factors, shown in Table 1, is . ± . . Wethink this error rate is acceptable, and it can bring enough diversityinto the network. Why is pre-train with prior knowledge needed? Because knowl-edge is the source of diversity, we should keep it. According to theconcept of Muti-task Learning, pre-training can keep some part ofthe domain knowledge in the network. In order to keep more diver-sity after the pre-training process, pruning is needed. Permanentlypruning the useless elements in the embedding matrix can help us tokeep the diversity, and filter out noisy signals from prior knowledge.High pruning rate will lose too much information, but low-levelpruning rate is hard to keep the diversity. The ideal pruning rateshould be about 0.2-0.5, which means 20%-50% of the elements inthe mask matrix should be 0. All the setting is the same as [9], andhere are more explanations. After embedding the data as f(x), we getits parameter matrix 𝑤 . Then we create a mask matrix 𝑚 to prune theparameters. For example, 𝑤 𝑖 𝑗 in the parameter matrix is relativelysmall, which means this element is useless, and we should set 𝑚 𝑖 𝑗 =0to prune it. If the 𝑤 𝑖 𝑗 is not useless, then we set 𝑚 𝑖 𝑗 =1. This methodcan help us to further keep the diversity in the neural network, and letthe network focus on improving the current situation. The pruningprocess is shown in formula 8, here 𝑚 ∗ 𝑤 means the HadamardProduct. 𝑓 ( 𝑥 ) = ( 𝑚 ∗ 𝑤 ) T 𝑥 + 𝑏 (8)After pre-training and pruning the network, we use the objectfunction shown in formula 6 to train ADNN. We simply reshapethe input data into a picture. And then we use the Saliency Map tolook at how the raw data contribute to the final constructed factor.The training process is shown in Figure 3, the y-axis is [open price,high price, low price, close price, volume], the x-axis is the lengthof input time series. Figure 3: How ADNN leads the prior knowledge to become abetter technical indicator. ADNN can adjust the raw data ′ s con-tribution according to the objective function, and make it bettercompared with its initial state. Prior knowledge performs like a seed, which is the source ofdiversity in this system. Although the features constructed by ADNNis not explicit, and it ′ s hard to explain, compared with GP. However,there are some strong points of ADNN. As mentioned above, ADNNcan conduct a more efficient evolution process. After warming upthe system, we can know how many differences have been put into lpha Discovery Neural Network, the Special Fountain of Financial Trading Signals KDD-MLF-2020, August 24, 2020, San Diego, CA, USA the constructed factors. Second, although we can ′ t fully understandits formular, we at least know that whether this factor is momentumor reverse. For example, 𝑐𝑙𝑜𝑠𝑒 [ 𝑡 ] − 𝑐𝑙𝑜𝑠𝑒 [ 𝑡 − ] is a momentumfactor, but 𝑐𝑙𝑜𝑠𝑒 [ 𝑡 − ] − 𝑐𝑙𝑜𝑠𝑒 [ 𝑡 ] is a reverse factor. Third, unlikethe traditional factors, whose raw data ′ s contribution is discrete. Theraw data ′ s contribution in ADNN is continuous, which helps it toextract high dimension information. Human experts can ′ t constructfactors by extracting high dimension data, this huge differences canhelp to avoid factors crowding. After all, the trading opportunities islimited, people can ′ t share the same trading signals. [10].We conduct experiments on different feature extractors. There aretwo motivations to conduct experiments on different feature extrac-tors. First, different feature extractors require different input data ′ sdata structure. After performing a literature review and consultingprofessional experts in the market, we discover many different waysto organize the input data. However, none of them can prove thattheir structure is the best. Thus, experiments on these structuresshould be performed. The second motivation is that different ex-tractors have their own strong comings and short comings. Some ofthem aim at extracting temporal information but the others aim atspatial information. Some of them designed for a long time series,but some of them are designed for quick training. We think all thesedifferences can make our factor pool more diversified. We use daily trading data in the Chinese A-share stock market (in thefollowing part, we call it A-share market data), including the dailyopen price, high price, low price, close price and trading volume, inthe past 30 trading days. The raw data is standardized by using itstime-series mean and standard deviation in the training set. Both themean and standard deviation are calculated from the training set. Weattempt to use these inputs to predict the stock return in the next 5trading days (using 3-15 trading days is recommended). Moreover,we should obey the market policy when we form a trading strategy.We have done a lot of experiments to select reasonable hyper-parameters. For each experiment, 250 trading days serve as thetraining set, the following 30 trading days serve as the validationset, and the following 90 trading days serve as the testing set. Theconstructed factors can keep high IC during the next 90 trading days.Most importantly, we want to stress a counter-intuitive setting. Thetraining period should be no longer than 250 trading days becausefinancial features are non-stationary. If we request a feature that canwork well for a very long period of time, then we will only find thisfeature in the over-fitting situation. Thus, we design a rolling forecaststructure that we will automatically find powerful features for eachtrading day. Each automatically constructed features will have theirown highlight time on this trading day. What ′ s more, these factorsnot only work well on this single day. Actually, his performance canlast several trading days, with a moderate decay.To make a fair comparison, the same setting is deployed for theGP algorithm. This algorithm ′ s logic references relative work [27]and [1]. Besides, the input data ′ s period and type should be the same.In this paper, we analyze the construed features ′ performance fromdifferent perspectives. Normally, institutional investors use Informa-tion Coefficient (IC), shown in formula 6, to measure how much information carried by a feature. For diversity, the cross-entropy isused to measure the distance between two different features ′ distri-butions on the same trading day. 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ( 𝑓 , 𝑓 ) = ∑︁ 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 ( 𝑓 ) 𝑙𝑜𝑔 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 ( 𝑓 ) (9)In formula 9, 𝑓 and 𝑓 refers to different features ′ distributionin the same trading day. The softmax function can help us get ridof the effect from scale without losing its rank information. Andk-means is used to cluster the distance matrix of this relative distancebetween two features. The average distance between each clustercenter refers to the diversity of this algorithm on this trading day.Besides measurements of IC and diversity, the performance of atrading strategy based on the constructed features are also measured,such as absolute return, max-drawdown [5], and sharp-ratio [26]. Ba-sically, all these indicators are really important to assess a feature ′ sperformance. The network structure shown in Figure 2 can equip ADNN withdifferent deep neural networks. In order to show the general situa-tion, we equip ADNN with 4 fully-connected layers. Each layer has128 neural, tanh activate function, L2 Regularization, and dropouttechnic. This general and simple setting is enough to beat the GP.We put forward three schemes help to show how ADNN beat theGP.

Only GP means only using genetic programming,

Only ADNN means only use ADNN to construct factors,

GP&ADNN means useGP ′ s value to initialize ADNN and then construct factors. All the ex-periments are conducted out of the sample, and we have summarizedit in Table 2. Table 2: The performance of different schemes.

Object Information Coefficient Diversity

Only GP

GP&ADNN

Only ADNN

Only ADNN is better than

Only GP , whichmeans ADNN outperforms GP on this task. And we also find that

GP&ADNN is the best, it means that our method can even improvethe performance of GP.In real practice, we should leverage the constructed factors toform a multi-factor strategy and compare its performance with GP.The specific strategy setting is same as section 3.4, and we haverepeated this experiment on different periods of time. The long-termbacktest result is shown in Table 3,

Only ADNN always has betterperformance than the

Only GP . It shows that ADNN has also beatenthe SOTA in real practice. Similar to the conculsions made above,if we combine these two methods together, the combined factors ′ strategy has the best performance in backtesting.All the results shown above is based on the most basic featureextractors. So will there be more powerful feature extractors todiscover knowledge from financial time series? And what is thesuitable input data structure for financial time series? DD-MLF-2020, August 24, 2020, San Diego, CA, USA Jie Fang, Shutao Xia, Jianwu Lin, Zhikang Xia, Xiang Liu, and Yong Jiang

Table 3: Strategy ′ s absolute return for each scheme. Time

Only GP GP & ADNN Only ADNN ZZ500

Train:2015.01-2015.12Test: 2016.02-2016.03 +2.59% +5.74% +4.52% +1.67%Train:2016.01-2016.12Test: 2017.02-2017.03 +5.40% +10.26% +8.33% +2.53%Train:2017.01-2017.12Test: 2018.02-2018.03 -5.27% -4.95% -4.16% -6.98%Train:2018.01-2018.12Test: 2019.02-2019.03 +13.00% +15.62% +15.41% +13.75%

All experiments are conducted in the same setting mentioned in sec-tion 3.1, and the results are summarized after generating 50 features.For the hardware equipment, we use 20 g GPU (NVIDIA 1080Ti)and 786 g CPU (Intel Xeon E5-2680 v2, 10 cores). Based on thissetting, we show the amount of time that we need to train 50 neuralnetworks. Moreover, the time to restore 50 trained networks andobtain their feature values will be substantially faster than traditionalfeatures. Because most traditional features are constructed with com-plicated explicit formulas, these formulas are not suitable for matrixcomputing. Using neural networks to represent features in matrixcomputing, which can have a faster testing speed.

Table 4: The higher are the information coefficient (IC) anddiversity, the better is their performance. Normally, a goodfeature ′ s long-term IC should be higher than 0.05, but it can-not be higher than 0.2 in an A-share market. Type Network IC Diversity TimeBaseline GP 0.072 17.532 0.215 hoursVanilla FCN 0.124 22.151 0.785 hoursSpatial Le-net 0.123 20.194 1.365 hoursResnet-50 0.108 21.403 3.450 hoursTemporal LSTM 0.170 24.469 1.300 hoursTCN 0.105 21.139 2.725 hoursTransformer 0.111 25.257 4.151 hoursShown in Table 4, basically, all neural networks can produce morediversified features than using GP. But temporal extractors are espe-cially better at producing diversified features, such as LSTM [13]and Transformer [28]. As for TCN [19], the author who put forwardthis network structure proves its ability to capture the temporal rulesburied in data. However, there is a huge difference. TCN relies on aconvolution neural network, but LSTM and Transformer still containrecurrent neural networks (Normally, the transformer uses a recur-rent neural network to embedded the input data). The existence of arecurrent neural network structure may contribute to the differencein diversity. For Le-net [20] and Resnet [11], they don ′ t provide uswith more informative features. It looks like that the convolutionnetwork structure is not suitable to extract information from thefinancial time series. In real practice, we combines traditional factors and the factorsconstructed by ADNN to form a quantitative investment strategy.What we want is to see if ADNN can enrich the factor pool andimprove the traditional multi-factor strategy.We form a frequently used multi-factors strategy to test its perfor-mance in the real case. In the training set, the sample whose returnranked in the top 30% in each trading day is labeled as 1 and thesample whose return ranked in the last 30% of each trading day is la-beled as 0. We abandon the remaining samples in the training set [6].After training these features with XGBoost [4] using binary logis-tics mode, the prediction result reflects the odds that this stock hasoutstanding performance in the following 5 trading days. It definesthe 50 features constructed by human experts as

PK 50 , the featuresconstructed by ADNN as

New 50 , and the features constructed byboth GP and PK as

GP-PK 50 . In separate experiments, we useXGBoost to pre-train both

PK 50 and

New 50 in the training set andthen using the weight score from XGBoost to choose the 50 mostimportant features as

Combined 50 . This feature selection processonly happens once, and only be conducted in training set.

Table 5: Back testing starts from Jan 2019 to June 2019. Theinvestment target is all A-share, except for the stock can ′ tbe traded during this period of time. Strategy ′ s commissionfee is 0.5%. SR refers to Sharpe Ratio, MD represents Max-Drawdown. Type Target Group Revenue MD SRBaseline ZZ500 Stock Index 19.60% 13,50% 1.982HS300 Stock Index 18.60% 20.30% 1.606PK PK 50 24.70% 18.90% 2.314GP GP 50 17.60% 25.30% 1.435GP-PK 50 25.40% 14.80% 2.672Vanilla FCN New 50 20.60% 15.80% 2.189Combined 50 29.60% 15.70% 3.167Spatial Le-net New 50 18.00% 16.90% 1.800Combined 50 27.50% 16.40% 2.921Resnet-50 New 50 19.90% 15.40% 1.962Combined 50 29.30% 17.20% 2.787Temporal LSTM New 50 19.50% 13.00% 2.205Combined 50 29.90% 15.00% 3.289TCN New 50 22.40% 14.70% 2.440Combined 50 26.90% 16.80% 2.729Transformer New 50 21.10% 15.90% 2.203Combined 50 27.20% 15.10% 2.806

As shown in Table 5,

HS300 and

ZZ500 are important stockindices in the A-share stock market. Revenue represents the annu-alized excess return, by longing portfolio and shorting the index.The max drawdown is the worst loss of the excess return from itspeak. The Sharpe ratio is the annually adjusted excess return dividedby a certain level of risk. These indicators can show the strategy ′ sperformance from the perspective of both return and risk.For the New 50 , although they have higher IC than the

PK 50 ,their overall performance is not always better than

PK 50 . Because lpha Discovery Neural Network, the Special Fountain of Financial Trading Signals KDD-MLF-2020, August 24, 2020, San Diego, CA, USA the overall performance of a multi-factor strategy is determined byboth diversity and information volume (IC), we guess the diversityof

PK 50 is remarkably higher than the diversity of

New 50 . Wealso did experiment to verify this guess. Thus, although every singlenew factor is better than the old factor, their overall performancenot always be better. ADNN ′ s diversity is larger than the GP, butfor further research, making ADNN ′ s diversity even larger is stillbadly needed. In the real world use case, all investors have theirown reliable and secret factor pool, what they want is that the newconstructed factors can bring in margin benefits. Thus, they will useboth new and old factors to do trading. That ′ s the reason why Com-bined 50 can represent ADNN ′ s contribution in the real situation. Inall cases, Combined 50 is better than

PK 50 and

GP-PK 50 , whichmeans that the ADNN not only perform better than GP, but also canenrich investors ′ factor pool. We also plots the exceed return curveof these strategies, shown in Figure 4. Figure 4: Different feature extractors ′ exceed return in testingset, hedge on HS300 Index. Shown in Figure 4, all these curves are similar, due to the fact thatthey all shared some factors from

PK 50 . And all the schemes pow-ered by ADNN performs better than GP. During this period of time,they have beaten the market more than 10 percent. It is reasonablebecause all the features are only constructed from price and volumedata. They don ′ t contain any fundamental data or even sentimentdata. What ′ s more, we will get a lot of extra information duringthe feature construction process. This information is helpful in thefeature selection process. That ′ s the main reason why some wrap-per methods will do feature selection and construction at the sametime. For further research, the current structure can be improved toconduct both the feature construction and feature selection processat the same time. This paper directly leverages this reasonable andfair feature selection method, because it only focuses on the featureconstruction task. From the experiment result, we have found that different featureextractors perform differently. In this part, we try to comprehendthis result. We construct 50 features by using FCN, 50 features fromthe network focused on spatial information and 50 features from the network focused on temporal information. Then the diversityis clustered into three groups using k-means; this method has beenmentioned in section 3.1. To show the distributions more clearly,we cluster them into three groups. Then we initialize one of thecluster centers at ( , ) and then determine the other two clustercenters according to their relative distance and a given direction.This direction will only influence the outlook of this graph, butnot influence the shared space between two different clusters. In thefollowing experiments, we plot all the factors ′ distributions to help usunderstand the characteristics of different types of feature extractors.Here, we focus on the sparsity and common area shared by eachgroup. Because these two indicators can help us to comprehendwhich feature extractor really contributes, and how much specialinformation it has discovered. Figure 5: Cluster different neural networks, spatial networkagainst temporal network.

As shown in Figure 5(left), the features constructed by the LSTMhave the sparsest distribution, which means that the network struc-ture that focuses on temporal information is excellent at extractinginformation from the financial time series. However, a large space isshared by FCN and Le-net. We can regard Le-net ′ s information asa subset of FCN. Combined with the convolution neural network ′ spoor performance in sections 3.2 and 3.3, it looks like that theconvolution neural network structure does not have a substantialcontribution to extracting information from the financial time series.Figure 5(right) is an extra experiment, whose result supports thisconclusion as well. Figure 6: Cluster different types of neural networks. More com-plicated, compared with the networks used in Figure 5.

From Figure 6, we can draw the same conclusions as above.What ′ s more, the network used in Figure 5 is simple, but the network DD-MLF-2020, August 24, 2020, San Diego, CA, USA Jie Fang, Shutao Xia, Jianwu Lin, Zhikang Xia, Xiang Liu, and Yong Jiang used in Figure 6 is relatively more complicated. Compare Figure5 and Figure 6, we can find that the complex network takes largerspace. It shows that complicated neural network bigger diversity.Thus, we think the complicated neural network ′ s strong point isthat they have less possibility to get co-linearity. Commonly, thecomplex network has large parameter set. And at most of the time,its impressive performance comes from its large parameter set. Avery complicated neural network will be helpful remember somestationary rules. But for the non-stationary stock market, the rulesin training set maybe different from the rules in testing set. If weonly rely on large parameter set to remember the rules, it may bringover-fitting risk. And currently, most of the tradings are still made byhuman, which means the majority of trading signals are still linear.Thus, at present, the very complicated neural network can ′ t have apromising performance in the stock market.However, while the stock market is developing, more and moreinvestors crowd into this game. We think that the factor crowdingphenomenon will become more and more clear. In addition, as moreand more trading is made by algorithms, the non-linear part in thetrading signals will be larger. Thus, for quantitative trading, webelieve that the complicated and tailored neural network structurewill have its supreme moment in the near future. In this paper, we put forward the alpha discovery neural network,which can automatically construct financial features from raw data.We designed its network structure according to the economic prin-ciple, and equip it with different advanced feature extractors. Thenumerical experiment shows that ADNN can produce more infor-mative and diversified features than the benchmark on this task. Inreal practice, ADNN can also achieve better revenue, sharpe-ratioand max-drawdown than genetic programming. What ′ s more, dif-ferent feature extractors play different roles. We have done plentyof experiments to verify it and try to comprehend its function. Forfurther research, we will leverage this framework to automaticallyconstruct useful features based on the companies ′ fundamental dataand sentiment data. ACKNOWLEDGMENTS

This work is supported in part by the National Natural ScienceFoundation of China under Grant 61771273 and the R&D Programof Shenzhen under Grant JCYJ20180508152204044.

REFERENCES [1] Franklin Allen and Risto Karjalainen. 1999. Using genetic algorithms to findtechnical trading rules.

Journal of financial Economics

51, 2 (1999), 245–271.[2] Taxiarchis Botsis, Michael D Nguyen, Emily Jane Woo, Marianthi Markatou, andRobert Ball. 2011. Text mining for the Vaccine Adverse Event Reporting System:medical text classification using informative feature selection.

Journal of theAmerican Medical Informatics Association

18, 5 (2011), 631–638.[3] Girish Chandrashekar and Ferat Sahin. 2014. A survey on feature selectionmethods.

Computers & Electrical Engineering

40, 1 (2014), 16–28.[4] Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, and Yuan Tang.2015. Xgboost: extreme gradient boosting.

R package version 0.4-2 (2015), 1–4.[5] Jaksa Cvitanic and Ioannis Karatzas. 1994. On portfolio optimization under"drawdown" constraints. (1994).[6] Eugene F. Fama and Kenneth R. French. 1993. Common risk factors in thereturns on stocks and bonds.

Journal of Financial Economics

33, 1 (1993), 3 – 56.https://doi.org/10.1016/0304-405X(93)90023-5 [7] Eugene F. Fama and Kenneth R. French. 2015. A five-factor asset pricing model.

Journal of Financial Economics

Enhancing Stock Movement Prediction with Adversarial Training .Papers. arXiv.org. https://EconPapers.repec.org/RePEc:arx:papers:1810.09936[9] Jonathan Frankle and Michael Carbin. 2018. The lottery ticket hypothesis: Findingsparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2018).[10] T. Clifton Green, Ruoyan Huang, Quan Wen, and Dexin Zhou. 2019. Crowd-sourced employer reviews and stock returns.

Journal of Financial Economics

Proceedings of the IEEE conference oncomputer vision and pattern recognition . 770–778.[12] Balázs Hidasi, Massimo Quadrana, Alexandros Karatzoglou, and Domonkos Tikk.2016. Parallel recurrent neural network architectures for feature-rich session-basedrecommendations. In

Proceedings of the 10th ACM conference on recommendersystems . ACM, 241–248.[13] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.

Neuralcomputation

9, 8 (1997), 1735–1780.[14] Mika Juuti, Sebastian Szyller, Samuel Marchal, and N Asokan. 2019. PRADA:protecting against DNN model stealing attacks. In . IEEE, 512–527.[15] Zura Kakushadze. 2016.

101 Formulaic Alphas . Papers 1601.00991. arXiv.org.https://ideas.repec.org/p/arx/papers/1601.00991.html[16] Krzysztof Krawiec. 2002. Genetic programming-based construction of featuresfor machine learning and knowledge discovery tasks.

Genetic Programming andEvolvable Machines

3, 4 (2002), 329–343.[17] Mark J Kwakkenbos, Sean A Diehl, Etsuko Yasuda, Arjen Q Bakker, Caroline MMVan Geelen, Michaël V Lukens, Grada M Van Bleek, Myra N Widjojoatmodjo,Willy MJM Bogers, Henrik Mei, et al. 2010. Generation of stable monoclonalantibody–producing B cell receptor–positive human memory B cells by geneticprogramming.

Nature medicine

16, 1 (2010), 123.[18] Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neu-ral networks for text classification. In

Twenty-ninth AAAI conference on artificialintelligence .[19] Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager.2017. Temporal convolutional networks for action segmentation and detection. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition .156–165.[20] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. 1998. Gradient-based learning applied to document recognition.

Proc. IEEE

86, 11 (1998),2278–2324.[21] Huan Liu and Hiroshi Motoda. 1998.

Feature extraction, construction and selec-tion: A data mining perspective . Vol. 453. Springer Science & Business Media.[22] Sidra Mehtab and Jaydip Sen. 2019.

A Robust Predictive Model for Stock PricePrediction Using Deep Learning and Natural Language Processing . Papers.arXiv.org. https://EconPapers.repec.org/RePEc:arx:papers:1912.07700[23] Hiroshi Motoda and Huan Liu. 2002. Feature selection, extraction and construction.

Communication of IICM (Institute of Information and Computing Machinery,Taiwan) Vol

5, 67-72 (2002), 2.[24] Cristóbal Romero, Sebastián Ventura, and Paul De Bra. 2004. Knowledge dis-covery with genetic programming for providing feedback to courseware authors.

User Modeling and User-Adapted Interaction

14, 5 (2004), 425–464.[25] Ke Shan, Junqi Guo, Wenwan You, Di Lu, and Rongfang Bie. 2017. Auto-matic facial expression recognition based on a deep convolutional-neural-networkstructure. In . IEEE, 123–128.[26] William F Sharpe. 1994. The sharpe ratio.

Journal of portfolio management

21, 1(1994), 49–58.[27] James D Thomas and Katia Sycara. 1999. The importance of simplicity andvalidation in genetic programming for data mining in financial data. In

Proceed-ings of the joint AAAI-1999 and GECCO-1999 Workshop on Data Mining withEvolutionary Algorithms .[28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all youneed. In

Advances in neural information processing systems . 5998–6008.[29] Robert J Watts and Alan L Porter. 2007. Mining conference proceedings forcorporate technology knowledge management.

International Journal of Innovationand Technology Management

4, 02 (2007), 103–119.[30] Yang Zhong, Josephine Sullivan, and Haibo Li. 2016. Face attribute predictionusing off-the-shelf cnn features. In2016 International Conference on Biometrics(ICB)