[PDF] AutoAlpha: an Efficient Hierarchical Evolutionary Algorithm for Mining Alpha Factors in Quantitative Investment

Abstract

The multi-factor model is a widely used model in quantitative investment. The success of a multi-factor model is largely determined by the effectiveness of the alpha factors used in the model. This paper proposes a new evolutionary algorithm called AutoAlpha to automatically generate effective formulaic alphas from massive stock datasets. Specifically, first we discover an inherent pattern of the formulaic alphas and propose a hierarchical structure to quickly locate the promising part of space for search. Then we propose a new Quality Diversity search based on the Principal Component Analysis (PCA-QD) to guide the search away from the well-explored space for more desirable results. Next, we utilize the warm start method and the replacement method to prevent the premature convergence problem. Based on the formulaic alphas we discover, we propose an ensemble learning-to-rank model for generating the portfolio. The backtests in the Chinese stock market and the comparisons with several baselines further demonstrate the effectiveness of AutoAlpha in mining formulaic alphas for quantitative trading.

Full PDF

AAutoAlpha: an Efﬁcient Hierarchical Evolutionary Algorithm for Mining AlphaFactors in Quantitative Investment

Tianping Zhang , Yuanqi Li , Yifei Jin , Jian Li

Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University, [email protected], { timezerolyq, yfjin1990 } @gmail.com, Abstract

The multi-factor model is a widely used model inquantitative investment. The success of a multi-factor model is largely determined by the effective-ness of the alpha factors used in the model. Thispaper proposes a new evolutionary algorithm called

AutoAlpha to automatically generate effective for-mulaic alphas from massive stock datasets. Specif-ically, ﬁrst we discover an inherent pattern of theformulaic alphas and propose a hierarchical struc-ture to quickly locate the promising part of spacefor search. Then we propose a new Quality Di-versity search based on the Principal ComponentAnalysis (

PCA-QD ) to guide the search away fromthe well-explored space for more desirable results.Next, we utilize the warm start method and the re-placement method to prevent the premature conver-gence problem. Based on the formulaic alphas wediscover, we propose an ensemble learning-to-rankmodel for generating the portfolio. The backtests inthe Chinese stock market and the comparisons withseveral baselines further demonstrate the effective-ness of

AutoAlpha in mining formulaic alphas forquantitative trading.

Predicting the future returns of stocks is one of the most chal-lenging tasks in quantitative trading. Stock prices are af-fected by many factors such as company performances, in-vestors’ sentiment, and new government policies, etc. To ex-plain the ﬂuctuation of stock markets, economists have es-tablished several theoretical models. Among the most promi-nent ones, the Capital Asset Pricing Model (CAPM) [Sharpe,1964] dictates that the expected return of a ﬁnancial asset isessentially determined by one factor, that is the market ex-cess return, while the Arbitrage Pricing Theory (APT) [Ross,2013] models the return by a linear combination of differ-ent risk factors. Since then, several multi-factor models havebeen proposed and numerous such factors (also called abnor-mal returns) have been found in the economics and ﬁnanceliterature. For example, the celebrated Fama-French ThreeFactor Model [Fama and French, 1993] discovered three im-portant factors that can explain almost 90% of the stock re- turns . In quantitative trading practice, designing novel fac-tors that can explain and predict future asset returns are ofvital importance to the proﬁtability of a strategy. Such factorsare usually called alpha factors , or alphas in short.In 2016, the quantitative investment management ﬁrm,WorldQuant, made public 101 formulaic alpha factors in[Kakushadze, 2016]. Since then, many quantitative tradingmethods have used these formulaic alphas for stock trend pre-diction [Chen et al. , 2019]. A formulaic alpha, as the namesuggests, is a kind of alpha that can be presented as a formulaor a mathematical expression.Alpha = ( close − open ) / ( high − low ) For example, the above formulaic alpha is one of the al-pha factors from [Kakushadze, 2016] and is calculated usingthe open price, the close price, the highest price and the low-est price of stocks on each trading day. This alpha formulareﬂects the momentum effect that has been observed in dif-ferent market (see e.g., [Jegadeesh and Titman, 1993]). Foreach day, the alpha gives different values for different stocks.The higher the value, it is more likely that the stock will haverelatively larger returns in the following days.There are much more complicated formulaic alphas thanthe one shown above. In Figure 2, we show two examples,Alpha ) to design new alphas. This way of ﬁnding good al-phas requires tremendous human labor and expertise, whichis not realistic for small ﬁrms or individual investors. There-fore, there is an urgent need to develop tools for mining neweffective alphas from massive stock datasets automatically.Our goal is to ﬁnd as many diverse formulaic alphas withdesirable performance as possible within limited computa-tional resources. Unlike many optimization and search prob-lems which aim at ﬁnding one single desirable solution, we https://en.wikipedia.org/wiki/Fama%E2%80%93French three-factor model a r X i v : . [ q -f i n . C P ] A p r perators + - * / min maxstd mean tsrank..... factors open close highlow vwap volume..... AutoAlpha warm start methodreplacement method

PCA-QD depth+1

FormulaicAlphas

Prediction

ModelStock RankingPrediction

Backtests /— — c o h l formula: (close-open) / (high-low) Figure 1: The framework of our approach.Figure 2: Formulas Alpha prefer to look for multiple diverse solutions with high perfor-mance and low correlation.As shown in Figure 3, a formulaic alpha can be expressedas a tree where the leaves correspond to raw data and the innernodes correspond to various operators. As the discrete searchspace is very large, it is natural to use genetic algorithms tosearch for effective alphas in the form of trees. However,as we argue below, this is not straightforward and there areseveral challenges we need to address.

Challenge 1: Quickly locate the promising searchspace.

The vanilla genetic algorithm is generally inefﬁcientin mining effective formulaic alphas, due to the fact that ef-fective alphas are sparse in the huge search space. Therefore,how to quickly locate the promising space for search becomesa critical issue.

Challenge 2: Guide the search away from the exploredsearch space.

In order to ﬁnd many diverse and effectiveformulaic alphas, we need to run the genetic algorithm severaltimes for more results. However, the vanilla genetic algorithmusually converges to the same local minima.

Challenge 3: Prevent the premature convergence prob-lem.

The premature convergence problem [Gupta and Ghaﬁr,2012] arises in genetic algorithms when some type of effec-tive genes dominate the whole population and destroy the di-versity in the population. When premature convergence hap-pens, the population stucks at a suboptimal state and we canno longer produce offspring with higher performance.In this paper, we propose a new model called

AutoAlpha to address the above challenges in a uniﬁed framework. Thetechnical contributions of this paper can be summarized asfollows: • We discover an inherent pattern of the formulaic alphas.Based on this, we design a hierarchical structure to quicklylocate the promising space for search, which address Chal-lenge 1. • For Challenge 2, we propose a new Quality Diversitymethod based on the Principal Component Analysis (

PCA- QD ) to guide the search away from the explored space formore diverse and effective formulaic alphas. • We introduce the warm start method at the initializationstep and the replacement method during reproduction toprevent the premature convergence problem. This ad-dresses Challenge 3.Based on the formulaic alphas we discover, we propose anensemble learning-to-rank model to predict the stocks rank-ings and develop effective stock trading strategies. We per-form backtests in the Chinese stock market for different hold-ing periods. The backtesting results show that our methodconsistently outperforms several baselines and the market in-dex.

Mining formulaic alphas can be regarded as a feature extrac-tion problem. We start from an initial set of basic factors (e.g.open, close, volume, etc.) and operators (e.g. +-*/, min, std,etc.), and then build formulaic alphas that satisfy certain per-formance measurement criterion, in order to reveal some in-herent patterns of the stock market. The basic factors and theoperators we use can be found in [Kakushadze, 2016]. Thedata is public in Chinese stock markets and can be accessedthrough multiple resources . In this section, we formalize theproblem of mining formulaic alphas. The return of a stock is generally determined by the closeprice of the stock and the holding period. For a given stock s ,a given date t and a given holding period h , the return of thestock can be calculated as: r ( h ) t,s = close t + h,s − close t,s close t,s where close t,s is the close price of stock s at date t . As-suming that there are n different stocks in the stock pool,the return vector at date t of holding period h is denoted as r ( h ) t = ( r ( h ) t, , ..., r ( h ) t,n ) . For a given formulaic alpha i , its value for the stock s at date t is deﬁned as a ( i ) t,s . We use the IC (information coefﬁcient)[Grinold and Kahn, 2000] to evaluate the effectiveness of analpha in the mining process. For a given formulaic alpha i and a given holding period h , the IC can be calculated as themean of the IC array : IC i = 1 T T (cid:88) t =1 corr ( a ( i ) t , r ( h ) t ) (1)where a ( i ) t = ( a ( i ) t, , ..., a ( i ) t,n ) is the value vector of formu-laic alpha i at date t , corr is the sample Pearson Correlation, { corr ( a ( i ) t , r ( h ) t ) } Tt =1 is the IC array and T is the number oftrading days in the training period. http://tushare.org/ — — c o h l op op op gene1 gene2 gene3 crossover /— c o op op op — l h root operator formula: (close-open) / (high-low) child1 child2 f f f f Figure 3: The demonstration of crossover. The leftmost treeshows the tree representation of the formulaic alpha ’ ( close − open ) / ( high − low ) ’. The trees on the right are two children af-ter crossover. op: operator. f: factor. The IC of an alpha indicates the relevance between the al-pha and the stock returns, and should be as high as possible . The similarity between the alpha i and j is calculated as: sim ( i, j ) = 1 T T (cid:88) t =1 corr ( a ( i ) t , a ( j ) t ) A group of alphas is diverse if the similarity between any twoalphas in the group is lower than . .In the process of mining formulaic alphas, our goal is toﬁnd as many diverse formulaic alphas with high IC as possi-ble within limited computational resources. AutoAlpha is a framework based on genetic algorithms [Whit-ley, 1994]. Genetic algorithm is a kind of metaheuristic op-timization algorithm which draws inspiration from biologicalprocess that produces new offspring and evolve a species. Thevanilla genetic algorithm uses mechanisms such as reproduc-tion, crossover, mutation and selection to give birth to newoffspring. In each step of regeneration, it uses ﬁtness func-tion to select the best-ﬁt individuals for reproduction. Afterwe give birth to new offspring through crossover and mu-tation operations, we replace the least-ﬁt individuals in thepopulation with new individuals to realize the mechanism ofelimination through competition.In order to apply the genetic algorithm for mining formu-laic alphas, ﬁrst we need to deﬁne the genetic representationof a formulaic alpha. As shown in the leftmost tree of Figure3, a formulaic alpha can be represented as a formulaic tree. Itwould be much easier for us to carry out crossover and mu-tation for trees. Figure 3 shows the crossover between twoformulaic alphas of depth 2. We perform the crossover in thesame depth level to prevent the depth from increasing. Thatis, the crossover between gene1 and gene3 in Figure 3 is notallowed. The gene2 and gene3 are called root genes which aredirectly attached to their root operators while gene1 is not.

The search space of trees is huge and the effective alphas arevery sparse. In our experiment, we ﬁnd out that the standardgenetic algorithm (e.g., that implemented in python package W.l.o.g., we assume that IC ≥ since we can multiply − toa formulaic alpha which has negative IC. ,& G H Q V L W \ I UH TX H Q F \ (a) depth=2 ,& G H Q V L W \ I UH TX H Q F \ (b) depth=3Figure 4: The IC of the root genes of the top 100 discovered formu-laic alphas and its distribution. For example, in the left ﬁgure, theblue histogram is the density plot of IC of the formulaic alphas ofdepth 2 (estimated by 20000 randomly generated samples). And thered histogram is the frequency plot of IC of the root genes of the top100 discovered formulaic alphas of depth 3. ’gplearn’) is generally inefﬁcient in initializing the popula-tion and exploring the search space for mining formulaic al-phas (see the results in Section 4.4). For remedy, We proposea novel hierarchical search strategy for the genetic algorithm,that is signiﬁcantly more efﬁcient in the initialization and ex-ploration of the search space. Motivation

In the early stage of this research, we have been using vanillagenetic algorithms for mining formulaic alphas. An interest-ing phenomenon occurs during the experiments that the al-gorithm usually converges to similar formulaic alphas with’ vwap/close ’ as a piece of its genes. ’ vwap ’ is the VolumeWeighted Average Price of a stock. The gene vwap/close it-self is also an effective alpha which relates to the phenomenonof mean reversion. While vwap/close itself is an effectiveformulaic alpha, the formulaic alphas of higher depth whichcontain vwap/close as a piece of its genes usually combinethis mean reversion information with some other informationand have higher effectiveness.Based on such phenomenon, we propose a hypothesisabout the inherent pattern of the formulaic alphas, that is,most of the effective alphas have at least one effective rootgene. Intuitively, if we want to obtain the formulaic alphas ofhigher depth, we should search nearby the effective alphas oflower depth. We design an experiment to veriﬁes the hypoth-esis. First, we use the vanilla genetic algorithm for evolvingformulaic alphas. Then we select the top discovered for-mulaic alphas. For each selected alpha, we further collect itsroot gene with highest IC. We use the density plot to showthat those root genes are effective and are hard to obtain byrandom generation. The results are shown in the Figure 4.Based on the analysis, if we maintain a population with di-verse and effective root genes in it, the crossover operation at-tempts to searching nearby effective formulas of lower depth,and thus improve the efﬁciency in obtaining diverse and ef-fective alphas. In order to establish a diverse and effectivegene pool for initializing the population, we use the hierar-chical structure as the framework for

AutoAlpha and generatealphas from lower to higher depth iteratively. .2 Guide The Search

Now the problem has turned to how to generate formulaicalphas of each depth. Unlike many optimization and searchproblems which aim at ﬁnding one single desirable solution,our goal is to look for as many formulaic alphas with high ICand low correlation as possible. If we have already obtained agroup of formulaic alphas, we would like to guide the searchinto unexplored space for more alphas.However, the genetic algorithms usually converge into thesame local minima. In genetic algorithms community, onemethod to tackle such problem is the

Quality Diversity (QD) search [Pugh et al. , 2016], which seeks to ﬁnd a maximallydiverse collection of individuals where each individual hasdesirable performance. If we can calculate the similarity(or distance) between individuals, then we can guide thesearch by changing the objective landscape through penalty[Lehman and Stanley, 2011]. For example, if the similaritybetween the new alpha and any of the alphas in the recordexceeds a certain threshold, then the ﬁtness of the new alphais penalized to be (least-ﬁt). However, calculating the sim-ilarities is slow, especially when the size of the record growslarge. Assuming that the size of the record is p , then calculat-ing the similarities between a new alpha and the alphas in therecord is O ( npT ) where T is the number of trading days and n is the number of stocks.In order to reduce the time complexity, we ﬁnd a simplerway to approximate the similarity and design our PCA-QD search. Firstly, we use the ﬁrst principal component vector torepresent the information of a formulaic alpha. Speciﬁcally,the values of the formulaic alpha i can be presented as a sam-ple matrix A ( i ) = ( a ( i ) t,s ) T × n , where each column (stock) canbe treated as a feature and each row (date) can be treated asa sample. Then we can calculate the ﬁrst principal compo-nent of the sample matrix. Next, we use the Pearson Correla-tion between the ﬁrst principal components of the two alphas,which we call PCA-similarity , to approximate the similaritybetween the two alphas. The calculation of the ﬁrst principalcomponent using power method is O ( nT + n ) . In this way,we reduce the computational complexity of calculating simi-larities from O ( npT ) to O ( pT ) (in experiments, n is 300 andwe have n < T and n < p ).We run the experiments to see how the approximationmethod works by random sampling. The threshold for the PCA-similarity is set to be 0.9. When the similarity is over . , the Mean Absolute Error (MAE) between the PCA-similarity and the similarity is . . And when the PCA-similarity is over . , the MAE is . . Since we are us-ing the PCA-similarity instead of the similarity to guide thesearch away from the explored area, we hope that this ap-proximation should be accurate when the similarity is highand when the

PCA-similarity is high as well.

The premature convergence problem has always been a criti-cal issue in the genetic algorithms [Gupta and Ghaﬁr, 2012].If we always replace the least-ﬁt individuals with new indi-viduals of greater ﬁtness, it is likely that some type of effec-tive genes may dominate the whole population and destroy the genetic diversity. When premature happens, the wholepopulation gets stuck early in a bad local minima. There aremany methods addressing the premature convergence prob-lem [Gupta and Ghaﬁr, 2012], and we select two from themin particular for

AutoAlpha . Warm Start . In the initialization step, instead of randomlygenerating individuals to the size of the population, we gen-erate individuals K times the size of the population and thenselect the individuals which rank at top K according to ICinto the population as initialization. In this way, we improvethe average effectiveness of the initialized individuals, andthus accelerate the evolution. The warm start method in Au-toAlpha helps us ﬁlter out those genes that are not useful inconstructing formulaic alphas of higher depth. Replacement Method . In the reproduction step, insteadof comparing the new individuals with the least-ﬁt individu-als in the population, we compare the new individuals withtheir own parents. A pair of parents has two offspring af-ter crossover, and the two offspring can replace their parentswhen the best of their ﬁtness is greater than that of their par-ents. In this way, all the genes in the population only haveone piece of copy after reproduction, which helps us preventthe premature convergence problem.

1. We enumerate the alphas of depth 1 and select the effectiveones to set up the gene pool.2. We use the warm start method and the gene pool to ini-tialize the population of depth 2. Then we use crossoverand the replacement method for reproduction. If the newoffspring has higher IC than its parents, we will calculateits PCA-similarity with the alphas in the record and deter-mine whether to keep it in the population3. We repeat step 2 for more alphas of depth 2 and updatethe gene pool and the record. Then we can start generatingalphas of depth 3 and so forth.We use the formulaic alphas as input features and weuse LightGBM [Ke et al. , 2017] and XGBoost [Chen andGuestrin, 2016] as learning algorithms to learn-to-rank thestocks. Then we ensemble the results and use it for stock in-vestment in the testing period. Due to space limit, we omitthe details of the training and the choice of hyper-parameters.

Tasks.

We conduct the experiments independently for theholding periods of 1 day and 5 days. First we use

AutoAlpha to generate a group of effective formulaic alphas for the givenholding period. Then we use the ensemble model to learnto rank the stocks. Finally we use the predicted rankings toconstruct the stock portfolio each day and provide backteststo evaluate the effectiveness of the alphas we generate.

Datasets.

We use the stocks in the CSI 300 Index(hs300) as the stock pool when mining the formulaic al- CSI 300 is a capitalization-weighted stock market index repli-cating the performance of top 300 stocks traded in the Shanghai andShenzhen stock exchanges.able 1: Performance of the top formulaic alphas for holding pe-riod h = 1 , in both training and testing period. The values outof/in the brackets are the results in the training/testing period. Top- k Alpha h = 1 h = 5 hs300 zz800 hs300 zz800 T op T op T op T op T op Table 2: Comparison with gplearn and Alpha101. n : number ofdiverse formulaic alphas with IC higher than . . avgIC : averageIC of the top 50 discovered formulaic alphas.Method h = 1 h = 5 n avgIC n avgIC Alpha101 0 1.02% 0 1.25%gplearn 35 6.10% 7 3.35%AutoAlpha 434 7.50% 415 6.71% phas in the training stage. We perform backtests on the stockpool of CSI 800 Index (zz800) . The overall testing period isfrom to and the training period is from to . We measure the performance of the trading strategies by stan-dard metrics in the literature such as annualized return (AR),annualized volatility and Sharpe ratio (SR).

Annualized return (AR).

Annualized return calculates therate of return for a given holding period scaled down to a12-month period: AR = exp (cid:110) /T (cid:48) × log ( S T /S ) (cid:111) − ,where T (cid:48) is the number of days and T is the number of tradingdays. S t denotes the total wealth at the end of t -th trading day. Sharpe ratio (SR).

In ﬁnance, the Sharpe ratio is a popularmeasure of the risk-adjusted performance of an investmentstrategy: SR = ( R p − R f ) /σ p , where R p is the annualizedreturn of the portfolio, R f is the risk-free rate , and σ p is theannualized volatility, which is the standard deviation of thestrategy’s yearly logarithmic returns . To further demonstrate the effectiveness of our method, wecompare it with the following baselines: • Alpha101 : We compare the formulaic alphas we au-tomatically discover with the 101 formulaic alphas in[Kakushadze, 2016]. We use the 101 alphas to train themodels and perform backtests under the same settings. CSI 800 Index is comprised of all the constituents of CSI 300Index and CSI 500 Index. It is designed to reﬂect the overall perfor-mance of the large, middle and small capital stocks in Chinese stockmarkets. We set the risk-free rate in the experiments to be 0 for simplicity,since there is no consensus on the value of risk-free rate. https://en.wikipedia.org/wiki/Volatility (ﬁnance) • gplearn : ’gplearn’ is a popular python package which per-forms the standard genetic algorithm. It can also be usedto generate formulaic alphas. • SFM : [Zhang et al. , 2017] proposed SFM (the State Fre-quency Memory recurrent network) which is an end-to-enddeep learning method and applied it to the stock predictiontasks. • Market : The market is represented by the CSI 800 Index.We show that our method is able to outperform the marketsignﬁcantly.

Table 1 shows the IC of the top 5 generated formulaic alphas.We show that the alphas can not only generalize in the testingperiod, but also generalize in the different stock pool of CSI800 Index. Table 2 shows the comparison between the alphasgenerated by

AutoAlpha , the alphas generated by gplearn andthe 101 alphas in [Kakushadze, 2016]. We use two metricsto compare the results from both quantitative and qualitativeaspects. One is the number of diverse formulaic alphas withIC higher than . . Another is the average IC of the top diverse formulaic alphas. We use the stratiﬁed backteststo show the alphas’ capability in ranking stocks. The stockpool is divided equally into fold each day according to theratings the alpha gives and fold 9 has the stocks with highestratings. Then the i -th strategy always buy the stocks in the i -th fold each day. The results are shown in Figure 5. fold_0fold_1fold_2fold_3fold_4fold_5 fold_6fold_7fold_8fold_9market (a) Alpha1 fold_0fold_1fold_2fold_3fold_4fold_5 fold_6fold_7fold_8fold_9market (b) Alpha2 fold_0fold_1fold_2fold_3fold_4fold_5 fold_6fold_7fold_8fold_9market (c) Alpha3 fold_0fold_1fold_2fold_3fold_4fold_5 fold_6fold_7fold_8fold_9market (d) Alpha4Figure 5: The results of stratiﬁed backtests of the top 4 alphas (h=1). For each holding period, we collect the top 150 formulaic al-phas generated by

AutoAlpha which have the highest IC in thetraining period for model training. Then we use the trainedmodel to predict the stock rankings in the testing period. Weensure the overall procedure does not use future informationthat is not available at the trading time. We perform backtestsusing historical data for holding periods h = 1 and . Fora given holding period h , the investor invests in the top able 3: Results of backtests and comparisons with baselines. Re-sults relative to the market are in the brackets. Method h = 1 h = 5 AR SR AR SR market -4.1% -0.20 -6.4% -0.31SFM -60.0%(-58.5%) -2.05(-2.89%) -23.7%(-17.7%) -1.01(-1.53)gplearn 61.8%(68.7%) 2.34(4.26) 16.9%(22.6%) 0.72(1.93)Alpha101 29.5%(35.3%) 1.06(2.02%) 16.3%(23.3%) 0.70(2.09)our method stocks at the close price according to the predicted rankingsin each day. Then the stocks are held for h days and sold atthe close price at the end of the holding period. The transac-tion cost is . as accustomed. The stock pool for tradingsimulation is the CSI 800 Index (zz800). The backtesting re-sults are shown in the Figure 6, 7 and Table 3. our methodmarket:zz800alpha101gplearnSFM (a) holding period h = 1 our methodalpha101gplearnSFM (b) holding period h = 1 (relative to the market) our methodmarket:zz800alpha101gplearnSFM (c) holding period h = 5 our methodalpha101gplearnSFM (d) holding period h = 5 (relative to the market)Figure 6: The Proﬁt&Loss graph of backtests. market:zz800top10top20top30top40top50 Figure 7: Comparison on backtesting strategies (longing top 10, top20, ..., top50 ). Holding period h = 1 . Data mining and machine learning techniques have been usedextensively to address a variety of problems in ﬁnance. In par-ticular, our work is closely related to feature extraction in ma-chine learning (see e.g., [Guyon et al. , 2008]). Feature extrac-tion has been widely adopted in ﬁnancial prediction. [Zhang et al. , 2018] extracted features from the semantic informationin stock comment text for reliability modeling of stock com-ments. [Huang and Wu, 2008] proposed a GA-based modelto extract wavelet features for stock index prediction. Ourwork develops a substantially different GA-based algorithmand extracts general formulaic alpha factors.How to maintain diversity has always been a critical issuein genetic algorithms [Gupta and Ghaﬁr, 2012]. The methodsto encourage diversity includes Nitching [Sareni and Krahen-buhl, 1998], Crowding [Mahfoud, 1992], Sharing [Goldberg et al. , 1987], etc. The replacement method we use derivesfrom the Steady State Genetic Algorithms (SSGAs) [Engel-brecht, 2007]. A number of replacement methods have beendeveloped for SSGAs, including the parent-offspring com-petition we use [Smith and Vavak, 1999]. The Quality Di-versity (QD) is designed to illuminate the diverse individu-als with high performance. There are many quality diversityalgorithms designated for different kinds of problems [Pugh et al. , 2016]. Common examples include Novelty Searchwith Local Competition (NSLC) [Lehman and Stanley, 2011]and Multi-dimensional Archive of Phenotypic Elites (MAP-Elites) [Mouret and Clune, 2015].

In this paper, we propose

AutoAlpha , an efﬁcient algorithmthat automatically discovers effective and diverse formulaicalphas for quantitative investment. We ﬁrst propose a hier-archical structure to quickly locate the promising space forsearch. Secondly, we propose a new

PCA-QD search to guidethe search away from the explored areas. Thirdly, we utilizethe warm start method and the parent-offspring replacementmethod to prevent the premature convergence problem. Thebacktests and comparisons with several baselines show the ef-fectiveness of our method. Finally, we remark that

AutoAlpha can be also viewed as an approach for automatic feature ex-traction. As the market becomes more efﬁcient, discoveringalpha factors becomes more difﬁcult and automatically ex-tracting effective features is a promising future direction forquantitative investment. eferences [Chen and Guestrin, 2016] Tianqi Chen and Carlos Guestrin.Xgboost: A scalable tree boosting system. In

Proceedingsof the 22nd acm sigkdd international conference on knowl-edge discovery and data mining , pages 785–794. ACM,2016.[Chen et al. , 2019] Chi Chen, Li Zhao, Jiang Bian, ChunxiaoXing, and Tie-Yan Liu. Investment behaviors can tell whatinside: Exploring stock intrinsic properties for stock trendprediction. In

Proceedings of the 25th ACM SIGKDD In-ternational Conference on Knowledge Discovery & DataMining , pages 2376–2384. ACM, 2019.[Engelbrecht, 2007] Andries P Engelbrecht.

Computationalintelligence: an introduction . John Wiley & Sons, 2007.[Fama and French, 1993] Eugene F Fama and Kenneth RFrench. Common risk factors in the returns on stocks andbonds.

Journal of ﬁnancial economics , 33(1):3–56, 1993.[Goldberg et al. , 1987] David E Goldberg, Jon Richardson,et al. Genetic algorithms with sharing for multimodalfunction optimization. In

Genetic algorithms and their ap-plications: Proceedings of the Second International Con-ference on Genetic Algorithms , pages 41–49. Hillsdale,NJ: Lawrence Erlbaum, 1987.[Grinold and Kahn, 2000] Richard C Grinold and Ronald NKahn. Active portfolio management. 2000.[Gupta and Ghaﬁr, 2012] Deepti Gupta and Shabina Ghaﬁr.An overview of methods maintaining diversity in geneticalgorithms.

International journal of emerging technologyand advanced engineering , 2(5):56–60, 2012.[Guyon et al. , 2008] Isabelle Guyon, Steve Gunn, MasoudNikravesh, and Lofti A Zadeh.

Feature extraction: foun-dations and applications , volume 207. Springer, 2008.[Huang and Wu, 2008] Shian-Chang Huang and Tung-Kuang Wu. Integrating ga-based time-scale featureextractions with svms for stock index forecasting.

ExpertSystems with Applications , 35(4):2080–2088, 2008.[Jegadeesh and Titman, 1993] Narasimhan Jegadeesh andSheridan Titman. Returns to buying winners and sellinglosers: Implications for stock market efﬁciency.

The Jour-nal of ﬁnance , 48(1):65–91, 1993.[Kakushadze, 2016] Zura Kakushadze. 101 formulaic al-phas.

Wilmott , 2016(84):72–81, 2016.[Ke et al. , 2017] Guolin Ke, Qi Meng, Thomas Finley,Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, andTie-Yan Liu. Lightgbm: A highly efﬁcient gradient boost-ing decision tree. In

Advances in Neural Information Pro-cessing Systems , pages 3146–3154, 2017.[Lehman and Stanley, 2011] Joel Lehman and Kenneth OStanley. Evolving a diversity of virtual creatures throughnovelty search and local competition. In

Proceedings ofthe 13th annual conference on Genetic and evolutionarycomputation , pages 211–218. ACM, 2011.[Mahfoud, 1992] Samir W Mahfoud. Crowding and prese-lection revisited. In

PPSN , volume 2, pages 27–36, 1992. [Mouret and Clune, 2015] Jean-Baptiste Mouret and JeffClune. Illuminating search spaces by mapping elites. arXiv preprint arXiv:1504.04909 , 2015.[Pugh et al. , 2016] Justin K Pugh, Lisa B Soros, and Ken-neth O Stanley. Quality diversity: A new frontier for evo-lutionary computation.

Frontiers in Robotics and AI , 3:40,2016.[Ross, 2013] Stephen A Ross. The arbitrage theory of capi-tal asset pricing. In

HANDBOOK OF THE FUNDAMEN-TALS OF FINANCIAL DECISION MAKING: Part I , pages11–30. World Scientiﬁc, 2013.[Sareni and Krahenbuhl, 1998] Bruno Sareni and LaurentKrahenbuhl. Fitness sharing and niching methods revis-ited.

IEEE transactions on Evolutionary Computation ,2(3):97–106, 1998.[Sharpe, 1964] William F Sharpe. Capital asset prices: Atheory of market equilibrium under conditions of risk.

Thejournal of ﬁnance , 19(3):425–442, 1964.[Smith and Vavak, 1999] Jim E Smith and Frantisek Vavak.Replacement strategies in steady state genetic algorithms:dynamic environments.

Journal of computing and infor-mation technology , 7(1):49–59, 1999.[Whitley, 1994] Darrell Whitley. A genetic algorithm tuto-rial.

Statistics and computing , 4(2):65–85, 1994.[Zhang et al. , 2017] Liheng Zhang, Charu Aggarwal, andGuo-Jun Qi. Stock price prediction via discoveringmulti-frequency trading patterns. In

Proceedings of the23rd ACM SIGKDD international conference on knowl-edge discovery and data mining , pages 2141–2149. ACM,2017.[Zhang et al. , 2018] Chen Zhang, Yijun Wang, Can Chen,Changying Du, Hongzhi Yin, and Hao Wang. Stockassis-tant: A stock ai assistant for reliability modeling of stockcomments. In