Covariance matrix filtering with bootstrapped hierarchies
CCovariance matrix filtering with bootstrapped hierarchies
Christian Bongiorno and Damien Challet
Universit´e Paris-Saclay, CentraleSup´elec, Laboratoire de Math´ematiques etInformatique pour les Syst`emes Complexes, 91190, Gif-sur-Yvette, France
Cleaning covariance matrices is a highly non-trivial problem, yet of central importance in the sta-tistical inference of dependence between objects. We propose here a probabilistic hierarchical clus-tering method, named Bootstrapped Average Hierarchical Clustering (BAHC) that is particularlyeffective in the high-dimensional case, i.e., when there are more objects than features. When appliedto DNA microarray, our method yields distinct hierarchical structures that cannot be accounted forby usual hierarchical clustering. We then use global minimum-variance risk management to test ourmethod and find that BAHC leads to significantly smaller realized risk compared to state-of-the-artlinear and nonlinear filtering methods in the high-dimensional case. Spectral decomposition showsthat BAHC better captures the persistence of the dependence structure between asset price returnsin the calibration and the test periods.
Covariance matrix inference is a cornerstone of the de-pendence inference between objects. This kind of matrixsuffers however from the curse of dimensionality, as theybecome very noisy when the number of objects is similarto the number of features. Even worse, unfiltered covari-ance matrices are pathological in the high dimensionalcase, i.e., when the number of features exceeds the num-ber of objects. This case is frequent e.g. in biologicaldata and in multivariate dynamical systems such as fi-nancial markets in which only the most recent history islikely to be relevant.Given its importance, covariance matrix filtering hasa long history. A popular approach is to obtain filteredcovariance matrices from the corresponding correlationmatrices. Two types of approaches stand out: i ) spec-tral methods, e.g. Random Matrix Theory, RotationallyInvariant Estimators [1], and Shrinkage [2, 3]; ii ) ansatzfor the correlation matrix, e.g. block-diagonal [4] or hi-erarchical [5].The usual setting is to have n objects and t featuresand to compute the correlation matrix between these n objects. Recent results on Rotationally Invariant Esti-mators [6] propose non-linear shrinkage methods able tocorrect the eigenvalue spectrum of covariance matricesoptimally: the inversion of the QuEST function [7], theCross-Validated (CV) eigenvalue shrinkage [8] and theIW-regularization [1], the latter being valid only in thelow dimensional regime q = n/t <
1, i.e., when there aremore features than objects. Eigenvector filtering is morecomplex. However, ans¨atze for the shape of the true cor-relation matrix impose constraints on the structure of theeigenvectors and of the eigenvalues. Such ansatz shouldbe simple enough to clean noise but flexible enough to ac-count for fine relevant details. The popular hierarchicalclustering ansatz (HC thereafter) is indeed simple: it as-sumes that correlations are nested [5], which is equivalentto assume that dependencies are described by a dendro-gram (a tree). In practice, it is hard to find statistically-validated hierarchical structures [9] when the fitted hier-archical structure is highly sensitive to small variationsof data.An obvious problem of HC occurs when the structure is more complex than a tree: for example the non-diagonalblocks in Figs 1 and 2 are ignored by a hierarchicalansatz: one needs more than a single hierarchical struc-ture to describe these empirical dependence structures.As a consequence, a non-negligible part of the depen-dence structure is left out, and in a dynamic context, thestability of a single hierarchical structure is likely to bepoor.Here, we introduce a more flexible hierarchical ansatzable to capture more of the structure of the eigenvectors.The rationale is to compute filtered hierarchical struc-tures of many bootstrapped copies of the initial data,which yields probabilistic hierarchical structures. Suchprocedure describes the structure of correlation and co-variance matrices better while keeping the robustness ofhierarchical clustering. We illustrate the power of ourmethod with data from two relevant fields. First, inbioinformatics, DNA micro-array gene expression depen-dence in tissues is frequently characterized by correla-tion matrices. Hierarchical clustering and its variants arecommonly used [10, 11], which helps simplify the covari-ance matrix by linkage averaging [12] (see Fig. 1). Whenthere are several strong candidates of hierarchical struc-ture, this approach selects a single one, which neglectspossibly crucial information held by alternative struc-tures. Comparing unfiltered correlation matrices withthe filtering yielded by hierarchical clustering and aver-age linkage (HCAL) [5] (Fig. 1) makes it clear first that(i) hierarchical clustering does capture some of the struc-ture and (ii) a substantial part of the structure is lost (seethe bottom plot). This is because hierarchical clusteringimposes too strict a structure, which erases out an un-controlled amount of information.Another domain in which covariance matrix filteringplays a central role is risk management. Broadly speak-ing, the problem amounts to minimize future uncertaintyby determining the fraction of resources to allocate to ev-ery possible choice. Risk in this particular context is dueto fluctuations of the future value of the choices. Theusual procedure consists in minimizing a suitable riskmeasure in the calibration window and hoping that thefuture, realized, risk will bear some relationship with the a r X i v : . [ q -f i n . R M ] M a r Tissue T i ss u e C ij Tissue T i ss u e C < ij Tissue T i ss u e C ij C < ij FIG. 1. Correlation matrix from tissue-gene micro-array dataof patients affected by lung cancer. The upper left plot is thesample correlation matrix, the upper right plot is the resultof hierarchical and average-linkage averaging (HCAL). Thebottom plot is the difference between the two: it still hasevident structure unaccounted for by HCAL. stocks s t o c k s c ij stocks s t o c k s c ij FIG. 2. Correlation matrix of US equities price returns in the2008-01-23 to 2008-11-04 (left plot) and in the 2008-11-05 to2009-08-24 period (right plot). The elements of both panelsare ordered according to the in-sample HCAL dendrogram ofthe first period. calibrated risk.The simplest approach consists in defining risk as thevariance of the weighted sum of choices’ values and tominimise it. This is known as globaly minimum-varianceportfolios, a subfield of quadratic portfolio optimiza-tion which has a wide range of applications: investmentinto technologies [13], energy sources mix for countries[14, 15], wind farm locations [16], and capital allocationin finance [17]. We shall focus on financial risk becausedata are abundant, which makes it possible to comparethe out-of-sample performance of filtering methods. Inaddition, the high-dimensional regime is particularly rel-evant in finance: there are many assets to choose fromand the speed with which the dependence structure be-tween asset price returns may change asks for an as shortas possible calibration period [18].In an inference or descriptive context such as DNAmicroarray data analysis, filtering correlation matrices ismeant to bring estimated covariance matrices closer tothe ground truth. In a dynamical context, especially for non-stationary systems such as financial markets, whatmatters is the part of the ground truth that most likelypersists after the calibration period, i.e., when one usesthe allocation weights computed from the filtered covari-ance matrix. Thus, ideally, the filtered covariance ma-trix should contain as much of the persistent structureas possible. The nature of the most likely persistentstructure is of course unknown from the calibration win-dow only. Figure 2 shows that there are indeed stronglypersistent dependence structures of asset price returnsbetween two non-overlapping periods. Similarly to cor-relation matrices of DNA microarray data, while a pureHC does capture a sizeable part of the useful structure,the non-diagonal correlation patterns blocks e.g., around( x, y ) = (140 , RESULTSMicroarray DNA
We first apply the BAHC method to DNA microarraydata [20] where the objects are n = 327 tissues of patientsaffected by pediatric acute lymphoblastic leukemia andfeatures are the expression intensities of t = 271 genes( q (cid:39) . . . . . . . . . . . . . FIG. 3. Bidimensional t-SNE projection of the cophenetic distance between the dendrograms of 1000 bootstraps of DNAmicroarray data [20]. Two main clusters emerge, with further subclusters, corresponding to distinct potential hierarchies ofdependence that are compatible with data. The red crosses indicate the centroids of the two largest clusters whose structuredifferences appear in the tanglegram of right plot. spectral clustering [22] to determine sub-clusters of eachmain cluster. Typically, sub-clusters within either of themain clusters differ at lower levels of branching. In sum-mary, sub-groups of cancers that lie on far branches ofthe sample dendrograms could be miss-classified as un-correlated despite being possibly much closer in the den-drograms of many bootstraps.
Risk minimization
Given the n × ( t + 1) matrix of values of choice i attime k , p i,k , and the value returns r i,k = p i,k /p i,k − − i , the i − th component of vector w . The riskis measured by the standard deviation of the portfolioreturn, denoted by v P , whit v P = w T Σ w , where Σ isthe n × n covariance matrix of the matrix of returns R .If the weights can be negative, the optimal weights ˜ w = Σ − · T · Σ − · , with the condition (cid:80) i w i = 1 in order to avoidthe trivial solution w = 0. This situation is called long-short portfolio in the following. In some situations, e.g.,when choosing one’s portfolio of energies or products,only positive weights are allowed, in which case one hasto solve a quadratic programming problem; we refer tothis situation as long-only portfolio.The realized (out-of-sample) risk is the relevant per-formance measure. Using the out exponent, the realizedrisk is v outP = (cid:113) ( ˜ w ) † Σ out ˜ w , where ˜ w are computed from the in-sample covariancematrix, filtered or not, and X † is the transpose of matrix X .All the results reported below use the simulation setupdescribed in the Methods section: in short, we perform10,000 simulations of n = 100 random assets in random periods. We compare the out-of-sample risk computedfrom BAHC and several other well-known methods: theclassic Ledoit and Wolf linear shrinkage method (LWhenceforth) [2] and the more recent nonlinear shrinkageapproach based on the inversion of the QuEST function(QuEST) [7]. We also include the Cross-Validated eigen-value shrinkage (CV) [8] and HCAL [5], denoted by < .Figure 4 shows that BAHC outperforms all the alter-native methods for t in (cid:46) q = n/t (cid:38) , whichincludes all of the high-dimensional regime q >
1. In par-ticular, for the long-only portfolios, the BAHC methodreaches the absolute minimum out-of-sample risk overall t in and all methods for t in (cid:39) q (cid:39) / q > /
2, which confirms that BAHC is better than allthe other methods not only with respect to the averagerealized risk, but also in probability in this region.Finally, we vary the length of the test window, t out .We report the probability that the BAHC method out-performs all its competitors as a function of both t in and t out in Fig. 5. Our approach achieves lower realizedriskwith in more than half the simulations than any othermethod tested here as soon as t in <
226 ( q > / .
26) forevery t out in the considered range. Remarkably, as t out increases, the calibration length below which BAHC hasbetter than 50% chances to outperform all its competi-tors only weakly increases. We interpret this result bythe fact that our method is able to extract the right kindof persistent structure in that particular data, which isconfirmed below by spectral analysis. We found similarresults for the Hong Kong equity market (see S.I.). Spectral Properties
In order to understand why and when our methodhas a better performance than the other methods based t in v o u t P long-short inBAHC 50 100 150 200 250 t out t i n . long-short m i n ( B A H C , ) 50 100 150 200 250 t out t i n . long only m i n ( B A H C , ) FIG. 5. Fraction of time BAHC yields a smaller realized riskthan all the alternative methods. Left plot: portfolios withpositive and negative weights; right plot: portfolios with onlypositive weights. The dotted line corresponds to q = t/n = 1,and the level curve to a 50% probability. 10 , 000 independentsimulations per point; t out = 42 days, n = 100 assets, USequities. on spectral clustering, it is instructive to compare thein- and out-of-sample persistence of the eigenvalues andeigenvectors produced by all the filtering methods con-sidered here. The spectral decomposition of correlationmatrix C is denoted by C = U † Λ U , where U is a n × n matrix formed by the eigenvectors of C and Λ is the diag-onal matrix obtained from the corresponding eigenvalues. Eigenvectors stability A simple way to characterise eigenvectors stability isto compare the empirical out-of-sample correlation ma-trix C out with the Oracle correlation estimator definedas Ξ inC = U in † Z in U in where Z in = diag ( U in † C out U in )is the Oracle eigenvector estimator, the idea being thatΞ inC = C out if in- and out-of-sample eigenvectors coin-cide (see S.I.). The Oracle estimator for the covariancematrix, denoted by Ξ in Σ , is defined in a similar way.Figure 6 reports the Frobenius distances (see the Meth-ods section) (cid:13)(cid:13) C out − Ξ inC (cid:13)(cid:13) CF and (cid:13)(cid:13) Σ out − Ξ in Σ (cid:13)(cid:13) Σ F as a func- t in || C C o u t || C F correlation inBAHC< t in ( B A H C , ) correlation in< t in || o u t || F covariance inBAHC< t in ( B A H C , ) covariance in< FIG. 6. Frobenius distance between the out-of-sample ma-trices and the Oracle estimators obtained with the in-sampleeigenvectors ( in ), the in-sample BAHC-filtered eigenvectors( BAHC ) and the in-sample HCAL-filtered eigenvectors ( < ).Upper panels refer to correlation matrices C , lower panels tocovariance matrices Σ. The left panels are the Frobenius normof the difference between the estimator and the out-of-samplerealization; the right panels are the fraction of time BAHCoutperforms the alternative estimators. 10 , 000 independentsimulations per point; t out = 42 days, n = 100 assets, USequities. tion of t in for n = 100 assets. Note that CV, LW andQuEST methods all use the in-sample eigenvectors andthus do not need separate computations. Generally, ourmethod yields more stable correlation and covariance ma-trices for t in < 300 ( q > / t in becomes smaller. The same appliesto the comparison between BAHC -filtered and empiri-cal covariance matrices, while HCAL, denoted by < , hasbetter performance in about a 25% of samples almostindependently of t in . In short, as soon as q > / Eigenvalues stability Since both the covariance Σ and precision Σ − matri-ces are relevant to minimum-variance optimization, wemeasure two types of residues that focus on large and t in h i correlation inBAHC 000 simulationswith random calibration windows and a random selection of n = 100 assets. The upper panel refers to the correlation ma-trix, the lower panel refers to the covariance matrix. 10 , t out = 42 days, n = 100assets, US equities. small eigenvalues, defined as (cid:15) hi = (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 ( λ i − z i ) (1) (cid:15) low = (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 (cid:18) λ i − z i (cid:19) , (2)where λ i = (Λ) ii is the i -th eigenvalue of the in-sampleestimator and z i = ( Z in ) ii comes from the Oracle estima-tor computed with the respective filtered eigenvector ma-trix and i is the respective rank of these eigenvalues. Theresidue measure (cid:15) hi mainly accounts for the discrepancybetween the largest eigenvalues and the residue measure (cid:15) low attributes more weight to the discrepancy betweenthe smallest eigenvalues.Figure 7 plots the residues of the correlation and co-variance matrices respectively as a function of t in . Wecompare our approach with the sample estimator, HCAL-filtered matrix, and the Cross-Validated (CV) eigenvaluedistribution. While CV method outperforms all the othermethods when t in (cid:46) q > . t in (cid:46) Filtered correlation and covariance matrices The ultimate test is of course to compare filtered in-sample matrices with out-of-sample matrices. Figure 8reports the Frobenius distance between the filtered in-sample and out-of-sample correlation and covariance ma-trices for all the tested methods. Expectedly, BAHC out-performs all the other ones for t in (cid:46) t in || CC o u t || C F correlation inBAHC 000 independent simulations per point; t out = 42days, n = 100 assets, US equities. is lower than the other methods, which shows that theBAHC method outperforms HCAL filtering for every t in . I. DISCUSSION Filtering covariance and correlation matrices requiresto take care of O ( n ) coefficients. Focusing on O ( n ) vari-ables, for example by tweaking the eigenvalues or usinga single hierarchical ansatz, works to some extend. Mak-ing further progresses requires to filter more variables, ifpossible while keeping an O ( n ) ansatz. This is what theBAHC method that we introduce achieves: by using m bootstraps and applying an O ( n ) structure, BAHC al-lows some additional flexibility, while keeping the overallstructure simple.Our method both filters out estimation noise and im-proves the stability of the eigenvectors in a dynamicalcontext. Indeed, the spectral decomposition of BAHC-filtered correlation matrices is close to the optimal CVmethod with respect to the eigenvalue distribution. Fur-thermore, in the dynamical context investigated here, theeigenvectors produced by our method have a higher over-lap with the out-of-sample ones than the unfiltered in-sample eigenvectors for reasonably small q = t/n . This iswhy our method leads to better minimum-variance port-folios than all the competing filtering methods when thecalibration window is small. In particular, if no shortselling is allowed, our approach produces, on average,the lowest-risk portfolio.Future work is needed to characterize the average de-pendence structure produced by BAHC better, from boththeoretical and empirical points of view. In addition,BAHC may still be too strict in some cases and thus leaveout valuable information, hence, further refinements ofthe ansatz will need to be investigated. MATERIALS AND METHODSDatasets description We consider the daily close-to-close returns of USequities, adjusted for dividends, splits, and other cor-porate events. More precisely, the dataset consists oflarge-capitalization stocks, from 1992-02-03 to 2018-06-29. The number of stocks with data varies over time: itranges from 399 in 1992-02-06 to 723 in 2018-06-29 and isroughly constant from 2008 onwards. The list of tickersis reported in S.I.DNA microarray data [20] can be downloaded from[23]. It consists of gene expression intensity of 327 tis-sues of patients affected by pediatric acute lymphoblasticleukemia and a subset of 271 genes. Numerical simulations with financial data All the simulations are carried out in the same way:each point of each plot is an average over 10 , 000 sim-ulations, each of which includes an in-sample windowof length t in and an out-of-sample window of length t out = 42 days (about two trading months) unless oth-erwise specified; it starts from a random day uniformlychosen in the available dataset. To have meaningful in-and out-of-sample windows given the maximum t in con-sidered, the first day of the out-of-sample must be after01-01-2000; each simulation selects n = 100 assets at ran-dom among the assets with no missing value in both in-and out-of-sample windows. BAHC algorithm Given matrix R ∈ R n × t , our method prescribes tocreate a set of m bootstrap (feature-wise) copies of R ,denoted by { R (1) , R (2) , · · · , R ( m ) } . A single bootstrapcopy of the data matrix R ( b ) ∈ R n × t has elements r ( b ) ij = r is ( b ) j , where s ( b ) is a vector of dimension t obtainedby random sampling with replacement of the elements ofvector { , , · · · , t } . The vectors s ( b ) , b = 1 , · · · , m areindependently sampled.The Pearson correlation matrix of each bootstrappeddata matrix R ( b ) is then computed and denoted by C ( b ) ;in turn the latter is filtered with the hierarchical cluster-ing average linkage (HCAL) proposed in [9], which yields C ( b ) < . In short, the HCAL uses two ingredients: thedistance D = 1 − C to agglomerate cluster in a hierar- chical way, and the averaging of the correlation betweenclusters (see S.I. and [9] for more details).Finally, the filtered correlation matrix C BAHC is theaverage of the HCAL-filtered matrices C ( b ) < C BAHC = 1 m m (cid:88) b =1 C ( b ) < To build a BAHC-filtered covariance matrice, we esti-mate the variance of r ( b ) i , denoted by σ ( b ) ii , compute theHCAL-filtered covariance matrices Σ ( b ) < whose elementsare defined as σ ( b ) We use rescaled Frobenius norms to account for thefact that the number of assets in our dataset depends ontime: (cid:107) X (cid:107) Σ F = (cid:118)(cid:117)(cid:117)(cid:116) n × n (cid:88) i,j x ij n . (4)In addition, because CV, LW and QuEST methods donot guarantee the identity on the diagonal of filtered cor-relation matrices; therefore, contrarily to BAHC, we donot include the diagonal elements in the metric and thusdefine (cid:107) X (cid:107) CF = (cid:118)(cid:117)(cid:117)(cid:116) n × n (cid:88) i>j x ij n ( n − . (5)We found that the performance of CV, LW, QuEST-based correlation estimators is slightly improved by re-placing c ij with c ij √ c ii c jj , which also ensures that the di-agonal elements equal one, and thus have used this mod-ification in our analysis. Source code We have written a BAHC package for both R andPython, available from CRAN and PyPI, respectively. [1] J. Bun, J.-P. Bouchaud, and M. Potters, Physics Reports , 1 (2017). [2] O. Ledoit and M. Wolf, Journal of multivariate analysis , 365 (2004). [3] O. Ledoit and M. Wolf, The Review of Financial Studies , 4349 (2017).[4] S. Beguˇsi´c and Z. Kostanjˇcar, in (IEEE, 2019) pp. 301–305.[5] M. Tumminello, F. Lillo, and R. N. Mantegna, EPL(Europhysics Letters) , 30006 (2007).[6] J. Bun, R. Allez, J.-P. Bouchaud, and M. Potters, IEEETransactions on Information Theory , 7475 (2016).[7] O. Ledoit, M. Wolf, et al. , The Annals of Statistics ,1024 (2012).[8] D. Bartz, “Cross-validation based nonlinear shrinkage,”(2016), arXiv preprint arXiv:1611.00798.[9] C. Bongiorno, S. Miccich`e, and R. N. Mantegna, “Nestedpartitions from hierarchical clustering statistical valida-tion,” (2019), arXiv preprint arXiv:1906.06908.[10] J. Quackenbush, Nature reviews genetics , 418 (2001).[11] Z. M. Hira and D. F. Gillies, Advances in bioinformatics (2015).[12] J. Friedman, T. Hastie, and R. Tibshirani, The elementsof statistical learning , 10 (Springer Series in StatisticsNew York, 2001).[13] D. W. Hubbard, How to measure anything: Finding thevalue of intangibles in business (John Wiley & Sons,2014).[14] F. A. Roques, D. M. Newbery, and W. J. Nuttall, EnergyEconomics , 1831 (2008).[15] M. Arnesano, A. Carlucci, and D. Laforgia, Energy ,112 (2012).[16] J. Dunlop, The Journal of Private Equity , 83 (2004).[17] H. M. Markowitz and G. P. Todd, Mean-variance analy-sis in portfolio choice and capital markets , Vol. 66 (JohnWiley & Sons, 2000).[18] C. Bongiorno and D. Challet, “Nonparametric sign pre-diction of high-dimensional correlation matrix coeffi-cients,” (2019), arXiv preprint arXiv:2001.11214.[19] R. N. Mantegna, The European Physical Journal B-Condensed Matter and Complex Systems , 193 (1999).[20] E.-J. Yeoh, M. E. Ross, S. A. Shurtleff, W. K. Williams,D. Patel, R. Mahfouz, F. G. Behm, S. C. Raimondi, M. V.Relling, A. Patel, et al. , Cancer cell , 133 (2002).[21] L. v. d. Maaten and G. Hinton, Journal of machine learn-ing research , 2579 (2008).[22] A. Y. Ng, M. I. Jordan, and Y. Weiss, in Advances inneural information processing systems (2002) pp. 849–856.[23] “St. Jude Children’s Research Hospital,” (2002), . This publication stems from a partnership betweenCentraleSuplec and BNP Paribas. Appendix A: Supporting Information Appendix (SI)1. Average Linkage Filtered Correlation Matrix a. The Notation We describe in this section the strictly hierarchicalmethod of Ref [5]. Given a generic matrix R ∈ R n × t ,a generic σ ij element of the n × n sample covariance ma-trix is defined as σ ij = 1 t t (cid:88) h =1 ( r ih − ¯ r i ) ( r jh − ¯ r j ) (A1)where ¯ r i = (cid:80) th =1 r ih /t is the sample mean. The relatedPearson correlation coefficient is defined as c ij = σ ij √ σ ii σ jj (A2) b. Hierarchical Clustering Average Linkage (HCAL) The hierarchical clustering is an agglomerative algo-rithm that recursively clusters groups of objects accord-ing to a distance. The latter is defined in the simplestway in Ref.[9]: the Pearson correlation matrix C is trans-formed into a distance matrix D as follows d ij = 1 − c ij , (A3)which respects the axioms of a distance. Then a distancemetric among clusters must by defined: in the HCALcase, it is based on the average linkage between clusters p and q ρ pq = (cid:80) i ∈ C p (cid:80) j ∈ C q d ij n q n p , (A4)where C p and C q are the sets of elements belonging tothe clusters p and q respectively, and n p and n q are theircardinality.Hierarchical clustering works as follows: initially, eachelement has its own cluster. Then, the pair of clusters( p, q ) with the smallest distance ρ pq are merged togetherinto a new cluster s such that C s = C p ∪ C q . The algo-rithm recursively joins a pair of clusters until all nodesfall into a single unique cluster. The genealogy G of thehierarchical clustering can be uniquely identified by thesequence of n − c. The Filtered Matrix Ref. [5] proposes to clean the correlation sub-matrixdefined from the indices F pq = { ( i, j ) : i ∈ C p , j ∈ C q } by replacing all its elements with their average: mathe-matically one builds a matrix C < with elements c 2. Bootstrap Average Linkage Correlation Matrix To overcome these two issues of HCAL filtering whilekeeping its advantages, we propose here a new approachto filter correlation matrices based on data matrix boot-strap resampling of the feature indices; therefore, it bet-ter accounts for the influence of randomness on the in-ferred structure. We call it BAHC, which stands forBootstrap-averaged hierarchical clustering.Our recipe prescribes to create a set of m boot-strap copies of the data matrix R , denoted by { R (1) , R (2) , · · · , R ( m ) } . A single bootstrap copy of thedata matrix R ( b ) ∈ R n × t is defined entry-wise as r ( b ) ij = r is ( b ) j , where s ( b ) is a vector of dimension t obtained withrandom sampling by replacement of the elements of thevector { , , · · · , t } . The vector s ( b ) , b = 1 , · · · , m areindependently sampled.Each bootstrap copy b of the data matrix has an asso-ciated Pearson correlation matrix C ( b ) from which we canconstruct the HCAL filtered matrix C ( m ) < . Finally, eachelement of the filtered Pearson correlation matrix C BAHC is defined as the average over the m filtered bootstrapcopies, i.e., c BAHC ij = m (cid:88) b =1 c ( b ) 3. Filter Covariance Matrices To build BAHC-filtered covariance matrices, wefirst estimate bootstrapped univariate variances { σ (1) ii , σ (2) ii , · · · , σ ( m ) ii } , where a generic element σ ( b ) ii of Σ ( b ) is defined as σ ( b ) ii = 1 t t (cid:88) h =1 (cid:16) r ( b ) ij − ¯ r ( b ) i (cid:17) (A7)Then element ( i, j ) of b -th bootstrap covariance is de-fined as σ ( b ) We recall the concept of Oracle estimator Ξ: giventhe spectral decomposition of the in-sample correlationmatrix C in = U in Λ in U in † and the spectral decompo-sition of the out-of-sample correlation matrix C out = U out Λ out U out † , where Λ in/out are diagonal eigenvaluematrices made from the eigenvalues of C in/out , and U in/out is the matrix defined by the eigenvectors of C in/out , the Oracle eigenvalue matrix is defined as Z in = (cid:0) U in † C out U in (cid:1) d (B1)where the superscript in indicates that we used the in-sample eigenvectors for its estimation. The operator () d sets to zero all the off-diagonal elements. Then the Oracleestimator of the correlation matrix is defined asΞ in = U in Z in U in † . (B2)Ref. [6] shows that Oracle eigenvalues are the optimalcorrection of the in-sample eigenvalues Λ in in the sensethat it minimizes the Frobenius norm of the difference be-tween the out-of-sample correlation matrix and the cor-rected in-sample one (cid:13)(cid:13) C out − Ξ in (cid:13)(cid:13) F . Although this esti-mator sounds worryingly tautological, since it require theknowledge of the out-of-sample correlation to constructthe most similar estimator, Ref. [6] show that is possi-ble to obtain Z in in the t, n → ∞ at constant q = n/t limit without the knowledge of C out for a broad set ofdistributions and noises (multiplicative and additive) ifthe system is stationary and for t > n (low-dimensionalregime). Indeed, it easy to show that the Oracle estima-tor is exactly C out if and only if U in = U out since Z in = (cid:0) U in † C out U in (cid:1) d = (cid:0) U in † U out Λ out U out † U in (cid:1) d == (cid:0) U out † U out Λ out U out † U out (cid:1) d = (cid:0) Λ out (cid:1) d = Λ out . (B3) t in v o u t P long-short inBAHC Figures 9 and 10 report the out-of-sample risk of co-variance matrix cleaning methods with the same set upfor Hong Kong stock exchange 9. The analysis cover1281 stocks in from 2005-10-19 to 2017-06-23. The stocksare not simultaneously listed over all time-period: thenumber stocks ranges from 590 on 2008-08-22 to 1277on 2017-06-14. Results are qualitatively consistent withthose observed in the US equity market. Appendix D: List of large-capitalization assets in theUS equities dataset A, AA, AAN, AAP, AAPL, ABC, ABT, ACGL, ACM, ACN, ACV, ADBE, ADI,ADM, ADP, ADS, ADSK, AEE, AEO, AEP, AES, AET, AFG, AFL, AGCO, AGN,AGO, AHL, AIG, AIV, AIZ, AJG, AKAM, AKS, ALB, ALEX, ALL, ALTR, ALV,ALXN, AMAT, AMD, AME, AMG, AMGN, AMP, AMR, AMT, AMTD, AMZN,AN, ANAT, ANF, ANSS, AON, APA, APC, APD, APH, ARCC, ARE, ARW, ASH,ATI, ATLS, ATO, ATR, ATVI, AVB, AVGO, AVP, AVT, AVX, AVY, AWI, AWK, 50 100 150 200 250 t out t i n . long-short m i n ( B A H C , ) 50 100 150 200 250 t out t i n long only m i n ( B A H C , ) FIG. 10. Fraction of time the realized risk of BAHC is smallerthan the best performing alternative method. Left plot: port-folios with long and short positions, right plot: portfolioswith only long positions. The level curve corresponds to a50% probability. 10 , 000 independent simulations per point; n = 100 assets AXP, AXS, AZO, BA, BAC, BAX, BBBY, BBT, BBY, BDN, BDX, BEN, BG, BIG,BIIB, BIO, BJ, BK, BKD, BLK, BLL, BMRN, BMS, BMY, BOH, BOKF, BPOP,BR, BRO, BSX, BTU, BWA, BXP, BXS, C, CA, CAG, CAH, CAL, CAT, CB, CBS,CBSH, CBT, CCE, CCI, CCK, CCL, CCO, CDNS, CE, CECO, CELG, CERN,CETV, CF, CFFN, CFR, CHD, CHH, CHK, CHRW, CHS, CI, CIEN, CIM, CINF,CIT, CKH, CL, CLB, CLF, CLGX, CLI, CLR, CLX, CMA, CMC, CMCSA, CME,CMG, CMI, CMP, CMS, CNA, CNP, CNX, COF, COG, COL, COO, COP, COST,CPA, CPB, CPRT, CPT, CPWR, CR, CREE, CRK, CRL, CRM, CRS, CSCO, CSL,CSX, CTAS, CTL, CTSH, CTV, CTXS, CVA, CVG, CVS, CVX, CXO, CXW, CY,CYH, D, DAL, DBD, DCI, DDR, DE, DEI, DF, DFS, DG, DGX, DHI, DHR, DIS,DISCA, DISH, DKS, DLB, DLR, DLTR, DNB, DNR, DO, DOV, DOX, DPS, DRE,DRI, DTE, DTV, DUK, DVA, DVN, EAT, EBAY, ECL, ED, EFX, EGN, EIX, EL,EMN, EMR, ENDP, ENR, EOG, EQIX, EQR, EQT, ESRX, ESS, ETN, ETR, EV,EW, EWBC, EXC, EXP, EXPD, EXPE, F, FAST, FCN, FCX, FDS, FDX, FE,FFIV, FHN, FII, FIS, FISV, FITB, FL, FLIR, FLO, FLR, FLS, FMC, FNF, FOSL,FRO, FRT, FSLR, FTI, FTR, FULT, G, GCI, GD, GDI, GE, GEF, GES, GGG,GGP, GHL, GILD, GIS, GLW, GME, GNTX, GNW, GOOG, GPC, GPN, GPRO,GPS, GRMN, GS, GT, GWW, H, HAL, HAS, HBAN, HBI, HCC, HCP, HD, HE,HES, HI, HIG, HK, HLF, HOG, HOLX, HON, HP, HPQ, HPT, HRB, HRC, HRL,HRS, HSC, HSIC, HST, HSY, HTZ, HUM, HUN, IBKR, IBM, ICE, IDXX, IEX,IFF, IGT, ILMN, INTC, INTU, IP, IPG, IPI, IR, IRM, ISCA, ISRG, IT, ITRI,ITT, ITW, IVZ, JBHT, JBL, JCI, JCP, JEC, JEF, JLL, JNJ, JNPR, JOE, JPM,JWN, K, KAR, KBH, KBR, KEX, KEY, KIM, KLAC, KMB, KMT, KMX, KO,KR, KSS, KSU, L, LAMR, LAZ, LBTYA, LEA, LECO, LEG, LEN, LH, LIFE, LII,LLL, LLY, LM, LMT, LNC, LNT, LOW, LPNT, LRCX, LSI, LSTR, LUV, LVS, M,MA, MAC, MAN, MAR, MAS, MAT, MBI, MCD, MCHP, MCK, MCO, MCY, MD,MDC, MDP, MDR, MDRX, MDT, MDU, MET, MGM, MHK, MKC, MKL, MLM,MMC, MMM, MO, MORN, MOS, MRK, MRO, MRVL, MS, MSFT, MSM, MTB,MTD, MTW, MU, MUR, MXIM, MYGN, MYL, NATI, NAV, NBL, NBR, NCR,NDAQ, NEE, NEM, NFG, NFLX, NFX, NI, NIHD, NKE, NLY, NOC, NOV, NRG,NSC, NSM, NTAP, NTRS, NUAN, NUE, NVDA, NVR, NWL, NWSA, NYT, O,OC, ODP, OFC, OGE, OI, OII, OIS, OKE, OMC, ORA, ORCL, ORI, ORLY, OSK,OXY, PAYX, PBCT, PBI, PCAR, PCG, PDCO, PDM, PEG, PENN, PEP, PFE,PFG, PG, PGR, PH, PHM, PKG, PKI, PLD, PM, PNC, PNR, PNW, PPG, PPL,PRGO, PRU, PSA, PTEN, PVH, PWR, PX, PXD, QCOM, R, RBC, RCL, RDC,RE, REG, REGN, RF, RGA, RGLD, RHI, RHT, RJF, RL, RMBS, RMD, RNR,ROK, ROP, ROST, RPM, RRC, RRD, RS, RSG, RTN, RYN, S, SATS, SBAC,SBUX, SCCO, SCG, SCHN, SCHW, SCI, SD, SE, SEE, SEIC, SHLD, SHW, SIG,SIRI, SJM, SLAB, SLB, SLG, SLM, SM, SMG, SNA, SNH, SNPS, SNV, SO, SON,SPG, SPN, SPR, SRCL, SRE, STI, STLD, STRA, STT, STX, STZ, SUN, SVU,SWK, SWKS, SWN, SYK, SYMC, SYY, T, TAP, TCO, TDC, TDG, TDS, TDW,TECD, TECH, TER, TEX, TFSL, TFX, TGT, THC, THG, THO, TIF, TJX, TK,0