[PDF] Covariance matrix filtering with bootstrapped hierarchies

Abstract

Statistical inference of the dependence between objects often relies on covariance matrices. Unless the number of features (e.g. data points) is much larger than the number of objects, covariance matrix cleaning is necessary to reduce estimation noise. We propose a method that is robust yet flexible enough to account for fine details of the structure covariance matrix. Robustness comes from using a hierarchical ansatz and dependence averaging between clusters; flexibility comes from a bootstrap procedure. This method finds several possible hierarchical structures in DNA microarray gene expression data, and leads to lower realized risk in global minimum variance portfolios than current filtering methods when the number of data points is relatively small.

Full PDF

CCovariance matrix ﬁltering with bootstrapped hierarchies

Christian Bongiorno and Damien Challet

Universit´e Paris-Saclay, CentraleSup´elec, Laboratoire de Math´ematiques etInformatique pour les Syst`emes Complexes, 91190, Gif-sur-Yvette, France

Cleaning covariance matrices is a highly non-trivial problem, yet of central importance in the sta-tistical inference of dependence between objects. We propose here a probabilistic hierarchical clus-tering method, named Bootstrapped Average Hierarchical Clustering (BAHC) that is particularlyeﬀective in the high-dimensional case, i.e., when there are more objects than features. When appliedto DNA microarray, our method yields distinct hierarchical structures that cannot be accounted forby usual hierarchical clustering. We then use global minimum-variance risk management to test ourmethod and ﬁnd that BAHC leads to signiﬁcantly smaller realized risk compared to state-of-the-artlinear and nonlinear ﬁltering methods in the high-dimensional case. Spectral decomposition showsthat BAHC better captures the persistence of the dependence structure between asset price returnsin the calibration and the test periods.

Covariance matrix inference is a cornerstone of the de-pendence inference between objects. This kind of matrixsuﬀers however from the curse of dimensionality, as theybecome very noisy when the number of objects is similarto the number of features. Even worse, unﬁltered covari-ance matrices are pathological in the high dimensionalcase, i.e., when the number of features exceeds the num-ber of objects. This case is frequent e.g. in biologicaldata and in multivariate dynamical systems such as ﬁ-nancial markets in which only the most recent history islikely to be relevant.Given its importance, covariance matrix ﬁltering hasa long history. A popular approach is to obtain ﬁlteredcovariance matrices from the corresponding correlationmatrices. Two types of approaches stand out: i ) spec-tral methods, e.g. Random Matrix Theory, RotationallyInvariant Estimators [1], and Shrinkage [2, 3]; ii ) ansatzfor the correlation matrix, e.g. block-diagonal [4] or hi-erarchical [5].The usual setting is to have n objects and t featuresand to compute the correlation matrix between these n objects. Recent results on Rotationally Invariant Esti-mators [6] propose non-linear shrinkage methods able tocorrect the eigenvalue spectrum of covariance matricesoptimally: the inversion of the QuEST function [7], theCross-Validated (CV) eigenvalue shrinkage [8] and theIW-regularization [1], the latter being valid only in thelow dimensional regime q = n/t <

1, i.e., when there aremore features than objects. Eigenvector ﬁltering is morecomplex. However, ans¨atze for the shape of the true cor-relation matrix impose constraints on the structure of theeigenvectors and of the eigenvalues. Such ansatz shouldbe simple enough to clean noise but ﬂexible enough to ac-count for ﬁne relevant details. The popular hierarchicalclustering ansatz (HC thereafter) is indeed simple: it as-sumes that correlations are nested [5], which is equivalentto assume that dependencies are described by a dendro-gram (a tree). In practice, it is hard to ﬁnd statistically-validated hierarchical structures [9] when the ﬁtted hier-archical structure is highly sensitive to small variationsof data.An obvious problem of HC occurs when the structure is more complex than a tree: for example the non-diagonalblocks in Figs 1 and 2 are ignored by a hierarchicalansatz: one needs more than a single hierarchical struc-ture to describe these empirical dependence structures.As a consequence, a non-negligible part of the depen-dence structure is left out, and in a dynamic context, thestability of a single hierarchical structure is likely to bepoor.Here, we introduce a more ﬂexible hierarchical ansatzable to capture more of the structure of the eigenvectors.The rationale is to compute ﬁltered hierarchical struc-tures of many bootstrapped copies of the initial data,which yields probabilistic hierarchical structures. Suchprocedure describes the structure of correlation and co-variance matrices better while keeping the robustness ofhierarchical clustering. We illustrate the power of ourmethod with data from two relevant ﬁelds. First, inbioinformatics, DNA micro-array gene expression depen-dence in tissues is frequently characterized by correla-tion matrices. Hierarchical clustering and its variants arecommonly used [10, 11], which helps simplify the covari-ance matrix by linkage averaging [12] (see Fig. 1). Whenthere are several strong candidates of hierarchical struc-ture, this approach selects a single one, which neglectspossibly crucial information held by alternative struc-tures. Comparing unﬁltered correlation matrices withthe ﬁltering yielded by hierarchical clustering and aver-age linkage (HCAL) [5] (Fig. 1) makes it clear ﬁrst that(i) hierarchical clustering does capture some of the struc-ture and (ii) a substantial part of the structure is lost (seethe bottom plot). This is because hierarchical clusteringimposes too strict a structure, which erases out an un-controlled amount of information.Another domain in which covariance matrix ﬁlteringplays a central role is risk management. Broadly speak-ing, the problem amounts to minimize future uncertaintyby determining the fraction of resources to allocate to ev-ery possible choice. Risk in this particular context is dueto ﬂuctuations of the future value of the choices. Theusual procedure consists in minimizing a suitable riskmeasure in the calibration window and hoping that thefuture, realized, risk will bear some relationship with the a r X i v : . [ q -f i n . R M ] M a r Tissue T i ss u e C ij Tissue T i ss u e C < ij Tissue T i ss u e C ij C < ij FIG. 1. Correlation matrix from tissue-gene micro-array dataof patients aﬀected by lung cancer. The upper left plot is thesample correlation matrix, the upper right plot is the resultof hierarchical and average-linkage averaging (HCAL). Thebottom plot is the diﬀerence between the two: it still hasevident structure unaccounted for by HCAL. stocks s t o c k s c ij stocks s t o c k s c ij FIG. 2. Correlation matrix of US equities price returns in the2008-01-23 to 2008-11-04 (left plot) and in the 2008-11-05 to2009-08-24 period (right plot). The elements of both panelsare ordered according to the in-sample HCAL dendrogram ofthe ﬁrst period. calibrated risk.The simplest approach consists in deﬁning risk as thevariance of the weighted sum of choices’ values and tominimise it. This is known as globaly minimum-varianceportfolios, a subﬁeld of quadratic portfolio optimiza-tion which has a wide range of applications: investmentinto technologies [13], energy sources mix for countries[14, 15], wind farm locations [16], and capital allocationin ﬁnance [17]. We shall focus on ﬁnancial risk becausedata are abundant, which makes it possible to comparethe out-of-sample performance of ﬁltering methods. Inaddition, the high-dimensional regime is particularly rel-evant in ﬁnance: there are many assets to choose fromand the speed with which the dependence structure be-tween asset price returns may change asks for an as shortas possible calibration period [18].In an inference or descriptive context such as DNAmicroarray data analysis, ﬁltering correlation matrices ismeant to bring estimated covariance matrices closer tothe ground truth. In a dynamical context, especially for non-stationary systems such as ﬁnancial markets, whatmatters is the part of the ground truth that most likelypersists after the calibration period, i.e., when one usesthe allocation weights computed from the ﬁltered covari-ance matrix. Thus, ideally, the ﬁltered covariance ma-trix should contain as much of the persistent structureas possible. The nature of the most likely persistentstructure is of course unknown from the calibration win-dow only. Figure 2 shows that there are indeed stronglypersistent dependence structures of asset price returnsbetween two non-overlapping periods. Similarly to cor-relation matrices of DNA microarray data, while a pureHC does capture a sizeable part of the useful structure,the non-diagonal correlation patterns blocks e.g., around( x, y ) = (140 , RESULTSMicroarray DNA

We ﬁrst apply the BAHC method to DNA microarraydata [20] where the objects are n = 327 tissues of patientsaﬀected by pediatric acute lymphoblastic leukemia andfeatures are the expression intensities of t = 271 genes( q (cid:39) . . . . . . . . . . . . . FIG. 3. Bidimensional t-SNE projection of the cophenetic distance between the dendrograms of 1000 bootstraps of DNAmicroarray data [20]. Two main clusters emerge, with further subclusters, corresponding to distinct potential hierarchies ofdependence that are compatible with data. The red crosses indicate the centroids of the two largest clusters whose structurediﬀerences appear in the tanglegram of right plot. spectral clustering [22] to determine sub-clusters of eachmain cluster. Typically, sub-clusters within either of themain clusters diﬀer at lower levels of branching. In sum-mary, sub-groups of cancers that lie on far branches ofthe sample dendrograms could be miss-classiﬁed as un-correlated despite being possibly much closer in the den-drograms of many bootstraps.

Risk minimization

Given the n × ( t + 1) matrix of values of choice i attime k , p i,k , and the value returns r i,k = p i,k /p i,k − − i , the i − th component of vector w . The riskis measured by the standard deviation of the portfolioreturn, denoted by v P , whit v P = w T Σ w , where Σ isthe n × n covariance matrix of the matrix of returns R .If the weights can be negative, the optimal weights ˜ w = Σ − · T · Σ − · , with the condition (cid:80) i w i = 1 in order to avoidthe trivial solution w = 0. This situation is called long-short portfolio in the following. In some situations, e.g.,when choosing one’s portfolio of energies or products,only positive weights are allowed, in which case one hasto solve a quadratic programming problem; we refer tothis situation as long-only portfolio.The realized (out-of-sample) risk is the relevant per-formance measure. Using the out exponent, the realizedrisk is v outP = (cid:113) ( ˜ w ) † Σ out ˜ w , where ˜ w are computed from the in-sample covariancematrix, ﬁltered or not, and X † is the transpose of matrix X .All the results reported below use the simulation setupdescribed in the Methods section: in short, we perform10,000 simulations of n = 100 random assets in random periods. We compare the out-of-sample risk computedfrom BAHC and several other well-known methods: theclassic Ledoit and Wolf linear shrinkage method (LWhenceforth) [2] and the more recent nonlinear shrinkageapproach based on the inversion of the QuEST function(QuEST) [7]. We also include the Cross-Validated eigen-value shrinkage (CV) [8] and HCAL [5], denoted by < .Figure 4 shows that BAHC outperforms all the alter-native methods for t in (cid:46) q = n/t (cid:38) , whichincludes all of the high-dimensional regime q >

1. In par-ticular, for the long-only portfolios, the BAHC methodreaches the absolute minimum out-of-sample risk overall t in and all methods for t in (cid:39) q (cid:39) / q > /

2, which conﬁrms that BAHC is better than allthe other methods not only with respect to the averagerealized risk, but also in probability in this region.Finally, we vary the length of the test window, t out .We report the probability that the BAHC method out-performs all its competitors as a function of both t in and t out in Fig. 5. Our approach achieves lower realizedriskwith in more than half the simulations than any othermethod tested here as soon as t in <

226 ( q > / .

26) forevery t out in the considered range. Remarkably, as t out increases, the calibration length below which BAHC hasbetter than 50% chances to outperform all its competi-tors only weakly increases. We interpret this result bythe fact that our method is able to extract the right kindof persistent structure in that particular data, which isconﬁrmed below by spectral analysis. We found similarresults for the Hong Kong equity market (see S.I.). Spectral Properties

In order to understand why and when our methodhas a better performance than the other methods based t in v o u t P long-short inBAHC

50 100 150 200 250 t out t i n . long-short m i n ( B A H C , )

50 100 150 200 250 t out t i n . long only m i n ( B A H C , ) FIG. 5. Fraction of time BAHC yields a smaller realized riskthan all the alternative methods. Left plot: portfolios withpositive and negative weights; right plot: portfolios with onlypositive weights. The dotted line corresponds to q = t/n = 1,and the level curve to a 50% probability. 10 ,

000 independentsimulations per point; t out = 42 days, n = 100 assets, USequities. on spectral clustering, it is instructive to compare thein- and out-of-sample persistence of the eigenvalues andeigenvectors produced by all the ﬁltering methods con-sidered here. The spectral decomposition of correlationmatrix C is denoted by C = U † Λ U , where U is a n × n matrix formed by the eigenvectors of C and Λ is the diag-onal matrix obtained from the corresponding eigenvalues. Eigenvectors stability

A simple way to characterise eigenvectors stability isto compare the empirical out-of-sample correlation ma-trix C out with the Oracle correlation estimator deﬁnedas Ξ inC = U in † Z in U in where Z in = diag ( U in † C out U in )is the Oracle eigenvector estimator, the idea being thatΞ inC = C out if in- and out-of-sample eigenvectors coin-cide (see S.I.). The Oracle estimator for the covariancematrix, denoted by Ξ in Σ , is deﬁned in a similar way.Figure 6 reports the Frobenius distances (see the Meth-ods section) (cid:13)(cid:13) C out − Ξ inC (cid:13)(cid:13) CF and (cid:13)(cid:13) Σ out − Ξ in Σ (cid:13)(cid:13) Σ F as a func- t in || C C o u t || C F correlation inBAHC< t in ( B A H C , ) correlation in< t in || o u t || F covariance inBAHC< t in ( B A H C , ) covariance in< FIG. 6. Frobenius distance between the out-of-sample ma-trices and the Oracle estimators obtained with the in-sampleeigenvectors ( in ), the in-sample BAHC-ﬁltered eigenvectors( BAHC ) and the in-sample HCAL-ﬁltered eigenvectors ( < ).Upper panels refer to correlation matrices C , lower panels tocovariance matrices Σ. The left panels are the Frobenius normof the diﬀerence between the estimator and the out-of-samplerealization; the right panels are the fraction of time BAHCoutperforms the alternative estimators. 10 ,

000 independentsimulations per point; t out = 42 days, n = 100 assets, USequities. tion of t in for n = 100 assets. Note that CV, LW andQuEST methods all use the in-sample eigenvectors andthus do not need separate computations. Generally, ourmethod yields more stable correlation and covariance ma-trices for t in <

300 ( q > / t in becomes smaller. The same appliesto the comparison between BAHC -ﬁltered and empiri-cal covariance matrices, while HCAL, denoted by < , hasbetter performance in about a 25% of samples almostindependently of t in . In short, as soon as q > / Eigenvalues stability

Since both the covariance Σ and precision Σ − matri-ces are relevant to minimum-variance optimization, wemeasure two types of residues that focus on large and t in h i correlation inBAHC

000 simulationswith random calibration windows and a random selection of n = 100 assets. The upper panel refers to the correlation ma-trix, the lower panel refers to the covariance matrix. 10 , t out = 42 days, n = 100assets, US equities. small eigenvalues, deﬁned as (cid:15) hi = (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 ( λ i − z i ) (1) (cid:15) low = (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 (cid:18) λ i − z i (cid:19) , (2)where λ i = (Λ) ii is the i -th eigenvalue of the in-sampleestimator and z i = ( Z in ) ii comes from the Oracle estima-tor computed with the respective ﬁltered eigenvector ma-trix and i is the respective rank of these eigenvalues. Theresidue measure (cid:15) hi mainly accounts for the discrepancybetween the largest eigenvalues and the residue measure (cid:15) low attributes more weight to the discrepancy betweenthe smallest eigenvalues.Figure 7 plots the residues of the correlation and co-variance matrices respectively as a function of t in . Wecompare our approach with the sample estimator, HCAL-ﬁltered matrix, and the Cross-Validated (CV) eigenvaluedistribution. While CV method outperforms all the othermethods when t in (cid:46) q > . t in (cid:46) Filtered correlation and covariance matrices

The ultimate test is of course to compare ﬁltered in-sample matrices with out-of-sample matrices. Figure 8reports the Frobenius distance between the ﬁltered in-sample and out-of-sample correlation and covariance ma-trices for all the tested methods. Expectedly, BAHC out-performs all the other ones for t in (cid:46) t in || CC o u t || C F correlation inBAHC

000 independent simulations per point; t out = 42days, n = 100 assets, US equities. is lower than the other methods, which shows that theBAHC method outperforms HCAL ﬁltering for every t in . I. DISCUSSION

Filtering covariance and correlation matrices requiresto take care of O ( n ) coeﬃcients. Focusing on O ( n ) vari-ables, for example by tweaking the eigenvalues or usinga single hierarchical ansatz, works to some extend. Mak-ing further progresses requires to ﬁlter more variables, ifpossible while keeping an O ( n ) ansatz. This is what theBAHC method that we introduce achieves: by using m bootstraps and applying an O ( n ) structure, BAHC al-lows some additional ﬂexibility, while keeping the overallstructure simple.Our method both ﬁlters out estimation noise and im-proves the stability of the eigenvectors in a dynamicalcontext. Indeed, the spectral decomposition of BAHC-ﬁltered correlation matrices is close to the optimal CVmethod with respect to the eigenvalue distribution. Fur-thermore, in the dynamical context investigated here, theeigenvectors produced by our method have a higher over-lap with the out-of-sample ones than the unﬁltered in-sample eigenvectors for reasonably small q = t/n . This iswhy our method leads to better minimum-variance port-folios than all the competing ﬁltering methods when thecalibration window is small. In particular, if no shortselling is allowed, our approach produces, on average,the lowest-risk portfolio.Future work is needed to characterize the average de-pendence structure produced by BAHC better, from boththeoretical and empirical points of view. In addition,BAHC may still be too strict in some cases and thus leaveout valuable information, hence, further reﬁnements ofthe ansatz will need to be investigated. MATERIALS AND METHODSDatasets description

We consider the daily close-to-close returns of USequities, adjusted for dividends, splits, and other cor-porate events. More precisely, the dataset consists oflarge-capitalization stocks, from 1992-02-03 to 2018-06-29. The number of stocks with data varies over time: itranges from 399 in 1992-02-06 to 723 in 2018-06-29 and isroughly constant from 2008 onwards. The list of tickersis reported in S.I.DNA microarray data [20] can be downloaded from[23]. It consists of gene expression intensity of 327 tis-sues of patients aﬀected by pediatric acute lymphoblasticleukemia and a subset of 271 genes.

Numerical simulations with ﬁnancial data

All the simulations are carried out in the same way:each point of each plot is an average over 10 ,

000 sim-ulations, each of which includes an in-sample windowof length t in and an out-of-sample window of length t out = 42 days (about two trading months) unless oth-erwise speciﬁed; it starts from a random day uniformlychosen in the available dataset. To have meaningful in-and out-of-sample windows given the maximum t in con-sidered, the ﬁrst day of the out-of-sample must be after01-01-2000; each simulation selects n = 100 assets at ran-dom among the assets with no missing value in both in-and out-of-sample windows. BAHC algorithm

Given matrix R ∈ R n × t , our method prescribes tocreate a set of m bootstrap (feature-wise) copies of R ,denoted by { R (1) , R (2) , · · · , R ( m ) } . A single bootstrapcopy of the data matrix R ( b ) ∈ R n × t has elements r ( b ) ij = r is ( b ) j , where s ( b ) is a vector of dimension t obtainedby random sampling with replacement of the elements ofvector { , , · · · , t } . The vectors s ( b ) , b = 1 , · · · , m areindependently sampled.The Pearson correlation matrix of each bootstrappeddata matrix R ( b ) is then computed and denoted by C ( b ) ;in turn the latter is ﬁltered with the hierarchical cluster-ing average linkage (HCAL) proposed in [9], which yields C ( b ) < . In short, the HCAL uses two ingredients: thedistance D = 1 − C to agglomerate cluster in a hierar- chical way, and the averaging of the correlation betweenclusters (see S.I. and [9] for more details).Finally, the ﬁltered correlation matrix C BAHC is theaverage of the HCAL-ﬁltered matrices C ( b ) < C BAHC = 1 m m (cid:88) b =1 C ( b ) < To build a BAHC-ﬁltered covariance matrice, we esti-mate the variance of r ( b ) i , denoted by σ ( b ) ii , compute theHCAL-ﬁltered covariance matrices Σ ( b ) < whose elementsare deﬁned as σ ( b )

We use rescaled Frobenius norms to account for thefact that the number of assets in our dataset depends ontime: (cid:107) X (cid:107) Σ F = (cid:118)(cid:117)(cid:117)(cid:116) n × n (cid:88) i,j x ij n . (4)In addition, because CV, LW and QuEST methods donot guarantee the identity on the diagonal of ﬁltered cor-relation matrices; therefore, contrarily to BAHC, we donot include the diagonal elements in the metric and thusdeﬁne (cid:107) X (cid:107) CF = (cid:118)(cid:117)(cid:117)(cid:116) n × n (cid:88) i>j x ij n ( n − . (5)We found that the performance of CV, LW, QuEST-based correlation estimators is slightly improved by re-placing c ij with c ij √ c ii c jj , which also ensures that the di-agonal elements equal one, and thus have used this mod-iﬁcation in our analysis. Source code

We have written a BAHC package for both R andPython, available from CRAN and PyPI, respectively. [1] J. Bun, J.-P. Bouchaud, and M. Potters, Physics Reports , 1 (2017). [2] O. Ledoit and M. Wolf, Journal of multivariate analysis , 365 (2004). [3] O. Ledoit and M. Wolf, The Review of Financial Studies , 4349 (2017).[4] S. Beguˇsi´c and Z. Kostanjˇcar, in (IEEE, 2019) pp. 301–305.[5] M. Tumminello, F. Lillo, and R. N. Mantegna, EPL(Europhysics Letters) , 30006 (2007).[6] J. Bun, R. Allez, J.-P. Bouchaud, and M. Potters, IEEETransactions on Information Theory , 7475 (2016).[7] O. Ledoit, M. Wolf, et al. , The Annals of Statistics ,1024 (2012).[8] D. Bartz, “Cross-validation based nonlinear shrinkage,”(2016), arXiv preprint arXiv:1611.00798.[9] C. Bongiorno, S. Miccich`e, and R. N. Mantegna, “Nestedpartitions from hierarchical clustering statistical valida-tion,” (2019), arXiv preprint arXiv:1906.06908.[10] J. Quackenbush, Nature reviews genetics , 418 (2001).[11] Z. M. Hira and D. F. Gillies, Advances in bioinformatics (2015).[12] J. Friedman, T. Hastie, and R. Tibshirani, The elementsof statistical learning , 10 (Springer Series in StatisticsNew York, 2001).[13] D. W. Hubbard,

How to measure anything: Finding thevalue of intangibles in business (John Wiley & Sons,2014).[14] F. A. Roques, D. M. Newbery, and W. J. Nuttall, EnergyEconomics , 1831 (2008).[15] M. Arnesano, A. Carlucci, and D. Laforgia, Energy ,112 (2012).[16] J. Dunlop, The Journal of Private Equity , 83 (2004).[17] H. M. Markowitz and G. P. Todd, Mean-variance analy-sis in portfolio choice and capital markets , Vol. 66 (JohnWiley & Sons, 2000).[18] C. Bongiorno and D. Challet, “Nonparametric sign pre-diction of high-dimensional correlation matrix coeﬃ-cients,” (2019), arXiv preprint arXiv:2001.11214.[19] R. N. Mantegna, The European Physical Journal B-Condensed Matter and Complex Systems , 193 (1999).[20] E.-J. Yeoh, M. E. Ross, S. A. Shurtleﬀ, W. K. Williams,D. Patel, R. Mahfouz, F. G. Behm, S. C. Raimondi, M. V.Relling, A. Patel, et al. , Cancer cell , 133 (2002).[21] L. v. d. Maaten and G. Hinton, Journal of machine learn-ing research , 2579 (2008).[22] A. Y. Ng, M. I. Jordan, and Y. Weiss, in Advances inneural information processing systems (2002) pp. 849–856.[23] “St. Jude Children’s Research Hospital,” (2002), . This publication stems from a partnership betweenCentraleSuplec and BNP Paribas.

Appendix A: Supporting Information Appendix (SI)1. Average Linkage Filtered Correlation Matrix a. The Notation

We describe in this section the strictly hierarchicalmethod of Ref [5]. Given a generic matrix R ∈ R n × t ,a generic σ ij element of the n × n sample covariance ma-trix is deﬁned as σ ij = 1 t t (cid:88) h =1 ( r ih − ¯ r i ) ( r jh − ¯ r j ) (A1)where ¯ r i = (cid:80) th =1 r ih /t is the sample mean. The relatedPearson correlation coeﬃcient is deﬁned as c ij = σ ij √ σ ii σ jj (A2) b. Hierarchical Clustering Average Linkage (HCAL) The hierarchical clustering is an agglomerative algo-rithm that recursively clusters groups of objects accord-ing to a distance. The latter is deﬁned in the simplestway in Ref.[9]: the Pearson correlation matrix C is trans-formed into a distance matrix D as follows d ij = 1 − c ij , (A3)which respects the axioms of a distance. Then a distancemetric among clusters must by deﬁned: in the HCALcase, it is based on the average linkage between clusters p and q ρ pq = (cid:80) i ∈ C p (cid:80) j ∈ C q d ij n q n p , (A4)where C p and C q are the sets of elements belonging tothe clusters p and q respectively, and n p and n q are theircardinality.Hierarchical clustering works as follows: initially, eachelement has its own cluster. Then, the pair of clusters( p, q ) with the smallest distance ρ pq are merged togetherinto a new cluster s such that C s = C p ∪ C q . The algo-rithm recursively joins a pair of clusters until all nodesfall into a single unique cluster. The genealogy G of thehierarchical clustering can be uniquely identiﬁed by thesequence of n − c. The Filtered Matrix Ref. [5] proposes to clean the correlation sub-matrixdeﬁned from the indices F pq = { ( i, j ) : i ∈ C p , j ∈ C q } by replacing all its elements with their average: mathe-matically one builds a matrix C < with elements c

2. Bootstrap Average Linkage Correlation Matrix

To overcome these two issues of HCAL ﬁltering whilekeeping its advantages, we propose here a new approachto ﬁlter correlation matrices based on data matrix boot-strap resampling of the feature indices; therefore, it bet-ter accounts for the inﬂuence of randomness on the in-ferred structure. We call it BAHC, which stands forBootstrap-averaged hierarchical clustering.Our recipe prescribes to create a set of m boot-strap copies of the data matrix R , denoted by { R (1) , R (2) , · · · , R ( m ) } . A single bootstrap copy of thedata matrix R ( b ) ∈ R n × t is deﬁned entry-wise as r ( b ) ij = r is ( b ) j , where s ( b ) is a vector of dimension t obtained withrandom sampling by replacement of the elements of thevector { , , · · · , t } . The vector s ( b ) , b = 1 , · · · , m areindependently sampled.Each bootstrap copy b of the data matrix has an asso-ciated Pearson correlation matrix C ( b ) from which we canconstruct the HCAL ﬁltered matrix C ( m ) < . Finally, eachelement of the ﬁltered Pearson correlation matrix C BAHC is deﬁned as the average over the m ﬁltered bootstrapcopies, i.e., c BAHC ij = m (cid:88) b =1 c ( b )

3. Filter Covariance Matrices

To build BAHC-ﬁltered covariance matrices, weﬁrst estimate bootstrapped univariate variances { σ (1) ii , σ (2) ii , · · · , σ ( m ) ii } , where a generic element σ ( b ) ii of Σ ( b ) is deﬁned as σ ( b ) ii = 1 t t (cid:88) h =1 (cid:16) r ( b ) ij − ¯ r ( b ) i (cid:17) (A7)Then element ( i, j ) of b -th bootstrap covariance is de-ﬁned as σ ( b )

We recall the concept of Oracle estimator Ξ: giventhe spectral decomposition of the in-sample correlationmatrix C in = U in Λ in U in † and the spectral decompo-sition of the out-of-sample correlation matrix C out = U out Λ out U out † , where Λ in/out are diagonal eigenvaluematrices made from the eigenvalues of C in/out , and U in/out is the matrix deﬁned by the eigenvectors of C in/out , the Oracle eigenvalue matrix is deﬁned as Z in = (cid:0) U in † C out U in (cid:1) d (B1)where the superscript in indicates that we used the in-sample eigenvectors for its estimation. The operator () d sets to zero all the oﬀ-diagonal elements. Then the Oracleestimator of the correlation matrix is deﬁned asΞ in = U in Z in U in † . (B2)Ref. [6] shows that Oracle eigenvalues are the optimalcorrection of the in-sample eigenvalues Λ in in the sensethat it minimizes the Frobenius norm of the diﬀerence be-tween the out-of-sample correlation matrix and the cor-rected in-sample one (cid:13)(cid:13) C out − Ξ in (cid:13)(cid:13) F . Although this esti-mator sounds worryingly tautological, since it require theknowledge of the out-of-sample correlation to constructthe most similar estimator, Ref. [6] show that is possi-ble to obtain Z in in the t, n → ∞ at constant q = n/t limit without the knowledge of C out for a broad set ofdistributions and noises (multiplicative and additive) ifthe system is stationary and for t > n (low-dimensionalregime). Indeed, it easy to show that the Oracle estima-tor is exactly C out if and only if U in = U out since Z in = (cid:0) U in † C out U in (cid:1) d = (cid:0) U in † U out Λ out U out † U in (cid:1) d == (cid:0) U out † U out Λ out U out † U out (cid:1) d = (cid:0) Λ out (cid:1) d = Λ out . (B3) t in v o u t P long-short inBAHC

Figures 9 and 10 report the out-of-sample risk of co-variance matrix cleaning methods with the same set upfor Hong Kong stock exchange 9. The analysis cover1281 stocks in from 2005-10-19 to 2017-06-23. The stocksare not simultaneously listed over all time-period: thenumber stocks ranges from 590 on 2008-08-22 to 1277on 2017-06-14. Results are qualitatively consistent withthose observed in the US equity market.

Appendix D: List of large-capitalization assets in theUS equities dataset

A, AA, AAN, AAP, AAPL, ABC, ABT, ACGL, ACM, ACN, ACV, ADBE, ADI,ADM, ADP, ADS, ADSK, AEE, AEO, AEP, AES, AET, AFG, AFL, AGCO, AGN,AGO, AHL, AIG, AIV, AIZ, AJG, AKAM, AKS, ALB, ALEX, ALL, ALTR, ALV,ALXN, AMAT, AMD, AME, AMG, AMGN, AMP, AMR, AMT, AMTD, AMZN,AN, ANAT, ANF, ANSS, AON, APA, APC, APD, APH, ARCC, ARE, ARW, ASH,ATI, ATLS, ATO, ATR, ATVI, AVB, AVGO, AVP, AVT, AVX, AVY, AWI, AWK,

50 100 150 200 250 t out t i n . long-short m i n ( B A H C , )

50 100 150 200 250 t out t i n long only m i n ( B A H C , ) FIG. 10. Fraction of time the realized risk of BAHC is smallerthan the best performing alternative method. Left plot: port-folios with long and short positions, right plot: portfolioswith only long positions. The level curve corresponds to a50% probability. 10 ,

000 independent simulations per point; n = 100 assets AXP, AXS, AZO, BA, BAC, BAX, BBBY, BBT, BBY, BDN, BDX, BEN, BG, BIG,BIIB, BIO, BJ, BK, BKD, BLK, BLL, BMRN, BMS, BMY, BOH, BOKF, BPOP,BR, BRO, BSX, BTU, BWA, BXP, BXS, C, CA, CAG, CAH, CAL, CAT, CB, CBS,CBSH, CBT, CCE, CCI, CCK, CCL, CCO, CDNS, CE, CECO, CELG, CERN,CETV, CF, CFFN, CFR, CHD, CHH, CHK, CHRW, CHS, CI, CIEN, CIM, CINF,CIT, CKH, CL, CLB, CLF, CLGX, CLI, CLR, CLX, CMA, CMC, CMCSA, CME,CMG, CMI, CMP, CMS, CNA, CNP, CNX, COF, COG, COL, COO, COP, COST,CPA, CPB, CPRT, CPT, CPWR, CR, CREE, CRK, CRL, CRM, CRS, CSCO, CSL,CSX, CTAS, CTL, CTSH, CTV, CTXS, CVA, CVG, CVS, CVX, CXO, CXW, CY,CYH, D, DAL, DBD, DCI, DDR, DE, DEI, DF, DFS, DG, DGX, DHI, DHR, DIS,DISCA, DISH, DKS, DLB, DLR, DLTR, DNB, DNR, DO, DOV, DOX, DPS, DRE,DRI, DTE, DTV, DUK, DVA, DVN, EAT, EBAY, ECL, ED, EFX, EGN, EIX, EL,EMN, EMR, ENDP, ENR, EOG, EQIX, EQR, EQT, ESRX, ESS, ETN, ETR, EV,EW, EWBC, EXC, EXP, EXPD, EXPE, F, FAST, FCN, FCX, FDS, FDX, FE,FFIV, FHN, FII, FIS, FISV, FITB, FL, FLIR, FLO, FLR, FLS, FMC, FNF, FOSL,FRO, FRT, FSLR, FTI, FTR, FULT, G, GCI, GD, GDI, GE, GEF, GES, GGG,GGP, GHL, GILD, GIS, GLW, GME, GNTX, GNW, GOOG, GPC, GPN, GPRO,GPS, GRMN, GS, GT, GWW, H, HAL, HAS, HBAN, HBI, HCC, HCP, HD, HE,HES, HI, HIG, HK, HLF, HOG, HOLX, HON, HP, HPQ, HPT, HRB, HRC, HRL,HRS, HSC, HSIC, HST, HSY, HTZ, HUM, HUN, IBKR, IBM, ICE, IDXX, IEX,IFF, IGT, ILMN, INTC, INTU, IP, IPG, IPI, IR, IRM, ISCA, ISRG, IT, ITRI,ITT, ITW, IVZ, JBHT, JBL, JCI, JCP, JEC, JEF, JLL, JNJ, JNPR, JOE, JPM,JWN, K, KAR, KBH, KBR, KEX, KEY, KIM, KLAC, KMB, KMT, KMX, KO,KR, KSS, KSU, L, LAMR, LAZ, LBTYA, LEA, LECO, LEG, LEN, LH, LIFE, LII,LLL, LLY, LM, LMT, LNC, LNT, LOW, LPNT, LRCX, LSI, LSTR, LUV, LVS, M,MA, MAC, MAN, MAR, MAS, MAT, MBI, MCD, MCHP, MCK, MCO, MCY, MD,MDC, MDP, MDR, MDRX, MDT, MDU, MET, MGM, MHK, MKC, MKL, MLM,MMC, MMM, MO, MORN, MOS, MRK, MRO, MRVL, MS, MSFT, MSM, MTB,MTD, MTW, MU, MUR, MXIM, MYGN, MYL, NATI, NAV, NBL, NBR, NCR,NDAQ, NEE, NEM, NFG, NFLX, NFX, NI, NIHD, NKE, NLY, NOC, NOV, NRG,NSC, NSM, NTAP, NTRS, NUAN, NUE, NVDA, NVR, NWL, NWSA, NYT, O,OC, ODP, OFC, OGE, OI, OII, OIS, OKE, OMC, ORA, ORCL, ORI, ORLY, OSK,OXY, PAYX, PBCT, PBI, PCAR, PCG, PDCO, PDM, PEG, PENN, PEP, PFE,PFG, PG, PGR, PH, PHM, PKG, PKI, PLD, PM, PNC, PNR, PNW, PPG, PPL,PRGO, PRU, PSA, PTEN, PVH, PWR, PX, PXD, QCOM, R, RBC, RCL, RDC,RE, REG, REGN, RF, RGA, RGLD, RHI, RHT, RJF, RL, RMBS, RMD, RNR,ROK, ROP, ROST, RPM, RRC, RRD, RS, RSG, RTN, RYN, S, SATS, SBAC,SBUX, SCCO, SCG, SCHN, SCHW, SCI, SD, SE, SEE, SEIC, SHLD, SHW, SIG,SIRI, SJM, SLAB, SLB, SLG, SLM, SM, SMG, SNA, SNH, SNPS, SNV, SO, SON,SPG, SPN, SPR, SRCL, SRE, STI, STLD, STRA, STT, STX, STZ, SUN, SVU,SWK, SWKS, SWN, SYK, SYMC, SYY, T, TAP, TCO, TDC, TDG, TDS, TDW,TECD, TECH, TER, TEX, TFSL, TFX, TGT, THC, THG, THO, TIF, TJX, TK,0