A fast Metropolis-Hastings method for generating random correlation matrices
Irene Córdoba, Gherardo Varando, Concha Bielza, Pedro Larrañaga
AA fast Metropolis-Hastings method forgenerating random correlation matrices
Irene C´ordoba − − − , Gherardo Varando , − − − ,Concha Bielza − − − , and Pedro Larra˜naga − − − Department of Artificial Intelligence, Universidad Polit´ecnica de Madrid Department of Mathematical Sciences, University of Copenhagen
Abstract.
We propose a novel Metropolis-Hastings algorithm to sam-ple uniformly from the space of correlation matrices. Existing methodsin the literature are based on elaborated representations of a correlationmatrix, or on complex parametrizations of it. By contrast, our methodis intuitive and simple, based the classical Cholesky factorization of apositive definite matrix and Markov chain Monte Carlo theory. We per-form a detailed convergence analysis of the resulting Markov chain, andshow how it benefits from fast convergence, both theoretically and em-pirically. Furthermore, in numerical experiments our algorithm is shownto be significantly faster than the current alternative approaches, thanksto its simple yet principled approach.
Keywords:
Correlation matrices · Random sampling · Metroplis-Hastings
Correlation matrices are a fundamental tool in statistics and for the analysisof multivariate data. In many application domains, such as signal processingor regression analysis, there is a natural need for tools that generate synthetic,benchmark correlation matrices [3,4,8]. Existing algorithms for this task usuallyrandomly sample either the eigenvalues of the matrix or the elements of itsCholesky decomposition, instead of directly sampling the matrix from its uniformdistribution.Uniform sampling of correlation matrices has received little attention until re-cently [6,9]. By contrast with the classical methods [4,8], uniform sampling doesnot assume any a priori information and allows to obtain an unbiased randomcorrelation matrix. In this paper, we propose a new Metropolis-Hastings algo-rithm for such task, which is significantly faster than the existing algorithms inthe literature [6,9]. We perform a detailed analysis of the convergence propertiesof the Markov chain that we construct.Our approach is similar to that of Pourahmadi and Wang [9] in the sensethat we rely on the Cholesky factorization of a positive definite matrix; howeverthey reparametrize the triangular factor with spherical coordinates, resulting inan additional layer of complexity. Lewandowsky et al. [6] present two methods a r X i v : . [ s t a t . C O ] S e p I. C´ordoba et al. based on alternative representations of the correlation matrix, vines and ellip-tical distributions, which are arguably less direct than our classical Choleskyfactorization.The rest of the paper is organized as follows. In Section 2 we briefly overviewthe Cholesky factorization of correlation matrices, and other technical resultsneeded for our method. Section 3 contains the details of our Metropolis-Hastingsalgorithm, whose convergence properties are analyzed in Section 4, both froma theoretical and experimental point of view. In Section 5 we empirically com-pare the computational performance of our method with the alternatives in theliterature. Finally, we conclude the paper in Section 6.
Let R be p × p a correlation matrix, that is, a symmetric positive definite (SPD)matrix with ones on the diagonal. Since R is SPD, it has a unique upper Choleskyfactorization R = UU t , with U an upper triangular matrix with positive diago-nal entries. Let U denote the set of upper triangular p × p matrices with positivediagonal entries. We will define the set of SPD correlation matrices as R = { R = UU t s.t. diag( R ) = , U ∈ U} . (1)The set R of SPD correlation matrices is known to form a convex bodycalled elliptope [5], whose volume has been explicitly computed by Lewandowskiet al. [6]. Observe that the constraint diag( R ) = in Equation (1) simplytranslates to the rows of U being normalized vectors. Denoting the subset of U with such normalized rows as U , R can be written more compactly as R = { R = UU t s.t. U ∈ U } . Consider now Φ ( U ) = UU t as a parametrization of U into SPD matrices.In order to sample uniformly from R , which is the image of Φ , we need tocompute the Jacobian matrix JΦ ( U ) [1]. Then, when sampling in U from adensity proportional to the Jacobian det( JΦ ( U )), the induced distribution on R by Φ is the uniform measure. In our case, the Jacobian is [2]det( JΦ ( U )) = 2 p p − (cid:89) i =1 u iii , (2)where u ii is the i -th diagonal element of U ∈ U and we have omitted u pp becauseit is equal to 1. We will use a Metropolis-Hastings method for sampling from a density propor-tional to the Jacobian in Equation (2). Observe that the i -th row in U , denoted niform sampling of correlation matrices 3 as u i in the remainder, can be sampled independently from all the other rows { u j } i (cid:54) = j , from a density f ( u i ) ∝ u iii . Furthermore, u i is a unitary vector and hasits first i − p − i )-dimensionalhemisphere, S p − i + = { v ∈ R p − i +1 s.t. vv t = 1 and v > } , where the positivity constraint is to ensure that u ii >
0. This independent row-wise sampling procedure is described in Algorithm 1.
Algorithm 1
Uniform sampling in R Input:
Sample size N Output:
Uniform sample from R of size N for n = 1 , . . . , N do U n ← p for i = 1 , . . . , p do u ni ← sample from f ( u i ) ∝ u iii on S p − i + end for end for return { Φ ( U ) , . . . , Φ ( U N ) } Since each row of U can be sampled independently, in the remainder of thissection we will concentrate on how to perform step 4 in Algorithm 1. In orderto lighten the notation, we will restate our problem as sampling vectors v fromthe hemisphere S p − i + with respect to the density f ( v ) ∝ v i , where p is fixed and1 ≤ i < p .In the Metropolis-Hastings algorithm, we need to generate a proposed vector˜ v from the current vector v already sampled from S p − i + . For this, we will proposethe new state as a normalized perturbation of the current vector, specifically,˜ v = v + (cid:15) (cid:107) v + (cid:15) (cid:107) , (3)where (cid:15) is a Gaussian random vector of dimension p − i + 1 with zero mean andcomponent-independent variance σ (cid:15) .With the transformation of Equation (3) the induced proposal distribution q (˜ v | v ) is a projected Gaussian over S p − i + [7], with parameters v and σ (cid:15) I p − i +1 .The expression for the density of this angular distribution is given in the generalcase by [10]. In our setting, we obtain a simplified expression, q (˜ v | v ) = exp (cid:0) (( v t ˜ v ) − / σ (cid:15) (cid:1) (2 π ) ( p − i +1) / (cid:90) ∞ s p − i exp (cid:32) − (cid:18) s − v t ˜ v σ (cid:15) (cid:19) (cid:33) d s. (4)The density for the proposal q (˜ v | v ) in Equation (4) is a function of the scalarproduct v t ˜ v , therefore it is symmetric because the roles of v and ˜ v can be I. C´ordoba et al. exchanged, and we can omit the Hastings correction from the sampling scheme.Thus the acceptance probability at each step of the algorithm becomesmin (cid:18) , f (˜ v ) f ( v ) (cid:19) = min (cid:32) , I ≥ (˜ v ) (cid:18) ˜ v v (cid:19) i (cid:33) , where ˜ v is the first component of the proposed vector ˜ v and I ≥ denotes the in-dicator function of the positive real numbers. The described Metropolis samplingis illustrated in Algorithm 2. Algorithm 2
Metropolis sampling of vectors v in S p − i + from f ( v ) ∝ v i Input:
Sample size N , variance σ (cid:15) and burn-in time t b Output:
Sample of size N from S p − i + v ← random standard multivariate Gaussian observation of dimension p − i + 12: v ← | v | v ← normalize v for t = 0 , . . . , t b + N do for j = 1 , . . . , p − i + 1 do (cid:15) j ← random Gaussian observation with zero mean and variance σ (cid:15) end for
8: ˜ v ← v t + (cid:15)
9: ˜ v ← normalize ˜ v δ ← random uniform observation on [0 , if ˜ v ≥ and δ ≤ (˜ v /v t ) i then v t ← ˜ v end if end for return { v t b +1 , v t b +2 , , . . . , v t b + N } In this section we will analyze, both theoretically and empirically, the conver-gence properties of the proposed Algorithm 2, following [12].
A minimal requirement for a Metropolis chain with proposal q to have the targetdensity f as its stationary distribution is the following relationship between thesupports supp( f ) ⊆ (cid:91) v ∈ supp( f ) supp( q ( · | v )) . Since in our case the support of q ( · | v ) is the ( p − i )-dimensional unit sphere,for all v ∈ S p − i + , this condition is automatically satisfied. niform sampling of correlation matrices 5 Given the above minimal requirement, if the chain additionally is f -irreducibleand aperiodic, then it converges to its stationary distribution (Theorem 7.4in [12]). The fist condition holds in our case because the proposal is strictlypositive for all v , ˜ v ∈ S p − . A sufficient condition for aperiodicity is that theprobability of remaining in the same state for the next step is strictly positive,that is, P ( f ( v ) ≥ f (˜ v )) >
0. In our case, we have P ( f ( v ) ≥ f (˜ v )) = P (cid:0) v i ≥ I ≥ (˜ v )˜ v i (cid:1) = P ( v ≥ ˜ v , ˜ v ≥
0) + P (˜ v ≤ . Expanding the second summand and using the fact that v ≥
0, we obtain P (˜ v ≤
0) = P ( (cid:15) ≤ − P ( − v ≤ (cid:15) ≤
0) = 12 − (cid:90) v √ πσ (cid:15) e − s / (2 σ (cid:15) ) d s, (5)Therefore P ( f ( v ) ≥ f (˜ v )) is strictly positive, the chain is aperiodic, and ourproposed algorithm converges to f .Some additional insights can be gained on the convergence of the algorithmwhen the variance σ (cid:15) increases. From Equation (4), we observe that in suchscenario, the proposal density approaches to a constant, yielding an indepen-dent Metropolis algorithm [12]. The expression for such limiting proposal, whichcoincides with the inverse sphere volume, islim σ (cid:15) →∞ q (˜ v | v ) = Γ (( p − i + 1) / π ( p − i +1) / . (6)In this scenario, denoting as C f and C q the integration constants for f and q respectively, taking M ≥ ( C q C f ) − we have for all v ∈ S p − i + that f ( v ) ≤ M q ( v ).Therefore the chain is uniformly ergodic, and 2(1 − M − ) n is an upper bound forthe total variation norm between the transition kernel after n iterations and thetarget distribution f (Theorem 7.8 in [12]). Furthermore, M − is also a lowerbound for the expected acceptance probability. The above theoretical analysis assures the convergence of the proposed Algo-rithm 2. Such convergence can also be empirically monitored in order to getinsight on how to tune its hyper-parameters: the burn-in time t b and the pertur-bation variance σ (cid:15) . This will be the focus of this subsection. For this task, themost challenging matrices are arguably high dimensional therefore we will focuson the case where p = 1000.There is no standard assessment scheme to follow that will guarantee anexpected behaviour for the Metropolis chain [12]. However, we can study thebehaviour of some characteristic quantities. We have chosen to focus on theacceptance ratio, that is, the percentage of times that we have accepted theproposed value, so it can be thought of as an approximation for P ( f ( v ) ≤ f (˜ v )).Whether a high acceptance ratio is desirable or not depends on the particularchain designed. For our case, in Figure 1 this quantity is depicted as a functionof the row number and, complementarily, of the perturbation variance σ (cid:15) . I. C´ordoba et al. ll l l l l l l l lll l l l l l l l lll l l l l l l l lll l l l l l l l lll l l l l l l l lll l l l l l l l lll l l l l l l l lll l l l l l l l l
Row number A cc ep t an c e r a t i o eps ll ll ll ll llllllllllllllllllllllllllllllllllllllll llllllllll llllllllll llllllllll llllllllll Perturbation variance A cc ep t an c e r a t i o i ll ll ll ll ll Fig. 1.
Acceptance ratio as a function of the row number i (left) and the perturbationvariance σ (cid:15) (right). eps: σ (cid:15) . We observe how, as σ (cid:15) increases, the proposed value is rejected more often.This could be already expected by looking at Equation (5) above, where we seethat the second term goes to zero as σ (cid:15) increases, yielding lim σ (cid:15) →∞ P ( f ( v ) ≤ f (˜ v )) ≤ /
2. Furthermore, as σ (cid:15) increases the proposal distribution is moresimilar to the uniform density on the ( p − i )-dimensional sphere (Equation (6)),which also hints the higher rejection rate.The row number i also has a significant influence on the acceptance ratio.Recall that 1 ≤ i ≤ p − v ∈ S p − i + and f ( v ) ∝ v i , therefore as the rownumber i increases the target distribution f approaches a delta function, andthe dimensionality of v decreases. Therefore it is reasonable to assume that thelarger i is, the smaller σ (cid:15) should be for achieving a high acceptance ratio, sinceit means that we are proposing new states that are, with high probability, veryclose to the current state.The above conclusions are further illustrated in Figure 2, where we haveplotted the contour lines of the acceptance ratio surface as a function of σ (cid:15) and the row number i . Observe that small values for σ (cid:15) always lead to highacceptance ratios, however this might not always be desirable since it can be asign of slow convergence. By contrast, a low acceptance ratio can be expectedwhen approaching to a delta in moderately high dimensions, as is the case forrow numbers approximately between 250 and 750. In this section we will compare our method, in terms of computational perfor-mance, with the existing approaches in the literature for the same task: uniformsampling of correlation matrices. We will generate 5000 correlation matrices ofdimension p = 10 , , . . . ,
100 using our algorithm, the vine and onion methodsof Lewandowski et al. [6], and the polar parametrization of Pourahmadi [9].Our algorithm has been implemented in R [11]. The vine and onion methodsare available in the function genPositiveDefMat from the R package cluster- niform sampling of correlation matrices 7
Perturbation variance R o w nu m be r level Contour plot of the acceptance ratio surface
Fig. 2.
Contour lines of the acceptance ratio surface. level: magnitude of the acceptanceratio.
Generation , provided by the authors. Since we have not found an implementa-tion of the polar parametrization method of Pourahmadi [9], we have developedour own function, also in R, mimicking the method therein described. For ourmethod, based on the analysis of the previous section we have fixed σ (cid:15) = 0 . t b = 1000, which have empirically provided good convergence results. Theexperiment has been executed on a machine equipped with Intel Core i7-5820k,3 .
30 GHz ×
12 and 16 GB of RAM.The results of the experiment are shown in Figure 3. We observe that ourmethod is faster than all of the existing approaches in the literature. The polarparametrization method has the worst performance, several orders of magnitudeslower than the other algorithms. This can be explained by the use of inversetransformation sampling for simulating the angles. By contrast, our methodachieves highly competitive results by taking advantage of the direct represen-tation provided by the Cholesky factorization, as well as the simple form of thetarget distribution and the proposed values on each iteration. The scripts usedfor generating the data and figures described throughout the paper are publiclyavailable, as well as the implementation of the algorithms described , so all theabove experiments can be replicated. In this paper we have proposed a Metropolis-Hastings method for uniform sam-pling of correlation matrices. We have studied its properties, both theoreticallyand empirically, and shown fast convergence to the target uniform distribution.We have also executed a comparative performance study, where our approachhas yielded faster results than all of the related approaches in the literature.In the future, we would like to further explore variants of our Markov chainalgorithm, such as the independent Metropolis or adaptive schemes. We wouldalso like to expand on the theoretical convergence analysis of such variants, aswell extend the empirical convergence monitoring to other relevant quantitiesapart from the acceptance ratio. https://CRAN.R-project.org/package=clusterGeneration https://github.com/irenecrsn/rcor I. C´ordoba et al. l l l l l l l l l ll l l l l l l l l ll l l l l l l l l ll l l l l l l l l l Number of nodes T i m e i n s e c ond s method l l l l chol c−vine onion polar l l l l l l l l l ll l l l l l l l l ll l l l l l l l l ll l l l l l l l l l Number of nodes T i m e i n s e c ond s method l l l l chol c−vine onion polar Fig. 3.
Execution time of available methods for uniform sampling of correlation ma-trices, both in linear (left) and logarithmic (right) scale. chol : our proposal; c-vine , onion : methods by [6]; polar : method by [9]. Acknowledgements . This work has been partially supported by the SpanishMinistry of Economy, Industry and Competitiveness through the Cajal Blue Brain(C080020-09; the Spanish partner of the EPFL Blue Brain initiative) and TIN2016-79684-P projects; by the Regional Government of Madrid through the S2013/ICE-2845-CASI-CAM-CM project; and by Fundaci´on BBVA grants to Scientific ResearchTeams in Big Data 2016. I. C´ordoba has been supported by the predoctoral grantFPU15/03797 from the Spanish Ministry of Education, Culture and Sports. G. Varandohas been partially supported by research grant 13358 from VILLUM FONDEN.
References
1. Diaconis, P., Holmes, S., Shahshahani, M.: Sampling from a Manifold, Collections,vol. 10, pp. 102–125. Institute of Mathematical Statistics (2013)2. Eaton, M.L.: Multivariate Statistics: A Vector Space Approach. Wiley (1983)3. Fallat, S., Lauritzen, S., Sadeghi, K., Uhler, C., Wermuth, N., Zwiernik, P.: Totalpositivity in markov structures. The Annals of Statistics (3), 1152–1184 (2017)4. Holmes, R.: On random correlation matrices. SIAM Journal on Matrix Analysisand Applications (2), 239–272 (1991)5. Laurent, M., Poljak, S.: On the facial structure of the set of correlation matrices.SIAM Journal on Matrix Analysis and Applications (3), 530–547 (1996)6. Lewandowski, D., Kurowicka, D., Joe, H.: Generating random correlation matri-ces based on vines and extended onion method. Journal of Multivariate Analysis (9), 1989–2001 (2009)7. Mardia, K., Jupp, P.: Directional Statistics. Wiley (1999)8. Marsaglia, G., Olkin, I.: Generating correlation matrices. SIAM Journal on Scien-tific and Statistical Computing (2), 470–475 (1984)9. Pourahmadi, M., Wang, X.: Distribution of random correlation matrices: Hyper-spherical parameterization of the Cholesky factor. Statistics & Probability Letters , 5–12 (2015)10. Pukkila, T.M., Rao, C.R.: Pattern recognition based on scale invariant discriminantfunctions. Information Sciences45