Householder Dice: A Matrix-Free Algorithm for Simulating Dynamics on Gaussian and Random Orthogonal Ensembles
HHouseholder Dice: A Matrix-Free Algorithm for SimulatingDynamics on Gaussian and Random Orthogonal Ensembles
Yue M. Lu ∗ Abstract
This paper proposes a new algorithm, named Householder Dice (HD), for simulating dynamics ondense random matrix ensembles with translation-invariant properties. Examples include the Gaussianensemble, the Haar-distributed random orthogonal ensemble, and their complex-valued counterparts. A“direct” approach to the simulation, where one first generates a dense n × n matrix from the ensemble,requires at least O ( n ) resource in space and time. The HD algorithm overcomes this O ( n ) bottleneckby using the principle of deferred decisions: rather than fixing the entire random matrix in advance, it letsthe randomness unfold with the dynamics. At the heart of this matrix-free algorithm is an adaptive andrecursive construction of (random) Householder reflectors. These orthogonal transformations exploit thegroup symmetry of the matrix ensembles, while simultaneously maintaining the statistical correlationsinduced by the dynamics. The memory and computation costs of the HD algorithm are O ( nT ) and O ( nT ), respectively, with T being the number of iterations. When T (cid:28) n , which is nearly always thecase in practice, the new algorithm leads to significant reductions in runtime and memory footprint.Numerical results demonstrate the promise of the HD algorithm as a new computational tool in thestudy of high-dimensional random systems. To do research involving large random systems, one must make a habit of experimenting on the computer.Indeed, computer simulations help verify theoretical results and provide new insights, not to mention thatthey can also be incredibly fun. For many problems in statistical learning, random matrix theory, andstatistical physics, the simulations that one encounters are often given as an iterative process in the form of x t +1 = f t ( M t x t , x t , x t − , . . . , x t − d ) , for 1 ≤ t ≤ T. (1)Here, M t is either Q or Q T , where Q is a random matrix; f t ( · ) denotes some general vector-valued functionthat maps M t x t and a few previous iteration vectors { x t − i } ≤ i ≤ d to the next one x t +1 ; and T is the totalnumber of iterations.With suitable definitions of the mappings f t ( · ), the formulation in (1) includes many well-known algo-rithms as its special cases. A classical example is to use iterative methods [1] to compute the extremaleigenvalues/eigenvectors of a (spiked) random matrix [2, 3]. Other examples include approximate messagepassing on dense random graphs [4–8], and gradient descent algorithms for solving learning and estimationproblems with random design [9, 10]. In this paper, we show that all of these algorithms can be simulatedby an efficient matrix-free scheme, if the random matrix Q is drawn from an ensemble with translation-invariant properties. Examples of such ensembles include the i.i.d. Gaussian (i.e. the rectangular Ginibre)ensemble, the Haar-distributed random orthogonal ensemble, the Gaussian orthogonal ensemble, and theircomplex-valued counterparts.What is wrong with the standard way of simulating (1), where we first draw a sample Q from the matrixensemble and then carry through the iterations? This direct approach is straightforward to implement, but ∗ Y. M. Lu is with the John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA02138, USA (e-mail: [email protected]). The initial part of this work was done during his sabbatical at the ´Ecole normalesup´erieure (ENS) in Paris, France in Fall 2019. He thanks colleagues at the ENS for their hospitality and stimulating discussions.This work was supported by the Harvard FAS Dean’s Fund for Promising Scholarship, by the chaire CFM-ENS “Science desdonnees”, and by the US National Science Foundation under grants CCF-1718698 and CCF-1910410. a r X i v : . [ c s . I T ] J a n t cannot handle large dimensions. To see this, suppose that Q ∈ R m × n with m (cid:16) n . We shall also assumethat the computational cost of the nonlinear mapping f t ( · ) is O ( n ). It follows that, at each iteration of (1),most of the computation is spent on the matrix-vector multiplication M t x t , at a cost of O ( n ) work. It isnot at all obvious that one can do much better: Merely generating an n × n Gaussian matrix already requires O ( n ) resource in computation and storage. When n is large, n is huge. In practice, this O ( n ) bottleneckmeans that one cannot simulate (1) at a dimension much larger than n = 10 on a standard computer (in areasonable amount of time). However, there are many occasions, especially in the study of high-dimensionalrandom systems, where one does wish to simulate large random matrices. A common workaround is tochoose a moderate dimension ( e.g. , n = 1000), repeat the simulation over many independent trials, andthen average the results to reduce statistical fluctuations. In addition to having to spend extra time on therepeated trials, this strategy can still suffer from strong finite size effects, making it a poor approximation ofthe true high-dimensional behavior of the underlying random systems. (An example is given in Section 2.2to illustrate this issue.)In this paper, we propose a new algorithm, named Householder Dice (HD), for simulating the dynamicsin (1) on the Gaussian, Haar, and other related random matrix ensembles. Our new approach is statistically-equivalent to the direct approach discussed above, but the memory and computation costs of the HD algo-rithm are O ( nT ) and O ( nT ), respectively, where T is the number of iterations. In many problems, T ismuch smaller than n . Typically, T = O (polylog( n )). In such cases, the new algorithm leads to significantreductions in runtime and memory footprint. In the numerical examples presented in Section 2, we showthat the crossover value of n at which the HD algorithm outperforms the direct approach can be as low as500. The speedup becomes orders of magnitude greater for n ≥ . Moreover, the HD algorithm expandsthe limits of what could be done on standard computers by making it tractable to perform dense randommatrix experiments in dimensions as large as n = 10 .The basic idea of the HD algorithm follows the so-called principle of deferred decisions [11]. Intuitively,each iteration of (1) only probes Q in a one-dimensional space spanned by x t . Thus, if the total number ofiterations T (cid:28) n , we only need to expose the randomness of Q over a few low-dimensional subspaces. It isthen clearly wasteful to fix and store in memory the full matrix in advance. The situation is analogous tothat of simulating a simple random walk for T steps. We can let the random choices gradually unfold withthe progress of the walk, fixing only the randomness that must be revealed at any given step. The challengein our problem though is that the dynamics in (1) can create a complicated dependence structure betweenthe random matrix Q and the iteration vectors x t , x t − . . . , x . Nevertheless, we show that this dependencestructure can be exactly accounted for by an adaptive and recursive construction of (random) Householderreflectors [12, 13] which exploit the inherent group symmetry of the matrix ensembles.Using Householder reflectors to speed up random matrix experiments is not a new idea. It is well-known[14, 15] that a Haar-distributed random orthogonal matrix can be factorized as a product of Householderreflectors. This leads to an efficient way of generating a random orthogonal matrix with O ( n ) operations(rather than the O ( n ) cost associated with a full QR decomposition on a Gaussian matrix). Householderreflectors have also been applied to reduce a Gaussian matrix to a particularly simple random bidiagonalform [16, 17]. This clever factorization leads to an O ( n ) algorithm for simulating the spectrum densitiesof Gaussian and Wishart matrices. (Recall that a standard eigenvalue decomposition on a dense matrixrequires O ( n ) work in practice.) The proposed HD algorithm differs from the previous work in that it isa truly matrix-free construction. With the progress of the dynamics, it gradually builds a recursive set of(random) Householder reflectors based on the current iteration vector x t and the history of the iterations upto this point. This adaptive, “on-the-fly” construction is essential for us to capture the correlation structuresgenerated by the dynamics without fixing the matrix in advance.The rest of the paper is organized as follows. We first present in Section 2 a few motivating examplesto showcase the applications of the HD algorithm. Section 3 contains the main technical results of thispaper. After a brief review of the basic properties of the Haar measure (on classical matrix groups) andHouseholder reflectors, we present the construction of the proposed algorithm for the Gaussian and randomorthogonal ensembles. Theorems 1 and 2 establish the statistical equivalence of the HD algorithm andthe direct approach to simulating (1). Generalizations to complex-valued and other related ensembles arediscussed in Section 3.4. We conclude the paper in Section 4.2 Numerical Examples
Before delving into technical details, it is helpful to go through a few motivating applications that show howthe HD algorithm can significantly speed up the simulation tasks. In the first example, we consider the simulation of the lasso estimator widely used in statistics and machinelearning. The goal is to estimate a sparse vector β ∗ ∈ R n from its noisy linear observation given by y = Qβ ∗ + w , where Q ∈ R m × n is a design (or covariate) matrix, and w ∼ N (0 , σ w I ) denotes the noise in y . The lassoestimator is formulated as an optimization problem (cid:98) β = arg min β (cid:107) y − Qβ (cid:107) + λ (cid:107) β (cid:107) , (2)where (cid:98) β is an estimate of β ∗ and λ > x t +1 = η λτ [ x t + τ Q T ( y − Qx t )] , ≤ t < T, (3)where τ > η λτ ( x ) = sign( x ) max (cid:8) | x | − λτ, (cid:9) is an element-wise soft-thresholdingoperator. In many theoretical studies of lasso, one assumes that the design matrix is random with i.i.d.normal entries, i.e. Q ij i.i.d. ∼ N (0 , m ). In this case, ISTA is an iterative process on a Gaussian matrix Q andits transpose. With some change of variables, we can rewrite (3) as a special case of the general dynamicsgiven in (1), with one iteration of (3) mapped to two iterations of (1).We simulate the ISTA dynamics using both the proposed HD algorithm and the direct simulation ap-proach that fixes the Gaussian matrix Q in advance. In our experiments, the target sparse vector β ∗ hasi.i.d. entries drawn from the Bernoulli-Gaussian prior β ∗ i ∼ ρδ ( β ) + (1 − ρ ) 1 (cid:112) πσ s exp (cid:110) − β σ s (cid:111) , where 0 < ρ < σ s > Q is of size m × n with m = (cid:98) n/ (cid:99) .Figure 1(a) shows the mean-squared error (MSE) e ( t ) def = n (cid:13)(cid:13) x t − β ∗ (cid:13)(cid:13) at each iteration of the dynamics,obtained by averaging over 10 independent trials. The dimension here is n = 1000. The results fromthe HD algorithm (the red circles in the figure) match those from the standard approach (the blue line).This is expected, since Householder Dice is designed to be statistically equivalent to the direct approach.However, the two simulation approaches behave very differently in runtime and memory footprint, as shownin Figure 1(b). When we increase the dimension n , the runtime of the standard approach exhibits a quadraticgrowth rate O ( n ), whereas the runtime of the HD algorithm scales linearly with n . For comparison, we alsoplot in the figure the runtime for merely generating an i.i.d. Gaussian matrix Q of size m × n .For small dimensions (250 ≤ n < n ≥ n ≥ n = 10 , Householder Dice becomes the only feasible method, as implementing the direct approach wouldrequire more memory than available on the test computer (equipped with 32 GB of RAM). Finally, for n = 10 , the runtime for the HD algorithm is 92 seconds, whereas by extrapolation the direct approachwould have taken 7 . × seconds (approximately 89 days). All of the numerical experiments presented in this section have been done in Julia [18]. The source code implementing theHD algorithm is available online at https://github.com/yuelusip/HouseholderDice .
10 20 30 40 50iteration t M S E Direct SimulationHouseholder Dice (a) dimension n r un t i m e ( s e c ) s c a l i n g : ( n ) s c a li n g : ( n ) o u t o f m e m o r y o u t o f m e m o r y Direct SimulationGenerating Gaussian MatrixHouseholder Dice (b)
Figure 1: Simulating the ISTA dynamics (3) using two approaches: the standard approach where the randommatrix Q is generated in advance, and the proposed HD algorithm. (a) The time-varying MSE averagedover 10 independent trials, with the results from the two approaches matching. (b) Runtime versus thematrix dimension n , shown in log-log scale. In all the experiments, the parameters are set to T = 50, λ = 2, τ = 0 . ρ = 0 . σ s = 2 and σ w = 0 . In the second example, we consider a spectral method [20–23] with applications in signal estimation andexploratory data analysis. Let ξ be an unknown vector in R n and { a i } ≤ i ≤ m a set of sensing vectors. Weseek to estimate ξ from a number of generalized linear measurements (cid:8) y i = f ( a T i ξ ) (cid:9) ≤ i ≤ m , where f ( · ) issome function modeling the acquisition process. The spectral method works as follows. Let D def = 1 m A diag { y , . . . , y m } A T , (4)where A = [ a , a , . . . , a m ] is a matrix whose columns are the sensing vectors. Denote by x a normalizedeigenvector associated with the largest eigenvalue of D . This vector x is then our estimate of ξ , up toa scaling factor. The performance of the spectral method is usually given in terms of the squared cosinesimilarity ρ ( ξ , x ) = ( ξ T x ) (cid:107) ξ (cid:107) (cid:107) x (cid:107) .Asymptotic limits of ρ ( ξ , x ) have been derived for the cases where A is an i.i.d. Gaussian matrix [22,23]or a subsampled random orthogonal matrix [24]. In our experiment, we consider the latter setting. Assume m = (cid:98) αn (cid:99) for some α >
1. We can write A = (cid:2) I n n × ( m − n ) (cid:3) Q , where Q ∈ R m × m is a random orthogonal matrix drawn from the Haar distribution.We simulate the spectral method and compare its empirical performance with the asymptotic limit givenin [24]. In our experiment, the measurement model is set to be y i = tanh (cid:0)(cid:12)(cid:12) a T i ξ (cid:12)(cid:12) (cid:1) . We compute the leadingeigenvector x by using the Krylov-Schur algorithm [1], which involves the repeated multiplication of D withsome vectors. With the forms of D and A given above, this algorithm can again be regarded as a specialcase of the general dynamics in (1). We use the HD algorithm for the simulation and show the results inFigure 2 for two different matrix dimensions: n = 10 and n = 10 . Observe that, at n = 10 , there is stillnoticeable fluctuations between the actual performance of the spectral method (shown as green dots in thefigure) and the theoretical prediction (the blue line). To get a better match, the standard practice is to domany independent trials (2000 in our experiment) and average the results. This gives us the green curve inthe figure. Averaging can indeed reduce statistical fluctuations, but there are still strong finite size effects,especially near the phase transition point. This is a case where the capability of the proposed HD algorithmto handle large matrices becomes particularly attractive: when we increase the dimension to n = 10 , theempirical results match the theoretical curve very closely in any (typical) trial, with no need for averaging4 .0 1.5 2.0 2.5 3.0= m / n phase transition Asymptotic Limit n = 10 , 2000 trials n = 10 , 1 trial n = 10 , 1 trial Figure 2: Simulating the spectral method given in (4) and comparing the empirical results against theasymptotic predictions given in [24]. The result for n = 10 shows strong statistical fluctuations. This canbe reduced by averaging over multiple independent trials, but the average curve still suffers from strongfinite size effects, especially near the phase transition point. At n = 10 , the match between the empiricalresults and the theoretical curve is nearly perfect in any (typical) trial.over repeated simulations. In terms of runtime, it takes the HD algorithm less than 4 seconds on average toobtain an extremal eigenvalue/eigenvector of D for n = 10 . Notation : In what follows, e i denotes the i th natural basis vector, and Z i def = I − e i e T i . For i ≤ j , we use Z i : j as a shorthand notation for (cid:81) i ≤ k ≤ j Z k . The dimension of Z i and Z i : j is either m × m or n × n , whichwill be made clear from the context. For any v ∈ R n , the “slicing” operation that takes a subset of v isdenoted by v [ i : j ] def = [ v i , v i +1 , . . . , v j ] T , where 1 ≤ i ≤ j ≤ n . We use O ( n ) def = { M ∈ R n × n : M M T = I n } to denote the set of n × n orthogonal matrices, and U ( n ) def = { M ∈ C n × n : M M ∗ = I n } its complex-valued counterpart. We will be mainly focusing on two real-valued random matrix ensembles: Ginibre( m, n )represents the ensemble of m × n matrices with i.i.d. standard normal entries, and Haar( n ) represents theensemble of random orthogonal matrices drawn from the Haar measure on O ( n ). The generalizations to thecomplex-valued cases and other closely related ensembles will be discussed in Section 3.4. The ensembles Ginibre( m, n ) and O ( n ) share an important property: they are both invariant with respectto multiplications by orthogonal matrices. For example, for any G drawn from Ginibre( m, n ), it is easy toverify that G ∼ Ginibre( m, n ) = ⇒ U GV ∼ Ginibre( m, n ) , (5)where U ∈ O ( m ) , V ∈ O ( n ) are any two deterministic or random orthogonal matrices independent of G .Translation-invariant properties similar to (5) are actually what defines the Haar measure. We call aprobability measure µ on O ( n ) a Haar measure if µ ( A ) = µ ( U ◦ A ) = µ ( A ◦ U ) (6)for any measurable subset A ⊂ O ( n ) and any fixed U ∈ O ( n ). Here, U ◦ A def = { U V : V ∈ A} and A ◦ U is defined similarly. The classical Haar’s theorem [25, 26] shows that there is one, and only one, translation-invariant probability measure in the sense of (6) on O ( n ). In fact, the theorem holds in much greater5enerality. For example, it remains true for any compact Lie group, which includes O ( n ) [and U ( n )] as itsspecial case.An additional property of O ( n ), U ( n ) (and compact Lie groups in general) is that left-invariance [the firstequality in (6)] implies right-invariance (the second equality), and vice versa. This then allows us to havea simplified characterization of the Haar measure on O ( n ). Specifically, to show that a random orthogonalmatrix Q ∼ Haar( n ), it is sufficient to verify that Q d = U Q for any fixed U ∈ O ( n ), where d = means that two random variables have the same distribution. We willuse this convenient characterization in Section 3.3, when we establish the statistical equivalence between theproposed HD algorithm and the direct simulation of (1).Finally, we recall the construction of Householder reflectors [12, 13] from numerical linear algebra, asthey will play important roles in our subsequent discussions. Given a vector v ∈ R n , how can we buildan orthogonal matrix H such that Hv = (cid:107) v (cid:107) e ? This is exactly the problem addressed by Householderreflectors, defined here as H ( v ) def = − sign( v ) (cid:16) I − uu T u T u (cid:17) , (7)where u = v + sign( v ) (cid:107) v (cid:107) e , and sign( v ) = 1 if v ≥ H ( v ) is a symmetric matrix whose eigenvalues are equal to ±
1. It follows that H ( v ) ∈ O ( n ). Moreover, we can verify from direct calculations that H ( v ) e = v / (cid:107) v (cid:107) and H ( v ) v = (cid:107) v (cid:107) e . (8)Geometrically, H ( v ) represents a reflection across the exterior (or interior) angle bisector of v / (cid:107) v (cid:107) and e .It is widely used in numerical linear algebra thanks to its low memory/computational costs. The matrix H ( v ) itself can be efficiently represented with O ( n ) space, and matrix-vector multiplications involving H ( v )only require O ( n ) work.For any p ∈ R n and 1 ≤ k ≤ n , we define a generalized Householder reflector as H k ( p ) def = (cid:20) I k − H ( p [ k : n ]) (cid:21) , (9)where H ( · ) is the reflector defined in (7), and p [ k : n ] denotes a subvector obtained by removing the first k − p . The construction in (7) requires that the reflecting vector p [ k : n ] be nonzero. In orderfor (9) to be always well-defined, we set H k ( p ) = I n if p [ k : n ] = . Recall the notation Z k introduced atthe beginning of the section. It is easy to verify that Z k H k ( p ) p = , (10)which means that the orthogonal transformation H k ( p ) can turn the last n − k entries of p to zero. We willuse this property in the construction of the HD algorithm. We start by considering the case where the random matrix Q in the dynamics (1) has i.i.d. Gaussian entries, i.e. , Q ∼ Ginibre( m, n ). In addition, we shall always assume that Q is independent of the initial condition { x , x , . . . , x − d } .Suppose that the first step of (1) is in the form of x = f ( Qx , x , . . . , x − d ), i.e., M = Q . How dowe simulate this step without generating the entire Gaussian matrix Q ? This can be achieved by a simpleobservation: Q d = g e T + G Z d = ( g e T + G Z ) R ∼ Ginibre( m, n ) , (11)where Z = I − e e T , R = H ( x ) is a (generalized) Householder reflector defined in (9), g ∼ Ginibre( m,
1) is a Gaussian vector, and G ∼ Ginibre( m, n ) is an independent Gaussian matrix. Here6nd subsequently, whenever we generate new random vectors and matrices, they are always independent ofeach other and of the σ -algebra generated by all the other random variables constructed up to that point.For example, g and G in (11) are understood to be independent of the initial condition { x , x , . . . , x − d } .In (11), the first equality (in distribution) is obvious, and the second equality is due to the translationinvariance of the Ginibre ensemble. (Recall (5) and the fact that R is an orthogonal matrix.)The new representation Q (1) = ( g e T + G Z ) R (12)looks like a rather convoluted way of writing an i.i.d. Gaussian matrix, but it turns out to be the right choicefor efficient simulations. To see this, we use the property of the Householder reflector [see (8)] which givesus R x = H ( x ) x = (cid:107) x (cid:107) e and thus Z R x = . It follows that Q (1) x = (cid:107) x (cid:107) g . Thus, to simulate the first step of the dynamics, we only need to generate a Gaussian vector g . The moreexpensive Gaussian matrix G does not need to be revealed (yet), as it is invisible to x .It is helpful to consider two more iterations to see how this idea can be applied recursively. Supposethat the second iteration takes the form of x = f ( Qx , x , . . . , x − d ). In general, x will have a nonzerocomponent in the space orthogonal to x , and thus the Gaussian matrix G in (12) is no longer invisible to x , meaning that G Z R x (cid:54) = . However, we can use the trick in (11) again by writing G d = ( g e T + G Z ) R ∼ Ginibre( m, n ) , (13)where g ∼ Ginibre( m, G ∼ Ginibre( m, n ), and R = H ( R x ) is again a generalized Householderreflector in (9). The subscript in H should not be overlooked, as it signifies the precise way the matrix isconstructed. [Recall (9) for the notation convention we use.]Observe that R commutes with Z . Substituting (13) into (12) then allows us to write Q (2) = u v T + u v T + G Z R R ∼ Ginibre( m, n ) , (14)where u = g , u = g , v = R e , and v = R R e . Just like what happens in (12), there isagain no need to explicitly generate the dense Gaussian matrix G in (14). To see this, we note that Z R R x = Z H ( R x ) R x = , where the second equality is due to (10). It follows that Q (2) x = ( v T x ) u + ( v T x ) u . So far we have only been considering the case where we access Q from the right. For the third iteration,let us suppose that we access Q from the left, i.e., x = f ( Q T x , x , . . . , x − d ). The idea is similar. Let G = L ( e g T + Z G ) ∼ Ginibre( m, n ) , (15)where L = H ( x ), g ∼ Ginibre( n, G ∼ Ginibre( m, n ). Substituting (15) into (14) gives us Q (3) = (cid:80) ≤ i ≤ u i v T i + L Z G Z R R ∼ Ginibre( m, n ) , where u = L e and v = R R Z g . Moreover, [ Q (3) ] T x = (cid:80) i ≤ ( u T i x ) v i .The general idea should now be clear. Rather than fixing the entire Gaussian matrix in advance, we letthe random choices gradually unfold as the iteration goes on, generating only the randomness that mustbe revealed at each step. Continuing this process for T steps, we reach the HD algorithm for the Ginibreensemble, summarized in Algorithm 1. Its memory and computational costs can be determined as follows.During its operation, Algorithm 1 keeps track of 2 T vectors { u t ∈ R m , v t ∈ R n } t ≤ T and T Householderreflectors (cid:8) L i ∈ R m × m (cid:9) i ≤ (cid:96) T and (cid:8) R i ∈ R n × n (cid:9) i ≤ r T , where (cid:96) T (resp. r T ) records the number of times we have used Q T (resp. Q ) in the T iterations of thedynamics. Clearly, r T + (cid:96) T = T . Thanks to the structures of the Householder reflectors in (7), the total7 lgorithm 1 Simulating (1) on Ginibre( m, n ) using Householder Dice
Require:
The initial condition { x , x , . . . , x − d } , and the number of iterations T ≤ min { m, n } Set r = 0, (cid:96) = 0, L = I m , and R = I n . for t = 1 , . . . , T do if M t = Q then r ← r + 1 Generate g t ∼ Ginibre( m, R r = H r ( R r − . . . R R x t ) u t = L L . . . L (cid:96) Z (cid:96) g t v t = R R . . . R r e r y t = (cid:80) i ≤ t ( v T i x t ) u i else (cid:96) ← (cid:96) + 1 Generate g t ∼ Ginibre( n, L (cid:96) = H (cid:96) ( L (cid:96) − . . . L L x t ) u t = L L . . . L (cid:96) e (cid:96) v t = R R . . . R r Z r g t y t = (cid:80) i ≤ t ( u T i x t ) v i end if x t +1 = f t ( y t , x t , x t − , . . . , x t − d ) end for memory footprint of Algorithm 1 is O (( m + n ) T ). At each iteration, computations mainly take place inlines 6–9 (or lines 13–16 if M t = Q T ). Since the matrices used there are always products of Householderreflectors, these steps require O (( m + n ) t ) operations. As t ranges from 1 to T , the computational complexityof Algorithm 1 is thus O (( m + n ) T ). Remark 1.
In line 6 and line 13, Algorithm 1 recursively constructs two products of (generalized) House-holder reflectors. Readers familiar with numerical linear algebra will recognize that this process is essentiallythe Householder algorithm for QR factorization [13, Lecture 10]. Special data structures have been developed(see, e.g., [27]) to efficiently represent and operate on such products of reflectors.
We can now exhibit the statistical equivalence of the HD algorithm and the direct simulation approach.
Theorem 1.
Fix T ≤ min { m, n } , and let { x t : 1 − d ≤ t ≤ T + 1 } be a sequence of vectors generated byAlgorithm 1. Let { (cid:101) x t : 1 − d ≤ t ≤ T + 1 } be another sequence obtained by the direct approach to simulating (1) , where we use the same initial condition (i.e. (cid:101) x t = x t for − d ≤ t ≤ ) but generate a full matrix Q ∼ Ginibre ( m, n ) in advance. The joint probability distribution of { x t } is equivalent to that of { (cid:101) x t } .Proof. We start by describing the general structure of the algorithm. At the t -th iteration, the algorithmkeeps the following representation of the matrix Q : Q ( t ) = (cid:80) i ≤ t u i v T i + L L . . . L (cid:96) t (cid:124) (cid:123)(cid:122) (cid:125) Householder Z (cid:96) t G t Z r t R r t . . . R R (cid:124) (cid:123)(cid:122) (cid:125) Householder , (16)where G t ∼ Ginibre( m, n ) is a Gaussian matrix independent of the σ -algebra generated by all the otherrandom variables constructed up to this point, and (cid:96) t (resp. r t ) denotes the number of times we have used Q T (resp. Q ) in the first t iterations of the dynamics. To lighten the notation, we will omit the subscript inthe remainder of the proof and simply write them as (cid:96) and r .The vectors { u i , v i } and the Householder reflectors { L i } , { R i } in (16) are constructed recursively, asfollows. We start with L = I m and R = I n . At the t -th iteration (for 1 ≤ t ≤ T ), if M t = Q (i.e. if weneed to compute Qx t ), we add a new Householder reflector R r = H r ( R r − . . . R R x t )8nd two new “basis” vectors u t = L L . . . L (cid:96) Z (cid:96) g t and v t = R R . . . R r e r , where g t ∼ Ginibre( m, M t = Q T is completely analogous: we add a newHouseholder reflector L (cid:96) (on the left) and construct the basis vectors u t , v t accordingly.It is important to note that the Gaussian matrix G t in (16) is never explicitly constructed in the algorithm.Assume without loss of generality that M t = Q . Let p = R r − . . . R R x t . We then have Z r R r . . . R R x t = Z r H r ( p ) p = , where the second equality is due to (10). Consequently, G t remains invisible to x t , and Q ( t ) x t = (cid:80) i ≤ t ( v T i x t ) u i . To prove the assertion of the theorem, it suffices to show that, for all 1 ≤ t ≤ T , Q ( t ) has the correctdistribution, namely Q ( t ) ∼ Ginibre( m, n ) and Q ( t ) is independent of the initial condition { x , x , . . . , x − d } .This is clearly true for t = 1, based on our discussions around (12). Now suppose that the condition on thedistribution has been verified for Q ( t ) for some t ≥
1, . To go to t + 1, we rewrite the Gaussian matrix G t in (16) by using a decomposition similar to (13). Specifically, if M t = Q , we write G t d = ( g t +1 e T r +1 + G t +1 Z r +1 ) R r +1 ∼ Ginibre( m, n ) , (17)where g t +1 ∼ Ginibre( m, G t +1 ∼ Ginibre( m, n ), and R r +1 def = H r +1 ( R r . . . , R R x t +1 ). (The decom-position for the case where M t = Q T is completely analogous.)That the new representation on the right-hand side of (17) has the same distribution as G t is due tothe translation-invariant property of the Ginibre ensemble [see (5)]. Substituting (17) into (16) allows us toconclude that the matrix (cid:80) i ≤ t u i v T i + L . . . L (cid:96) Z (cid:96) ( g t +1 e T r +1 + G t +1 Z r +1 ) R r +1 Z r R r . . . R (18)satisfies the required condition on its distribution. By construction, R r +1 commutes with Z r . [Recall (9).]This simple observation allows us to check that the matrix in (18) is exactly Q ( t +1) . By induction on t from1 to T , we then complete the proof. We now turn to the case where Q is a Haar-distributed random orthogonal matrix. The construction of theHD algorithm relies on the following factorization of the Haar measure on O ( n ). Lemma 1.
Let g ∼ Ginibre ( n, , Q n − ∼ Haar ( n − , and v ∈ R n , all of which are independent. Then H ( g ) (cid:20) Q n − (cid:21) H ( v ) ∼ Haar ( n ) . (19) Proof.
Let M denote the left-hand side of (19). It is sufficient to show that M d = U M for any fixed U ∈ O ( n ). The statement of the lemma then follows from the fact that the Haar measure is the unique (left)translation-invariant measure on O ( n ).For any nonzero vector x ∈ R n , we denote by B ( x ) ∈ R n × ( n − the submatrix consisting of the last n − H ( x ). It is also useful to notice that the first column of H ( x ) is x / (cid:107) x (cid:107) . Thus, H ( x ) = (cid:2) x (cid:107) x (cid:107) | B ( x ) (cid:3) . The following observation is easy to verify. For any fixed U ∈ O ( n ), there existssome R ∈ O ( n −
1) such that
U B ( x ) = B ( U x ) R . Applying this to B ( g ) [in H ( g )] then allows us to write U M = H ( U g ) (cid:20) RQ n − (cid:21) H ( v ) , lgorithm 2 Simulating (1) on Haar( n ) using Householder Dice Require:
The initial condition { x , x , . . . , x − d } , and the number of iterations T ≤ n Set L = I m , and R = I n . for t = 1 , . . . , T do Generate g t ∼ Ginibre( n, if M t = Q then p t = R t − . . . R R x t R t = H t ( p t ) L t = H t ( g t ) y t = L . . . L t R t p t else p t = L t − . . . L L x t L t = H t ( p t ) R t = H t ( g t ) y t = R . . . R t L t p t end if x t +1 = f t ( y t , x t , x t − , . . . , x t − d ) end for where R is an orthogonal matrix independent of Q n − and v . Since the joint distribution of ( U g , RQ n − , v )is equal to that of ( g , Q n − , v ) in (19), we must have M d = U M .The HD algorithm exploits the factorization in (19) to speed up the simulation of Haar random matrices.Before presenting the algorithm in its full generality, we first illustrate how it unfolds in the first two iterationsof (1). For simplicity, we assume that M = M = Q . For the first iteration, we use (19) to write Q as Q (1) = L (cid:20) Q n − (cid:21) R ∼ Haar( n ) , (20)where R = H ( x ), L = H ( g ), g ∼ Ginibre( n,
1) and Q n − ∼ Haar( n − Q (1) x = (cid:107) x (cid:107) H ( g ) e = (cid:107) x (cid:107)(cid:107) g (cid:107) g . Notice that only a Gaussian vector g is needed here, and that the matrix Q n − is invisible.To simulate the second iteration, we apply the factorization (19) recursively to write Q n − as Q n − = H ( g [2 : n ]) (cid:20) Q n − (cid:21) H ( p [2 : n ]) ∼ Haar( n − , (21)where g ∼ Ginibre( n, Q n − ∼ Haar( n − p = R x . Substituting (21) into (20) then gives us Q (2) = L L (cid:20) I Q n − (cid:21) R R , (22)where L = H ( g ) and R = H ( p ). By construction, the vector R R x has nonzero entries only in thefirst two coordinates. It follows that Q (2) x = L L R R x , with Q n − in (22) remaining invisible.Continuing this process, we see a simple pattern emerging. We summarize it in Algorithm 2. In general,the algorithm recursively constructs two sequences of Householder reflectors { L t , R t } t ≤ T , starting from L = R = I n . At the t -th iteration, we first generate a new Gaussian vector g ∼ Ginibre( n, M t = Q , we compute p t = R t − . . . R R x t (23)10nd add two reflectors R t = H t ( p t ) and L t = H t ( g t ). The algorithm then proceeds to the next iteration byletting x t +1 = f t ( y t , x t , . . . , x t − d ), where y t = L . . . L t R t p t . The steps the algorithms takes if M = Q T isexactly symmetric, with the roles of { R i } and { L i } switched. The computational and memory complexityof Algorithm 2 is similar to that of Algorithm 1. Specifically, the Householder reflectors can be efficientlyrepresented by the corresponding reflection vectors, at a cost of O ( nT ) space. At each iteration, the matrix-vector multiplications in lines 5, 8, 10 and 13 can all be implemented in O ( nt ) operations (thanks to theHouseholder structure). Therefore, the total computational complexity is O ( nT ).Finally, we establish the statistical “correctness” of Algorithm 2 in the following theorem. Theorem 2.
Fix T ≤ n , and let { x t : 1 − d ≤ t ≤ T + 1 } be a sequence of vectors generated by Algorithm 2.Let { (cid:101) x t : 1 − d ≤ t ≤ T + 1 } be another sequence obtained by the direct approach to simulating (1) , where weuse the same initial condition (i.e. (cid:101) x t = x t for − d ≤ t ≤ ) but generate a random orthogonal matrix Q ∼ Haar ( n ) in advance. The joint probability distribution of { x t } is equivalent to that of { (cid:101) x t } .Proof. The proof is very similar to that of Theorem 1. At the t -th iteration, the algorithm has constructeda representation of the random orthogonal matrix Q as Q ( t ) = L L . . . L t (cid:20) I t Q n − t (cid:21) R t . . . R R , (24)where { L i , R i } i ≤ t is a collection of Householder reflectors, and Q n − t ∼ Haar( n − t ) is an ( n − t ) × ( n − t ) random orthogonal matrix independent of the σ -algebra generated by all the other random variablesconstructed up to this point. We shall have established the theorem if we prove the following two claims for1 ≤ t ≤ T : (a) Q ( t ) ∼ Haar( n ) and Q ( t ) is independent of the initial condition { x t } − d ≤ t ≤ ; (b) If M t = Q in (1), then Q ( t ) x t = L . . . L t R t p t , (25)where p t is as defined in (23). If M t = Q T , then [ Q ( t ) ] T x t = R R . . . R t L t . . . L L x t .Claim (a) can be proved by induction. We have already established its correctness for t = 1. [See (20).]To propagate the claim from iteration t to t + 1, we simply apply Lemma 1 to rewrite Q n − t in (24) as Q n − t d = H ( g t +1 [ t + 1 : n ]) (cid:20) Q n − t − (cid:21) H ( p t +1 [ t + 1 : n ]) ∼ Haar( n − t ) , where g t +1 ∼ Ginibre( n, Q n − t − ∼ Haar( n − t − p t +1 = R t . . . R R x t +1 . (This is for the caseof M t +1 = Q , but the treatment for the case of M t +1 = Q T is completely analogous.) Substituting thisequivalent representation into (24) gives us Q ( t +1) .To establish Claim (b), we again assume without loss of generality that M t = Q . By the definition of p t in (23) and that of R t , we have Q ( t ) x t = L L . . . L t (cid:20) I t Q n − t (cid:21) H t ( p t ) p t . Using (10), we can then verify the expression given in (25).
The Gaussian and Haar ensembles studied above can serve as building blocks for simulating other relatedrandom matrix ensembles. For example, consider the classical Gaussian orthogonal ensemble (GOE). Asymmetric n × n matrix G is drawn from GOE( n ) if { G ij } ≤ i ≤ j ≤ n are independent random variables, with G ii ∼ N (0 ,
2) and G ij ∼ N (0 ,
1) for i < j . Clearly, Q ∼ Ginibre( n, n ) = ⇒ √ Q + Q T ) ∼ GOE( n ) . It follows that a single matrix-vector multiplication involving G ∼ GOE( n ) can be simulated via two matrix-vector multiplications involving a nonsymmetric Gaussian matrix, i.e., y = Gx = ⇒ (cid:98) y = Qx and y = ( Q T x + (cid:98) y ) / √ .
11s a second example, we consider random matrices in the form of Q = U Σ V , (26)where U ∼ Haar( m ) and V ∼ Haar( n ) are two independent random orthogonal matrices, and Σ ∈ R m × n is a rectangular diagonal matrix independent of U , V . Matrices like these often appear in the study of freeprobability theory [28]. They are also used as a convenient model for matrices whose singular vectors are generic [6–8]. Strictly speaking, Theorem 2 only applies to the case where the dynamics operates on a singlerandom orthogonal matrix. However, it is obvious from the proof that the idea applies to more generaldynamics involving a finite number of independent random orthogonal matrices. Thus, Algorithm 2 can beeasily adapted to handle the matrix ensemble given in (26).Finally, we note that the constructions of the HD algorithm can be generalized to the complex-valuedcases, with the random matrices drawn from the complex Ginibre ensemble, the Haar ensemble on theunitary group U ( n ), and the Gaussian unitary ensemble, respectively. We avoid repetitions, as most changesin such generalizations are straightforward (such as replacing M T by M ∗ ). In what follows, we only presentthe formula for a complex version of the Householder reflector, as it might be less well-known.Let v ∈ C n be a nonzero vector. Write v / (cid:107) v (cid:107) = re iθ , where r is a nonnegative real number. (When v = 0, we have r = 0 and set θ = 0.) We define a unitary reflector [29, pp. 48–49] as H ( v ) = ( − e − iθ ) (cid:104) I n − ( v / (cid:107) v (cid:107) + e iθ e )( v / (cid:107) v (cid:107) + e iθ e ) ∗ r (cid:105) . (27)It is easy to check that H ( v ) is a unitary matrix such that H ( v ) v = (cid:107) v (cid:107) e and H ∗ ( v ) e = v / (cid:107) v (cid:107) . Moreover,if v is real, then (27) reduces to the Householder reflector given in (7). We proposed a new algorithm called Householder Dice for simulating dynamics on several dense randommatrix ensembles with translation-invariant properties. Rather than fixing the entire random matrix in ad-vance, the new algorithm is matrix-free, generating only the randomness that must be revealed at any givenstep of the dynamics. The name of the algorithm highlights the central role played by an adaptive and recur-sive construction of (random) Householder reflectors. These orthogonal transformations exploit the groupsymmetry of the matrix ensembles, while simultaneously maintaining the statistical correlations induced bythe dynamics. Numerical results demonstrate the promise of the HD algorithm as a new computational toolin the study of high-dimensional random systems.
References [1] G. W. Stewart, “A Krylov–Schur algorithm for large eigenproblems,”
SIAM J. Matrix Anal. Appl. ,vol. 23, no. 3, pp. 601–614, 2002.[2] J. Baik, G. B. Arous, and S. P´ech´e, “Phase transition of the largest eigenvalue for nonnull complexsample covariance matrices,”
Ann. Probab. , vol. 33, pp. 1643–1697, Sept. 2005.[3] F. Benaych-Georges and R. R. Nadakuditi, “The eigenvalues and eigenvectors of finite, low rank per-turbations of large random matrices,”
Adv. Math. , vol. 227, pp. 494–521, May 2011.[4] E. Bolthausen, “An iterative construction of solutions of the TAP equations for the Sherrington–Kirkpatrick model,”
Commun. Math. Phys. , no. 325, pp. 333–366, 2014.[5] M. Bayati and A. Montanari, “The dynamics of message passing on dense graphs, with applications tocompressed sensing,”
IEEE Trans. Inf. Theory , vol. 57, pp. 764 –785, Feb. 2011.[6] M. Opper, B. Cakmak, and O. Winther, “A theory of solving TAP equations for Ising models withgeneral invariant random matrices,”
J. Phys. A , vol. 49, no. 11, p. 114002, 2016.127] S. Rangan, P. Schniter, and A. K. Fletcher, “Vector approximate message passing,”
IEEE Trans. Inf.Theory , vol. 65, pp. 6664–6684, Oct 2019.[8] Z. Fan, “Approximate message passing algorithms for rotaltionally invariant matrices,” arXiv:2008.11892 , 2020.[9] E. J. Candes, X. Li, and M. Soltanolkotabi, “Phase retrieval via Wirtinger flow: Theory and algorithms,”
IEEE Trans. Inf. Theory , vol. 61, no. 4, pp. 1985–2007, 2015.[10] S. Goldt, M. S. Advani, A. M. Saze, F. Krzakala, and L. Zdeborov´a, “Dynamics of stochastic gradientdescent for two-layer neural networks in the teacher-student setup,” in
Advances in Neural InformationProcessing Systems 32 , 2019.[11] M. Mitzenmacher and E. Upfal,
Probability and computing: Randomized algorithms and probabilisticanalysis . Cambridge University Press, 2005.[12] A. S. Householder, “Unitary triangularization of a nonsymmetric matrix,”
J. ACM , vol. 5, no. 4, pp. 339–342, 1958.[13] L. N. Trefethen and D. Bau III.,
Numerical Linear Algebra . Philadelphia, PA: SIAM, 1997.[14] G. W. Stewart, “The efficient generation of random orthogonal matrices with an application to conditionnumbers,”
SIAM J. Numer. Anal. , vol. 17, no. 3, pp. 403–425, 1980.[15] F. Mezzadri, “How to generate random matrices from the classical compact groups,”
Notice of the AMS ,vol. 54, pp. 592–604, May 2007.[16] J. W. Silverstein, “The smallest eigenvalue of a large dimensional Wishart matrix,”
Ann. Probab. ,vol. 13, no. 4, pp. 1364–1368, 1985.[17] A. Edelman,
Eigenvalues and Condition Numbers of Random Matrices . PhD thesis, MassachusettsInstitute of Technology, Cambridge, MA, May 1989.[18] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, “Julia: A fresh approach to numerical com-puting,”
SIAM Rev. , vol. 59, no. 65–98, 2017.[19] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,”
SIAM J. Imaging Sci. , vol. 2, no. 1, pp. 183–202, 2009.[20] K.-C. Li, “On principal hessian directions for data visualization and dimension reduction: Anotherapplication of Stein’s lemma,”
J. Am. Stat. Assoc , vol. 87, no. 420, pp. 1025–1039, 1992.[21] P. Netrapalli, P. Jain, and S. Sanghavi, “Phase retrieval using alternating minimization,” in
Advancesin Neural Information Processing Systems , pp. 2796–2804, 2013.[22] Y. M. Lu and G. Li, “Phase transitions of spectral initialization for high-dimensional nonconvex esti-mation,”
Information and Inference , vol. 9, pp. 507–541, September 2020.[23] M. Mondelli and A. Montanari, “Fundamental limits of weak recovery with applications to phase re-trieval,” in
Proceedings of Machine Learning Research , vol. 75, 2018.[24] R. Dudeja, M. Bakhshizadeh, J. Ma, and A. Maleki, “Analysis of spectral methods for phase retrievalwith random orthogonal matrices,”
IEEE Trans. Inf. Theory , vol. 66, pp. 5182–5203, Aug 2020.[25] A. Haar, “Der massbegriff in der theorie der kontinuierlichen gruppen,”
Ann. Math. , vol. 34, pp. 147–169,January 1933.[26] E. S. Meckes,
The Random Matrix Theory of the Classical Compact Groups . Cambridge, UK: CambridgeUniversity Press, 2019.[27] R. Schreiber and C. V. Loan, “A storage-efficient WY representation for products of Householdertransformations,”
SIAM J. Sci. Stat. Comput. , vol. 10, January 1989.1328] J. A. Mingo and R. Speicher,
Free Probability and Random Matrices . New York, NY: Springer Science& Business Media, 2017.[29] J. H. Wilkinson,