A randomized blocked algorithm for efficiently computing rank-revealing factorizations of matrices
AA randomized blocked algorithm for efficiently computingrank-revealing factorizations of matrices
Per-Gunnar Martinsson and Sergey VoroninDepartment of Applied Mathematics, University of Colorado at Boulder
Abstract:
This manuscript describes a technique for computing partial rank-revealing factorizations, such as, e.g, a partial QR factorization or a partialsingular value decomposition. The method takes as input a tolerance ε andan m × n matrix A , and returns an approximate low rank factorization of A that is accurate to within precision ε in the Frobenius norm (or some othereasily computed norm). The rank k of the computed factorization (whichis an output of the algorithm) is in all examples we examined very close tothe theoretically optimal ε -rank. The proposed method is inspired by theGram-Schmidt algorithm, and has the same O ( mnk ) asymptotic flop count.However, the method relies on randomized sampling to avoid column pivoting,which allows it to be blocked, and hence accelerates practical computationsby reducing communication. Numerical experiments demonstrate that theaccuracy of the scheme is for every matrix that was tried at least as good ascolumn-pivoted QR, and is sometimes much better. Computational speed isalso improved substantially, in particular on GPU architectures.1. Introduction
Problem formulation.
This manuscript describes an algorithm based on randomized sam-pling for computing an approximate low-rank factorization of a given matrix. To be precise, givena real or complex matrix A of size m × n , and a computational tolerance ε , we seek to determinea matrix A approx of low rank such that(1) (cid:107) A − A approx (cid:107) ≤ ε. For any given k ∈ { , , . . . , min( m, n ) } , a rank- k approximation to A that is in many waysoptimal is given by the partial singular value decomposition (SVD),(2) A k = U k Σ k V ∗ k ,m × n m × k k × k k × n where U k and V k are orthonormal matrices whose columns consist of the first k left and rightsingular vectors, respectively, and Σ k is a diagonal matrix whose diagonal entries { σ j } kj =1 are theleading k singular values of A , ordered so that σ ≥ σ ≥ σ ≥ · · · ≥ σ k ≥
0. The Eckart-Youngtheorem [3] states that for the spectral norm and the Frobenius norm, the residual error is minimal, (cid:107) A − A k (cid:107) = inf {(cid:107) A − C (cid:107) : C has rank k } . However, computing the factors in (2) is computationally expensive. In contrast, our objective isto find an approximant A approx that is cheap to compute, and close to optimal.The method we present is designed for situations where A is sufficiently large that computingthe full SVD is not economical. The method is designed to be highly communication efficient,and to execute efficiently on both shared and distributed memory machines. It has been testednumerically for situations where the matrix fits in RAM on a single machine. We will, without lossof generality, assume that m ≥ n . For the most part we discuss real matrices, but the generalizationto complex matrices is straight-forward. a r X i v : . [ m a t h . NA ] J un .2. A greedy template.
A standard approach in computing low-rank factorizations is to employa greedy algorithm to build, one vector at a time, an orthonormal basis { q j } kj =1 that approximatelyspans the columns of A . To be precise, given an m × n matrix A and a computational tolerance ε , our objective is to determine a rank k , and an m × k matrix Q k = [ q · · · q k ] with orthonormalcolumn vectors such that (cid:107) A − Q k B k (cid:107) ≤ ε , where B k = Q ∗ k A . The matrices Q k and B k may beconstructed jointly via the following procedure: Algorithm 1(1) Q = [ ] ; B = [ ] ; A = A ; j = 0 ;(2) while (cid:107) A ( j ) (cid:107) > ε (3) j = j + 1 (4) Pick a unit vector q j ∈ ran ( A ( j − ) .(5) b j = q ∗ j A ( j − (6) Q j = [ Q j − q j ] (7) B j = (cid:20) B j − b j (cid:21) (8) A ( j ) = A ( j − − q j b j (9) end while (10) k = j . Note that A ( j ) can overwrite A ( j − . It can be verifiedthat if the algorithm is executed in exactarithmetic, then the matrices generated satisfy(3) A ( j ) = A − Q j Q ∗ j A , and B j = Q ∗ j A . The performance of the greedy scheme is determined by how we choose the vector q j on line(4). If we pick q j as simply the largest column of A ( j − , scaled to yield a vector of unit length,then we recognize the scheme as the column pivoted Gram-Schmidt algorithm for computing aQR factorization. This method often works very well, but can lead to sub-optimal factorizations.Reference [5] discusses this in detail, and also provides an improved pivoting technique that canbe proved to yield closer to optimal results. However, both standard Gram-Schmidt (see, e.g., [4,Sect. 5.2]), and the improved version in [5] are challenging to implement efficiently on modernmulticore processors since they cannot readily be blocked. Expressed differently, they rely onBLAS2 operations rather than BLAS3.Another natural choice for q j on line (4) is to pick the unit vector that minimizes (cid:107) A ( j − − qq ∗ A ( j − (cid:107) . This in fact leads to an optimal factorization, with the vectors { q j } kj =1 being leftsingular vectors of A . However, finding the minimizer tends to be computationally expensive.In this manuscript, we propose a scheme that is more computationally efficient than column-pivoted Gram-Schmidt, and often yields close to minimal approximation errors. The idea is tochoose q j as a random linear combination of the columns of A ( j − . To be precise, we propose thefollowing mechanism for choosing q j : (4a) Draw a random vector ω whose entries are iid Gaussian random variables.(4b) Set y = A ( j − ω .(4c) Normalize so that q j = (cid:107) y (cid:107) y . his scheme is mathematically very close to the low-rank approximation scheme proposed in [6],but is slightly different in the stopping criterion used (the scheme of [6] does not explicitly updatethe matrix, and therefore relies on a probabilistic stopping criterion), and in its performance whenexecuted with finite precision arithmetic. We argue that choosing the vector q j using randomizedsampling leads to performance very comparable to traditional column pivoting, but has a deci-sive advantage in that the resulting algorithm is easy to block. We will demonstrate substantialpractical speed-up on both multicore CPUs and GPUs. Remark 1.
The factorization scheme described in this section produces an approximate factoriza-tion of the form A ≈ Q k B k , where Q k is orthonormal, but no conditions are `a priori imposed on B k . Once the factors Q k and B k are available, it is simple to compute many standard factorizationssuch as the low rank QR, SVD, or CUR factorizations. For details, see Section 3.3. Technical preliminaries
Notation.
Throughout the paper, we measure vectors in R n using their Euclidean norm.The default norm for matrices will be the Frobenius norm (cid:107) A (cid:107) = (cid:16)(cid:80) i,j | A ( i, j ) | (cid:17) / , althoughother norms will also be discussed.We use the notation of Golub and Van Loan [4] to specify submatrices. In other words, if B isan m × n matrix with entries b ij , and I = [ i , i , . . . , i k ] and J = [ j , j , . . . , j (cid:96) ] are two indexvectors, then we let B ( I, J ) denote the k × (cid:96) matrix B ( I, J ) = b i j b i j · · · b i j (cid:96) b i j b i j · · · b i j (cid:96) ... ... ... b i k j b i k j · · · b i k j (cid:96) . We let B ( I, :) denote the matrix B ( I, [1 , , . . . , n ]), and define B (: , J ) analogously.The transpose of B is denoted B ∗ , and we say that a matrix U is orthonormal if its columns forman orthonormal set, so that U ∗ U = I .2.2. The singular value decomposition (SVD).
The SVD was introduced briefly in the in-troduction. Here we define it again, with some more detail added. Let A denote an m × n matrix,and set r = min( m, n ). Then A admits a factorization(4) A = U Σ V ∗ ,m × n m × r r × r r × n where the matrices U and V are orthonormal, and Σ is diagonal. We let { u i } ri =1 and { v i } ri =1 denote the columns of U and V , respectively. These vectors are the left and right singular vectorsof A . As in the introduction, the diagonal elements { σ j } rj =1 of Σ are the singular values of A . Weorder these so that σ ≥ σ ≥ · · · ≥ σ r ≥
0. We let A k denote the truncation of the SVD to itsfirst k terms, as defined by (2). It is easily verified that(5) (cid:107) A − A k (cid:107) spectral = σ k +1 , and that (cid:107) A − A k (cid:107) = min( m,n ) (cid:88) j = k +1 σ j / , where (cid:107) A (cid:107) spectral denotes the operator norm of A and (cid:107) A (cid:107) denotes the Frobenius norm of A .Moreover, the Eckart-Young theorem [3] states that these errors are the smallest possible errorsthat can be incurred when approximating A by a matrix of rank k . .3. The QR factorization.
Any m × n matrix A admits a QR factorization of the form(6)
A P = Q R ,m × n n × n m × r r × n where r = min( m, n ), Q is orthonormal, R is upper triangular, and P is a permutation matrix.The permutation matrix P can more efficiently be represented via a vector J c ∈ Z n + of columnindices such that P = I (: , J c ) where I is the n × n identity matrix. Then (6) can be written(7) A ( : , J c ) = Q R ,m × n m × r r × n The QR-factorization is often computed via column pivoting combined with either the Gram-Schmidt process, Householder reflectors, or Givens rotations [4]. The resulting upper triangular R then satisfies various decay conditions [4]. These techniques are all incremental, and can bestopped after the first k terms have been computed to obtain a “partial QR-factorization of A ”:(8) A ( : , J c ) ≈ Q k R k .m × n m × r r × n The main drawback of the classical partial pivoted QR approximation is the difficulty to obtainsubstantial speedups on multi-processor architectures.2.4.
Orthonormalization.
Given an m × (cid:96) matrix X , with m ≥ (cid:96) , we introduce the function Q = orth ( X )to denote orthonormalization of the columns of X . In other words, Q will be an m × (cid:96) orthonormalmatrix whose columns form a basis for the column space of X . In practice, this step is typicallyachieved most efficiently by a call to a packaged QR factorization (e.g., in Matlab, we would write Q = qr ( X , Construction of low-rank approximations via randomized sampling
A basic randomized scheme.
Let A be a given m × n matrix whose singular values exhibitsome decay, and suppose that we seek a matrix Q with (cid:96) orthonormal columns such that(9) A ≈ Q B , where B = Q ∗ A . In other words, we seek a matrix Q whose columns form an approximate orthonormal basis forthe column space of A . A randomized procedure for solving this task was proposed in [7], andlater analyzed an elaborated in [8, 6]. A basic version of the scheme that we call “ randQB ” is givenin Figure 1. Once randQB has been executed to produce the factors Q and B in (9), standardfactorizations such as the QR factorization, or the truncated SVD can easily be obtained, asdescribed in Section 3.3.3.2. Over sampling and theoretical performance guarantees.
The algorithm randQB de-scribed in Section 3.1 produces close to optimal results for matrices whose singular values decayrapidly, provided that some slight over-sampling is done. To be precise, if we seek to match theminimal error for a factorization of rank k , then choose (cid:96) in randQB as (cid:96) = k + s where s is a small integer (say s = 10). It was shown in [6, Thm. 10.5] that if s ≥
2, then E (cid:2) (cid:107) A − QB (cid:107) (cid:3) ≤ (cid:18) ks − (cid:19) min( m,n ) (cid:88) j = k +1 σ j / , unction [ Q , B ] = randQB ( A , (cid:96) )(1) Ω = randn ( n, (cid:96) )(2) Q = orth ( AΩ ) C mm mn(cid:96) + C qr m(cid:96) (3) B = Q ∗ A C mm mn(cid:96) Figure 1.
The most basic version of the randomized range finder. The algorithmtakes as input an m × n matrix A and a target rank (cid:96) , and produces factors Q and B of sizes m × (cid:96) and (cid:96) × n , respectively, such that A ≈ QB . Text in blue refers tocomputational cost, see Section 4.4 for notation.where E denotes expectation. Recall from equation (5) that (cid:16)(cid:80) min( m,n ) j = k +1 σ j (cid:17) / is the theoreticallyminimal error in approximating A by a matrix of rank k , so we miss the optimal bound only by afactor of (cid:16) ks − (cid:17) (except for the over sampling, of course). Moreover, it can be shown that thelikelihood of a substantial deviation from the expectation is extremely small [6, Sec. 10.3]. Remark 2.
When errors are measured in the spectral norm, as opposed to the Frobenius norm,the randomized scheme is slightly further removed from optimality. Theorem 10.6 of [6] states that (10) E (cid:2) (cid:107) A − QB (cid:107) spectral (cid:3) ≤ (cid:18) ks − (cid:19) σ k +1 + e √ k + ss min( m,n ) (cid:88) j = k +1 σ j / , where e = 2 . · · · is the basis of the natural exponent. We observe that in cases where the singularvalues decay slowly, the right hand side of (10) is substantially larger than the theoretically optimalvalue of σ k +1 . For such situation, the “power scheme” described in Section 5.1 should be used. Computing standard factorizations.
The output of the randomized factorization schemein Figure 1 is a factorization A ≈ QB where Q is orthonormal, but no constraints have been placedon B . It turns out that standard factorizations can efficiently be computed from the factors Q and B ; in this section we describe how to get the QR, the SVD, and “interpolatory” factorizations.3.3.1. Computing the low rank SVD.
To get a low rank SVD, cf. Section 2.2, we perform the fullSVD on the (cid:96) × n matrix B , to obtain a factorization B = ˆ U ˆ D ˆ V . Then, A ≈ QB = Q ˆ U ˆ D ˆ V ∗ . We can now choose a rank k to use based on the decaying singular values of D . Once a suitablerank has been chosen, we form the low rank SVD factors: U k = Q ˆ U (: , k ) , Σ k = ˆ D (1 : k, k ) , and V k = ˆ V (: , k ) , so that A ≈ U k Σ k V ∗ k . Observe that the truncation undoes the over-sampling that was done anddetects a numerical rank k that is typically very close to the optimal ε -rank.3.3.2. Computing the partial pivoted QR factorization.
To obtain the factorization AP ≈ QR ,cf. Section 2.3, from the QB decomposition, perform a QR factorization of the (cid:96) × n matrix B toobtain BP = ˜ QR . Then, set ˆ Q = Q ˜ Q to obtain AP ≈ QBP = Q ˜ QR = ˆ QR . .3.3. Computing interpolatory and CUR factorizations.
In applications such as data interpreta-tion, it is often of interest to determine a subset of the rows/columns of A that form a good basisfor its row/column space. For concreteness, suppose that A is an m × n matrix of rank k , and thatwe seek to determine an index set J of length k , and a matrix Y of size k × n such that(11) A ≈ A (: , J ) Y .m × n m × k k × n One can prove that there always exist such a factorization for which every entry of Y is boundedin modulus by 1 (which is to say that the columns in A (: , J ) form a well-conditioned basis forthe range of A ) and for which Y (: , J ) is the k × k identity matrix [2]. Now suppose that we haveavailable a factorization A = QB where B is of size (cid:96) × n . Then determine J and Y such that(12) B ≈ B (: , J ) Y .(cid:96) × n (cid:96) × k k × n This can be done using the techniques in, e.g., [2] or [5]. Then (11) holds automatically for theindex set J and the matrix Y that were constructed. Using similar ideas, one can determine a set ofrows that form a well-conditioned basis for the row space, and also the so called CUR factorization A ≈ C U R ∗ ,m × n m × k k × k k × n where C and R consist of subsets of the columns and rows of A , respectively, cf. [12].4. A blocked version of the randomized range finder
In this section, we describe and analyze a blocked version of the basic randomized scheme describedin Figure 1. By blocking, we improve computational efficiency and simplify implementation onparallel machines. Blocking also allows for adaptive rank determination to be incorporated for sit-uations where the rank is not known in advance. While blocking greatly helps with computationalefficiency, it also creates some issues in terms of aggregation of round-off errors; this problem canbe eliminated using techniques described in Section 4.3.The algorithm described in this section is directly inspired by Algorithm 4.2 of [6]; besides blocking,the scheme proposed here is different in that the matrix A is updated in a manner analogousto “modified” column-pivoted Gram-Schmidt. This updating allows the randomized stoppingcriterion employed in [6] to be replaced with a precise deterministic stopping criterion.4.1. Blocking.
Converting the basic scheme in Figure 1 to a blocked scheme is in principlestraight-forward. Suppose that in addition to an m × n matrix A and a rank (cid:96) , we have seta block size b such that (cid:96) = sb , for some integer s . Then draw an n × (cid:96) Gaussian random matrix Ω , and partition it into slices { Ω j } sj =1 , each of size n × b , so that(13) Ω = (cid:2) Ω , Ω , . . . , Ω s (cid:3) . We analogously partition the matrices Q and B in groups of b columns and b rows, respectively, Q = (cid:2) Q , Q , . . . , Q s (cid:3) and B = B B ... B s . The blocked algorithm then proceeds to build the matrices { Q i } si =1 and { B i } si =1 one at a time.We first initiate the algorithm by setting(14) A (0) = A . hen step forwards, computing for i = 1 , , . . . , s the matrices Q i = orth (cid:16) A ( i − Ω i (cid:17) , (15) B i = Q ∗ i A ( i − , (16) A ( i ) = A ( i − − Q i B i . (17)We will next prove that the matrix ¯Q i = [ Q , Q , . . . , Q i ] constructed is indeed orthonormal, andthat the matrix A ( i ) defined by (17) is the “remainder” after i steps, in the sense that A ( i ) = A − (cid:2) Q , Q , . . . , Q i (cid:3) (cid:2) Q , Q , . . . , Q i (cid:3) ∗ A = A − ¯Q i ¯Q ∗ i A . To be precise, we will prove the following proposition:
Proposition 4.1.
Let A be an m × n matrix. Let b denote a block size, and let s denote thenumber of steps. Suppose that the rank of A is at least sb . Let Ω be a Gaussian random matrixof size n × sb , partitioned as in (13), with each Ω j of size n × b . Let { A ( j ) } ij =0 , { Q j } ij =1 , and { B j } ij =1 , be defined by (14) – (17). Set: (18) P i = i (cid:88) j =1 Q j Q ∗ j . and (19) ¯Q i = (cid:2) Q , Q , . . . , Q i (cid:3) , ¯B i = (cid:2) B ∗ , B ∗ , . . . , B ∗ i (cid:3) ∗ , ¯Y i = (cid:2) AΩ , AΩ , . . . , AΩ i (cid:3) Then for every i = 1 , , . . . , s , it is the case that: (a) The matrix ¯Q i is ON, so P i is an orthogonal projection. (b) A ( i ) = (cid:0) I − P i (cid:1) A = (cid:0) I − ¯Q i ¯Q ∗ i (cid:1) A and ¯B i = ¯Q ∗ i A . (c) R ( ¯Q i ) = R ( ¯Y i ) .Proof. The proof is by induction. We will several times use that if C is a matrix of size n × b offull rank, and we set Q = orth ( C ), then R ( Q ) = R ( C ), where R ( X ) denotes the range of X . Wewill also use the fact that if Ω is a Gaussian random matrix of size n × (cid:96) , and E is a matrix of size m × n with rank at least (cid:96) , then the rank of EΩ is with probability 1 precisely (cid:96) [6].Direct inspection of the definitions show that (a), (b), (c) are all true for i = 1. Suppose allstatements are true for i −
1. We will prove that then (a), (b), (c) hold for i .To prove that (a) holds for i , we use that (b) holds for i − Q i = orth (( I − P i − ) AΩ i ) . Then observe that P i − is the orthogonal projection onto a space of dimension b ( i − A ( i − = ( I − P i − ) A has rank at least bs − b ( i −
1) = b ( s − i + 1) ≥ b .Consequently, A ( i − Ω i has rank precisely b . This shows that R ( Q i ) ⊆ R ( I − P i − ) = R ( P i − ) ⊥ = R ([ Q , Q , . . . , Q i − ]) ⊥ . It follows that Q ∗ j Q i = whenever j < i which shows that ¯Q i is ON. Next, B i = Q ∗ i A ( i − = Q ∗ i (cid:0) I − Q i − Q ∗ i − (cid:1) A ( i − = Q ∗ i A ( i − = · · · = Q ∗ i A (0) = Q ∗ i A It follows that: ¯Q ∗ i A = [ Q , . . . , Q i ] ∗ A = Q ∗ A ... Q ∗ i A = B ... B i = ¯B i hus, (a) holds for i .Proving (b) is a simple calculation. Combining (16) and (17) we get A ( i ) = A ( i − − Q i Q i A ( i − = ( I − Q i Q ∗ i ) A ( i −
1) ( b ) = ( I − Q i Q ∗ i )( I − P i − ) A = ( I − ( P i − + Q i Q ∗ i )) A , where in the last step we used that Q ∗ i P i − = . Since P i = P i − + Q i Q ∗ i , this proves (b).To prove (c), we observe that (20) implies that(21) R ( Q i ) ⊆ R ([ AΩ i , P i − AΩ i ]) . Induction assumption (c) tells us that(22) R ( P i − AΩ i ) ⊆ R ([ Q , Q , · · · , Q i − ]) = R ([ AΩ , AΩ , · · · , AΩ i − ]) . Combining (21) and (22), we find(23) R ( Q i ) ⊆ R ([ AΩ , AΩ , · · · , AΩ i − , AΩ i ]) . Equation (23) together with the induction assumption (c) imply that R ([ Q , Q , . . . , Q i ]) ⊆ R ([ AΩ , AΩ , · · · , AΩ i ]). But both of these spaces have dimension precisely bi , so the fact thatone is a subset of the other implies that they must be identical. (cid:3) Let us next compare the blocked algorithm defined by relations (14) – (17) to the unblockedalgorithm described in Figure 1. Let for a fixed Gaussian matrix Ω , the output of the blockedversion be { Q , B } and let the output of the unblocked method be { ˜ Q , ˜ B } . These two pairs ofmatrices do not need to be identical. (They depend on how exactly the QR factorizations areimplemented, for instance). However, Proposition 4.1 demonstrates that the projectors QQ ∗ and˜ Q ˜ Q ∗ are identical. To be precise, both of these matrices represent the orthogonal projection ontothe space R ( AΩ ). This means that the error resulting from the two algorithms are also identical A − QQ ∗ A (cid:124) (cid:123)(cid:122) (cid:125) error from blocked algorithm = A − ˜ Q ˜ Q ∗ A (cid:124) (cid:123)(cid:122) (cid:125) error from non blocked algorithm . Consequently, all theoretical results given in [6], cf. Section 3.2, directly apply to the output of theblocked algorithm too.4.2.
Adaptive rank determination.
The blocked algorithm defined by (14) – (17) was in Section4.1 presented for the case where the rank (cid:96) of the approximation is given in advance. A perhapsmore common situation in practical applications is that a precision ε > ε . Observe that in the algorithm defined by (14) – (17), we proved that after step i has beencompleted, we have (cid:107) A ( i ) (cid:107) = (cid:107) A − P i A (cid:107) = (cid:107) A − (cid:2) Q Q · · · Q i (cid:3) (cid:2) Q Q · · · Q i (cid:3) ∗ A (cid:107) . In other words, A ( i ) holds precisely the residual remaining after step i . This means that incorpo-rating adaptive rank determining is now trivial — we simply compute (cid:107) A ( i ) (cid:107) after completing step i , and break once (cid:107) A ( i ) (cid:107) ≤ ε . The algorithm resulting is shown as randQB b in Figure 2. (Thepurpose of line (3’) will be explained in Section 4.3.) Remark 3.
Recall that our default norm in this manuscript, the Frobenius norm, is simple tocompute, which means that the check on whether to break the loop on line (7) in Figure 2 hardlyadds at all to the execution time. If circumstances warrant the use of a norm that is more expensiveto compute, then some modification of the algorithm would be required. Suppose, for instance, thatwe seek an approximation in the spectral norm. We could then use the fact that the Frobeniusnorm is an upper bound on the spectral norm, keep the Frobenius norm as the breaking condition, unction [ Q , B ] = randQB b ( A , ε, b )(1) for i = 1 , , , . . . (2) Ω i = randn ( n, b )(3) Q i = orth ( AΩ i ) C mm mnb + C qr mb (3’) Q i = orth ( Q i − (cid:80) i − j =1 Q j Q ∗ j Q i ) 2( i − C mm mb + C qr mb (4) B i = Q ∗ i A C mm mnb (5) A = A − Q i B i C mm mnb (6) if (cid:107) A (cid:107) < ε then stop (7) end for (8) Set Q = [ Q · · · Q i ] and B = [ B ∗ · · · B ∗ i ] ∗ . Figure 2. A blocked version of the randomized range finder, cf. Fig 1. The al-gorithm takes as input an m × n matrix A , a block size b , and a tolerance ε . Itsoutput are factors Q and B such that (cid:107) A − QB (cid:107) ≤ ε . Note that if the algorithmis executed in exact arithmetic, then line (3’) does nothing. Text in blue refers tocomputational cost, see Section 4.4 for notation. and then eliminate any “superfluous” degrees of freedom that were included in the post-processingstep, cf. Section 3.3.1. (This approach would only be practicable for matrices whose singular valuesexhibit reasonable decay, as otherwise the discrepancy in the ε -ranks would be prohibitively large.) Floating point arithmetic.
When the algorithm defined by (14) – (17) is carried out infinite precision arithmetic, a serious problem often arises in that round-off errors will accumulateand will cause loss of orthonormality among the columns in { Q , Q , . . . } . The problem is that asthe computation proceeds, the columns of each computed matrix Q i will due to round-off errorsdrift into the span of the columns of { Q , Q , . . . , Q i − } . To fix this problem, we explicitlyreproject Q i away from the span of the previously computed basis vectors [1]. The line (3’) inFigure 2 represents the re-projection that is done to combat the accumulation of round-off errors.(Note that if the algorithm is carried out in exact arithmetic, then Q ∗ j Q i = whenever j < i , soline (3’) would have no effect.)4.4. Comparison of execution times.
Let us compare the computational cost of algorithms randQB (Figure 1) and randQB b (Figure 2). To this end, let C mm and C qr denote the scalingconstants for the cost of executing a matrix-matrix multiplication and a full QR factorization,respectively. (While the algorithms only need the function orth , cf. Section 2.4, this cost is forpractical purposes the same as the cost for QR factorization.) In other words, we assume that: • Multiplying two matrices of sizes m × n and n × r costs C mm mnr . • Performing a QR factorization of a matrix of size m × n , with m ≥ n , costs C qr mn .Note that these are rough estimates. Actual costs depend on the actual sizes (note that thecosts are dominated by data movement rather than flops), but this model is still instructive. Theexecution time for the algorithm in Figure 1 is easily seen to be(24) T randQB ∼ C mm mn(cid:96) + C qr m(cid:96) . or the blocked algorithm of Figure 2, we assume that it stops after s steps and set (cid:96) = sb . Then T randQB b ∼ s (cid:88) i =1 (cid:2) C mm mnb + 2( i − C mm mb + C qr mb (cid:3) ∼ sC mm mnb + s C mm mb + sC qr mb . Using that sb = (cid:96) we find(25) T randQB b ∼ C mm mn(cid:96) + C mm m(cid:96) + 2 s C qr m(cid:96) . Comparing (24) and (25), we see that the blocked algorithm involves one additional term of C mm mn(cid:96) , but on the other hand spends less time executing full QR factorizations, as expected. Remark 4.
All blocked algorithms that we present share the characteristic that they slightly in-crease the amount of time spent on matrix-matrix multiplication, while reducing the amount of timespent performing QR factorization. This is a good trade-off on many platforms, but becomes par-ticularly useful when the algorithm is executed on a GPU. These massively multicore processors areparticularly efficient at performing matrix-matrix multiplications, but struggle with communicationintensive tasks such as a QR factorization. A version of the method with enhanced accuracy
Randomized sampling of a power of the matrix.
The accuracy of the basic randomizedapproximation scheme described in Section 3, and the blocked version of Section 4 is well under-stood. The analysis of [6] (see the synopsis in Section 3.2) shows that the error (cid:107) A − A approx (cid:107) depends strongly on the quantity (cid:16)(cid:80) min( m,n ) j = k +1 σ j (cid:17) / . This implies that the scheme is highly accu-rate for matrices whose singular values decay rapidly, but less accurate when the “tail” singularvalues have substantial weight. The problem becomes particularly pronounced for large matrices.Happily, it was demonstrated in [9] that this problem can easily be resolved when given a matrixwith slowly decaying singular values by simply applying a power of the matrix to be analyzed tothe random matrix. To be precise, suppose that we are given an m × n matrix A , a target rank (cid:96) , and a small integer P (say P = 1 or P = 2). Then the following formula will produce an ONmatrix Q whose columns form an approximation to the range: Ω = randn ( n, (cid:96) ) , and Q = orth (cid:0)(cid:0) AA ∗ (cid:1) P AΩ , (cid:1) . The key observation here is that the matrix (cid:0) AA ∗ (cid:1) P A has exactly the same left singular values as A , but its singular values are σ P +1 j (observe that our objective is to build an ON-basis for therange of A , and the optimal such basis consists of the leading left singular vectors). Even a smallvalue of P will typically provide enough decay that highly accurate results are attained. For atheoretical analysis, see [6, Sec. 10.4].When the “power scheme” idea is to be executed in floating point arithmetic, substantial loss ofaccuracy happens whenever the singular values of A have a large dynamic range. To be precise,if (cid:15) mach denotes the machine precision, then any singular components smaller than σ (cid:15) / (2 P +1)mach will be lost. This problem can be resolved by orthonormalizing the “sample matrix” between eachapplication of A and A ∗ . This results in the scheme we call randQB p , as shown in Figure 3. (Notethat this scheme is virtually identical to a classical subspace iteration with a random Gaussianmatrix as the start [10].) unction [ Q , B ] = randQB p ( A , (cid:96), P )(1) Ω = randn ( n, (cid:96) ).(2) Q = orth ( AΩ ). C mm mn(cid:96) + C qr m(cid:96) (3) for j = 1 : P (4) Q = orth ( A ∗ Q ). C mm mn(cid:96) + C qr m(cid:96) (5) Q = orth ( AQ ). C mm mn(cid:96) + C qr m(cid:96) (6) end for (7) B = Q ∗ A C mm mn(cid:96) Figure 3.
An accuracy enhanced version of the basic randomized range finder inFigure 1. The algorithm takes as input an m × n matrix A , a rank (cid:96) , and a “power” P (see Section 5.1). The output are matrices Q and B of sizes m × (cid:96) and (cid:96) × n such that A ≈ QB . Higher P leads to better accuracy, but also higher cost. Setting P = 1 or P = 2 is often sufficient. function [ Q , B ] = randQB pb ( A , ε, P, b )(1) for i = 1 , , , . . . (2) Ω i = randn ( n, b ).(3) Q i = orth ( AΩ i ). C mm mnb + C qr mb (4) for j = 1 : P (5) Q i = orth ( A ∗ Q i ). C mm mnb + C qr mb (6) Q i = orth ( AQ i ). C mm mnb + C qr mb (7) end for (8) Q i = orth ( Q i − (cid:80) i − j =1 Q j Q ∗ j Q i ) 2( i − C mm mb + C qr mb (9) B i = Q ∗ i A C mm mnb (10) A = A − Q i B i C mm mnb (11) if (cid:107) A (cid:107) < ε then stop (12) end while (13) Set Q = [ Q · · · Q i ] and B = [ B ∗ · · · B ∗ i ] ∗ . Figure 4. A blocked and adaptive version of the accuracy enhanced algorithmshown in Figure 3. Its input and output are identical, except that we now providea tolerance ε as an input (instead of a rank), and also a block size b .5.2. The blocked version of the power scheme.
A blocked version of randQB p is easilyobtained by a process analogous to the one described in Section 4.1, resulting in the algorithm“ randQB pb ” in Figure 4. Line (8) combats the problem of incremental loss of orthonormalitywhen the algorithm is executed in finite precision arithmetic, cf. Section 4.3.5.3.
Computational complexity.
When comparing the computational cost of randQB p (cf. Fig-ure 3) versus randQB pb (cf. Figure 4), we use the notation that was introduced in Section 4.4. Byinspection, we directly find that T randQB p ∼ C mm (2 + 2 P ) mn(cid:96) + C qr (1 + 2 P ) m(cid:96) . or the blocked scheme, inspection tells us that T randQB pb ∼ s (cid:88) i =1 (cid:2) C mm (3 + 2 P ) mnb + 2( i − C mm mb + C qr (2 + 2 P ) mb (cid:3) . Executing the sum, and utilizing that (cid:96) = sb , we get T randQB pb ∼ C mm (3 + 2 P ) mn(cid:96) + C mm m(cid:96) + 1 s C qr (2 + 2 P ) m(cid:96) . In other words, the blocked algorithm again spends slightly more time executing matrix-matrixmultiplications, and quite a bit less time on qr-factorizations. This trade is often favorable, andparticularly so when the algorithm is executed on a GPU (cf. Remark 4). On the other hand,when (cid:96) (cid:28) n , the benefit to saving time on QR factorizations is minor.5.4. Is re-orthonormalizing truly necessary?
Looking at algorithm randQB p , it is very tempt-ing to skip the intermediate QR factorizations and simply execute steps (2) – (6) as:(2) Y = AΩ .(3) for j = 1 : P (4) Y = A (cid:0) A ∗ Y (cid:1) .(5) end for (6) Q = orth ( Y )This simplification does speed things up substantially, but as we mentioned earlier, it can lead toloss of accuracy. In this section we state some conjectures about when re-orthonormalization isnecessary. These conjectures appear to show that the blocked scheme is much more resilient toskipping re-orthonormalization.To describe the issue, let us fix a (small) integer P , and define the matrix A P = (cid:0) AA ∗ (cid:1) P A . If the SVD of A is A = UΣV ∗ , then the SVD of A P is A P = UΣ P +1 V ∗ . In computing A = A P Ω , we lose all information about any singular mode i for which σ P +1 i ≤ σ P +11 (cid:15) mach , where (cid:15) mach is the machine precision. In other words, in order to accurately resolvethe first k singular modes, re-orthogonalization is needed if(26) σ σ k > (cid:15) / (2 P +1)mach . As an example, with P = 2 and (cid:15) mach = 10 − , we find that (cid:15) / (2 P +1)mach = 10 − , so re-orthonormalizationis imperative resolve any components smaller than σ · · · − . Moreover, if we skip re-orthonormalization,we are likely to see an overall loss of accuracy affecting singular values and singular vectors asso-ciated with larger singular values.Next consider the blocked scheme. The crucial observation is that now, instead of trying toextract the whole range of singular values { σ j } kj =1 (and their associated eigenvectors) at once, wenow extract them in s groups of b modes each, where k ≈ sb . This means that we can expect toget reasonable accuracy as long as(27) σ ( i − b +1 σ ib ≤ (cid:15) / (2 P +1)mach , for i = 1 , , . . . , s. Comparing (26) and (27), we see that (27) is a much milder condition, in particular when theblock size b is much smaller than k . ll claims in this section are heuristics. However, while they have not been rigorously proven,they are supported by extensive numerical experiments, see Section 6.3.6.
Numerical experiments
In this section, we present numerical examples that illustrate the computational efficiency and theaccuracy of the proposed scheme, see Sections 6.1 and 6.2, respectively. The codes we used areavailable at http://amath.colorado.edu/faculty/martinss/main codes.html and we encour-age any interested reader to try the methods out, and explore different parameter sets than thoseincluded here.6.1.
Comparison of execution speeds.
We first compare the run times of different techniquesfor computing a partial (rank k ) QR factorization of a given matrix A of size n × n . Observe thatthe choice of matrix is immaterial for a run time comparison (we investigate accuracy in Section6.2). We compared three sets of techniques: • Truncating a full
QR factorization, computed using the Intel MKL libraries. • Taking k steps of a column pivoted Gram-Schmidt process. The implementation wasaccelerated by using MKL library functions whenever practicable. • The blocked “QB” scheme, followed by postprocessing of the factors to obtain a QR factor-ization. We used the “power method” described in Section 5 with parameters P = 0 , , far slower than the built-in QR factorization in the MKL libraries. Evenfor as low of a rank as k = 100, we do not break even with a full factorization until n = 8 000.This implies that column pivoting can be implemented far more efficiently than we were able to.The point is that in order to attain the efficiency of the MKL libraries, very careful coding that iscustomized to any particular computing platform would have to be done. In contrast, our blockedcode is able to exploit the very high efficiency of the MKL libraries with minimal effort.Finally, it is worth nothing how particularly efficient our blocked algorithms are when executed ona GPU. We gain a substantial integer factor speed-up over CPU speed in every test we conducted.6.2. Accuracy of the randomized scheme.
We next investigate the accuracy of the randomizedschemes versus column-pivoted QR on the one hand (easy to compute, not optimal) and versus thetruncated SVD on the other (expensive to compute, optimal). We used 5 classes of test matricesthat each have different characteristics:
Matrix 1 (fast decay):
Let A denote an m × n matrix of the form A = UDV ∗ where U and V are randomly drawn matrices with orthonormal columns (obtained by performingqr on a random Gaussian matrix), and where D is a diagonal matrix with entries roughlygiven by d j = g j β j − where g j is a random number drawn from a uniform distribution on[0 ,
1] and β = 0 .
65. To precision 10 − , the rank of A is about 75. Matrix 2 (slow decay):
The matrix A is formed just like A , but now the diagonal entriesof D decay very slowly, with d j = (1 + 200( j − / . −1 T i m e i n s e c ond s nTime for partial (k=100) QR factorization of n x n matrix. b=20. QR partQB P=0 cpuQB P=0 gpuQB P=1 cpuQB P=1 gpuQB P=2 cpuQB P=2 gpuQR full Figure 5.
Timing results for different algorithms on CPU and GPU. The integer P denotes the parameter in the “power scheme” described in Section 5. Figure 6.
Timing results for blocked QB scheme for different block sizes b . Theinteger P denotes the parameter in the “power scheme” described in Section 5. Matrix 3 (sparse):
The matrix A is a sparse matrix given by A = (cid:80) j =1 2 j x j y ∗ j + (cid:80) min( m,n ) j =11 1 j x j y ∗ j where x j and y j are random sparse vectors generated by the Matlab commands sprand ( m, , . sprand ( n, , . m = 800 and n = 600 with resulted in a ma-trix with roughly 6% non-zero elements. This matrix was borrowed from Sorensen andEmbree [11] and is an example of a matrix for which column pivoted Gram-Schmidt per-forms particularly well. Matrix 4 (Kahan):
This is a variation of the “Kahan counter-example” which is a matrixdesigned so that Gram-Schmidt performs particularly poorly. The matrix here is formed ia the matrix matrix product SK where: S = · · · ζ · · · ζ · · · ζ · · · ... ... ... ... . . . and K = − φ − φ − φ · · · − φ − φ · · · − φ · · · · · · ... ... ... ... . . . with random ζ, φ > , ζ + φ = 1. Then SK is upper triangular, and for many choices of ζ and φ , classical column pivoting will yield poor performance as the different column normswill be similar and pivoting will generally fail. The rank- k approximation resulting fromcolumn pivoted QR is substantially less accurate than the optimal rank- k approximationresulting from truncating the full SVD [5]. However, we obtain much better results thanQR with the QB algorithm. Matrix 5 (S shaped decay):
The matrix A is built in the same manner as A and A ,but now the diagonal entries of D are chosen to first hover around 1, then decay rapidly,and then level out at a relatively high plateau, cf. Figure 11.We compare four different techniques for computing a rank- k approximation to our test matrices: SVD:
We computed the full SVD (using the Matlab command svd ) and then truncated tothe first k components. Column-pivoted QR:
We implemented this using modified Gram-Schmidt with reorthog-onalization to ensure that orthonormality is strictly maintained in the columns of Q . randQB — single vector: This is the greedy algorithm labeled “Algorithm 1” in Section1.2, implemented with q j on line (4) chosen as q j = y / (cid:107) y (cid:107) where y = (cid:0) AA ∗ (cid:1) P A ω andwhere ω is a random Gaussian vector. randQB — blocked: This is the algorithm randQB pb shown in Figure 4.The results are shown in Figure 7 – 11. We make three observations: (1) When the “powermethod” described in Section 5 is used, the accuracy of randQB pb exceeds that of column-pivotedQR in every example we tried, even for as low of a power as P = 1. (2) Blocking appears to leadto no loss of accuracy. In most cases, there is no discernible difference in accuracy between theblocked and the non-blocked versions. (3) The accuracy of randQB pb is particularly good whenerrors are measured in the Frobenius norm. In almost all cases we investigated, essentially optimalresults are obtained even for P = 1.Figures 7 – 11 report the errors resulting from a single instantiation of the randomized algorithm.Appendix A provides more details on the statistical distribution of errors.
20 40 60 80 100 −16 −14 −12 −10 −8 −6 −4 −2 || A − A k || kSpectral norm −16 −14 −12 −10 −8 −6 −4 −2 kFrobenius norm SVDColumn−pivoted QRrandQB (P=0)randQB (P=1)randQB (P=2)randQBblocked (P=0)randQBblocked (P=1)randQBblocked (P=2) SVDColumn−pivoted QRrandQB (P=0)randQB (P=1)randQB (P=2)randQBblocked (P=0)randQBblocked (P=1)randQBblocked (P=2)
Figure 7.
Errors for the 800 ×
600 “Matrix 1” whose singular values decay veryrapidly. The block size is b = 10. −2 −1 || A − A k || kSpectral norm 0 10 20 30 40 50 60 7010 −2 −1 kFrobenius norm SVDColumn−pivoted QRrandQB (P=0)randQB (P=1)randQB (P=2)randQBblocked (P=0)randQBblocked (P=1)randQBblocked (P=2) SVDColumn−pivoted QRrandQB (P=0)randQB (P=1)randQB (P=2)randQBblocked (P=0)randQBblocked (P=1)randQBblocked (P=2) Figure 8.
Errors for the 800 ×
600 “Matrix 2” whose singular values decay slowly.The block size is b = 10.
20 40 60 80 10010 −1 || A − A k || kSpectral norm 0 20 40 60 80 10010 −1 kFrobenius norm SVDColumn−pivoted QRrandQB (P=0)randQB (P=1)randQB (P=2)randQBblocked (P=0)randQBblocked (P=1)randQBblocked (P=2) SVDColumn−pivoted QRrandQB (P=0)randQB (P=1)randQB (P=2)randQBblocked (P=0)randQBblocked (P=1)randQBblocked (P=2) Figure 9.
Errors for the 800 ×
600 “Matrix 3.” This is a sparse matrix for whichcolumn-pivoted Gram-Schmidt performs exceptionally well. However, randQB stillgives better accuracy whenever a power P ≥ || A − A k || kSpectral norm 0 20 40 60 80 10010 kFrobenius norm SVDColumn−pivoted QRrandQB (P=0)randQB (P=1)randQB (P=2)randQBblocked (P=0)randQBblocked (P=1)randQBblocked (P=2) SVDColumn−pivoted QRrandQB (P=0)randQB (P=1)randQB (P=2)randQBblocked (P=0)randQBblocked (P=1)randQBblocked (P=2) Figure 10.
Errors for the 1 000 × b = 20.
50 100 15010 −1 || A − A k || kSpectral norm 0 50 100 15010 −1 kFrobenius norm SVDColumn−pivoted QRrandQB (P=0)randQB (P=1)randQB (P=2)randQBblocked (P=0)randQBblocked (P=1)randQBblocked (P=2) SVDColumn−pivoted QRrandQB (P=0)randQB (P=1)randQB (P=2)randQBblocked (P=0)randQBblocked (P=1)randQBblocked (P=2) Figure 11.
Errors for the 800 ×
600 “Matrix 5” whose singular values show an“S-shaped” decay. Here b = 15.6.3. When re-orthonormalization is required.
We claimed in Section 5.4 that the blockedscheme is more robust towards loss of orthonormality than the non-blocked scheme presented in[6]. To test this hypothesis, we tested what happens if we skip the re-orthonormalization betweenapplications of A and A ∗ in the algorithms shown in Figures 3 and 4. The results are shownin Figure 12. The key observation here is that the blocked versions of randQB still always yieldexcellent precision. When the block size is large, the convergence is slowed down a bit comparedto the more meticulous implementation, but essentially optimal accuracy is nevertheless obtainedrelatively quickly. Remark 5.
The numerical results in Figure 12 substantiate the claim that for the unblockedversion, the best accuracy attainable is σ (cid:15) / (2 P +1)mach . In all examples, we have σ = 1 , so theprediction is that for P = 1 the maximum precision is (cid:0) − (cid:1) / = 10 − and for P = 2 it is (cid:0) − (cid:1) / = 10 − . The results shown precisely follow this pattern. Observe that for A , no lossof accuracy is seen at all since the singular values we are interested in level out at about − .
20 40 60 80 100 12010 −14 −12 −10 −8 −6 −4 −2 k || A − A k || ( s pe c t r a l no r m ) Errors for "Matrix 1" with P=1 (skipping qr) svdsunblocked randQBrandQB with b=10randQB with b=20randQB with b=30 0 20 40 60 80 100 12010 −14 −12 −10 −8 −6 −4 −2 k || A − A k || ( s pe c t r a l no r m ) Errors for "Matrix 1" with P=2 (skipping qr) svdsunblocked randQBrandQB with b=10randQB with b=20randQB with b=30 −2 −1 k || A − A k || ( s pe c t r a l no r m ) Errors for "Matrix 2" with P=2 (skipping qr) svdsunblocked randQBrandQB with b=10randQB with b=20randQB with b=30 −6 −5 −4 −3 −2 −1 k || A − A k || ( s pe c t r a l no r m ) Errors for "Matrix 5" with P=1 (skipping qr) svdsunblocked randQBrandQB with b=10randQB with b=20randQB with b=30
Figure 12.
Errors incurred when not re-orthonormalizing between applications of A and A ∗ in the “power method,” cf. Sections 5.4 and 6.3. The non-blocked scheme(red) performs precisely as predicted, and cannot resolve anything beyond precision10 − when P = 1 and 10 − when P = 2. The blocked version converges slightlyslower when skipping re-orthonormalization but always reaches full precision.7. Concluding remarks
We have described a randomized algorithm for the low rank approximation of matrices. Thealgorithm is based on the randomized sampling paradigm described in [7, 9, 6, 8]. In this article,we introduce a blocking technique which allows us to incorporate adaptive rank determinationwithout sacrificing computational efficiency, and an updating technique that allows us to replacethe randomized stopping criterion proposed in [6] with a deterministic one. Through theoreticalanalysis and numerical examples, we demonstrate that while the blocked scheme is mathematicallyequivalent to the non-blocked scheme of [7, 9, 6, 8] when executed in exact arithmetic, the blockedscheme is slightly more robust towards accumulation of round-off errors. he updating strategy that we propose is directly inspired by a classical scheme for computing apartial QR factorization via the column pivoted Gram-Schmidt process. We demonstrate that therandomized version that we propose is more computationally efficient than this classical scheme(since it is hard to block the column pivoting scheme). Our numerical experiments indicate thatthe randomized version not only improves speed, but also leads to higher accuracy. In fact, in allexamples we present, the errors resulting from the blocked randomized scheme are very close tothe optimal error obtained by truncating a full singular value decomposition. In particular, whenerrors are measured in the Frobenius norm, there is almost no loss of accuracy at all compared tothe optimal factorization, even for matrices whose singular values decay slowly.The scheme described can output any of the standard low-rank factorizations of matrices such as,e.g., a partial QR or SVD factorization. It can also with ease produce less standard factorizationssuch as the “CUR” and “interpolative decompositions (ID)”, cf. Section 3.3. Acknowledgements:
The research reported was supported by DARPA, under the contractN66001-13-1-4050, and by the NSF, under the contract DMS-1407340.
References [1] A . Bj¨orck, Numerics of Gram-Schmidt orthogonalization , Linear Algebra Appl. (1994), 297–316,Second Conference of the International Linear Algebra Society (ILAS) (Lisbon, 1992). MR 1275620 (95b:65060)[2] H. Cheng, Z. Gimbutas, P.G. Martinsson, and V. Rokhlin,
On the compression of low rank matrices , SIAMJournal of Scientific Computing (2005), no. 4, 1389–1404.[3] Carl Eckart and Gale Young, The approximation of one matrix by another of lower rank , Psychometrika (1936), no. 3, 211–218.[4] Gene H. Golub and Charles F. Van Loan, Matrix computations , fourth ed., Johns Hopkins Studies in theMathematical Sciences, Johns Hopkins University Press, Baltimore, MD, 2013. MR 3024913[5] Ming Gu and Stanley C. Eisenstat,
Efficient algorithms for computing a strong rank-revealing QR factorization ,SIAM J. Sci. Comput. (1996), no. 4, 848–869. MR 97h:65053[6] Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp, Finding structure with randomness: Probabilisticalgorithms for constructing approximate matrix decompositions , SIAM Review (2011), no. 2, 217–288.[7] Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert, A randomized algorithm for the approximationof matrices , Tech. Report Yale CS research report YALEU/DCS/RR-1361, Yale University, Computer ScienceDepartment, 2006.[8] ,
A randomized algorithm for the decomposition of matrices , Appl. Comput. Harmon. Anal. (2011),no. 1, 47–68. MR 2737933 (2011i:65066)[9] Vladimir Rokhlin, Arthur Szlam, and Mark Tygert, A randomized algorithm for principal component analysis ,SIAM Journal on Matrix Analysis and Applications (2009), no. 3, 1100–1124.[10] Youcef Saad, Overview of Krylov subspace methods with applications to control problems , Signal processing,scattering and operator theory, and numerical methods (Amsterdam, 1989), Progr. Systems Control Theory,vol. 5, Birkh¨auser Boston, Boston, MA, 1990, pp. 401–410. MR 1115470[11] Danny C Sorensen and Mark Embree,
A DEIM induced CUR factorization , arXiv preprint arXiv:1407.5516(2014).[12] S. Voronin and P.-G. Martinsson,
A CUR Factorization Algorithm based on the Interpolative Decomposition ,ArXiv e-prints (2014).
Appendix A. Distribution of errors
The output of our randomized blocked approximation algorithms is a random variable, since itdepends on the drawing of a Gaussian matrix Ω . It has been proven (see, e.g., [6]) that due toconcentration of mass, the variation in this random variable is tiny. The output is for practicalpurposes always very close to the expectation of the output. For this reason, when we comparedthe accuracy of the randomized method to classical methods in Section 6.2, we simply presentedthe results from one particular draw of Ω . In this section, we provide some more detailed numericalexperiments that illuminate exactly how little variation there is in the output of the algorithm. e present the results from all matrices considered in Section 6.2 except Matrix 4 (the so called“Kahan counter example”) since this is an artificial example concocted specifically to give poorresults for column-pivoted Gram Schmidt.Figures 13 through 20 provide more information about the statistics of the outcome for the exper-iments reported for a single instantiation in Figures 7 through 11. For each experiment, we showboth the empirical expectation of the accuracy, and the error paths from 25 different instantiations.We observe that the errors are in all cases tightly clustered, in particular for P = 1 and P = 2. Wealso observe that when the singular values decay slowly, the clustering is stronger in the Frobeniusnorm than in the spectral norm.In our final set of experiments, we increased the number of experiments from 25 to 2 000. To keepthe plots legible, we plot the errors only for a fixed value of k , see Figure 21. These experimentsfurther substantiate our claim that the results are tightly clustered, in particular when the “powermethod” is used, and when the Frobenius norm is used.
20 40 60 80 10010 −15 −10 −5 || A − A k || kSpectral norm − average errors over 25 experiments 0 20 40 60 80 10010 −15 −10 −5 kFrobenius norm − average errors over 25 experiments SVDColumn−pivoted QRrandQB (P=0)randQB (P=1)randQB (P=2) SVDColumn−pivoted QRrandQB (P=0)randQB (P=1)randQB (P=2) Figure 13.
The empirical mean errors from 25 instantiations of the randomizedfactorization algorithm applied to Matrix 1. −15 −10 −5 || A − A k || kSpectral norm − errors for 25 experiments 0 20 40 60 80 10010 −15 −10 −5 kFrobenius norm − errors for 25 experiments SVDrandQB (P=0)randQB (P=1)randQB (P=2) SVDrandQB (P=0)randQB (P=1)randQB (P=2) Figure 14.
The error paths for 25 instantiations of the randomized factorizationalgorithm applied to Matrix 1.
10 20 30 40 50 60 7010 −2 −1 || A − A k || kSpectral norm − average errors over 25 experiments 0 10 20 30 40 50 60 7010 −2 −1 kFrobenius norm − average errors over 25 experiments SVDColumn−pivoted QRrandQB (P=0)randQB (P=1)randQB (P=2) SVDColumn−pivoted QRrandQB (P=0)randQB (P=1)randQB (P=2) Figure 15.
The empirical mean errors from 25 instantiations of the randomizedfactorization algorithm applied to Matrix 2. −2 −1 || A − A k || kSpectral norm − errors for 25 experiments 0 10 20 30 40 50 60 7010 −2 −1 kFrobenius norm − errors for 25 experiments SVDrandQB (P=0)randQB (P=1)randQB (P=2) SVDrandQB (P=0)randQB (P=1)randQB (P=2) Figure 16.
The error paths for 25 instantiations of the randomized factorizationalgorithm applied to Matrix 2.
20 40 60 80 10010 −1 || A − A k || kSpectral norm − average errors over 25 experiments 0 20 40 60 80 10010 −1 kFrobenius norm − average errors over 25 experiments SVDColumn−pivoted QRrandQB (P=0)randQB (P=1)randQB (P=2) SVDColumn−pivoted QRrandQB (P=0)randQB (P=1)randQB (P=2) Figure 17.
The empirical mean errors from 25 instantiations of the randomizedfactorization algorithm applied to Matrix 3. −1 || A − A k || kSpectral norm − errors for 25 experiments 0 20 40 60 80 10010 −1 kFrobenius norm − errors for 25 experiments SVDrandQB (P=0)randQB (P=1)randQB (P=2) SVDrandQB (P=0)randQB (P=1)randQB (P=2) Figure 18.
The error paths for 25 instantiations of the randomized factorizationalgorithm applied to Matrix 3.
50 100 15010 −1 || A − A k || kSpectral norm − average errors over 25 experiments 0 50 100 15010 −1 kFrobenius norm − average errors over 25 experiments SVDColumn−pivoted QRrandQB (P=0)randQB (P=1)randQB (P=2) SVDColumn−pivoted QRrandQB (P=0)randQB (P=1)randQB (P=2) Figure 19.
The empirical mean errors from 25 instantiations of the randomizedfactorization algorithm applied to Matrix 5. −1 || A − A k || kSpectral norm − errors for 25 experiments 0 50 100 15010 −1 kFrobenius norm − errors for 25 experiments SVDrandQB (P=0)randQB (P=1)randQB (P=2) SVDrandQB (P=0)randQB (P=1)randQB (P=2) Figure 20.
The error paths for 25 instantiations of the randomized factorizationalgorithm applied to Matrix 5. −11 −11 ||A − A k ||/||A|| || A − A k || f r o / || A || f r o Matrix 1: N exp = 2000. k = 60 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.0800.020.040.060.080.10.12 ||A − A k ||/||A|| || A − A k || f r o / || A || f r o Matrix 2: N exp = 2000. k = 60 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14−0.0200.020.040.060.08 ||A − A k ||/||A|| || A − A k || f r o / || A || f r o Matrix 3: N exp = 2000. k = 60 0 0.2 0.4 0.6 0.8−0.2−0.100.10.20.30.4 ||A − A k ||/||A|| || A − A k || f r o / || A || f r o Matrix 5: N exp = 2000. k = 75 SVDColumn−pivoted GSRandQB with P=0RandQB with P=1RandQB with P=2 SVDColumn−pivoted GSRandQB with P=0RandQB with P=1RandQB with P=2SVDColumn−pivoted GSRandQB with P=0RandQB with P=1RandQB with P=2 SVDColumn−pivoted GSRandQB with P=0RandQB with P=1RandQB with P=2
Figure 21.
Each blue cross in the graphs represents one instantiation of the ran-domized blocked algorithm. The x - and y -coordinates show the relative errors inthe spectral and Frobenius norms, respectively. For reference, we also include theerror from classical column-pivoted Gram-Schmidt (the magenta diamond), andthe error incurred by the truncated SVD. The dashed lines are the horizonal andvertical lines cutting through the point representing the SVD — since these errorsare minimal, every other dot must be located above and to the right of these lines.-coordinates show the relative errors inthe spectral and Frobenius norms, respectively. For reference, we also include theerror from classical column-pivoted Gram-Schmidt (the magenta diamond), andthe error incurred by the truncated SVD. The dashed lines are the horizonal andvertical lines cutting through the point representing the SVD — since these errorsare minimal, every other dot must be located above and to the right of these lines.