[PDF] Randomized Projection for Rank-Revealing Matrix Factorizations and Low-Rank Approximations

Abstract

Rank-revealing matrix decompositions provide an essential tool in spectral analysis of matrices, including the Singular Value Decomposition (SVD) and related low-rank approximation techniques. QR with Column Pivoting (QRCP) is usually suitable for these purposes, but it can be much slower than the unpivoted QR algorithm. For large matrices, the difference in performance is due to increased communication between the processor and slow memory, which QRCP needs in order to choose pivots during decomposition. Our main algorithm, Randomized QR with Column Pivoting (RQRCP), uses randomized projection to make pivot decisions from a much smaller sample matrix, which we can construct to reside in a faster level of memory than the original matrix. This technique may be understood as trading vastly reduced communication for a controlled increase in uncertainty during the decision process. For rank-revealing purposes, the selection mechanism in RQRCP produces results that are the same quality as the standard algorithm, but with performance near that of unpivoted QR (often an order of magnitude faster for large matrices). We also propose two formulas that facilitate further performance improvements. The first efficiently updates sample matrices to avoid computing new randomized projections. The second avoids large trailing updates during the decomposition in truncated low-rank approximations. Our truncated version of RQRCP also provides a key initial step in our truncated SVD approximation, TUXV. These advances open up a new performance domain for large matrix factorizations that will support efficient problem-solving techniques for challenging applications in science, engineering, and data analysis.

Full PDF

RRANDOMIZED PROJECTION FOR RANK-REVEALING MATRIXFACTORIZATIONS AND LOW-RANK APPROXIMATIONS ∗ JED A. DUERSCH † AND

MING GU ‡ Abstract.

Rank-revealing matrix decompositions provide an essential tool in spectral analysisof matrices, including the Singular Value Decomposition (SVD) and related low-rank approximationtechniques. QR with Column Pivoting (QRCP) is usually suitable for these purposes, but it can bemuch slower than the unpivoted QR algorithm. For large matrices, the diﬀerence in performanceis due to increased communication between the processor and slow memory, which QRCP needs inorder to choose pivots during decomposition. Our main algorithm, Randomized QR with ColumnPivoting (RQRCP), uses randomized projection to make pivot decisions from a much smaller samplematrix, which we can construct to reside in a faster level of memory than the original matrix. Thistechnique may be understood as trading vastly reduced communication for a controlled increase inuncertainty during the decision process. For rank-revealing purposes, the selection mechanism inRQRCP produces results that are the same quality as the standard algorithm, but with performancenear that of unpivoted QR (often an order of magnitude faster for large matrices). We also proposetwo formulas that facilitate further performance improvements. The ﬁrst eﬃciently updates samplematrices to avoid computing new randomized projections. The second avoids large trailing updatesduring the decomposition in truncated low-rank approximations. Our truncated version of RQRCPalso provides a key initial step in our truncated SVD approximation, TUXV. These advances open upa new performance domain for large matrix factorizations that will support eﬃcient problem-solvingtechniques for challenging applications in science, engineering, and data analysis.

Key words.

QR factorization, column pivoting, rank-revealing, random sampling, sampleupdate, blocked algorithm, low-rank approximation, truncated SVD

AMS subject classiﬁcations.

1. Introduction.

QR with Column Pivoting (QRCP) is a fundamental kernel innumerical linear algebra that broadly supports scientiﬁc analysis. As a rank-revealingmatrix factorization, QRCP provides the ﬁrst step in eﬃcient implementations of spec-tral methods such as the eigenvalue decomposition and Principal Component Analysis(PCA), also called the Singular Value Decomposition (SVD) [16]. QRCP also playsa key role in least-squares approximation [6] and stable basis extraction [33, 15, 10]for other important algorithms. These methods allow us to form compressed repre-sentations of linear operators by truncation while retaining dominant features thatfacilitate analytic capabilities that would otherwise be impractical for large matrices.In the ﬁeld of data science, PCA and its generalizations [35] support unsupervisedmachine learning techniques to extract salient features in two-dimensional numeri-cal arrays. Randomized methods have been extended further to support analysis ofmultidimensional data using tensor decompositions [1, 17].QRCP builds on the QR decomposition, which expresses a matrix A as the prod-uct of an orthogonal matrix Q and a right-triangular factor R as A = QR . Unlikethe LU decomposition, or Gaussian elimination, QR always exists and may be sta-bly computed regardless of the conditioning of A . Furthermore, ﬁnely tuned libraryimplementations use a blocked algorithm that operates with the communication ef-ﬁciency of matrix-matrix multiply, or level-3 kernels in the Basic Linear Algebra ∗ Revised from Randomized QR with Column Pivoting for submission to SIGEST

Funding:

The original work was support by NSF award 1319312. This revised version was sup-ported by NSF award 1760316 and by the Laboratory Directed Research and Development programat Sandia National Laboratories. † Sandia National Laboratories, Livermore, CA 94550, United States ([email protected]) ‡ Department of Mathematics, University of California, Berkeley, CA 94720 ([email protected]).C1 a r X i v : . [ c s . M S ] A ug JED A. DUERSCH AND MING GU

Subprograms (BLAS-3). The standard QR decomposition is not, however, suitablefor rank detection or low-rank approximations. These applications require a columnpermutation scheme to process more representative columns of A earlier in the de-composition [5, 4]. A permutation matrix P encodes these pivoting decisions so thatthe decomposition becomes AP = QR .The basic QRCP approach selects an unfactorized column with a maximal 2-normof components that do not reside within the span of previously factorized columns.This heuristic is a greedy algorithm attempting to maximize the sequence of partialdeterminants in R , which is typically adequate for rank detection with a few notablerare exceptions, such as the Kahan matrix [12]. The critical drawback, however, isthat these computations suﬀer a substantial increase in communication between slow(large) and fast (small) levels of memory, especially for large matrices. Each pivotdecision requires at least one matrix-vector multiply in series, thus limiting overalleﬃciency to that of level-2 kernels in the Basic Linear Algebra Subprograms (BLAS-2). Our primary motivation to examine eﬃcient rank-revealing algorithms is that, forlarge matrices or algorithms that require frequent use of QRCP, the communicationbottleneck becomes prohibitively expensive and impedes utilization of this importantset of analytic tools. Randomized projection allows us to reduce commu-nication complexity at the cost of increasing uncertainty regarding latent 2-norms(and inner products) among columns in a large matrix. Randomized QR with Col-umn Pivoting (RQRCP), shown in Algorithm 3.2, harnesses this trade-oﬀ to obtainfull blocks of pivot decisions that can be applied to A all at once. We do this bydrawing a Gaussian Independent Identically Distributed (GIID) matrix, Ω , and com-pressing columns of A into a sample matrix, B = Ω A . The sample matrix retainsenough information about the original matrix for us to obtain a full block of pivotingdecisions, while requiring far less communication to do so. Because we construct thesample to contain far fewer rows than the original, it resides in a faster level of mem-ory than the full matrix. Having a full block of pivots allows us to update the matrixthat becomes R with blocked BLAS-3 operations in the same fashion as unpivotedQR, thus entirely eliminating the communication bottleneck of BLAS-2 operationsencountered in the standard algorithm.We show how rank-revealing decompositions are well suited to this approach be-cause each pivot decision must merely avoid selecting a column containing a relativelysmall component orthogonal to the span of previous pivots. As many columns areoften suitable for this purpose, the use of precise 2-norm computations in standardQRCP is unnecessarily. Our approach has been adopted in subsequent work in com-puting matrix factorization-based low-rank approximations on sequential and parallelplatforms [11, 24, 25, 27, 37].We also propose a sample update formula that reduces the number of BLAS-3operations required to process a full matrix. Given a block of pivots, updating R withblocked Householder reﬂections requires two matrix-matrix multiplications. Withouta formula to update the sample, we would need a new sample matrix after each blockand one additional matrix-matrix multiply. Instead, we utilize computations that areneeded to update R to also update B into a suitable sample for the next decisionblock.RQRCP naturally extends to a truncated formulation, described in Algorithm 4.1,that further reduces communication by avoiding trailing updates. For low-rank matrix ANDOMIZED PROJECTION FOR RANK-REVEALING FACTORIZATIONS

C3approximations, this algorithm requires only one block matrix-matrix multiply perupdate. We accomplish this by storing reﬂector information in the compact WYnotation [29], which allows us to construct each block of R without intermediateupdates. Moreover, this algorithm serves as a key initial step in our approximation ofthe truncated SVD described in Algorithm 4.2, which is a variant of Stewart’s QLPalgorithm [34].Section 2 will discuss the nature of the communication bottleneck and relatedapproaches to address it. Section 3 will analyze sample-based pivoting as the maxi-mization of expected utility and derive our main result, RQRCP. Section 4 will deriveand explain our truncated algorithms for low-rank approximation. Section 5 will pro-vide numerical experiments that explore the performance and decomposition qualityof these approaches. Section 6 will oﬀer concluding remarks.

2. Related Work.

In order to understand how our algorithms improve perfor-mance, we ﬁrst review the reasons why additional communication could not be avoidedwith previous approaches. Given a large matrix A ∈ R m × n , the QRCP decompositioncan be computed using Algorithm 2.1. This algorithm is composed of a sequence ofHouseholder reﬂections. To review, a reﬂector y is formed from a particular column,say a , by subtracting the desired reﬂection (which must have the same 2-norm) fromthe current form, such as y = a − ( − sign( a ) (cid:107) a (cid:107) e ). The negative sign of theleading element of a is used to ensure that the reﬂector satisﬁes (cid:107) y (cid:107) ≥ (cid:107) a (cid:107) for nu-merical stability. The corresponding reﬂection coeﬃcient is τ = 2 / y T y , which yieldsthe Householder reﬂection H = I − y τ y T .In the algorithms that follow, an intermediate state of an array or operator at theend of iteration j is denoted by superscript ( j ) to emphasize when updates overwrite aprevious state. In contrast, an element computed on iteration j that remains accessiblein some form is denoted with a simple subscript. Algorithm 2.1

QRCP with BLAS-2 Householder reﬂections.

Require: A is m × n . Ensure: Q is an m × m orthogonal matrix. R is an m × n right triangular matrix, magnitude of diagonals nonincreasing. P is an n × n permutation matrix such that AP = QR . function [ Q , R , P ](QRCP) A Compute initial column 2-norms which become trailing column norms. for j = 1 , , . . . , k , where k = min( m, n ) do Find index p j of the column with maximum trailing 2-norm. Swap columns j and p j with permutation P j . Form Householder reﬂection H j = I − y j τ j y Tj from new column. Apply reﬂection A ( j ) = H j ( A ( j − P j ). Update trailing column norms by removing the contribution of row j . end for Q = H H . . . H k is the product of all reﬂections. R = A ( k ) . P = P P . . . P k is the aggregate column permutation. end function At the end of iteration j we can represent the matrix A as a partial factorization4 JED A. DUERSCH AND MING GU using the permutation P ( j ) = P . . . P j , the composition of column swaps so far, andthe analogous composition of Householder reﬂections, Q ( j ) = H . . . H j , to obtain AP ( j ) = Q ( j ) (cid:34) R ( j )11 R ( j )12 ˆ A ( j ) (cid:35) . The leading j − y j in line 6 of Algorithm 2.1 are 0. The upper-left submatrix R ( j )11 ∈ R j × j is right-triangular. Likewise, R ( j )12 ∈ R j × j completes theleading j rows of R . It only remains to process the trailing matrix ˆ A ( j ) . On thenext iteration, the 2-norm of the selected column within ˆ A ( j ) becomes the magnitudeof the next diagonal element in R ( j +1)11 . We may understand QRCP as a greedyprocedure intended to maximize the magnitude of the determinant of R ( j +1)11 . The newdeterminant magnitude is | det R ( j +1)11 | = | det R ( j )11 | (cid:13)(cid:13)(cid:13) ˆ A ( j ) (: , p j +1 ) (cid:13)(cid:13)(cid:13) , so this schemehas selected the pivot that multiplies the previous determinant by the largest factor.Note that true determinant maximization would require exchanging prior columnsand adjusting the factorization accordingly [13].Early implementations of QR also relied on BLAS-2 kernels and, thus, gave similarperformance results until reﬂector blocking was employed in QR [2, 32]. To process b columns of an m × n matrix with BLAS-2 kernels, O ( bmn ) elements must pass fromslow to fast memory. Blocking improves performance by reducing communicationbetween these layers of memory using matrix-matrix multiply. Instead of updatingthe entire matrix with each Householder reﬂection, transformations are collected into amatrix representation that can be applied using two BLAS-3 matrix-matrix multiplies,which reduces communication complexity to O ( bmn/M / ), where M is the size offast memory.In order to produce a correct pivot decision at iteration j + 1 in QRCP, however,trailing column norms must be updated to remove the contribution of row j , whichdepends on the Householder transformation H j . At ﬁrst glance, this update appearsto require two BLAS-2 operations on the trailing matrix per iteration. The ﬁrst oper-ation computes scaled inner products w Tj = τ j y Tj A ( j − P j and the second operationmodiﬁes the trailing matrix with the rank-1 update A ( j ) = A ( j − P j − y j w Tj . Thesetwo operations cause the communication bottleneck in QRCP. Quintana-Ort´ı, Sun, andBischof [30] were able to halve BLAS-2 operations with the insight that the trailingnorm update does not require completing the full rank-1 update on each iteration.Instead, reﬂections can be gathered into blocks, as in QR. This method appears inAlgorithm 2.2.At the end of iteration j , the algorithm has collected a block of reﬂectors Y ( j ) .Reﬂector y i , for i ≤ j , appears in column i from the diagonal down. This forms a blockreﬂection Q ( j ) = I − Y ( j ) T ( j ) Y ( j ) T , where T ( j ) is an upper triangular j × j connectionmatrix that can be solved from Y ( j ) so that Q ( j ) is orthogonal. This algorithm mustalso collect each corresponding scaled inner product, w Ti , which appears as row i in a matrix W ( j ) T . In the compact WY notation, this block of inner products is W ( j ) T = T ( j ) T Y ( j ) T AP ( j ) , which provides enough information to update row j alone and adjust trailing column norms to prepare for the next pivot selection A ( j ) ( j, :) = A ( j − ( j, :) P j − Y ( j ) ( j, :) W ( j ) T . ANDOMIZED PROJECTION FOR RANK-REVEALING FACTORIZATIONS

C5Note, however, that this construction complicates reﬂector formation. As before,the next pivot index p j +1 is selected and swapped into column j + 1. Let this newcolumn be a j +1 . From rows j + 1 down, elements of a j +1 have not been updatedwith the current block of reﬂectors. Before we can form y j +1 , prior transformationsmust be applied to these rows from the formula ˆ a j +1 = a j +1 − Y ( j ) W ( j ) T (: , p j +1 ).An additional step is also required to compute reﬂector inner products because theymust account for reﬂections that have not been applied to the trailing matrix. Theadjusted formula for these inner products is w Tj +1 = τ j +1 (cid:16) y Tj +1 A ( j ) − ( y Tj +1 Y ( j ) ) W ( j ) T (cid:17) P j +1 . The reﬂector and inner product blocks are then updated: Y ( j +1) = (cid:104) Y ( j ) y j +1 (cid:105) and W ( j +1) T = (cid:20) W ( j ) T P j +1 w Tj +1 (cid:21) .Unfortunately, the remaining BLAS-2 operations y Tj +1 A ( j ) and y Tj +1 Y ( j ) in the innerproduct computation still dominate slow communication complexity for large matri-ces. The entire trailing matrix must pass from slow to fast memory once per iteration.Consequently, even heavily optimized implementations of blocked QRCP still run sub-stantially slower than blocked QR on both sequential and parallel architectures. Algorithm 2.2

QRCP with BLAS-3 reﬂection blocking.

Require: A is m × n . Ensure: Q is an m × m orthogonal matrix. R is an m × n right triangular matrix, diagonals in nonincreasing magnitude order. P is an n × n permutation matrix such that AP = QR . function [ Q , R , P ] = QRCP ( A ) Compute initial column 2-norms, which will become trailing column norms. for i = 0 , b, b . . . , where b is block size. do for j = i + 1 , i + 2 , . . . min( i + b, k ), where k = min( m, n ). do Find index p j of the column with maximum trailing 2-norm. Apply permutation P j swapping column j with p j . Update column j with prior reﬂections in this block. Form reﬂector y j and τ j from new column j . Compute adjusted reﬂector inner products w Tj . Update row j with all reﬂections in this block. Update trailing column norms by removing the contribution of row j . end for Apply block reﬂection to trailing matrix. end for Q = I − Y k T k Y Tk where T k can be recovered from Y k and τ , . . . , τ k . R = A ( k ) . P = P P . . . P k is the aggregate column permutation. end function JED A. DUERSCH AND MING GU

Several mechanismshave been put forward to avoid repeating full passes over the trailing matrix on eachiteration. Bischof [3] proposed pivoting restricted to local blocks and Demmel, etal. [7, 8] propose a procedure called Communication Avoiding Rank-Revealing QR(CARRQR). CARRQR proceeds by partitioning the trailing matrix into P subsets ofcolumns that are processed independently and possibly simultaneously. From withineach column subset, b candidate pivots are selected using QRCP. Adjacent subsets ofcandidates are then combined to form P subsets of 2 b candidates. This procedurecontinues, using QRCP to ﬁlter b candidates per subset followed by merging resultsinto P subsets and so on, until only one subset of b candidates remains. The trailingmatrix is then updated as before, with blocked reﬂections.We now examine several practical constraints in implementing CARRQR. First,the reﬂectors Y , inner products W T , and leading rows of R must be stored sepa-rately from the original matrix for each independently processed subset of columns.Furthermore, one must employ a version of QRCP that avoids the trailing update,because the ﬁnal reﬂectors are unknown until the last selection stage. Any inter-mediate changes to the original columns would have to be undone before the ﬁnaltransformations can be correctly processed. In contrast, QRCP can be written toconvert columns into reﬂectors, storing the results in the same array as the input onthe strictly lower triangle portion of the matrix. Likewise, R can be stored in placeof the input on the upper triangle.Depending on the initial column partition, CARRQR performs up to 2 timesas many inner products as QRCP per block iteration. Note that as the reﬂectorindex j increases, the total number of inner products of the form y Tj +1 y j +1 , y Tj +1 Y ( j ) and y Tj +1 A ( j ) remains constant. Therefore, if the i th subset contains n i columns, bn i inner products will be required to produce b candidates. Letting n + n + · · · + n P = n on the ﬁrst stage of reﬁnement, summing over all of the subsets gives bn inner products to produce P subsets of b candidates. QRCP requires the samecomputational complexity to produce b ﬁnal pivots. Assuming that the number ofcandidates is at least halved for each subsequent stage of reﬁnement in CARRQR, iteasily follows that no more than 2 bn inner products will be computed in total.Despite increased computational complexity, CARRQR is intended to beneﬁtfrom better memory utilization and better parallel scalability. If each column subsetis thin enough to ﬁt in fast memory, then slow communication is eliminated betweeniterations of j . The only remaining slow communication transmits pivot candidatesbetween stages of reﬁnement.Unfortunately, writing and tuning CARRQR is nontrivial. We implemented thisalgorithm and found that it ran slightly slower than the LAPACK implementationof Algorithm 2.2, called DGEQP3, on a shared-memory parallel machine. We be-lieve this was mainly due to ineﬃcient parallelization in the ﬁnal stages of reﬁnement.We assigned each column subset to a diﬀerent processor, which then worked inde-pendently to produce candidates. This approach was attractive because it did notrequire communication between processors during each ﬁltration stage. However, de-spite communication eﬃciency, this technique can only engage as many processorsas there are column subsets. Most processors are left idle during the ﬁnal stages ofreﬁnement. A second problem with this approach occurres when the matrix is tootall. In such cases, it is not possible to select column subsets that are thin enough toﬁt into fast memory. An eﬃcient implementation would need alternative or additionalworkload-splitting tactics to use all processors at every stage of reﬁnement. ANDOMIZED PROJECTION FOR RANK-REVEALING FACTORIZATIONS

C7As we will discuss in the next section, the method we propose also gathers pivotsinto blocks, which are then applied to the trailing matrix. Our method, however,improves performance by reducing both the communication and computational com-plexity needed to form a block of pivots.

3. Randomized Projection for Sample Pivoting.

Sampling via randomizedprojection has proven to be beneﬁcial for a variety of applications in numerical linearalgebra [22, 36, 31, 14, 23, 26]. Sampling reduces communication complexity via di-mensional reduction, while simultaneously maintaining a safe degree of uncertainty inthe approximations that follow. This technique is typically framed using the Johnson–Lindenstrauss lemma [19]. Let a j represent the j th column of A for j = 1 , , . . . , n .There exists a distribution over (cid:96) × m matrices such that a randomly drawn matrix Ω yields a lower dimensional embedding b j = Ω a j , with high probability of preservingdistances within a relative error ε . Because we would like to use the sample to obtaina block of b pivots while controlling the degree of uncertainty in the true norms of A , we include padding p in the sample rank so that (cid:96) = b + p . When Ω is GIID, theexpected 2-norms and variance are E Ω (cid:104) (cid:107) b j (cid:107) (cid:105) = (cid:96) (cid:107) a j (cid:107) and Var Ω (cid:104) (cid:107) b j (cid:107) (cid:105) = 2 (cid:96) (cid:107) a j (cid:107) . If we let a = 0 and b = 0, then we can express the probability of satisfying therelative error bounds as P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) b j − b i (cid:107) (cid:96) (cid:107) a j − a i (cid:107) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ε (cid:33) ≥ − (cid:18) − (cid:96)ε − ε ) (cid:19) , where 0 < ε < and i, j = 0 , , . . . , n . Capturing the norm of the diﬀerence betweencolumns also implies coherence among inner product approximations by taking ( b − b ) T ( b − b ) = (cid:107) b (cid:107) − b T b + (cid:107) b (cid:107) . Thus, each component of B within a subspacedeﬁned by a few of its columns approximates the corresponding component of A . Bayesian inference is not often used in numerical lin-ear algebra. In this case, it allows us to obtain an exact expression for the uncertaintyin column norms, which rigorously frames the amount of padding p that is needed tosatisfy a relative error bound with a desired probability of success.We begin by expressing a single column a as its 2-norm multiplied by a unitvector q , so that a = (cid:107) a (cid:107) q . Because a GIID matrix is invariant in distributionunder independent orthogonal transformations, we can construct an orthogonal matrix Q = [ q Q ⊥ ] and write Ω as Ω = (cid:2) ˆ ω ˆΩ (cid:3) (cid:20) q T Q T ⊥ (cid:21) , where both the leading column ˆ ω and remaining columns ˆΩ are GIID. It easily followsthat each element of the sample column, b = (cid:107) a (cid:107) ˆ ω , is normally distributed withmean 0 and latent variance (cid:107) a (cid:107) .Inferring variance from a normal distribution with a known mean is a standardproblem in Bayesian statistics. The likelihood probability distribution is written p ( b |(cid:107) a (cid:107) ) ≡ N ( b | , (cid:107) a (cid:107) I ) where I is the (cid:96) × (cid:96) identity. Jeﬀreys proposed the maximallyuninformative prior for an unknown variance, p ( (cid:107) a (cid:107) ) ≡ (cid:107) a (cid:107) − , which is invariantunder scaling and power transformations [20]. We apply Bayes’ theorem, p ( (cid:107) a (cid:107) | JED A. DUERSCH AND MING GU b ) ∝ p ( b | (cid:107) a (cid:107) ) p ( (cid:107) a (cid:107) ), to obtain the posterior distribution. Normalization resultsin the inverse gamma distribution p ( (cid:107) a (cid:107) | (cid:107) b (cid:107) ) ≡ (cid:16) (cid:107) b (cid:107) (cid:17) (cid:96)/ Γ( (cid:96) ) (cid:107) a (cid:107) − (cid:96) − exp (cid:32) − (cid:107) b (cid:107) (cid:107) a (cid:107) (cid:33) . Consequently, we can cast each pivoting decision as the maximizer of expected utility,where utility is taken to be the latent 2-norm squared E p ( (cid:107) a (cid:107) |(cid:107) b (cid:107) ) (cid:104) (cid:107) a (cid:107) (cid:105) = (cid:107) b (cid:107) (cid:96) − . Since expected utility is monotonic in the sample column norms, we simply choosethe maximum as in QRCP.More importantly, the posterior distribution clearly relates sample rank (cid:96) to un-certainty. In order to capture the probability distribution of the latent relative error,we can change variables to express the latent column norm as a fraction φ of the ex-pectation, (cid:107) a (cid:107) = φ(cid:96) − (cid:107) b (cid:107) . This gives analytic expressions for both the probabilitydistribution function and the cumulative distribution function of the relative scaling p ( φ | (cid:96) ) ≡ (cid:0) (cid:96) − (cid:1) (cid:96)/ Γ( (cid:96) ) φ (cid:96)/ exp (cid:18) − ( (cid:96) − φ (cid:19) and P ( φ < τ ) = Γ( (cid:96) , (cid:96) − τ )Γ( (cid:96) ) , respectively. The numerator of the cumulative distribution is the upper incompletegamma function and normalization results in the regularized gamma function.Rank-revealing decompositions must avoid selecting columns that are already wellapproximated by components in the span of previous pivots, i.e columns with smalltrailing norms. As such, we only care about the probability that a sample columnradically overestimates the true column norm. The CDF plotted in Figure 3.1 showsthe probability that the relative scaling φ falls below a speciﬁed upper bound τ forseveral choices of (cid:96) . Our experiments provide good results using padding p = 8 witha block size b = 32 so that the eﬀective sample rank satisﬁes 8 < (cid:96) ≤

40 for each pivotdecision. Note that this analysis also holds for linear combinations of columns in A and B , provided that those linear combinations are independent of B . An in-depthreliability and probability analysis has appeared in Xiao, Gu, and Langou [37]. Inparticular, they show that in the case of decaying singular values in the matrix A , anearly optimal low-rank approximation can be computed with the QR factorizationwith a slight modiﬁcation in the strategy used to choose P . We now show how the samplematrix B = Ω A can be used to select a full block of b pivots. If QRCP is performed onthe sample matrix, then at iteration j we can examine B as a partial factorization. Welet P ( j ) be the aggregate permutation so far and represent the accumulated orthogonaltransformations applied to B as U ( j ) , with corresponding intermediate triangularfactor S ( j ) , as shown below. We also consider a partial factorization of A using thesame pivots that were applied to B . The corresponding factors of A are Q ( j ) and R . BP ( j ) = U ( j ) (cid:34) S ( j )11 S ( j )12 S ( j )22 (cid:35) and AP ( j ) = Q ( j ) (cid:34) R ( j )11 R ( j )12 R ( j )22 (cid:35) . ANDOMIZED PROJECTION FOR RANK-REVEALING FACTORIZATIONS C9 Fig. 3.1 . Cumulative distribution function for latent relative scaling factor. For sample rank4, the probability that the latent 2-norm squared is less than / of the expectation is . . Forsample rank 8, the probability that it is less than / is . . For sample rank 32, the probabilitythat it is less than / is . . Both S ( j )11 and R ( j )11 are upper triangular. Ω can then be expressed as elements ˆΩ inthe bases given by U ( j ) and Q ( j ) : Ω = U ( j ) (cid:34) ˆΩ ( j )11 ˆΩ ( j )12 ˆΩ ( j )21 ˆΩ ( j )22 (cid:35) Q ( j ) T .Noting that BP ( j ) = Ω AP ( j ) , we have(3.1) (cid:34) S ( j )11 S ( j )12 S ( j )22 (cid:35) = (cid:34) ˆΩ ( j )11 R ( j )11 ˆΩ ( j )11 R ( j )12 + ˆΩ ( j )12 R ( j )22 ˆΩ ( j )21 R ( j )11 ˆΩ ( j )21 R ( j )12 + ˆΩ ( j )22 R ( j )22 (cid:35) .If S ( j )11 is nonsingular, then both ˆΩ ( j )11 and R ( j )11 are also nonsingular. It followsthat ˆΩ ( j )11 = S ( j )11 R ( j ) − is upper triangular and ˆΩ ( j )21 = 0. In other words, we haveimplicitly formed a QR factorization of Ω Q ( j ) using the same orthogonal matrix U ( j ) . Finally, the trailing matrix in the sample simpliﬁes to S ( j )22 = ˆΩ ( j )22 R ( j )22 , whichis a sample of the trailing matrix R ( j )22 using the compression matrix ˆΩ ( j )22 . If thepermutation P ( j ) were independent of the sample, then Q ( j ) would be formed from A , independent of Ω . Likewise, U ( j ) only depends on the leading j columns of thesample, which are independent of the trailing columns. Thus ˆΩ ( j )22 would be GIID.As such, we may use column norms of S ( j )22 to approximate column norms of R ( j )22 when we select the ( j + 1)st pivot, so that a full block of pivots can be selectedwithout interleaving any references to A or R memory. We note, however, that thepermutation may exhibit a subtle dependence on the pivots, which we brieﬂy discussin subsection 3.3.Once b pivots have been selected from the sample matrix B , the correspondingcolumns of A are permuted and processed all at once, as is done in BLAS-3 QR,thus reducing both the communication and computational complexity associated with10 JED A. DUERSCH AND MING GU selecting a block of b pivots by a factor of (cid:96)/m . More signiﬁcantly, if the samplematrix B ﬁts in fast memory, then slow communication between consecutive pivotdecisions is eliminated within each block iteration. As in BLAS-3 QR, the remainingcommunication costs are due to the matrix-matrix multiplications needed to performblock reﬂections. As such, RQRCP satisﬁes the BLAS-3 performance standard.Algorithm 3.1 outlines the full procedure for a sample-based block permutation.It is acceptable for very low-rank approximations, wherein the required sample is smallenough to maintain communication eﬃciency. That is, when the desired approxima-tion rank k is small enough to be a single block, b = k . For larger approximations wewill resort to a more comprehensive algorithm that includes a sample update formu-lation that subsumes this version. Since the single-sample algorithm illuminates theperformance advantage gained from this approach, we examine it ﬁrst. Algorithm 3.1

Single-Sample Randomized QRCP.

Require: A is m × n . k is the desired approximation rank. k (cid:28) min( m, n ). Ensure: Q is an m × m orthogonal matrix in the form of k reﬂectors. R is a k × n truncated upper trapezoidal matrix. P is an n × n permutation matrix such that AP ≈ Q (: , k ) R . function [ Q , R , P ] = SingleSampleRQRCP ( A , k ) Set sample rank l = k + p as needed for acceptable uncertainty. Generate random l × m matrix Ω . Form the sample B = Ω A . Get k column pivots from sample [ · , · , P ] = QRCP ( B ) . Apply permutation A (1) = AP . Construct k reﬂectors from new leading columns [ Q , R ] = QR ( A (1) ( :,1:k )). Finish k rows of R in remaining columns R = Q ( :,1:k ) T A (1) ( :,k+1:n ). end function3.3. Sample Bias. The bias of an estimator is the diﬀerence between the ex-pected value of the estimator and the true value of the quantity being estimated. Inthis case, sample column norms are used to estimate true column norms. As illus-trated in the following thought experiment, a subtle form of bias occurs when we usethe maximum to select a pivot.Suppose Jack and Jill roll one six-sided die each. The expected outcome for eachof them is 3 .

5. We then learn that Jill rolled 5, which was greater than Jack’s roll. Thisnew information reduces Jack’s expectation to 2 .

5, because it is no longer possiblefor him to have rolled 5 or 6. Similarly, the act of selecting the largest sample normcreates a dependency with remaining samples by truncating their plausible outcomes.This eﬀect becomes less pronounced, however, if the decision is more likely to bedetermined by the true value being estimated rather than a chance sample outcome.For example, now suppose that Jill rolls three dice and Jack rolls one. If we onlyknow each person’s sum, we can still infer the number of dice each person rolledand select the expected maximum as before. In 90 .

7% of trials, Jill’s sum will be 7 orgreater, driven by the fact that three dice were rolled. In most cases, this leaves Jack’spotential outcomes unconstrained and his expectation unbiased after we observe themaximum. Analogously, when a selected column exhibits a norm that is an order of

ANDOMIZED PROJECTION FOR RANK-REVEALING FACTORIZATIONS

C11magnitude greater than others, the bias eﬀect is negligible and we may proceed asthough the pivot decision was independent of the sample.The original version of this paper included analysis of this distribution truncationeﬀect [9]. For our purposes, we proceed as though the progression of sample updatesis independent of the pivot decisions, which is also the approach Xiao, et al. [37]take. This assumption is potentially problematic when multiple trailing columnshave similar norms, but any such column provides a suitable pivot in that scenario.

The sample matrix in Algorithm 3.1 is for-mulated to have rank (cid:96) = k + p , where k was both the desired approximation rankand the permutation block size, b . As k increases, however, it becomes ineﬃcient tosimply increase the sample rank (cid:96) . In the extreme case, a full decomposition wouldrequire a sample just as big as the original matrix. If we require a decomposition witha larger rank than that which can be eﬃciently sampled and blocked, that is if thesample rank cannot exceed (cid:96) = b + p , but we require k > b , then we need to updatethe sample matrix after each block. Martinsson [28] developed an approach in whichone simply processes each subsequent block by drawing a new random matrix Ω andapplying it to each new trailing matrix. We propose a sample update formulationthat does not require multiplying the trailing matrix by a new compression matrixand reduces BLAS-3 communication in the overall factorization by at least one third.The update formula we derive is an extension of the implicit update mechanismdescribed in the previous section. Both Algorithm 3.2 and Algorithm 4.1, the trun-cated variation, will proceed in blocks of pivots. Bracket superscripts denote theresults of a computation that occurred on the indicated block-iteration. At entryto the ﬁrst block-iteration, the sample is B [0] = Ω [0] A [0] , where A [0] is the originalmatrix. At the end of block-iteration J , the sample will be in the transformed state(3.2) (cid:34) S [ J ]11 S [ J ]12 S [ J ]22 (cid:35) = (cid:34) ˆΩ [ J ]11 ˆΩ [ J ]12 ˆΩ [ J ]22 (cid:35) (cid:34) R [ J ]11 R [ J ]12 A [ J ] (cid:35) , just as in (3.1). S [ J ]11 is the leading upper triangle from the partial factorization of thesample and S [ J ]22 gives the trailing sample columns. Likewise R [ J ]11 and A [ J ] , respec-tively, give the leading upper triangle and trailing columns that would be obtainedby factorizing the original matrix with the same pivots. By absorbing the transfor-mations U [ J ] T and Q [ J ] into Ω [ J ] , we obtained an eﬀective compression matrix ˆΩ [ J ]22 ,which had already been implicitly applied to the trailing columns: S [ J ]22 = ˆΩ [ J ]22 A [ J ] .The diﬃculty is S [ J ]22 only has rank p . In order to construct a rank (cid:96) = b + p sample ofthe trailing matrix A [ J ] , we need to include ˆΩ [ J ]12 in the updated compression matrix Ω [ J ] = (cid:34) ˆΩ [ J ]12 ˆΩ [ J ]22 (cid:35) giving B [ J ] = Ω [ J ] A [ J ] = (cid:34) S [ J ]12 − ˆΩ [ J ]11 R [ J ]12 S [ J ]22 (cid:35) . In other words, the new compression matrix Ω [ J ] is simply U [ J ] T Ω [ J − Q [ J ] , withthe leading b columns removed. This new compression matrix does not need to beexplicitly formed or applied to the trailing columns A [ J ] . Instead, we form the resultimplicitly by removing ˆΩ [ J ]11 R [ J ]12 from S [ J ]12 . Both R [ J ]11 and R [ J ]12 will be computed inblocked matrix multiply operations using the previous b pivots of A . Since ˆΩ [ J ]11 canthen be recovered from S [ J ]11 , we can avoid any direct computations on Ω . We only12 JED A. DUERSCH AND MING GU need to update the ﬁrst b rows of B , which gives us the sample update formula(3.3) (cid:34) B [ J ]1 B [ J ]2 (cid:35) = (cid:34) S [ J ]12 − S [ J ]11 R [ J ] − R [ J ]12 S [ J ]22 (cid:35) . Full RQRCP, described in Algorithm 3.2, can be structured as a modiﬁcation toblocked BLAS-3 QR. The algorithm must simply interleave processing blocks of reﬂec-tors with permutations obtained from each sample matrix, then update the sample asabove. To obtain a modiﬁed version that employs repeated sampling simply replacethe sample update with a new sample of the trailing matrix, B [ J ] = Ω [ J ] A [ J ] .When QRCP is applied to the sample matrix B , only a partial decompositionis necessary. The second argument b in the subroutine call QRCP ( B [ J ] , b ) indicatesthat only b column permutations are required. It is relatively simple to modify theQR algorithm to avoid any unnecessary computation. After sample pivots have beenapplied to the array containing both A and R , we perform QR factorization on thenew leading b columns of the trailing matrix. Although this is stated as returning Q [ J ] for convenience, it can be implemented eﬃciently with the blocked Householderreﬂections described in subsection 2.1. We can then apply reﬂectors to the trailingmatrix and form the sample update B [ J ] to prepare for the next iteration. Algorithm 3.2

Randomized QR with Column Pivoting, RQRCP

Require: A is m × n . k is the desired factorization rank. k ≤ min ( m, n ). Ensure: Q is an m × m orthogonal matrix in the form of k reﬂectors. R is a k × n upper trapezoidal (or triangular) matrix. P is an n × n permutation matrix such that AP ≈ Q ( :,1:k ) R . function [ Q , R , P ] = RQRCP ( A , k ) Set sample rank (cid:96) = b + p as needed for acceptable uncertainty. Generate random (cid:96) × m matrix Ω [0] . Form the initial sample B [0] = Ω [0] A [0] . for J = 1, 2, . . . , kb do Get b column pivots from sample [ U [ J ] , S [ J ] , P [ J ] ] = QRCP ( B [ J − , b ). Permute A [ J − and completed rows in R with P [ J ] . Construct b reﬂectors [ Q [ J ] , R [ J ]11 ] = QR ( A [ J − ( :,1:b )). Finish b rows R [ J ]12 = Q [ J ] ( :,1:b ) T A [ J − ( :,b+1:end ). Update the trailing matrix A [ J ] = Q [ J ] ( :,b+1:end ) T A [ J − ( :,b+1:end ). Update the sample B [ J ]1 = S [ J ]12 − S [ J ]11 R [ J ] − R [ J ]12 and B [ J ]2 = S [ J ]22 . end for Q = Q [1] (cid:20) I b Q [2] (cid:21) . . . (cid:20) I ([ k/b ] − b Q [ k/b ] (cid:21) . P = P [1] (cid:20) I b P [2] (cid:21) . . . (cid:20) I ([ k/b ] − b P [ k/b ] (cid:21) . end function4. Truncated Factorizations for Low-Rank Approximations. The trail-ing matrix is unnecessary for low-rank applications. We can reformulate RQRCP to

ANDOMIZED PROJECTION FOR RANK-REVEALING FACTORIZATIONS

C13avoid the trailing update, rather than computing it and discarding it. Provided the ap-proximation rank is small ( k (cid:28) min( m, n )), the truncated reformulation (TRQRCP)reduces large matrix multiplications by half and completes in roughly half the time. Our technique is analogous to the method Quintana-Ort´ı, Sun, and Bischof used to halve BLAS-2 operations in QRCP. In their version ofQRCP, all reﬂector inner products are computed, but rows and columns are only up-dated as needed. In order to compute correct reﬂector inner products, without havingupdated the trailing matrix, we need to formulate blocked reﬂector compositions( I − Y T Y T )( I − Y T Y T ) = I − Y T Y T , where Y = (cid:2) Y Y (cid:3) and T = (cid:20) T − T Y T Y T T (cid:21) . The corresponding reﬂector inner products W T = T T Y T A become W T = (cid:20) W T W T (cid:21) with W T = T T Y T A and W T = T T (cid:16) Y T A − ( Y T Y ) W T (cid:17) .If we store these reﬂector inner products, then we can construct any submatrix ofthe accumulated transformation ˆ A [ J ] = A − Y [ J ] W [ J ] T as needed. Columns that areselected by sample pivots are constructed just before becoming the next reﬂectors andcorresponding rows of R are constructed just before being used to update the sample,as outlined in Algorithm 4.1. TRQRCP naturally extends to an ap-proximation of the truncated SVD by following the QLP method proposed by Stewart.The QLP decomposition proceeds by ﬁrst applying QRCP to obtain AP = Q R .Then the right triangular matrix R is factored again using an LQ factorization P R = LQ , where row-pivoting is an optional safeguard (otherwise P = I ), givingthe factorization A = ( Q P T ) L ( Q P T ). The diagonal elements of L approximatethe singular values of A . Huckaby and Chan [18] provide convergence analysis.The approximate truncated SVD we propose (TUXV) simply adapts low-rankversions of the steps in QLP. The rank- k approximation that results is exactly the sameas the truncated approximation that would be obtained if QLP had been processedto completion using RQRCP, without secondary row-pivoting, and then truncated toa rank- k approximation.We begin by using TRQRCP to produce k left reﬂectors, which deﬁnes the initialleft orthogonal matrix U (0) . Superscript (0) refers to the initial state of an arrayupon entry to the ﬁrst iteration of the main loop. We can compare our results towhat would have been obtained from full RQRCP-based QLP: AP (0) ≈ (cid:104) U (0)1 U (0)2 (cid:105) (cid:20) R (0)11 R (0)12 (cid:21) vs. AP (0 ∗ ) = (cid:104) U (0)1 U (0 ∗ )2 (cid:105) (cid:34) R (0)11 R (0 ∗ )12 R (0 ∗ )22 (cid:35) . The asterisk denotes additional pivoting produced from the full factorization. Theﬁrst k pivots in P (0) and corresponding reﬂectors in U (0) are the same, as are cor-responding rows in R (0) modulo additional column permutations. We reverse thesepermutations to construct the k × n matrix Z (0) = R (0) P (0) T so that A ≈ (cid:104) U (0)1 U (0)2 (cid:105) (cid:20) Z (0)11 Z (0)12 (cid:21) versus A = (cid:104) U (0)1 U (0 ∗ )2 (cid:105) (cid:34) Z (0)11 Z (0)12 Z (0 ∗ )21 Z (0 ∗ )22 (cid:35) . JED A. DUERSCH AND MING GU

Algorithm 4.1

Truncated RQRCP without trailing update

Require: A is m × n . k is the approximation rank. k (cid:28) min( m, n ). Ensure: Q is an m × m orthogonal matrix in the form of k reﬂectors. R is a k × n upper trapezoidal matrix. P is an n × n permutation matrix such that AP ≈ Q ( :,1:k ) R . function [ Q , R , P ] TruncatedRQRCP ( A , k ) Set the sample rank (cid:96) = b + p as needed for acceptable uncertainty. Generate (cid:96) × m random matrix Ω [0] and sample B [0] = Ω [0] A [0]. for J = 1, 2, . . . , kb do Obtain b pivots [ U [ J ] , S [ J ] , P [ J ] ] = QRCP ( B [ J ] , b ). Permute A [ J ] = A [ J − P [ J ] , W [ J ] T = W [ J − T P [ J ] , and leading R rows. Construct selected columns ˆ A J from A [ J ] − Y [ J − W [ J ] T . Form reﬂectors Y [ J ]2 using [ Q [ J ] , R [ J ]11 ] = QR ( ˆ A J ). Form inner products W [ J ]2 = T [ J ] T ( Y [ J ] T A [ J ] − ( Y [ J ] T Y [ J − ) W [ J ] T ) . Augment Y [ J ] = [ Y [ J − Y [ J ]2 ] and W [ J ] = [ W [ J ]1 W [ J ]2 ]. Construct new rows of R from A [ J ] − Y [ J ] W [ J ] T . Update the sample B [ J ]1 = S [ J ]12 − S [ J ]11 R [ J ] − R [ J ]12 and B [ J ]2 = S [ J ]22 . end for Q = Q [1] (cid:20) I b Q [2] (cid:21) . . . (cid:20) I ([ k/b ] − b Q [ k/b ] (cid:21) . P = P [1] (cid:20) I b P [2] (cid:21) . . . (cid:20) I ([ k/b ] − b P [ k/b ] (cid:21) . end function Taking the LQ factorization from Z (0) , so that L (1) V (1) T = Z (0) , instead of from R (0) simply absorbs the permutation P (0) T into the deﬁnition of V (1) T so that A ≈ (cid:104) U (0)1 U (0)2 (cid:105) (cid:20) L (1)11

00 0 (cid:21) (cid:34) V (1) T V (1) T (cid:35) versus A = (cid:104) U (0)1 U (0 ∗ )2 (cid:105) (cid:34) L (1)11 L (1 ∗ )21 L (1 ∗ )22 (cid:35) (cid:34) V (1) T V (1 ∗ ) T (cid:35) . For consistency with the following algorithm, we label the k × k connecting matrix L (1)11 as X (0) . We will discuss the connecting matrix further after explaining the restof the algorithm. Provided that no secondary row-pivoting is considered, the leading k reﬂectors in V (1) and V (1 ∗ ) are identical because they are only computed from theleading k rows of Z (0) . At this point, the rank- k approximation of RQRCP-basedQLP would require L (1 ∗ )21 , which is unknown. Fortunately, the leading k columns of U (0 ∗ ) L (1 ∗ ) can be reconstructed with one matrix multiply. We label this m × k matrix Z (1) , which is Z (1) = AV (1)1 = (cid:104) U (0)1 U (0 ∗ )2 (cid:105) (cid:34) L (1)11 L (1 ∗ )21 (cid:35) in both cases. We can then take the QR-factorization U (1) X (1) = Z (1) to produce ANDOMIZED PROJECTION FOR RANK-REVEALING FACTORIZATIONS

C15the approximation(4.1) A ≈ U (1) (cid:20) X (1)

00 0 (cid:21) V (1) T . Further iterations can be computed to produce subsequent k × k connection ma-trices X (2) , X (3) , etc. which would ﬂip-ﬂop between upper triangular and lower tri-angular forms. To do this, we simply multiply the leading rows of U T or columns of V on the left and right of A , respectively, as outlined in Algorithm 4.2. The leadingsingular values of A are approximated on the diagonals of X ( j ) . Since the connec-tion matrix is small, however, it is feasible to obtain slightly better approximationsby taking the SVD of X ( j ) . One could also insert mechanisms to iterate until a de-sired level of convergence is obtained, but Stewart observed that only one QRCP-LQiteration is needed to produce a reasonable approximation of the SVD. One subtlepoint of possible confusion is that by setting j max = 1 our algorithm might appearto produce a truncated approximation from the sequence RQRCP-LQ-QR. It is truethat the diagonal elements in X correspond to that sequence; however, the result-ing factorization is equivalent to that which would be obtained by keeping only theleading columns of L after RQRCP-LQ. The ﬁnal QR factorization simply extractsan orthogonal basis U . In the next section, we test the performance of TUXV with j max = 1 for both timing and quality experiments. Algorithm 4.2

TUXV approximation of the truncated SVD

Require: A is an m × n matrix to approximate. k is the approximation rank. k (cid:28) min( m, n ). j max is the number of LQ-QR iterations. We set j max = 1. Ensure: U is an orthogonal m × m matrix. V is an orthogonal n × n matrix. X is a k × k upper or lower triangular matrix. A ≈ U (: , k ) XV (: , k ) T . function [ U , X , V ] = TUXV ( A , k, τ, j max ) TRQRCP-Factorize [ U (0) , R (0) , P (0) ] = TRQRCP ( A , k ). Restore original column order Z (0) = R (0) P (0) T . LQ-Factorize [ V (1) , X (0) T ] = QR ( Z (0) T ). for j = 1 , , , . . . do Z ( j ) = AV ( j ) ( :,1:k ). QR-Factorize [ U ( j +1) , X ( j ) ] = QR ( Z ( j ) ). If j = j max , then break. Z ( j +1) = U ( j +1) ( :,1:k ) T A . LQ-Factorize [ V ( j +2) , X ( j +1) T ] = QR ( Z ( j +1) T ). If j + 1 = j max , then break. end for end function5. Experiments. Our ﬁrst Fortran version of RQRCP used simple calls toBLAS and LAPACK subroutines without directly managing workloads among avail-able cores. Library implementations of BLAS and LAPACK subroutines automati-cally distribute the computation to available cores using OpenMP. Although we knew16

JED A. DUERSCH AND MING GU

RQRCP should have nearly the same communication complexity as blocked QR, thatversion did not compete well with library calls to the LAPACK subroutine DGEQRF,the BLAS-3 QR factorization. We believe this was due to poor automatic memorycoordination between large blocked matrix operations. In order to provide a convinc-ing demonstration of the eﬃciency of RQRCP, we had to carefully manage workloadsusing OpenMP within each phase of the main algorithm. The following experimentsshow that our RQRCP and TRQRCP subroutines can be written to require substan-tially less computation time than the optimized QRCP implementation DGEQP3available through Intel’s Math Kernel Library.

These tests examine how factorization timesscale with various problem dimensions for several full matrix decompositions. Sincewe wanted to understand how diﬀerent sizes and shapes of matrices could aﬀect per-formance, we separately tested order scaling, row scaling, and column scaling. Inorder scaling, we vary the number of rows and columns together so that the matrixremains square. We also wanted to see how the performance of each algorithm scalesas we increase the number of cores engaged. Unless the experiment speciﬁes otherwise,each matrix has 12000 rows, 12000 columns, and the algorithm engages 24 cores. Thealgorithms we tested are listed in Table 5.1. Figure 5.1 shows performance results.

Table 5.1

Full decomposition experiments compare these algorithms. Rank-revealing subroutines areDGESVD, DGEQP3, RSRQRCP, and RQRCP. DGEQRF demonstrates the limit of performancewithout pivoting. DGEQR2 to shows the historical evolution of these algorithms.

Subroutine Description

DGEQR2

LAPACK BLAS-2 implementation of QR

DGESVD

LAPACK singular value decomposition

DGEQP3

LAPACK competing implementation of QRCP

RSRQRCP

Algorithm 3.2 modiﬁed for repeated-sampling

RQRCP

Algorithm 3.2 with sample update (unmodiﬁed)

DGEQRF

LAPACK BLAS-3 implementation of QR

Fig. 5.1 . Full decomposition benchmarks. Top-left: cores, rows and columns scaled equally.Top-right: cores, rows, columns scaled. Bottom-left: cores, rows scaled, columns.Bottom-right: cores scaled, rows and columns. RQRCP consistently performs almost as wellas DGEQRF, QR without pivoting. ANDOMIZED PROJECTION FOR RANK-REVEALING FACTORIZATIONS

C17These tests show that RQRCP performs nearly as well as DGEQRF, the LA-PACK implementation of BLAS-3 QR without pivoting. We further note that ourimplementation even outperforms DGEQRF in some column scaling cases. These ex-periments were run on a single node of the NERSC machine Edison. Each node hastwo 12-core Intel processors. Subroutines were linked with Intel’s Math Kernel Li-brary. Each test matrix was randomly generated and the same matrix was submittedto each algorithm.

These experiments compare truncated ap-proximation performance. In addition to probing how the matrix shape aﬀects exe-cution time, we examine the eﬀect of varying the truncation rank. As before, we testparallelization by varying the number of cores engaged. Unless speciﬁed otherwise,each algorithm engages 24 cores, operates on a 12000 × Table 5.2

These algorithms are compared in truncated decomposition scaling experiments. ComparingRSRQRCP with RQRCP reveals the cost of repeated sampling. Comparing RQRCP with QR revealsthe cost of pivot selection from samples. Comparing RQRCP with TRQRCP reveals the cost ofthe trailing matrix update. This version of QR is identical to RQRCP after eliminating sampleoperations and pivoting.

Subroutine Description

QRCP

Algorithm 2.2, QRCP with trailing update

RSRQRCP

Algorithm 3.2 modiﬁed for repeated-sampling

TUXV

Algorithm 4.2, approximation of truncated SVD

RQRCP

Algorithm 3.2 with sample update and trailing update (unmodiﬁed) QR QR (no pivoting) with trailing update

TRQRCP

Algorithm 4.1 with sample update and no trailing update

Fig. 5.2 . Truncated decomposition benchmarks. Top-left: cores, rows and columns,truncation rank scaled. Top-right: cores, rows, columns scaled, rank . Bottom-left: cores, rows scaled, columns, rank . Bottom-right: cores scaled, rows and columns,rank . RQRCP and QR perform similarly. TRQRCP is fastest. TUXV requires only modestadditional cost. JED A. DUERSCH AND MING GU

Since TRQRCP uses the sample update formula and avoids the trailing update,it is nearly always fastest. Our TUXV experiments use j max = 1. As such, TUXVperforms just one additional matrix multiply. These results show that TUXV requiresonly a modest increase in processing time over optimized truncated QR. Furthermore,the next set of experiments shows that TUXV gains a signiﬁcant improvement inapproximation quality over both QRCP and RQRCP. The pivots that result from randomized samplingare not the same as those obtained from QRCP. In order to compare factorizationquality, we construct sequences of partial factorizations and compute the correspond-ing low rank approximations. The resulting relative error in the Frobenius normis plotted in Figure 5.3 against the corresponding approximation rank. For bothRQRCP and TUXV, we perform 100 runs and plot the median, as well as both theminimum and maximum error bounds. Random samples used padding size p = 8 andblock size b = 32. Matrix decomposition quality is compared for the proposed algo-rithms using test cases from the San Jose State University Singular Matrix Database: FIDAP/ex33 , HB/lock2232 , and

LPnetlib/lpi gran . We also test a matrix corre-sponding to a gray-scale image of a diﬀerential gear (image credit: Alex Kovach [21]).Plot axes have been chosen to magnify the diﬀerences among these algorithms.

Fig. 5.3 . Truncated decomposition quality experiments. Top-left:

FIDAP/ex33 (1733 × .Top-right: HB/lock2232 (2232 × . Bottom-left: Diﬀerential gear (2442 × . Bottom-right: LPnetlib/lpi gran (2658 × . Both RQRPC and TUXV show the minimum and maximumtruncation errors over 100 runs. The range of outcomes for RQRCP is narrow and holds to QRCP.TUXV outcomes are even narrower and slightly above the truncated SVD. At the top of each plot we have QR without pivoting. In order to produce compet-itive results, QR was applied after presorting columns in order of descending 2-norms.Despite this modiﬁcation, QR performs poorly (as expected) with approximation er-ror dropping much more slowly than the other approximations, thus demonstratingwhy pivoting is necessary to prioritize representative columns in the truncated de-composition. Below QR we have RQRCP, which achieves results that are consistentwith QRCP in each case. The QRCP-like low-rank approximations are further im-proved by TUXV and the exact truncated SVD, which is theoretically optimal. Ineach case, TUXV produces approximation error closer to the SVD, with very lowvariation among the 100 runs.

ANDOMIZED PROJECTION FOR RANK-REVEALING FACTORIZATIONS

C19In Figure 5.4, we compare approximation quality by reconstructing the image ofthe diﬀerential gear using low-rank approximations. The original image is 2442 × Fig. 5.4 . Low-rank image reconstruction comparison. Reconstructions are computed from a × grayscale matrix. Approximations are reconstructed with rank . Truncated QR iscomputed with columns presorted by descending -norm. TRQRCP dramatically improves visualapproximation quality. Likewise, TUXV further improves ﬁne details.

6. Conclusion.

We have shown that RQRCP achieves strong parallel scalabilityand the pivoting quality of QRCP at BLAS-3 performance, often an order of magni-tude faster than the standard approach. By using randomized projection to constructa small sample matrix B from a much larger original matrix A , it becomes possibleto substantially reduce the communication complexity associated with a series of al-gorithmic decisions. Our analysis of latent column norms, inferred from the sample,justiﬁes the selection computations that we use to obtain full blocks of column pivots.Having blocks of pivots allows our algorithm to factorize A using matrix-matrix mul-tiplications instead of interleaving multiple series of matrix-vector operations, makingRQRCP the algorithm of choice for numerical rank determination.Critically, we have shown how to leverage intermediate block transformations of A to update B with a sample update formula. This technique allows us to avoidcomputing a new randomized projection for each block operation, thus substantiallyreducing the matrix-matrix multiplication work needed to process each block.We have extended this method of factorization to produce truncated low-rank ap-proximations. TRQRCP, the truncated formulation of RQRCP, avoids block updatesto the trailing matrix during factorization, which reduces the leading contributionto communication for low-rank approximations. Moreover, TRQRCP provides aneﬃcient initial operation in our approximation of the truncated SVD, TUXV.Our algorithms are implemented in Fortran with OpenMP. Numerical experi-ments compare performance with LAPACK subroutines, linked with the Intel MathKernel Library, using a 24-core system. These experiments demonstrate that thecomputation time of RQRCP is nearly as short as that of unpivoted QR and sub-20 JED A. DUERSCH AND MING GU stantially shorter than QRCP. Problems that have been too large to process withQRCP-dependent subroutines may now become feasible. Other applications that hadto settle for QR, due to performance constraints, may ﬁnd improved numerical sta-bility at little cost by switching to RQRCP. For low-rank approximations, TRQRCPoﬀers an additional performance advantage and TUXV improves approximation qual-ity at a modest additional cost. These algorithms open a new performance domainfor large matrix factorizations that we believe will be useful in science, engineering,and data analysis.Randomized methods harness the fact that it is possible to make good algorithmicdecisions in the face of uncertainty. By permitting controlled uncertainty, we cansubstantially improve algorithm performance. Future work will address how we mayunderstand the foundations of credible uncertainty in predictions. Improving thisunderstanding will facilitate the development of eﬃcient predictive algorithms andsupport our ability to make good decisions from limited information.

Acknowledgments.

We sincerely thank Chris Melgaard, Laura Grigori, andJames Demmel for several helpful discussions, and to anonymous reviewers for theirhard work and insightful feedback. We also deeply appreciate feedback on this revisionfrom both Thomas Catanach and Jacquilyn Weeks.Sandia National Laboratories is a multimission laboratory managed and operatedby National Technology and Engineering Solutions of Sandia, LLC., a wholly ownedsubsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s Na-tional Nuclear Security Administration under contract DE-NA-0003525.This paper describes objective technical results and analysis. Any subjectiveviews or opinions that might be expressed in the paper do not necessarily representthe views of the U.S. Department of Energy or the United States Government.

REFERENCES[1]

C. Battaglino, G. Ballard, and T. G. Kolda , A practical randomized CP tensor decompo-sition , SIAM Journal on Matrix Analysis and Applications, 39 (2018), pp. 876–901.[2]

C. Bischof and C. Van Loan , The

W Y representation for products of Householder matrices ,SIAM Journal on Scientiﬁc and Statistical Computing, 8 (1987), pp. s2–s13.[3]

C. H. Bischof , A parallel QR factorization algorithm with controlled local pivoting , SIAMJournal on Scientiﬁc and Statistical Computing, 12 (1991), pp. 36–57.[4] C. H. Bischof and P. C. Hansen , Structure-preserving and rank-revealing QR -factorizations ,SIAM Journal on Scientiﬁc and Statistical Computing, 12 (1991), pp. 1332–1350, https://doi.org/10.1137/0912073.[5] T. F. Chan , Rank revealing QR factorizations , Linear Algebra and its Applications, 88-89 (1987), pp. 67–82, https://doi.org/10.1016/0024-3795(87)90103-0, https://doi.org/10.1016/0024-3795(87)90103-0.[6]

T. F. Chan and P. C. Hansen , Some applications of the rank revealing QR factorization ,SIAM Journal on Scientiﬁc and Statistical Computing, 13 (1992), pp. 727–741.[7] J. Demmel, L. Grigori, M. Hoemmen, and J. Langou , Communication-optimal parallel andsequential QR and LU factorizations , SIAM Journal on Scientiﬁc Computing, 34 (2012),pp. A206–A239.[8] J. W. Demmel, L. Grigori, M. Gu, and H. Xiang , Communication avoiding rank revealing QR factorization with column pivoting , SIAM Journal on Matrix Analysis and Applica-tions, 36 (2015), pp. 55–89.[9] J. A. Duersch and M. Gu , Randomized QR with column pivoting , SIAM Journal on ScientiﬁcComputing, 39 (2017), pp. C263–C291.[10] J. A. Duersch, M. Shao, C. Yang, and M. Gu , A robust and eﬃcient implementation ofLOBPCG , SIAM Journal on Scientiﬁc Computing, 40 (2018), pp. C655–C676.[11]

N. B. Erichson, S. Voronin, S. L. Brunton, and J. N. Kutz , Randomized matrix decom-positions using R , Journal of Statistical Software, 89 (2019), pp. 1–48.[12]

G. H. Golub and C. F. Van Loan , Matrix Computations , vol. 3, JHU Press, 2013.ANDOMIZED PROJECTION FOR RANK-REVEALING FACTORIZATIONS

C21 [13]

M. Gu and S. C. Eisenstat , Eﬃcient algorithms for computing a strong rank-revealing QR factorization , SIAM Journal on Scientiﬁc Computing, 17 (1996), pp. 848–869.[14] N. Halko, P.-G. Martinsson, and J. A. Tropp , Finding structure with randomness: Proba-bilistic algorithms for constructing approximate matrix decompositions , SIAM review, 53(2011), pp. 217–288.[15]

U. Hetmaniuk and R. Lehoucq , Basis selection in LOBPCG , Journal of ComputationalPhysics, 218 (2006), pp. 324–332.[16]

N. J. Higham , QR factorization with complete pivoting and accurate computation of the SVD ,Linear Algebra and its Applications, 309 (2000), pp. 153–174.[17] D. Hong, T. G. Kolda, and J. A. Duersch , Generalized canonical polyadic tensor decompo-sition , SIAM Review, 62 (2020), pp. 133–163.[18]

D. A. Huckaby and T. F. Chan , On the convergence of Stewart’s

QLP algorithm for approx-imating the SVD , Numerical Algorithms, 32 (2003), pp. 287–316.[19]

W. B. Johnson and J. Lindenstrauss , Extensions of Lipschitz mappings into a Hilbert space ,Contemporary mathematics, 26 (1984), p. 1.[20]

R. E. Kass and L. Wasserman , Formal rules for selecting prior distributions: A reviewand annotated bibliography , Journal of the American Statistical Association, 435 (1996),pp. 1343–1370.[21]

A. Kovach , Diﬀerential gear , 2016, https://commons.wikimedia.org/wiki/File:Diﬀ gear.jpg.[22]

E. Liberty, F. Woolfe, P.-G. Martinsson, V. Rokhlin, and M. Tygert , Randomized algo-rithms for the low-rank approximation of matrices , Proceedings of the National Academyof Sciences, 104 (2007), pp. 20167–20172.[23]

M. W. Mahoney et al. , Randomized algorithms for matrices and data , Foundations andTrends R (cid:13) in Machine Learning, 3 (2011), pp. 123–224.[24] P.-G. Martinsson, G. Q. Ort´ı, and N. Heavner , randUTV: A blocked randomized algorithmfor computing a rank-revealing UTV factorization , ACM Transactions on MathematicalSoftware, 45 (2019), pp. 1–26.[25] P.-G. Martinsson, G. Q. Ort´ı, N. Heavner, and R. van de Geijn , Householder QR fac-torization with randomization for column pivoting (HQRRP) , SIAM J. Sci. Comput., 39(2017), p. C96C115.[26]

P.-G. Martinsson, V. Rokhlin, and M. Tygert , A randomized algorithm for the decompo-sition of matrices , Applied and Computational Harmonic Analysis, 30 (2011), pp. 47–68,https://doi.org/10.1016/j.acha.2010.02.003, https://doi.org/10.1016/j.acha.2010.02.003.[27]

P.-G. Martinsson and J. Tropp , Randomized numerical linear algebra: Foundations & algo-rithms , Manuscript, (2020), https://doi.org/https://arxiv.org/pdf/2002.01387.pdf.[28]

P.-G. Martinsson and S. Voronin , A randomized blocked algorithm for eﬃciently comput-ing rank-revealing factorizations of matrices , SIAM Journal on Scientiﬁc Computing, 38(2016), pp. S485–S507.[29]

C. Puglisi , Modiﬁcation of the householder method based on the compact

W Y representation ,SIAM Journal on Scientiﬁc and Statistical Computing, 13 (1992), pp. 723–726.[30]

G. Quintana-Ort´ı, X. Sun, and C. H. Bischof , A BLAS-3 version of the QR factorizationwith column pivoting , SIAM Journal on Scientiﬁc Computing, 19 (1998), pp. 1486–1494.[31] V. Rokhlin, A. Szlam, and M. Tygert , A randomized algorithm for principal componentanalysis , SIAM Journal on Matrix Analysis and Applications, 31 (2010), pp. 1100–1124.[32]

R. Schreiber and C. Van Loan , A storage-eﬃcient

W Y representation for products of House-holder transformations , SIAM Journal on Scientiﬁc and Statistical Computing, 10 (1989),pp. 53–57.[33]

A. Stathopoulos and K. Wu , A block orthogonalization procedure with constant synchroniza-tion requirements , SIAM Journal on Scientiﬁc Computing, 23 (2002), pp. 2165–2182.[34]

G. Stewart , The

QLP approximation to the singular value decomposition , SIAM Journal onScientiﬁc Computing, 20 (1999), pp. 1336–1348.[35]

R. Vidal, Y. Ma, and S. Sastry , Generalized principal component analysis (GPCA) , IEEEtransactions on pattern analysis and machine intelligence, 27 (2005), pp. 1945–1959.[36]

F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert , A fast randomized algorithm for theapproximation of matrices , Applied and Computational Harmonic Analysis, 25 (2008),pp. 335–366.[37]

J. Xiao, M. Gu, and J. Langou , Fast parallel randomized QR with column pivoting algorithmsfor reliable low-rank matrix approximationswith column pivoting algorithmsfor reliable low-rank matrix approximations