[PDF] A Block Bidiagonalization Method for Fixed-Accuracy Low-Rank Matrix Approximation

Abstract

We present randUBV, a randomized algorithm for matrix sketching based on the block Lanzcos bidiagonalization process. Given a matrix \bf{A}, it produces a low-rank approximation of the form {\bf UBV}^T, where \bf{U} and \bf{V} have orthonormal columns in exact arithmetic and \bf{B} is block bidiagonal. In finite precision, the columns of both {\bf U} and {\bf V} will be close to orthonormal. Our algorithm is closely related to the randQB algorithms of Yu, Gu, and Li (2018) in that the entries of \bf{B} are incrementally generated and the Frobenius norm approximation error may be efficiently estimated. Our algorithm is therefore suitable for the fixed-accuracy problem, and so is designed to terminate as soon as a user input error tolerance is reached. Numerical experiments suggest that the block Lanczos method is generally competitive with or superior to algorithms that use power iteration, even when \bf{A} has significant clusters of singular values.

Full PDF

AA BLOCK BIDIAGONALIZATION METHOD FORFIXED-ACCURACY LOW-RANK MATRIX APPROXIMATION ∗ ERIC HALLMAN † Abstract.

We present randUBV, a randomized algorithm for matrix sketching based on theblock Lanzcos bidiagonalization process. Given a matrix A , it produces a low-rank approximationof the form UBV T , where U and V have orthonormal columns in exact arithmetic and B is blockbidiagonal. In ﬁnite precision, the columns of both U and V will be close to orthonormal. Ouralgorithm is closely related to the randQB algorithms of Yu, Gu, and Li (2018) in that the entriesof B are incrementally generated and the Frobenius norm approximation error may be eﬃcientlyestimated. It is therefore suitable for the ﬁxed-accuracy problem, and so is designed to terminateas soon as a user input error tolerance is reached. Numerical experiments suggest that the blockLanczos method is generally competitive with or superior to algorithms that use power iteration,even when A has signiﬁcant clusters of singular values. Key words. randomized algorithm, low-rank matrix approximation, ﬁxed-accuracy problem,block Lanczos

AMS subject classiﬁcations.

1. Introduction.

In this paper we consider the problem of ﬁnding a quality low-rank approximation e A r to a given matrix A ∈ R m × n , where we assume that m ≥ n .In particular we consider the ﬁxed-accuracy problem, where the desired truncationrank r is not known in advance, but we instead want to ﬁnd the smallest possible r such that k A − e A r k F < τ for some tolerance τ .The optimal approximation can be found by computing and truncating the SVDof A , but when A is large this method may be impractically expensive. It is thereforeincreasingly common to use randomized techniques to ﬁnd an approximation to thedominant subspace of A : that is, to ﬁnd a matrix Q ∈ R m × r with orthonormalcolumns so that [12](1.1) A ≈ QB , where B is an r × n matrix satisfying(1.2) B = Q T A . Two variants on this basic approach are randomized subspace iteration and ran-domized block Lanczos. Algorithms 1.1 and 1.2 present prototype algorithms for eachof these methods for the ﬁxed-rank problem, where r is speciﬁed in advance. Algorithm 1.1

Randomized Subspace Iteration ( randQB ) [12, Alg. 4.3]

Input: A ∈ R m × n , rank r , integer ‘ ≥ r , power parameter p ≥ Output: Q ∈ R m × ‘ with orthonormal columns, B ∈ R ‘ × n Draw a random standard Gaussian matrix Ω ∈ R n × ‘ Form Y = ( AA T ) p AΩ Compute the QR factorization Y = QR B = Q T A ∗ This research was supported in part by the National Science Foundation through grant DMS-1745654. † North Carolina State University ([email protected], https://erhallma.math.ncsu.edu/).1 a r X i v : . [ m a t h . NA ] F e b E. HALLMAN

Algorithm 1.2

Randomized Block Lanczos [32, Alg. 1]

Input: A ∈ R m × n , block size b ≥

1, rank r , iterations q such that ( q + 1) b ≥ r Output: Q ∈ R m × ( q +1) b with orthonormal columns, B ∈ R ( q +1) b × n Draw a random standard Gaussian matrix Ω ∈ R n × b Form Y = [ AΩ , ( AA T ) AΩ , . . . , ( AA T ) q AΩ ] Compute the QR factorization Y = QR B = Q T A Extensions of these algorithms to the ﬁxed-accuracy problem make use of the factthat the columns of Q and rows of B can be computed incrementally rather than allat once. The process can then be terminated once a user-speciﬁed error threshold hasbeen reached, assuming the error can be eﬃciently computed or estimated. Algorithmsfor the ﬁxed-accuracy problem are proposed in [12, 17], and more recently by Yu, Gu,and Li in [31]. One algorithm by the latter authors, randQB_EI , is currently thefoundation for the MATLAB function svdsketch [18].The algorithms cited above all rely on subspace iteration rather than the blockLanczos method, despite the fact that Krylov subspace methods are “the classicalprescription for obtaining a partial SVD” [12], as with svds in MATLAB. One jus-tiﬁcation for the focus on subspace iteration is that convergence analysis is morecomplete. In particular, the block Lanczos method converges slowly when the spec-trum of A has a cluster larger than the block size b , and the convergence analysisbecomes more complicated in this situation. In recent years, however, several workshave improved the analysis for randomized block Lanczos. Analyzing Algorithm 1.2for the case b ≥ r , Musco and Musco [19] derive bounds on the approximation errorthat do not depend on the gaps between the singular values of A . Yuan, Gu, and Li[32] derive results under the more general condition where A has no singular valueswith multiplicity greater than b . Both papers focus mostly on theoretical results, butthe latter authors make the following observation:“A practical implementation of [Algorithm 1.2] should involve, at thevery least, a reorganization of the computation to use the three-termrecurrence and bidiagonalization [7], and reorthogonalization of theLanczos vectors at each step using one of the numerous schemes thathave been proposed [7, 21, 24].”The goal of this paper is to provide a practical implementation of Algorithm1.2, along with a method for eﬃciently estimating the Frobenius norm approximationerror. Our main contribution is the algorithm randUBV (Algo-rithm 4.1), which uses the block Lanczos method to solve the ﬁxed accuracy problem.It is for the most part a straightforward combination of the block Lanzcos bidiag-onalization process [8] shown in Algorithm 2.2 with a randomized starting matrix V = Ω . As such, it yields a factorization of the form UBV T , where U and V have orthonormal columns in exact arithmetic and B is block bidiagonal. Our sec-ondary contribution is Theorem 4.3, which establishes bounds on the accuracy of theFrobenius norm error estimate (2.6).Our algorithm has two notable features that make it competitive with methodsbased on subspace iteration: • It accepts block sizes smaller than the target rank. Contrary to what anexact arithmetic analysis would suggest, the block Lanczos method can ﬁnd

LOCK BIDIAGONALIZATION FOR MATRIX APPROXIMATION A even when the multiplicity is greater than theblock size b . Large clusters in the spectrum of A are inconvenient, but notfatal.We can therefore compare randUBV with adaptive methods such as randQB_EI when the two are run with the same block size. They will have the same costper iteration when the latter algorithm is run with power parameter p = 0,and empirically randUBV converges faster. If randQB_EI instead uses p = 1or p = 2 then randUBV empirically requires more iterations to converge, buteach iteration costs signiﬁcantly less. • It uses one-sided reorthogonalization , wherein V is reorthogonalized but U isnot. This technique was recommended in [25] for the single-vector case (i.e., b = 1), and leads to considerable cost savings when A is sparse and m (cid:29) n .If m (cid:28) n , our algorithm should be run on A T instead. The matrix U mayslowly lose orthogonality in practice, but Theorem 4.3 shows that our errorestimate (2.6) will still remain accurate.For simplicity, we use full reorthogonalization on V as opposed to more care-fully targeted methods such as those discussed in [21, 24].One other design choice merits discussion: deﬂation occurs when the blocks pro-duced by the block Lanczos method are nearly rank-deﬁcient and results in a reductionin the block size. In the event of deﬂation, we propose to augment the block Krylovspace in order to keep the block column size constant. This will prevent the processfrom terminating early in extreme cases such as when A is the identity matrix.Numerical experiments on synthetic and real data suggest that randUBV generallycompares favorably with randQB and its variants, at least on modestly sized problems. The paper is organized as follows. In section 2, we review thebackground of QB algorithms for the ﬁxed-accuracy problem as well as the blockLanczos method. In section 3 we discuss several implemenation details including thechoice of block size, deﬂation and augmentation, and one-sided reorthogonalization.We present our main algorithm in section 4 and establish the accuracy of the er-ror indicator. Our numerical experiments are in section 5, and section 6 oﬀers ourconcluding remarks and some avenues for future exploration. Matrices, vectors, integers, and scalars will be respectively de-noted by A , a , a , and α . We use k A k F and k A k for the Frobenius norm and operatornorm, respectively, and I for the identity matrix whose dimensions can be inferredfrom context. We use MATLAB notation for matrix indices: i.e., A ( i, j ) and A (: , j )respectively represent the ( i, j ) element and the j -th column of A .For the cost analysis of our algorithm we use the same notation as in [17, 31]: C mul and C qr will represent constants so that the cost of multiplying two dense matricesof sizes m × n and n × l is taken to be C mul mnl and the cost of computing the QRfactorization of an m × n matrix with m ≥ n is taken to be C qr mn , or C qrcp mn ifcolumn pivoting is used.

2. Background.

In this section we review the ﬁxed-accuracy QB factorizationalgorithm randQB_EI and the block Lanczos bidiagonalization process.

In order to extend Algorithm 1.1 tothe ﬁxed-accuracy problem, Yu, Gu, and Li [31] make use of two key ideas. First, fora given block size b ≤ ‘ the matrix Ω can be generated b columns at a time ratherthan all at once, allowing the resulting factors Q and B to be generated incrementally. E. HALLMAN

Second, since Q has orthonormal columns and B = Q T A , it follows [31, Thm. 1] that(2.1) k A − QB k F = k A − QQ T A k F = k A k F − k QQ T A k F = k A k F − k B k F . As long as the columns of Q are kept close to orthonormal, the Frobenius norm errorcan be eﬃciently estimated at each step simply by updating k B k F . It is thereforepossible to compute the low-rank factorization QB and cheaply estimate its errorwithout ever forming the error matrix A − QB explicitly. Algorithm randQB_EI incorporates both of these ideas, the second of which is particularly useful when A issparse.Algorithm 2.1 presents code for randQB_EI , which in exact arithmetic will outputthe same QB factorization as randQB when run to the same rank. It is noted in [12]that a stable implementation of Algorithm 1.1 should include a reorthogonalizationstep after each application of A or A T . The reorthogonalization step in Line 10provides further stability. Algorithm 2.1

Blocked randQB algorithm ( randQB_EI ) [31, Alg. 2]

Input: A ∈ R m × n , block size b ≥

1, power parameter p ≥

0, tolerance τ Output: Q ∈ R m × ‘ , B ∈ R ‘ × n , such that k A − QB k F < τ Q = [ ], B = [ ] E = k A k F (Approximate costs) for k = 1 , , , . . . do Draw a random standard Gaussian matrix Ω k ∈ R n × b Q k = qr( AΩ k − Q ( BΩ k )) C mul mnb + ( k − C mul ( m + n ) b + C qr mb for j = 1 : p do ˜ Q k = qr( A T Q k − B T ( Q T Q k )) — — + ————— ————— + C qr nb Q k = qr( A ˜ Q k − Q ( B ˜ Q k )) — — + ————— ————— + C qr mb end for Q k = qr( Q k − Q ( Q T Q k )) 2( k − C mul mb + C qr mb B k = Q Tk A C mul mnb Q = [ Q , Q k ] B = (cid:2) B T , B Tk (cid:3) T E = E − k B k k F if E < τ then stop end for Suppose that we stop Algorithm 2.1 after t iterations, and set ‘ = tb . The runtimeof randQB_EI can then be approximated as T randQB_EI ≈ C mul mn‘ + 12 C mul (3 m + n ) ‘ + 2 t C qr m‘ + p (cid:18) C mul mn‘ + C mul ( m + n ) ‘ + 1 t C qr ( m + n ) ‘ (cid:19) , (2.2)where the cost increases more or less proportionally to p + 1. By comparison, the costof the ﬁxed-rank prototype algorithm randQB can be approximated as(2.3) T randQB ≈ p + 1) C mul mn‘ + C qr m‘ . Here we describe a block Lanczosmethod for reducing a matrix to block bidiagonal form. Since this method gener-alizes the single-vector algorithm by Golub and Kahan [6] commonly known as theGolub-Kahan-Lanczos process, we will abbreviate it as bGKL . LOCK BIDIAGONALIZATION FOR MATRIX APPROXIMATION bGKL process was introduced by Golub, Luk, and Overton [8] to ﬁnd thelargest singular values and associated singular vectors of a large and sparse matrix.Since then, it has been applied to both least squares problems [14, 27] and total leastsquares problems [2, 13] with multiple right-hand sides.The process takes a matrix A ∈ R m × n and matrix V ∈ R n × b with orthonormalcolumns, and after k steps produces the orthonormal bases U ( k ) = [ U , · · · , U k ] and V ( k +1) = [ V , · · · , V k +1 ] satisfyingSpan (cid:8) U ( k ) (cid:9) = Span (cid:8) AV , A ( A T A ) V , . . . , A ( A T A ) k − V (cid:9) , Span (cid:8) V ( k +1) (cid:9) = Span (cid:8) V , ( A T A ) V , . . . , ( A T A ) k V (cid:9) . Furthermore, it produces the kb × ( k + 1) b block bidiagonal matrix(2.4) B k =  R L R . . .. . . L k R k L k +1  so that at each step of the process the relations(2.5) AV ( k ) = U ( k ) B k (: , kb ) and A T U ( k ) = V ( k +1) B Tk are satisﬁed. Assuming no loss of rank, the blocks { R i } ki =1 or { L i } k +1 i =1 are respectively b × b upper and lower triangular. Algorithm 2.2

Block Lanczos bidiagonalization process ( bGKL ) [8]

Input: A ∈ R m × n , matrix V ∈ R n × b with orthonormal columns U = ; L = (Approximate costs) for k = 1 , , . . . do U k R k = qr( AV k − U k − L k ) C mul mnb + C mul mb + C qr mb V k +1 L Tk +1 = qr( A T U k − V k R Tk ) C mul mnb + C mul nb + C qr nb end for The basic outline of the process is given in Algorithm 2.2, where the costs assumeno loss of rank in the blocks { R i } ki =1 or { L i } k +1 i =1 . We note that the original algorithmin [8] is organized so that B k is square at the end of each iteration. Our current presen-tation more directly mimics the QB factorization, since U ( k ) B k V T ( k +1) = U ( k ) U T ( k ) A by the second relation in (2.5). It follows that in exact arithmetic the identity(2.6) k A − U ( k ) B k V T ( k +1) k F = k A k F − k B k k F will hold, and so the bGKL process can be readily adapted to ﬁnd a ﬁxed-accuracyapproximation to A .Suppose that we stop the process after t iterations and set ‘ = tb . The runtimeof the bGKL process can then be approximated as(2.7) T bGKL ≈ C mul mn‘ + 12 t C mul ( m + n ) ‘ + 1 t C qr ( m + n ) ‘ . At this point, it is not fair to compare this cost to the cost of (2.2) because we havenot yet accounted for the cost of reorthogonalization in bGKL , which is necessary forstability. Nonetheless, it suggests that we may be able to obtain an algorithm basedon bGKL that costs no more per iteration than randQB_EI with power parameter p = 0. E. HALLMAN

3. Implementation details.

In this section we discuss how to handle severalimportant issues in the implementation of our ﬁxed-accuracy algorithm. The ﬁrstconcerns the diﬃculty the Lanczos method encounters when A has large singularvalue clusters. The second is the matter of ensuring that the columns of U ( k ) and V ( k ) remain close to orthonormal, and the third is the use of deﬂation and augmentationwhen the blocks R k or L k are rank-deﬁcient. It is known that if A has a sin-gular value with multiplicity greater than the block size b , then in exact arithmetic theblock Lanczos process will recover at most b of those singular values. More generally,if the spectrum of A has a cluster of size greater than b then the approximate singularvectors recovered by the Lanczos process may converge slowly. This behavior standsin stark contrast to that of blocked subspace iteration methods such as randQB_EI ,whose outputs do not in exact arithmetic depend on b .For the ﬁrst situation—singular values with multiplicity greater than b —classicalresults tend to examine a restricted problem. Saad [23] notes that the Lanczos processwould simply behave as though it were being performed on a restricted matrix A | S whose singular values had multiplicity at most b . There is therefore “no loss ofgenerality” in assuming that the singular values of A have multiplicity bounded by b for the purpose of analyzing convergence rates. Other more recent works restricttheir attention to the case where the cluster size is bounded by b [16], or where b isgreater than or equal to the target rank r [19, 28, 4].The analysis of Yuan, Gu, and Li [32] makes an important advancement by al-lowing for cluster sizes (though not multiplicity) greater than b , and showing thateven within a large cluster the recovered singular values will converge superlinearlyin the number of Lanczos iterations. Their numerical experiments on real-world datasuggest that smaller block sizes generally lead to faster convergence with respect tothe number of ﬂops expended.As it turns out, even singular values with multiplicity greater than b are notfatal to the Lanczos process. Parlett [21] notes that since “rounding errors introducecomponents in all directions”, even repeated singular vectors will eventually be found.Simon and Zha [25] add the caveat that the singular vectors will not converge inconsecutive order: the Lanczos process will likely ﬁnd several smaller singular values of A before it ﬁnds copies of the larger repeated ones. What we should expect in practiceis that a singular value of multiplicity greater than b (or a cluster of comparable size)will delay convergence, but not prevent it entirely.Thus in spite of complications in the analysis of the block Lanczos method, usinga smaller block size can be quite eﬀective in practice. Even when A has clusterslarger than the block size, we can obtain a good approximation simply by increasingthe number of Lanczos iterations. Our numerical experiments support this notion:although we can construct synthetic examples for which randUBV is inferior to methodsthat use subspace iteration, our algorithm performs quite well on a real-world examplewith large clusters. An alternate method for dealing with clustersis oﬀered in [30] and explored further in [1, 33]: instead of keeping the block sizeconstant, we may periodically augment the block Krylov space with new vectors in Strictly speaking, Saad’s analysis is for block Lanczos tridiagonalization applied to a symmetricmatrix as opposed to Lanczos bidiagonalization applied a rectangular matrix. Our focus is onbidiagonalization, but the two processes are closely related. See footnote 1.LOCK BIDIAGONALIZATION FOR MATRIX APPROXIMATION B k , and to increase the block size b so that it remains larger than the largestcluster in B k . For the sake of keeping the implementation of our algorithm simple,we leave this extension for future exploration. In exact arithmetic, the matrices U ( k ) and V ( k ) will have orthonormal columns. In practice, they will quickly lose orthog-onality due to roundoﬀ error, and so we must take additional steps to mitigate thisloss of orthogonality.For the single-vector case b = 1, Simon and Zha [25] observe that it may suf-ﬁce to reorthogonalize only one of U ( k ) or V ( k ) in order to obtain a good low-rankapproximation. They suggest that if the columns of V ( k ) alone are kept close to or-thonormal, then U ( k ) B k V T ( k +1) will remain a good approximation to A regardless ofthe orthogonality of U ( k ) . Separately, experiments by Fong and Saunders [5] in thecontext of least-squares problems suggest that keeping V ( k ) orthonormal to machineprecision (cid:15) mach might be enough to keep U ( k ) orthonormal to at least O ( √ (cid:15) mach ), atleast until the least-squares solver reaches a relative backward error of √ (cid:15) mach . Forthe sake of computational eﬃciency, we therefore choose to explicitly reorthogonalize V ( k ) but not U ( k ) (assuming that m ≥ n ).Reorthogonalization can take up a signiﬁcant portion of the runtime of our al-gorithm, particularly if A is sparse. However, it is known for the Lanczos processthat orthogonality is lost only in the direction of singular vectors that have alreadyconverged [20]. Thus in a high-quality implementation, it should be possible to savetime by orthogonalizing each block V k against a smaller carefully chosen set of vectorsobtained from V ( k − (see [21, 10, 24] for a few such proposals). In our implementa-tion, we use full reorthogonalization for simplicity. We note that even if A is square,full reorthogonalization will cost no more than the equivalent step in randQB_EI (line10 of Algorithm 2.1). In practice, the block Lanczos process may yield blocks R k or L k that are rank-deﬁcient or nearly so. Here and with other block Krylov methods,it is typical to reduce the block size b in response so that R k and L k retain fullrow rank and column rank, respectively. This process is known as deﬂation . Formore background, we refer the reader to the survey paper by Gutknecht [11] and thereferences therein.In the context of solving systems with multiple right-hand sides, Gutknechtstresses that deﬂation is highly desirable. Indeed, when solving a system such as AX = B , it is precisely the dimension reduction resulting from deﬂation that givesblock methods an advantage over methods that solve each right hand side separately.In this context, deﬂation might occur if B is itself rank-deﬁcient, or if B has somenotable rank structure in relation to the matrix A . When running block Lanczos witha randomly chosen starting matrix V (i.e., V = qr( Ω ) and Ω is a standard Gauss-ian matrix), we do not expect deﬂation to occur frequently since Ω is not likely tohave any notable structure with respect to A . Nonetheless, a reliable implementationshould be prepared for the possibility, and so we examine the details here.Björck [2] proposes computing the QR factorizations in lines 3–4 of Algorithm 2.2using Householder reﬂections without column pivoting. The resulting matrix B k willbe not just block bidiagonal, but a banded matrix whose eﬀective bandwidth beginsat b and decreases with each deﬂation. Hnětynková et al. [13] refer to B k as a b -wedgeshaped matrix . If the eﬀective bandwidth decreases to zero, the bidiagonalizationprocess will terminate. E. HALLMAN

Algorithm 3.1

Deﬂated QR ( deflQR ) Input: X ∈ R m × n , deﬂation tolerance δ Output: Q ∈ R m × s with orthonormal columns, R ∈ R s × n , rank s Compute the pivoted QR factorization XΠ = b Q b R Find the largest s such that | b R ( s, s ) | ≥ δ R = b R (1 : s, :) Π T Q = b Q (: , s )We propose to instead use QR with column pivoting, which is slower and lesselegant but simpler to implement in terms of readily available subroutines. Theprocedure is outlined in Algorithm 3.1, where the deﬂation tolerance δ is presumablysomewhat larger than (cid:15) mach k A k . Lines 3–4 of Algorithm 2.2 would use this modiﬁedroutine in place of unpivoted QR, and as Björck [2] notes the recurrence in those lineswill still work in the presence of deﬂation. When using block Lanczos to solve systems of linear equa-tions, deﬂation can be highly beneﬁcial. In the context of matrix sketching, it is lessdesirable. Consider an extreme example where the columns of V are right singularvectors of A : the Lanczos process will terminate after a single iteration, returningan approximation of the form A ≈ U ΣV T . Termination at this point would yieldaccurate singular vectors, but the factorization may not approximate A to within thedesired error tolerance.As mentioned before, we do not expect deﬂation to occur frequently if V ischosen randomly. However, if we do not make any further adjustments for deﬂationour algorithm would fail to converge on cases as simple as A = I . In order to makeour method more robust, we will replace any deﬂated vectors with new randomlydrawn ones in order to keep the block column size constant. Similar augmentationtechniques have been proposed to prevent breakdown in the case of the nonsymmetricLanczos process [29] and GMRES [22].More speciﬁcally, if Algorithm 3.1 returns a factorization V k L Tk with rank lessthan b , we generate a standard Gaussian matrix Ω k so that [ V k , Ω k ] has b columns.We then orthogonalize Ω k against V k and V ( k − , obtaining V k . The resulting matrix[ V k , V k ] is then used in place of V k in the next step of the Lanczos process.In keeping with the spirit of one-sided reorthogonalization, we do not augment U k if a block R k is found to be rank deﬁcient. This will allow us to avoid accessingthe matrix U ( k − while the block Lanczos process is running. As a consequence, theblocks of B k will each have b columns, but some may have fewer than b rows.We observe that in the presence of augmentation, the space Span (cid:8) V ( k ) (cid:9) willnot be a block Krylov space, but will instead be the sum of multiple block Krylovspaces with diﬀerent dimensions. As of the time of writing we are not aware of anyconvergence results for this more general case.

4. Fixed-accuracy algorithm.

Algorithm 4.1 presents code for randUBV . Ig-noring the augmentation step in line 16, the cost is more or less equal to the cost of bGKL plus the cost of reorthogonalizing V k +1 in Line 11. Thus if we stop the processafter t iterations and set ‘ = tb , the total cost is approximately(4.1) T randUBV ≈ C mul mn‘ + C mul n‘ + 12 t C mul ( m + n ) ‘ + 1 t C qr ( m + n ) ‘ . LOCK BIDIAGONALIZATION FOR MATRIX APPROXIMATION randUBV requires fewer ﬂoating pointsoperations than randQB_EI when run for the same number of iterations, even whenthe latter is run with power parameter p = 0. In particular, the cost of one-sidedreorthogonalization is only O ( n‘ ) while the stabilization steps in lines 5 and 10 of randQB_EI cost O (( m + n ) ‘ ). We can therefore expect that if A is sparse and m (cid:29) n , randUBV may run signiﬁcantly faster. Algorithm 4.1

Blocked Bidiagonalization algorithm ( randUBV ) Input: A ∈ R m × n , block size b , relative error τ , deﬂation tolerance δ Output: U , B , V , such that k A − UBV T k F < τ E = k A k F (Approximate costs) Draw a random standard Gaussian matrix Ω ∈ R n × b V = qr( Ω ) C qr nb U = ; L = V = V ; U = U for k = 1 , , , . . . do [ U k , R k ] = deflQR ( AV k − U k − L k , δ ) C mul mnb + C mul mb + C qrcp mb U = [ U , U k ] E = E − k R k k F V k +1 = A T U k − V k R Tk C mul mnb + C mul nb V k +1 = V k +1 − V ( V T V k +1 ) 2 kC mul nb [ V k +1 , L Tk +1 , s ] = deflQR ( V k +1 , δ ) C qrcp nb V = [ V , V k +1 ] if s < b then Draw a random standard Gaussian matrix Ω k ∈ R n × ( b − s ) V k +1 = qr( Ω k − V ( V T Ω k )) 2 kC mul nb ( b − s ) + C qr n ( b − s ) V = [ V , V k +1 ] end if E = E − k L k +1 k F if E < τ k A k F then stop end for Since our focus is on the ﬁxed-accuracy algorithm, however, diﬀerent algorithms(and for randQB_EI , diﬀerent power parameters p ) will converge after diﬀerent num-bers of iterations. We must therefore consider not just the cost per iteration, but howquickly the approximations converge. We discuss this matter further along with thenumerical experiments in section 5. It is noted in [31] that due to cancellation,the computed value of E = k A k F − k B k F may be inaccurate when E is very small.In order to estimate the error E to within a relative tolerance of γ (say, γ = 1%),the authors suggest that the absolute accuracy tolerance τ for the QB factorizationshould be set large enough to satisfy(4.2) τ > √ E ≥ r (cid:15) mach γ k A k F , where (cid:15) mach is the machine precision. In short, the proposed method of error estima-tion cannot reliably estimate a relative error below 2 √ (cid:15) mach .We provide a similar analysis in order to account for deﬂation and loss of orthog-onality of U ( k ) . In particular, we show that the error estimate can remain accurate0 E. HALLMAN even as U ( k ) loses orthogonality in practice. To that end, we deﬁne the local loss oforthogonality of a matrix as follows: Definition

Given a matrix U ( k ) = [ U , . . . , U k ] , the local loss of orthogo-nality of U ( k ) is deﬁned as ε k = max (cid:26) max ≤ i ≤ k k U Ti U i − I k , max ≤ i ≤ k k U Ti − U i k (cid:27) The main idea is that we do not require k U T ( k ) U ( k ) − I k to be small. Instead, we needonly the milder condition that adjacent blocks be close to orthogonal. This idea bearssome resemblance to the work [26], which uses local recurrence formulas to show thatcertain error estimates for the conjugate gradient method remain accurate in a ﬁniteprecision setting. Lemma

Consider the matrix U ( k ) = [ U , . . . , U k ] , and let ε k denote the localloss of orthogonality of U ( k ) . Let B k be a block upper bidiagonal matrix whose blocksare partitioned conformally with those of U ( k ) . Then k U ( k ) B k k F = (1 + θ ) k B k k F , | θ | ≤ ε k . Proof.

We will ﬁnd the squared Frobenius norm of U ( k ) B k one block column ata time, and use the fact that since B k is block bidiagonal, each block column in theproduct uses at most two adjacent blocks of U ( k ) .Let { R i } ki =1 denote the blocks on the main block diagonal of B k , and let { L i } k +1 i =2 denote the oﬀ-diagonal blocks. Then for 2 ≤ i ≤ k , the squared Frobenius norm ofthe i -th block column of U ( k ) B k is given by(4.3) k U i − L i + U i R i k F = k U i − L i k F + k U i R i k F + 2 tr (cid:0) R Ti U Ti U i − L i (cid:1) . Examining the ﬁrst term, it can be seen that k U i − L i k F = tr( L Ti U Ti − U i − L i )= tr( L Ti ( U Ti − U i − − I ) L i ) + tr( L Ti L i )= (1 + θ ) k L i k F , where | θ | ≤ ε k . A similar result applies to the term k U i R i k F . As for the ﬁnal term,we ﬁnd that 2 | tr R Ti U Ti U i − L i | ≤ k U Ti U i − k k R i k F k L i k F ≤ ε k k R i k F k L i k F , ≤ ε k ( k R i k F + k L i k F ) . By adding these expressions back together we arrive at the bound(4.4) k U i − L i + U i R i k F = (1 + θ )( k R i k F + k L i k F ) , | θ | ≤ ε k , so the desired relative error bound holds for each block column (the ﬁrst and lastcolumns may be checked separately). The main claim then follows by summing overthe block columns. LOCK BIDIAGONALIZATION FOR MATRIX APPROXIMATION V ( k ) and in the ab-sence of deﬂation, the ﬁrst relation in (2.5) will remain accurate to machine precisionregardless of the orthogonality of U ( k ) (as noted in [25], the second relation will not).In the presence of deﬂation, the ﬁrst relation must be amended slightly. We rewriteit as(4.5) AV ( k ) = U ( k ) B k + D k , where B k is shorthand for B k (: , kb ) and D k is a matrix accounting for all deﬂationsin U ( k ) . Assuming the column pivoting in Algorithm 3.1 selects at each step thecolumn with the largest 2-norm, it can be veriﬁed that k D k k F ≤ δ √ d , where δ is thedeﬂation tolerance and d is the total number of columns that have been removed from U ( k ) through deﬂation.We now show that the error estimate E = k A k F −k B k k F will remain accurate upto terms involving the deﬂation tolerance and the local loss of orthogonality in U ( k ) .The proof makes the simplifying assumptions that V ( k +1) has orthonormal columnsand that there is no rounding error term in (4.5), but accounting for both of theseeﬀects will change the bound (4.6) by at most O ( (cid:15) mach k A k F ). The proof also ignoresthe eﬀect of cancellation in the computation of E , so as with [31] we cannot expectto reliably estimate a relative error below √ (cid:15) mach . Theorem

Given a matrix A , let U ( k +1) , B k +1 , and V ( k +1) be as produced byAlgorithm 4.1 with deﬂation tolerance δ . Let ε k +1 denote the local loss of orthogonalityof U ( k +1) . Assume that V ( k +1) has orthonormal columns. Assume that (4.5) holdsexactly at each iteration, and let d be the number of columns removed from U ( k +1) due to deﬂation. Finally, let E = k A k F − k B k F . Then (4.6) k A − U ( k ) B k V T ( k +1) k F ≤ E + 4 ε k +1 k A k F + 2 δ √ d (1 + 2 ε k +1 ) k A k F . Proof.

First, by assuming the columns of V ( k +1) are orthonormal we ﬁnd that(4.7) k A − U ( k ) B k V T ( k +1) k F = k A k F + k U ( k ) B k k F − AV ( k +1) B Tk U T ( k ) ) . By assuming that (4.5) holds exactly at each step, we also get the identity AV ( k +1) = U ( k +1) B k +1 + D k +1 = U ( k ) B k + [ , U k +1 R k +1 ] + D k +1 , where k D k +1 k F ≤ δ √ d . It follows that(4.8)tr( AV ( k +1) B Tk U T ( k ) ) = k U ( k ) B k k F + tr( U Tk U k +1 R k +1 L Tk +1 ) + tr( D k +1 B Tk U T ( k ) ) . From the deﬁnition of ε k +1 we have(4.9) (cid:12)(cid:12) tr( U Tk U k +1 R k +1 L Tk +1 ) (cid:12)(cid:12) ≤ k U Tk U k +1 k k R k +1 k F k L k +1 k F ≤ ε k +1 k A k F , and since k D k +1 k F ≤ δ √ d we also have(4.10) (cid:12)(cid:12)(cid:12) tr( D k +1 B Tk U T ( k ) ) (cid:12)(cid:12)(cid:12) ≤ k D k +1 k F k U ( k ) B k k F ≤ δ √ d k U ( k ) B k k F . Lemma 4.2 gives us bounds on k U ( k ) B k k F , so by returning to (4.7) and using(4.8), (4.9), and (4.10), we conclude that k A − U ( k ) B k V T ( k +1) k F = k A k F + k U ( k ) B k k F − AV ( k +1) B Tk U T ( k ) ) ≤ k A k F − k U ( k ) B k k F + 2 ε k +1 k A k F + 2 δ √ d k U ( k ) B k k F ≤ E + 4 ε k +1 k A k F + 2 δ √ d (1 + 2 ε k +1 ) k A k F . E. HALLMAN

Thus as long as local orthogonality is maintained for U ( k ) and as long as thenumber of deﬂations is not too large, we can expect E to remain an accurate estimateof the Frobenius norm approximation error, at least when the error tolerance is nottoo small. Recall that our original goal for the ﬁxed-accuracyproblem was not just to ﬁnd a factorization that satisﬁes the bound k A − UBV T k F <τ , but to ﬁnd the factorization with the smallest rank that does so. In order toaccomplish this, we may compute the SVD of B as B = ˆ UΣ ˆ V T , truncate it to thesmallest rank r such that k A − ˆ U r Σ r ˆ V Tr k F < τ , then approximate the left and rightsingular vectors of A by U ˆ U r and V ˆ V r . It should be noted that since B is a blockbidiagonal matrix, its SVD can in theory be computed more eﬃciently than if B were dense. Algorithms for computing the SVD typically ﬁrst reduce the matrix tobidiagonal form [6], and B can be eﬃciently reduced to this form using band reductiontechniques as in [15].This postprocessing step takes on additional importance when dealing with theblock Lanczos method rather than subspace iteration. Where subspace iteration willyield a matrix B whose singular values are all decent approximations of the top singu-lar values of A , the factor B produced by the Lanczos method will contain approxima-tions to the smallest singular values of A as well [9]. It is therefore possible that thematrix B produced by randUBV can be truncated signiﬁcantly without diminishingthe quality of the approximation.In fact, if one has the goal of obtaining a factorization whose rank is as smallas possible, we recommend setting the stopping tolerance τ stop slightly smaller thanthe desired approximation tolerance τ err (or similarly, running the algorithm for a fewmore iterations after the approximation tolerance has already been satisﬁed). Doingso will may signiﬁcantly reduce the rank r of the truncated SVD, which will in turnpay dividends by reducing the cost of computing U ˆ U r and V ˆ V r .

5. Numerical experiments.

Here we report the results of numerical experi-ments on synthetic and real test cases. We run four sets of experiments in order toexamine the following:1. The rate of convergence by iteration. We use synthetic matrices whose spectradecay at diﬀerent rates, and compare randUBV with randQB_EI using poweriterations p = 0 , , τ stop < τ err on the qualityof the approximation.All experiments were carried out in MATLAB 2020b on a 4-core Intel Core 7 with32GB RAM. LOCK BIDIAGONALIZATION FOR MATRIX APPROXIMATION (a) Left: slow decay. Right: very slow decay.(b) Left: fast decay. Right: singular values have multiplicity greater than the block size. Fig. 5.1: Convergence rate by iteration. In all cases but the last, randUBV requiresfewer iterations for convergence than randQB_EI with p = 0 but more than randQB_EI with p = 1. For our ﬁrst set of test cases we createdmatrices of size 2000 × A = UΣV T , where U and V were formedby orthogonalizing standard Gaussian matrices and Σ was set in the following manner: • (Matrix 1) Slow decay, in which σ j = 1 /j for 1 ≤ j ≤ • (Matrix 2) Very slow decay, in which σ j = 1 /j for 1 ≤ j ≤ • (Matrix 3) Fast decay, in which σ j = exp( − j/

20) for 1 ≤ j ≤ • (Matrix 4) Step function decay, in which σ j = 10 − . d j/ e− for 1 ≤ j ≤ A (except for the smallest) has multiplicity 30.In all four cases, we ran the sketching algorithms to a maximum rank k = 200 usingblock size b = 10. The deﬂation tolerance was set at δ = 10 − p k A k k A k ∞ , but wedid not encounter deﬂation in any of these cases.Results are shown in Figure 5.1. In the ﬁrst three test cases, the approximationerror for randUBV was smaller than that of randQB_EI (with power parameter p = 0)for every iteration after the ﬁrst. It lagged somewhat behind randQB_EI with p = 1 or p = 2, both of which were quite close to optimal. In the ﬁnal case, where the singularvalues of A were chosen to have multiplicity larger than the block size, randUBV laggedsigniﬁcantly behind even randQB_EI with p = 0. We note that algorithm randUBV did nonethless converge, which would not have been possible in exact arithmetic.4 E. HALLMAN

Finally, we oﬀer a snapshot of the singular values of B after the algorithmshave terminated. Results for test cases 1 and 4 are shown in Figure 5.2. We notethat the leading singular values returned by randUBV are more accurate than thosereturned by randQB_EI with p = 0 and comparable to the cases p = 1 or p = 2. Thesmallest singular values for randUBV are much smaller than their randQB counterparts,which appears to be undesirable but has a bit of a silver lining: it suggests that therank of B k can be truncated without losing much approximation accuracy.Fig. 5.2: Singular values of B k after termination. Left: slow decay. Right: stepfunction decay. For our second set of test cases we generatedrandom sparse matrices as

A = sprand(m,n,d) with n = 4000 columns and varyingnumbers of rows m and densities d . We then approximated A to a variable rank k using randUBV and randQB_EI with p = 0. We tested three diﬀerent variations: • Number of rows m varying from 8000 to 40000, rank k = 600, and d = 0 . • Number of rows m = 24000, rank k varying from 200 to 1000, and d = 0 . • Number of rows m = 24000, rank k = 600, and nonzeros varying from d =0 .

4% to d = 2%.Fig. 5.3: Eﬀects of sparsity (left) and approximation rank (right) on run time.Results for the second and third cases are shown in Figure 5.3, which conﬁrm ourgeneral expectations: for a rectangular matrix with m > n , if the matrix is sparse or LOCK BIDIAGONALIZATION FOR MATRIX APPROXIMATION randUBV will gain a competitive advantage over randQB_EI due to the fact that it uses one-sided reorthogonalization. This eﬀect willbe more pronounced the larger m is compared to n , although we found that changing m alone did not have much eﬀect on the relative runtimes of the two algorithms. For our third set of test cases, we examine how the choice ofblock size aﬀects the time and number of iterations required for convergence. We useone synthetic matrix and two real ones: the synthetic matrix is a 4000 × σ j = 10 − . d j/ e− .Thus each singular value except for the last has multiplicity 30.The ﬁrst real matrix is a dense 3168 × lp_cre_b , comes from a linear programmingproblem from the SuiteSparse collection [3], and is a 9648 × ,

785 nonzero elements and at most 9 nonzero elements per column. This secondmatrix has several sizeable clusters of singular values: for example, σ ≈ .

10 and σ ≈ .

77. The median relative gap ( σ k − σ k +1 ) /σ k +1 among the ﬁrst 800 singularvalues is about 8 . × − , and the smallest relative gap is about 2 . × − . Priorto running the sketching algorithms, both matrices were transposed in order to havemore rows than columns.Fig. 5.4: Left: image of pinus glabra . Right: leading singular values of lp_cre_b .We compare randUBV to randQB_EI with power parameter p = 1. For bothalgorithms we approximate the synthetic matrix to a relative error τ err = 0 .

01, thegrayscale image to a relative error τ err = 0 .

1, and the SuiteSparse matrix to a relativeerror τ err = 0 . randQB_EI was fairly straight-forward: using larger block sizes was more eﬃcient, at least up to the point wherethe block size was large enough to waste computation by computing Q and B to alarger rank than necessary. This makes sense because larger block sizes oﬀer moreopportunities for using BLAS 3 operations and parallelization. Relatedly, we notethat MATLAB’s svdsketch function adaptively increases the block size in order toaccelerate convergence.The behavior of randUBV was very similar to that of randQB_EI on the grayscaleimage, but less so on the other two cases. For the synthetic matrix whose singularvalues were distributed according to a step function, increasing b from just belowthe cluster size to just above it led to a sharp drop in both the time and number of6 E. HALLMAN iterations required. On the matrix lp_cre_b , the optimal block size was near b = 10even though the approximation rank was close to constant over all block sizes tested.We speculate that the reason for this is that lp_cre_b is both sparse and rectangular,so dense QR operations are a signiﬁcant portion of the cost of the algorithm. Lookingback to the cost of randUBV as shown in (4.1), we note that using a smaller block sizereduces the cost of performing QR operations on U . (a) Step function decay.(b) Grayscale image.(c) SuiteSparse matrix lp_cre_b . Fig. 5.5: Eﬀect of block size the time and number of iterations required for conver-gence.

LOCK BIDIAGONALIZATION FOR MATRIX APPROXIMATION In our ﬁnal set of experiments we examined the eﬀectof choosing a stopping tolerance τ stop smaller than the desired approximation errortolerance τ err , with the conjecture that doing so would allow randUBV to attain signif-icantly better compression rates. We used randQB_EI with p = 0 , , τ stop . Inthe second step, the SVD of B was then computed and truncated as B r = U r Σ r V Tr to the smallest rank such that k A − B r k F ≤ τ err k A k F , and the singular vectors of A computed as UU r and VV r (or as QU r for randQB_EI ). The time required for eachof these two stages was recorded using tic and toc .Method τ stop t fac t svd t total k r SVD – – 13.52 13.52 – 388UBV 0.1 0.68 0.08 0.76 520 439UBV 0.09 0.87 0.11 0.98 600 392QB(P=0) 0.1 1.22 0.21 1.44 700 663QB(P=1) 0.1 1.12 0.09 1.22 440 420QB(P=2) 0.1 1.55 0.08 1.63 420 398Fig. 5.6: Results for image data with approximation tolerance τ err = 0 . For the image data, we ran all algorithms to a relative errorof τ stop = τ err = 0 . b = 20, and for randUBV additionally consideredthe stricter stopping tolerance τ stop = 0 . t fac is the time required for the QB or UBV factorization, t svd is the time required tocompute the SVD of B and the new singular vectors of A , and t total = t fac + t svd .Finally, k is the rank at which the algorithm was terminated, and r the rank to which B was truncated. The ﬁrst line represents the time required to directly compute theSVD of A and the optimal truncation rank.We observe that randUBV ran faster than randQB_EI regardless of the value ofthe power parameter p . Even though it required more iterations to converge than randQB_EI with p = 1 or p = 2, it required fewer matrix-vector products with A or A T per iteration. Furthermore, running randUBV to a stopping tolerance that wasslighly smaller than the truncation tolerance took somewhat longer but resulted innearly optimal compression, even superior to subspace iteration with p = 2. For the matrix lp_cre_b from the SuiteSparse col-lection, we ran two trials. In the ﬁrst, we ran all algorithms to the rather modestrelative error of τ stop = τ err = 0 .

5, and for randUBV considered the stricter stoppingtolerance τ stop = 0 .

45. In the second, we ran the algorithms to the stricter relativeerror of τ stop = τ err = 0 .

15, and for randUBV additionally considered τ stop = 0 .

14. Weused block size b = 50 for both trials.8 E. HALLMAN

Method τ stop t fac t svd t total k r SVD – – – – – 608UBV 0.5 4.69 0.93 5.62 900 747UBV 0.45 5.68 0.99 6.67 1050 627QB(P=0) 0.5 8.33 8.32 16.66 1150 1123QB(P=1) 0.5 5.16 3.69 8.85 700 676QB(P=2) 0.5 6.54 3.21 9.75 650 627Fig. 5.7: Results for lp_cre_b with approximation tolerance τ err = 0 . τ stop t fac t svd t total k r SVD – – – – – 2082UBV 0.15 21.28 12.15 33.43 2600 2293UBV 0.14 24.09 13.88 37.98 2700 2150QB(P=0) 0.15 72.36 63.25 135.61 3600 3505QB(P=1) 0.15 38.05 22.91 60.97 2150 2147QB(P=2) 0.15 48.00 21.59 69.59 2100 2100Fig. 5.8: Results for lp_cre_b with approximation tolerance τ err = 0 . A , we did not attempt to compute its SVD directly but insteadfound the optimal truncation rank using the precomputed singular values availableonline [3].Once again, randUBV ran faster than its subspace-iteration-based counterpart,and using a slightly smaller stopping tolerance τ stop improved the compression ratiowithout signiﬁcantly increasing the runtime. The iteration k at which randUBV termi-nated was signiﬁcantly smaller than it was for randQB_EI with p = 0, but signiﬁcantlylarger than for randQB_EI with p = 1 or p = 2 (perhaps in part due to the singularvalue clusters).It should be noted that the matrix A in question is quite sparse with only about0 .

03% of its entries nonzero, and fairly skinny with m ≈ n . It is therefore worth ex-ploring whether randQB_EI might save time on reorthogonalization costs if performedon A T instead. We re-ran the experiment for τ err = 0 .

15, and found that while thefactorization time t fac did not change much, the second step t svd took around twiceas long due to the matrix B being k × m rather than k × n .

6. Conclusions.

We have proposed a randomized algorithm randUBV that takesa matrix A and uses block Lanczos bidiagonalization to ﬁnd an approximation of theform UBV T , where U and V each have orthonormal columns in exact arithmeticand B is a block bidiagonal matrix. For square matrices it costs approximately thesame per iteration as randQB -type methods run with power parameter p = 0 whilehaving better convergence properties. On rectangular matrices, it exploits one-sidedreorthognalization to run faster without much degrading the accuracy of the errorestimator. Numerical experiments suggest that randUBV is generally competitive withexisting randUBV -type methods, at least as long as the problem is not so large that itbecomes important to minimize the number of passes over A . LOCK BIDIAGONALIZATION FOR MATRIX APPROXIMATION

19A few avenues for future exploration are suggested. First and most importantly,roundoﬀ error allows block Lanczos methods to handle repeated singular values, whichthey would be unable to do in exact arithmetic. This fact has been known for decades,but we are not currently aware of any rigorous convergence bounds that accountfor ﬁnite precision. Second, reinﬂation or any more general method for adaptivelychanging the block size b will make the span of V a sum of Krylov spaces of diﬀerentdimensions. We are not aware of any convergence results that cover this more generalsetting.It is also worth exploring just how much the block Lanczos method beneﬁts fromoversampling. We have observed that running randUBV for a few more iterationsthan necessary can result in near-optimal compression, but it would be worthwhile toturn the convergence results of e.g. [32] into practical guidance on how many moreiterations are necessary.Finally, the behavior of U when using one-sided reorthogonalization merits furtherstudy. We generally found that when using a larger stopping tolerance τ the columnsof U remained closer to orthonormal. It would be highly desirable to obtain a rigorousresult establishing that one-sided reorthogonalization is safe as long as only a roughapproximation is required, but we leave this goal for a future work.MATLAB code is available at https://github.com/erhallma/randUBV, includingour main algorithm randUBV as well as code used to reproduce the ﬁgures used in thispaper. Acknowledgments.

The author would like to thank Ilse Ipsen and ArvindSaibaba for their helpful comments on an earlier draft of this paper.

REFERENCES[1]

Z. Bai, D. Day, and Q. Ye , ABLE: an adaptive block Lanczos method for non-hermitian eigen-value problems , SIAM Journal on Matrix Analysis and Applications, 20 (1999), pp. 1060–1082.[2]

A. Björck , Block bidiagonal decomposition and least square problems , Perspectives in numer-ical Analysis, Helsinki, (2008).[3]

T. A. Davis and Y. Hu , The University of Florida sparse matrix collection , 38 (2011), https://doi.org/10.1145/2049662.2049663.[4]

P. Drineas, I. C. Ipsen, E.-M. Kontopoulou, and M. Magdon-Ismail , Structural conver-gence results for approximation of dominant subspaces from block Krylov spaces , SIAMJournal on Matrix Analysis and Applications, 39 (2018), pp. 567–586.[5]

D. C.-L. Fong and M. Saunders , LSMR: An iterative algorithm for sparse least-squaresproblems , SIAM Journal on Scientiﬁc Computing, 33 (2011), pp. 2950–2971.[6]

G. Golub and W. Kahan , Calculating the singular values and pseudo-inverse of a matrix ,Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical Analy-sis, 2 (1965), pp. 205–224.[7]

G. Golub, R. Underwood, and J. Wilkinson , The Lanczos algorithm for the symmetric Ax= λ Bx problem , tech. report, Report STAN-CS-72-270, Department of Computer Science,Stanford U. Stanford . . . , 1972.[8]

G. H. Golub, F. T. Luk, and M. L. Overton , A block Lanczos method for computing thesingular values and corresponding singular vectors of a matrix , ACM Transactions onMathematical Software (TOMS), 7 (1981), pp. 149–169.[9]

G. H. Golub and C. F. Van Loan , Matrix Computations , The Johns Hopkins UniversityPress, Baltimore, 4th ed., 2013.[10]

J. F. Grcar , Analyses of the Lanczos Algorithm and of the Approximation Problem in Richard-son’s Method. , PhD thesis, University of Illinois at Urbana-Champaign, 1982.[11]

M. H. Gutknecht , Block Krylov space methods for linear systems with multiple right-handsides: an introduction , 2006.[12]

N. Halko, P. G. Martinsson, and J. A. Tropp , Finding structure with randomness: Proba-bilistic algorithms for constructing approximate matrix decompositions , SIAM Review, 53 E. HALLMAN(2011), pp. 217–288.[13]

I. Hnětynková, M. Plešinger, and Z. Strakoš , Band generalization of the Golub–Kahanbidiagonalization, generalized Jacobi matrices, and the core problem , SIAM Journal onMatrix Analysis and Applications, 36 (2015), pp. 417–434.[14]

S. Karimi and F. Toutounian , The block least squares method for solving nonsymmetriclinear systems with multiple right-hand sides , Applied Mathematics and Computation,177 (2006), pp. 852–862.[15]

L. Kaufman , Band reduction algorithms revisited , ACM Transactions on Mathematical Soft-ware (TOMS), 26 (2000), pp. 551–567.[16]

R.-C. Li and L.-H. Zhang , Convergence of the block Lanczos method for eigenvalue clusters ,Numerische Mathematik, 131 (2015), pp. 83–113.[17]

P.-G. Martinsson and S. Voronin , A randomized blocked algorithm for eﬃciently comput-ing rank-revealing factorizations of matrices , SIAM Journal on Scientiﬁc Computing, 38(2016), pp. S485–S507.[18]

MATLAB , version 9.9.0 (R2020b) , The MathWorks Inc., Natick, Massachusetts, 2020.[19] C. Musco and C. Musco , Randomized block Krylov methods for stronger and faster approxi-mate singular value decomposition , in Advances in Neural Information Processing Systems,2015, pp. 1396–1404.[20]

C. C. Paige , The computation of eigenvalues and eigenvectors of very large sparse matrices. ,PhD thesis, University of London, 1971.[21]

B. N. Parlett and D. S. Scott , The Lanczos algorithm with selective orthogonalization ,Mathematics of computation, 33 (1979), pp. 217–238.[22]

L. Reichel and Q. Ye , Breakdown-free GMRES for singular systems , SIAM Journal on MatrixAnalysis and Applications, 26 (2005), pp. 1001–1021.[23]

Y. Saad , On the rates of convergence of the Lanczos and the block-Lanczos methods , SIAMJournal on Numerical Analysis, 17 (1980), pp. 687–706.[24]

H. D. Simon , The Lanczos algorithm with partial reorthogonalization , Mathematics of compu-tation, 42 (1984), pp. 115–142.[25]

H. D. Simon and H. Zha , Low-rank matrix approximation using the Lanczos bidiagonalizationprocess with applications , SIAM Journal on Scientiﬁc Computing, 21 (2000), pp. 2257–2274.[26]

Z. Strakoš and P. Tich`y , On error estimation in the conjugate gradient method and whyit works in ﬁnite precision computations. , ETNA. Electronic Transactions on NumericalAnalysis [electronic only], 13 (2002), pp. 56–80.[27]

F. Toutounian and M. Mojarrab , The block LSMR method: a novel eﬃcient algorithm forsolving non-symmetric linear systems with multiple right-hand sides , Iranian Journal ofScience and Technology (Sciences), 39 (2015), pp. 69–78.[28]

S. Wang, Z. Zhang, and T. Zhang , Improved analyses of the randomized power method andblock Lanczos method , arXiv preprint arXiv:1508.06429, (2015).[29]

Q. Ye , A breakdown-free variation of the nonsymmetric Lanczos algorithms , Mathematics ofComputation, 62 (1994), pp. 179–207.[30]

Q. Ye , An adaptive block Lanczos algorithm , Numerical Algorithms, 12 (1996), pp. 97–110.[31]

W. Yu, Y. Gu, and Y. Li , Eﬃcient randomized algorithms for the ﬁxed-precision low-rankmatrix approximation , SIAM Journal on Matrix Analysis and Applications, 39 (2018),pp. 1339–1359.[32]

Q. Yuan, M. Gu, and B. Li , Superlinear convergence of randomized block Lanczos algorithm ,in 2018 IEEE International Conference on Data Mining (ICDM), IEEE, 2018, pp. 1404–1409.[33]