Computing rank-revealing factorizations of matrices stored out-of-core
Nathan Heavner, Per-Gunnar Martinsson, Gregorio Quintana-Ortí
aa r X i v : . [ c s . M S ] M a r Computing rank-revealing factorizations of matrices stored out-of-core
N. Heavner , P.G. Martinsson , G. Quintana-Ort´ı Abstract:
This paper describes efficient algorithms for computing rank-revealing factor-izations of matrices that are too large to fit in RAM, and must instead be stored on slowexternal memory devices such as solid-state or spinning disk hard drives (out-of-core orout-of-memory). Traditional algorithms for computing rank revealing factorizations, suchas the column pivoted QR factorization, or techniques for computing a full singular valuedecomposition of a matrix, are very communication intensive. They are naturally expressedas a sequence of matrix-vector operations, which become prohibitively expensive when datais not available in main memory. Randomization allows these methods to be reformulatedso that large contiguous blocks of the matrix can be processed in bulk. The paper describestwo distinct methods. The first is a blocked version of column pivoted Householder QR,organized as a “left-looking” method to minimize the number of write operations (whichare more expensive than read operations on a spinning disk drive). The second method re-sults in a so called UTV factorization which expresses a matrix A as A = U T V ∗ where U and V are unitary, and T is triangular. This method is organized as an algorithm-by-blocks,in which floating point operations overlap read and write operations. The second methodincorporates power iterations, and is exceptionally good at revealing the numerical rank; itcan often be used as a substitute for a full singular value decomposition. Numerical exper-iments demonstrate that the new algorithms are almost as fast when processing data storedon a hard drive as traditional algorithms are for data stored in main memory. To be precise,the computational time for fully factorizing an n × n matrix scales as cn , with a scalingconstant c that is only marginally larger when the matrix is stored out of core. Keywords:
Randomized numerical linear algebra; rank-revealing factorization; partial rank-revealing fac-torization; out-of-core computation; randUTV factorization; HQRRP factorization; Householder QR factor-ization; blocked matrix computations; shared-memory multiprocessors; shared-memory multicore proces-sors. 1. I
NTRODUCTION
Problem formulation.
We consider the task of computing a rank-revealing factorization of a matrixthat is so large that it must be stored on an external memory device such as a spinning or solid-state diskdrive. In this case, the matrix is said to be stored out-of-core . Specifically, given an m × n matrix A , we seeka factorization of the form(1) A = U R V ∗ ,m × n m × m m × n n × n where U and V are unitary matrices, and where R is upper triangular. We use the term “rank-revealing” inan informal way, as meaning simply that for any k with ≤ k ≤ min( m, n ) , the truncation of (1) to the k dominant terms provides a rank- k approximation to A that is almost as good as the theoretically optimalone. To be precise, with A k = U (: , k ) R (1 : k, :) V ∗ , we ask that k A − A k k ≈ inf {k A − B k : where B has rank k } . Problems which make use of such a factorization include solving ill-conditioned linear systems, rank-deficient and total least squares problems [25, 12, 43, 25], subset selection [31], and matrix approxima-tion [10, 26], among others. Department of Applied Mathematics, University of Colorado at Boulder, 526 UCB, Boulder, CO 80309-0526, USA Department of Mathematics, University of Texas at Austin, Stop C1200, Austin, TX 78712-1202, USA Departamento de Ingenier´ıa y Ciencia de Computadores, Universitat Jaume I, 12.071 Castell´on, Spain Prior work.
There are several well-established options for computing a rank-revealing factorization.If a full factorization is required, the singular value decomposition (SVD) and the QR decomposition withcolumn pivoting (CPQR) are two popular options. The SVD provides theoretically optimal low rank ap-proximations for many choices of matrix norms, but can be prohibitively expensive to compute. The CPQRusually reveals rank reasonably well (see, e.g. [22] for a notable exception) and requires much less computa-tional work than the SVD. Neither factorization is easily implemented when the matrix is stored out-of-core,however. The traditional algorithms for computing each require many slow matrix-vector operations, whichin turn necessitate many read operations (data brought from an external device to main memory) and writeoperations (data sent from main memory to an external device). See, e.g. [42, 13, 23] for information onstandard algorithms for computing the SVD, and [11, 39, 34] for information on the most common CPQRalgorithm and its restriction to matrix-vector operations.Some software efforts in computing factorizations of dense matrices stored in external devices includePOOCLAPACK [16, 36, 15], the SOLAR [41] library, a runtime to execute algorithms-by-blocks whenthe matrix is stored in disk [32], as well as an out-of-core extension [7] to ScaLAPACK [2] that does notappear in the current version of the library. Even for these libraries, factorization routines are usually lim-ited to Cholesky, unpivoted QR, and LU decompositions. A truncated SVD algorithm that is effective forout-of-core matrices was introduced in [17]. This technique, however, only yields a partial factorization,and it requires that an upper bound for the target rank k is known in advance. Recent efforts have beenable to compute the SVD of large dense matrices stored in an external device. Demchik et al. [8] employedrandomization to compute the SVD, but no performance results for dense matrices were shown. Kabir etal. [21] used a two-stage approach to compute the SVD that reduces the dense matrix to band form in a firststage, which allowed to factorize matrices of up to dimension k × k . To our knowledge, there are nocurrently available and widely used software for computing a full rank-revealing factorization out-of-core.1.3. Contributions of present work.
This article describes two different algorithms for computing rank-revealing factorizations of matrices stored out-of-core. The first,
HQRRP left , is based on the
HQRRP algorithm [29], which uses randomization techniques to build a fully blocked column pivoted QR factoriza-tion.
HQRRP relies largely on matrix-matrix operations, reducing the number of reads and writes that mustoccur between the external device and main memory (RAM).
HQRRP left further reduces the number ofwrite operations required by redesigning
HQRRP as a left-looking algorithm. Numerical experiments revealthat this reduction in the writing time is critical to the algorithm’s performance, particularly when the data isstored on a spinning disk. (The
HQRRP method is closely related to the technique in [9].)The second contribution of this article is an out-of-core implementation of the randUTV algorithm [28],which uses randomization to build a rank-revealing UTV factorization (see, e.g. [40, 38] for a good intro-duction to this factorization). The out-of-core implementation, randUTV AB , modifies randUTV in such away as to achieve overlap of communication and floating point operations. The result, demonstrated with nu-merical experiments, is that randUTV AB suffers only a small extra cost by storing the matrix in an externaldevice with respect to the same factorization of a matrix stored in RAM.The new techniques presented enable processing of very large matrices at modest cost. To illustrate, at thetime of writing in 2019, a personal desktop with a 2TB high speed SSD (like the one in the machine “ut”in Section 4.2.1) can be had for a couple of thousands of dollars, whereas a workstation with an equivalentamount of RAM would be an order of magnitude more expensive. Since solid state storage technology israpidly getting both cheaper and faster, we expect to see demand for out-of-core algorithms to continue toincrease.1.4.
Assessing the quality of a rank-revealing factorization.
The term “rank-revealing factorization” hasbeen used with slightly different meanings in the literature [6, 38, 4, 5]; our requirement that the approxima-tion error k A − A k k is close to optimal is invariably a part of the requirement, but in more theoretical papersthe statement is often made precise by bounding how rapidly the discrepancy is allowed to increase as thematrix dimensions tend to infinity. The focus of the current paper is how to efficiently implement algorithms that have already been published.The precision at which they reveal rank has been carefully studied: The randomized CPQR was analyzedin [9], with additional numerical results presented in [29]. The conclusion of these papers was that therandomized pivoting reveals the numerical rank to roughly the same precision as classical pivoting. Therandomized UTV factorization was studied in [28], with additional results reported in [19]. The precisionof randUTV depends on a tuning parameter q that controls the number of power iterations that are takenat each step. Higher q results in better precision, but a slower execution time. When no powering is done( q = 0 ), randUTV is about as good at revealing the rank as CPQR, with randUTV often having a slightedge. As q is increased to one or two, the precision very rapidly improves, and often gets to within strikingdistance of the optimal precision of the SVD. Moreover, the diagonal entries of the middle factor T oftenprovide excellent approximations to the true singular values of the matrix.1.5. Outline of paper.
In Section 2, we discuss the first out-of-core factorization, a rank-revealing QRfactorization
HQRRP which can be stopped early to yield a partial decomposition. Section 3 explores anout-of-core implementation of randUTV , an efficient algorithm for obtaining a rank-revealing orthogonalmatrix decomposition. In Section 4 we present numerical experiments which demonstrate the performanceof both algorithms. Finally, Section 5 contains the main conclusions of our work.2. P
ARTIAL RANK - REVEALING QR FACTORIZATION
In this section we introduce an out-of-core implementation of a rank-revealing column pivoted QR factor-ization. In subsection 2.1 and subsection 2.2, we review a fully blocked algorithm for computing a columnpivoted QR factorization,
HQRRP . In subsection 2.3, we discuss modifications of
HQRRP which enhance itsefficiency when the matrix is stored out-of-core.2.1.
Overview of
HQRRP . Consider an input matrix A ∈ R m × n with m ≥ n . HQRRP [29] is a blocked algorithm for computing a column-pivoted QR factorization A = Q R P ∗ ,m × n m × m m × n n × n where Q is orthogonal, R is upper trapezoidal, and P is a permutation matrix.The bulk of the algorithm’s work is executed in a loop with ⌈ n/b ⌉ iterations, where ≤ b ≤ n is the blocksize parameter. For notational simplicity, for the remaining discussion it is assumed that n is a multiple of b so that n = bp for some p ∈ N . At the start of the HQRRP algorithm are the following initializations: R (0) = A, Q (0) = I, P (0) = I. During the i -th iteration, matrices Q ( i ) and P ( i ) are constructed such that R ( i ) = ( Q ( i ) ) ∗ R ( i − P ( i ) , where R ( i ) (: , ib ) is upper trapezoidal. After p steps, R ( p ) is upper trapezoidal, and the final factorizationcan be written as R = R ( p ) P = P (0) P (1) · · · P ( p ) Q = Q (0) Q (1) · · · Q ( p ) . Choosing the orthogonal matrices.
At the i -th step of HQRRP , consider the partitioning of R ( i ) R ( i ) → R ( i )11 R ( i )12 R ( i )11 R ( i )22 ! , where R ( i ) is ib × ib . The permutation matrix P ( i ) is chosen to find b columns of R ( i − with a large spanningvolume, in the sense that the spanning volume is close to maximal. We can find such a selection by projectingthe columns of R ( i − into a lower dimensional space and cheaply finding b good pivot columns there. Thesteps for computing P ( i ) are thus as follows:1. Draw a random matrix G ( i ) ∈ R b × ( m − ib ) , with i.i.d. entries drawn from the standard normal distri-bution.2. Compute Y ( i ) = G ( i ) R ( i − .3. Compute b steps of the traditional CPQR of Y ( i ) to obtain Y ( i ) P ( i )22 = W trash S trash .4. Set P ( i ) = (cid:18) I P ( i )22 (cid:19) . This method for selecting multiple pivot columns has shown itself to be effective and reliable, consistentlyproducing factorizations that reveal the rank as well as traditional CPQR [29].
Remark 1.
There is an alternate “downdating” method for computing Y ( i ) during each step that reduces theasymptotic flop count of the HQRRP algorithm [9]. With this technique,
HQRRP has the same asymptotic flopcount as the unpivoted QR factorization and the CPQR; the reader may see [29] for details. However, thisdowndating method will not be used in this article’s primary implementation as the communication restric-tions imposed by this downdating method make the basic scheme discussed in this section more practical.Once P ( i ) has been computed, Q ( i ) is built with well-established steps:1. Perform unpivoted QR factorization on A ( i − P (: , b ) to obtain A ( i − P = Q ( i )22 A ( i )22 (: , b ) .
2. Set Q ( i ) = (cid:18) I Q ( i )22 (cid:19) . Executing
HQRRP out-of-core.
When the matrix to be factorized is large enough that it must be storedout-of-core, communication (I/O operations) costs become a major concern. While it is desirable to minimizeall forms of communication, writing to an external device is typically more expensive than reading from it.In linear algebra, in every row, column, or block iteration, right-looking algorithms factorize the current row,column, or block and then update the rest of the matrix, which usually requires an overall cost of O ( n ) writes. In contrast, left-looking algorithms apply all the previous transformations to the current row, column,or block and then factorize it, which usually requires an overall cost of only O ( n ) writes. When working onmain memory, performances are only slightly different, but when working with matrices stored in an externaldevice with slow writes, left-looking algorithms deliver higher performances.The original HQRRP was a classical right-looking algorithm, and therefore it performed many write oper-ations. We have designed a left-looking variation of the original
HQRRP algorithm, so that the number ofwrite operations are much smaller.
Left-looking algorithms.
Several standard matrix factorizations have been re-ordered as left-lookingalgorithms for out-of-core computations, largely with the goal of reducing certain I/O operations [35, 36, 41,24].As our new algorithm employs the QR factorization, a high level description of the left-looking algorithmfor computing the QR factorization follows. Let A ∈ R n × n and b be a block size ≤ b ≤ n ; for simplicitylet n = bp , where b, p ∈ N . Then for i = 0 , , . . . , p − , the algorithm for computing the QR factorizationis described by the following steps (indices start at zero):(1) A col ← A (: , ib : ( i + 1) b − .(2) A col ← Q ∗ A col .(3) Compute the unpivoted QR factorization of A col ( ib : n − , :) yielding Q i , R i .(4) A (0 : ib − , ib : ( i + 1) b − ← A col (0 : ib − , :) .(5) A ( ib : n − , ib : ( i + 1) b − ← R i .(6) Q ← Q (cid:18) I Q i (cid:19) .In this algorithm, Q is initialized as the identity matrix I , and after the algorithm is finished, A is overwrittenwith the upper triangular matrix R . In practice, Q is never formed, but instead the Householder vectorsare stored in the lower triangular portion of R as it is computed. This overview omits other computationaldetails as well, but clearly highlights the fact that only the column block of A (: , ib : ( i + 1) b − isupdated in every iteration. In contrast, the traditional right-looking algorithm updates the bottom-right block A ( ib : n − , ib : n − during each iteration of the loop, thus requiring much more writing to disk.2.3.2. HQRRP as a left-looking algorithm.
HQRRP as a left-looking algorithm follows the pattern of theunpivoted factorization discussed in Section 2.3.1, with the added complication of choosing and applyingthe permutation matrix P . We select the permutations using precisely the same procedure as in 2.2. Observethat the projection procedure used in the selection of P requires that the right-most columns of A be updatedaccording to the rotations encoded in Q during each iteration. We must therefore choose whether to write outthe updated columns of A to disk or repeat some of the arithmetic operations during each iteration. Since themain idea of a left-looking algorithm is to avoid writing to the right-most columns of A every iteration, wemust accept the repetition of operations as part of HQRRP left with the current technologies. Numericalexperiments indicated that the I/O-saving benefits of
HQRRP left easily outweigh the costs of the extraflops as compared to the original right-looking
HQRRP .3. F
ULL RANK - REVEALING ORTHOGONAL FACTORIZATION
In this section we introduce an efficient implementation of a rank-revealing orthogonal decomposition fora matrix stored out-of-core. In subsection 3.1, we review an efficient algorithm, randUTV , for buildingsuch a decomposition when the matrices are stored in main memory. In subsection 3.2, we discuss somemodifications of randUTV that optimize its efficiency in the out-of-core setting.3.1.
Overview of randUTV . Let A ∈ R m × n with m ≥ n . The randUTV algorithm [28] builds a rank-revealing UTV factorization of A , that is, a decomposition A = U T V ∗ m × n m × m m × n n × n such that U and V are orthogonal and T is upper triangular. The randUTV algorithm is blocked, so itproceeds by choosing first a block size b with ≤ b ≤ n and performing its work inside a loop with ⌈ n/b ⌉ iterations. For ease of notation, it is assumed that n = bp for b, p ∈ N . At the beginning, we initialize T (0) = A, U (0) = I, V (0) = I. For the matrix T ( i ) (and analogously for matrices U and V ), the following partitioning will be employed: T ( i ) → T ( i )11 T ( i )12 T ( i )21 T ( i )22 ! , where T ( i )11 is ib × ib . At the i -th iteration, for i = 1 , . . . , p , we form matrices T ( i ) , U ( i ) and V ( i ) as follows:1. Construct an orthogonal matrix ˆ V ( i )22 such that its leading b columns span approximately the samesubspace as the leading b right singular vectors of T ( i − .2. Compute the unpivoted QR factorization of T ( i − ˆ V ( i )22 (: , b ) to obtain T ( i − ˆ V ( i )22 (: , b ) = ˆ U ( i )22 R.
3. Compute the SVD of (cid:16) ˆ U ( i )22 (: , b ) (cid:17) ∗ T ( i − ˆ V ( i )22 (: , b ) , yielding (cid:16) ˆ U ( i )22 (: , b ) (cid:17) ∗ T ( i − ˆ V ( i )22 (: , b ) = U SVD DV ∗ SVD .
4. Calculate V ( i ) with V ( i ) = (cid:18) I
00 ˆ V ( i )22 (cid:19) I V SVD
00 0 I .
5. Calculate U ( i ) with U ( i ) = (cid:18) I
00 ˆ U ( i )22 (cid:19) I U SVD
00 0 I .
6. Calculate T ( i ) with T ( i ) = ( U ( i ) ) ∗ T ( i − V ( i ) . Once all ⌈ n/b ⌉ iterations have completed, we may compute the final factors with T = T ( p ) ,U = U (0) U (1) · · · U ( p ) ,V = V (0) V (1) · · · V ( p ) . This leaves only the question of how the matrix ˆ V ( i )22 is formed in step 1 above. The method, inspired fromwork in randomized linear algebra including [37, 18, 30], is the following:1. Draw a random matrix G ( i ) ∈ R ( m − ib ) × b , with i.i.d. entries drawn from the standard normal distri-bution.2. Compute the unpivoted QR factorization of (( T ( i − ) ∗ T ( i − ) q ( T ( i − ) ∗ G , where q is some smallnonnegative integer, typically less than three. The result is (( T ( i − ) ∗ T ( i − ) q ( T ( i − ) ∗ G = ˆ V ( i )22 R trash . This simple algorithm has been demonstrated to consistently provide high-quality subspace approximationsto the space spanned by the leading b right singular vectors of T ( i − . It is largely for this reason that randUTV reveals rank comparably to the SVD. For details on randUTV , see [28]. Executing randUTV out-of-core: an algorithm-by-blocks.
For matrices so large they do not fit inRAM, randUTV requires significant management of I/O tasks. If the orthogonal matrices U and V arerequired, then these must be stored out-of-core as well. To implement randUTV efficiently under theseconstraints, it is helpful to reorganize the algorithm as an algorithm-by-blocks . Like blocked algorithms,algorithms-by-blocks seek to cast most of the flops in a factorization in terms of the highly efficient Level3 BLAS (Basic Linear Algebra Subprograms). Unlike blocked algorithms, the algorithms-by-blocks takemaximum advantage of the full main memory by making all the data blocks being transferred of the samesize, which, besides, makes easier to overlap communication and computation. All these advantages makean algorithm-by-blocks more efficient. In the following sections, we present the core technologies behindthe design and implementation of randUTV as an algorithm-by-blocks.3.2.1. Algorithms-by-blocks: an overview.
When working with matrices stored in RAM, blocked algorithmscan improve performances by processing multiple columns (or rows) of the input matrix A in each iterationof its main loop. For instance, some classical factorizations drive several columns to upper triangular form ineach iteration of the main loop. This design allows most of the operations to be cast in terms of the Level 3BLAS (matrix-matrix operations), and more specifically in xgemm operations (matrix-matrix products). Asvendor-provided and open-source multithreaded implementations of the Level 3 BLAS are highly efficient(with performances close to the peak speed), blocked algorithms usually offer high performances. Processingone column (or one row) at a time would require the employment of the slower matrix-vector operations(Level 2 BLAS) and much more communication. Thus, a blocked implementation of randUTV relyinglargely on standard calls to parallel BLAS was found to be faster than the highly optimized MKL CPQRimplementation for a shared memory system, despite randUTV having a much higher flop count than theCPQR algorithm [28].On the other side, in blocked algorithms the amount of data being processed by every iteration varies ex-tremely. Usually, as the factorization advances, every call to parallel BLAS must handle an increasing (or adecreasing) amount of data. For instance, to factorize an n × n matrix with block size b , the first iteration ofright-looking algorithms usually requires the processing of the full matrix, whereas the last iteration requiresjust to process a very small amount of data (in some cases a b × b block). When the data is stored in anslow external device, this extremely high variation in the data being transferred harms performances sinceexternal devices work optimally only for certain given transfer sizes. Moreover, these high variation in thedata being transferred and processed makes an optimal scheduling of I/O operations and computational op-erations much more difficult since the cost of the operations varies even much more (the I/O cost is usually O ( n ) , whereas the computational cost is usually O ( n ) ). In addition, this high variation also makes a pooruse of main memory either at the beginning or at the end of the factorizations.We are therefore led to seek a technique other than blocking to obtain higher performances, although we willnot abandon the strategy of casting most operations in terms of the Level 3 BLAS. The key lies in changingthe method with which we aggregate multiple lower-level BLAS flops into a single Level 3 BLAS operation.Blocked algorithms do this by raising the granularity of the algorithm’s main loop. The alternative approach,called algorithms-by-blocks, is to instead raise the granularity of the data . With this method, the algorithmmay be designed as if only scalar elements of the input are dealt with at one time. Then, the algorithm istransformed into Level 3 BLAS by conceiving of each scalar as a supmatrix or block of size b × b . Eachscalar operation turns into a matrix-matrix operation, and operations in the algorithm will, at the finest levelof detail, operate on usually a few (between one and four, but usually two or three) b × b blocks. Eachoperation on a few blocks is called a task. This arrangement removes the problems of blocked algorithmssince every task will work with a similar amount of data, the memory can be employed as a cache to storeblocks of memory, and an overlapping of computation and communication is much easier and more efficient.The performance benefits obtained by algorithm-by-blocks with respect to blocked algorithms for linearalgebra problems on shared-memory architectures with data stored in RAM are usually significant [27] whenmore than a few cores are employed, because they remove the thread synchronization points that blockedalgorithms insert after every call to the parallel BLAS. On the other side, when the data is stored in an externaldevice, a runtime [32] to execute algorithms-by-blocks was described, but it was only applied to the followingbasic factorizations: Cholesky, LU with incremental pivoting, and unpivoted incremental QR. In our work we have built an algorithm-by-blocks for computing the much more complex randUTV factorization onlarge matrices stored in an external device by extending the mentioned approach and runtime. Despite theoriginal randUTV factorization being much more difficult and complex than those basic factorizations, ourapproach to reveal the rank of matrices stored in an external device has obtained good performances whencompared to high-performance algorithms revealing the rank of matrices stored in RAM, thus making therevealing of the rank of very large matrices feasible.An algorithm-by-blocks for computing the randUTV requires that the QR factorization performed insidealso works on b × b blocks. In order to design this internal QR factorization process such that each unitof work requires only b × b submatrices, the algoritm-by-blocks for computing the QR factorization mustemploy an algorithm based on updating an existing QR factorization. We shall refer to this algorithm as QR AB . We consider only the part of
QR AB that makes the first column of blocks upper triangular, since thatis all what is required for randUTV AB .Figure 1 shows this process for a × matrix with block size 3. In this figure, the continuous lines showthe × blocks of the matrix involved in the current task, the ‘ • ’ symbol represents a non-modified elementby the current task, the ‘ ⋆ ’ symbol represents a modified element by the current task, and the ‘ · ’ symbolrepresents an element nullified by the current task. The nullified elements are shown because, as usual inlinear algebra factorizations, they store information about the Householder transformations that will be usedlater to apply these transformations. The first task, called Compute QR , computes the QR factorization ofthe leading dense block A , thus nullifying all the elements below the main diagonal and modifying allthe elements on or above the main diagonal. The second task, called Apply left Qt of dense QR , appliesthe Householder transformations obtained in the previous task (and stored in A ) to block A . The thirdtask performs the same operation onto A . The fourth task annihilates block A , which is called Com-pute QR of td QR (where ‘td‘ means triangular-dense since the upper block A is triangular and the lowerblock A is dense). The fifth task, called Apply left Qt of td QR , applies the transformations of the previoustask to blocks A and A . The sixth task performs the same operation onto A and A . Analogously, theseventh, eighth, and ninth tasks update the first and third row of blocks by performing the same work as thefourth, fifth, and sixth tasks. By taking advantage of the zeros present in the factorizations for each iteration,a well-implemented QR AB costs essentially no more flops than the traditional blocked unpivoted QR. Thealgorithm is described in greater detail in [33, 3].3.2.2.
Algorithms-by-blocks for randUTV . An algorithm-by-blocks for randUTV , which we will call randUTV AB ,performs mostly the same operations as the original, but they are rearranged as many more tasks working onusually square blocks (except maybe the right-most and bottom-most blocks). We will discuss in some detailhow this plays out in the first step of the algorithm. First, choose a block size b . (When working on matricesstored on main memory, small block sizes such as b = 128 or usually work well. In contrast, whenworking on very large matrices stored on an external device, much larger block sizes such as b = 10 , must be employed.) For simplicity, assume b divides both m and n evenly. Recall that at the beginning of randUTV , T is initialized with T := A . Consider the following partitioning of the matrix T : T → T T · · · T N T T · · · T N ... ... . . . ... T M T M · · · T MN , where each supmatrix or block T ij is b × b , N = n/b , and M = m/b . Note that the rest of matrices ( G , Y , U , and V ) must also be accordingly partitioned. The submatrices T ij (and those of the rest of matrices)are treated as the fundamental unit of data in the algorithm, so that each operation is expressed only in theseterms. For the first step of the algorithm when no orthonormal matrices are built, for instance:(1) Constructing V (0) : The first step, Y (0) = ( T ∗ T ) q T ∗ G (0) , is broken into several tasks, each one ofwhich computes the product of two blocks. In the simplified case where q = 0 , we have M × N products of two blocks. The second step, the QR factorization of Y (0) , uses the QR AB algorithm ⋆ ⋆ ⋆ • • • • • •· ⋆ ⋆ • • • • • •· · ⋆ • • • • • •• • • • • • • • •• • • • • • • • •• • • • • • • • •• • • • • • • • •• • • • • • • • •• • • • • • • • • (1) After Compute QR( A ) • • • ⋆ ⋆ ⋆ • • •· • • ⋆ ⋆ ⋆ • • •· · • ⋆ ⋆ ⋆ • • •• • • • • • • • •• • • • • • • • •• • • • • • • • •• • • • • • • • •• • • • • • • • •• • • • • • • • • (2) After Apply left Qt of Den-se QR( A , A ) • • • • • • ⋆ ⋆ ⋆ · • • • • • ⋆ ⋆ ⋆ · · • • • • ⋆ ⋆ ⋆ • • • • • • • • •• • • • • • • • •• • • • • • • • •• • • • • • • • •• • • • • • • • •• • • • • • • • • (3) After Apply left Qt of Den-se QR( A , A ) ⋆ ⋆ ⋆ • • • • • •· ⋆ ⋆ • • • • • •· · ⋆ • • • • • •· · · • • • • • •· · · • • • • • •· · · • • • • • •• • • • • • • • •• • • • • • • • •• • • • • • • • • (4) After Compute td QR( A , A ) • • • ⋆ ⋆ ⋆ • • •· • • ⋆ ⋆ ⋆ • • •· · • ⋆ ⋆ ⋆ • • •· · · ⋆ ⋆ ⋆ • • •· · · ⋆ ⋆ ⋆ • • •· · · ⋆ ⋆ ⋆ • • •• • • • • • • • •• • • • • • • • •• • • • • • • • • (5) After Apply left Qt of td QR( A , A , A , A ) • • • • • • ⋆ ⋆ ⋆ · • • • • • ⋆ ⋆ ⋆ · · • • • • ⋆ ⋆ ⋆ · · · • • • ⋆ ⋆ ⋆ · · · • • • ⋆ ⋆ ⋆ · · · • • • ⋆ ⋆ ⋆ • • • • • • • • •• • • • • • • • •• • • • • • • • • (6) After Apply left Qt of td QR( A , A , A , A ) ⋆ ⋆ ⋆ • • • • • •· ⋆ ⋆ • • • • • •· · ⋆ • • • • • •· · · • • • • • •· · · • • • • • •· · · • • • • • •· · · • • • • • •· · · • • • • • •· · · • • • • • • (7) After Compute td QR( A , A ) • • • ⋆ ⋆ ⋆ • • •· • • ⋆ ⋆ ⋆ • • •· · • ⋆ ⋆ ⋆ • • •· · · • • • • • •· · · • • • • • •· · · • • • • • •· · · ⋆ ⋆ ⋆ • • •· · · ⋆ ⋆ ⋆ • • •· · · ⋆ ⋆ ⋆ • • • (8) After Apply left Qt of td QR( A , A , A , A ) • • • • • • ⋆ ⋆ ⋆ · • • • • • ⋆ ⋆ ⋆ · · • • • • ⋆ ⋆ ⋆ · · · • • • • • •· · · • • • • • •· · · • • • • • •· · · • • • ⋆ ⋆ ⋆ · · · • • • ⋆ ⋆ ⋆ · · · • • • ⋆ ⋆ ⋆ (9) After Apply left Qt of td QR( A , A , A , A ) F IGURE
1. An illustration of the first tasks peformed by an algorithm-by-blocks for com-puting the QR factorization. The ‘ • ’ symbol represents a non-modified element by thecurrent task, ‘ ⋆ ’ represents a modified element by the current task, and ‘ · ’ represents a nul-lified element by the current task (they are shown because they store information about theHouseholder transformations that will be later used to apply them). The continuous linessurround the blocks involved in the current task.previously described. Thus, the decomposition of each Y (0) i is computed separately, and the resultingupper triangular factor R (0) (stored in Y (0)1 ) is updated after each step. Then, matrix T is updatedwith the previous transformations obtained in the factorization of Y (0) , that is, T ← T V (0) . See, e.g. [16, 32, 33] for details on an approach to a full unpivoted QR factorization.(2)
Constructing U (0) : This step requires an unpivoted QR factorization of the first column block of T . A similar algorithm to the previous one, QR AB , has been employed, the main difference beingthat the transformations are applied from the left int this case. Then, the rest of the matrix T must beupdated with the previous transformations, that is, T ← ( U (0) ) ∗ T . (3) Computing the SVD of T : This step is the same one as in randUTV and randUTV AB . In bothcases, T is interacted with as a single unit. Then, matrix T must be properly updated with thetransformations obtained in this step.3.2.3. The FLAME abstraction for implementing algorithm-by-blocks.
A nontrivial obstacle to implement-ing an algorithm-by-blocks is the issue of programmability. The FLAME (Formal Linear Algebra MethodsEnvironment) project [14, 20] helps to solve this issue. FLAME is a framework for designing linear-algebraalgorithms. In this approach the input matrix is viewed as a collection of submatrices, basing its loops onre-partitionings of the input data. The FLAME API [1] for the C programming language enables a user tocode high-performance implementations of linear-algebra algorithms at a high level of abstraction. Besides,this methodology makes it a natural fit for use with an algorithm-by-blocks. Thus, the actual code for theimplementation of randUTV AB looks very similar to the written version of the algorithm given in Figure2.3.2.4.
Scheduling and dispatching the operations for an algorithm-by-blocks.
The runtime described in [32]allows to execute algorithms-by-blocks for single factorizations that work on data stored in an externaldevice. We have employed and extended that runtime to compute the randUTV factorization. This algorithmis called the randUTV AB . To understand how this OOC runtime works, consider the problem of factorizinga matrix of × blocks A ← (cid:18) A A A A (cid:19) , where each block is of size b × b . We will consider the case where the power iteration parameter q = 0 forsimplicity. The execution of the program proceeds in two phases: the analysis stage and the execution stage.In the first stage (analysis), instead of executing the code sequentially, the runtime builds a list or queueof tasks recording the operands associated with each task or operation. Figure 3 shows an example of thelist built up by the runtime for randUTV AB for the case A ∈ R n × n , b = n/ , and q = 0 . The S factorsobtained in the QR factorization of the Y blocks are stored in blocks called B , whereas the S factors obtainedin the factorization of the blocks of the current column block of A are stored in blocks called C .In the second stage (execution), the runtime executes (or dispatches) all the tasks in the list. Several methodcan be employed to execute these tasks. • Traditional method.
This is an straightforward implementation. For each task, all the inputoperands are read from the external device, then the task is executed with its operands stored inRAM, and finally all the output operands are written to the external device. The main advantage isits simplicity, but it obviously performs many I/O operations.Figure 4 illustrates the execution of the first six tasks of randUTV AB for a matrix A ∈ R n × n with block size b = n/ . • Method with Cache.
Since the work to do has been decomposed into many “small” tasks, a fastmethod to accelerate performances is to store as many blocks as possible in main memory. Byefficiently using a part of the main memory as a cache of the blocks stored in the external device, thenumber of I/O operations can be reduced. Since most blocks are of the same size, the managementof the cache is very efficient. The only blocks with different sizes (smaller) are those that storethe S factors of the unpivoted QR factorizations, but all blocks will be treated in the same way tosimplify the programming and accelerate the execution. When the cache is full with blocks and anew block must be loaded, the Least-Recently Used (LRU) block is selected. If the block has beenmodified while staying in RAM, it must be written to the external device. Otherwise, it can plainlybe discarded and overwritten with the new block.Figure 5 illustrates the execution of the first six tasks of randUTV AB for a matrix A ∈ R n × n with block size b = n/ and a cache with 7 blocks. To reduce the length of the example but still showthe benefits of this approach, the matrix is small and the cache of blocks is rather large. Note thatwhen the cache is full, a block must be replaced. For instance, to load A block, the least-recentlyused block ( A ) is selected. As it has not been modified, it need not be written to disk. Algorithm: [ U, T, V ] :=
RAND
UTV AB ( A, q, n b ) V := EYE ( n ( A ) , n ( A )) U := EYE ( m ( A ) , m ( A )) Partition A → (cid:18) A T L A T R A BL A BR (cid:19) , V → (cid:0) V L V R (cid:1) , U → (cid:0) U L U R (cid:1) where A T L is × , V L has columns, U L has columns while m ( A T L ) < m ( A ) doDetermine block size b = min( n b , n ( A BR )) Repartition (cid:18) A T L A T R A BL A BR (cid:19) → A A A A A A A A A , (cid:0) V L V R (cid:1) → (cid:0) V V V (cid:1) , (cid:0) U L U R (cid:1) → (cid:0) U U U (cid:1) where A is b × b , V has b rows, U has b rows % Right transform VG := GENERATE IID STDNORM MATRIX ( m ( A ) − m ( A ) , n b ) Y := (cid:18)(cid:18) A A A A (cid:19) ∗ (cid:18) A A A A (cid:19)(cid:19) q (cid:18) A A A A (cid:19) ∗ G [ Y, T V ] := UNPIVOTED QR ( Y ) (cid:18) A A A A (cid:19) := (cid:18) A A A A (cid:19) − (cid:18) A A A A (cid:19) W V T V W ∗ V (cid:0) V V (cid:1) := (cid:0) V V (cid:1) − (cid:0) V V (cid:1) W V T V W ∗ V % Left transform U [ (cid:18) A A (cid:19) , T U ] := UNPIVOTED QR (cid:18)(cid:18) A A (cid:19)(cid:19)(cid:18) A A (cid:19) := (cid:18) A A (cid:19) − W ∗ U T ∗ U W U (cid:18) A A (cid:19)(cid:0) U U (cid:1) := (cid:0) U U (cid:1) − (cid:0) U U (cid:1) W U T U W ∗ U % Small SVD [ A , U SV D , V
SV D ] :=
SVD ( A ) A := A V SV D A := U ∗ SV D A V := V V SV D U := U U SV D
Continue with (cid:18) A T L A T R A BL A BR (cid:19) ← A A A A A A A A A , (cid:0) V L V R (cid:1) ← (cid:0) V V V (cid:1) , (cid:0) U L U R (cid:1) ← (cid:0) U U U (cid:1) endwhile F IGURE
2. The randUTV algorithm adapted for algorithms-by-blocks written with theFLAME methodology/notation. In this algorithm, W V and W U are the unit lower trape-zoidal matrices stored below the diagonal of Y and (cid:18) A A (cid:19) , respectively. • Method with Cache + Overlapping of computation and communication.
Though the use of acache allows the reuse of blocks currently stored in main memory, when a block is not already inmain memory, it must be read from the slow external device. An additional problem happens if thecache is full and the block selected to be replaced has been modified, because it must be written tothe external device before loading the new block. During all this time, the cores cannot compute andtherefore overall performances drop. One way to avoid this is to use one core to perform all the I/Ooperations (communications) while the other cores compute. The disadvantage of this method is thatthe computations will be slower than the two previous methods since they employ one fewer core.However, the communications can be made transparent (or at least be reduced) since I/O operationsare performed at the same time as the computation.The I/O thread and the rest of the threads (computational threads) are completely decoupled. TheI/O thread works by bringing data into the cache of blocks in main memory in advance, and some-times it must remove data that is currently in cache to make room for newer data. This decouplingmakes the programming more difficult, but it accelerates the execution since the speed of cores andthe speed of the external devices can be very different between computers (or even between differentexternal devices in the same computer).Figure 6 illustrates the execution with overlapping of computation and communication of the firstsix tasks of randUTV AB for a matrix A ∈ R n × n with block size b = n/ , and a cache with 7blocks. In this case the execution of tasks is not clearly marked since the execution of different taskscan overlap. For instance reading or writing of operands of one task can be executed at the same timeas the computation of another task. For ease of notation, the figure shows that each I/O operationtakes exactly the same as a computational task. In practice, the I/O operations and the computationaltasks are decoupled. 4. N UMERICAL RESULTS
In this section, we present the experiments demonstrating the scalability and computational costs of imple-mentations of
HQRRP and randUTV AB for matrices stored out-of-core. In subsection 4.1, we compareseveral implementations of
HQRRP with different strategies for handling the I/O. In subsection 4.2, we ex-amine computational cost of an implementation of randUTV .4.1.
Partial CPQR factorization with
HQRRP . The machine used for these experiments had four DIMMmemory chips with 16 GB of DDR4 memory each. The CPU was an Intel r Core ™i7-6700K (4.00 GHz)with four cores. Experiments were run on two different hard drives. One was a Toshiba P300 HDD with 3 TBof memory and 7200 RPM; the other was a Samsung V-NAND SSD 950 Pro with 512 GB of memory. Thecode was compiled with gcc (version 5.4.0) and linked with the Intel r MKL library (Version 2017.0.3).The computational cost of three different implementations of
HQRRP for matrices stored out-of-core wereassessed. • Algorithm 1 – In place:
This implementation of the
HQRRP did not carry out physical permutationson the columns of A but instead applied the permutation information during each I/O task. In otherwords, the routine returns RP ∗ rather than R . As a result, no more data than necessary is transferredto and from the hard drive, but many reads occur from noncontiguous locations in the drive. • Algorithm 2 – Physical pivoting:
Permutations are physically performed during the computation,so that upon completion R is stored in the hard drive. This implementation reads and writes moredata than the “in place” version, but the data transfer mostly occurs in contiguous portions of mem-ory. • Algorithm 3 – Left-looking:
An implementation of a left-looking version of
HQRRP , outlined inSection 2.3. This version requires only O ( n ) write operations, but the total number of operations(including reading, writing, and flops) is asymptotically higher than the “in place” and “physicalpivoting” implementations. Operation OperandsIn OutGenerate normal random G Generate normal random G Gemm tn oz: C = A ∗ B A , G Y Gemm tn oz: C = A ∗ B A , G Y Gemm tn oo: C = C + A ∗ B A , G , Y Y Gemm tn oo: C = C + A ∗ B A , G , Y Y Comp dense QR Y Y , B Comp td QR Y , Y Y , Y , B Apply right Q of dense QR Y , B , A A Apply right Q of dense QR Y , B , A A Apply right Q td QR Y , B , A , A A , A Apply right Q td QR Y , B , A , A A , A Comp dense QR A A , C Comp td QR A , A A , A , C Apply left Qt of dense QR A , C , A A Apply left Qt of td QR A , C , A , A A , A Keep upper triang A A Set to zero A Svd of block A A , P , Q Gemm abta: A = B ∗ A P , A A Svd of block A A , P , Q Gemm aabt: A = AB ∗ A , Q A F IGURE
3. A list of the tasks or operations queued up by the runtime during the analyzerstage in the simplified case that the block size is b = n/ . The “In” column specifies piecesof required input data. The “Out” column specifies required pieces of data that will bealtered upon completion of the operation.Figure 7 shows the times scaled by the square of the matrix dimension of several factorizations. Two key ob-servations can be made of the results depicted in this figure. The first observation is that the number of writesrequired by the implementation has a dramatic effect on its performance, especially when the matrix is storedon a hard disk drive. Algorithms 2 and 3, while performing reasonably well on the SSD, did not even scalecorrectly on the HDD, or at the very least did not reach asymptotic behavior as early as the SSD experiments.Algorithm 3, which requires O ( n ) write operations rather than O ( n ) , outperforms the right-looking alter-natives on both hard drives for large matrices. The second observation is that the best-performing algorithmtakes roughly times longer than the predicted performance for an in-core factorization of a matrix of thesame size on the SSD. On the HDD, it takes about . times longer.4.2. Full factorization with randUTV . In this subsection, we assess the performances of our new imple-mentations for computing the randUTV factorization when the data are stored in a disk drive, and compareit to the performances of highly optimized methods for computing the SVD, the column pivoted QR (CPQR)factorization, and the randUTV factorization when the data are stored in RAM.To fairly compare the different implementations included in this study, the usual flop rate or the usual flopcount cannot be employed since the computation of the SVD, the CPQR, and the randUTV factorizationsrequire a very different number of flops (the dominant n -term in the asymptotic flop count is very different).Absolute computational times are not shown either as they vary greatly because of the huge range of matrixdimensions employed in the experiments. Therefore, scaled computational times (absolute computationaltimes divided by n ) are employed instead. The lower the scaled computational times, the better the perfor-mances are. Since all the implementations being assessed have asymptotic complexity O ( n ) when applied Task G ← Generate normal random()Write block( G ) G ← Generate normal random()Write block( G ) A ← Read block() G ← Read block() Y ← Gemm tn oz( A , G )Write block( Y ) A ← Read block() G ← Read block() Y ← Gemm tn oz( A , G )Write block( Y ) A ← Read block() G ← Read block() Y ← Read block() Y ← Gemm tn oo( Y , A , G )Write block( Y ) A ← Read block() G ← Read block() Y ← Read block() Y ← Gemm tn oo( Y , A , G )Write block( Y )...F IGURE
4. Execution of the tasks by using the traditional approach. The execution of thesix first tasks requires 16 I/O operations (6 writes and 10 reads). Double horizontal linesmark the boundaries between consecutive tasks.Task Cache after Task G ← Generate normal random() G – – – – – – G ← Generate normal random() G G – – – – – A ← Read block() G G A – – – – Y ← Gemm tn oz( A , G ) G G A Y – – – A ← Read block() G G A Y A – – Y ← Gemm tn oz( A , G ) G G A Y A Y – A ← Read block() G G A Y A Y A Y ← Gemm tn oo( Y , A , G ) G G A Y A Y A A ← Read block() G G A Y A Y A Y ← Gemm tn oo( Y , A , G ) G G A Y A Y A ...F IGURE
5. Execution of the tasks by using a cache with 7 blocks. The execution of the sixfirst tasks requires only 4 I/O operations (all of them reads). Double horizontal lines markthe boundaries between consecutive tasks. Computational Task I/O Task Cache after Task G ← Generate normal random() A ← Read block() G A – – – – – G ← Generate normal random() A ← Read block() G A G A – – – Y ← Gemm tn oz( A , G ) A ← Read block() G A G A Y A – Y ← Gemm tn oz( A , G ) A ← Read block() G G A Y A A Y Y ← Gemm tn oo( Y , A , G ) G G A Y A A Y Y ← Gemm tn oo( Y , A , G ) G G A Y A A Y ...F IGURE
6. Execution of the tasks by using a cache with 7 blocks and overlapping of com-putation and communication. The execution of the six first tasks requires only 4 I/O opera-tions (all of them reads), and all of them are peformed at the same time as the computation.No double horizontal lines are employed to mark the boundaries between consecutive taskssince I/O operations and computations of different tasks can be executed simultaneously. n T i m e [ s ] / n -7 Partial Factorization Times, k=500, b=250
In CoreIn place, SSDIn place, HDDPhysical pivoting, SSDPhysical pivoting, HDDLeft-looking, SSDLeft-looking, HDD n T i m e [ s ] / n -7 Partial Factorization Times, k=1000, b=250
In CoreIn place, SSDIn place, HDDPhysical pivoting, SSDPhysical pivoting, HDDLeft-looking, SSDLeft-looking, HDD F IGURE
7. A comparison of the computational cost for three different algorithms for com-puting a partial CPQR when the matrix may be too large to fit in RAM. In the top figure, columns of the input matrix were processed. In the bottom figure, columns wereprocessed. The block size b in each case was . to an n × n matrix, these graphs better reveal the computational speed. The scaled times are multiplied bya constant to make the figures in the vertical axis more readable. The value of this constant is shown in thevertical axis, and it is usually .For the implementations of randUTV , both in-core and out-of-core, results are shown for 0, 1, and 2 iter-ations ( q = 0 , q = 1 , and q = 2 , respectively) in the power iteration process. Recall that the higher q , thehigher the approximation to the singular values obtained in the main diagonal of matrix T .The following out-of-core implementations were assessed in the experiments of this subsection: • OOC-
RAND
UTV-T: This is the traditional implementation for computing the randUTV factoriza-tion of a matrix stored in the disk by using an algorithm-by-blocks. • OOC-
RAND
UTV-V: This is the implementation for computing the randUTV factorization of amatrix stored in the disk by using an algorithm-by-blocks with a cache of blocks and overlappingof computation and I/O. Unlike the previous implementation, this one uses all the cores but one forcomputation and one core for I/O. Not all of the RAM in the computer is employed in the cache formatrix blocks, and a large amount of memory is left available for the operating system, since someinitial experiments supported this approach. • OOC-QR-T: This is the traditional right-looking implementation for computing the QR factoriza-tion of a matrix stored in the disk by using an algorithm-by-blocks. • OOC-QR-V: This is the right-looking implementation for computing the QR factorization of amatrix stored in the disk by using an algorithm-by-blocks with a cache of blocks and overlappingof computation and I/O. One core is employed for I/O and the rest, for computations. Not all of theRAM in the computer is employed in the cache of matrix blocks.Although the two methods for computing the QR factorization included in the experiments do not reveal therank, they were included as a performance reference for the others.In order to be included in this study, we asked some authors to send us their out-of-core codes to reveal therank (e.g. SVD). Unfortunately, no codes were made available to us.Our aim is to factorize very large matrices that do not usually fit in RAM unless a very expensive mainmemory is available. However, as a performance reference for our out-of-core implementations, we haveincluded the following in-core implementations: • MKL SVD: The routine called dgesvd from MKL’s LAPACK was used to compute the SingularValue Decomposition of matrices stored in RAM. • MKL CPQR: The routine called dgeqp3 from MKL’s LAPACK was used to compute the column-pivoting QR factorization of matrics stored in RAM. • RAND
UTV PBLAS ( randUTV with parallel BLAS): This is the traditional implementation forcomputing the randUTV factorization of matrics stored in RAM that relies on the parallel BLAS totake advantage of all the cores in the system. The parallel BLAS library from MKL was employedwith these codes for the purpose of a fair comparison. • RAND
UTV AB ( randUTV with Algorithm-by-Blocks): This is the new implementation for com-puting the randUTV factorization by scheduling all the tasks to be computed in parallel, and thenexecuting them with serial BLAS. The serial BLAS library from MKL was employed with these newcodes for the purpose of a fair comparison. • MKL QR: The routine called dgeqrf from MKL’s LAPACK was used to compute the QR factor-ization of matrices stored in RAM. Although this routine does not reveal the rank, it was included insome experiments as a performance reference for the others.For most of the experiments, two plots are shown. The left plot shows the performances when no orthonormalmatrices are computed. In this case, just the upper triangular factor R is computed for the CPQR and QR,just the upper triangular factor T is computed for the randUTV , and just the singular values are computedfor the SVD. In contrast, the right plot shows the performances when all orthonormal matrices are explicitlyformed in addition to the previously mentioned factors. In this case, matrix Q is computed for the CPQR and QR, and matrices U and V are computed for the randUTV and for the SVD. The right plot slightly favorsthe CPQR and QR since only one orthonormal matrix is formed.4.2.1. Experimental Setup.
The experiments reported in this section were performed on three computers.Next they are briefly described. • ua : This HP computer contained two Intel Xeon r CPU X5560 processors at 2.8 GHz, with 12 coresand 48 GiB of RAM in total. Its OS was GNU/Linux (Version 3.10.0-514.21.1.el7.x86 64). Intel’s icc compiler (Version 12.0.0 20101006) was employed. LAPACK and BLAS routines were takenfrom the Intel(R) Math Kernel Library (MKL) Version 10.3.0 Product Build 20100927 for Intel(R)64 architecture, since this library usually delivers much higher performances than LAPACK andBLAS from the Netlib repository. The Hard Disk Drive (HDD) employed in the experiments to storeall the data in the out-of-core implementations was an HP MM0500EANCR (Firmware RevisionHPG3). Though its capacity was 500 GB, only about 400 GB were available for users, which wasabout 8.3 times as large as the main memory. According to the Linux operating system hdparm tool, the read speed of this drive was 91.13 MB/s. • ut : This Dell computer contained two Intel Xeon r Gold 6254 processors at 3.10GHz, with 36cores and 754 GiB of RAM in total. Its OS was GNU/Linux (Version 5.0.0-37-generic). Intel’s icc compiler (Version 19.0.5.281 20190815) was employed. LAPACK and BLAS routines were takenfrom the Intel(R) Math Kernel Library (MKL) Version 2019.0.5 Product Build 20190808 for Intel(R)64 architecture for the same reason as before. The disk drive (SDD) employed in the experimentsto store all the data in the out-of-core implementations was a Toshiba KXG50PNV2T04 NVMe(Firmware Revision AFDA4105) with a capacity of 2 TiB. According to the Linux operating system hdparm tool, the read speed of this disk was 9,744.21 MB/s on cached reads and the read speed ofthis disk was 2,410.66 MB/s on buffered disk reads. • ucm : This Supermicro computer contained two Intel Xeon r processors E5-2695 v3 at 2.30GHz,with 28 cores and 125 GiB of RAM in total. In this computer the so-called Turbo Boost mode of theCPU was turned off in our experiments. Its OS was GNU/Linux (Version 2.6.32-504.el6.x86 64).Intel’s icc compiler (Version 18.0.1 20171018) was employed. LAPACK and BLAS routines weretaken from the Intel(R) Math Kernel Library (MKL) Version 2018.0.1 Product Build 20171007 forIntel(R) 64 architecture for the same reason as before. The disk drive (SDD) employed in the ex-periments to store all the data in the out-of-core implementations was a Samsung SSD 850 EVO(Firmware Revision EMT02B6Q) with a capacity of 1 TB. According to the Linux operating system hdparm tool, the read speed of this disk was 10,144.81 MB/s on cached reads and the read speedof this disk was 441.48 MB/s on buffered disk reads.In our out-of-core implementations the block cache employed 16 GiB (of the 48 GiB), 256 GiB (of the 754GiB), and 32 GiB (of the 128 GiB) in ua , ut , and ucm , respectively. The rest was left for the operatingsystem kernel and buffers, and the application’s code.In contrast, unless explicitfly stated otherwise, all the experiments employed all the cores in the computer.Notice that the ua computer has a very slow spinning disk as well as a low computational power. Noticealso that both ut and ucm have SSD disks, but their performances are different. The ucm computer has anSSD with a SATA interface, which is limited to 600 MiB/s, whereas the ut computer has an SSD with anM.2 interface, which does not have that limitation.In all the experiments double-precision real matrices were processed. All the matrices used in these experi-ments were randomly generated since generation is fast.4.2.2. Effect of block sizes.
Figure 8 shows the scaled computational times obtained by our implementationsfor computing both the QR factorization and the randUTV factorization versus several block sizes whenmatrices of dimension , × , are processed in ua . The aim of these two plots is to determine theoptimal block sizes. As can be seen, performances of the QR factorization do not strongly depend on the ^ * t / n ^ ( l o w e r i s b e tt e r ) Performances vs block size without ON matrices
OOC-QR-TOOC-QR-VOOC-randUTV-T q=0OOC-randUTV-T q=1OOC-randUTV-T q=2OOC-randUTV-V q=0OOC-randUTV-V q=1OOC-randUTV-V q=2 ^ * t / n ^ ( l o w e r i s b e tt e r ) Performances vs block size with ON matrices
OOC-QR-TOOC-QR-VOOC-randUTV-T q=0OOC-randUTV-T q=1OOC-randUTV-T q=2OOC-randUTV-V q=0OOC-randUTV-V q=1OOC-randUTV-V q=2 F IGURE
8. Performances of QR and randUTV implementations versus block sizes on ma-trices of dimension , × , in ua .block size. In contrast, for the randUTV factorization performances do depend on the block size, and blocksizes between 7,680 and 11,264 usually offer optimal results.From now on, in ua the block size 10,240 will be employed for both the out-of-core QR and the out-of-core randUTV factorizations since it returns near-optimal results in most cases. As ut and ucm have largercentral memories than ua , larger matrices will be assessed in those two computers. Since in linear algebrathe larger the matrix sizes being tested, the larger the optimal block sizes usually are, in ut and ucm theblock size 20,480 will be employed.4.2.3. Comparison of Out-Of-Core variants.
The plots in all the next figures include a vertical dashed grayline showing the largest theoretical matrix size that can be stored in the RAM of the computer when the randUTV factorization is computed. For instance, in ua this size is about 80,000 when no orthonormalmatrices are built, and therefore matrices U and V do not need to be stored. In contrast, this size is about46,000 when orthonormal matrices are built, and therefore matrices U , and V must be stored. In practice,the actual threshold must be sligthly smaller than those in the pictures since main memory must be alsoemployed for the operating system kernel, operating system’s disk cache and buffers, application’s code,application’s other data, etc. ^ * t / n ^ ( l o w e r i s b e tt e r ) R A M D i s k Performances of variants vs n without ON matrices
OOC-QR-TOOC-QR-VOOC-randUTV-T q=0OOC-randUTV-T q=1OOC-randUTV-T q=2OOC-randUTV-V q=0OOC-randUTV-V q=1OOC-randUTV-V q=2 ^ * t / n ^ ( l o w e r i s b e tt e r ) R A M D i s k Performances of variants vs n with ON matrices
OOC-QR-TOOC-QR-VOOC-randUTV-T q=0OOC-randUTV-T q=1OOC-randUTV-T q=2OOC-randUTV-V q=0OOC-randUTV-V q=1OOC-randUTV-V q=2 F IGURE
9. Performances versus matrix dimensions for the different implementations of theQR and randUTV factorizations in ua .Figure 9 shows the performances for the different implementations of the QR and randUTV factorizationson several large matrix sizes. As can be seen, the overlapping of computation and I/O for the QR factorization increases performances slightly: The speedup is about between 3 % and 20 % when no orthonormal matricesare built, and it is about between 8 % and 22 % when orthonormal matrices are built. In contrast, theoverlapping of computation and I/O for the randUTV clearly improve performances: The speedup is aboutbetween 25 % and 37 % when no orthonormal matrices are built, and it is about between 18 % and 42 %when orthonormal matrices are built. ut ucm n = 348 , n = 143 , Computational time 229,627.6 34,165.5I/O time 41,312.9 38,253.5Computational time + I/O time 270,940.5 72,419.0Real time 229,910.0 43,326.0Ratio Real time / Computational time 1.001 1.268Ratio Computational time / I/O time 5.558 0.893T
ABLE
1. Decomposed times (in seconds) of the randUTV factorization of large matricesof dimension n × n on the ut and ucm platforms.To check the benefits of the overlapping of computation and I/O in the other platforms, Table 1 reports thetotal and decomposed times of the computation of the randUTV factorization of large matrices that do not fitin main memory in both ut and ucm when no orthornormal matrices are built and q = 0 (the power iterationprocess). To build this table, each operation in the factorization (both computational and I/O) was recorded.As can be seen, in ut the performances are very encouraging because the I/O time is much smaller than thecomputational time (about 18 %), thus making the application computation-bound. However, in ucm theresults are different since the disk of ut is 5.4 times as fast as the disk of ucm . The slower disk of ucm makes that the overall computational time and the overall I/O time are similar, the latter being slightly larger.On the other side, in ut the overlapping of computation and I/O is perfect since the real time is only 0.12 %larger than the computational time. In ucm the overlapping of computation and I/O is almost perfect sincethe real time is only 27 % larger than the computational time, and 13 % larger than the I/O time (in this caselarger than the computational time). C o m p u t a t i o n I O T a s k s Scheduling of tasks C o m p u t a t i o n I O T a s k s A pp l _ r _ T D A pp l _ r _ T D A pp l _ r _ T D A pp l _ r _ T D A pp l _ r _ T D A pp l _ r _ T D A pp l _ r _ T D A pp l _ r _ T D A pp l _ r _ T D A pp l _ r _ T D A pp l _ r _ T D A pp l _ r _ T D A pp l _ r _ T D A pp l _ r _ T D A pp l _ r _ T D Scheduling of tasks F IGURE
10. Example of the scheduling of the tasks in the execution of the randUTV fac-torization of a , × , matrix in ut in two different periods. The left plot showsthe scheduling during one hour; the right plot is a zoom of the first ten minutes of the leftplot. Figure 10 shows a real example of the scheduling of the different tasks during some time of the randUTV factorizations of a , × , matrix in ut when no orthonormal matrices are built and q = 0 . Theleft plot shows the scheduling during one hour of the experiment; the right plot is a zoom of the first tenminutes of the left plot. The top part of the two plots shows the I/O tasks performed by the application:A red rectangle is a write operation, whereas a green rectangle is a read operation. The bottom part of thetwo plots shows the computation performed by the application. The names of the tasks are only shownwhen there is enough room. As can be seen, the overlapping of computation and I/O is almost perfect: Thecomputer performs I/O operations and computation at the same time. Another interesting remark obtainedfrom this plot is the fact that the I/O operations are usually smaller than the computational operations despitethe high processing power of the computer (36 cores), which makes the application still computation-boundin this case ( ut ). This means that higher computational power could further reduce the processing time ofthe factorizations.4.2.4. Performances of Out-Of-Core codes versus In-Core codes.
Figure 11 shows the performances of thebest out-of-core implementations as a function of the matrix dimensions. This plot also shows performancesof in-core factorizations so that the speed of both types of implementations can be compared. Notice thatwhen no orthonormal matrices are built (left plots) the in-core MKL SVD is much faster on ut and ucm than on ua . In those cases, the MKL SVD is even much faster than the MKL CPQR despite having a muchhigher computational cost. These high performances are caused by the employment of a new implementationfor computing the SVD in MKL that replaces the traditional implementation for parallel architectures onmatrices with dimensions larger than 4000. We think that some of the techniques employed by Intel for thisnew implementation of the SVD in MKL could also be applied to our randUTV to make it faster when noorthornormal matrices are built.The top row of plots of Figure 11 shows the results obtained in ua . In these results the best out-of-coreQR factorization is about 28 % slower than the in-core QR factorization when no orthonormal matrices arebuilt, and about 29 % slower than the in-core QR factorization when orthonormal matrices are built. On theother side, the out-of-core randUTV is about 51 % ( q = 0 ), 53 % ( q = 1 ), and 51 % ( q = 2 ) slower thanthe in-core randUTV factorization when no orthonormal matrices are built, and about 41 % ( q = 0 ), 36 %( q = 1 ), and 33 % ( q = 2 ) slower than the in-core randUTV factorization when orthonormal matrices arebuilt. In all these cases the comparison has been performed considering the best performance of the out-of-core implementations against the best performance of the in-core factorizations (which is usually obtainedon very large matrices).The center row of plots of Figure 11 shows the results obtained in ut . In these results the out-of-core QRfactorization is about 2.58 times as slow as the in-core QR factorization when no orthonormal matrices arebuilt, and about 2.10 tiimes as slow as the in-core QR factorization when orthonormal matrices are built. Onthe other side, the out-of-core randUTV is 1.89 ( q = 0 ), 1.70 ( q = 1 ), and 1.48 ( q = 2 ) times as slow asthe in-core randUTV factorization when no orthonormal matrices are built, and about 2.07 ( q = 0 ), 1.83( q = 1 ), and 1.80 ( q = 2 ) times as slow as the in-core randUTV factorization when orthonormal matricesare built.The bottom row of plots of Figure 11 shows the results obtained in ucm . In these results the out-of-core QRfactorization is about 1.76 times as slow as the in-core QR factorization when no orthonormal matrices arebuilt, and about 2.00 times as slow as the in-core QR factorization when orthonormal matrices are built. Onthe other side, the out-of-core randUTV is about 2.15 ( q = 0 ), 2.29 ( q = 1 ), and 2.36 ( q = 2 ) times as slowas the in-core randUTV factorization when no orthonormal matrices are built, and about 2.10 ( q = 0 ), 2.04( q = 1 ), and 2.03 ( q = 2 ) times as slow as the in-core randUTV factorization when orthonormal matricesare built.Therefore, as can be seen, in the worst cases the speed of revealing the rank of matrices stored in the slowexternal devices is only about two times as slow as the speed of revealing the rank of matrices already storedin the main memory.In conclusion, our out-of-core implementations of the randUTV factorization are able to efficiently processvery large data that do not fit in RAM and must be stored in the disk drive. Despite the slow speed of the ^ * t / n ^ ( l o w e r i s b e tt e r ) R A M D i s k Performances vs n without ON matrices in ua
MKL SVDMKL CPQRMKL QRrandUTV q=0randUTV q=1randUTV q=2OOC-QR-VOOC-randUTV-V q=0OOC-randUTV-V q=1OOC-randUTV-V q=2 ^ * t / n ^ ( l o w e r i s b e tt e r ) R A M D i s k Performances vs n with ON matrices in ua
MKL SVDMKL CPQRMKL QRrandUTV q=0randUTV q=1randUTV q=2OOC-QR-VOOC-randUTV-V q=0OOC-randUTV-V q=1OOC-randUTV-V q=2 ^ * t / n ^ ( l o w e r i s b e tt e r ) R A M D i s k Performances vs n without ON matrices in ut
MKL SVDMKL CPQRMKL QRrandUTV q=0randUTV q=1randUTV q=2OOC-QR-VOOC-randUTV-V q=0OOC-randUTV-V q=1OOC-randUTV-V q=2 ^ * t / n ^ ( l o w e r i s b e tt e r ) R A M D i s k Performances vs n with ON matrices in ut
MKL SVDMKL CPQRMKL QRrandUTV q=0randUTV q=1randUTV q=2OOC-QR-VOOC-randUTV-V q=0OOC-randUTV-V q=1OOC-randUTV-V q=2 ^ * t / n ^ ( l o w e r i s b e tt e r ) R A M D i s k Performances vs n without ON matrices in ucm
MKL SVDMKL CPQRMKL QRrandUTV q=0randUTV q=1randUTV q=2OOC-QR-VOOC-randUTV-V q=0OOC-randUTV-V q=1OOC-randUTV-V q=2 ^ * t / n ^ ( l o w e r i s b e tt e r ) R A M D i s k Performances vs n with ON matrices in ucm
MKL SVDMKL CPQRMKL QRrandUTV q=0randUTV q=1randUTV q=2OOC-QR-VOOC-randUTV-V q=0OOC-randUTV-V q=1OOC-randUTV-V q=2 F IGURE
11. Performances versus matrix dimensions for the best implementations of theQR and randUTV factorizations. In-core implementations are included as a reference.Recall that the QR factorization does not reveal the rank, and it is included as a reference.The top row shows results in ua , the center row shows results in ut , and the bottom rowshows results in ucm .hard drive with respect to the main memory, these methods are able to process matrices with only a minorloss of performances. 5. C ONCLUSIONS
This paper describes a set of algorithms for computing rank revealing factorizations of matrices that arestored on external memory devices such as solid-state or spinning disk hard drives. Standard techniques for computing rank revealing factorizations of matrices perform very poorly in this environment, as theyinherently consist of a sequence of matrix-vector operations, which necessitate a large number of read andwrite operations to the disk. We use randomization to reorganize the computation to operate on large blocksof the matrix, thereby dramatically reducing the amount of communication required.Numerical experiments demonstrate that the speed of the new randomized methods that operate on datastored on external devices is comparable to traditional methods that operate on data stored in main memory.This enables the processing of very large matrices in a cost efficient way. As the performance of solid statehard drives is rapidly improving in terms of both speed and capacity, the methods described are likely to gaineven more of a competitive advantage in coming years.A CKNOWLEDGEMENTS
P.G. Martinsson acknowledges support from the Office of Naval Research (award N00014-18-1-2354), fromthe National Science Foundation (award DMS-1620472), and from Nvidia Corp. G. Quintana-Ort´ı wassupported by the Spanish Ministry of Science, Innovation and Universities under Grant RTI2018-098156-B-C54 co-financed with FEDER funds. The authors would like to thank Francisco D. Igual (UniversidadComplutense de Madrid) and Javier Navarrete (Universitat d’Alacant) for granting access to the ucm and the ua servers, respectively. R EFERENCES [1] Paolo Bientinesi, Enrique S. Quintana-Ort´ı, and Robert A. van de Geijn,
Representing linear algebra algorithms in code: Theflame application program interfaces , ACM Trans. Math. Softw. (2005), no. 1, 27–59.[2] L Susan Blackford, Jaeyoung Choi, Andy Cleary, Eduardo D’Azevedo, James Demmel, Inderjit Dhillon, Jack Dongarra, SvenHammarling, Greg Henry, Antoine Petitet, et al., Scalapack users’ guide , vol. 4, Siam, 1997.[3] Ernie Chan, Field G. Van Zee, Paolo Bientinesi, Enrique S. Quintana-Orti, Gregorio Quintana-Orti, and Robert van de Geijn,
Supermatrix: A multithreaded runtime scheduling system for algorithms-by-blocks , Proceedings of the 13th ACM SIGPLANSymposium on Principles and Practice of Parallel Programming (New York, NY, USA), PPoPP ’08, ACM, 2008, pp. 123–132.[4] Tony F Chan,
Rank revealing qr factorizations , Linear algebra and its applications (1987), 67–82.[5] Tony F Chan and Per Christian Hansen, Some applications of the rank revealing qr factorization , SIAM Journal on Scientificand Statistical Computing (1992), no. 3, 727–741.[6] Shivkumar Chandrasekaran and Ilse CF Ipsen, On rank-revealing factorisations , SIAM Journal on Matrix Analysis and Ap-plications (1994), no. 2, 592–622.[7] Eduardo D’Azevedo and Jack Dongarra, The design and implementation of the parallel out-of-core scalapack lu, qr, andcholesky factorization routines , Concurrency: Practice and Experience (2000), no. 15, 1481–1493.[8] Vadim Demchik, Miroslav Bac´ak, and Stefan Bordag, Out-of-core singular value decomposition , CoRR abs/1907.06470 (2019).[9] Jed A. Duersch and Ming Gu,
Randomized qr with column pivoting , SIAM Journal on Scientific Computing (2017), no. 4,C263–C291.[10] Carl Eckart and Gale Young, The approximation of one matrix by another of lower rank , Psychometrika (1936), no. 3,211–218.[11] Gene Golub, Numerical methods for solving linear least squares problems , Numerische Mathematik (1965), no. 3, 206–216.[12] Gene H Golub and Charles F Van Loan, An analysis of the total least squares problem , SIAM journal on numerical analysis (1980), no. 6, 883–893.[13] Gene H. Golub and Charles F. Van Loan, Matrix computations , third ed., Johns Hopkins Studies in the Mathematical Sciences,Johns Hopkins University Press, Baltimore, MD, 1996.[14] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn,
Flame: Formal linear algebra methodsenvironment , ACM Trans. Math. Softw. (2001), no. 4, 422–455.[15] Brian C Gunter, Wesley C Reiley, and Robert A van de Geijn, Parallel out-of-core cholesky and qr factorization with poocla-pack. , IPDPS, 2001, p. 179.[16] Brian C Gunter and Robert A Van De Geijn,
Parallel out-of-core computation and updating of the qr factorization , ACMTransactions on Mathematical Software (TOMS) (2005), no. 1, 60–78.[17] Nathan Halko, Per-Gunnar Martinsson, Yoel Shkolnisky, and Mark Tygert, An algorithm for the principal component analysisof large data sets , SIAM Journal on Scientific computing (2011), no. 5, 2580–2594.[18] Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp, Finding structure with randomness: Probabilistic algorithms forconstructing approximate matrix decompositions , SIAM Review (2011), no. 2, 217–288.[19] Nathan Heavner, Building rank-revealing factorizations with randomization , Ph.D. thesis, Applied Mathematics, University ofColorado at Boulder, 2019. [20] Francisco D. Igual, Ernie Chan, Enrique S. Quintana-Ort´ı, Gregorio Quintana-Ort´ı, Robert A. Van De Geijn, and Field G.Van Zee, The flame approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations ,J. Parallel Distrib. Comput. (2012), no. 9, 1134–1143.[21] K. Kabir, A. Haidar, S. Tomov, A. Bouteiller, and J. Dongarra.[22] William Kahan, Numerical linear algebra , Canadian Mathematical Bulletin (1966), no. 5, 757–801.[23] Virginia Klema and Alan Laub, The singular value decomposition: Its computation and some applications , IEEE Transactionson automatic control (1980), no. 2, 164–176.[24] Jakub Kurzak, Hatem Ltaief, Jack Dongarra, and Rosa M Badia, Scheduling dense linear algebra operations on multicoreprocessors , Concurrency and Computation: Practice and Experience (2010), no. 1, 15–44.[25] Charles L Lawson and Richard J Hanson, Solving least squares problems , vol. 15, Siam, 1995.[26] Michael W Mahoney et al.,
Randomized algorithms for matrices and data , Foundations and Trends® in Machine Learning (2011), no. 2, 123–224.[27] Mercedes Marqu´es, Gregorio Quintana-Ort´ı, Enrique S. Quintana-Ort´ı, and Robert A. Van De Geijn, Using desktop computersto solve large-scale dense linear algebra problems , Journal of Supercomputing (2011), no. 2, 145–150.[28] P. G. Martinsson, G. Quintana-Ort´ı, and N. Heavner, randutv: A blocked randomized algorithm for computing a rank-revealingutv factorization , ACM Trans. Math. Softw. (2019), no. 1, 4:1–4:26.[29] Per-Gunnar Martinsson, Gregorio Quintana Ort´ı, Nathan Heavner, and Robert van de Geijn, Householder qr factorization withrandomization for column pivoting (hqrrp) , SIAM Journal on Scientific Computing (2017), no. 2, C96–C115.[30] Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert, A randomized algorithm for the approximation of matrices , Tech.Report Yale CS research report YALEU/DCS/RR-1361, Yale University, Computer Science Department, 2006.[31] Alan Miller,
Subset selection in regression , Chapman and Hall/CRC, 2002.[32] Gregorio Quintana-Ort´ı, Francisco D Igual, Mercedes Marqu´es, Enrique S Quintana-Ort´ı, and Robert A Van de Geijn,
Aruntime system for programming out-of-core matrix algorithms-by-tiles on multithreaded architectures , ACM Transactions onMathematical Software (TOMS) (2012), no. 4, 25.[33] Gregorio Quintana-Ort´ı, Enrique S. Quintana-Ort´ı, Robert A. Van De Geijn, Field G. Van Zee, and Ernie Chan, Programmingmatrix algorithms-by-blocks for thread-level parallelism , ACM Trans. Math. Softw. (2009), no. 3, 14:1–14:26.[34] Gregorio Quintana-Ort´ı, Xiaobai Sun, and Christian H Bischof, A blas-3 version of the qr factorization with column pivoting ,SIAM Journal on Scientific Computing (1998), no. 5, 1486–1494.[35] Wesley C Reiley, Efficient parallel out-of-core implementation of the cholesky factorization , University of Texas at Austin,Austin, TX (1999).[36] Wesley C Reiley and Robert A Van De Geijn,
Pooclapack: Parallel out-of-core linear algebra package , University of Texas atAustin, Austin, TX (1999).[37] Vladimir Rokhlin, Arthur Szlam, and Mark Tygert,
A randomized algorithm for principal component analysis , SIAM Journalon Matrix Analysis and Applications (2009), no. 3, 1100–1124.[38] Gilbert W Stewart, An updating algorithm for subspace tracking , IEEE Transactions on Signal Processing (1992), no. 6,1535–1541.[39] , Matrix algorithms: Volume 1: Basic decompositions , vol. 1, Siam, 1998.[40] GW Stewart,
UTV decompositions , PITMAN RESEARCH NOTES IN MATHEMATICS SERIES (1994), 225–225.[41] Sivan Toledo and Fred G Gustavson,
The design and implementation of solar, a portable library for scalable out-of-core linearalgebra computations , IOPADS, vol. 96, Citeseer, 1996, pp. 28–40.[42] Lloyd N Trefethen and David Bau III,
Numerical linear algebra , vol. 50, Siam, 1997.[43] Sabine Van Huffel and Joos Vandewalle,