[PDF] A Survey of Singular Value Decomposition Methods for Distributed Tall/Skinny Data

Abstract

The Singular Value Decomposition (SVD) is one of the most important matrix factorizations, enjoying a wide variety of applications across numerous application domains. In statistics and data analysis, the common applications of SVD such as Principal Components Analysis (PCA) and linear regression. Usually these applications arise on data that has far more rows than columns, so-called "tall/skinny" matrices. In the big data analytics context, this may take the form of hundreds of millions to billions of rows with only a few hundred columns. There is a need, therefore, for fast, accurate, and scalable tall/skinny SVD implementations which can fully utilize modern computing resources. To that end, we present a survey of three different algorithms for computing the SVD for these kinds of tall/skinny data layouts using MPI for communication. We contextualize these with common big data analytics techniques, principally PCA. Finally, we present both CPU and GPU timing results from the Summit supercomputer, and discuss possible alternative approaches.

Full PDF

AA Survey of Singular Value DecompositionMethods for Distributed Tall/Skinny Data

Drew Schmidt

Oak Ridge National Laboratory

Oak Ridge, [email protected]

Abstract —The Singular Value Decomposition (SVD) is oneof the most important matrix factorizations, enjoying a widevariety of applications across numerous application domains. Instatistics and data analysis, the common applications of SVD suchas Principal Components Analysis (PCA) and linear regression.Usually these applications arise on data that has far more rowsthan columns, so-called “tall/skinny” matrices. In the big dataanalytics context, this may take the form of hundreds of millionsto billions of rows with only a few hundred columns. Thereis a need, therefore, for fast, accurate, and scalable tall/skinnySVD implementations which can fully utilize modern computingresources. To that end, we present a survey of three differentalgorithms for computing the SVD for these kinds of tall/skinnydata layouts using MPI for communication. We contextualizethese with common big data analytics techniques, principallyPCA. Finally, we present both CPU and GPU timing resultsfrom the Summit supercomputer, and discuss possible alternativeapproaches.

Index Terms —SVD, PCA, Big Data, MPI, CUDA

I. I

NTRODUCTION

To call the Singular Value Decomposition (SVD) usefulis an extraordinary understatement. It is difﬁcult to think ofanother matrix factorization that has enjoyed more applicationsof more importance than the SVD. Perhaps this is becausemany of the things one may wish to do numerically withmatrices can be done via SVD. To name a few of the moreimportant ones, one can compute matrix inverses, solve asystem of equations, calculate determinants, compute leastsquares solutions to overdetermined systems, compute con-dition numbers, norms, column rank, and on and on. It shouldbe no surprise then that many of the applications of matricesto real world problems are either solved outright by SVD orare simpler to solve because of it.Data analytics and statistics are not immune from thedominance of this important factorization. One useful ap-plication of SVD is in ﬁtting linear and generalized linearmodels (LM and GLM). The standard “linear regression”formulation is essentially just linear least squares, so this isoften solved by a QR factorization alone. However, there aresome advantages to using SVD instead. Indeed, SVD is a rank-revealing factorization, and some popular QR algorithms arenot. This is important for statisticians because in the context ofstatistical experimental design, their so-called model matricesare sometimes rank-degenerate by construction. As for GLM,a good LM ﬁtter can be used to ﬁt a GLM using iterativelyre-weighted least squares [1]. Another popular application of SVD in the statistical anddata sciences is Principal Components Analysis or PCA [2].PCA is an exploratory data analysis tool which is oftenused to visualize a high-dimensional dataset in two or threedimensions while maximizing the amount of “information” (inthis case, variability) retained. The dataset consists of a matrixwhose columns are variables and whose rows are observations.In this setting, the data is ﬁrst mean-centered column-by-column, and then projected onto the right singular vectors(alternatively, the right singular vectors are “removed”). Thisrotation orders the newly constructed variables in decreasingorder of variability retained from the original dataset. Soplotting only the ﬁrst few gives a good tradeoff betweenpreserving all of the variability of the original dataset andnot being able to visualize it at all.For data analytics and statistics, dataset are typically of thetall/skinny variety. It is not uncommon, particularly in big dataapplications, to have only a few hundred or a few thousandcolumns, but hundreds of millions to billions of rows. In fact,while tall/not-so-skinny applications do arise, square or nearlysquare matrices are essentially unheard of. Some domains suchas bioinformatics have the transposed problem of short/widematrix shapes. Much of the mathematics of tall/skinny dataworks out with appropriate transpose arithmetic and symbol-pushing. So for the sake of brevity, we only focus on thetall/skinny case.In the sections that follow, we will present several algo-rithms for computing SVD. The algorithms we describe are notnew to this article. However, we provide some implementationdetails which are often left as an exercise to the reader. Inparticular, we stress the issues that arise in big data contexts,where data is distributed across multiple processes. Our im-plementations use MPI [3] for managing communication. Ofcourse, much of the big data analytics world has focused onusing the MapReduce algorithm [4] instead. For example, [5]takes an in-depth look at developing a tall/skinny QR usingMapReduce, which in many ways is very similar to what wepresent in Section III-B.So why bother with MPI? For as often maligned as itis, we argue that it has several serious advantages overMapReduce, even for big data analytics. One disadvantageof the MapReduce approach is that it uses languages andprogramming models alien to many in the statistical andmathematical sciences. For example, forcing all computation a r X i v : . [ c s . M S ] S e p hrough the key/value pair mapper/reducer model is usuallycumbersome and unnatural for matrix computations. Onemay challenge this as a matter of taste. However, we wouldpoint out that no one programs in this way among the largevariety of programming languages and analysis packages thatimplement matrix calculations (Matlab, R, NumPy, Julia, . . . ).One programs in this way with MapReduce only because theymust. In contrast, the Single Program/Multiple Data or SPMDmodel [6] common to MPI programs is a natural extension ofserial programming. It is fair to say that MapReduce makesparallelism easy, in that it is never explicit. But this comes atthe cost of making everything else harder.Perhaps more importantly, MPI allows us to implementour matrix computations of interest with signiﬁcantly lesscommunication overhead inherently . Indeed, each iteration ofMapReduce includes a shufﬂe operation that is equivalent to an MPI_Alltoall call [7], [8]. And any given implementationmay require multiple iterations, meaning multiple shufﬂes.However, for our SVD computations, we are generally ableto use a single

MPI_Allreduce call.Finally, most MapReduce users today use the Apache Sparkframework [9]. This is a very heavy dependency which canhave problems integrating with shared resources like clustersand supercomputers. And the performance of Spark is knownto be poor compared to high performance computing solu-tions [10].For all of these reasons, we will only proceed with an MPI-based communication strategies. But before we can properlybegin, we outline some background information.II. B

ACKGROUND

A. Basic Deﬁnitions

Let A be a real-valued matrix with m rows and n columns.Then we take the SVD of A to be: A m × n = U m × r Σ r × r V Tn × r where r is usually the minimum of m and n , although it maybe taken to be the rank of A which is no greater than theminimum of m and n . This is sometimes called the ”compactSVD”. Additionally, Σ is a diagonal matrix, and U and V areorthogonal. To say that a matrix is orthogonal means that itstranspose is its inverse. For rectangular matrices (as is oftenthe case for SVD), when we say that a matrix is orthogonal, weonly mean so along the short dimension. So if U is as abovewith m > n ≥ r , then U T U = I r × r . However, U U T (cid:54) = I m × m since the rows of U can not be linearly independent.For the remainder, we will always assume that m > n . B. Truncated SVD

Sometimes only a few singular values/vectors are requiredfor a given application, as is often the case with PCA.However, because of limitations inherent to the algorithms thatcompute them: if you want one, you get them all. This canadd signiﬁcant runtime and memory overhead to calculate datathat will simply be thrown away. Although it is nearly 100 years old, the so-called powermethod is still used today for truncated problems. We do notinclude the details of the method here, but there are manygood surveys on the topic, for example [11]. Another commontechnique for truncated SVD is the Lanczos method [12],and its many variants. Both of these classes of methods areiterative solvers which are in actual fact techniques for spectraldecomposition, not SVD. That is, these methods provideapproximations for the largest eigenvalues and their corre-sponding eigenvectors. There is a close relationship betweenSVD and the eigenproblem via the so-called “normal equationsmatrix,” which we discuss in depth in Section III-A. But thereis another relationship that is exploited speciﬁcally for Lanczosmethods. For a rectangular matrix A of dimension m × n with m > n , you must ﬁrst “square up” the matrix: H :=  AA T  Then H is a square, symmetric matrix of order m + n , so youcan perform the Lanczos method k times to approximate theeivenvalues of H . Note that for full convergence, we need torun the iteration m + n times, not n (which would be a veryexpensive way of computing them). The approximations to thesingular values of A are given by the non-zero eigenvaluesof H , and the singular vectors (left and right) are found in Y k := √ Q k S k (Theorem 3.3.4 of [13]). Speciﬁcally, the ﬁrst m rows of Y k are the left singular vectors, and the remaining n rows are the right singular vectors.For the purposes of computing SVD, the Lanczos methodprimarily relies on a series of matrix-vector products involving H . Because of this and the block anti-diagonal structure of H , one does not ever need to explicitly form it in computermemory. The reliance of this technique on nothing morecomplicated than matrix-vector products becomes particularlyvaluable if A is itself sparse.One common application of the Lanczos method for datascience is Latent Semantic Indexing [14], which comes fromtext analysis. It involves computing the truncated SVD of agenerally very sparse matrix called the term frequency-inversedocument frequency (tf-idf) matrix. The Lanczos method andthe power iteration are sometimes used in implementationsof the PageRank algorithm [15], which is a technique toﬁnd inﬂuential nodes in a graph. Although this example isabout eigendecomposition, not strictly SVD. It is worth noting,however, is that these are both sparse matrix problems.More recently, there is the method of computing truncatedSVD via random projections [16]. This technique works wellfor dense and can be easily implemented for distributed dataproblems, and as such is the subject of Section III-C. C. Computing Assumptions

In the following section, our implementation emphasis ison distributed data. We use the term “distributed” and “dis-tribution” in the computing sense, not the statistical sense.Speciﬁcally, the data is a matrix split across processes in ane-dimensional (1-d) fashion, so that processes in the MPIcommunicator own contiguous blocks of rows. If a processowns part of a row, it owns the entire row. If we have p processes in the MPI communicator, we conceptually split amatrix A as: A =  A A ... A p  (1)where A is the block of rows on MPI rank 0, A on rank 1,and so on.We assume that any time a matrix is “large”, it is distributed,and otherwise it is common to all processes. So for example,if A is a distributed matrix with many rows and few columnsand if we have A = U Σ V T , then U is distributed in identicalfashion to A , but Σ and V are both small and so every processowns a copy of each.Matrix multiplications always take the form of a distributedmatrix times a non-distributed matrix resulting in a distributedmatrix, or the transpose of a distributed matrix times adistributed matrix, resulting in a non-distributed matrix. Theformer is an entirely local computation, and the latter is a localmatrix product followed by an MPI_Allreduce .We have noted that communication will be handled byMPI. As for the local computations, we rely on high-qualityimplementations of the de facto standards, the BLAS [17] andLAPACK [18]. For example, in our benchmarks in Section IV,we use OpenBLAS [19] for computations on CPU. For GPUs,we use the CUDA environment [20], including their BLAS andLAPACK semi-clones cuBLAS and cuSOLVER, although wediscuss alternate strategies in Section V.When referring to some speciﬁc local computations whichare to be performed by an LAPACK implementation, we willrefer to the function name without the type character. Sofor example, when referring to the symmetric eigensolver,we use syevr , with the understanding that the “s” or “d”variant will be used appropriately. Likewise for a QR, geqrf ,and for a local SVD, gesdd . We will also make regularuse of the shorthand qr_Q and qr_R to refer to functionswhich compute the Q or R matrices of a QR factorization,respectively. In algorithmic descriptions, the implementationdetails of these functions will be noted parenthetically if at allambiguous. A local matrix product is always assumed to becomputed by a BLAS gemm operation.III. A LGORITHMS

A. Method 1: The Normal Equations

Given an over-determined system of equations Ax = b , the“normal equation” is A T Ax = A T b For this reason, A T A is sometimes referred to as “the normalequations matrix”, or even sometimes quite incorrectly as “thenormal equations.” In statistics, this is sometimes called the Algorithm 1:

SVD Via Normal Equations Matrix

Data: A with m > n Result: Σ , optionally U and V Compute N local = A Tlocal A local ;Compute N = MPI_allreduce ( N local );Compute the eigenvalues Λ (optionally the eigenvectors V ) of N locally via syevr ;Let σ i = √ λ i for i = 1 , . . . , n ; if compute U then U = AV Σ − ; end “crossproducts matrix”. In fact, in the popular statistics anddata programming language R [21], this operation is performedby the crossprod function.Because A T A is symmetric, we know that it must havean eigendecomposition A = V Λ V T where Λ is the diagonalmatrix of eigenvalues and V is an orthogonal matrix ofeigenvectors. But taking the SVD of A we have A T A = (cid:0) U Σ V T (cid:1) T (cid:0) U Σ V T (cid:1) = V Σ U T U Σ V T = V Σ V T So the eigenvalues of A T A are the square of the singularvalues of A , and the right singular vectors of A form a setof eigenvectors of A T A . If we know Σ and V then we canrecover U easily since U = AV Σ − .If A is distributed as in Formula (1), then each MPI rank i can compute the local N i = A Ti A i . Then we merely needadd up all the entries of all the local N i matrices across allthe ranks using a call to MPI_allreduce . We summarizethis procedure in Algorithm 1.This approach has several advantages for the tall/skinnycase. For one, it is extremely simple to implement. Addition-ally, because it is dominated by a matrix-matrix multiplication,it is extremely fast. The downside is that it may have numericalissues. Indeed, explicitly forming the normal equations matrixsquares the condition number of A . For applications likelinear regression, this may be a serious problem. For PCA,it probably does not matter.If the data resides on a GPU and the MPI implementationsupports GPUDirect, then the MPI_Allreduce computationcan be done as-is. Otherwise, a CPU buffer will be required.

B. Method 2: QR

We can factor A m × n = Q m × n R n × n where R is upper triangular, and Q is orthogonal. Strictlyspeaking, this is the “thin” QR factorization of A [22].However, we will never need or make additional referenceto the full factorization. lgorithm 2: QR Reducer

Data:

Real matrices of order n R in and R out Let T := (cid:20) R in R out (cid:21) ;Set R out = qr_R ( T ); Algorithm 3:

SVD via CAQR

Data: A with m > n Result: Σ , optionally U and V Compute R local = qr_R ( A local ) via geqrf ;Compute R = QR_allreduce ( R local );Compute Σ and optionally V via gesdd . Recover U = AV Σ − if desired.;With A = QR , if we next take the SVD of R then we have A = QR = Q (cid:0) U R Σ R V TR (cid:1) = ( QU R ) Σ R V TR Then U Σ V T is a singular value decomposition of A , where U = QU R (which is orthogonal, being a product of orthogonalmatrices), Σ = Σ R and V = V R . So if we can compute theright factor R of A , then we can compute its singular valuesand right singular vectors. As before, we can recover the leftsingular vectors if desired from the identity U = AV Σ − .The key idea behind computing R is that given a matrixsplit by rows as in Formula (1), computing R of A can beachieved by computing the R matrix of the pieces, stackingthem two at a time, and then computing their R matrix. Thisprocess continues until all of the pieces have contributed. Thisis known as a communication-avoiding QR, or CAQR [23],[24], and it is a particular kind of tall/skinny QR, or TSQR.More formally, if we split the rows of A into A and A so that A = (cid:20) A A (cid:21) , then we can factor A i = Q i R i ( i = 1 , ).Then A = (cid:20) Q Q (cid:21) (cid:20) R R (cid:21) This is not yet a QR factorization of A , because the right factoris not upper triangular. Denote the left factor as ˜ Q and let ˆ QR be a QR factorization of the right factor. Then A = ˜ Q ˆ QR .And because ˜ Q and ˆ Q are orthogonal, so is their product.And hence this is a QR factorization of A .We can cast this as a binary operation which accepts two R matrices, stacks them, computes and returns their R matrix.This binary operation is always associative, and under somecircumstances it may be commutative [24]. This allows usto create a custom MPI reduce operation, which can takeadvantage of MPI’s optimizations like recursive doubling tominimize communication [25]. We summarize this in Algo-rithm 2.For ﬂoats, a C-like implementation may look like: (cid:7) (cid:4) void qr_reducer( void *Rin, void *Rout, int *len, MPI_Datatype *dtype);MPI_Datatype qrtype;MPI_Type_contiguous(n*n, MPI_FLOAT, qrtype);MPI_Type_commit(qrtype);MPI_Op qrop;MPI_Op_create((MPI_User_function*)qr_reducer, 1, &qrop); (cid:6) (cid:5) The reducer can then be called as follows: (cid:7) (cid:4)

MPI_Allreduce(Rin, Rout, 1, qrtype, qrop,comm); (cid:6) (cid:5)

We summarize the use of the QR reducer to implementthe CAQR in Algorithm 3. As a ﬁnal note, there are somesubtleties here if the data is a GPU device pointer. Speciﬁcally,at this time on the compute platform we run our experimentson, GPUDirect does not support custom operations. So wemust use CPU buffers for transferring the data. This createssome unavoidable overhead in copying the two input R matrices from host to device, and then the emitted R matrixfrom device to host. C. Method 3: Truncated SVD via Random Projections

Finally, we present a truncated SVD algorithm from thefamous “Finding Structure with Randomness” [16]. The detailsare more involved than the above, so we do not repeat themhere. We note that the key ideas come from the Prototype forRandomized SVD and and Algorithm 4.4 pieces of the paper.Since this is a truncated SVD, we specify an additionalparameter k indicating the number of singular values/vectorsto compute. There is also a parameter q which is an exponent,but which we treat as a hyper-parameter. Finally, there isa random matrix Ω used internally to the algorithm forinitialization (the random projection). The original authors usedata generated from a standard normal distribution. But weﬁnd experimentally that random uniform works just as well,and with the advantage of being slightly faster to generate.The computations involved are a series of matrix productsand computing Q matrices from a QR factorization. Some ofthe matrix products are communication-free (see the discus-sion in Section II-C for details), and some of the Q compu-tations are also purely local. For the Q computations whichoperate on distributed data, a TSQR should be used. Naturally,we can use the CAQR from Algorithm 3 of Section III-B. Wesummarize the details in Algorithm 4.Finally, this technique has been successfully employed fortruncated PCA on large datasets [26] using ScaLAPACK [27].We will compare our TSQR via CAQR approach to one usingScaLAPACK for the intermediate calculations in Section IV.IV. E XPERIMENTS AND D ISCUSSION

Below we present the results of several experiments. Werun all of our benchmarks on Summit, which at the time ofwriting is ranked second on the Top 500 list [28], [29]. Summit lgorithm 4:

Randomized SVD

Data: A with m > n , integers k < n and q Result: Σ , optionally U and V Generate Ω n × k random uniform or standard normal;Let Y m × k = A Ω ;Compute Q Y = qr_Q ( Y ) (CAQR); for i ← to q do Z n × k = A T Q Y ; Q Z = qr_Q ( Z ) (local); Y = AQ Z ; Q Y = qr_Q ( Y ) (CAQR); end Let B k × n = Q TY A ;Local SVD of B gives approximation for Σ , andoptionally V T . Recover U = Q Y U B if desired.is an IBM system managed by the Oak Ridge LeadershipComputing Facility at Oak Ridge National Laboratory. It has4608 nodes, each equipped with two 22-core IBM POWER9CPUs, and six NVIDIA Tesla V100 GPUs. The GPUs areconnected by NVIDIA’s high-speed NVLink interconnect, andnodes are connected by a high-performance Mellanox Inﬁni-Band network. The CUDA-Aware MPI from IBM (SpectrumMPI) allows for some GPUDirect communication, which weexploit whenever possible.For the benchmarks, we present both CPU and GPU im-plementations of the three algorithms presented in Section III.We will refer to them as cpsvd for crossproducts-svd, tssvd fortall/skinny CAQR-based SVD, and rsvd for randomized SVD,being implementations of the algorithms from Sections III-A,III-B, and III-C, respectively. For each, we measure only thetime to compute the singular values (or their approximations),since that is the only portion requiring communication. Wewould expect some separation of the CPU and GPU timings infavor of GPUs if additionally computing the singular vectors.As we evaluate the various implementations throughout, wevary the sizes of the inputs, and will describe them in detail aswe describe each particular result. In each case, we use randomdata from a standard normal distribution. We also presentresults for both 32-bit and 64-bit ﬂoating point numbers (C++ float and double ). In each case, we operate on the samesize matrix in bytes, using half the number of rows for the 64-bit data as the 32-bit data. All matrices are chosen to have 250columns. This is somewhat arbitrary, but it gives a tall/skinnymatrix, and in data analysis, the number of columns is often onthe order of numbering no more than a few hundred. Anotherconstant throughout all experiments is that any time we presenta result for the randomized SVD, we use k = 2 , q = 2 , andwe use random uniform data for the projection.We begin by examining the performance of the of threeimplementations on a single node, with results presented inFigure 1. This is a strong scaling benchmark, where we ﬁx adata size and increase the number of resources. Each Summit Fig. 1: SVD benchmarks on one Summit node. The x-axisshows the number of resources; for CPU this is all cores (nohyperthreads) on each physical CPU. Summit has two physicalCPUs and six GPUs per node.node has two physical CPUs and six GPUs, hence the disparityin the number of timing results for each.Let us begin by examining the ﬂoat vs double performance.Recall that the data size for each test is the same (twice asmany rows for ﬂoat as double), so we would expect the twoto be roughly the same. And indeed, this holds true for allCPU timings. However, there are some interesting differencesfor GPU. The rsvd on one GPU completes in roughly halfthe time for ﬂoat than for double. We are unable to dismissthis as mere sampling variation, as re-running the experimentmultiple times on different nodes produces similar results.However, the cpsvd results may shed some light on whatis occurring. Notice here the large discrepancy between theGPU run times across the two fundamental types. Examinationof the benchmark by the NVIDIA Nsight proﬁler [30] showsthat 95% of the total run time is dominated by the randomgenerator prior to launching the cpsvd kernel together witha matrix multiplication, carried out by cuBLAS. The strikingdifference in performance suggests that perhaps the sgemm variant is using the tensor cores of the GPU. We evaluatedthe benchmark with various program level and environmentlevel settings, but were unable to affect a different result inthe timings. We contacted an NVIDIA support engineer withdeep familiarity with Summit for clariﬁcation; but if the rootof the issue was discovered, it was not shared with the author.The relative consistency for the tssvd, which has very fewmatrix multiplications, compared to the rsvd which has several,then ﬁnally compared to the cpsvd which is essentially onlya matrix multiplication, it seems reasonable to conclude thatig. 2: SVD benchmarks on 20 and 200 Summit nodes,on problems of size 1 TB and 10 TB, respectively. Resultshere are node-level performance. For example, 20 nodes witha CPU backend is 40 total physical CPUs (all cores, nohyperthreads), and at 20 nodes with a GPU backend is 120total GPUs.the performance difference we see here is due to accelerationby the tensor cores. Why we are unable to disable this isstill unclear, however. For a detailed list of our softwareenvironment, see Section A.Next, let us compare the CPU vs GPU performance. Thediscussion above colors the evaluation of cpsvd and rsvd.As for tssvd, the run times are fairly close to each otherfor CPU and GPU. This is likely because of the additionaloverhead required moving data back and forth between hostand device memory during the QR_allreduce computation,as discussed at the end of Section III-B. Also, the localproblem at each step of the reduction operation amounts tostacking the two n × n matrices on top of each other, thenemitting the R matrix computed from the QR factorization ofthe n × n matrix. With our ﬁxed n = 250 , this problem sizeis fairly small, so we never get to take full advantage of theGPU ﬂops. Speciﬁcally, each local QR operates on only 1 MBin double precision and 0.5 MB in single. This likely explainswhy the double precision GPU version performs slightly betterthan its single precision variant. One ﬁnal thing worth notingis that the CPU cores on Summit are extremely fast, makingthem much more competitive in general.Next we examine how each of the implementations scalesin the weak sense. The results are presented in Figure 2, andthey show the timings from 1 TB and 10 TB total problems at20 and 200 nodes, respectively. This works out to a per-GPUlocal problem size close to that in our ﬁrst experiment (here Fig. 3: SVD benchmarks compared to ScaLAPACK on 20Summit nodes. The CPU backend refers to our custom imple-mentation, while the ScaLAPACK backend refers to an imple-mentation following the same algorithm but with ScaLAPACKfunctions.8.333 GB vs 8.5 GB formerly). At 20 nodes, each matrix has · rows for double precision data and for single, andat 200 nodes each matrix has · rows for double precisiondata and rows for single. One notable observation beforeproceeding is that for the smaller data size, each index is32-bit addressable, while the number of rows is not 32-bitaddressable for the larger matrices. This is an issue we willreturn to in the next experiment.Proceeding as before, if we ﬁrst examine the plot for acomparison of ﬂoat vs double performance, we ﬁnd thatthe analysis above appears to still hold. So for the sake ofbrevity we do not repeat it here. Comparing CPU vs GPUperformance, we ﬁrst note that this is a node-level comparison,compared to the above which was resource-level. For the CPUtimes, this accounts for 40 CPUs for the smaller size and 120at for the larger one. Likewise, this is 400 GPUs at for thesmaller size and 1200 for the larger. And indeed, what wesee is a roughly three-fold performance difference over whatwe saw above in Figure 1. The only observation of note isthat the GPU run times are all largely ﬂat, which is ideal inthis case, while the CPU variants have worse scaling. We donot immediately understand why this is so. It is possible thatbecause Summit is essentially a GPU machine, the vendorMPI library may not be well-optimized for intra-node CPUcommunication. The total number of MPI ranks is 840 and8400 for CPU, but only 120 and 1200 for GPU. Again wenote that a full list of our software environment is provided inSection A.n our ﬁnal experiment, we present the results of comparingour approach to one using ScaLAPACK functions. In all cases,we use a process grid with one column and each matrix hasa × blocking factor. The timings from the experimentare shown in Figure 3. For the 1 TB problem size matrices,the pdgesvd and psgesvd ScaLAPACK functions bothrequire an amount of workspace that overﬂows a 32-bit integer,making it impossible to use them. Although this was notdone intentionally on our part, it is a good demonstration ofthe need for optimized tall/skinny routines. We discuss somepossible future strategies for existing ScaLAPACK codebasesin Section V.So instead, we use a custom routine which ﬁrst reducesthe matrix using pdgeqrf or psgeqrf . Contrary to ourtall/skinny SVD using the CAQR, these functions use useHouseholder transformations [31]. For simplicity presentingand discussing the data, we also refer to this routine tssvd .For this particular routine, the ScaLAPACK version does quitewell, being noticeably faster than the alternative. Althoughthis too suffers from an indexing issue preventing yet largerexperiments, the performance of this old library is quite im-pressive. For the other two approaches, our implementation isfaster, with the best performance shown in the single precisiondata case. Although in absolute terms, the run times for eachare small. This shows that so long as matrix sizes conform tothose which ScaLAPACK is capable of handling and one doesnot need use of or have access to GPUs, using ScaLAPACKis still a very viable choice for many applications nearly 25years later.V. C ONCLUSIONS AND F UTURE W ORK

We have presented a survey of methods for computing theSVD or its approximation for distributed tall/skinny data, withparticular attention given to the detail of implementing theseusing MPI for communication. We compared implementationsof these algorithms for both CPU and GPU, single and doubleprecision, and presented the results of several experimentsfrom running these on the Summit supercomputer.For local matrix factorizations on GPU, such as comput-ing the eigendecomposition in Section III-A or the variousQR factorizations, we use the vendor library cuSOLVER.Although we have not yet conducted serious experimentswith it, it would be interesting to use MAGMA [32] instead.MAGMA provides many advantages over cuSOLVER in termsof functionality, and its API is much more similar to LA-PACK making porting legacy CPU codes to GPUs simplerthan using the vendor alternative. As far as performance, forsome factorizations unrelated to those of interest here, theperformance of MAGMA seems quite good [33]; a quick scanof the literature did not produce more immediately relevantresults. Finally, although cuSOLVER contains only a subsetof MAGMA functionality, unfortunately swapping cuSOLVERfor MAGMA is far from a trivial process.In Section IV, we compared some of our implementationsto those using ScaLAPACK functions for the linear algebraoperations. ScaLAPACK is a very well-written library, but it is showing its age in many ways, notably its exclusively32-bit indexing and inability to utilize GPUs. There is amodern replacement for ScaLAPACK called SLATE [34]which alleviates these and other issues. Conveniently, SLATEworks as an

LD_PRELOAD replacement for ScaLAPACK,hijacking symbols at runtime so that supported kernels arereplaced by high-performance, GPU-enabled variants. Thiswould make experimentation for us very simple. However,SLATE is still a young and developing project. Throughprivate correspondence with one of the developers some timeago, the author learned that SLATE does not include optimiza-tions for tall/skinny data. Using this without caution couldpotentially result in an experiment which successfully runsand utilizes GPUs, but with deceptively poor performance. AsSLATE matures, it will be interesting to revisit the experimentsabove. A

CKNOWLEDGEMENTS

We wish to thank George Ostrouchov and Wei-Chen Chen,conversations with whom over the past decade have helpedshape the author’s thinking on this topic.This research used resources of the Oak Ridge LeadershipComputing Facility at the Oak Ridge National Laboratory,which is supported by the Ofﬁce of Science of the U.S. Depart-ment of Energy under Contract No. DE-AC05-00OR22725.This manuscript has been authored by UT-Battelle, LLC, un-der contract DE-AC05-00OR22725 with the US Department ofEnergy (DOE). The US government retains and the publisher,by accepting the article for publication, acknowledges that theUS government retains a nonexclusive, paid-up, irrevocable,worldwide license to publish or reproduce the publishedform of this manuscript, or allow others to do so, for USgovernment purposes. DOE will provide public access to theseresults of federally sponsored research in accordance withthe DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).A

PPENDIX AS OFTWARE E NVIRONMENT

Our implementation is written in C++, and makes extensiveuse of standard interfaces (e.g. BLAS, LAPACK, etc.) andvendor libraries (e.g. cuBLAS, cuSOLVER, etc.). All Summitexperiments reported above used the following software andversions: • gcc 8.1.1 • IBM Spectrum MPI 10.3.1.2 • NVIDIA CUDA 10.1.243 • OpenBLAS 0.3.9 • Netlib ScaLAPACK 2.0.2 • CXXFLAGS = -O2 -std=c++17 -mcpu=native=mtune=native • OpenMP (for omp simd ) • OMP_NUM_THREADS=1

EFERENCES[1] P. McCullagh and J. A. Nelder, “Generalized linear models 2nd editionchapman and hall,”

London, UK , 1989.[2] M. E. Wall, A. Rechtsteiner, and L. M. Rocha, “Singular value decom-position and principal component analysis,” in

A practical approach tomicroarray data analysis . Springer, 2003, pp. 91–109.[3] W. Gropp, E. Lusk, and A. Skjellum,

Using MPI: Portable ParallelProgramming with the Message-Passing Interface . Cambridge, MA,USA: MIT Press Scientiﬁc And Engineering Computation Series, 1994.[4] J. Dean and S. Ghemawat, “Mapreduce: a ﬂexible data processing tool,”

Communications of the ACM , vol. 53, no. 1, pp. 72–77, 2010.[5] P. G. Constantine and D. F. Gleich, “Tall and skinny qr factorizationsin mapreduce architectures,” in

Proceedings of the second internationalworkshop on MapReduce and its applications , 2011, pp. 43–50.[6] F. Darema, “The spmd model: Past, present and future,” in

EuropeanParallel Virtual Machine/Message Passing Interface Users Group Meet-ing . Springer, 2001, pp. 1–1.[7] T. Tu, C. A. Rendleman, D. W. Borhani, R. O. Dror, J. Gullingsrud,M. O. Jensen, J. L. Klepeis, P. Maragakis, P. Miller, K. A. Stafford et al. , “A scalable parallel framework for analyzing terascale moleculardynamics simulation trajectories,” in

SC’08: Proceedings of the 2008ACM/IEEE conference on Supercomputing . IEEE, 2008, pp. 1–12.[8] S. J. Plimpton and K. D. Devine, “Mapreduce in mpi for large-scalegraph algorithms,”

Parallel Computing , vol. 37, no. 9, pp. 610–632,2011.[9] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave,X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin et al. , “Apachespark: a uniﬁed engine for big data processing,”

Communications of theACM , vol. 59, no. 11, pp. 56–65, 2016.[10] P. Xenopoulos, J. Daniel, M. Matheson, and S. Sukumar, “Big dataanalytics on hpc architectures: Performance and cost,” in . IEEE, 2016, pp.2286–2295.[11] T. E. Booth, “Power iteration method for the several largest eigenvaluesand eigenfunctions,”

Nuclear science and engineering , vol. 154, no. 1,pp. 48–62, 2006.[12] C. Lanczos,

An iteration method for the solution of the eigenvalueproblem of linear differential and integral operators . United StatesGovernm. Press Ofﬁce Los Angeles, CA, 1950.[13] J. W. Demmel,

Applied numerical linear algebra . SIAM, 1997.[14] M. W. Berry, S. T. Dumais, and G. W. OBrien, “Using linear algebrafor intelligent information retrieval,”

SIAM review , vol. 37, no. 4, pp.573–595, 1995.[15] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citationranking: Bringing order to the web.” Stanford InfoLab, Tech. Rep., 1999.[16] N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure withrandomness: Probabilistic algorithms for constructing approximate ma-trix decompositions,”

SIAM review , vol. 53, no. 2, pp. 217–288, 2011.[17] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, “Basiclinear algebra subprograms for fortran usage,”

ACM Transactions onMathematical Software (TOMS) , vol. 5, no. 3, pp. 308–323, 1979.[18] E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Dem mel, J. J.Dongarra, J. Du Croz, S. Hammarling, A. Greenbaum, A. McKenney,and D. Sorensen,

LAPACK Users’ guide (third ed.) . Philadelphia, PA,USA: Society for Industrial and Applied Mathematics, 1999.[19] Z. Xianyi, W. Qian, and Z. Chothia, “Openblas,”

URL: http://xianyi.github. io/OpenBLAS , p. 88, 2012.[20] D. Kirk et al. , “Nvidia cuda software and gpu parallel computingarchitecture,” in

ISMM , vol. 7, 2007, pp. 103–104.[21] R. C. Team, “R language deﬁnition,”

Vienna, Austria: R foundation forstatistical computing , 2000.[22] G. Golub and C. Van Loan, “Matrix computations 3rd edition the johnhopkins university press,”

Baltimore, MD , 1996.[23] J. Demmel, L. Grigori, M. Hoemmen, and J. Langou, “Communication-avoiding parallel and sequential qr factorizations,”

CoRR abs/0806.2159 ,2008.[24] E. Agullo, C. Coti, J. Dongarra, T. Herault, and J. Langem, “Qrfactorization of tall and skinny matrices in a grid computing environ-ment,” in . IEEE, 2010, pp. 1–11.[25] R. Rabenseifner, “Optimization of collective reduction operations,” in

International Conference on Computational Science . Springer, 2004,pp. 1–9. [26] D. Schmidt, W.-C. Chen, M. A. Matheson, and G. Ostrouchov, “Pro-gramming with big data in r: Scaling analytics from one to thousandsof nodes,”

Big Data Research , vol. 8, pp. 1–11, 2017.[27] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel,I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet,K. Stanley, D. Walker, and R. C. Whaley,

ScaLAPACK Users’ Guide .Philadelphia, PA: Society for Industrial and Applied Mathematics, 1997.[Online]. Available: http://netlib.org/scalapack/slug/scalapack slug.html/[28] J. J. Dongarra, H. W. Meuer, E. Strohmaier et al. , “Top500 supercom-puter sites,”

Supercomputer , vol. 13, pp. 89–111, 1997.[29] “TOP500 june 2020,” https://top500.org/lists/top500/2020/06/, accessed:2020-08-18.[30] N. Nsight and V. S. Edition, “3.0 user guide,”

NVIDIA Corporation ,2013.[31] E. D’Azevedo and J. Dongarra, “The design and implementation of theparallel out-of-core scalapack lu, qr, and cholesky factorization routines,”

Concurrency: Practice and Experience , vol. 12, no. 15, pp. 1481–1493,2000.[32] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou,H. Ltaief, P. Luszczek, and S. Tomov, “Numerical linear algebra onemerging architectures: The plasma and magma projects,” in

Journal ofPhysics: Conference Series , vol. 180, no. 1. IOP Publishing, 2009, p.012037.[33] A. Haidar, S. Tomov, J. Dongarra et al. , “Analysis and design techniquestowards high-performance and energy-efﬁcient dense linear solvers ongpus,”

IEEE Transactions on Parallel and Distributed Systems , vol. 29,no. 12, pp. 2700–2712, 2018.[34] M. Gates, J. Kurzak, A. Charara, A. YarKhan, and J. Dongarra, “Slate:design of a modern distributed and accelerated linear algebra library,”in