[PDF] Compressing rank-structured matrices via randomized sampling

Abstract

Randomized sampling has recently been proven a highly efficient technique for computing approximate factorizations of matrices that have low numerical rank. This paper describes an extension of such techniques to a wider class of matrices that are not themselves rank-deficient, but have off-diagonal blocks that are; specifically, the classes of so called \textit{Hierarchically Off-Diagonal Low Rank (HODLR)} matrices and \textit{Hierarchically Block Separable (HBS)} matrices. Such matrices arise frequently in numerical analysis and signal processing, in particular in the construction of fast methods for solving differential and integral equations numerically. These structures admit algebraic operations (matrix-vector multiplications, matrix factorizations, matrix inversion, etc.) to be performed very rapidly; but only once a data-sparse representation of the matrix has been constructed. How to rapidly compute this representation in the first place is much less well understood. The present paper demonstrates that if an N×N matrix can be applied to a vector in O(N) time, and if the ranks of the off-diagonal blocks are bounded by an integer k , then the cost for constructing a HODLR representation is O( k 2 N(logN ) 2 ) , and the cost for constructing an HBS representation is O( k 2 NlogN) . The point is that when legacy codes (based on, e.g., the Fast Multipole Method) can be used for the fast matrix-vector multiply, the proposed algorithm can be used to obtain the data-sparse representation of the matrix, and then well-established techniques for HODLR/HBS matrices can be used to invert or factor the matrix. The proposed scheme is also useful in simplifying the implementation of certain operations on rank-structured matrices such as the matrix-matrix multiplication, low-rank update, addition, etc.

Full PDF

CCompressing rank-structured matrices via randomized sampling

Per-Gunnar Martinsson, Dept. of Applied Math., Univ. of Colorado Boulder

March 22, 2015

Abstract:

Randomized sampling has recently been proven a highly efﬁcient tech-nique for computing approximate factorizations of matrices that have low numeri-cal rank. This paper describes an extension of such techniques to a wider class ofmatrices that are not themselves rank-deﬁcient, but have off-diagonal blocks thatare; speciﬁcally, the classes of so called

Hierarchically Off-Diagonal Low Rank(HODLR) matrices and

Hierarchically Block Separable (HBS) matrices (a.k.a. “Hi-erarchically Semi-Separable (HSS)” matrices). Such matrices arise frequently innumerical analysis and signal processing, in particular in the construction of fastmethods for solving differential and integral equations numerically. These struc-tures admit algebraic operations (matrix-vector multiplications, matrix factoriza-tions, matrix inversion, etc.) to be performed very rapidly; but only once a data-sparse representation of the matrix has been constructed. How to rapidly computethis representation in the ﬁrst place is much less well understood. The present paperdemonstrates that if an N × N matrix can be applied to a vector in O ( N ) time, andif the ranks of the off-diagonal blocks are bounded by an integer k , then the costfor constructing a HODLR representation is O ( k N (log N ) ) , and the cost forconstructing an HBS representation is O ( k N log N ) (assuming of course, thatthe matrix is compressible in the respective format). The point is that when legacycodes (based on, e.g., the Fast Multipole Method) can be used for the fast matrix-vector multiply, the proposed algorithm can be used to obtain the data-sparse rep-resentation of the matrix, and then well-established techniques for HODLR/HBSmatrices can be used to invert or factor the matrix. The proposed scheme is alsouseful in simplifying the implementation of certain operations on rank-structuredmatrices such as the matrix-matrix multiplication, low-rank update, addition, etc.1. I NTRODUCTION

A ubiquitous task in computational science is to rapidly perform linear algebraic operations involving verylarge matrices. Such operations typically exploit special “structure” in the matrix since the costs of standardtechniques tend to scale prohibitively fast with matrix size; for a general N × N matrix, it costs O ( N ) operations to perform a matrix-vector multiplication, O ( N ) operations to perform Gaussian elimination or toinvert the matrix, etc. A well-known form of “structure” in a matrix is sparsity. When at most a few entriesin each row of the matrix are non-zero (as is the case, e.g., for matrices arising upon the discretization ofdifferential equations, or representing the link structure of the World Wide Web) matrix-vector multiplicationscan be performed in O ( N ) operations instead of O ( N ) . The description “data-sparse” applies to a matrix thatmay be dense, but that shares the key characteristic of a sparse matrix that some linear algebraic operations,typically the matrix-vector multiplication, can to high precision be executed in fewer than O ( N ) operations(often in close to linear time).Several different formats for rank-structured matrices have been proposed in the literature. In this manuscript,we rely on the so called Hierarchically Off-Diagonal Low Rank (HODLR) format. This name was minted in [1],but this class of matrices has a long history. It is a special case of the H -matrix format introduced by Hackbuschand co-workers [19, 3], and was used explicitly in [30, Sec. 4]. The HODLR format is very easy to describe andeasy to use, but can lead to less than optimal performance due to the fact that the basis matrices used to representlarge blocks are stored explicitly, leading to a O ( k N log( N )) storage requirement for a HODLR matrix whose a r X i v : . [ m a t h . NA ] M a r ff-diagonal blocks have rank at most k . To attain linear storage requirements and arithmetic operations, onecan switch to a format that expresses all basis matrices hierarchically ; in other words, the basis matrices usedon one level are expressed implicitly in terms of the basis matrices on the next ﬁner level. We sometimes saythat we use nested basis matrices. To be precise, we use the Hierarchically Block Separable (HBS) format thatwas described in [28, 15]. This format is closely related to the

Hierarchically Semi-Separable (HSS) [7, 34]format, and is also related to the H -matrix format [21, 4].The most straight-forward technique for computing a data-sparse representation of a rank-structured N × N matrix A is to explicitly form all matrix elements, and then to compress the off-diagonal blocks using, e.g., theSVD. This approach can be executed stably [35, 20], but it is often prohibitively expensive, with an O ( k N ) asymptotic cost, where k is the rank of the off-diagonal blocks (in the HSS-sense). Fortunately, there exist forspeciﬁc applications much faster methods for constructing HSS representations. When the matrix A approxi-mates a boundary integral operator in the plane, the technique of [28] computes a representation in O ( k N ) time by exploiting representation results from potential theory. In other environments, it is possible to useknown regularity properties of the off-diagonal blocks in conjunction with interpolation techniques to obtainrough initial factorizations, and then recompress these to obtain factorizations with close to optimal ranks[4, 29]. A particularly popular version of the “regularity + recompression” method is the so called AdaptiveCross Approximation technique which was initially proposed for H -matrices [2, 5, 23] but has recently beenmodiﬁed to obtain a representation of a matrix in a format similar to the HSS [12].The purpose of the present paper is to describe a fast and simple randomized technique for computing a datasparse representation of a rank-structured matrix which can rapidly be applied to a vector. The existence ofsuch a technique means that the advantages of the HODLR and HBS formats — fast inversion and factorizationalgorithms in particular — become available for any matrix that can currently be applied via the FMM, viaan H -matrix calculation, or by any other existing data-sparse format (provided of course that the matrix is inprinciple rank-structured). In order to describe the cost of the algorithm precisely, we introduce some notation:We let A be an N × N matrix whose off-diagonal blocks have maximal rank k , we let T mult denote the timerequired to perform a matrix-vector multiplication x (cid:55)→ A x or x (cid:55)→ A ∗ x , we let T rand denote the cost ofconstructing a pseudo random number from a normalized Gaussian distribution, and T ﬂop denote the cost of aﬂoating point operation. The computational cost T total of the algorithm for the HBS format then satisﬁes T total ∼ T mult × k log( N ) + T rand × ( k + p ) N log( N ) + T ﬂop × k N log( N ) , (1)where p is a tuning parameter that balances computational cost against the probability of not meeting therequested accuracy. Setting p = 10 is often a good choice which leads to a “failure probability” of less that − , see Remark 1. In particular, if T mult is O ( N ) , then the method presented here has overall complexity O ( k N log( N )) . For the HODLR format, an additional factor of log N arises, cf. (13).The work presented is directly inspired by [27] (which is based on a 2008 preprint [26]), which presented asimilar algorithm with O ( k N ) complexity for the compression of an HBS matrix. This is better by a factorof log( N ) compared to the present work, but the algorithm of [27] has a serious limitation in that it requiresthe ability to evaluate O ( k N ) entries of the matrix to be compressed. This algorithm was later reﬁned by Xia[33] and applied to the task of accelerated the “nested dissection” direct solver [13, 11] for elliptic PDEs to O ( N ) complexity. In 2011, by L. Lin, J. Lu, and L. Ying presented an alternative algorithm [25] that interactswith the matrix only via the matrix-vector multiplication, which makes the randomized compression idea muchmore broadly applicable than the algorithm in [27]. However, this came at the cost of requiring O ( k log( N )) matrix-vector multiplies (just like the present work). The algorithm proposed here is an evolution of [25] thatdoes away with a step of least-square ﬁtting that led to a magniﬁcation of the sampling error resulting from therandomized approximation. Moreover, we here present a new strategy for enforcing the “nestedness” of thebasis matrices that is required in the HBS format. Remark 1.

The technique described in this paper utilizes a method for computing approximate low-rank factor-izations of matrices that is based on randomized sampling [31, 32, 22] . As a consequence, there is in principle non-zero risk that the method may fail to produce full accuracy in any given realization of the algorithm. Thisrisk can be controlled by the user via the choice of the tuning parameter p in (1), for details see Section 2.5.Moreover, unlike some better known randomized algorithms such as Monte Carlo, the accuracy of the output ofthe algorithms under discussion here is typically very high; in the environment described in the present paper,approximation errors of less than − are entirely typical.

2. P

RELIMINARIES

Notation.

Throughout the paper, we measure vectors in R n using their Euclidean norm. The default normfor matrices will be the corresponding operator norm (cid:107) A (cid:107) = sup (cid:107) x (cid:107) =1 (cid:107) Ax (cid:107) , although we will sometimes alsouse the Frobenius norm (cid:107) A (cid:107) Fro = (cid:16)(cid:80) i,j | A ( i, j ) | (cid:17) / .We use the notation of Golub and Van Loan [16] to specify submatrices. In other words, if B is an m × n matrixwith entries b ij , and I = [ i , i , . . . , i k ] and J = [ j , j , . . . , j (cid:96) ] are two index vectors, then we let B ( I, J ) denote the k × (cid:96) matrix B ( I, J ) =  b i j b i j · · · b i j (cid:96) b i j b i j · · · b i j (cid:96) ... ... ... b i k j b i k j · · · b i k j (cid:96)  . We let B ( I, :) denote the matrix B ( I, [1 , , . . . , n ]) , and deﬁne B (: , J ) analogously.The transpose of B is denoted B ∗ , and we say that a matrix U is orthonormal if its columns form an orthonormalset, so that U ∗ U = I .2.2. The QR factorization.

Any m × n matrix A admits a QR factorization of the form

A P = Q R ,m × n n × n m × r r × n (2)where r = min( m, n ) , Q is orthonormal, R is upper triangular, and P is a permutation matrix. The permutationmatrix P can more efﬁciently be represented via a vector J c ∈ Z n + of column indices such that P = I (: , J c ) where I is the n × n identity matrix. As a result, the factorization can be written as: A ( : , J c ) = Q R ,m × n m × r r × n The QR-factorization is often built incrementally via a greedy algorithm such as column pivoted Gram-Schmidt.This allows one to stop after the ﬁrst k terms have been computed to obtain a “partial QR-factorization of A ”: A ( : , J c ) = (cid:2) Q (1) Q (2) (cid:3) (cid:20) R (1) R (2) (cid:21) ,m × n m × r r × n That is, taking the ﬁrst k columns of Q and the ﬁrst k rows of R , we can obtain the approximation: A ( : , J c ) ≈ Q k R k (3)We note that the partial factors Q k and R k can be obtained after k steps of the pivoted QR algorithm, withouthaving to compute the full matrices Q and R ..3. The singular value decomposition (SVD).

Let A denote an m × n matrix, and set r = min( m, n ) . Then A admits a factorization A = U Σ V ∗ ,m × n m × r r × r r × n (4)where the matrices U and V are orthonormal, and Σ is diagonal. We let { u i } ri =1 and { v i } ri =1 denote thecolumns of U and V , respectively. These vectors are the left and right singular vectors of A . The diagonalelements { σ j } rj =1 of Σ are the singular values of A . We order these so that σ ≥ σ ≥ · · · ≥ σ r ≥ . We let A k denote the truncation of the SVD to its ﬁrst k terms, so that A k = (cid:80) kj =1 σ j u j v ∗ j . It is easily veriﬁed that (cid:107) A − A k (cid:107) spectral = σ k +1 , and that (cid:107) A − A k (cid:107) Fro =  min( m,n ) (cid:88) j = k +1 σ j  / , where (cid:107) · (cid:107) spectral denotes the operator norm and (cid:107) · (cid:107) Fro denotes the Frobenius norm. Moreover, the Eckart-Young theorem states that these errors are the smallest possible errors that can be incurred when approximating A by a matrix of rank k .2.4. The interpolatory decomposition (ID).

Let A denote an m × n matrix of rank k . Then A admits thefactorization A = A (: , J ) X ,m × n m × k k × n where J is a vector of indices marking k of the columns of A , and the k × n matrix X has the k × k identitymatrix as a submatrix and has the property that all its entries are bounded by in magnitude. In other words,the interpolative decomposition picks k columns of A as a basis for the column space of A and expresses theremaining columns in terms of the chosen ones. The ID can be viewed as a modiﬁcation to so the called Rank-Revealing QR factorization [6]. It can be computed in a stable and accurate manner using the techniques of[18], as described in [8]. (Practical algorithms for computing the interpolative decomposition produce a matrix X whose elements slightly exceed in magnitude.)2.5. Randomized compression.

Let A be a given m × n matrix that can accurately be approximated by amatrix of rank k , and suppose that we seek to determine a matrix Q with orthonormal columns (as few aspossible) such that || A − Q Q ∗ A || is small. In other words, we seek a matrix Q whose columns form an approximate orthornomal basis (ON-basis)for the column space of A . This task can efﬁciently be solved via the following randomized procedure:(1) Pick a small integer p representing how much “over-sampling” we do. ( p = 10 is often good.)(2) Form an n × ( k + p ) matrix Ω whose entries are iid normalized Gaussian random numbers.(3) Form the product Y = A Ω .(4) Construct a matrix Q whose columns form an ON-basis for the columns of Y .Note that each column of the “sample” matrix Y is a random linear combination of the columns of A . Wewould therefore expect the algorithm described to have a high probability of producing an accurate result when p is a large number. It is perhaps less obvious that this probability depends only on p (not on m or n , or anyother properties of A ), and that it approaches extremely rapidly as p increases. In fact, one can show that thebasis Q determined by the scheme above satisﬁes (cid:107) A − Q Q ∗ A (cid:107) ≤ (cid:104) (cid:112) k + p · (cid:112) min { m, n } (cid:105) σ k +1 , (5)with probability at least − · p − p , see [22, Sec. 1.5]. The error bound (5) indicates that the error producedby the randomized sampling procedure can be larger than the theoretically minimal error σ k +1 by a factor of + 11 √ k + p · (cid:112) min { m, n } . This crude bound is typically very pessimistic; for speciﬁc situations sharperbounds have been proved, see [22]. Deﬁnition 1.

Let A be an m × n matrix, and let ε be a positive number. We say that an m × (cid:96) matrix Y an ε -spanning matrix for A , if (cid:107) A − YY † A (cid:107) ≤ ε . Informally, this means that the columns of Y span the columnspace A to within precision ε . Furthermore, we say that an m × (cid:96) matrix Q is an ε -basis matrix for A if itscolumns are orthonormal, and if || A − QQ ∗ A || ≤ ε . Functions for low-rank factorizations.

For future reference, we introduce two functions “ qr ” and“ svd ” that can operate in three difference modes. In the ﬁrst mode, they produce the full (“economy size”)factorizations described in section 2.2 and 2.3, respectively, [ Q , R , J ] = qr ( A ) , [ U , D , V ] = svd ( A ) , and [ X , J ] = id ( A ) . In practice, we execute these factorizations using standard LAPACK library functions. In the second mode, weprovide an integer k and obtain partial factorizations of rank k , [ Q , R , J ] = qr ( A , k ) , [ U , D , V ] = svd ( A , k ) , and [ X , J ] = id ( A , k ) . Then the matrices Q , U , D , V have precisely k columns, and R has precisely k rows. In the third mode, weprovide a real number ε ∈ (0 , and obtain partial factorizations [ Q , R , J ] = qr ( A , ε ) , [ U , D , V ] = svd ( A , ε ) , and [ X , J ] = id ( A , ε ) , such that (cid:107) A ( : , J ) − QR (cid:107) ≤ ε (cid:107) A − UDV ∗ (cid:107) ≤ ε, and (cid:107) A − A (: , J ) X (cid:107) ≤ ε. In practice, for a small input matrix A , we execute mode 2 and mode 3 by calling the LAPACK routine for a full factorization (the ID can be obtained from the full QR), and then simply truncating the result. If A is large,then we use the randomized sampling technique of Section 2.5. Remark 2.

The differentiation between modes 2 and 3 for qr and svd is communicated by whether the secondargument is an integer (mode 2) or a real number ε ∈ (0 , (mode 3). This is slightly questionable notation,but it keeps the formulas clean, and hopefully does not cause confusion. A binary tree structure.

Both the HODLR and the HBS representations of an M × M matrix A arebased on a partition of the index vector I = [1 , , . . . , M ] into a binary tree structure. We let I form the rootof the tree, and give it the index , I = I . We next split the root into two roughly equi-sized vectors I and I so that I = I ∪ I . The full tree is then formed by continuing to subdivide any interval that holds more thansome preset ﬁxed number m of indices. We use the integers (cid:96) = 0 , , . . . , L to label the different levels, with denoting the coarsest level. A leaf is a node corresponding to a vector that never got split. For a non-leafnode τ , its children are the two boxes σ and σ such that I τ = I σ ∪ I σ , and τ is then the parent of σ and σ . Two boxes with the same parent are called siblings . These deﬁnitions are illustrated in Figure 1. For anynode τ , let n τ denote the number of indices in I τ .2.8. The HODLR data sparse matrix format.

The

Hierarchically Off-Diagonal Low Rank (HODLR) prop-erty is, as the name implies, a condition that the off-diagonal blocks of a matrix A should have low (numerical)rank. To be precise, given a hierarchical partitioning of the index vector, cf. Section 2.7, a computationaltolerance ε , and a bound on the rank k , we require that for any sibling pair { α, β } , the corresponding block A α,β = A ( I α , I β ) should have ε -rank at most k . The tessellation resulting from the tree in Figure 1 is shown in Figure 2. We thenrepresent each off-diagonal block via a rank- k factorization A α,β = U α ˜ A α,β U β ,n α × n β n α × k k × k k × n β Level Level Level Level I = [1 , , . . . , I = [1 , , . . . , , I = [201 , , . . . , I = [1 , , . . . , , I = [101 , , . . . , , . . . I = [1 , , . . . , , I = [51 , , . . . , , . . . F IGURE

1. Numbering of nodes in a fully populated binary tree with L = 3 levels. The rootis the original index vector I = I = [1 , , . . . , . A = A , A , A , A , A , A , D A , A , D D A , A , D D A , A , D D A , A , D F IGURE

2. A HODLR matrix A tesselated in accordance with the tree in Figure 1. Everyoff-diagonal block A α,β that is marked in the ﬁgure should have ε -rank at most k .where U α and B β are orthonormal matrices. It is easily veriﬁed that if we required each leaf node to haveat most O ( k ) nodes, then it takes O ( k N log N ) storage to store all factors required to represent A , and amatrix-vector multiplication can be executed using O ( k N log N ) ﬂops.For future reference, we deﬁne for a given HODLR matrix A a “level-truncated” matrix A ( (cid:96) ) as the matrixobtained by zeroing out any block associated with levels ﬁner than (cid:96) . In other words, A (1) = , A , , A (2) = , A , A , , , A , , etc. Remark 3.

In our description of rank-structured matrices in sections 2.8 and 2.9, we generally assume thatthe numerical rank is the same number k for every off-diagonal block. In practice, we typically estimate the ε -rank for any speciﬁc off-diagonal block adaptively to save both storage and ﬂop counts. .9. The HBS data sparse matrix format.

The HODLR format is simple to describe and to use, but is slightlyinefﬁcient in that it requires the user to store for every node τ , the basis matrices U τ and V τ , which can be quitelong. The Hierarchically Block Separable (HBS) format is designed to overcome this problem by expressingthese matrices hierarchically.

To be precise, suppose that τ is a node which children { α, β } , and that we canﬁnd a short matrix U τ such that U τ = (cid:20) U α U β (cid:21) U τ .n τ × k n τ × k k × k (6)The point is that if we have the long basis matrices U α and U β available for the children, then all we need tostore in order to be able to apply U τ is the short matrix U τ . This process can now be continued recursively.For instance, if { γ, δ } are the children of α , and { ν, µ } are the children of β , we assume there exist matrices U α and U β such that U α = (cid:20) U γ U δ (cid:21) U α ,n α × k n α × k k × k and U β = (cid:20) U µ U ν (cid:21) U β .n β × k n β × k k × k Then U τ can be expressed as U τ =  U γ U δ U µ

00 0 0 U ν  (cid:20) U α

00 U β (cid:21) U τ .n τ × k n τ × k k × k k × k By continuing this process down to the leaves, it becomes clear that we only need to store the “long” basismatrices for a leaf node (and they are not in fact long for a leaf node!); for every other node, it is sufﬁcient tostore the small matrix U τ . The process for storing the long basis matrices V τ via small matrices V τ of size k × k is of course exactly analogous.For a general HODLR matrix, there is no guarantee that a relationship such as (6) need hold. We need toimpose an additional condition on the long basis matrices U τ and V τ . To this end, given a node τ , let usdeﬁne a neutered row block as the off-diagonal block A ( I τ , I c τ ) , where I c τ is the complement of I τ within thevector [1 , , , . . . , N ] , cf. Figure 3. We then require that the columns of the long basis matrix U τ must spanthe columns of A ( I τ , I c τ ) . Observe that for a node τ with sibling σ , the sibling matrix A ( I τ , I σ ) is a submatrixof the neutered row block A ( I τ , I c τ ) since I σ ⊆ I c τ . This means that the new requirement of the basis matricesis more restrictive, and that typically the ranks required will be larger for any given precision. However, oncethe long basis matrices satisfy the more restrictive requirement, it is necessarily the case that (6) holds for somesmall matrix U τ .We analogously deﬁne the neutered column block for τ as the matrix A ( I c τ , I τ ) and require that the columns of V τ span the rows of A ( I c τ , I τ ) . Deﬁnition 2.

We say that a HODLR matrix A is an HBS matrix if, for every parent node τ with children { α, β } ,there exist “small” basis matrices U τ and V τ such that U τ = (cid:20) U α U β (cid:21) U τ ,n τ × k n τ × k k × k and V τ = (cid:20) V α V β (cid:21) V τ .n τ × k n τ × k k × k While standard practice is to require all basis matrices U τ and V τ to be orthonormal , we have found thatit is highly convenient to use the interpolatory decomposition (ID) to represent the off-diagonal blocks. Thekey advantage is that then the sibling interaction matrices which be submatrices of the original matrix. Thisimproves interpretability, and also slightly reduces storage requirements. I I (a) I I I (b)F IGURE

3. Illustration of the neutered row blocks for the nodes and , with parent . (a) Theblock A ( I , I c4 ) is marked in grey. Observe that A ( I , I c4 ) = [ A ( I , I ) , A ( I , I )] . (b) Theblock A ( I , I c5 ) is marked in grey. Observe that A ( I , I c5 ) = [ A ( I , I ) , A ( I , I )] . Deﬁnition 3.

We say that a HBS matrix A is in HBS-ID format if every basis matrix U τ and V τ contains a k × k identity matrix, and every siblin interaction matrix ˜ A α,β is a submatrix of A . In other words, there existsome index sets ˜ I α and ˆ I β such that ˜ A α,β = A ( ˜ I α , ˆ I β ) . The index sets ˜ I τ and ˆ I τ are called the row skeleton and column skeleton of box τ , respectively. We enforcethat these are “nested,” which is to say that if the children of τ are { α, β } , then ˜ I τ ⊆ ˜ I α ∪ ˜ I β and ˆ I τ ⊆ ˆ I α ∪ ˆ I β . Remark 4.

The straight-forward way to build a basis matrix U τ for a node τ is to explicitly form the cor-responding neutered row block A ( I τ , I c τ ) and then compress it (perform column pivtoed Gram-Schmidt onits columns to form an ON basis U τ , or perform column pivoted Gram-Schmidt on its rows to form the ID).However, suppose that we can somehow construct a smaller matrix Y τ of size n τ × (cid:96) with the property thatthe columns of Y τ span the columns of A ( I τ , I c τ ) . Then it would be sufﬁcient to process the columns of Y τ to build a basis for A ( I τ , I c τ ) . For instance, if we orthonormalize the columns of Y τ to form a basis matrix U τ = qr ( Y τ ) , then the columns of U τ will necessarily form an orthonormal basis for the columns of A ( I τ , I c τ ) .The key point here is that one can often ﬁnd such a matrix Y τ with a small number (cid:96) of columns. In [28] we usea representation theorem from potential theory to ﬁnd such a matrix Y τ when A comes from the discretizationof a boundary integral equation of mathematical physics. In [27] we do this via randomized sampling, so that Y τ = A ( I τ , I c τ ) Ω for some Gaussian random matrix Ω . The main point of [27] is that this can be done byapplying all of A to a single random matrix with N × (cid:96) columns, where (cid:96) ≈ k . In the current manuscript, weuse a similar strategy, but we now require the application of A to a set of O ( k ) random vector for each level.

3. A

N ALGORITHM FOR COMPRESSING A

HODLR

MATRIX

The algorithm consists of a single sweep through the levels in the tree of nodes, starting from the root (theentire domain), and processing each level of successively smaller blocks at a time.In presenting the algorithm, we let G τ denote a random matrix of size n τ × r drawn from a Gaussian distribution.Blocked matrices are drawn with the blocks to be processed set in red type. To minimize clutter, blocks of thematrix that will not play a part in the current step are marked with a star (“*”) and may or may not be zero. rocessing level 0 (the root of the tree): Let { α, β } denote the children of the root node (in our standardordering, α = 2 and β = 3 ). Our objective is now to ﬁnd low-rank factorizations of the off-diagonal blocks A = (cid:20) ∗ A αβ A αβ ∗ (cid:21) . To this end, we build two random matrix, each of size N × r , and deﬁned by Ω = (cid:20) G α (cid:21) and Ω = (cid:20) β (cid:21) . Then construct the matrices of samples Y = AΩ = (cid:20) A αβ G β ∗ (cid:21) and Y = AΩ = (cid:20) ∗ A βα G α (cid:21) . Supported by the results on randomized sampling described in Section 2.5, we now know that it is almost certainthat the columns in the top block of Y will span the columns of A αβ . By orthonormalizing the columns of Y ( I α , :) , we therefore obtain an ON-basis for the column space of A αβ . In other words, we set U α = qr ( Y ( I α , :)) , and U β = qr ( Y ( I β , :)) , and then we know that U α and U β will serve as the relevant basis matrices in the HODLR representation of A .To compute V α , V β , B αβ and B βα we now need to form the matrices U ∗ α A αβ and U ∗ β A βα . To do this throughour “black-box” matrix-matrix multiplier, we form the new test matrices Ω = (cid:20) U α (cid:21) and Ω = (cid:20) U β (cid:21) . Then construct the matrices of samples Z = A ∗ Ω = (cid:20) A ∗ βα U β ∗ (cid:21) and Z = A ∗ Ω = (cid:20) ∗ A ∗ αβ U α (cid:21) . All that remains is now to compute QR factorizations [ V α , B βα ] = qr ( Z ( I α , :)) , and [ V β , B αβ ] = qr ( Z ( I β , :)) . (7) Remark 5.

At a slight increase in cost, one can obtain diagonal sibling interaction matrices B αβ and B βα . Wewould then replace the QR factorization in (7) by a full SVD [ V α , B βα , ˆ U β ] = svd ( Z ( I α , :)) , and [ V β , B αβ , ˆ U α ] = svd ( Z ( I β , :)) . (8) This step then requires an update to the long basis matrices for the column space: U α ← U α ˆ U α and U β ← U β ˆ U β . (9) Processing level 1:

Now that all off-diagonal blocks on level 1 have been computed, we can use this informationto compress the blocks on level 2. We let { α, β, γ, δ } denote the boxes on level 2 (in standard ordering, α = 4 , β = 5 , γ = 6 , δ = 7 ). Our objective is now to construct low-rank approximations to the blocks marked in red: A = ∗ A αβ * A βα ∗ * ∗ A γδ A δγ ∗ First, observe that A − A (1) = ∗ A αβ βα ∗ ∗ A γδ A δγ ∗ e then deﬁne two random test matrices, each of size N × r , via Ω =  G α γ  and Ω =  β δ  We compute the sample matrices via Y = AΩ − A (1) Ω =  A αβ G β ∗ A γδ G δ ∗  and Y = AΩ − A (1) Ω =  ∗ A βα G α ∗ A δγ G γ  . (10)In evaluating Y and Y , we use the black-box multiplier to form AΩ and AΩ , and the compressed repre-sentation of A (1) obtained on the previous level to evaluate A (1) Ω and A (1) Ω . We now obtain orthonormalbases for the column spaces of the four sibling interaction matrices by orthonormalizing the pertinent blocks of Y and Y : U α = qr ( Y ( I α , :)) , U β = qr ( Y ( I β , :)) , U γ = qr ( Y ( I γ , :)) , U δ = qr ( Y ( I δ , :)) . It remains to construct the ON-bases for the corresponding row spaces and the compressed sibling interactionmatrices. To this end, we form two new test matrices, both of size N × r , via Ω =  U α U γ  and Ω =  U β U δ  . Then the sample matrices are computed via Z = A ∗ Ω − (cid:0) A (1) (cid:1) ∗ Ω =  A ∗ βα U β ∗ A ∗ δγ U δ ∗  and Z = A ∗ Ω − (cid:0) A (1) (cid:1) ∗ Ω =  ∗ A ∗ αβ U α ∗ A ∗ γδ U γ  . We obtain diagonal compressed sibling interaction matrices by taking a sequence of dense SVDs of the relevantsub-blocks, cf. (8), [ V α , B βα , ˆ U β ] = svd ( Z ( I α , :)) , [ V β , B αβ , ˆ U α ] = svd ( Z ( I β , :)) , [ V γ , B δγ , ˆ U δ ] = svd ( Z ( I γ , :)) , [ V δ , B γδ , ˆ U γ ] = svd ( Z ( I δ , :)) . Finally update the bases for the column-spaces, cf (9), U α ← U α ˆ U α , U β ← U β ˆ U β , U γ ← U γ ˆ U γ , U δ ← U δ ˆ U δ . Processing levels 2 through L − : The processing of every level proceeds in a manner completely analogousto the processing of level . The relevant formulas are given in Figure 4. Processing the leaves:

Once all L levels have been traversed using the procedure described, compressed rep-resentations of all off-diagonal blocks will have been computed. At this point, all that remains is to extract theiagonal blocks. We illustrate the process for a simplistic example of a tree with only L = 2 levels (beyond theroot). Letting { α, β, γ, δ } denote the leaf nodes, our task is then to extract the blocks marked in red: A = D α ∗ * ∗ D β * D γ ∗∗ D δ = D α β D γ

00 D δ + A (2) . Since the diagonal blocks are not rank-deﬁcient, we will extract them directly, without using randomized sam-pling. To describe the process, we assume at ﬁrst (for simplicity) that every diagonal block has the same size, m × m . We then choose a test matrix of size N × m Ω =  I m I m I m I m  , and trivially extract the diagonal blocks via the sampling Y = AΩ − A (2) Ω =  D α D β D γ D δ  . The diagonal blocks can then be read off directly from Y .For the general case where the leaves may be of different sizes, we form a test matrix Ω of size N × m , where m = max { n τ : τ is a leaf } , such that Ω ( I τ , :) = (cid:2) I n τ zeros ( n τ , m − n τ ) (cid:3) . In other words, we simply pad a few zero columns at the end.The entire algorithm is summarized in Figure 4.3.1.

Asymptotic complexity.

Let L denote the number of levels in the tree. We ﬁnd that L ∼ log N . Let T ﬂop denote the time required for a ﬂop, let T mult denote the time required to apply A or A ∗ to a vector, and let T A ( (cid:96) ) denote the time required to apply A ( (cid:96) ) to a vector. Then the cost of processing level (cid:96) is T (cid:96) ∼ T mult × k + T ﬂop × (cid:96) k N (cid:96) + T A ( (cid:96) ) × k, (11)since on level (cid:96) there are (cid:96) blocks to be processed, and each “long” matrix at this level has height − (cid:96) N andwidth O ( k ) . Further, we ﬁnd that the cost of applying A ( (cid:96) ) to a single vector is T A ( (cid:96) ) ∼ T ﬂop × (cid:96) (cid:88) j =0 j k N j ∼ T ﬂop × (cid:96) k N. (12)Combining (11) and (12) and summing from (cid:96) = 0 to (cid:96) = L , we ﬁnd (using that L ∼ log N ) T compress ∼ T mult × k log N + T ﬂop × k N (cid:0) log N (cid:1) . (13) uild compressed representations of all off-diagonal blocks. loop over levels (cid:96) = 0 : ( L − Build the random matrices Ω and Ω . Ω = zeros ( n, r ) Ω = zeros ( n, r ) loop over boxes τ on level (cid:96) Let { α, β } denote the children of box τ . Ω ( I α , :) = randn ( n α , r ) Ω ( I β , :) = randn ( n β , r ) end loop Apply A to build the samples for the incoming basis matrices. Y = AΩ − A ( (cid:96) ) Ω Y = AΩ − A ( (cid:96) ) Ω Orthonormalize the sample matrices to build the incoming basis matrices. loop over boxes τ on level (cid:96) Let { α, β } denote the children of box τ . U α = qr ( Y ( I α , :)) . U β = qr ( Y ( I β , :)) . Ω ( I α , :) = U α Ω ( I β , :) = U β end loop Apply A ∗ to build the samples for the outgoing basis matrices. Z = A ∗ Ω − (cid:0) A ( (cid:96) ) (cid:1) ∗ Ω Z = A ∗ Ω − (cid:0) A ( (cid:96) ) (cid:1) ∗ Ω Take local SVDs to build incoming basis matrices and sibling interaction matrices.We determine the actual rank, and update the U ∗ basis matrices accordingly. loop over boxes τ on level (cid:96) Let { α, β } denote the children of box τ . [ V α , B βα , ˆ U β ] = svd ( Z ( I α , :) , ε ) . [ V β , B αβ , ˆ V α ] = svd ( Z ( I β , :) , ε ) . U β ← U β ˆ U β . U α ← U α ˆ U α . end loopend loop Extract the diagonal matrices. n max = max { n τ : τ is a leaf } Ω = zeros ( N, n max ) loop over leaf boxes τ Ω ( I τ , n τ ) = eye ( n τ ) . end loop Y = AΩ − A ( L ) Ω loop over leaf boxes τ D τ = Y ( I τ , n τ ) . end loop F IGURE

4. Randomized compression of a HODLR matrix.. A

N ALGORITHM FOR COMPRESSING AN

HBS

MATRIX

The algorithm for computing a compressed representation of an HBS matrix is a slight variation of the algorithmfor a HODLR matrix described in Section 3. The key difference is that the long basis matrices U α and V α associated with any node now must satisfy a more rigorous requirement. We will accomplish this objective byconstructing for every node α , two new “long” sampling matrices Y α and Z α , each of size n α × r that helptransmit information from the higher levels to the node α . The presentation will start in Section 4.1 with adescription of the modiﬁcation to the scheme of 3 required to enforce the more rigorous requirement. In thisinitial description, we will assume that all four “long” matrices associated with a node ( U τ , V τ , Y τ , Z τ ) arestored explicitly, resulting in an O ( k N log N ) memory requirement, just like for the HODLR algorithm. InSection 4.2 we show that while these “long” matrices do need to be temporarily built and processed, they can bestored implicitly, which will allow the algorithm to use only O ( k N ) memory. Finally, Section 4.3 will describehow to construct a representation using interpolatory decompositions in all low-rank approximations.4.1. A basic scheme for compressing an HBS matrix.

Throughout this section, let α denote a node with aparent τ that is not the root, and with a sibling β . We will ﬁrst describe the process for building the long basismatrices { U τ } τ . To this end, recall that the difference in requirements on the long basis matrices in the twoframeworks is as follows:HODLR framework: The columns of U α need to span the columns of A ( I α , I β ) HBS framework: The columns of U α need to span the columns of A ( I α , I c α ) . The assertion that the requirements on a basis in the HBS framework is more stringent follows from the factthat I β ⊆ I c α , and that I β is in general much smaller than I c α . Now note that A ( I α , I c α ) = (cid:2) A ( I α , I β ) A ( I α , I c τ ) (cid:3) P . (14)In (14), the matrix P is a permutation matrix whose effect is to reorder the columns. For purposes of construct-ing a basis for the column space, the matrix P can be ignored. The idea is now to introduce a new samplingmatrix Y α of size n α × r that encodes all the information that needs to be transmitted from the parent τ .Speciﬁcally, we ask that:The columns of Y α span the columns of A ( I α , I c τ ) (to within precision ε ).Then, when processing box α , we will sample A ( I α , I β ) using a Gaussian matrix G β of size n β × r just as inthe HODLR algorithm. In the end, we will build U α by combining the two sets of samples [ U α , D α , ∼ ] = svd (cid:0)(cid:2) A ( I α , I β ) G β , Y α (cid:3) , r (cid:1) . In other words, we take a matrix (cid:2) A ( I α , I β ) G β , Y α (cid:3) of size n α × r and extract its leading r singular com-ponents (the trailing r components are discarded). All that remains is to build the sample matrices Y γ and Y δ that transmit information to the children { γ, δ } of α . To be precise, let J γ and J δ denote the local relative indexvectors, so that I γ = I α ( J γ ) and I δ = I α ( J δ ) . Then set Y γ = U ( J γ , :) D α and Y δ = U ( J δ , :) D α . The process for building the long basis matrices { V α } α is entirely analogous to the process described forbuilding the { U α } α matrices. We ﬁrst recall that the difference between then HODLR and the HBS frameworksis as follows: HODLR framework: The columns of V α need to span the rows of A ( I β , I α ) HBS framework: The columns of V α need to span the rows of A ( I c α , I α ) . oop over levels (cid:96) = 0 : ( L − Build the random matrices Ω and Ω . Ω = zeros ( n, r ) Ω = zeros ( n, r ) loop over boxes τ on level (cid:96) Let { α, β } denote the children of box τ . Ω ( I α , :) = randn ( n α , r ) Ω ( I β , :) = randn ( n β , r ) end loop Apply A to build the samplesfor the incoming basis matrices. Y = AΩ − A ( (cid:96) ) Ω Y = AΩ − A ( (cid:96) ) Ω Orthonormalize the sample matrices tobuild the incoming basis matrices. loop over boxes τ on level (cid:96) Let { α, β } denote the children of box τ . if ( τ is the root) Y loc1 = Y ( I α , :) Y loc2 = Y ( I β , :) else Y loc1 = [ Y ( I α , :) , Y τ ( J α , :)] Y loc2 = [ Y ( I β , :) , Y τ ( J β , :)] end if [ U α , y α , ∼ ] = svd ( Y loc1 , r ) . [ U β , y β , ∼ ] = svd ( Y loc2 , r ) . Y α = U α diag ( y α ) . Y β = U β diag ( y β ) . Ω ( I α , :) = U α Ω ( I β , :) = U β U τ = (cid:20) U ∗ α U τ ( J α , :) U ∗ β U τ ( J β , :) (cid:21) . end loop Apply A ∗ to build the samplesfor the outgoing basis matrices. Z = A ∗ Ω − (cid:0) A ( (cid:96) ) (cid:1) ∗ Ω Z = A ∗ Ω − (cid:0) A ( (cid:96) ) (cid:1) ∗ Ω Take local SVDs to build incoming basismatrices and sibling interaction matrices. loop over boxes τ on level (cid:96) Let { α, β } denote the children of box τ . if ( τ is the root) Z loc1 = Z ( I α , : ) Z loc2 = Z ( I β , : ) else Z loc1 = [ Z ( I α , : ) , Z τ ( J α , :)] Z loc2 = [ Z ( I β , : ) , Z τ ( J β , :)] end if [ V α , b , X ] = svd ( Z loc1 , r ) . [ V β , b , X ] = svd ( Z loc2 , r ) . Z α = V α diag ( b ) Z β = V β diag ( b ) B αβ = X (1 : r, :) diag ( b ) B βα = X (1 : r, :) diag ( b ) V τ = (cid:20) V ∗ α V τ ( J α , :) V ∗ β V τ ( J β , :) (cid:21) . end loopend loop Extract the diagonal matrices. n max = max { n τ : τ is a leaf } Ω = zeros ( N, n max ) loop over leaf boxes τ Ω ( I τ , n τ ) = eye ( n τ ) . end loop Y = AΩ − A ( L ) Ω loop over leaf boxes τ D τ = Y ( I τ , n τ ) . end loop F IGURE

5. A basic scheme for compressing an HBS matrix.With P again denoting a permutation matrix, we have A ( I c α , I α ) = P (cid:20) A ( I β , I α ) A ( I c τ , I α ) (cid:21) . It follows that the role that was played by Y α in the construction of U α is now played by a sampling matrix Z α of size n α × r that satisﬁesThe columns of Z α span the rows of A ( I c τ , I α ) .The algorithm for computing the HBS representation of a matrix is now given in Figure 5.4.2. A storage efﬁcient scheme for compressing an HBS matrix.

The scheme described in Section 4.1assumes that all “long” basis and spanning matrices ( U τ , V τ , Y τ , Z τ ) are stored explicitly, resulting in an O ( k N log N ) storage requirement. We will now demonstrate that in fact, only O ( k N ) is required.irst, observe that in the HBS framework, we only need to keep at hand the long basis matrices U τ and V τ fornodes τ on the level (cid:96) that is currently being processed. In the HODLR compression algorithm in Figure 4, weneeded the long basis matrices associated with nodes on coarser levels in order to apply A ( (cid:96) ) , but in the HBSframework, all we need in order to apply A ( (cid:96) ) is the long basis matrices { U τ , V τ } on the level currently beingprocessed , and then only the short basis matrices U τ and V τ for any box τ on a level coarser than (cid:96) , cf. thealgorithm in Figure 15.Next, observe that the long “spanning” matrices Y τ and Z τ that were introduced in Section 4.1 do not need tobe stored explicitly either. The reason is that these matrices can be expressed in terms of the long basis matrices U τ and V τ . In fact, in the algorithm in Figure 4, we compute Y τ and Z τ via the relations Y τ = U τ diag ( y τ ) and Z τ = V τ diag ( z τ ) . Since the long basis matrices U τ and V τ are available during the processing of level (cid:96) , we only need to storethe short vectors y τ and z τ , and can then construct Y τ and Z τ when they are actually needed.The memory efﬁcient algorithm resulting from exploiting the observations described in this section is summa-rized in Figure 6.4.3. Adaptive rank determination and conversion to the HBS-ID format.

The schemes presented in Sec-tions 4.1 and 4.2 do not adaptively determine the ranks of the off-diagonal blocks being compressed. Instead,every block is factored using a preset uniform rank (cid:96) that must be picked to be larger than any actual numericalrank encountered. It is possible to incorporate adaptive rank determination in to the scheme, but we found iteasier to perform this step in a second sweep that travels through the tree in the opposite direction: from smallerboxes to larger. In this second sweep, we also convert the standard HBS format to the HBS-ID format, whichleads to a slight improvement in storage requirements, and improves interpretability of the sibling interactionmatrices, as discussed in Section 2.9 and Deﬁnition 3.The conversion to the HBS-ID is a “post-processing” step, so in what follows, we assume that the compressionalgorithm in Figure 6 has already been executed so that both the HBS basis matrices U τ and V τ , and the samplematrix Y τ and Z τ are available for every node (these are stored implicitly in terms of the short basis matrices U τ and V τ , as described in Section 4.2).The ﬁrst step is to sweep over all leaves τ in the tree. For each leaf, we now have available spanning matrices Y τ and Z τ whose columns span the columns of A ( I τ , I c τ ) and A ( I c τ , I τ ) ∗ , respectively. In order to ﬁnd a setof spanning rows ˜ I in τ of A ( I τ , I c τ ) and a set of spanning columns ˜ I out τ of A ( I c τ , I τ ) ∗ , all we need to do is tocompute interpolatory decompositions (IDs) of the small matrices Y τ and Z τ , cf. Remark 4, [ T in , J in ] = id ( Y ∗ τ , ε ) , and [ T out , J out ] = id ( Z ∗ τ , ε ) . (15)In equation (15), we give the computational tolerance ε as an input parameter. This reveals the “true” ε -ranks k in and k out . Then the skeleton index vectors for τ are given by ˜ I in τ = I τ ( J in (1 : k in )) , and ˜ I out τ = I τ ( J out (1 : k out )) . Now deﬁne the subsampled basis matrices U samp τ and V samp τ via U samp τ = U ( J in (1 : k in ) , :) , and V samp τ = V ( J out (1 : k out ) , :) . (16)Once all leaves have been processed in this manner, we can determine the sibling interaction matrices in theHBS-ID representation. Let { α, β } denote a sibling pair consisting of two leaves. First observe that, bydeﬁnition, B skel α,β = A ( ˜ I in α , ˜ I out β ) . (17)Next, recall that A ( I α , I β ) = U α B α,β V ∗ β . (18) oop over levels (cid:96) = 0 : ( L − Build the random matrices Ω and Ω . Ω = zeros ( n, r ) Ω = zeros ( n, r ) loop over boxes τ on level (cid:96) Let { α, β } denote the children of box τ . Ω ( I α , :) = randn ( n α , r ) Ω ( I β , :) = randn ( n β , r ) end loop Apply A to build the samplesfor the incoming basis matrices. Y = AΩ − A ( (cid:96) ) Ω Y = AΩ − A ( (cid:96) ) Ω Orthonormalize the sample matrices tobuild the incoming basis matrices. loop over boxes τ on level (cid:96) Let { α, β } denote the children of box τ . if ( τ is the root) Y loc1 = Y ( I α , :) Y loc2 = Y ( I β , :) else Y loc1 = [ Y ( I α , :) , U τ ( J α , :) diag ( y τ )] Y loc2 = [ Y ( I β , :) , U τ ( J β , :) diag ( y τ )] end if [ U α , y α , ∼ ] = svd ( Y loc1 , r ) . [ U β , y β , ∼ ] = svd ( Y loc2 , r ) . Ω ( I α , :) = U α Ω ( I β , :) = U β U τ = (cid:20) U ∗ α U τ ( J α , :) U ∗ β U τ ( J β , :) (cid:21) .Delete U τ . end loop Apply A ∗ to build the samplesfor the outgoing basis matrices. Z = A ∗ Ω − (cid:0) A ( (cid:96) ) (cid:1) ∗ Ω Z = A ∗ Ω − (cid:0) A ( (cid:96) ) (cid:1) ∗ Ω Take local SVDs to build incoming basismatrices and sibling interaction matrices. loop over boxes τ on level (cid:96) Let { α, β } denote the children of box τ . if ( τ is the root) Z loc1 = Z ( I α , : ) Z loc2 = Z ( I β , : ) else Z loc1 = [ Z ( I α , : ) , V τ ( J α , :) diag ( z τ )] Z loc2 = [ Z ( I β , : ) , V τ ( J β , :) diag ( z τ )] end if [ V α , b , X ] = svd ( Z loc1 , r ) . [ V β , b , X ] = svd ( Z loc2 , r ) . z α = b z β = b B αβ = X (1 : r, :) diag ( b ) B βα = X (1 : r, :) diag ( b ) V τ = (cid:20) V ∗ α V τ ( J α , :) V ∗ β V τ ( J β , :) (cid:21) .Delete V τ . end loopend loop Extract the diagonal matrices. n max = max { n τ : τ is a leaf } Ω = zeros ( N, n max ) loop over leaf boxes τ Ω ( I τ , n τ ) = eye ( n τ ) . end loop Y = AΩ − A ( L ) Ω loop over leaf boxes τ D τ = Y ( I τ , n τ ) . end loop F IGURE

6. A storage efﬁcient algorithm for compressing an HBS matrix.Combining (16), (17), and (18), we ﬁnd that B skel α,β = U samp α B α,β ( V samp β ) ∗ . Once all leaves have been processed, we next proceed to the parent nodes. We do this by transversing the treein the other direction, going from smaller to larger boxes. When a box τ is processed, its children { α, β } havealready been processed. The key observation is now that that if we set ˆ I in τ = ˜ I in α ∪ ˜ I in β and ˆ I out τ = ˜ I out α ∪ ˜ I out β ,then these index vectors form skeletons for τ . These skeletons are inefﬁcient, but by simply compressing thecorresponding rows of Y τ and Z τ , we can build the skeletons and the interpolation matrices associated with τ . Due to the self-similarity between levels in the HBS representation, the compression is entirely analogousto the compression of a leaf, with the only difference that the role played by the basis matrices U τ and V τ fora leaf, are now played by the sub-sampled matrices U tmp and V tmp which represent the restriction of U τ and xecute the compression algorithm described in Figure 6. loop over leaves τ Y tmp = U τ diag ( y τ ) Z tmp = V τ diag ( z τ )[ T in , J in ] = id ( Y ∗ tmp , ε )[ T out , J out ] = id ( Z ∗ tmp , ε ) U samp τ = U τ ( J in (1 : k in ) , :) . V samp τ = V τ ( J out (1 : k out ) , :) . end loop loop over levels (cid:96) = ( L −

1) : ( −

1) : 1 loop over boxes τ on level (cid:96) Let { α, β } denote the children of box τ . U tmp = (cid:20) U samp α

00 U samp β (cid:21) U τ . V tmp = (cid:20) V samp α

00 V samp β (cid:21) V τ . Y tmp = U tmp diag ( y τ ) Z tmp = V tmp diag ( z τ )[ T in , J in ] = id ( Y ∗ tmp , ε )[ T out , J out ] = id ( Z ∗ tmp , ε ) U samp τ = U tmp ( J in (1 : k in ) , :) . V samp τ = V tmp ( J out (1 : k out ) , :) . B skel α,β = U samp α B α,β ( V samp β ) ∗ . B skel β,α = U samp β B β,α ( V samp α ) ∗ . end loopend loop Let { α, β } denote the children of the root. B skel α,β = U samp α B α,β ( V samp β ) ∗ . B skel β,α = U samp β B β,α ( V samp α ) ∗ . F IGURE

7. An algorithm for computing the HBS-ID representation of a given matrix A . Thisalgorithm adaptively determines the ranks of the off-diagonal blocks of A . V τ to the index rows and columns indicated by the index vectors ˆ I in τ and ˆ I out τ , respectively. The entire processis summarized in Figure 7.4.4. Asymptotic complexity.

The asymptotic complexity for the HBSID algorithm is very similar to that forthe HODLR algorithm. The key difference is that since A ( (cid:96) ) is now applied using nested basis, we ﬁnd T A ( (cid:96) ) ∼ T ﬂop ×  (cid:96) k N (cid:96) + (cid:96) − (cid:88) j =0 j k  ∼ T ﬂop × k N. In other words, the asymptotic complexity of applying T A ( (cid:96) ) is now less by a factor of O ( (cid:96) ) . In consequence, T compress ∼ T mult × k log N + T ﬂop × k N log N. (19)5. N UMERICAL EXPERIMENTS

In this section, we present results from numerical experiments that substantiate claims on asymptotic complex-ity made in sections 3.1 and 4.4, and demonstrate that the practical execution time is very competitive (in otherwords, that the scaling constants suppressed in the asymptotic analysis are moderate). We investigate four dif-ferent test problems: In Section 5.1 we apply the randomized compression schemes to a discretized boundaryintegral operator for which other compression techniques are already available. This allows us to benchmarkthe new algorithms and verify their accuracy. In Section 5.2 we demonstrate how the proposed scheme can beused to form a compressed representation of a product of two compressed matrices, thus demonstrated a way ofcircumventing the need for a complex and time-consuming structured matrix-matrix multiplication. In Section5.3 we apply the schemes to a potential evaluation problem where we use the Fast Multipole Method (FMM) asthe “black-box” matrix-vector multiplication scheme (note that the data sparse format implicit in the FMM ismuch more cumbersome to invert than the HODLR and HBSID formats). In Section 5.4 we apply the schemeto compress large dense matrices that arise in the classical “nested dissection” or “multifrontal” direct solversor the sparse matrices arising from ﬁnite element or ﬁnite difference discretization of elliptic PDEs. The abilityto efﬁciently manipulate such matrices allows for the construction of O ( N ) complexity direct solvers for theassociated sparse linear systems.For each of the four test problems, we compare two different techniques for computing a data-sparse repre-sentation of A : (1) The randomized technique for computing an HODLR-representation in Figure 4. (2) Therandomized technique for computing an HBSID-representation in Figure 7. For each technique, we report thefollowing quantities: N The number of DOFs (so that A is of size N × N ). (cid:96) The number of random vectors used at each level ( (cid:96) must be larger than the maximal rank). k The largest rank encountered in the compression. N matvec The number of applications of A required (so that N matvec = ( L + 1) × (cid:96) ∼ log( N ) × (cid:96) ). T compress The time required for compression (in seconds). T net The time required for compression, excluding time to apply A and A ∗ (in seconds). T app The time required for applying the compressed matrix to a vector (in seconds). M The amount of memory required to store A (in MB).The reason that we report the time T net (that does not count time spent in the black-box matrix-vector multiplier)is to validate our claims (13) and (19) regarding the asymptotic complexity of the method. To summarize, ourpredictions are, for the HODLR algorithm T net ∼ N (log N ) , T app ∼ N log N, M ∼ N log( N ) , and for the HBSID algorithm T net ∼ N log N, T app ∼ N, M ∼ N. In addition to the timings, we computed a randomized estimate E of the compression error, computed asfollows: We drew ten vectors { ω i } i =1 of unit length from a uniform distribution on the unit sphere in R N .Then E is deﬁned via E = max ≤ i ≤ || A ω i − A compressed ω i |||| A ω i || . Compressing a Boundary Integral Equation.

Our ﬁrst numerical example concerns compression of adiscretized version of the Boundary Integral Equation (BIE) q ( x ) + (cid:90) Γ ( x − y ) · n ( y )4 π | x − y | q ( y ) ds ( y ) = f ( x ) , x ∈ Γ , (20)where Γ is the simple curve shown in Figure 8, and where n ( y ) is the outwards pointing unit normal of Γ at y . The BIE (20) is a standard integral equation formulation of the Laplace equation with boundary condition f on the domain interior to Γ . We discretize the BIE (20) using the Nystr¨om method on N equispaced points on Γ , with the Trapezoidal rule as the quadrature. Note that the kernel in (20) is smooth, so the Trapezoidal rulehas exponential convergence. This problem is slightly artiﬁcial in that only about 200 points are needed toattain full double precision accuracy in the discretization. We include it for bench-marking purposes to verifythe scaling of the proposed method.To be precise, the matrix A used in this numerical experiment is itself an HBS representation of the matrixresulting from discretization of (20), computed using the technique of [28]. To minimize the risk of spuriouseffects due to both the “exact” and the computed A being HBS representations, we used a much higher precisionin computing the “exact” A , and also a shifted tree structure to avoid having the compressed blocks of ourreference A align with the compressed blocks constructed by the randomzed sampling algorithms. Verify! Check.

IGURE

8. Contour Γ on which the BIE (20) in Section 5.1 is deﬁned. N N matvec T compress T net T app M M/N E k (sec) (sec) (sec) (MB) (reals)400 3 x 35 0.065 0.035 0.001 0.5 157.7 4.68e-11 28800 4 x 35 0.107 0.033 0.001 1.0 169.9 3.67e-11 281600 5 x 35 0.267 0.081 0.003 2.2 179.1 3.09e-11 283200 6 x 35 0.617 0.188 0.006 4.5 186.2 2.57e-11 286400 7 x 35 1.458 0.456 0.011 9.5 194.0 1.99e-11 2912800 8 x 35 3.453 1.198 0.024 19.6 200.5 1.80e-11 3025600 9 x 35 8.065 2.998 0.047 39.7 203.1 1.74e-11 2951200 10 x 35 18.362 7.140 0.099 82.1 210.3 1.14e-11 30102400 11 x 35 41.985 17.366 0.204 166.6 213.2 1.24e-11 30T

ABLE

1. Compression to the HODLR format using Algorithm 4 of the double layer integralequation described in Section 5.1. Here ε = 10 − and (cid:96) = 35 . N N matvec T compress T net T app M M/N E k (sec) (sec) (sec) (MB) (reals)400 3 x 35 0.098 0.068 0.001 0.4 127.7 1.40e-09 25800 4 x 35 0.163 0.090 0.002 0.7 113.6 1.30e-09 251600 5 x 35 0.381 0.193 0.005 1.3 104.3 1.42e-09 243200 6 x 35 0.840 0.411 0.009 2.4 98.3 1.32e-09 246400 7 x 35 1.881 0.878 0.019 4.6 94.2 1.56e-09 2412800 8 x 35 4.258 1.969 0.037 8.9 91.3 2.04e-09 2325600 9 x 35 9.262 4.205 0.076 17.6 90.3 1.62e-09 2351200 10 x 35 20.431 9.238 0.153 34.4 87.9 1.44e-09 23102400 11 x 35 45.732 21.039 0.310 68.3 87.5 1.61e-09 23T

ABLE

2. Compression to the HBSID format using Algorithm 7 of the double layer integralequation described in Section 5.1. Here ε = 10 − and (cid:96) = 35 .For this experiment, we bench-mark the new compression algorithms by comparing their speed, accuracy, andmemory requirements to the compression technique based on potential theory described in [28], run at thesame precision as the randomized compression scheme. The results are presented in tables 1, 2, and 3, andsummarized in Figure 9.5.2. Operator multiplication.

We next apply the proposed scheme to compute the Neumann-to-Dirichletoperator for the boundary value problem (cid:40) − ∆ u ( x ) = 0 x ∈ Ω ,∂ n u ( x ) = g ( x ) x ∈ Γ , (21) T compress t app M M/N E k (sec) (sec) (MB) (reals)400 0.053 0.001 0.4 128.3 1.51e-09 38800 0.104 0.002 0.7 119.5 2.57e-09 371600 0.194 0.005 1.4 113.6 3.57e-09 353200 0.352 0.009 2.7 109.9 4.58e-09 346400 0.674 0.019 5.3 108.2 7.42e-09 3312800 1.368 0.041 10.5 107.1 2.02e-08 3125600 2.664 0.079 20.8 106.7 1.89e-08 2951200 5.308 0.162 41.5 106.3 2.51e-08 28102400 10.758 0.327 82.9 106.1 3.95e-08 25T

ABLE

3. Compression to the HBSID format of the double layer integral equation describedin Section 5.1, using the technique based on potential theory of [28] with ε = 10 − .where Γ is again the contour shown in Figure 8, where Ω is the domain exterior to Γ , and where n is the unitnormal vector pointing in the outwards direction from Γ . With u the solution of (21), let f denote the restrictionof u to Γ , and let T denote the linear operator T : g (cid:55)→ f, known as the Neumann-to-Dirichlet (NtD) operator .It is well-known (see Remark 6) that T can be built explicitly as the product T = S (cid:18) I + D ∗ (cid:19) − , (22)where S is the single-layer operator [ Sq ]( x ) = (cid:82) Γ − π log | x − y | q ( y ) ds ( y ) , and where D ∗ is the adjoint ofthe double-layer operator [ D ∗ q ]( x ) = (cid:82) Γ n ( x ) · ( x − y )2 π | x − y | q ( y ) ds ( y ) . We form discrete approximations S and D ∗ to S and D ∗ , and compute ( I + D ∗ ) − using the techniques in [15], with th order Kapur-Rokhlin quadratureused to discretize the singular integral operator S . Then we can evaluate a discrete approximation T to T via T = S (cid:0) I + D ∗ (cid:1) − . For this example, we evaluated an additional error metric by testing the computed NtD operator to an exactsolution u exact to (21). The function u exact is given as the potential from a collection of ﬁve randomly placedcharges inside Γ , and then f exact and g exact are simply the evaluations of u exact and its normal derivative on thequadrature nodes on Γ . Then the new error measure is given by E pot = (cid:107) f exact − T approx g exact (cid:107) max (cid:107) f exact (cid:107) max , where (cid:107) · (cid:107) max is the maximum norm, and where T approx is the compressed representation of T determined bythe randomized sampling scheme proposed.The numerical results are presented in tables 4 and 5, and summarized in Figure 10. Remark 6.

The formula (22) for the NtD operator is derived as follows: We ﬁrst look for a solution to (21) ofthe form u = Sq . Then it can be shown that q must satisfy (1 / q + D ∗ q = g . Solving for q and using f = Sq ,we obtain (22). Dimensional reduction in the Fast Multipole Method.

In this section, we investigate a numerical ex-ample where A is a potential evaluation map for a set of electric charges in the plane. To be precise, we let { x i } Ni =1 denote a set of points located as shown in Figure 11. Then A is the N × N matrix with entries A ( i, i ) = (cid:40) − π log | x i − x j | , when i (cid:54) = j, when i = j. −1 N T c o m p r e ss Time for matrix compression (seconds) 10 −3 −2 −1 N T app l y Time for matrix−vector multiplication (seconds) 10 M e m o r y ( m ea s u r ed i n r ea l s nu m be r s pe r po i n t ) Memory requirements (reals/dof) 10 −11 −10 −9 −8 N E rr o r Error HODLRHODLR (net)HBSIDHBSID (net)GREENlinear HODLRHBSIDGREENlinearHODLRHBSIDGREEN HODLRHBSIDGREEN F IGURE

9. Visualization of the results presented in Tables 1, 2, and 3, pertaining to the exam-ple in Section 5.1.We apply A rapidly using the classical Fast Multipole Method [17] with 30th order expansions, which ensuresthat the error in the black-box code is far smaller than our requested precision of ε = 10 − . Our FMM isimplemented in Matlab as described in [27]. This implementation is quite inefﬁcient, but has the ability toapply A to an N × (cid:96) matrix rather than single vector.For this example, we use an additional error measure E pot that compares the compressed matrix A compress tothe exact matrix A , evaluated at a subset of the target points. To be precise, we picked at random a subset of10 target locations, marked by the index vector I ⊂ { , , . . . , N } . We then drew a sequence of ten vectors { ω i } i =1 in which each entry is drawn at random from a uniform distribution on the interval [0 , (observe that N matvec T compress T net T app M M/N E E pot k (sec) (sec) (sec) (MB) (reals)400 2 x 60 0.069 0.030 0.001 0.6 200.3 2.06e-11 3.87e-07 32800 3 x 60 0.123 0.034 0.001 1.5 237.7 4.35e-11 8.22e-09 351600 4 x 60 0.323 0.088 0.002 3.5 283.1 4.00e-11 3.06e-09 403200 5 x 60 0.839 0.236 0.005 8.0 327.7 7.18e-11 9.74e-09 446400 6 x 60 2.093 0.657 0.011 18.1 370.0 1.08e-10 8.37e-09 4512800 7 x 60 5.108 1.734 0.024 40.7 417.0 1.61e-10 1.76e-08 4925600 8 x 60 12.302 4.516 0.055 89.3 457.2 3.09e-10 3.41e-08 5251200 9 x 60 29.380 11.636 0.120 195.8 501.2 5.44e-10 7.22e-08 54102400 10 x 60 67.879 29.072 0.284 426.0 545.3 1.11e-09 3.39e-07 55T ABLE

4. Compression to the HODLR format using Algorithm 4 of the NtD operator de-scribed in Section 5.2. Here ε = 10 − and (cid:96) = 60 . N N matvec T compress T net T app M M/N E E pot k (sec) (sec) (sec) (MB) (reals)400 2 x 60 0.093 0.062 0.001 0.6 211.7 3.48e-10 3.86e-07 29800 3 x 60 0.190 0.103 0.001 1.2 198.7 2.70e-10 8.25e-09 311600 4 x 60 0.482 0.245 0.003 2.4 192.7 5.32e-10 5.42e-09 333200 5 x 60 1.154 0.557 0.006 4.6 189.9 1.57e-09 1.01e-08 366400 6 x 60 2.708 1.269 0.012 9.1 186.5 2.02e-09 1.35e-08 3712800 7 x 60 6.310 2.906 0.023 18.1 185.4 3.98e-09 2.51e-08 4025600 8 x 60 14.495 6.511 0.048 35.6 182.3 6.80e-09 4.72e-08 4151200 9 x 60 32.906 15.033 0.094 70.8 181.4 1.47e-08 8.04e-08 42102400 10 x 60 82.238 37.772 0.189 139.8 178.9 2.51e-08 3.46e-07 44T ABLE

5. Compression to the HODLR format using Algorithm 4 of the NtD operator de-scribed in Section 5.2. Here ε = 10 − and (cid:96) = 60 .every charge is positive ). Then E pot is deﬁned via E pot = max ≤ i ≤ || A ( I, :) ω i − A compressed ( I, :) ω i |||| A ( I, :) ω i || . The numerical results are presented in tables 6 and 7, and summarized in Figure 12.5.4.

Compression of frontal matrices in nested dissection.

Our ﬁnal example applies the proposed compres-sion schemes to the problem of constructing O ( N ) direct solvers for the sparse linear systems arising upon thediscretization of elliptic PDEs via ﬁnite difference or ﬁnite element methods. The idea is to build a solver on theclassical “nested dissection” scheme of George [13, 11, 10]. In standard implementations, the problem of thisdirect solver is that it requires the inversion or LU-factorization of a set of successively larger dense matrices.However, it has recently been demonstrated that while these matrices are dense, they have internal structure thatallows for linear or close to linear time matrix algebra to be executed, [24, 34, 30, 14]. In this manuscript, wetest the proposed compression scheme on a set of matrices whose behavior is directly analogous to the matricesencountered in the algorithms of [24, 34, 30, 14]. To be precise, let B denote the sparse coefﬁcient matrixassociated with a grid conduction problem on the grid shown in Figure 13. Each bar has a conductivity that isdrawn at random from a uniform distribution on the interval [1 , . Let I , I , I denote three index vectors thatmark the three regions shown in Figure 13, set B ij = B ( I i , I j ) , i, j = 1 , , , −1 N T c o m p r e ss Time for matrix compression (seconds) 10 −3 −2 −1 N T app l y Time for matrix−vector multiplication (seconds) 10 M e m o r y ( m ea s u r ed i n r ea l s nu m be r s pe r po i n t ) Memory requirements (reals/dof) 10 −10 −9 −8 −7 N E rr o r Error HODLRHODLR (net)HBSIDHBSID (net)linearHODLRHBSID HODLRHODLR (pot)HBSIDHBSID (pot)HODLRHBSIDlinear F IGURE

10. Visualization of the results presented in Tables 4, and 5, pertaining to the examplein Section 5.2.and then deﬁne the N × N matrix A via A = B − B B − B − B B − B . (23)The relevance of the matrix A is discussed in some detail in Remark 7.In our numerical experiments, the black-box application of A , as deﬁned by (23) was executed using thesparse matrix built-in routines in Matlab, which relies on UMFPACK [9] for the sparse solves implicit inthe application of B − and B − .The numerical results are presented in tables 8 and 9, and summarized in Figure 14. Remark 7.

To illustrate the connection between the matrix A , as deﬁned by (23), and the LU-factorization ofa matrix such as B , observe ﬁrst that the blocks B and B are zero, so that (up to a permutation of the rows F IGURE

11. Geometry of the potential evaluation map in Section 5.3. The top ﬁgure showsthe entire geometry, and the lower ﬁgure shows a magniﬁcation of a small part.

N N matvec T compress T net T app M M/N E E pot k (sec) (sec) (sec) (MB) (reals)400 3 x 45 0.101 0.021 0.001 0.5 173.9 1.57e-10 1.16e-10 21800 4 x 45 0.265 0.055 0.001 1.3 215.9 3.69e-10 3.52e-10 231600 5 x 45 0.812 0.133 0.003 3.1 257.8 1.09e-10 1.46e-10 243200 6 x 45 1.925 0.321 0.007 7.4 304.9 6.17e-11 8.77e-11 266400 7 x 45 4.509 0.783 0.015 17.2 351.5 4.22e-11 5.61e-11 2812800 8 x 45 10.980 1.941 0.032 39.1 400.6 3.11e-11 3.63e-11 3025600 9 x 45 26.271 4.788 0.068 87.1 446.0 2.56e-11 3.38e-11 3051200 10 x 45 62.119 11.938 0.149 192.5 492.7 2.26e-11 2.67e-11 32102400 11 x 45 158.617 30.621 0.362 419.5 537.0 1.96e-11 2.19e-11 32T ABLE

6. Compression to the HODLR format using Algorithm 4 of the potential evaluationmatrix described in Section 5.3. Here ε = 10 − and (cid:96) = 45 . N N matvec T compress T net T app M M/N E E pot k (sec) (sec) (sec) (MB) (reals)400 3 x 45 0.140 0.062 0.001 0.5 164.1 3.84e-09 2.96e-09 31800 4 x 45 0.358 0.149 0.002 1.0 164.8 2.29e-08 2.10e-08 331600 5 x 45 1.007 0.329 0.005 2.0 166.1 1.12e-08 1.28e-08 373200 6 x 45 2.306 0.713 0.010 4.1 166.5 1.29e-08 1.08e-08 396400 7 x 45 5.334 1.554 0.020 8.0 164.7 8.47e-09 5.13e-09 4012800 8 x 45 12.366 3.350 0.040 15.8 161.7 7.75e-09 4.66e-09 4225600 9 x 45 28.813 7.170 0.081 30.9 158.1 7.28e-09 7.56e-09 4351200 10 x 45 64.838 15.572 0.166 60.1 154.0 1.47e-08 9.91e-09 44102400 11 x 45 164.750 36.686 0.335 116.7 149.4 1.23e-08 1.03e-08 43T ABLE

7. Compression to the HBSID format using Algorithm 7 of the potential evaluationmatrix described in Section 5.3. Here ε = 10 − and (cid:96) = 45 . and columns), B =  B B B B B  . −1 N T c o m p r e ss Time for matrix compression (seconds) 10 −3 −2 −1 N T app l y Time for matrix−vector multiplication (seconds) 10 M e m o r y ( m ea s u r ed i n r ea l s nu m be r s pe r po i n t ) Memory requirements (reals/dof) 10 −10 −9 −8 N E rr o r Error HODLRHODLR (net)HBSRIDHBSID (net)linear HODLRHBSIDlinearHODLRHBSID HODLRHODLR (pot)HBSIDHBSID (pot) F IGURE

12. Visualization of the results presented in Tables 6, and 7, pertaining to the examplein Section 5.3.

Next suppose that we can somehow determine the LU-factorizations of B and B , B = L U , and B = L U . Then the LU-factorization of B is given by B =  L U − B U − L   U − B L − B  , where L and U are deﬁned as the LU factors of L U = B − B U − L − B − B U − L − B (cid:124) (cid:123)(cid:122) (cid:125) =: A . (24) I I F IGURE

13. Geometry of problem described in Section 5.4. We consider a grid conductionproblem on the grid shown. As N is increased, the width or the grid is ﬁxed at 41 nodes, whilethe height of the grid equals N . N N matvec T compress T net T app M M/N E k (sec) (sec) (sec) (MB) (reals)400 3 x 25 0.348 0.012 0.001 0.4 115.0 1.83e-14 9800 4 x 25 0.849 0.025 0.001 0.8 133.4 1.47e-14 91600 5 x 25 2.143 0.064 0.002 1.9 151.6 1.40e-14 93200 6 x 25 5.588 0.141 0.005 4.1 169.7 1.43e-14 96400 7 x 25 13.926 0.336 0.011 9.2 187.8 1.37e-14 912800 8 x 25 38.558 0.903 0.023 20.1 205.8 1.36e-14 925600 9 x 25 91.382 2.322 0.048 43.7 223.8 1.29e-14 951200 10 x 25 208.704 5.874 0.098 94.5 241.8 1.32e-14 9102400 11 x 25 468.981 14.203 0.204 203.0 259.8 1.30e-14 9T

ABLE

8. Compression to the HODLR format using Algorithm 4 of a simulated “frontal ma-trix” in the nested dissection technique, as described in Section 5.4. Here ε = 10 − and (cid:96) = 25 . N N matvec T compress T net T app M M/N E k (sec) (sec) (sec) (MB) (reals)400 3 x 25 0.361 0.032 0.001 0.4 121.0 3.80e-14 18800 4 x 25 0.867 0.066 0.002 0.8 124.9 4.75e-14 181600 5 x 25 2.180 0.150 0.005 1.6 127.7 4.34e-14 183200 6 x 25 5.575 0.322 0.010 3.2 129.5 4.27e-14 186400 7 x 25 14.470 0.708 0.019 6.4 130.6 4.30e-14 1812800 8 x 25 38.945 1.525 0.040 12.8 131.3 4.19e-14 1825600 9 x 25 91.563 3.253 0.081 25.7 131.7 4.11e-14 1851200 10 x 25 208.145 7.440 0.157 51.5 131.9 4.10e-14 18102400 11 x 25 465.309 15.759 0.314 103.1 132.0 4.07e-14 18T

ABLE

9. Compression to the HBSID format using Algorithm 7 of a simulated “frontal ma-trix” in the nested dissection technique, as described in Section 5.4. Here ε = 10 − and (cid:96) = 25 . Observe that since the matrices L , U , L , U are all triangular, their inverses are inexpensive to apply.To summarize, if one can cheaply evaluate the LU factorization (24), then the task of LU-factoring B directlyreduces to the task of LU factoring the two matrices B and B , which both involve about half as manyvariables as B . The classical nested dissection idea is now to apply this observation recursively to factor B and B . The problem of this scheme has traditionally been that in order to evaluate L and U in (24), onemust factorize the dense matrix A . −1 N T c o m p r e ss Time for matrix compression (seconds) 10 −3 −2 −1 N T app l y Time for matrix−vector multiplication (seconds) 10 M e m o r y ( m ea s u r ed i n r ea l s nu m be r s pe r po i n t ) Memory requirements (reals/dof) 10 −14 −13 N E rr o r Error HODLRHODLR (net)HBSIDHBSID (net)linear HODLRHBSIDlinearHODLRHBSIDHODLRHBSID F IGURE

14. Visualization of the results presented in Tables 8, and 9, pertaining to the examplein Section 5.4.5.5.

Summary of observations from numerical experiments.

To close this section, we make some observa-tions and conjectures: • In all examples examined, numerical evidence supports the claims on asymptotic scaling made in Sec-tions 3.1 and 4.4. • Excellent approximation errors are obtained in every case. Aggregation of errors over levels is a veryminor problem. • The computational time is in all cases dominated by the time spent in the external black-box multiplier.As a consequence, the primary route by which the proposed algorithm could be improved would be tolessen the number of matrix-vector multiplications required. uild outgoing expansions on level m . loop over all nodes τ on level m ˆ q τ = V ∗ τ q ( I τ ) end loop Build outgoing expansions on levels coarser than m (upwards pass). loop over levels (cid:96) = ( m −

1) : ( −

1) : 1 loop over boxes τ on level (cid:96) Let { α, β } denote the children of τ . ˆ q τ = V ∗ τ (cid:20) ˆ q α ˆ q β (cid:21) . end loopend loop Build incoming expansions for the children of the root.

Let { α, β } denote the children of the root node. ˆ u α = B α,β ˆ q β . ˆ u β = B β,α ˆ q α . Build incoming expansions on levels coarser than m (downwards pass). loop over levels (cid:96) = ( m −

1) : ( −

1) : 1 loop over boxes τ on level (cid:96) Let { α, β } denote the children of τ . ˆ u α = B α,β ˆ q β + U τ ( J α , :) ˆ u τ . ˆ u β = B β,α ˆ q α + U τ ( J β , :) ˆ u τ . end loopend loop Build incoming expansions on level m . loop over boxes τ on level m u ( I τ ) = U τ ˆ u τ end loop F IGURE

15. Application of A ( m ) in the HBS framework. Given a vector q of charges, buildthe vector u = A ( m ) q of potentials. • For modest problem sizes, the HODLR algorithm is very fast and easy to use. However, as problemsgrow large, the memory requirements of the HODLR format become slightly problematic, and theHBSID algorithm gains a more pronounced advantage.R

EFERENCES [1] S. A

MBIKASARAN AND

E. D

ARVE , An o ( n log n ) fast direct solver for partial hierarchically semi-separable matrices , Journalof Scientiﬁc Computing, 57 (2013), pp. 477–501.[2] M. B EBENDORF , Approximation of boundary element matrices , Numerische Mathematik, 86 (2000), pp. 565–589.[3] ,

Hierarchical matrices , vol. 63 of Lecture Notes in Computational Science and Engineering, Springer-Verlag, Berlin, 2008.A means to efﬁciently solve elliptic boundary value problems.[4] S. B ¨

ORM , Efﬁcient numerical methods for non-local operators , vol. 14 of EMS Tracts in Mathematics, European MathematicalSociety (EMS), Z¨urich, 2010. H -matrix compression, algorithms and analysis.[5] S. B ¨ ORM AND

L. G

RASEDYCK , Hybrid cross approximation of integral operators , Numerische Mathematik, 101 (2005), pp. 221–249.[6] T. F. C

HAN , Rank revealing qr factorizations , Linear Algebra and its Applications, 88-89 (1987), pp. 67 – 82.7] S. C

HANDRASEKARAN , M. G U , AND

W. L

YONS , A fast adaptive solver for hierarchically semiseparable representations , Cal-colo, 42 (2005), pp. 171–185.[8] H. C

HENG , Z. G

IMBUTAS , P. M

ARTINSSON , AND

V. R

OKHLIN , On the compression of low rank matrices , SIAM Journal ofScientiﬁc Computing, 26 (2005), pp. 1389–1404.[9] T. A. D

AVIS , Algorithm 832: Umfpack v4. 3—an unsymmetric-pattern multifrontal method , ACM Transactions on MathematicalSoftware (TOMS), 30 (2004), pp. 196–199.[10] ,

Direct methods for sparse linear systems , vol. 2, Siam, 2006.[11] I. D

UFF , A. E

RISMAN , AND

J. R

EID , Direct Methods for Sparse Matrices , Oxford, 1989.[12] K. F

REDERIX AND

M. V. B

AREL , Solving a large dense linear system by adaptive cross approximation , Journal of Computationaland Applied Mathematics, 234 (2010), pp. 3181 – 3195.[13] A. G

EORGE , Nested dissection of a regular ﬁnite element mesh , SIAM J. on Numerical Analysis, 10 (1973), pp. 345–363.[14] A. G

ILLMAN AND

P. M

ARTINSSON , A direct solver with o ( n ) complexity for variable coefﬁcient elliptic pdes discretized via ahigh-order composite spectral collocation method , SIAM Journal on Scientiﬁc Computing, 36 (2014), pp. A2023–A2046.[15] A. G ILLMAN , P. Y

OUNG , AND

P.-G. M

ARTINSSON , A direct solver o ( n ) complexity for integral equations on one-dimensionaldomains , Frontiers of Mathematics in China, 7 (2012), pp. 217–247. 10.1007/s11464-012-0188-3.[16] G. H. G OLUB AND

C. F. V AN L OAN , Matrix computations , Johns Hopkins Studies in the Mathematical Sciences, Johns HopkinsUniversity Press, Baltimore, MD, third ed., 1996.[17] L. G

REENGARD AND

V. R

OKHLIN , A fast algorithm for particle simulations , J. Comput. Phys., 73 (1987), pp. 325–348.[18] M. G

U AND

S. C. E

ISENSTAT , Efﬁcient algorithms for computing a strong rank-revealing QR factorization , SIAM J. Sci. Com-put., 17 (1996), pp. 848–869.[19] W. H

ACKBUSCH , A sparse matrix arithmetic based on H-matrices; Part I: Introduction to H-matrices , Computing, 62 (1999),pp. 89–108.[20] W. H

ACKBUSCH AND

S. B ¨

ORM , Data-sparse approximation by adaptive H -matrices , Computing, 69 (2002), pp. 1–35.[21] W. H ACKBUSCH , B. K

HOROMSKIJ , AND

S. S

AUTER , On H -matrices , in Lectures on Applied Mathematics, Springer Berlin,2002, pp. 9–29.[22] N. H ALKO , P.-G. M

ARTINSSON , AND

J. A. T

ROPP , Finding structure with randomness: Probabilistic algorithms for construct-ing approximate matrix decompositions , SIAM Review, 53 (2011), pp. 217–288.[23] D. H

UYBRECHS , Multiscale and hybrid methods for the solution of oscillatory integral equations , PhD thesis, Katholieke Uni-versiteit Leuven, 2006.[24] S. L E B ORNE , L. G

RASEDYCK , AND

R. K

RIEMANN , Domain-decomposition based H -LU preconditioners , in Domain decom-position methods in science and engineering XVI, vol. 55 of Lect. Notes Comput. Sci. Eng., Springer, Berlin, 2007, pp. 667–674.[25] L. L IN , J. L U , AND

L. Y

ING , Fast construction of hierarchical matrix representation from matrixvector multiplication , Journalof Computational Physics, 230 (2011), pp. 4071 – 4087.[26] P. M

ARTINSSON , Rapid factorization of structured matrices via randomized sampling , 2008. arXiv:0806.2339.[27] ,

A fast randomized algorithm for computing a hierarchically semiseparable representation of a matrix , SIAM Journal onMatrix Analysis and Applications, 32 (2011), pp. 1251–1274.[28] P. M

ARTINSSON AND

V. R

OKHLIN , A fast direct solver for boundary integral equations in two dimensions , J. Comp. Phys., 205(2005), pp. 1–23.[29] ,

An accelerated kernel independent fast multipole method in one dimension , SIAM Journal of Scientiﬁc Computing, 29(2007), pp. 1160–11178.[30] P.-G. M

ARTINSSON , A fast direct solver for a class of elliptic partial differential equations , J. Sci. Comput., 38 (2009), pp. 316–330.[31] P.-G. M

ARTINSSON , V. R

OKHLIN , AND

M. T

YGERT , A randomized algorithm for the approximation of matrices , Tech. ReportYale CS research report YALEU/DCS/RR-1361, Yale University, Computer Science Department, 2006.[32] ,

A randomized algorithm for the decomposition of matrices , Appl. Comput. Harmon. Anal., 30 (2011), pp. 47–68.[33] J. X IA , Randomized sparse direct solvers , SIAM Journal on Matrix Analysis and Applications, 34 (2013), pp. 197–227.[34] J. X IA , S. C HANDRASEKARAN , M. G U , AND

X. S. L I , Superfast multifrontal method for large structured linear systems ofequations , SIAM J. Matrix Anal. Appl., 31 (2009), pp. 1382–1411.[35] ,