[PDF] Hybrid matrix compression for high-frequency problems

Abstract

Boundary element methods for the Helmholtz equation lead to large dense matrices that can only be handled if efficient compression techniques are used. Directional compression techniques can reach good compression rates even for high-frequency problems. Currently there are two approaches to directional compression: analytic methods approximate the kernel function, while algebraic methods approximate submatrices. Analytic methods are quite fast and proven to be robust, while algebraic methods yield significantly better compression rates. We present a hybrid method that combines the speed and reliability of analytic methods with the good compression rates of algebraic methods.

Full PDF

HHybrid matrix compression forhigh-frequency problems

Steﬀen B¨orm and Christina B¨orstOctober 28, 2019

Boundary element methods for the Helmholtz equation lead to large densematrices that can only be handled if eﬃcient compression techniques areused. Directional compression techniques can reach good compression rateseven for high-frequency problems.Currently there are two approaches to directional compression: analyticmethods approximate the kernel function, while algebraic methods approxi-mate submatrices. Analytic methods are quite fast and proven to be robust,while algebraic methods yield signiﬁcantly better compression rates.We present a hybrid method that combines the speed and reliability ofanalytic methods with the good compression rates of algebraic methods.

We consider the Helmholtz single layer potential operator G [ u ]( x ) := (cid:90) Ω g ( x, y ) u ( y ) dy, where Ω ⊆ R is a surface and g ( x, y ) = exp( i κ (cid:107) x − y (cid:107) )4 π (cid:107) x − y (cid:107) (1)denotes the Helmholtz kernel function with the wave number κ ∈ R ≥ .Applying a standard Galerkin discretization scheme with a ﬁnite element basis ( ϕ i ) i ∈I leads to the stiﬀness matrix G ∈ C I×I given by g ij = (cid:90) Ω ϕ i ( x ) (cid:90) Ω g ( x, y ) ϕ j ( y ) dy dx for all i, j ∈ I , (2)where we assume that the basis functions are suﬃciently smooth to ensure that theintegrals are well-deﬁned even if the supports overlap, e.g., for i = j . Due to g ( x, y ) (cid:54) = 01 a r X i v : . [ m a t h . NA ] O c t or all x (cid:54) = y , the matrix G is not sparse and therefore requires special handling if wewant to construct an eﬃcient algorithm.Standard techniques like fast multipole expansions [18, 12], panel clustering [15, 20],or hierarchical matrices [13, 14, 10] rely on local low-rank approximations of the matrix.In the case of the high-frequency Helmholtz equation, e.g., if the product of the wavenumber κ and the mesh width h is bounded, but not particularly small, these techniquescan no longer be applied since the local ranks become too large. This situation frequentlyappears in engineering applications.The fast multipole method can be generalized to handle this problem by employinga special expansion that leads to operators that can be diagonalized, and thereforeevaluated eﬃciently [19, 11].The butterﬂy method (also known as multi-level matrix decomposition algorithms,MLMDA) [17] achieves a similar goal by using permutations and block-diagonal trans-formations in a pattern closely related to the fast Fourier transformation algorithm. Directional methods [5, 7, 16, 1] take advantage of the fact that the Helmholtz kernel(1) can be written as a product of a plane wave and a function that is smooth inside aconical domain. Replacing this smooth function by a suitable approximation results infast summation schemes.We will focus on directional methods, since they can be applied in a more generalsetting than the fast multipole expansions based on special functions, and since theyoﬀer the chance of achieving better compression to O ( n log n ) coeﬃcients compared to O ( n log n ) required by the butterﬂy scheme [17].In particular, we will work with directional H -matrices (abbreviated DH -matrices),the algebraic counterparts of the directional approximation schemes used in [5, 7, 16].Our starting point is the directional Chebyshev approximation scheme introduced in [16]and analyzed in [4]. While this approach is fast and proven to be reliable, the resultingranks are quite large, and this leads to unattractive storage requirements.We can solve this problem by applying an algebraic recompression algorithm thatstarts with the DH -matrix constructed by interpolation and uses nested orthogonalprojections and singular value decompositions (SVD) to signiﬁcantly reduce the rank.This algorithm is based on the general DH -matrix compression algorithm introducedin [3], but takes advantage of the previous approximation in order to signiﬁcantly reducethe computational work to O ( nk log n ) in the high-frequency case, cf. Theorem 13 andRemark 14.Compared to the closely related algorithm presented in [16], our algorithm compressesthe entire DH -matrix structure instead of just the coupling matrices, and the orthogonalprojections applied in the recompression algorithm allow us to obtain straightforwardestimates for the compression error.Compared to the algorithm presented in [1], our approach has better stability prop-erties, owing to the results of [4] for the interpolation scheme and the orthogonal pro-jections employed in [3] for the recompression, and it can be expected to yield bettercompression rates, since it uses an H -matrix representation for low-frequency clusters,while the algorithm of [1] relies on the slightly less eﬃcient H -matrices.2 Directional H -matrices Hierarchical matrix methods are based on decompositions of the matrix G into subma-trices that can be approximated by factorized low-rank matrices. In our case, we followthe directional interpolation technique described in [16] and translate the resulting com-pressed representation into an algebraical deﬁnition that can be applied in more generalsituations.In order to describe the decomposition into submatrices, we ﬁrst introduce a hierarchyof subsets of the index set I corresponding to the box trees used, e.g., in fast multipolemethods. Deﬁnition 1 (Cluster tree)

Let T be a labeled tree such that the label ˆ t of each node t ∈ T is a subset of the index set I . We call T a cluster tree for I if • the root r ∈ T is labeled ˆ r = I , • the index sets of siblings are disjoint, i.e., t (cid:54) = t = ⇒ ˆ t ∩ ˆ t = ∅ for all t ∈ T , t , t ∈ chil( t ) , and • the index sets of a cluster’s children are a partition of their parent’s index set, i.e., ˆ t = (cid:91) t (cid:48) ∈ chil( t ) ˆ t (cid:48) for all t ∈ T with chil( t ) (cid:54) = ∅ . A cluster tree for I is usually denoted by T I . Its nodes are called clusters . We denotethe set of leaves of T I by L I := { t ∈ T I : chil( t ) = ∅} . A cluster tree T I can be split into levels: we let T (0) I be the set containing only theroot of T I and deﬁne T ( (cid:96) ) I := { t (cid:48) ∈ T I : t (cid:48) ∈ chil( t ) for a t ∈ T ( (cid:96) − I } for all (cid:96) ∈ N . For each cluster t ∈ T I , there is exactly one (cid:96) ∈ N such that t ∈ T ( (cid:96) ) I . We call this the level number of t and denote it by level( t ) = (cid:96) . The maximal level p I := max { level( t ) : t ∈ T I } is called the depth of the cluster tree.Pairs of clusters ( t, s ) correspond to subsets ˆ t × ˆ s of I × I , i.e., to submatrices of G ∈ C I×I . These pairs inherit the hierarchical structure provided by the cluster tree.In order to approximate G | ˆ t × ˆ s , the directional interpolation approach uses axis-parallel bounding boxes B t , B s ⊆ R such thatsupp( ϕ i ) ⊆ B t , supp( ϕ j ) ⊆ B s for all i ∈ ˆ t, j ∈ ˆ s, g ts of g | B t × B s . Discretizing ˜ g ts then gives rise to anapproximation of the submatrix G | ˆ t × ˆ s .For large wave numbers κ , the function g | B t × B s cannot be expected to be smooth,so we cannot apply interpolation directly. This problem can be solved by directional interpolation [5, 7, 16]: we choose a direction c ∈ R and split the function g into aplane wave and a remainder term, i.e., we use g ( x, y ) = exp( i κ (cid:104) x − y, c (cid:105) ) exp( i κ ( (cid:107) x − y (cid:107) − (cid:104) x − y, c (cid:105) ))4 π (cid:107) x − y (cid:107) = exp( i κ (cid:104) x − y, c (cid:105) ) g c ( x, y ) , where the remainder is deﬁned by g c ( x, y ) = exp( i κ ( (cid:107) x − y (cid:107) − (cid:104) x − y, c (cid:105) ))4 π (cid:107) x − y (cid:107) . This function is smooth [4] and can therefore be interpolated by polynomials if thefollowing admissibility conditions hold: κ (cid:13)(cid:13)(cid:13)(cid:13) m t − m s (cid:107) m t − m s (cid:107) − c (cid:13)(cid:13)(cid:13)(cid:13) ≤ η max { diam( B t ) , diam( B s ) } , (3a)max { diam( B t ) , diam( B s ) } ≤ η dist( B t , B s ) , (3b) κ max { diam( B t ) , diam( B s ) } ≤ η dist( B t , B s ) , (3c)where m t ∈ B t and m s ∈ B s denote the midpoints of the boxes and η , η ∈ R > arechosen to strike a balance between fast convergence (if both parameters are small) andlow computational cost (if both parameters are large). Condition (3a) ensures that c issuﬃciently close to the direction from the midpoint m s to the midpoint m t , condition(3c) allows us to extend this property to directions from any point y ∈ B s to anypoint x ∈ B t [4, Lemma 3.9], while condition (3b) is required to keep admissible blockssuﬃciently far away from the singularity at x = y .Due to [4, Corollary 3.14], the interpolating polynomial˜ g c,ts ( x, y ) = k (cid:88) ν,µ =1 L t,ν ( x ) g c ( ξ t,ν , ξ s,µ ) L s,µ ( y )converges exponentially to g c in B t × B s , and the error is bounded independently of thewavenumber κ . Here ( ξ t,ν ) kν =1 and ( ξ s,µ ) kµ =1 are families of tensor interpolation pointsin B t and B s , while ( L t,ν ) kν =1 and ( L s,µ ) kµ =1 are the corresponding families of tensorLagrange polynomials.Multiplying by the plane wave, we obtain g ( x, y ) = exp( i κ (cid:104) x − y, c (cid:105) ) g c ( x, y ) ≈ exp( i κ (cid:104) x − y, c (cid:105) ) k (cid:88) ν,µ =1 L t,ν ( x ) g c ( ξ t,ν , ξ s,µ ) L s,µ ( x )4 k (cid:88) ν,µ =1 exp( i κ (cid:104) x, c (cid:105) ) L t,ν ( x ) g c ( ξ t,ν , ξ s,µ )exp( i κ (cid:104) y, c (cid:105) ) L s,µ ( y )= k (cid:88) ν,µ =1 L tc,ν ( x ) g c ( ξ t,ν , ξ s,µ ) L sc,µ ( y ) =: ˜ g b ( x, y ) for all x ∈ B t , y ∈ B s with the modiﬁed Lagrange polynomials L tc,ν ( x ) = exp( i κ (cid:104) x, c (cid:105) ) L t,ν ( x ) , L sc,µ ( y ) = exp( i κ (cid:104) y, c (cid:105) ) L s,µ ( y ) . Replacing g by ˜ g b in (2) yields g ij ≈ (cid:90) Ω ϕ i ( x ) (cid:90) Ω ˜ g b ( x, y ) ϕ j ( y ) dy dx = k (cid:88) ν =1 k (cid:88) µ =1 g c ( ξ t,ν , ξ s,µ ) (cid:124) (cid:123)(cid:122) (cid:125) =: s b,νµ (cid:90) Ω ϕ i ( x ) L tc,ν ( x ) dx (cid:124) (cid:123)(cid:122) (cid:125) =: v tc,iν (cid:90) Ω ϕ j ( y ) L sc,µ ( y ) dy (cid:124) (cid:123)(cid:122) (cid:125) v sc,jµ = k (cid:88) ν =1 k (cid:88) µ =1 s b,νµ v tc,iν v sc,jµ = ( V tc S b V ∗ sc ) ij for all i ∈ ˆ t, j ∈ ˆ s with matrices V tc ∈ C ˆ t × k , V sc ∈ C ˆ s × k , and S b ∈ C k × k . This is a factorized low-rankapproximation G | ˆ t × ˆ s ≈ V tc S b V ∗ sc (4)of the submatrix G | ˆ t × ˆ s .Since the matrix G itself does not satisfy the conditions (3), we have to split it intosubmatrices, and experiments show that the number of submatrices grows rapidly asthe problem size increases. Storing the matrices ( V tc ) t ∈T I for all clusters t ∈ T I wouldlead to quadratic complexity and is therefore unattractive. Fortunately, we can takeadvantage of the hierarchical structure of the cluster tree if we organize the directions c accordingly. Deﬁnition 2 (Directions)

Let ( D (cid:96) ) p I (cid:96) =0 be a family of ﬁnite subsets of R . It is calleda family of directions if (cid:107) c (cid:107) = 1 ∨ c = 0 for all c ∈ D (cid:96) , (cid:96) ∈ [0 : p I ] . Here the special case c = 0 is included to allow us to treat the low-frequency case thatdoes not require us to split oﬀ a plane wave. Given a family of directions, we ﬁx a family (sd (cid:96) ) p I − (cid:96) =0 of mappings sd (cid:96) : D (cid:96) → D (cid:96) +1 such that (cid:107) c − sd (cid:96) ( c ) (cid:107) ≤ (cid:107) c − ˜ c (cid:107) for all c ∈ D (cid:96) , ˜ c ∈ D (cid:96) +1 , (cid:96) ∈ [0 : p I − . T I , we write D t := D level( t ) , sd t ( c ) := sd level( t ) ( c ) for all t ∈ T I , c ∈ D level( t ) . In order to satisfy the admissibility condition (3a), we use sets of directions that aresuﬃciently large to approximate any direction, i.e., we requiremin {(cid:107) y − c (cid:107) : c ∈ D t } ≤ η κ diam( B t ) for all t ∈ T I , y ∈ R with (cid:107) y (cid:107) = 1 . (5)Since the size of clusters decreases as the level grows, this requirement means that largeclusters will require more directions than small clusters. Remark 3 (Construction of directions)

In our implementation, we construct setsof directions satisfying (5) as follows: for a level (cid:96) ∈ [0 : p I ] , we compute the maximaldiameter d (cid:96) of all bounding boxes B t associated with clusters t ∈ T ( (cid:96) ) I on this level.If κd (cid:96) ≤ η holds, we let D (cid:96) = { } , i.e., we use no directional approximation in thelow-frequency case.Otherwise, i.e., if κd (cid:96) > η , we let m := (cid:100) √ κd (cid:96) / η (cid:101) and split each side of the unit cube [0 , into m squares of width / m and diameter √ / m . Since the cube has six sides,we have a total of m ∈ O ( κ d (cid:96) ) such squares. We denote the centers of the squaresby ˜ c ι , and their projections to the unit sphere by c ι := ˜ c ι / (cid:107) ˜ c ι (cid:107) for ι ∈ [1 : 6 m ] . We let D (cid:96) := { c ι : ι ∈ [1 : 6 m ] } . For every unit vector y ∈ R , there is a point ˜ y on thesurface of the unit cube with y = ˜ y / (cid:107) ˜ y (cid:107) . Since the surface grid is suﬃciently ﬁne, we canﬁnd ι ∈ [1 : 6 m ] with (cid:107) ˜ y − ˜ c ι (cid:107) ≤ √ / m , and the projection ensures (cid:107) y − c ι (cid:107) ≤ (cid:107) ˜ y − ˜ c ι (cid:107) ≤ √ m ≤ η κd (cid:96) ≤ η κ diam( B t ) for all t ∈ T ( (cid:96) ) I , i.e., (5) holds. By this construction, the set D (cid:96) is suﬃciently large to contain approxima-tions for any unit vector, but still small enough to satisfy the assumption (18) requiredfor our complexity analyis. Due to (5), we only have to satisfy the conditions (3b) and (3c) and can then ﬁnda direction c ts ∈ D t = D s that satisﬁes the ﬁrst condition (3a): for t, s ∈ T ( (cid:96) ) I , we let c ts ∈ D (cid:96) be a best approximation of the direction from the midpoint m s of the sourcebox B s to midpoint m t of the target box B t , i.e., (cid:13)(cid:13)(cid:13)(cid:13) m t − m s (cid:107) m t − m s (cid:107) − c ts (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13) m t − m s (cid:107) m t − m s (cid:107) − ˜ c (cid:13)(cid:13)(cid:13)(cid:13) for all ˜ c ∈ D (cid:96) . This leaves us with the task of splitting the matrix G into submatrices G | ˆ t × ˆ s such that B t and B s satisfy the admissibility conditions (3b) and (3c). A decomposition with theminimal necessary number of submatrices can be constructed by a recursive procedurethat again gives rise to a tree structure and inductively ensures level( t ) = level( s ), sothat the directions c ts are well-deﬁned. 6 eﬁnition 4 (Block tree) Let T I be a cluster tree for the index set I with root r I , let ( D (cid:96) ) p I (cid:96) =0 be a family of directions.A tree T is called a block tree for T I if • for each b ∈ T there are t, s ∈ T I such that b = ( t, s, c ts ) , • the root r ∈ T satisﬁes r = ( r I , r I , c r I r I ) , • for each b = ( t, s, c ts ) ∈ T we have chil( b ) (cid:54) = ∅ = ⇒ chil( b ) = { ( t (cid:48) , s (cid:48) , c t (cid:48) s (cid:48) ) : t (cid:48) ∈ chil( t ) , s (cid:48) ∈ chil( s ) } . (6) A block tree for T I is usually denoted by T I×I . Its nodes are called blocks . We denotethe set of leaves of T I×I by L I×I := { b ∈ T I×I : chil( b ) = ∅} . The leaves of a block tree deﬁne a disjoint partition of the index set

I × I , i.e., adecomposition of the matrix G ∈ C I×I into submatrices G | ˆ t × ˆ s with ( t, s, c ) ∈ L I×I .We can construct a block tree with the minimal number of blocks by a simple recursion:starting with the root, we check whether a block is admissible. If it is, we make it anadmissible leaf and represent the corresponding submatrix in the factorized form (4).Otherwise, we consider its children given by (6). If there are no children, i.e., if chil( t ) orchil( s ) are empty, we have found an inadmissible leaf and store the submatrix directly,i.e., as a two-dimensional array.While the approximation (4) reduces the amount of storage required for one block to k , we still have to store the matrices ( V tc ) t ∈T I ,c ∈D t , and in the high-frequency case thestorage requirements for these matrices grow at least quadratically with the problemsize: if we have κ ∼ n , doubling the matrix dimension means multiplying κ by a factorof √

2. Construction directions as in Remark 3, the splitting parameter m also growsby a factor of approximately √

2, and the number of directions is therefore doubled, aswell. On a given level (cid:96) ∈ [0 : p I ], storing V tc for all t ∈ T I with a ﬁxed direction c requires O ( nk ) coeﬃcients, and since we have O ( n ) directions, we even end up with O ( n k ) coeﬃcients per level .In order to solve this problem, we take advantage of the requirement (5): given acluster t ∈ T I , a direction c ∈ D t , and one of its children t (cid:48) ∈ chil( t ), we can ﬁnd adirection c (cid:48) := sd (cid:96) ( c ) ∈ D t (cid:48) that approximates c reasonably well. This property allowsus to reduce the storage requirements for the matrices ( V tc ) t ∈T I ,c ∈D t as follows: since (cid:107) c − c (cid:48) (cid:107) is small, the function x (cid:55)→ exp( − i κ (cid:104) x, c (cid:48) (cid:105) ) L tc,ν ( x ) = exp( i κ (cid:104) x, c − c (cid:48) (cid:105) ) L t,ν ( x )is smooth and can therefore be interpolated in B t (cid:48) . We ﬁnd L tc,ν ( x ) = exp( i κ (cid:104) x, c (cid:105) ) L t,ν ( x ) = exp( i κ (cid:104) x, c (cid:48) (cid:105) ) exp( i κ (cid:104) x, c − c (cid:48) (cid:105) ) L t,ν ( x ) ≈ exp( i κ (cid:104) x, c (cid:48) (cid:105) ) k (cid:88) ν (cid:48) =1 exp( i κ (cid:104) ξ t (cid:48) ,ν (cid:48) , c − c (cid:48) (cid:105) ) L t,ν ( ξ t (cid:48) ,ν (cid:48) ) (cid:124) (cid:123)(cid:122) (cid:125) =: e t (cid:48) c,ν (cid:48) ν L t (cid:48) ,ν (cid:48) ( x )7 k (cid:88) ν (cid:48) =1 e t (cid:48) c,ν (cid:48) ν L t (cid:48) c (cid:48) ,ν (cid:48) ( x ) . This approach immediately yields v tc,iν = (cid:90) Ω ϕ i ( x ) L tc,ν ( x ) dx ≈ k (cid:88) ν (cid:48) =1 e t (cid:48) c,ν (cid:48) ν (cid:90) Ω ϕ i ( x ) L t (cid:48) c (cid:48) ,ν (cid:48) ( x ) dx = ( V t (cid:48) c (cid:48) E t (cid:48) c ) iν for all i ∈ ˆ t (cid:48) and ν ∈ [1 : k ], which is equivalent to V tc | ˆ t (cid:48) × k ≈ V t (cid:48) c (cid:48) E t (cid:48) c . (7)Instead of storing V tc , we can just store small k × k matrices E t (cid:48) c for all t (cid:48) ∈ chil( t ), thusreducing the storage requirements from ( t ) k to O ( k ). This approach implies usingthe right-hand side of (7) to deﬁne the left-hand side. Deﬁnition 5 (Directional cluster basis)

Let k ∈ N , and let V = ( V tc ) t ∈T I ,c ∈D t be afamily of matrices. We call it a directional cluster basis if • V tc ∈ C ˆ t × k for all t ∈ T I and c ∈ D t , and • there is a family E = ( E t (cid:48) c ) t ∈T I ,t (cid:48) ∈ chil( t ) ,c ∈D t such that V tc | ˆ t (cid:48) × k = V t (cid:48) c (cid:48) E t (cid:48) c for all t ∈ T I , t (cid:48) ∈ chil( t ) , c ∈ D t , c (cid:48) = sd t ( c ) . (8) The elements of the family E are called transfer matrices for the directional cluster basis V , and k is called its rank . Remark 6 (Notation)

The notation “ E t (cid:48) c ” for the transfer matrices (instead of some-thing like “ E t (cid:48) tc (cid:48) c ” listing all parameters) for the matrices is justiﬁed since the parent t ∈ T I is uniquely determined by t (cid:48) ∈ T I due to the tree structure, and the direction c (cid:48) = sd t ( c ) is uniquely determined by c ∈ D t due to our Deﬁnition 2. We can now deﬁne the class of matrices that is the subject of this article: since theleaves L I×I of the block tree correspond to a partition of the matrix G , we have torepresent each of the submatrices G | ˆ t × ˆ s for b = ( t, s, c ) ∈ L I×I . Those blocks thatsatisfy the admissibility conditions (3) can be approximated in the form (4). Thesematrices are called admissible and collected in a subset L + I×I := { b ∈ L I×I : b is admissible } . The remaining blocks are called inadmissible and collected in the set L −I×I := L I×I \ L + I×I . These matrices are stored as simple two-dimensional arrays without any compression.8 eﬁnition 7 (Directional H -matrix) Let V and W be directional cluster bases for T I . Let G ∈ C I×I be a matrix. We call it a directional H -matrix (or just a DH -matrix ) if there are families S = ( S b ) b ∈L + I×I such that G | ˆ t × ˆ s = V tc S b W ∗ sc for all b = ( t, s, c ) ∈ L + I×I . (9) The elements of the family S are called coupling matrices . V is called the row clusterbasis and W is called the column cluster basis .A DH -matrix representation of a DH -matrix G consists of V , W , S and the family ( G | ˆ b ) b ∈L −I×I of nearﬁeld matrices corresponding to the inadmissible leaves of T I×I . Under typical assumptions, including that k is ﬁxed independently of κ , since theinterpolation error does not depend on κ , it is possible to prove that a DH -matrixrequires only O ( nk + κ k log( n )) units of storage [3, Section 5]. Before we address the recompression of a DH -matrix, we brieﬂy recall the compressionalgorithm for general matrices described in [3].Let G ∈ C I×I . We want to approximate the matrix by an orthogonal projection ,since this guarantees optimal stability and best-approximation properties with respectto certain norms.We call a matrix X ∈ C I×K isometric if X ∗ X = I holds. If X is isometric, XX ∗ isthe orthogonal projection into the range of X , i.e., it maps a vector y ∈ C I onto its bestapproximation (cid:101) y := XX ∗ y in this space, and the stability estimate (cid:107) (cid:101) y (cid:107) ≤ (cid:107) y (cid:107) holds.We call the cluster bases ( V tc ) t ∈T I ,c ∈D t and ( W tc ) t ∈T I ,c ∈D t orthogonal if all matricesare isometric, i.e., if V ∗ tc V tc = I, W ∗ tc W tc = I holds for all t ∈ T I , c ∈ D t . In this case, the optimal coupling matrices with respect to the Frobenius norm (andalmost optimal with respect to the spectral norm) can be computed by orthogonal pro-jection using G | ˆ t × ˆ s ≈ V tc V ∗ tc G | ˆ t × ˆ s W sc W ∗ sc = V tc S b W ∗ tc with S b := V ∗ tc G | ˆ t × ˆ s W sc . Due to (cid:107) G | ˆ t × ˆ s − V tc S b W ∗ sc (cid:107) F = (cid:107) G | ˆ t × ˆ s − V tc V ∗ tc G | ˆ t × ˆ s (cid:107) F + (cid:107) V tc V ∗ tc ( G | ˆ t × ˆ s − G | ˆ t × ˆ s W sc W ∗ sc ) (cid:107) F ≤ (cid:107) G | ˆ t × ˆ s − V tc V ∗ tc G | ˆ t × ˆ s (cid:107) F + (cid:107) G | ∗ ˆ t × ˆ s − W sc W ∗ sc G | ∗ ˆ t × ˆ s (cid:107) F , we can focus on the construction of a good row cluster basis, since a good column clusterbasis can be obtained by applying the same procedure to the adjoint matrix.9y Deﬁnition 7, the matrix V tc has to be able to approximate the range of all matrices G | ˆ t × ˆ s with ( t, s, c ) ∈ L + I×I . We collect the corresponding column clusters in the setrow( t, c ) := { s ∈ T I : ( t, s, c ) ∈ L + I×I } . We also have to take the nested structure of the cluster basis into account. Let t ∈ T I with chil( t ) (cid:54) = ∅ . For the sake of simplicity, we focus on the case t ) = 2 andchil( t ) = { t , t } . Assume that isometric matrices V t c and V t c with c = sd t ( c ) and c = sd t ( c ) have already been computed. Due to (8), we have V tc = (cid:18) V t c V t c (cid:19) (cid:98) V tc with (cid:98) V tc := (cid:18) E t c E t c (cid:19) . (10)The error of the orthogonal projection takes the form (cid:107) G | ˆ t × ˆ s − V tc V ∗ tc G | ˆ t × ˆ s (cid:107) F = (cid:13)(cid:13)(cid:13)(cid:13) G | ˆ t × ˆ s − (cid:18) V t c V t c (cid:19) (cid:98) V tc (cid:98) V ∗ tc (cid:18) V ∗ t c V ∗ t c (cid:19) G | ˆ t × ˆ s (cid:13)(cid:13)(cid:13)(cid:13) F . Since V t c and V t c are assumed to be isometric, Pythagoras’ identity yields (cid:107) G | ˆ t × ˆ s − V tc V ∗ tc G | ˆ t × ˆ s (cid:107) F = (cid:13)(cid:13)(cid:13)(cid:13) G | ˆ t × ˆ s − (cid:18) V t c V t c (cid:19) (cid:18) V ∗ t c V ∗ t c (cid:19) G | ˆ t × ˆ s (cid:13)(cid:13)(cid:13)(cid:13) F + (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) V t c V t c (cid:19) ( I − (cid:98) V tc (cid:98) V ∗ tc ) (cid:18) V ∗ t c V ∗ t c (cid:19) G | ˆ t × ˆ s (cid:13)(cid:13)(cid:13)(cid:13) F = (cid:107) G | ˆ t × s − V t c V ∗ t c G | ˆ t × s (cid:107) F + (cid:107) G | ˆ t × s − V t c V ∗ t c G | ˆ t × s (cid:107) F + (cid:13)(cid:13)(cid:13)(cid:13) ( I − (cid:98) V tc (cid:98) V ∗ tc ) (cid:18) V ∗ t c G | ˆ t × s V ∗ t c G | ˆ t × s (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) F . (11)We can see that the projection error for the cluster t depends on the projection errorsfor its children t and t . Using a straightforward induction, we ﬁnd that all descendantsof t contribute to the error.This means that our algorithm has to take all ancestors of a cluster t into accountwhen it constructs V tc . We collect these ancestors and the corresponding directions inthe setsanc( t, c ) := (cid:40) { ( t, c ) } if t is the root of T I , { ( t, c ) } ∪ (cid:83) c + ∈ sd − t + ( { c } ) anc( t + , c + ) if t is a child of t + ∈ T I (12)for all t ∈ T I and c ∈ D t . The mapping sd t + is not invertible, since the parent clusterfrequently is associated with more directions than its child. This is the reason there canbe multiple c + ∈ D t + with sd t + ( c + ) = c in (12). We have to ﬁnd V tc such that G | ˆ t × ˆ s ≈ V tc V ∗ tc G | ˆ t × ˆ s for all s ∈ row(˜ t, ˜ c ) , (˜ t, ˜ c ) ∈ anc( t, c ) . (13)10ue to Deﬁnition 4, the index sets of the clusters inrow + ( t, c ) := (cid:91) (˜ t, ˜ c ) ∈ anc( t,c ) row(˜ t, ˜ c )are disjoint, and we can introduce R tc := (cid:91) s ∈ row + ( t,c ) ˆ s, G tc := G | ˆ t ×R tc for all t ∈ T I , c ∈ D t in order to rewrite (13) in the form G tc ≈ V tc V ∗ tc G tc . If t is a leaf cluster, we can directly ﬁnd the optimal approximation by computing theSVD of G tc and using the ﬁrst k left singular vectors as the columns of the matrix V tc :the SVD yields an orthonormal basis ( v i ) τi =1 of left singular vectors, an orthonormalbasis ( u i ) τi =1 of right singular vectors, and ordered singular values σ ≥ · · · ≥ σ τ ≥ G tc = τ (cid:88) i =1 v i σ i u ∗ i , where τ = t . Using this notation, it is an easy task to ﬁnd the lowest rank k ∈ [0 : τ ]such that the approximation (cid:101) G tc := k (cid:88) i =1 v i σ i u ∗ i = V tc V ∗ tc G tc , V tc := (cid:0) v . . . v k (cid:1) , still ensures the desired error bound [2, Lemma 5.19]. For the sake of simplicity, weconsider only the Frobenius norm case, the spectral norm and relative errors bounds areavailable with slight adaptations in the choice of k [9, Theorem 2.5.3].If t is not a leaf cluster, (11) indicates that we have to look for (cid:98) V tc such that (cid:98) G tc ≈ (cid:98) V tc (cid:98) V ∗ tc (cid:98) G tc with the matrix (cid:98) G tc := (cid:18) V ∗ t c V ∗ t c (cid:19) G tc (14)containing the coeﬃcients for the approximation of G tc in the children’s bases. Thistask can again be solved by computing the SVD of (cid:98) G tc , which takes only O ( k R tc )operations, and the transfer matrices E t c and E t c can be obtained from (cid:98) V tc by deﬁnition(10). 11 .2 DH -recompression The algorithm presented in the previous section has quadratic complexity if we assumethat the numerical ranks k are uniformly bounded, cf. [3, Theorem 17], since it startswith a dense matrix G represented explicitly by n coeﬃcients. This means that thealgorithm is only of theoretical interest, i.e., for investigating whether a given matrixcan be approximated at all, but not attractive for real applications with large numbersof degrees of freedom.In the case of the Helmholtz equation, it has already been proven [4] that directionalinterpolation provides us with an DH -matrix approximation, although the rank of thisapproximation may be larger than necessary. Our task is therefore only to recompress an already compressed DH -matrix, we do not have to start from scratch. If we canarrange the algorithm in a way that avoids creating the entire original approximation,we can obtain nearly optimal storage requirements without the need of excessive storagefor intermediate results.Our ﬁrst step is to take advantage of the DH -matrix structure to reduce the com-plexity of our algorithm. We assume that the original matrix is described by clusterbases ( V tc ) t ∈T I ,c ∈D t , ( W tc ) t ∈T I ,c ∈D t and coupling matrices ( S b ) b ∈L + I×I such that G | ˆ t × ˆ s = V tc S b W ∗ sc for all b = ( t, s, c ) ∈ L + I×I . We denote the transfer matrices of the cluster basis ( V tc ) t ∈T I ,c ∈D t by E t (cid:48) c ∈ C k × k for t ∈ T I , t (cid:48) ∈ chil( t ), c ∈ D t , and c (cid:48) = sd t ( c ).Our goal is to ﬁnd improved row and column cluster bases for the matrix G by thealgorithm outlined in Section 3.1, but to take advantage of the DH -matrix structure of G to reduce the computational work.We have already seen that we only have to describe an algorithm for computing rowbases, since applying this algorithm to the adjoint matrix G ∗ will yield a column basis,as well. We call the improved row basis ( Q tc ) t ∈T I ,c ∈D t , the adaptively-chosen rank of Q tc is called k tc , and the transfer matrices are called ( F t (cid:48) c ) t ∈T I ,t (cid:48) ∈ chil( t ) ,c ∈D t .In particular, the matrices V tc and W tc are no longer isometric, and ensuring reliableerror control requires the use of suitable weight matrices. Since we assume that V tc and W tc result from directional interpolation of constant order, we know that all of thesematrices have a ﬁxed number k of columns.Our approach to speeding up the algorithm of Section 3.1 is to obtain a factorizedlow-rank representation of the matrices G tc required by the compression algorithm thatallows us to eﬃciently compute an improved basis. In particular, we will prove thatthere are k × k matrices (cid:98) Z tc for all t ∈ T I , c ∈ D t such that G tc = V tc (cid:98) Z ∗ tc P ∗ tc (15)holds with an isometric matrix P tc ∈ C R tc × k . Since P tc is isometric, it does not inﬂuencethe left singular vectors or the non-zero singular values, so we can replace G tc by theskinny matrix V tc (cid:98) Z ∗ tc in the compression algorithm of Section 3.1 and construct Q tc fromthe left singular vectors of this smaller matrix without changing the result.12et t ∈ T I , c ∈ D t . For the moment, we assume that t is not the root of the clustertree, i.e., that it has a parent t + ∈ T I with t ∈ chil( t + ).We assume that the matrices (cid:98) Z t + c + have already been computed for all directions c + ∈ D t + with sd t + ( c + ) = c , i.e., for all c + ∈ sd − t + ( { c } ). Let γ := − t + ( { c } ) denotethe number of directions in D t that get mapped to c , and enumerate these directions assd − t + ( { c } ) = { c +1 , . . . , c + γ } . Due to deﬁnition (12), we haveanc( t, c ) = { ( t, c ) } ∪ γ (cid:91) ι =1 anc( t + , c + ι ) . We let σ := t, c ) and row( t, c ) = { s , . . . , s σ } .Let now s ∈ row + ( t, c ). By deﬁnition, we can either have s ∈ row( t, c ), or there is a ι ∈ [1 : γ ] such that s ∈ row + ( t + , c + ι ). In the ﬁrst case, we have G | ˆ t × ˆ s = V tc S b W ∗ sc , and we collect all of these matrices in an auxiliary matrix H tc := (cid:0) V tc S ts c W ∗ s c · · · V tc S ts σ c W ∗ s σ c (cid:1) = V tc (cid:0) S ts c W ∗ s c · · · S ts σ c W ∗ s σ c (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) =: Y tc . The matrix Y tc has too many columns for a practical algorithm, so we use the orthog-onalization algorithm [2, Algorithm 16], with a straightforward generalization, to ﬁnd k × k matrices R W,s i c and isometric matrices P W,s i c with W s i c = P W,s i c R W,s i c for all i ∈ [1 : σ ] (16)and obtain Y tc = (cid:0) S ts c R ∗ W,s c · · · S ts σ c R ∗ W,s σ c (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) =: (cid:98) Y tc  P W,s c . . . P W,s σ c  ∗ (cid:124) (cid:123)(cid:122) (cid:125) =: P ∗ Y,tc = (cid:98) Y tc P ∗ Y,tc . The matrix (cid:98) Y tc is now suﬃciently small, and the isometric matrix P Y,tc can later besubsumed in P tc .In the second case, i.e., if s ∈ row + ( t + , c + ι ), we have G | ˆ t × ˆ s = G t + c + ι | ˆ t × ˆ s . Combining both cases yields G tc = (cid:16) H tc G t + c +1 | ˆ t ×R t + c +1 · · · G t + c + γ | ˆ t ×R t + c + γ (cid:17) . G t + c +1 , . . . , G t + c + γ , and applying (8) yields G tc = (cid:16) H tc V tc | ˆ t (cid:48) × k (cid:98) Z ∗ tc P ∗ tc · · · V t + c + γ | ˆ t × k (cid:98) Z ∗ t + c + γ P ∗ t + c + γ (cid:17) = V tc (cid:16) (cid:98) Y tc P ∗ Y,tc E tc +1 (cid:98) Z ∗ t + c +1 P ∗ t + c +1 · · · E tc + γ (cid:98) Z ∗ t + c + γ P ∗ t + c + γ (cid:17) = V tc (cid:16) (cid:98) Y tc E tc +1 (cid:98) Z ∗ t + c +1 · · · E tc + ι (cid:98) Z ∗ t + c + γ (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) =: Z tc  P ∗ Y,tc P ∗ t + c +1 . . . P ∗ t + c + γ  . We compute a skinny QR factorization (cid:98) P tc (cid:98) Z tc = Z ∗ tc and ﬁnd G tc = V tc (cid:98) Z ∗ tc P ∗ tc with P tc :=  P Y,tc P t + c +1 . . . P t + c + γ  (cid:98) P tc . As a product of two isometric matrices, P tc is again isometric, and since Z tc has only k rows, (cid:98) Z tc is a k × k matrix. It is important to note that we do not need the matrices P t + c + ι to compute (cid:98) Z tc , we can carry out the entire algorithm without storing any of theisometric matrices.If t is the root cluster, it has no parent t + , but we can still proceed as before by setting γ = 0, i.e., without contributions inherited from the ancestors.Once we have the total weight matrices (cid:98) Z tc ∈ C k × k at our disposal, we can considerthe construction of the basis. Since V tc is already the name of the original basis, we use Q tc for the new one. The transfer matrices for Q tc are denoted by F t (cid:48) c .If t is a leaf, we have to compute the left singular vectors and singular values of thematrix G tc = V tc (cid:98) Z ∗ tc P ∗ tc , and this is equivalent to computing these quantities only for the thin matrix V tc (cid:98) Z ∗ tc . Wechoose a rank k tc for the new basis and use the ﬁrst k tc left singular vectors as columnsof the new basis matrix Q tc ∈ C ˆ t × k tc .If t is not a leaf, we assume again chil( t ) = { t , t } , let c := sd t ( c ), c := sd t ( c ), andhave to compute the left singular vectors and singular values of the matrix (cid:98) G tc = (cid:18) Q ∗ t c Q ∗ t c (cid:19) G tc = (cid:18) Q ∗ t c Q ∗ t c (cid:19) V tc (cid:98) Z ∗ tc P ∗ tc (cid:18) Q ∗ t c Q ∗ t c (cid:19) (cid:18) V t c E t c V t c E t c (cid:19) (cid:98) Z ∗ tc P ∗ tc = (cid:18) Q ∗ t c V t c E t c Q ∗ t c V t c E t c (cid:19) (cid:98) Z ∗ tc P ∗ tc . In order to prepare this matrix eﬃciently, we introduce the matrices C tc := Q ∗ tc V tc for all t ∈ T I , c ∈ D t , (17)that describe the change of basis from V tc to Q tc . With these matrices, we have (cid:98) G tc = (cid:18) C t c E t c C t c E t c (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) =: (cid:98) V tc (cid:98) Z ∗ tc P ∗ tc and only have to compute the SVD of (cid:98) V tc (cid:98) Z ∗ tc , choose a rank k tc , and use the ﬁrst k tc leftsingular vectors as columns of the matrix (cid:98) Q tc that can be split into (cid:98) Q tc = (cid:18) F t c F t c (cid:19) to obtain the transfer matrices for the new cluster basis. In this case, we can use C tc = (cid:98) Q ∗ tc (cid:98) V tc to compute the basis-change matrix eﬃciently. In order to analyze the complexity of the new algorithms, we follow the approach of[3, Section 5]: for the sake of simplicity, we assume that all bounding boxes on thesame level are identical up to translation. We also assume that the cluster tree isgeometrically regular and that the surface Ω is two-dimensional, i.e., that there areconstants C cp , C nc , C cu , C cl , C ov , C rs , C un ∈ R > such thatdiam( B t ) ≤ C cp diam( B t (cid:48) ) for all t ∈ T I , t (cid:48) ∈ chil( t ) , t ) ≤ C nc , t ) (cid:54) = 1 for all t ∈ T I , | Ω ∩ B ( x, r ) | ≤ C cu r for all x ∈ R , r ∈ R ≥ , diam ( B t ) ≤ C cl | B t ∩ Ω | for all t ∈ T I , { t ∈ T ( (cid:96) ) I : x ∈ B t } ≤ C ov for all x ∈ Ω , (cid:96) ∈ [0 : p I ] ,C lf κ diam( B t ) ≤ t ∈ L I ,C − k ≤ t ≤ C rs k for all leaves t ∈ L I ,η dist( B t , B s ) < diam( B t ) ⇒ s ≤ C un t for all t ∈ L I , s ∈ T I with level( t ) = level( s ) . The constant C cp ensures that child clusters do not decreases in size too quickly, while C nc provides an upper bound for the number of children. C cu and C cl measure how“convoluted” the surface Ω is, C ov describes the overlap of clusters. C lf and C rs ensure15hat the leaves are small enough compared to the wavelength, and C un can be interpretedas a quasi-uniformity condition for neighbouring leaf clusters. Additionally we assumethat the number of directions associated with a cluster is bounded, i.e., that there is aconstant C di ∈ R > with D t ≤ C di (1 + κ diam ( B t )) for all t ∈ T I . (18)If the directions are constructed as in Remark 3, this condition is satisﬁed.According to [3, Lemma 8], there is a sparsity constant C sp ∈ R > such that (cid:88) c ∈D t t, c ) ≤ (cid:40) C sp if C sb κ diam( B t ) < C sp κ diam( B t ) otherwise (19)holds for all t ∈ T I . We introduce the short notation C sp ,t := (cid:40) C sp if C cp κ diam( B t ) < C sp κ diam( B t ) otherwise for all t ∈ T I . According to [3, Lemma 9], there is a constant C lv ∈ R > such that T ( (cid:96) ) I ≤ C lv | Ω | diam ( B t ) for all (cid:96) ∈ [0 : p I ] , t ∈ T ( (cid:96) ) I , (20a) T I ≤ C lv I k . (20b)The estimates (19) and (20) give rise to the following fundamental result. Lemma 8 (Block and cluster sums)

There are constants C bs , C cs ∈ R > with (cid:88) t ∈T I (cid:88) c ∈D t t, c ) ≤ C bs (cid:0) T I + C lv ( p I + 1) κ (cid:1) , (21a) (cid:88) t ∈T I D t ≤ C cs (cid:0) T I + C lv ( p I + 1) κ (cid:1) . (21b) Proof.

Combining (19) and (20a) yields (cid:88) t ∈T I (cid:88) c ∈D t t, c ) = (cid:88) t ∈T I C cp κ diam( B t ) < (cid:88) c ∈D t t, c ) + (cid:88) t ∈T I C cp κ diam( B t ) ≥ (cid:88) c ∈D t t, c ) ≤ (cid:88) t ∈T I C cp κ diam( B t ) < C sp + (cid:88) t ∈T I C cp κ diam( B t ) ≥ C sp κ diam ( B t ) ≤ C sp T I + p I (cid:88) (cid:96) =0 (cid:88) t ∈T ( (cid:96) ) I C sp κ diam ( B t )16 C sp T I + p I (cid:88) (cid:96) =0 C lv | Ω | diam ( B t ) C sp κ diam ( B t ) ≤ C sp T I + C lv C sp | Ω | ( p I + 1) κ , and we obtain (21a) by choosing C bs := max { C sp , C sp | Ω |} .For the second estimate, we combine (18) with (20a) to ﬁnd (cid:88) t ∈T I D t ≤ C di (cid:88) t ∈T I (1 + κ diam ( B t )) = C di T I + C di p I (cid:88) (cid:96) =0 (cid:88) t ∈T ( (cid:96) ) I κ diam ( B t ) ≤ C di T I + C di p I (cid:88) (cid:96) =0 C lv | Ω | diam ( B t ) κ diam ( B t )= C di T I + C di ( p I + 1) C lv | Ω | κ , and we can obtain (21b) by choosing C cs := max { C di , C di | Ω |} . (cid:3) To establish an estimate for the complexity we need to bound the work of the QRfactorization as well as the SVD. We assume that there are constants C qr , C svd such thatthe work of computing the QR factorization and the SVD of a matrix A ∈ C m × n up tomachine accuracy is bounded by C qr mn min { m, n } , (22a) C svd mn min { m, n } , (22b)respectively.Now we can consider the complexity of the diﬀerent phases of the recompressionalgorithm. We ﬁrst have to compute the basis weight matrices R W,tc for the originalcluster basis ( W tc ) t ∈T I ,c ∈D t . Lemma 9 (Basis weights)

There is a constant C bw ∈ R > such that computing thebasis weights ( R W,tc ) t ∈T I ,c ∈D t , cf. (16), requires not more than C bw k (cid:18) I k + ( p I + 1) κ (cid:19) operations.Proof. Using [2, Algorithm 16], adapted for multiple directions per cluster, this tasktakes ( C qr + 2) k operations per cluster and direction, and Lemma 8 together with (20b)yields (cid:88) t ∈T I (cid:88) c ∈D t ( C qr + 2) k ≤ ( C qr + 2) k C cs ( T I + C lv ( p I + 1) κ )= C cs C lv ( C qr + 2) k (cid:18) I k + ( p I + 1) κ (cid:19) . We let C bw := C cs C lv ( C qr + 2) to complete the proof. (cid:3) The second step is to compute the total weight matrices (cid:98) Z tc for the original clusterbasis ( V tc ) t ∈T I ,c ∈D t . 17 emma 10 (Total weights) There is a constant C we ∈ R > such that computing thetotal weights ( (cid:98) Z tc ) t ∈T I ,c ∈D t requires not more than C we k (cid:18) I k + ( p I + 1) κ (cid:19) operations.Proof. Let t ∈ T I and c ∈ D t .We have to set up the matrix Z tc . For all s ∈ row( t, c ), this means computing theproduct S tsc R ∗ W,sc , which takes not more than 2 k operations.If there is a corresponding parent cluster t + , we also have to compute the product ofthe transfer matrix E tc + and the parent’s weight (cid:98) Z t + c + for all c + ∈ sd − t + ( { c } ), whichtakes not more than 2 k operations per product.We denote the number of columns of Z tc by m := k − t + ( { c } ) + k ( t, c ))and have shown that 2 mk operations are needed to set up this matrix.Now follows a QR factorization of the matrix Z ∗ tc ∈ C m × k , which requires not morethan C qr mk min { m, k } ≤ C qr mk operations.In consequence, the complexity for the whole cluster tree is bounded by (cid:88) t ∈T I (cid:88) c ∈D t ( C qr + 2) mk = ( C qr + 2) k (cid:88) t ∈T I (cid:88) c ∈D t − t + ( { c } ) + t, c ) . Since sd t + maps every direction c + ∈ D t + to a direction c = sd t + ( c + ) ∈ D t , we have D t + = (cid:91) c ∈D t sd − t + ( { c } ) , D t + = (cid:88) c ∈D t − t + ( { c } )and can use Lemma 8 (and the convention D t + = ∅ if t is the root) to ﬁnd the bound (cid:88) t ∈T I (cid:88) c ∈D t ( C qr + 2) mk = ( C qr + 2) k (cid:88) t ∈T I (cid:88) c ∈D t − t + ( { c } ) + t, c )= ( C qr + 2) k (cid:88) t ∈T I D t + + (cid:88) c ∈D t t, c )= ( C qr + 2) k (cid:88) t + ∈T I (cid:88) t ∈ chil( t + ) D t + + ( C qr + 2) k (cid:88) t ∈T I (cid:88) c ∈D t t, c ) ≤ ( C qr + 2) k (cid:88) t + ∈T I C nc D t + + ( C qr + 2) k (cid:88) t ∈T I (cid:88) c ∈D t t, c ) ≤ ( C qr + 2) C nc k C cs ( T I + C lv ( p I + 1) κ )18 ( C qr + 2) k C bs ( T I + C lv ( p I + 1) κ )= ( C qr + 2)( C nc C cs + C bs ) k ( T I + C lv ( p I + 1) κ ) . We can use (20b) to complete the proof with C we := ( C qr + 2)( C nc C cs + C bs ) C lv . (cid:3) Now we can address the construction of the improved cluster basis.

Lemma 11 (Truncation)

There is a constant C tr ∈ R > such that computing theimproved cluster basis ( Q tc ) t ∈T I ,c ∈D t requires not more than C tr k (cid:18) I k + ( p I + 1) κ (cid:19) operations.Proof. Let t ∈ T I and c ∈ D t .If t is a leaf, V tc ∈ C m × k , m := t , is used directly. We compute the product V tc (cid:98) Z ∗ tc ∈ C m × k in not more than 2 mk operations, its SVD in not more than C svd mk min { m, k } ≤ C svd mk operations, and the basis-change matrix C tc in not more than 2 mk operations.Due to our assumptions, we have m = t ≤ C rs k , and the number of operations forleaf clusters is bounded by ( C svd + 4) C rs k .If t is not a leaf, we compute the product of the transfer matrix E t (cid:48) c and the alreadycalculated basis-change matrix C t (cid:48) c (cid:48) for every child t (cid:48) ∈ chil( t ) and c (cid:48) = sd t (cid:48) ( c ), and theresulting matrix (cid:98) V tc has m := (cid:80) t (cid:48) ∈ chil( t ) k t (cid:48) c (cid:48) rows and k columns. Computing all productstakes not more than (cid:88) t (cid:48) ∈ chil( t ) k k t (cid:48) c (cid:48) = 2 mk operations.Now the matrix V tc (cid:98) Z ∗ tc has to be computed, this takes not more than 2 mk operations.Due to (22b), its SVD can be computed in C svd mk min { m, k } ≤ C svd mk operations.Finally the basis-change matrix C tc can be computed in not more than 2 mk operations.Due to our assumptions, t ) ≤ C nc holds and we have m ≤ C nc k , so the numberof operations for non-leaf clusters is bounded by ( C svd + 6) C nc k .Finding the correct ranks k tc requires the inspection of m singular values and can beaccomplished in O ( k ) operations, so we can conclude that there is a constant C suchthat not more than Ck operations are required per cluster t ∈ T I and direction c ∈ D t .The total number of operations is bounded by (cid:88) t ∈T I (cid:88) c ∈D t Ck = Ck (cid:88) t ∈T I D t ≤ CC cs k ( T I + C lv ( p I + 1) κ )due to (21b), and (20b) completes the proof. (cid:3) The only thing left is the calculation of the new coupling matrices, but this is a simplematrix multiplication of the old coupling matrices with the basis-change matrices (17).

Lemma 12 (Projections)

There is a constant C pr ∈ R > such that computing the newcoupling matrices ( (cid:101) S b ) b ∈L + I×I requires not more than C pr k (cid:18) I k + ( p I + 1) κ (cid:19) operations. roof. Computing the products T b := C tc S b and (cid:101) S b := T b C ∗ sc requires not more than4 k operations for each block b ∈ L + I×I . Due to (21a), the total number of operations isbounded by (cid:88) b ∈L + I×I k ≤ k (cid:88) t ∈T I (cid:88) c ∈D t t, c ) ≤ C bs k ( T I + C lv ( p I + 1) κ ) , and we can use (20b) to obtain our estimate with C pr := 4 C bs C lv . (cid:3) For the complete recompression, we have to compute the basis and total weights forthe row and the column cluster basis, we have to truncate both bases, and we have toapply the projection to obtain the improved DH -matrix representation. Theorem 13 (Complexity)

Let a DH -matrix be given. The entire recompressionalgorithm requires not more than (2 C bw + 2 C we + 2 C tr + C pr ) k (cid:18) I k + ( p I + 1) κ (cid:19) operations.Proof. The proof follows by simply adding the estimates provided by the previouslemmas. (cid:3)

Remark 14 (Complexity)

Let n := I denote the matrix dimension. Since C rs should be bounded independently of n , we have to expect p I ∼ log n .If the wave number κ is constant, the ﬁrst term in the estimate of Theorem 13 isdominant and the recompression algorithm requires O ( nk ) operations.In the high-frequency case, we have κ ∼ n , the second term is dominant, and therecompression algorithm requires O ( nk log n ) operations. As an example we use the three-dimensional unit sphere and cube. For the cube, wesimply represent each face by two triangles that are then regularly reﬁned. For thesphere, we start with the double pyramid P = { x ∈ R : | x | + | x | + | x | = 1 } , reﬁneevery one of its eight faces regularly, and shift the resulting vertices to the unit sphere.For constructing the Galerkin stiﬀness matrix G ∈ C I×I we use piecewise constant basisfunctions and Sauter-Erichsen-Schwab quadrature of order n q = 5 [21, 8] for trianglesthat share a vertex, an edge, or are identical, and otherwise Gauß quadrature with Duﬀytransformation of order n q = 3 [6].As clustering strategy the standard binary space partitioning is applied until clusterscontain not more than 32 elements. We used η = 10 for creating the directions (3a),and η = 1 for the standard (3b) and parabolic admissibility condition (3c). The initial DH -matrix approximation is constructed by directional interpolation of order m = 4,and the initial rank is k = 4 = 64.We choose the wave number in such a way, that κh ≈ . (cid:15) = 10 − for the block-wise relative Frobenius norm.20e have parallelized the algorithm using the OpenMP standard in order to take ad-vantage of multicore shared-memory systems: since the basis weights (16) are computedby a bottom-up recursion, all weights within one level of the tree can be computed inparallel. Since the total weights (15) are computed by a top-down recursion, we canalso perform all operations within the same level in parallel. Finally, the constructionof the adaptive cluster bases is again a bottom-up procedure and therefore accessible tothe same approach. On a server with two Intel R (cid:13) Xeon R (cid:13) Platinum 8160 processors witha total of 48 cores, direct interpolation for 131 072 triangles takes 203 seconds with theparallel implementation and 5 543 seconds without, a speedup of 27, while recompressiontakes 1 475 seconds with the parallel version and 38 676 seconds without, a speedup of26. We have observed that the speedup improves as the problem size grows.Table 5 shows our results for the single layer potential on the unit sphere. Theﬁrst column gives the number n of degrees of freedom, the second the wave number κ . The third and fourth column show the storage per degree of freedom in KB (1KB is 1 024 bytes) of the original cluster basis and the original DH -matrix, the ﬁfth,sixth and sevenths colums give the rank, storage for the basis and the matrix after therecompression. The last two columns give the absolute and relative errors between the DH -matrix and its recompressed version, measured in the Frobenius norm.original recomp. n κ basis matrix rank basis matrix error rel. error2 352 3 . . . . . . − . − . . . . . − . − . . . . . − . −

19 200 10 6 . . . . . − . −

37 632 14 13 . . . . . − . −

76 800 20 25 . . . . . − . −

150 528 28 35 . . . . . − . −

307 200 40 57 . . . . . − . − Table 1: Single layer potential operator on the cube (Frobenius norm)Similar results are obtained for the double layer potential, they are presented with thesame structure as above in Table 5.To outline the results, Figure 5 shows the memory requirements per degree of freedomas function of n for the single layer (c) and the double layer potential (a). In both casesfrom the beginning of our experiment the recompressed version needs less storage andthe memory advantage improves with n .The panels (b) and (d) in Figure 5 show the runtime per degree of freedom for therecompression of the DH -matrix obtained by interpolation. Since we use a logarithmicscale for n and a linear scale for the time divided by n , the experiment conﬁrms ourexpectation of O ( n log n ) complexity.Even for higher wave numbers and other norms the algorithm keeps this behavior:21riginal recomp. n κ basis matrix rank basis matrix error rel. error2 352 3 . . . . . . − . − . . . . . − . − . . . . . − . −

19 200 10 6 . . . . . − . −

37 632 14 13 . . . . . − . −

76 800 20 25 . . . . . − . −

150 528 28 35 . . . . . − . −

307 200 40 57 . . . . . − . − Table 2: Double layer potential operator on the cube (Frobenius norm)Table 5 shows results for doubled wave numbers on the unit sphere, where the errorcontrol strategy for the recompression uses the spectral norm instead of the Frobeniusnorm. The last entry could not be computed due to storage limitations.original recomp. n κ basis matrix rank basis matrix error rel. error2 048 8 0 . . . . . − . − . . . . . − . − . . . . . − . −

18 432 24 10 . . . . . − . −

32 768 32 45 . . . . . − . −

73 728 48 72 . . . . . − . −

131 072 64 98 . . . . . − . − Table 3: Single layer potential operator on the sphere (spectral norm)Next we consider a more realistic geometry: a mesh of an airplane, more precisely aBoeing 747, comprised of 549 836 triangles and 274 920 vertices, provided by courtesyof Boris Dilba. We use the wave number κ = 3 .

15, corresponding to a wavelength ofapproximately 2. The maximal extent of the airplane is approximately 62, i.e., approx-imately 31 wavelengths. A picture of our object of study is shown in Figure 2.We have modiﬁed our recompression algorithm such that it can be applied duringthe set-up process to reduce intermediate storage requirements: the cluster basis isorthogonalized immediately, and the coupling matrices are constructed on the ﬂy whenneeded during the recompression algorithm. With the modiﬁed algorithm we are ableto set up the DH -matrix with linear basis functions for the airplane mesh to obtainthe results given in Table 5, where we have varied both the recompression tolerance (cid:15) and the interpolation order m . We also report run-times (in hours) measured on ourshared-memory system. 22 M e m / n [ KB ] DimensionDLP InterpolationDLP Recomp (a) Memory (dlp) T i m e / n [ s ] DimensionDLP Setup (b) Complexity (dlp) M e m / n [ KB ] DimensionSLP InterpolationSLP Recomp (c) Memory (slp) T i m e / n [ s ] DimensionSLP Setup (d) Complexity (slp)

Figure 1: Memory and time per degree of freedom for the cube (cid:15) m time rank basis matrix error rel. error1 . − . . . . − . − . − . . . . − . − . − . . . . − . − Table 4: Boeing 747 with single layer potential (direct recompression)We have measured the relative error between the dense, i.e., uncompressed, matrixand the recompressed approximation in the Frobenius norm. To put these results inperspective, the dense matrix takes about 4 296 KB per degree of freedom.Compared to interpolation, our recompression algorithm reduces the storage require-ments for the cluster basis from 194 KB to 1 . m = 4 and from 2 137 KB to4 . m = 6. For the coupling matrices, we save similar amounts of storage:we go from 2 322 KB to 83 . m = 4 and from 25 833 KB to 138 . m = 6. The measured relative Frobenius error is always well below the prescribedtolerance.We can see that DH -recompression is absolutely crucial in order to turn the initialapproximation constructed by directional interpolation into a practically useful repre-sentation that saves approximately 99% of storage at an accuracy of 3 . × − .23igure 2: Mesh of a Boeing 747. References [1] M. Bebendorf, C. Kuske, and R. Venn. Wideband nested cross approximation forHelmholtz problems.

Num. Math. , 130(1):1–34, 2015.[2] S. B¨orm.

Eﬃcient Numerical Methods for Non-local Operators: H -Matrix Com-pression, Algorithms and Analysis , volume 14 of EMS Tracts in Mathematics . EMS,2010.[3] S. B¨orm. Directional H -matrix compression for high-frequency problems. Num.Lin. Alg. Appl. , 24(6), 2017. available online at http://dx.doi.org/10.1002/nla.2112 .[4] S. B¨orm and J. M. Melenk. Approximation of the high-frequency Helmholtz kernelby nested directional interpolation: error analysis.

Num. Math. , 137(1):1–34, 2017.available at http://dx.doi.org/10.1007/s00211-017-0873-y .[5] A. Brandt. Multilevel computations of integral transforms and particle interactionswith oscillatory kernels.

Comp. Phys. Comm. , 65(1–3):24–38, 1991.[6] M. G. Duﬀy. Quadrature over a pyramid or cube of integrands with a singularityat a vertex.

SIAM J. Num. Anal. , 19(6):1260–1262, 1982.[7] B. Engquist and L. Ying. Fast directional multilevel algorithms for oscillatorykernels.

SIAM J. Sci. Comput. , 29(4):1710–1737, 2007.[8] S. Erichsen and S. A. Sauter. Eﬃcient automatic quadrature in 3-d Galerkin BEM.

Comput. Meth. Appl. Mech. Eng. , 157:215–224, 1998.[9] G. H. Golub and C. F. Van Loan.

Matrix Computations . Johns Hopkins UniversityPress, London, 1996.[10] L. Grasedyck and W. Hackbusch. Construction and arithmetics of H -matrices. Computing , 70:295–334, 2003.[11] L. Greengard, J. Huang, V. Rokhlin, and S. Wandzura. Accelerating fast multipolemethods for the Helmholtz equation at low frequencies.

IEEE Comp. Sci. Eng. ,5(3):32–38, 1998. 2412] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations.

J. Comp.Phys. , 73:325–348, 1987.[13] W. Hackbusch. A sparse matrix arithmetic based on H -matrices. Part I: Introduc-tion to H -matrices. Computing , 62(2):89–108, 1999.[14] W. Hackbusch and B. N. Khoromskij. A sparse matrix arithmetic based on H -matrices. Part II: Application to multi-dimensional problems. Computing , 64:21–47,2000.[15] W. Hackbusch and Z. P. Nowak. On the fast matrix multiplication in the boundaryelement method by panel clustering.

Numer. Math. , 54(4):463–491, 1989.[16] M. Messner, M. Schanz, and E. Darve. Fast directional multilevel summation foroscillatory kernels based on Chebyshev interpolation.

J. Comp. Phys. , 231(4):1175–1196, 2012.[17] E. Michielssen and A. Boag. A multilevel matrix decomposition algorithm for an-alyzing scattering from large structures.

IEEE Trans. Antennas and Propagation ,AP-44:1086–1093, 1996.[18] V. Rokhlin. Rapid solution of integral equations of classical potential theory.

J.Comp. Phys. , 60:187–207, 1985.[19] V. Rokhlin. Diagonal forms of translation operators for the Helmholtz equation inthree dimensions.

Appl. Comp. Harm. Anal. , 1:82–93, 1993.[20] S. A. Sauter. Variable order panel clustering.

Computing , 64:223–261, 2000.[21] S. A. Sauter and C. Schwab.