[PDF] A direct solver with O(N) complexity for variable coefficient elliptic PDEs discretized via a high-order composite spectral collocation method

Abstract

A numerical method for solving elliptic PDEs with variable coefficients on two-dimensional domains is presented. The method is based on high-order composite spectral approximations and is designed for problems with smooth solutions. The resulting system of linear equations is solved using a direct (as opposed to iterative) solver that has optimal O(N) complexity for all stages of the computation when applied to problems with non-oscillatory solutions such as the Laplace and the Stokes equations. Numerical examples demonstrate that the scheme is capable of computing solutions with relative accuracy of 10 −10 or better, even for challenging problems such as highly oscillatory Helmholtz problems and convection-dominated convection diffusion equations. In terms of speed, it is demonstrated that a problem with a non-oscillatory solution that was discretized using 10 8 nodes was solved in 115 minutes on a personal work-station with two quad-core 3.3GHz CPUs. Since the solver is direct, and the "solution operator" fits in RAM, any solves beyond the first are very fast. In the example with 10 8 unknowns, solves require only 30 seconds.

Full PDF

aa r X i v : . [ m a t h . NA ] J u l A direct solver with O ( N ) complexity for variable coeﬃcient elliptic PDEsdiscretized via a high-order composite spectral collocation method A. Gillman P.G. MartinssonDepartment of Mathematics Department of Applied MathematicsDartmouth College University of Colorado at Boulder

Abstract:

A numerical method for solving elliptic PDEs with variable co-eﬃcients on two-dimensional domains is presented. The method is based onhigh-order composite spectral approximations and is designed for problems withsmooth solutions. The resulting system of linear equations is solved using a direct(as opposed to iterative) solver that has optimal O ( N ) complexity for all stages ofthe computation when applied to problems with non-oscillatory solutions such asthe Laplace and the Stokes equations. Numerical examples demonstrate that thescheme is capable of computing solutions with relative accuracy of 10 − or bet-ter, even for challenging problems such as highly oscillatory Helmholtz problemsand convection-dominated convection diﬀusion equations. In terms of speed, it isdemonstrated that a problem with a non-oscillatory solution that was discretizedusing 10 nodes was solved in 115 minutes on a personal work-station with twoquad-core 3.3GHz CPUs. Since the solver is direct, and the “solution operator”ﬁts in RAM, any solves beyond the ﬁrst are very fast. In the example with 10 unknowns, solves require only 30 seconds.1. Introduction

Problem formulation.

The paper describes a numerical method with optimal O ( N ) complex-ity for solving boundary value problems of the form(1) ( Au ( x ) = 0 x ∈ Ω ,u ( x ) = f ( x ) x ∈ Γ , where Ω is a rectangle in the plane with boundary Γ, and where A is a coercive elliptic partialdiﬀerential operator(2) [ Au ]( x ) = − c ( x )[ ∂ u ]( x ) − c ( x )[ ∂ ∂ u ]( x ) − c ( x )[ ∂ u ]( x )+ c ( x )[ ∂ u ]( x ) + c ( x )[ ∂ u ]( x ) + c ( x ) u ( x ) . The methodology is based on a high order composite spectral discretization and can be modiﬁed tohandle a range of diﬀerent domains, including curved ones. For problems with smooth solutions, wedemonstrate that the method can easily produce answers with ten or more correct digits.The proposed method is based on a direct solver which in a single sweep constructs an approxima-tion to the solution operator of (1). This gives the solver several advantages over established linearcomplexity methods based on iterative solvers (e.g. GMRES or multigrid), perhaps most importantly,the new method can solve problems for which iterative methods converge slowly or not at all. Thedirect solver has O ( N ) complexity for all stages of the computation. A key feature is that once thesolution operator has been built, solves can be executed extremely rapidly, making the scheme excelwhen solving a sequence of equations with the same operator but diﬀerent boundary data.1.2. Outline of solution procedure.

The method in this paper is comprised of three steps:(1) The domain is ﬁrst tessellated into a hierarchical tree of rectangular patches. For each patchon the ﬁnest level, a local “solution operator” is built using a dense brute force calculation.The “solution operator” will be deﬁned in Section 1.3; for now we simply note that it encodesall information about the patch that is required to evaluate interactions with other patches. (2) The larger patches are processed in an upwards pass through the tree, where each parent canbe processed once its children have been processed. The processing of a parent node consistsof forming its solution operator by “gluing together” the solution operators of its children.(3) Once the solution operators for all patches have been computed, a solution to the PDE canbe computed via a downwards pass through the tree. This step is typically very fast.1.3. Local solution operators.

The “local solution operators” introduced in Section 1.2 take theform of discrete approximations to the

Dirichlet-to-Neumann , or “DtN,” maps. To explain whatthese maps do, ﬁrst observe that for a given boundary function f , the BVP (1) has a unique solution u (recall that we assume A to be coercive). For x ∈ Γ, let g ( x ) = u n ( x ) denote the normal derivativein the outwards direction of u at x . The process for constructing the function g from f is linear, wewrite it as g = T f.

Or, equivalently, T : u | Γ u n | Γ , where u satisﬁes Au = 0 in Ω . From a mathematical perspective, the map T is a slightly unpleasant object; it is a hyper-singularintegral operator whose kernel exhibits complicated behavior near the corners of Γ. A key observationis that in the present context, these diﬃculties can be ignored since we limit attention to functionsthat are smooth. In a sense, we only need to accurately represent the projection of the “true”operator T onto a space of smooth functions (that in particular do not have any corner singularities).Concretely, given a square box Ω τ we represent a boundary potential u | Γ and a boundary ﬂux u n | Γ via tabulation at a set of r tabulation points on each side. (For a leaf box, we use r Gaussiannodes.) The DtN operator T τ is then represented simply as a dense matrix T τ of size 4 r × r thatmaps tabulated boundary potentials to the corresponding tabulated boundary ﬂuxes.1.4. Computational complexity.

A straight-forward implementation of the direct solver outlinedin Sections 1.2 and 1.3 in which all solution operators T τ are treated as general dense matrices has as-ymptotic complexity O ( N . ) for the “build stage” where the solution operators are constructed, and O ( N log N ) complexity for the “solve stage” where the solution operator is applied for a given set ofboundary data [16]. This paper demonstrates that by exploiting internal structure in these operators,they can be stored and manipulated eﬃciently, resulting in optimal O ( N ) overall complexity.To be precise, the internal structure exploited is that the oﬀ-diagonal blocks of the dense solutionoperators can to high accuracy be approximated by matrices of low rank. This property is a resultof the fact that for a patch Ω τ , the matrix T τ is a discrete approximation of the continuum DtNoperator T τ , which is an integral operator whose kernel is smooth away from the diagonal. Remark 1.1.

The proposed solver can with slight modiﬁcations be applied to non-coercive problemssuch as the Helmholtz equation. If the equation is kept ﬁxed while N is increased, O ( N ) complexityis retained. However, in the context of elliptic problems with oscillatory solutions, it is common toscale N to the wave-length so that the number of discretization points per wave-length is ﬁxed as N increases. Our accelerated technique will in this situation lead to a practical speed-up, but will havethe same O ( N . ) asymptotic scaling as the basic method that does not use fast operator algebra.1.5. Prior work.

The direct solver outlined in Section 1.2 is an evolution of a sequence of directsolvers for integral equations dating back to [17] and later [9, 8, 10, 4, 3]. The common idea is tobuild a global solution operator by splitting the domain into a hierarchical tree of patches, build alocal solution operator for each “leaf” patch, and then build solution operators for larger patchesvia a hierarchical merge procedure in a sweep over the tree from smaller to larger patches. In thecontext of integral equations, the “solution operator” is a type of scattering matrix while in thepresent context, the solution operator is a DtN operator.

The direct solvers [17, 9, 8, 10, 4], designed for dense systems, are conceptually related to earlierwork on direct solvers for sparse systems arising from ﬁnite diﬀerence and ﬁnite element discretiza-tions of elliptic PDEs such as the classical nested dissection method of George [6, 11] and themultifrontal methods by Duﬀ and others [5]. These techniques typically require O ( N . ) operationsto construct the LU-factorization of a sparse coeﬃcient matrix arising from the discretization of anelliptic PDE on a planar domain, with the dominant cost being the formation of Schur complementsand LU-factorizations of dense matrices of size up to O ( N . ) × O ( N . ). It was in the last severalyears demonstrated that these dense matrices have internal structure that allows the direct solver tobe accelerated to linear or close to linear complexity, see, e.g., [22, 7, 13, 14, 19]. These acceleratednested dissection methods are closely related to the fast direct solver presented in this manuscript. Animportant diﬀerence is that the method in the present paper allows high order discretizations to beused without increasing the cost of the direct solver. To be technical, the solvers in [22, 7, 13, 14, 19]are based on an underlying ﬁnite diﬀerence or ﬁnite element discretization. High order discretizationin this context leads to large frontal matrices (since the “dividers” that partition the grid have to bewide), and consequently very high cost of the LU-factorization.Our discretization scheme is related to earlier work on spectral collocation methods on composite(“multi-domain”) grids, such as, e.g., [12, 23], and in particular Pfeiﬀer et al [18]. For a detailedreview of the similarities and diﬀerences, see [16].An O ( N . ) complexity version of the direct solver described in this paper was presented in [16]which in turn is based on [15]. In addition to the improvement in complexity, this paper describes anew representation of the local solution operators that leads to cleaner implementation of the directsolvers and allows greater ﬂexibility in executing the leaf computation, see Remark 3.1.1.6. Outline of paper.

Section 2 introduces the mesh of Gaussian nodes that forms our basiccomputational grid. Sections 3, 4, and 5 describe a relatively simple direct solver with O ( N . )complexity. Sections 6, 7, and 8 describe how to improve the asymptotic complexity of the directsolver from O ( N . ) to O ( N ) by exploiting internal structure in certain dense matrices. Section 9describes numerical examples and Section 10 summarizes the key ﬁndings.2. Discretization

Partition the domain Ω into a collection of square (or possibly rectangular) boxes, called leaf boxes .On the edges of each leaf, place q Gaussian interpolation points. The size of the leaf boxes, and theparameter q should be chosen so that any potential solution u of (1), as well as its ﬁrst and secondderivatives, can be accurately interpolated from their values at these points ( q = 21 is often a goodchoice). Let { x k } Nk =1 denote the collection of interpolation points on all boundaries.Next construct a binary tree on the collection of leaf boxes by hierarchically merging them, makingsure that all boxes on the same level are roughly of the same size, cf. Figure 1. The boxes should beordered so that if τ is a parent of a box σ , then τ < σ . We also assume that the root of the tree(i.e. the full box Ω) has index τ = 1. We let Ω τ denote the domain associated with box τ .With each box τ , we deﬁne two index vectors I τ i and I τ e as follows: I τ e A list of all exterior nodes of τ . In other words, k ∈ I τ e iﬀ x k lies on the boundary of Ω τ . I τ i For a parent τ , I τ i is a list of all its interior nodes that are not interior nodes of its children.For a leaf τ , I τ i is empty.Let u ∈ R N denote a vector holding approximations to the values of u of (1), in other words, u ( k ) ≈ u ( x k ) . Figure 1.

The square domain Ω is split into 4 × τ is the parent of box σ , then τ < σ .Finally, let v ∈ R N denote a vector holding approximations to the boundary ﬂuxes of the solution u of (1), in other words v ( k ) ≈ ( ∂ u ( x k ) , when x j lies on a horizontal edge, ∂ u ( x k ) , when x j lies on a vertical edge.Note the v ( k ) represents an outgoing ﬂux on certain boxes and an incoming ﬂux on others. This isa deliberate choice to avoid problems with signs when matching ﬂuxes of touching boxes.3. Constructing the Dirichlet-to-Neumann map for a leaf

This section describes a spectral method for computing a discrete approximation to the DtN map T τ associated with a leaf box Ω τ . In other words, if u is a solution of (1), we seek a matrix T τ ofsize 4 q × q such that(3) v ( I τ e ) ≈ T τ u ( I τ e ) . Conceptually, we proceed as follows: Given a vector u ( I τ e ) of potential values tabulated on theboundary of Ω τ , form for each side the unique polynomial of degree at most q − q speciﬁed values of u . This yields Dirichlet boundary data on Ω τ in the form of four polynomials.Solve the restriction of (1) to Ω τ for the speciﬁed boundary data using a spectral method on a localtensor product grid of q × q Chebyshev nodes.

The vector v ( I τ e ) is obtained by spectral diﬀerentiationof the local solution, and then re-tabulating the boundary ﬂuxes to the Gaussian nodes in { x k } k ∈ I τ e .We give details of the construction in Section 3.2, but as a preliminary step, we ﬁrst review aclassical spectral collocation method for the local solve in Section 3.1 Remark 3.1.

Chebyshev nodes are ideal for the leaf computations, and it is in principle also possibleto use Chebyshev nodes to represent all boundary-to-boundary “solution operators” such as, e.g., T τ (indeed, this was the approach taken in the ﬁrst implementation of the proposed method [16]).However, there are at least two substantial beneﬁts to using Gaussian nodes that justify the trouble toretabulate the operators. First, the procedure for merging boundary operators deﬁned for neighboringboxes is much cleaner and involves less bookkeeping since the Gaussian nodes do not include thecorner nodes. (Contrast Section 4 of [16] with Section 4.) Second, and more importantly, the use ofthe Gaussian nodes allows for interpolation between diﬀerent discretizations. Thus the method caneasily be extended to have local reﬁnement when necessary, see Remark 5.2.3.1. Spectral discretization.

Let Ω τ denote a rectangular subset of Ω with boundary Γ τ , andconsider the local Dirichlet problem[ Au ]( x ) = 0 , x ∈ Ω τ (4) u ( x ) = h ( x ) , x ∈ Γ τ , (5) (a) (b) Figure 2.

Notation for the leaf computation in Section 3. (a) A leaf before elimina-tion of interior (white) nodes. (b) A leaf after elimination of interior nodes.where the elliptic operator A is deﬁned by (2). We will construct an approximate solution to (4)using a classical spectral collocation method described in, e.g., Trefethen [21]: First, pick a smallinteger p and let { z k } p k =1 denote the nodes in a tensor product grid of p × p Chebyshev nodes on Ω τ .Let D (1) and D (2) denote spectral diﬀerentiation matrices corresponding to the operators ∂/∂x and ∂/∂x , respectively. The operator (2) is then locally approximated via the p × p matrix(6) A = − C (cid:0) D (1) (cid:1) − C D (1) D (2) − C (cid:0) D (2) (cid:1) + C D (1) + C D (2) + C , where C is the diagonal matrix with diagonal entries { c ( z k ) } p k =1 , and the other matrices C ij , C i , C are deﬁned analogously.Let w ∈ R p denote a vector holding the desired approximate solution of (4). We populate allentries corresponding to boundary nodes with the Dirichlet data from h , and then enforce a spectralcollocation condition at the interior nodes. To formalize, let us partition the index set { , , . . . , p } = J e ∪ J i in such a way that J e contains the 4( p −

1) nodes on the boundary of Ω τ , and J i denotes the set of( p − interior nodes, see Figure 2(a). Then partition the vector w into two parts corresponding tointernal and exterior nodes via w i = w ( J i ) , w e = w ( e ) . J Analogously, partition A into four parts via A i , i = A ( J i , J i ) , A i , e = A ( J i , J e ) , A e , i = A ( J e , J i ) , A e , e = A ( J e , J e ) . The potential at the exterior nodes is now given directly from the boundary condition: w e = [ h ( z k )] k ∈ J e . For the internal nodes, we enforce the PDE (4) via direct collocation:(7) A i , i w i + A i , e w e = . Solving (7) for w i , we ﬁnd(8) w i = − A − , i A i , e w e , Constructing the approximate DtN.

Now that we know how to approximately solve thelocal Dirichlet problem (4) via a local spectral method, we can build a matrix T τ such that (3) holdsto high accuracy. The starting point is a vector u ( I τ ) ∈ R q of tabulated potential values on theboundary of Ω τ . We will construct the vector v ( I τ ) ∈ R q via four linear maps. The combination ofthese maps is the matrix T τ . We henceforth assume that the spectral order of the local Chebyshevgrid matches the order of the tabulation on the leaf boundaries so that p = q . Step 1 — re-tabulation from Gaussian nodes to Chebyshev nodes:

For each side of Ω τ ,form the unique interpolating polynomial of degree at most q − q potentialvalues on that side speciﬁed by u ( I τ e ). Now evaluate these polynomials at the boundary nodes ofa q × q Chebyshev grid on Ω τ . Observe that for a corner node, we may in the general case getconﬂicts. For instance, the potential at the south-west corner may get one value from extrapolationof potential values on the south border, and one value from extrapolation of the potential valueson the west border. We resolve such conﬂicts by assigning the corner node the average of the twopossibly diﬀerent values. (In practice, essentially no error occurs since we know that the vector u ( I τ e )tabulates an underlying function that is continuous at the corner.) Step 2 — spectral solve:

Step 1 populates the boundary nodes of the q × q Chebyshev grid withDirichlet data. Now determine the potential at all interior points on the Chebyshev grid by executinga local spectral solve, cf. equation (8).

Step 3 — spectral diﬀerentiation:

After Step 2, the potential is known at all nodes on thelocal Chebyshev grid. Now perform spectral diﬀerentiation to evaluate approximations to ∂u/∂x for the Chebyshev nodes on the two horizontal sides, and ∂u/∂x for the Chebyshev nodes on thetwo vertical sides. Step 4 — re-tabulation from the Chebyshev nodes back to Gaussian nodes:

After Step 3,the boundary ﬂuxes on ∂ Ω τ are speciﬁed by four polynomials of degree q − v ( I τ e ).Putting everything together, we ﬁnd that the matrix T τ is given as a product of four matrices T τ = L ◦ L ◦ L ◦ L q × q q × q q × q q × q −

1) 4( q − × q where L i is the linear transform corresponding to “Step i ” above. Observe that many of thesetransforms are far from dense, for instance, L and L are 4 × Remark 3.2.

The grid of Chebyshev nodes { z k } p j =1 introduced in Section 3.1 is only used for thelocal computation. In the ﬁnal solver, there is no need to store potential values at these grid points— they are used merely for constructing the matrix T τ .4. Merging two DtN maps

Let τ denote a box in the tree with children α and β . In this section, we demonstrate that if theDtN matrices T α and T β for the children are known, then the DtN matrix T τ can be constructedvia a purely local computation which we refer to as a “merge” operation.We start by introducing some notation: Let Ω τ denote a box with children Ω α and Ω β . Forconcreteness, let us assume that Ω α and Ω β share a vertical edge as shown in Figure 3, so thatΩ τ = Ω α ∪ Ω β . We partition the points on ∂ Ω α and ∂ Ω β into three sets: Ω α Ω β J J J Figure 3.

Notation for the merge operation described in Section 4. The rectangulardomain Ω is formed by two squares Ω α and Ω β . The sets J and J form the exteriornodes (black), while J consists of the interior nodes (white). J Boundary nodes of Ω α that are not boundary nodes of Ω β . J Boundary nodes of Ω β that are not boundary nodes of Ω α . J Boundary nodes of both Ω α and Ω β that are not boundary nodes of the union box Ω τ .Figure 3 illustrates the deﬁnitions of the J k ’s. Let u denote a solution to (1), with tabulated potentialvalues u and boundary ﬂuxes v , as described in Section 2. Set(9) u i = u , and u e = (cid:20) u u (cid:21) . Recall that T α and T β denote the operators that map values of the potential u on the boundary tovalues of ∂ n u on the boundaries of the boxes Ω α and Ω β , as described in Section 3. The operatorscan be partitioned according to the numbering of nodes in Figure 3, resulting in the equations(10) (cid:20) v v (cid:21) = (cid:20) T α , T α , T α , T α , (cid:21) (cid:20) u u (cid:21) , and (cid:20) v v (cid:21) = " T β , T β , T β , T β , u u (cid:21) . Our objective is now to construct a solution operator S τ and a DtN matrix T τ such that u = S τ (cid:20) u u (cid:21) (11) (cid:20) v v (cid:21) = T τ (cid:20) u u (cid:21) . (12)To this end, we write (10) as a single equation:(13)  T α , α , β , T β , T α , − T β , T α , − T β ,   u u u  =  v v  , The last equation directly tells us that (11) holds with(14) S τ = (cid:0) T α , − T β , (cid:1) − (cid:2) − T α , (cid:12)(cid:12) T β , ] . By eliminating u from (13) by forming a Schur complement, we also ﬁnd that (12) holds with(15) T τ = (cid:20) T α ,

00 T β , (cid:21) + (cid:20) T α , T β , (cid:21) (cid:0) T α , − T β , (cid:1) − (cid:2) − T α , (cid:12)(cid:12) T β , (cid:3) . The full hierarchical scheme

At this point, we know how to construct the DtN operator for a leaf (Section 3), and how to mergetwo such operators of neighboring patches to form the DtN operator of their union (Section 4). Weare ready to describe the full hierarchical scheme for solving the Dirichlet problem (1). This schemetakes the Dirichlet boundary data f , and constructs an approximation to the solution u . The outputis a vector u that tabulates approximations to u at the Gaussian nodes { x k } Nk =1 on all interior edgesthat were deﬁned in Section 2. To ﬁnd u at an arbitrary set of target points in Ω, a post-processingstep described in Section 5.3 can be used.5.1. The algorithm.

Partition the domain into a hierarchical tree as described in Section 2. Thenexecute a “build stage” in which we construct for each box τ the following two matrices: S τ For a parent box τ , S τ is a solution operator that maps values of u on ∂ Ω τ to values of u atthe interior nodes. In other words, u ( I τ i ) = S τ u ( I τ e ). (For a leaf τ , S τ is not deﬁned.) T τ The matrix that maps u ( I τ e ) (tabulating values of u on ∂ Ω τ ) to v ( I τ e ) (tabulating values of du/dn ). In other words, v ( I τ e ) = T τ u ( I τ e ).(Recall that the index vector I τ e and I τ i were deﬁned in Section 2.) The build stage consists of asingle sweep over all nodes in the tree. Any bottom-up ordering in which any parent box is processedafter its children can be used. For each leaf box τ , an approximation to the local DtN map T τ isconstructed using the procedure described in Section 3. For a parent box τ with children σ and σ ,the matrices S τ and T τ are formed from the DtN operators T σ and T σ via the process describedin Section 4. Algorithm 1 summarizes the build stage.Once all the matrices { S τ } τ have been formed, a vector u holding approximations to the solution u of (1) can be constructed for all discretization points by starting at the root box Ω and movingdown the tree toward the leaf boxes. The values of u for the points on boundary of Ω can be obtainedby tabulating the boundary function f . When any box τ is processed, the value of u is known forall nodes on its boundary (i.e. those listed in I τ e ). The matrix S τ directly maps these values tothe values of u on the nodes in the interior of τ (i.e. those listed in I τ i ). When all nodes have beenprocessed, approximations to u have constructed for all tabulation nodes on interior edges. Algorithm2 summarizes the solve stage. Remark 5.1.

The merge stage is exact when performed in exact arithmetic. The only approximationinvolved is the approximation of the solution u on a leaf by its interpolating polynomial. Remark 5.2.

To keep the presentation simple, we consider in this paper only the case of a uniformcomputational grid. Such grids are obviously not well suited to situations where the regularity of thesolution changes across the domain. The method described can in principle be modiﬁed to handlelocally reﬁned grids quite easily. A complication is that the tabulation nodes for two touching boxeswill typically not coincide, which requires the introduction of specialized interpolation operators.Eﬃcient reﬁnement strategies also require the development of error indicators that identify the regionswhere the grid need to be reﬁned. This is work in progress, and will be reported at a later date.We observe that our introduction of Gaussian nodes on the internal boundaries (as opposed to theChebyshev nodes used in [16]) makes re-interpolation much easier.5.2.

Asymptotic complexity.

In this section, we determine the asymptotic complexity of thedirect solver. Let N leaf = 4 q denote the number of Gaussian nodes on the boundary of a leaf box,and let q denote the number of Chebychev nodes used in the leaf computation. Let L denote thenumber of levels in the binary tree. This means there are 4 L boxes. Thus the total number ofdiscretization nodes N is approximately 4 L q = (2 L q ) q . (To be exact, N = 2 L +1 q + 2 L +1 q .) Algorithm 1 (build solution operators)This algorithm builds the global Dirichlet-to-Neumann operator for (1).It also builds all matrices S τ required for constructing u at any interior point.It is assumed that if node τ is a parent of node σ , then τ < σ .(1) for τ = N boxes , N boxes − , N boxes − , . . . , if ( τ is a leaf)(3) Construct S τ via the process described in Section 3.(4) else (5) Let σ and σ be the children of τ .(6) Split I σ e and I σ e into vectors I , I , and I as shown in Figure 3.(7) S τ = (cid:0) T σ , − T σ , (cid:1) − (cid:2) − T σ , (cid:12)(cid:12) T σ , (cid:3) (8) T τ = (cid:20) T σ ,

00 T σ , (cid:21) + (cid:20) T σ , T σ , (cid:21) S τ .(9) Delete T σ and T σ .(10) end if (11) end for Algorithm 2 (solve BVP once solution operator has been built)This program constructs an approximation u to the solution u of (1).It assumes that all matrices S τ have already been constructed in a pre-computation.It is assumed that if node τ is a parent of node σ , then τ < σ .(1) u ( k ) = f ( x k ) for all k ∈ I .(2) for τ = 1 , , , . . . , N boxes (3) u ( I τ i ) = S τ u ( I τ e ).(4) end for Remark: This algorithm outputs the solution on the Gaussian nodes on box bound-aries. To get the solution at other points, use the method described in Section 5.3.

The cost to process one leaf is approximately O ( q ). Since there are Nq leaf boxes, the total costof pre-computing approximate DtN operators for all the bottom level is Nq × q ∼ N q .Next, consider the cost of constructing the DtN map on level ℓ via the merge operation describedin Section 4. For each box on the level ℓ , the operators T τ and S τ are constructed via (14) and (14).These operations involve matrices of size roughly 2 − ℓ N . × − ℓ N . . Since there are 4 ℓ boxes perlevel. The cost on level ℓ of the merge is4 ℓ × (cid:16) − ℓ N . (cid:17) ∼ − ℓ N . . The total cost for all the merge procedures has complexity L X ℓ =1 − ℓ N . ∼ N . . Finally, consider the cost of the downwards sweep which solves for the interior unknowns. For anynon-leaf box τ on level ℓ , the size of S τ is 2 l q × l (6 q ) which is approximately ∼ − ℓ N . × − ℓ N . .Thus the cost of applying S τ is roughly (2 − ℓ N . ) = 2 − ℓ N . So the total cost of the solve step has complexity L − X l =0 ℓ − ℓ N ∼ N log N. In Section 8, we explain how to exploit structure in the matrices T and S to improve the compu-tational cost of both the precomputation and the solve steps.5.3. Post-processing.

The direct solver in Algorithm 1 constructs approximations to the solution u of (1) at tabulation nodes at all interior edges. Once these are available, it is easy to constructan approximation to u at an arbitrary point. To illustrate the process, suppose that we seek anapproximation to u ( y ), where y is a point located in a leaf τ . We have values of u tabulatedat Gaussian nodes on ∂ Ω τ . These can easily be re-interpolated to the Chebyshev nodes on ∂ Ω τ .Then u can be reconstructed at the interior Chebyshev nodes via the formula (8); observe thatthe local solution operator − A − , i A i , e was built when the leaf was originally processed and can besimply retrieved from memory (assuming enough memory is available). Once u is tabulated at theChebyshev grid on Ω τ , it is trivial to interpolate it to y or any other point.6. Compressible matrices

The cost of the direct solver given as Algorithm 1 is dominated by the work done at the very toplevels; the matrix operations on lines (7) and (8) involve dense matrices of size O ( N . ) × O ( N . )where N is the total number of discretization nodes, resulting in O ( N . ) overall cost. It turnsout that these dense matrices have internal structure that can be exploited to greatly acceleratethe matrix algebra. Speciﬁcally, the oﬀ-diagonal blocks of these matrices are to high precision rankdeﬁcient, and the matrices can be represented eﬃciently using a hierarchical “data-sparse” formatknown as Hierarchically Block Separable (HBS) (and sometimes

Hierarchically Semi-Separable (HSS) matrices [20, 1]). This section brieﬂy describes the HBS property, for details see [8].6.1.

Block separable.

Let H be an mp × mp matrix that is blocked into p × p blocks, each of size m × m . We say that H is “block separable” with “block-rank” k if for τ = 1 , , . . . , p , there exist m × k matrices U τ and V τ such that each oﬀ-diagonal block H σ,τ of H admits the factorization(16) H σ,τ = U σ ˜ H σ,τ V ∗ τ , σ, τ ∈ { , , . . . , p } , σ = τ.m × m m × k k × k k × m Observe that the columns of U σ must form a basis for the columns of all oﬀ-diagonal blocks in row σ , and analogously, the columns of V τ must form a basis for the rows in all the oﬀ-diagonal blocksin column τ . When (16) holds, the matrix H admits a block factorization(17) H = U ˜ H V ∗ + D ,mp × mp mp × kp kp × kp kp × mp mp × mp where U = diag( U , U , . . . , U p ) , V = diag( V , V , . . . , V p ) , D = diag( D , D , . . . , D p ) , and ˜ H =  ˜ H ˜ H · · · ˜ H ˜ H · · · ˜ H ˜ H · · · ... ... ...  . Level 0Level 1Level 2Level 3 I = [1 , , . . . , I = [1 , , . . . , I = [201 , , . . . , I = [1 , , . . . , I = [101 , , . . . , I = [1 , , . . . , I = [51 , , . . . ,

12 34 5 6 7

Figure 4.

Numbering of nodes in a fully populated binary tree with L = 3 levels.The root is the original index vector I = I = [1 , , . . . , Hierarchically Block-Separable.

Informally speaking, a matrix H is Heirarchically Block-Separable (HBS), if it is amenable to a telescoping block factorization. In other words, in additionto the matrix H being block separable, so is ˜ H once it has been reblocked to form a matrix with p/ × p/ H will beblock separable, etc.In this section, we describe properties and the factored representation of HBS matrices. Detailson constructing the factorization are provided in [8].6.3. A binary tree structure.

The HBS representation of an M × M matrix H is based on apartition of the index vector I = [1 , , . . . , M ] into a binary tree structure. We let I form the rootof the tree, and give it the index 1, I = I . We next split the root into two roughly equi-sized vectors I and I so that I = I ∪ I . The full tree is then formed by continuing to subdivide any intervalthat holds more than some preset ﬁxed number m of indices. We use the integers ℓ = 0 , , . . . , L to label the diﬀerent levels, with 0 denoting the coarsest level. A leaf is a node corresponding to avector that never got split. For a non-leaf node τ , its children are the two boxes σ and σ such that I τ = I σ ∪ I σ , and τ is then the parent of σ and σ . Two boxes with the same parent are called siblings . These deﬁnitions are illustrated in Figure 4.6.4. Deﬁnition of the HBS property.

We now deﬁne what it means for an M × M matrix H to be hierarchically block separable with respect to a given binary tree T that partitions the indexvector J = [1 , , . . . , M ]. For simplicity, we suppose that for every leaf node τ the index vector I τ holds precisely m points, so that M = m L . Then H is HBS with block rank k if the following twoconditions hold: (1) Assumption on ranks of oﬀ-diagonal blocks at the ﬁnest level: For any two distinct leaf nodes τ and τ ′ , deﬁne the m × m matrix(18) H τ,τ ′ = H ( I τ , I τ ′ ) . Then there must exist matrices U τ , V τ ′ , and ˜ H τ,τ ′ such that(19) H τ,τ ′ = U τ ˜ H τ,τ ′ V ∗ τ ′ .m × m m × k k × k k × m (2) Assumption on ranks of oﬀ-diagonal blocks on level ℓ = L − , L − , . . . , : The rank assumptionat level ℓ is deﬁned in terms of the blocks constructed on the next ﬁner level ℓ + 1: For any distinctnodes τ and τ ′ on level ℓ with children σ , σ and σ ′ , σ ′ , respectively, deﬁne(20) H τ,τ ′ = " ˜ H σ ,σ ′ ˜ H σ ,σ ′ ˜ H σ ,σ ′ ˜ H σ ,σ ′ . Name: Size: Function:For each leaf D τ m × m The diagonal block H ( I τ , I τ ).node τ : U τ m × k Basis for the columns in the blocks in row τ . V τ m × k Basis for the rows in the blocks in column τ .For each parent B τ k × k Interactions between the children of τ .node τ : U τ k × k Basis for the columns in the (reduced) blocks in row τ . V τ k × k Basis for the rows in the (reduced) blocks in column τ . Figure 5.

An HBS matrix H associated with a tree T is fully speciﬁed if the factorslisted above are provided.Then there must exist matrices U τ , V τ ′ , and ˜ H τ,τ ′ such that(21) H τ,τ ′ = U τ ˜ H τ,τ ′ V ∗ τ ′ . k × k k × k k × k k × k An HBS matrix is now fully described if the basis matrices U τ and V τ are provided for each node τ , and in addition, we are for each leaf τ given the m × m matrix(22) D τ = H ( I τ , I τ ) , and for each parent node τ with children σ and σ we are given the 2 k × k matrix(23) B τ = (cid:20) H σ ,σ ˜ H σ ,σ (cid:21) . Observe in particular that the matrices ˜ H σ ,σ are only required when { σ , σ } forms a sibling pair.Figure 5 summarizes the required matrices.6.5. Telescoping factorization.

Given the matrices deﬁned in the previous section, we deﬁne thefollowing block diagonal factors: D ( ℓ ) = diag( D τ : τ is a box on level ℓ ) , ℓ = 0 , , . . . , L, (24) U ( ℓ ) = diag( U τ : τ is a box on level ℓ ) , ℓ = 1 , , . . . , L, (25) V ( ℓ ) = diag( V τ : τ is a box on level ℓ ) , ℓ = 1 , , . . . , L, (26) B ( ℓ ) = diag( B τ : τ is a box on level ℓ ) , ℓ = 0 , , . . . , L − , . (27)Furthermore, we let ˜ H ( ℓ ) denote the block matrix whose diagonal blocks are zero, and whose oﬀ-diagonal blocks are the blocks ˜ H τ,τ ′ for all distinct τ, τ ′ on level ℓ . With these deﬁnitions,(28) H = U ( L ) ˜ H ( L ) ( V ( L ) ) ∗ + D ( L ) ; m L × n L m L × k L k L × k L k L × m L m L × m L for ℓ = L − , L − , . . . , H ( ℓ +1) = U ( ℓ ) ˜ H ( ℓ ) ( V ( ℓ ) ) ∗ + B ( ℓ ) ; k ℓ +1 × k ℓ +1 k ℓ +1 × k ℓ k ℓ × k ℓ k ℓ × k ℓ +1 k ℓ +1 × k ℓ +1 and ﬁnally(30) ˜ H (1) = B (0) . Fast arithmetic operations on HBS matrices

Arithmetic operations involving dense HBS matrices of size M × M can often be executed in O ( M )operations. This fast matrix algebra is vital for achieving linear complexity in our direct solver. Thissection provides a brief introduction to the HBS matrix algebra. We describe the operations we need(inversion, addition, and low-rank update) in some detail for the single level “block separable” format.The generalization to the multi-level “hierarchically block separable” format is brieﬂy described forthe case of matrix inversion. A full description of all algorithms required is given in [7], which isrelated to the earlier work [2].Before we start, we recall that a block separable matrix H consisting of p × p blocks, each of size m × m , and with “HBS-rank” k < m , admits the factorization(31) H = U ˜ H V ∗ + D .mp × mp mp × kp kp × kp kp × mp mp × mp Inversion of a block separable matrix.

The decomposition (31) represents H as a sumof one term U ˜ HV ∗ that is “low rank,” and one term D that is easily invertible (since it is blockdiagonal). By modifying the classical Woodbury formula for inversion of a matrix perturbed by theaddition of a low-rank term, it can be shown that (see Lemma 3.1 of [8])(32) H − = E ( ˜ H + ˆ D ) − F ∗ + G , where ˆ D = (cid:0) V ∗ D − U (cid:1) − , (33) E = D − U ˆ D , (34) F = ( ˆ D V ∗ D − ) ∗ , (35) G = D − − D − U ˆ D V ∗ D − , (36)assuming the inverses in formulas (32) — (36) all exist. Now observe that the matrices ˆ D , E , F , and G can all easily be computed since the formulas deﬁning them involve only block-diagonal matrices.In consequence, (32) reduces the task of inverting the big (size mp × mp ) matrix H to the task ofinverting the small (size kp × kp ) matrix ˜ H + ˆ D .When H is not only “block separable”, but “hierarchically block separable”, the process can berepeated recursively by exploiting that ˜ H + ˆ D is itself amenable to accelerated inversion, etc. Theresulting process is somewhat tricky to analyze, but leads to very clean codes. To illustrate, weinclude Algorithm 3 which shows the multi-level O ( M ) inversion algorithm for an HBS matrix H .The algorithm takes as input the factors { U τ , V τ , D τ , B τ } τ representing H (cf. Figure 5), and outputsan analogous set of factors { E τ , F τ , G τ } τ representing H − . With these factors, the matrix-vectormultiplication y = H − x can be executed via the procedure described in Algorithm 4.7.2. Addition of two block separable matrices.

Let H A and H B be block separable matriceswith factorizations H A = U A ˜ H A V A ∗ + D A , and H B = U B ˜ H B V B ∗ + D B . Then H = H A + H B can be written in block separable form via(37) H = H A + H B = (cid:2) U A U B (cid:3) " ˜ H A

00 ˜ H B V A V B (cid:3) ∗ + (cid:0) D A + D B (cid:1) . To restore (37) to block separable form, permute the rows and columns of (cid:2) U A U B (cid:3) and (cid:2) V A V B (cid:3) to attain block diagonal form, then re-orthogonalize the diagonal blocks. This process in principleresults in a matrix H whose HBS-rank is the sum of the HBS-ranks of H A and H B . In practice, this Algorithm 3 (inversion of an HBS matrix)Given factors { U τ , V τ , D τ , B τ } τ representing an HBS matrix H , this algorithm constructs factors { E τ , F τ , G τ } τ representing H − . loop over all levels, ﬁner to coarser, ℓ = L, L − , . . . , loop over all boxes τ on level ℓ , if τ is a leaf node˜ D τ = D τ else Let σ and σ denote the children of τ .˜ D τ = (cid:20) ˆ D σ B σ ,σ B σ ,σ ˆ D σ (cid:21) end if ˆ D τ = (cid:0) V ∗ τ ˜ D − τ U τ (cid:1) − . E τ = ˜ D − τ U τ ˆ D τ . F ∗ τ = ˆ D τ V ∗ τ ˜ D − τ . G τ = ˜ D − τ − ˜ D − τ U τ ˆ D τ V ∗ τ ˜ D − τ . end loopend loop G = (cid:20) ˆ D B , B , ˆ D (cid:21) − . Algorithm 4 (application of the inverse of an HBS matrix)

Given x , compute y = H − x using the factors { E τ , F τ , G τ } τ resulting from Algorithm 3. loop over all leaf boxes τ ˆ x τ = F ∗ τ x ( I τ ). end looploop over all levels, ﬁner to coarser, ℓ = L, L − , . . . , loop over all parent boxes τ on level ℓ ,Let σ and σ denote the children of τ .ˆ x τ = F ∗ τ (cid:20) ˆ x σ ˆ x σ (cid:21) . end loopend loop (cid:20) ˆ y ˆ y (cid:21) = G (cid:20) ˆ x ˆ x (cid:21) . loop over all levels, coarser to ﬁner, ℓ = 1 , , . . . , L − loop over all parent boxes τ on level ℓ Let σ and σ denote the children of τ . (cid:20) ˆ y σ ˆ y σ (cid:21) = E τ ˆ x τ + G τ (cid:20) ˆ x σ ˆ x σ (cid:21) . end loopend looploop over all leaf boxes τ y ( I τ ) = E τ ˆ q τ + G τ x ( I τ ). end loop rank increase can be combated by numerically recompressing the basis matrices, and updating themiddle factor as needed. For details, as well as the extension to a multi-level scheme, see [2, 7]. Addition of a block separable matrix with a low rank matrix.

Let H B = QR be a k -rank matrix where Q and R ∗ are of size mp × k . We would like to add H B to the block separablematrix H A . Since we already know how to add two block separable matrices, we choose to rewrite H B in block separable form. Without loss of generality, assume Q is orthogonal. Partition Q into p blocks of size m × k . The blocks make up the matrix U B . Likewise partition R into p blocks of size k × m . The block matrix D B has entries D τ = Q τ R τ for τ = 1 , . . . , p . To construct the matrices V B , for each τ = 1 , . . . , p , the matrix R τ is factorized into ˜ R τ V τ ∗ where the matrix V τ is orthogonal.The matrices ˜ R τ make up the entries of ˜ H B .8. Accelerating the direct solver

This section describes how the fast matrix algebra described in Sections 6 and 7 can be used toaccelerate the direct solver of Section 5 to attain O ( N ) complexity. We recall that the O ( N . ) costof Algorithm 1 relates to the execution of lines (7) and (8) at the top levels, since these involve densematrix algebra of matrices of size O ( N . ) × O ( N . ). The principal claims of this section are: • The matrices T σ , , T σ , , T σ , , T σ , have low numerical rank. • The matrices T σ , , T σ , , T σ , , T σ , are HBS matrices of low HBS rank.To be precise, the ranks that we claim are “low” scale as log(1 /ν ) × log( m ) where m is the numberof points along the boundary of Ω τ , and ν is the computational tolerance. In practice, we found thatfor problems with non-oscillatory solutions, the ranks are extremely modest: when ν = 10 − , theranks range between 10 and 80, even for very large problems.The cause of the rank deﬁciencies is that the matrix T τ is a highly accurate approximation to theDirichlet-to-Neumann operator on Ω τ . This operator is known to have a smooth kernel that is non-oscillatory whenever the underlying PDE has non-oscillatory solutions. Since the domain boundary ∂ Ω τ is one-dimensional, this makes the expectation that the oﬀ-diagonal blocks have low rank verynatural, see [8]. It is backed up by extensive numerical experiments (see Section 9), but we do notat this point have rigorous proofs to support the claim.Once it is observed that all matrices in lines (7) and (8) of Algorithm 1 are structured, it becomesobvious how to accelerate the algorithm. For instance, line (7) is executed in three steps: (i) Addthe HBS matrices T σ , and − T σ , . (ii) Invert the sum of the HBS matrices. (iii) Apply the inverse(in HBS form) to one of the low rank factors of (cid:2) − T α , (cid:12)(cid:12) T β , (cid:3) . The result is an approximation to S τ ,represented as a product of two thin matrices. Executing line (8) is analogous: First form the matrixproducts T σ , S τ and T σ , S τ , exploiting that all factors are of low rank. Then perform a low-rankupdate to a block-diagonal matrix whose blocks are provided in HBS-form to construct the new HBSmatrix T τ .Accelerating the solve stage in Algorithm 2 is trivial, simply exploit that the matrix S τ on line (3)has low numerical rank. Remark 8.1.

Some of the structured matrix operators (e.g. adding two HBS matrices, or the low-rank update) can algebraically lead to a large increase in the HBS ranks. We know for physical reasonsthat the output should have rank-structure very similar to the input, however, and we combat therank-increase by frequently recompressing the output of each operation.

Remark 8.2.

In practice, we implement the algorithm to use dense matrix algebra at the ﬁner levels,where all the DtN matrices T τ are small. Once they get large enough that HBS algebra outperformsdense operations, we compress the dense matrices by brute force, and rely on HBS algebra in theremainder of the upwards sweep. 9. Numerical Examples

In this section, we illustrate the capabilities of the proposed method for the boundary value problem (38) ( − ∆ u ( x ) − c ( x ) ∂ u ( x ) − c ( x ) ∂ u ( x ) − c ( x ) u ( x ) = 0 , x ∈ Ω ,u ( x ) = f ( x ) , x ∈ Γ , where Ω = [0 , , Γ = ∂ Ω, and c ( x ) , c ( x ) , and c ( x ) are smooth, cf. (2). The choice of the functions c ( x ) , c ( x ) , and c ( x ) will vary for each example.All experiments reported in this section were executed on a machine with two quad-core IntelXeon E5-2643 processors and 128GB of RAM. The direct solver was implemented in Matlab, whichmeans that the speeds reported can very likely be improved, although the asymptotic complexityshould be unaﬀected.In Section 9.1 we apply the direct solver to four problems with known analytic solutions. Thisallows us to very accurately investigate the errors incurred, but is arguably a particularly favorableenvironment. Section 9.2 presents results from more general situations where the exact solution isnot known, and errors have to be estimated.In all experiments, the number of Gaussian points per leaf edge q is ﬁxed at 21, and 21 ×

21 tensorproduct grids of Chebyshev points are used in the leaf computations. Per Remark 8.2, we switchfrom dense computations to HBS when a box has more than 2000 points on the boundary.9.1.

Performance for problems with known solutions.

To illustrate the scaling and accuracy ofthe discretization technique, we apply the numerical method to four problems with known solutions.The problems are:

Laplace : Let c ( x ) = c ( x ) = c ( x ) = 0 in (38). Helmholtz I : Let c ( x ) = c ( x ) = 0, and c ( x ) = κ where κ = 80 in (38). This representsa vibration problem on a domain Ω of size roughly 12 ×

12 wave-lengths. (Recall that thewave-length is given by λ = πκ .) Helmholtz II : Let c ( x ) = c ( x ) = 0, and c ( x ) = κ where κ = 640 in (38). This correspondsto a domain of size roughly 102 ×

102 wave-lengths.

Helmholtz III : We again set c ( x ) = c ( x ) = 0, and c ( x ) = κ in (38), but now we let κ growas the number of discretization points grows to maintain a constant 12 points per wavelength.The boundary data in (38) is chosen to coincide with the known solutions u exact ( x ) = log | ˆ x − x | for the Laplace problem and with u exact ( x ) = Y ( κ | ˆ x − x | ) for the three Helmholtz problems, whereˆ x = ( − , denotes the 0’th Bessel function of the second kind.In a ﬁrst experiment, we prescribed the tolerance in the “fast” matrix algebra to be ǫ = 10 − .Table 1 reports the following quantities: N Number of Gaussian discretization points. N tot Total number of discretization points. ( N plus the number of Chebyshev points) T build Time for building the solution operator. T solve Time to solve for interior nodes. T apply Time to apply the approximate Dirichlet-to-Neumann operator T . R Amount of memory required to store the solution operator. E pot = max k : x k ∈ Ω (cid:8)(cid:12)(cid:12) u app ( x k ) − u exact ( x k ) (cid:12)(cid:12)(cid:9) , where u app denotes the approximate solution constructed by the direct solver.Our expectation is for all problems except Helmholtz III to exhibit optimal linear scaling for boththe build and the solve stages. Additionally, we expect the cost of applying the Dirichlet-to-Neumannoperator T to scale as N . for all problems except Helmholtz III . The numerical results clearly bearthis out for

Laplace and

Helmholtz I . For

Helmholtz II , it appears that linear scaling has not quitetaken hold for the range of problem sizes our hardware could manage. The

Helmholtz III problemclearly does not exhibit linear scaling, but has not yet reached its O ( N . ) asymptotic at the largestproblem considered, which was of size roughly 426 ×

426 wave-lengths. We remark that the cost N tot N T build T solve T apply R E pot (seconds) (seconds) (seconds) (MB)

Laplace

Helmholtz I

Helmholtz II

Helmholtz III

Table 1.

Timing results in seconds for the PDEs considered in Section 9.1. Forthese experiments, ǫ = 10 − .of the solve stage is tiny. For example, a problem involving 11 million unknowns (corresponding toapproximately 100 million discretization points) takes 115 minutes for the build stage and then only30 seconds for each additional solve. The cost for applying the Dirichlet-to-Neumann operator iseven less at 0 .

36 seconds per boundary condition. Figure 6 illustrates the scaling via log-log plots.In a second set of experiments, we investigated the accuracy of the computed solutions, and inparticular how the accuracy depends on the tolerance ǫ in the fast matrix algebra. In addition toreporting E pot , Table 2 reports E grad = max k : x k ∈ Γ (cid:8)(cid:12)(cid:12) u n, app ( x k ) − u n, exact ( x k ) (cid:12)(cid:12)(cid:9) , where u app denotes the approximate solution constructed by the direct solver for tolerances ε =10 − , − , and 10 − . The number of discretization points was ﬁxed problem to be N = 693504( N tot = 7252225). ǫ = 10 − ǫ = 10 − ǫ = 10 − E pot E grad E pot E grad E pot E grad Laplace

Helmholtz I

Helmholtz II

Helmholtz III

Table 2.

Errors for solving the PDEs in Section 9.1 with diﬀerent user prescribedtolerances when the number of discretization points is ﬁxed at N = 693504 ( N tot =7252225).The solution obtains (or nearly obtains) the prescribed tolerance while the normal derivativesuﬀers from a three digit loss in accuracy. This loss is likely attributable to the unboundedness of theDirichlet-to-Neumann map. The compressed representation captures the high end of the spectrumto the desired accuracy while the low end of the spectrum is captured to three digits less than thedesired accuracy. −1 −2 −1 LaplaceHelmholtz IHelmholtz IIHelmholtz III 10 T build T solve T apply RN T i m e i n s e c o nd s M e m o r y i n M B Figure 6.

The ﬁrst three graphs give the times required for building the directsolver ( T build ), solving a problem ( T solve ) and applying the approximate Dirichlet-to-Neumann operator on ∂ Ω ( T apply ). The fourth graph gives the memory R in MBrequired to store the solution operator.9.2. Convergence for unknown solutions.

In this section, we apply the direct solver to threeproblems for which we do not know an exact solution:

Constant convection : Let the convection in the x direction be constant by setting c ( x ) = 0, c ( x ) = − ,

000 and set c ( x ) = 0. Diﬀusion-Convection : Introduce a divergence free convection by setting c ( x ) = − ,

000 cos(4 πx ), c ( x ) = − ,

000 cos(4 πx ), and c ( x ) = 0. Variable Helmholtz : Consider the variable coeﬃcient Helmholtz problem where c ( x ) = 0, c ( x ) = 0, c ( x ) = κ (1 − (sin(4 πx ) sin(4 πx )) ) and κ = 640.For the three experiments, the boundary data is given by f ( x ) = cos(2 x )(1 − x ). N tot N u N ( ˆ x ) E N int u Nn ( ˆ x ) E N bnd Constant Convection

Variable Helmholtz

Diﬀusion-Convection

Table 3.

Convergence results for solving the PDE’s in Section 9.2 with a user sub-scribed tolerance of ǫ = 10 − . N tot N T build T solve R (seconds) (seconds) (MB) Constant Convection

Variable Helmholtz

Diﬀusion-Convection

Table 4.

Times in seconds for solving the PDE’s in Section 9.2 with a user subscribedtolerance of ǫ = 10 − .To check for convergence, we post-process the solution as described in Section 5.3 to get thesolution on the Chebyshev grid. Let u N denote the solution on the Chebyshev grid. Likewise,let u Nn denote the normal derivative on the boundary at the Chebyshev boundary points. We compare thesolution and the normal derivative on the boundary pointwise at the locationsˆ x = (0 . , .

25) and ˆ y = (0 . , E N int = | u N ( ˆ x ) − u N ( ˆ x ) | and E N bnd = | u Nn ( ˆ y ) − u Nn ( ˆ y ) | . The tolerance for the compressed representations is set to ǫ = 10 − . Table 3 reports the pointwiseerrors. We see that high accuracy is obtained in all cases, with solutions that have ten correct digitsfor the potential and about seven correct digits for the boundary ﬂux.The computational costs of the computations are reported in Table 4. The memory R reportednow includes the memory required to store all local solution operators described in Section 5.3. Conclusions

We have described a direct solver for variable coeﬃcient elliptic PDEs in the plane, under theassumption that the solution and all coeﬃcient functions are smooth. For problems with non-oscillatory solutions such as the Laplace and Stokes equations, the asymptotic complexity of thesolver is O ( N ), with a small constant of proportionality. For problems with oscillatory solutions,high practical eﬃciency is retained for problems of size up to several hundred wave-lengths.Our method is based on a composite spectral discretization. We use high order local meshes(typically of size 21 ×

21) capable of solving even very large scale problems to ten correct digits ormore. The direct solver is conceptually similar to the classical nested dissection method [6]. Toimprove the asymptotic complexity from the classical O ( N . ) to O ( N ), we exploit internal structurein the dense “frontal matrices” at the top levels in a manner similar to recent work such as, e.g.,[2, 7, 13, 14, 19]. Compared to these techniques, our method has an advantage in that high orderdiscretizations can be incorporated without compromising the speed of the linear algebra. The reasonis that we use a formulation based on Dirichlet-to-Neumann operators. As a result, we need highorder convergence only in the tangential direction on patch boundaries.The direct solver presented requires more storage than classical iterative methods, but this ispartially oﬀ-set by the use of high-order discretizations. More importantly, the solver is characterizedby very low data-movement. This makes the method particularly well suited for implementation onparallel machines with distributed memory. Acknowledgments:

The work was supported by NSF grants DMS-0748488 and CDI-0941476.

References [1] S. Chandrasekaran and M. Gu,

A divide-and-conquer algorithm for the eigendecomposition of symmetric block-diagonal plus semiseparable matrices , Numer. Math. (2004), no. 4, 723–731.[2] S. Chandrasekaran, M. Gu, X.S. Li, and J Xia, Fast algorithms for hierarchically semiseparable matrices , Numer.Linear Algebra Appl. (2010), 953–976.[3] Y. Chen, Total wave based fast direct solver for volume scattering problems , arXiv.org (2013).[4] Yu Chen,

A fast, direct algorithm for the Lippmann-Schwinger integral equation in two dimensions , Adv. Comput.Math. (2002), no. 2-3, 175–190, Modeling and computation in optics and electromagnetics.[5] I.S. Duﬀ, A.M. Erisman, and J.K. Reid, Direct methods for sparse matrices , Oxford, 1989.[6] A. George,

Nested dissection of a regular ﬁnite element mesh , SIAM J. on Numerical Analysis (1973), 345–363.[7] A. Gillman, Fast direct solvers for elliptic partial diﬀerential equations , Ph.D. thesis, University of Colorado atBoulder, Applied Mathematics, 2011.[8] A. Gillman, P. Young, and P.G. Martinsson,

A direct solver with o ( n ) complexity for integral equations on one-dimensional domains , Frontiers of Mathematics in China (2012), no. 2, 217–247.[9] Leslie Greengard, Denis Gueyﬃer, Per-Gunnar Martinsson, and Vladimir Rokhlin, Fast direct solvers for integralequations in complex three-dimensional domains , Acta Numer. (2009), 243–275.[10] K.L. Ho and L. Greengard, A fast direct solver for structured linear systems by recursive skeletonization , SIAM JSci Comput to appear (2012).[11] A. J. Hoﬀman, M. S. Martin, and D. J. Rose,

Complexity bounds for regular ﬁnite diﬀerence and ﬁnite elementgrids , SIAM J. Numer. Anal. (1973), 364–369.[12] D.A. Kopriva, A staggered-grid multidomain spectral method for the compressible navierstokes equations , Journalof Computational Physics (1998), no. 1, 125 – 158.[13] Sabine Le Borne, Lars Grasedyck, and Ronald Kriemann,

Domain-decomposition based H -LU preconditioners ,Domain decomposition methods in science and engineering XVI, Lect. Notes Comput. Sci. Eng., vol. 55, Springer,Berlin, 2007, pp. 667–674. MR 2334161[14] P.G. Martinsson, A fast direct solver for a class of elliptic partial diﬀerential equations , J. Sci. Comput. (2009),no. 3, 316–330.[15] , A composite spectral scheme for variable coeﬃcient helmholtz problems , arXiv.org (2012).[16] ,

A direct solver for variable coeﬃcient elliptic pdes discretized via a composite spectral collocation method ,Journal of Computational Physics (2013), no. 1, 460 – 479. [17] P.G. Martinsson and V. Rokhlin, A fast direct solver for boundary integral equations in two dimensions , J. Comp.Phys. (2005), no. 1, 1–23.[18] H.P. Pfeiﬀer, L.E. Kidder, M.A. Scheel, and S.A. Teukolsky,

A multidomain spectral method for solving ellipticequations , Computer physics communications (2003), no. 3, 253–273.[19] P.G. Schmitz and L. Ying,

A fast direct solver for elliptic problems on general meshes in 2d , Journal of Computa-tional Physics (2012), no. 4, 1314 – 1338.[20] Zhifeng Sheng, Patrick Dewilde, and Shivkumar Chandrasekaran,

Algorithms to solve hierarchically semi-separablesystems , System theory, the Schur algorithm and multidimensional analysis, Oper. Theory Adv. Appl., vol. 176,Birkh¨auser, Basel, 2007, pp. 255–294. MR MR2342902[21] L.N. Trefethen,

Spectral methods in matlab , SIAM, Philadelphia, 2000.[22] Jianlin Xia, Shivkumar Chandrasekaran, Ming Gu, and Xiaoye S. Li,

Superfast multifrontal method for largestructured linear systems of equations , SIAM J. Matrix Anal. Appl. (2009), no. 3, 1382–1411. MR 2587783(2011c:65072)[23] B. Yang and J.S. Hesthaven, Multidomain pseudospectral computation of maxwell’s equations in 3-d general curvi-linear coordinates , Applied Numerical Mathematics33