[PDF] Interpolative Butterfly Factorization

Abstract

This paper introduces the interpolative butterfly factorization for nearly optimal implementation of several transforms in harmonic analysis, when their explicit formulas satisfy certain analytic properties and the matrix representations of these transforms satisfy a complementary low-rank property. A preliminary interpolative butterfly factorization is constructed based on interpolative low-rank approximations of the complementary low-rank matrix. A novel sweeping matrix compression technique further compresses the preliminary interpolative butterfly factorization via a sequence of structure-preserving low-rank approximations. The sweeping procedure propagates the low-rank property among neighboring matrix factors to compress dense submatrices in the preliminary butterfly factorization to obtain an optimal one in the butterfly scheme. For an N×N matrix, it takes O(NlogN) operations and complexity to construct the factorization as a product of O(logN) sparse matrices, each with O(N) nonzero entries. Hence, it can be applied rapidly in O(NlogN) operations. Numerical results are provided to demonstrate the effectiveness of this algorithm.

Full PDF

IInterpolative Butterﬂy Factorization

Yingzhou Li (cid:93) , Haizhao Yang † , (cid:93) ICME, Stanford University † Department of Mathematics, Duke UniversityOctober 21, 2018

Abstract

This paper introduces the interpolative butterﬂy factorization for nearly optimal implemen-tation of several transforms in harmonic analysis, when their explicit formulas satisfy certainanalytic properties and the matrix representations of these transforms satisfy a complementarylow-rank property. A preliminary interpolative butterﬂy factorization is constructed based oninterpolative low-rank approximations of the complementary low-rank matrix. A novel sweepingmatrix compression technique further compresses the preliminary interpolative butterﬂy factor-ization via a sequence of structure-preserving low-rank approximations. The sweeping procedurepropagates the low-rank property among neighboring matrix factors to compress dense subma-trices in the preliminary butterﬂy factorization to obtain an optimal one in the butterﬂy scheme.For an N × N matrix, it takes O ( N log N ) operations and complexity to construct the factoriza-tion as a product of O (log N ) sparse matrices, each with O ( N ) nonzero entries. Hence, it canbe applied rapidly in O ( N log N ) operations. Numerical results are provided to demonstratethe eﬀectiveness of this algorithm. Keywords.

Data-sparse matrix, butterﬂy algorithm, randomized algorithm, matrix factoriza-tion, operator compression, nonuniform Fourier transform, Fourier integral operators.

AMS subject classiﬁcations: 44A55, 65R10 and 65T50.

One key problem in computational harmonic analysis is the rapid evaluation of various transformsto make large-scale scientiﬁc computation feasible. These transforms are essentially matrix-vectormultiplications, u = Kg , where the kernel matrix K is the discrete representation of a transformand the vector g is the discrete representation of a function to be transformed. Inspired by theidea of the butterﬂy algorithm initially proposed in [24] and later extended in [25], the recentlyproposed butterﬂy factorization [20, 21] factorizes a complementary low-rank matrix K of size N × N into a product of O (log N ) sparse matrices, each with O ( N ) nonzero entries. After fac-torization, the application of K has nearly optimal operation and memory complexity of order N log N . Since a wide range of transforms in harmonic analysis admits a matrix representationsatisfying the complementary low-rank property [7, 10, 22, 33, 25, 28], the butterﬂy factorization,once constructed, is a nearly optimal fast algorithm to evaluate these transforms. However, theconstruction of the butterﬂy factorization requires O ( N ) operations in [25] and requires O ( N . )operations in [20, 21], which might still be too expensive in real applications. This paper introducesthe interpolative butterﬂy factorization to construct the factorization in O ( N log N ) operation and Through out the paper, the “nearly optimal” refers to the nearly optimal constant in the complexity. a r X i v : . [ m a t h . NA ] N ov emory complexity, if the continuous kernel K is explicitly available. The interpolative butterﬂyfactorization is a combination of the butterﬂy algorithm in [7] and a novel structure-preserving ma-trix compression technique. Hence, the proposed method can be considered as an optimized sparsematrix representation of the butterﬂy algorithm in [7] with a smaller prefactor in the operationcomplexity.Another key problem in modern large-scale computation is the parallel scalability of fast al-gorithms. Although there have been various fast algorithms like the (nonuniform) fast Fouriertransform (FFT) [1, 14, 15], the FFT in polar and spherical coordinates [19, 29, 31, 33], the par-allel scalability of some of these traditional algorithms might be still limited in high performancecomputing. This motivates much eﬀort to improve their parallel scalability [2, 26, 30, 37]. For thesame purpose, the interpolative butterﬂy factorization is proposed as a general framework for highlyparallel scalable implementation of a wide range of transforms in harmonic analysis. Since the con-struction of the butterﬂy factorization is just a few essentially independent low-rank approximationsand the application is a sequence of small matrix-vector multiplications, the butterﬂy factorizationframework signiﬁcantly reduces communication if implemented in parallel computation.To be more speciﬁc, the interpolative butterﬂy factorization is proposed for the rapid applicationof integral transforms of the form u ( x ) = (cid:90) R d a ( x, ξ ) e π ıΦ( x,ξ ) g ( ξ ) dξ, (1)where d is the dimension, and K ( x, ξ ) = a ( x, ξ ) e π ıΦ( x,ξ ) is the kernel function that satisﬁes followingproperties: Assumption 1.1.

Smoothness properties • a ( x, ξ ) is an amplitude function that is smooth both in x and ξ ; • Φ( x, ξ ) is a phase function that is real analytic for x and ξ and obeys the homogeneity conditionof degree in ξ , namely, Φ( x, λξ ) = λ Φ( x, ξ ) for λ > . This transform is also known as the Fourier integral operator (FIO). FIOs are a wide class ofoperators in harmonic analysis including the (nonuniform) Fourier transform, pseudo-diﬀerentialoperators, and the generalized Radon transform. All these are popular tools in computationalphysics and chemistry [11, 17, 27, 32, 35], imaging science [6, 8, 23, 38]. For higher dimensionalFIOs, the phase function might not be smooth when ξ = 0. Fortunately, the interpolative butterﬂyfactorization can be adapted to this case following the idea in the multiscale butterﬂy algorithm in[22].In most examples, since a ( x, ξ ) is a smooth symbol of order zero and type (1 ,

0) [3, 5, 9, 34], a ( x, ξ ) is numerically low-rank in the joint X and Ω domain and its numerical treatment is relativelyeasy. Therefore, we will simplify the problem by assuming a ( x, ξ ) = 1 in the following discussion.In a typical setting, it is often assumed that the function g ( ξ ) decays suﬃciently fast so thatone can embed the problem in a suﬃciently large periodic cell. Without loss of generality, a simplediscretization considers functions given on a Cartesian grid X = (cid:110) x = (cid:16) n N /d , . . . , n d N /d (cid:17) , ≤ n , . . . , n d < N /d with n , . . . , n d ∈ Z (cid:111) (2)in a unit box in x and deﬁnes the discrete integral transform by u ( x ) = (cid:88) ξ ∈ Ω K ( x, ξ ) g ( ξ ) , x ∈ X, (3)2here Ω = (cid:40) ξ = ( n , . . . , n d ) , − N /d ≤ n , . . . , n d < N /d n , . . . , n d ∈ Z (cid:41) . (4)Using the notation in numerical linear algebra, the evaluation of (3) is a matrix-vector multiplication u = Kg . Under Assumption 1.1, it can be proved that K is essentially complementary low-rank[7, 22]. Complementary low-rank matrices have been widely studied in [12, 13, 20, 21, 25, 24, 36]. Let X and Ω be point sets (not necessary uniformly distributed) in R d for some dimension d . When thekernel K ( x, ξ ) is discretized on X × Ω, points in X and Ω are indexed with row and column indicesin the matrix K . For simplicity, we also use X and Ω to denote the sets of row and column indices.Two trees T X and T Ω of the same depth L = O (log N ), associated with X and Ω respectively, areconstructed by dyadic partitioning. Denote the root level of the tree as level 0 and the leaf one aslevel L . Such a matrix K of size O ( N ) × O ( N ) is said to satisfy the complementary low-rankproperty if for any level (cid:96) , any node A in T X at level (cid:96) , and any node B in T Ω at level L − (cid:96) ,the submatrix K A,B , obtained by restricting K to the rows indexed by the points in A and thecolumns indexed by the points in B , is numerically low-rank, i.e., for a given precision (cid:15) there existsa low-rank approximation of K A,B with an error bounded by (cid:15) and the rank bounded polynomiallyin log(1 /(cid:15) ) and is independent of N . See Figure 1 for an illustration of low-rank submatrices in acomplementary low-rank matrix of size 16 × Column Index R o w I nde x Column Index R o w I nde x Column Index R o w I nde x Column Index R o w I nde x Column Index R o w I nde x Figure 1: Hierarchical decomposition of the row and column indices of a one-dimensional comple-mentary low-rank matrix of size 16 ×

16. The trees T X ( T Ω ) has a root containing 16 column (row)indices and leaves containing a single column (row) index. The rectangles above indicate some ofthe low-rank submatrices.It will be shown that, for a complementary low-rank matrix K , the matrix-vector multiplication u = Kg can be carried out eﬃciently via a preliminary interpolative butterﬂy factorization (IBF) constructed by interpolative low-rank approximations: K ≈ U L G L − · · · G h M h ( H h ) ∗ . . . ( H ) ∗ ( V ) ∗ , where the depth L = O (log N ) of T X and T Ω is assumed to be even, h = L/ O ( N ) nonzero entries and a large prefactor. Dense blocksin these sparse factors come from interpolative low-rank approximations of low-rank submatricesas illustrated in Figure 1. When the kernel function K satisﬁes Assumption 1.1, each low-rankapproximation of these submatrices can be constructed explicitly by Lagrange interpolation with q Chebyshev grid points in each dimension, resulting in dense submatrices of size q d × q d in sparsefactors. Hence, all the matrix factors are available explicitly. However, the low-rank approximation3y interpolation is not optimal in the sense that it over estimates the numerical rank r of the low-rank matrix, i.e. q d > r . Hence, dense blocks in these sparse factors can be further compressed tosmaller submatrices of size r × r by a truncated SVD.The preliminary IBF above can be further compressed by a novel sweeping matrix compressionmethod in two stages. In the sweep-out stage, a sequence of structure-preserving matrix compres-sion is conducted: M h ≈ C h ¯ M h ( R h ) ∗ ,G (cid:96) C (cid:96) ≈ C (cid:96) +1 ¯ G (cid:96) for (cid:96) = h, h + 1 , . . . , L −

1, ( H (cid:96) R (cid:96) ) ∗ ≈ ( ¯ H (cid:96) ) ∗ ( R (cid:96) − ) ∗ for (cid:96) = h, h − , . . . ,

1, starting from the middle matrix M h and moving towards outer matrices.Let ¯ U L = U L C L , and ¯ V = V R , then we have a further compressed factorization K ≈ ¯ U L ¯ G L − · · · ¯ G h ¯ M h ( ¯ H h ) ∗ · · · ( ¯ H ) ∗ ( ¯ V ) ∗ , where all sparse factors have dense submatrices of size closer to r × r . ¯ U L and ¯ V are block-diagonalmatrices and the sizes of diagonal blocks depend on the distribution of points in X and Ω. In thecase of nonuniform distribution, there might be diagonal blocks with size smaller than r × r in ¯ U L and ¯ V . This motivates the sweep-in stage that contains another sequence of structure-preservingmatrix compression: ¯ U L ≈ ˙ U L ¯ C L , ¯ V ≈ ˙ V ¯ R , ¯ C (cid:96) +1 ¯ G (cid:96) ≈ ˙ G (cid:96) ¯ C (cid:96) for (cid:96) = L − , L − , . . . , h , ( ¯ H (cid:96) ) ∗ ( ¯ R (cid:96) − ) ∗ ≈ ( ¯ R (cid:96) ) ∗ ( ˙ H (cid:96) ) ∗ for (cid:96) = 1 , , . . . , h , starting from outer matrices to the middle matrix. Finally, let ˙ M h = ¯ C h ¯ M h ( ¯ R h ) ∗ ,and one reaches the optimal IBF K ≈ ˙ U L ˙ G L − · · · ˙ G h ˙ M h ( ˙ H h ) ∗ · · · ( ˙ H ) ∗ ( ˙ V ) ∗ , (5)where all sparse factors have dense submatrices of nearly optimal size.The optimal IBF represents K as a product of L + 3 sparse matrices, where all factors aresparse matrices with O ( N ) nonzero entries, and the prefactor of O ( N ) is nearly optimal. Onceconstructed, the cost of applying K to a given vector g ∈ C N is O ( N log N ).4 .2 Content The rest of this paper is organized as follows. Section 2 brieﬂy reviews low-rank factorizationtechniques and the butterﬂy algorithm in [7]. Section 3 describes the one-dimensional preliminaryinterpolative butterﬂy factorization based on interpolative low-rank factorization. Section 4 in-troduces a structure-preserving matrix compression technique and a sweeping method to furthercompress the preliminary IBF into an optimal one. Multidimensional extension to a general casewhen the phase function has singularity at ξ = 0 is discussed in Section 5. In Section 6, numericalexamples are provided to demonstrate the eﬃciency of the proposed algorithms. Finally, Section 7concludes this paper with a short discussion. This section reviews basic tools for eﬃcient low-rank approximations that are repeatedly used inthis paper.

Randomized low-rank approximation

For a matrix Z ∈ C m × n , we deﬁne a rank- r approximate singular value decomposition (SVD)of Z as Z ≈ U Σ V ∗ , where U ∈ C m × r is unitary, Σ ∈ R r × r is diagonal, and V ∈ C n × r is unitary. A straightforwardmethod to obtain the optimal rank- r approximation of Z is to compute its truncated SVD, where U is the matrix with the ﬁrst r left singular vectors, Σ is a diagonal matrix with the ﬁrst r singularvalues in decreasing order, and V is the matrix with the ﬁrst r right singular vectors.The original truncated SVD of Z takes O ( mn min( m, n )) operations. More eﬃcient tools havebeen proposed by introducing randomness in computing approximate SVDs for numerically low-rank matrices. To name a few, the one in [16] is based on applying the matrix to random vectorswhile another one in [13, 34] relies on sampling the matrix entries randomly. Throughout thispaper, the second one is applied to compute large low-rank approximations because it only takeslinear operations with respect to the matrix size. Readers are referred to [13, 34] for detailedimplementation.When an approximate SVD Z ≈ U Σ V ∗ is ready, it can be rearranged in several equivalentways. First, one can write Z ≈ U SV ∗ , where U = U Σ / , S = I and V ∗ = Σ / V ∗ , (6)so that the left and right factors inherit similar singular values of the original numerical low-rankmatrix. Depending on certain applications, sometimes it is better to write the approximation as Z ≈ U V ∗ where U = U and V ∗ = Σ V ∗ , (7)or U = U Σ and V ∗ = V ∗ (8)so that only one factor shares the singular values of Z .5 nterpolative low-rank approximation The randomized low-rank approximation is eﬃcient if a kernel matrix is given, while the in-terpolative low-rank approximation is eﬃcient when the explicit formula of the kernel function K ( x, ξ ) = e π ıΦ( x,ξ ) is available, because the approximation can be constructed explicitly.Following the theorems in [7, 22], it can be proved that the kernel function K ( x, ξ ) = e π ıΦ( x,ξ ) satisﬁes the complementary low-rank property if it fulﬁlls the properties in Assumption 1.1. Wewill refresh the key idea here for one-dimensional kernels. Let the set X and Ω refer to the setsdeﬁned in (2) and (4). Let A and B be a box pair in the dyadic trees T X and T Ω such that theirlevels satisfy (cid:96) A + (cid:96) B = L . Then a low-rank separated representation K ( x, ξ ) = e π ıΦ( x,ξ ) ≈ r (cid:15) (cid:88) t =1 α ABt ( x ) β ABt ( ξ ) for x ∈ A, ξ ∈ B exists and can be constructed via the interpolative low-rank approximation as follows.Let R AB ( x, ξ ) := Φ( x, ξ ) − Φ( c A , ξ ) − Φ( x, c B ) + Φ( c A , c B ) , (9)where c A and c B are the centers of A and B respectively, then the kernel can be written as e π ıΦ( x,ξ ) = e π ıΦ( c A ,ξ ) e π ıΦ( x,c B ) e − π ıΦ( c A ,c B ) e π ı R AB ( x,ξ ) . (10)Hence, the low-rank approximation of K ( x, ξ ) in A × B is reduced to the low-rank approximationof e π ı R AB ( x,ξ ) .Let w A and w B denote the lengths of intervals A and B , respectively. A Lagrange interpolationwith Chebyshev points in x when w A ≤ / √ N and in ξ when w B ≤ √ N is applied to constructthe low-rank approximation of e π ı R AB ( x,ξ ) . For this purpose, we associate with each interval aChebyshev grid as follows.For a ﬁxed integer r , the Chebyshev grid of order r on [ − / , /

2] is deﬁned by (cid:26) z t = 12 cos (cid:18) tπr − (cid:19)(cid:27) ≤ t ≤ r − . A grid adapted to an interval A with center c A and length w A is then deﬁned via shifting and scalingas { x t } t =0 , ,...,r − = { c A + w A z t } t =0 , ,...,r − . Given a set of grid points { x t } t =0 , ,...,r − in A , deﬁne Lagrange interpolation polynomials M At ( x )taking value 1 at x t and 0 at the other Chebyshev grid points M At ( x ) = (cid:89) ≤ j ≤ r − ,j (cid:54) = t x − x j x t − x j . Similarly, M Bt is denoted as the Lagrange interpolation polynomials for the interval B .Now we are ready to construct the low-rank approximation of e π ı R AB ( x,ξ ) with r (cid:15) Chebyshevpoints for (cid:15) -accuracy by interpolation: • when w B ≤ √ N , the Lagrange interpolation of e π ı R AB ( x,ξ ) in ξ on a Chebyshev grid { g Bt } ≤ t ≤ r (cid:15) adapted to B obeys e π ı R AB ( x,ξ ) ≈ r (cid:15) (cid:88) t =1 e π ı R AB ( x,g Bt ) M Bt ( ξ ) , ∀ x ∈ A, ∀ ξ ∈ B, (11)6 when w A ≤ / √ N , the Lagrange interpolation of e π ı R AB ( x,ξ ) in x on a Chebyshev grid { g At } ≤ t ≤ r (cid:15) adapted to A obeys e π ı R AB ( x,ξ ) ≈ r (cid:15) (cid:88) t =1 M At ( x ) e π ı R AB ( g At ,ξ ) , ∀ x ∈ A, ∀ ξ ∈ B. (12)Finally, we are ready to construct the low-rank approximation for the kernel e π ıΦ( x,ξ ) : • when w B ≤ √ N , we multiply (11) with e π ıΦ( c A ,ξ ) e π ıΦ( x,c B ) e − π ıΦ( c A ,c B ) , which gives that ∀ x ∈ A, ∀ ξ ∈ B e π ıΦ( x,ξ ) ≈ r (cid:15) (cid:88) t =1 e π ıΦ( x,g Bt ) (cid:16) e − π ıΦ( c A ,g Bt ) M Bt ( ξ ) e π ıΦ( c A ,ξ ) (cid:17) ; (13) • when w A ≤ / √ N , multiply (12) with e π ıΦ( c A ,ξ ) e π ıΦ( x,c B ) e − π ıΦ( c A ,c B ) and obtain that ∀ x ∈ A, ∀ ξ ∈ B e π ıΦ( x,ξ ) ≈ r (cid:15) (cid:88) t =1 (cid:16) e π ıΦ( x,c B ) M At ( x ) e − π ıΦ( g At ,c B ) (cid:17) e π ıΦ( g At ,ξ ) . (14)The interpolative low-rank factorization can be constructed on-the-ﬂy from the explicit formulasabove, which is the main advantage over randomized low-rank approximations. However, sinceit relies on the information on a ﬁxed Chebyshev grid, the number of Chebyshev points must besuﬃciently large to obtain an accurate approximation, i.e., the (cid:15) -separation rank r (cid:15) might be greaterthan the true numerical rank with (cid:15) accuracy. This section provides a brief description of the overall structure of the butterﬂy algorithm based onthe interpolative low-rank approximation in the previous section. In this section, X and Ω refer totwo general sets of N points in R , respectively. With no loss of generality, we assume the points inthese two sets are distributed quasi-uniformly but they are not necessarily the sets deﬁned in (2)and (4).Given an input { g ( ξ ) , ξ ∈ Ω } , the goal is to compute the potentials { u ( x ) , x ∈ X } deﬁned by u ( x ) = (cid:88) ξ ∈ Ω K ( x, ξ ) g ( ξ ) , x ∈ X, where K ( x, ξ ) is a kernel function. The main data structure of the butterﬂy algorithm is a pairof dyadic trees T X and T Ω . Recall that for any pair of intervals A × B ∈ T X × T Ω obeying thecondition (cid:96) A + (cid:96) B = L , the submatrix { K ( x, ξ ) } x ∈ A,ξ ∈ B is approximately of a constant rank. Anexplicit method to construct its low-rank approximation is given by the interpolative low-rankapproximation. More precisely, for any (cid:15) >

0, there exists a constant r (cid:15) independent of N and twosets of functions { α ABt ( x ) } ≤ t ≤ r (cid:15) and { β ABt ( ξ ) } ≤ t ≤ r (cid:15) given in (13) or (14) such that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K ( x, ξ ) − r (cid:15) (cid:88) t =1 α ABt ( x ) β ABt ( ξ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15), ∀ x ∈ A, ∀ ξ ∈ B. (15)For a given interval B in Ω, deﬁne u B ( x ) to be the restricted potential over the sources ξ ∈ Bu B ( x ) = (cid:88) ξ ∈ B K ( x, ξ ) g ( ξ ) . { u B ( x ) } x ∈ A as summing (15) over ξ ∈ B with coeﬃcients g ( ξ ) gives (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) u B ( x ) − r (cid:15) (cid:88) t =1 α ABt ( x ) (cid:88) ξ ∈ B β ABt ( ξ ) g ( ξ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) ξ ∈ B | g ( ξ ) |  (cid:15), ∀ x ∈ A. Therefore, if one can ﬁnd coeﬃcients { λ ABt } ≤ t ≤ r (cid:15) obeying λ ABt ≈ (cid:88) ξ ∈ B β ABt ( ξ ) g ( ξ ) , ≤ t ≤ r (cid:15) , (16)then the restricted potential { u B ( x ) } x ∈ A admits a compact expansion (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) u B ( x ) − r (cid:15) (cid:88) t =1 α ABt ( x ) λ ABt (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) ξ ∈ B | g ( ξ ) |  (cid:15), ∀ x ∈ A. The butterﬂy algorithm below provides an eﬃcient way for computing { λ ABt } ≤ t ≤ r (cid:15) recursively. Thegeneral structure of the algorithm consists of a top-down traversal of T X and a bottom-up traversalof T Ω , carried out simultaneously. A schematic illustration of the data ﬂow in this algorithm isprovided in Figure 2. Algorithm 2.1.

Butterﬂy algorithm1.

Preliminaries.

Construct the trees T X and T Ω .2. Initialization.

Let A be the root of T X . For each leaf interval B of T Ω , construct the expansioncoeﬃcients { λ ABt } ≤ t ≤ r (cid:15) for the potential { u B ( x ) } x ∈ A by simply setting λ ABt = (cid:88) ξ ∈ B β ABt ( ξ ) g ( ξ ) , ≤ t ≤ r (cid:15) . (17) By the interpolative low-rank approximation, we can deﬁne the expansion coeﬃcients { λ ABt } ≤ t ≤ r (cid:15) by λ ABt := e − π ıΦ( c A ,g Bt ) (cid:88) ξ ∈ B (cid:16) M Bt ( ξ ) e π ıΦ( c A ,ξ ) g ( ξ ) (cid:17) . (18) Recursion.

For (cid:96) = 1 , , . . . , L/ , visit level (cid:96) in T X and level L − (cid:96) in T Ω . For each pair ( A, B ) with (cid:96) A = (cid:96) and (cid:96) B = L − (cid:96) , construct the expansion coeﬃcients { λ ABt } ≤ t ≤ r (cid:15) for thepotential { u B ( x ) } x ∈ A using the low-rank representation constructed at the previous level. Let P be A ’s parent and C be a child of B . Throughout, we shall use the notation C (cid:31) B when C is a child of B . At level (cid:96) − , the expansion coeﬃcients { λ P Cs } ≤ s ≤ r (cid:15) of { u C ( x ) } x ∈ P arereadily available and we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) u C ( x ) − r (cid:15) (cid:88) s =1 α P Cs ( x ) λ P Cs (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) ξ ∈ C | g ( ξ ) |  (cid:15), ∀ x ∈ P. Since u B ( x ) = (cid:80) C (cid:31) B u C ( x ) , the previous inequality implies that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) u B ( x ) − (cid:88) C (cid:31) B r (cid:15) (cid:88) s =1 α P Cs ( x ) λ P Cs (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) ξ ∈ B | g ( ξ ) |  (cid:15), ∀ x ∈ P. ince A ⊂ P , the above approximation is of course true for any x ∈ A . However, since (cid:96) A + (cid:96) B = L , the sequence of restricted potentials { u B ( x ) } x ∈ A also has a low-rank approximationof size r (cid:15) , namely, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) u B ( x ) − r (cid:15) (cid:88) t =1 α ABt ( x ) λ ABt (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) ξ ∈ B | g ( ξ ) |  (cid:15), ∀ x ∈ A. Combining the last two approximations, we obtain that { λ ABt } ≤ t ≤ r (cid:15) should obey r (cid:15) (cid:88) t =1 α ABt ( x ) λ ABt ≈ (cid:88) C (cid:31) B r (cid:15) (cid:88) s =1 α P Cs ( x ) λ P Cs , ∀ x ∈ A. (19) This is an over-determined linear system for { λ ABt } ≤ t ≤ r (cid:15) when { λ P Cs } ≤ s ≤ r (cid:15) ,C (cid:31) B are avail-able. The butterﬂy algorithm uses an eﬃcient linear transformation approximately mapping { λ P Cs } ≤ s ≤ r (cid:15) ,C (cid:31) B into { λ ABt } ≤ t ≤ r (cid:15) as follows λ ABt := e − π ıΦ( c A ,g Bt ) (cid:88) C (cid:31) B r (cid:15) (cid:88) s =1 M Bt ( g Cs ) e π ıΦ( c A ,g Cs ) λ P Cs . (20) Switch.

For the levels visited, the Chebyshev interpolation is applied in variable ξ , whilethe interpolation is applied in variable x for levels (cid:96) > L/ . Hence, we are switching theinterpolation method at this step. Now we are still working on level (cid:96) = L/ and the samedomain pairs ( A, B ) in the last step. Let λ ABs denote the expansion coeﬃcients obtained byChebyshev interpolation in variable ξ in the last step. Correspondingly, { g Bs } s are the gridpoints in B in the last step. We take advantage of the interpolation in variable x in A andgenerate grid points { g At } ≤ t ≤ r (cid:15) in A . Then we can deﬁne new expansion coeﬃcients λ ABt := r (cid:15) (cid:88) s =1 e π ıΦ( g At ,g Bs ) λ ABs . Recursion.

Similar to the discussion in Step , we go up in tree T Ω and down in tree T X atthe same time until we reach the level (cid:96) = L . We construct the approximation functions byChebyshev interpolation in variable x as follows: α ABt ( x ) = e π ıΦ( x,c B ) M At ( x ) e − π ıΦ( g At ,c B ) , β ABt ( ξ ) = e π ıΦ( g At ,ξ ) . (21) Hence, the new expansion coeﬃcients { λ ABt } ≤ t ≤ r (cid:15) can be deﬁned as λ ABt := (cid:88) C (cid:31) B e π ıΦ( g At ,c C ) r (cid:15) (cid:88) s =1 (cid:16) M Ps ( g At ) e − π ıΦ( g Ps ,c C ) λ P Cs (cid:17) , (22) where again P is A ’s parent and C is a child interval of B .6. Termination.

Finally, (cid:96) = L and set B to be the root node of T Ω . For each leaf interval A ∈ T X , use the constructed expansion coeﬃcients { λ ABt } ≤ t ≤ r (cid:15) in (22) to evaluate u B ( x ) foreach x ∈ A , u ( x ) = u B ( x ) = r (cid:15) (cid:88) t =1 α ABt ( x ) λ ABt = e π ıΦ( x,c B ) r (cid:15) (cid:88) t =1 (cid:16) M At ( x ) e − π ıΦ( g At ,c B ) λ ABt (cid:17) . (23)9 X T Ω L L Figure 2: Trees of the row and column indices. Left: T X for the row indices X . Right: T Ω for thecolumn indices Ω. The interaction between A ∈ T X and B ∈ T Ω starts at the root of T X and theleaves of T Ω . This section presents the preliminary interpolative butterﬂy factorization (IBF) for a matrix K ∈ C N × N when its kernel function satisﬁes Assumption 1.1. In fact, the preliminary interpolativebutterﬂy factorization is a matrix representation of the butterﬂy algorithm [7]. Similar to thebutterﬂy algorithm, we adopt the same notation of point sets X and Ω, trees T X and T Ω of depth L (assumed to be an even number). At each level (cid:96) , (cid:96) = 0 , . . . , L , we denote the i th node atlevel (cid:96) in T X as A (cid:96)i for i = 0 , , . . . , (cid:96) − j th node at level L − (cid:96) in T Ω as B L − (cid:96)j for j = 0 , , . . . , L − (cid:96) −

1. These nodes naturally partition K into O ( N ) submatrices K A (cid:96)i ,B L − (cid:96)j . Forsimplicity, we write K (cid:96)i,j := K A (cid:96)i ,B L − (cid:96)j , where the superscript is used to indicate the level (in T X ).With abuse of notation, sometimes we use the subscripts i, j and the superscript (cid:96) on a matrix ora vector corresponding the domain pair ( A (cid:96)i , B L − (cid:96)j ) for simplicity, although it is not a submatrix ora subvector restricted in ( A (cid:96)i , B L − (cid:96)j ).The preliminary IBF is based on the observation that the butterﬂy algorithm in Section 2.2 canbe written in a form of matrix factorization. The operations in Step 2 to 6 in Algorithm 2.1 areessentially a sequence of matrix vector multiplications with O (log N ) multiplications, each matrixof which has only O ( r (cid:15) N ) nonzero entries. Since all the operations are given explicitly based onChebyshev interpolation, these sparse matrices can be formed by explicit formulas. By formulatingthese matrices step by step following the ﬂow in Algorithm 2.1, the preliminary IBF is proposed asfollows.1. Preliminaries.

Construct the trees T X and T Ω .2. Initialization.

At level (cid:96) = 0, for each j = 0 , , . . . , L −

1, let A be Ω and B be B Lj of T Ω .Construct the expansion coeﬃcients { λ ABt } ≤ t ≤ r (cid:15) for the potential { u B ( x ) } x ∈ A by simplysetting λ ABt := e − π ıΦ( c A ,g Bt ) (cid:88) ξ ∈ B (cid:16) M Bt ( ξ ) e π ıΦ( c A ,ξ ) g ( ξ ) (cid:17) . (24)For each j = 0 , , . . . , L −

1, a column vector Λ j corresponding to the domain pair ( A, B ) =(Ω , B Lj ) is deﬁned as Λ j =  λ AB ... λ ABr (cid:15)  . V j ) ∗ ∈ C r (cid:15) × O (1) represent the linear transformation in (24) and g j be the vector repre-senting g ( ξ ) for ξ ∈ B Lj . Then we have Λ j = ( V j ) ∗ g j . (25)As we shall see later, the conjugate transpose ∗ is applied for the purpose of notation consis-tency. By assembling the matrix-vector multiplications in (25), all the operations in this stepcan be written as Λ =  Λ ...Λ L −  = ( V ) ∗ g, where V = diag (cid:8) V , V , · · · , V L − (cid:9) . Recursion.

For (cid:96) = 1 , , . . . , L/

2, visit level (cid:96) in T X and level L − (cid:96) in T Ω . At each level (cid:96) , for each pair ( A, B ) with (cid:96) A = (cid:96) and (cid:96) B = L − (cid:96) , construct the expansion coeﬃcients { λ ABt } ≤ t ≤ r (cid:15) using the low-rank representation constructed at the previous level ( (cid:96) = 0 isthe initialization step). Let P be A ’s parent and C be a child of B . An eﬃcient lineartransformation approximately mapping { λ P Cs } ≤ s ≤ r (cid:15) ,C (cid:31) B into { λ ABt } ≤ t ≤ r (cid:15) is constructed by λ ABt := e − π ıΦ( c A ,g Bt ) (cid:88) C (cid:31) B r (cid:15) (cid:88) s =1 M Bt ( g Cs ) e π ıΦ( c A ,g Cs ) λ P Cs . (26)At each level (cid:96) , for each i = 0 , , . . . , (cid:96) − j = 0 , , . . . , L − (cid:96) −

1, a column vector Λ (cid:96)i,j isdeﬁned as Λ (cid:96)i,j =  λ AB ... λ ABr (cid:15)  ∈ C r (cid:15) , (27)where the domain pair ( A, B ) = ( A (cid:96)i , B L − (cid:96)j ). For a ﬁxed j , stacking vectors { Λ (cid:96)i,j } i togetherforms a larger column vector Λ (cid:96)j and stacking vectors { Λ (cid:96)j } j together forms a column vectorΛ (cid:96) , Λ (cid:96)j =  Λ (cid:96) ,j ...Λ (cid:96) (cid:96) − ,j  , Λ (cid:96) =  Λ (cid:96) ...Λ (cid:96) L − (cid:96) −  . The linear transformation in (26) maps two small column vectors Λ (cid:96) − i, j and Λ (cid:96) − i, j +1 at theprevious level (cid:96) − (cid:96) i,j if A is the ﬁrst child of P (or Λ (cid:96) i +1 ,j if A is thesecond child of P ) at the current level (cid:96) for all i = 0 , , . . . , (cid:96) − − j = 0 , , . . . , L − (cid:96) − i, j ), the linear transformation in (26) can be written as a matrix vectormultiplication Λ (cid:96) i,j = (cid:0) H (cid:96) i, j H (cid:96) i, j +1 (cid:1) (cid:32) Λ (cid:96) − i, j Λ (cid:96) − i, j +1 (cid:33) , (28)where H (cid:96) i, j ∈ C r (cid:15) × r (cid:15) and H (cid:96) i, j +1 ∈ C r (cid:15) × r (cid:15) , orΛ (cid:96) i +1 ,j = (cid:0) H (cid:96) i +1 , j H (cid:96) i +1 , j +1 (cid:1) (cid:32) Λ (cid:96) − i, j Λ (cid:96) − i, j +1 (cid:33) , (29)11here H (cid:96) i +1 , j ∈ C r (cid:15) × r (cid:15) and H (cid:96) i +1 , j +1 ∈ C r (cid:15) × r (cid:15) . Assemble the above small matrices togetherto get H (cid:96)j =  H (cid:96) , j H (cid:96) , j H (cid:96) , j H (cid:96) , j . . . H (cid:96) (cid:96) − , j H (cid:96) (cid:96) − , j H (cid:96) , j +1 H (cid:96) , j +1 H (cid:96) , j +1 H (cid:96) , j +1 . . . H (cid:96) (cid:96) − , j +1 H (cid:96) (cid:96) − , j +1  for j = 0 , , . . . , L − (cid:96) − H (cid:96) =  H (cid:96) H (cid:96) . . . H (cid:96) L − (cid:96) −  , then all the operations in this step can be represented by a large matrix-vector multiplicationΛ (cid:96) = ( H (cid:96) ) ∗ Λ (cid:96) − for (cid:96) = 1 , , . . . , L/ Switch.

For the levels visited, the Chebyshev interpolation is applied in variable ξ , while theinterpolation is applied in variable x for levels (cid:96) > L/

2. Hence, we are switching the interpo-lation method at this step. Now we are still working at level (cid:96) = L/ A, B ) = ( A L/ i , B L/ j ) in the last step. Recall that Λ L/ denote the expansion coeﬃ-cients obtained by Chebyshev interpolation in variable ξ in the last step. Correspondingly, { g Bs } s are the grid points in B in the last step. We take advantage of the interpolation invariable x in A and generate grid points { g At } ≤ t ≤ r (cid:15) in A . Then we can deﬁne new expansioncoeﬃcients for Chebyshev interpolation in variable x for future steps as λ ABt := r (cid:15) (cid:88) s =1 e π ıΦ( g At ,g Bs ) λ ABs . (30)Let ˜ M L/ i,j ∈ C r (cid:15) × r (cid:15) represent the linear transformation introduced by { e π ıΦ( g At ,g Bs ) } ≤ t ≤ r (cid:15) , ≤ s ≤ r (cid:15) for i , j = 0 , , · · · , L/ −

1. Then (30) is equivalent to˜Λ L/ i,j = ˜ M L/ i,j Λ L/ i,j for each domain pair ( A, B ) = ( A L/ i , B L/ j ). Recall thatΛ L/ j =  Λ L/ ,j ...Λ L/ L/ − ,j  , and Λ L/ =  Λ L/ ...Λ L/ L/ −  .

12f we deﬁne the expansion coeﬃcients in a new order by˜Λ L/ i =  ˜Λ L/ i, ...˜Λ L/ i, L/ −  , and ˜Λ L/ =  ˜Λ L/ ...˜Λ L/ L/ −  , then all the operations in (30) for all domain pairs can be represented by a large matrix-vectormultiplication ˜Λ L/ = M L/ Λ L/ , where M L/ =  M L/ , M L/ , · · · M L/ , L/ − M L/ , M L/ , M L/ , L/ − ... . . . M L/ L/ − , M L/ L/ − , M L/ L/ − , L/ −  ∈ C r (cid:15) N × r (cid:15) N and M L/ i,j ∈ C r (cid:15) √ N × r (cid:15) √ N is also a 2 L/ × L/ block matrix with the only nonzero block atthe location ( j, i ) being ˜ M L/ i,j ∈ C r (cid:15) × r (cid:15) . By the abuse of notation, we drop the tilde notation˜ · of the expansion coeﬃcients in the second half of the algorithm description.5. Recursion.

Similar to the discussion in Step 3, we go up in tree T Ω and down in tree T X at the same time for all level (cid:96) = L/ , · · · , L . At each level (cid:96) , for any domain pair( A, B ) = ( A (cid:96)i , B L − (cid:96)j ), for i = 0 , , . . . , (cid:96) − j = 0 , , . . . , L − (cid:96) −

1, we construct the newexpansion coeﬃcients { λ ABt } ≤ t ≤ r (cid:15) by λ ABt := (cid:88) C (cid:31) B e π ıΦ( g At ,c C ) r (cid:15) (cid:88) s =1 (cid:16) M Ps ( g At ) e − π ıΦ( g Ps ,c C ) λ P Cs (cid:17) , (31)where again P is A ’s parent and C is a child interval of B . The coeﬃcients are assembled asΛ (cid:96)i,j =  λ AB ... λ ABr (cid:15)  , Λ (cid:96)i =  Λ (cid:96)i, ...Λ (cid:96)i, L − (cid:96) −  , and Λ (cid:96) =  Λ (cid:96) ...Λ (cid:96) (cid:96) −  , for i = 0 , , . . . , (cid:96) − j = 0 , , . . . , L − (cid:96) − (cid:96) − i, j and Λ (cid:96) − i, j +1 at the previous level (cid:96) − (cid:96) i,j if A is the ﬁrst child of P (or Λ (cid:96) i +1 ,j if A is the second child of P ) at the current level (cid:96) forall i = 0 , , . . . , (cid:96) − − j = 0 , , . . . , L − (cid:96) −

1. Hence, for each pair ( i, j ), the lineartransformation in (31) can be written as a matrix vector multiplicationΛ (cid:96) i,j = (cid:16) G (cid:96) − i, j G (cid:96) − i, j +1 (cid:17) (cid:32) Λ (cid:96) − i, j Λ (cid:96) − i, j +1 (cid:33) , (32)where G (cid:96) − i, j ∈ C r (cid:15) × r (cid:15) and G (cid:96) − i, j +1 ∈ C r (cid:15) × r (cid:15) , orΛ (cid:96) i +1 ,j = (cid:16) G (cid:96) − i +1 , j G (cid:96) − i +1 , j +1 (cid:17) (cid:32) Λ (cid:96) − i, j Λ (cid:96) − i, j +1 (cid:33) , (33)13here G (cid:96) − i +1 , j ∈ C r (cid:15) × r (cid:15) and G (cid:96) − i +1 , j +1 ∈ C r (cid:15) × r (cid:15) . Assemble the above small matrices togetherto get G (cid:96) − i =  G (cid:96) − i, G (cid:96) − i, G (cid:96) − i, G (cid:96) − i, . . . G (cid:96) − i, L − (cid:96) − G (cid:96) − i, L − (cid:96) − G (cid:96) − i +1 , G (cid:96) − i +1 , G (cid:96) − i +1 , G (cid:96) − i +1 , . . . G (cid:96) − i +1 , L − (cid:96) − G (cid:96) − i +1 , L − (cid:96) −  for i = 0 , , . . . , (cid:96) − − G (cid:96) − =  G (cid:96) − G (cid:96) − . . . G (cid:96) − (cid:96) − −  , then all the operations in this step can be represented by a large matrix-vector multiplicationΛ (cid:96) = G (cid:96) − Λ (cid:96) − for (cid:96) = L/ , . . . , L .6. Termination.

Finally, (cid:96) = L again and set B = Ω. For each leaf interval A = A Li ∈ T X , i = 0 , , . . . , L −

1, use the constructed expansion coeﬃcients δ L in the last step to evaluate u B ( x ) for each x ∈ A , u ( x ) = u B ( x ) = r (cid:15) (cid:88) t =1 α ABt ( x ) λ ABt = e π ıΦ( x,c B ) r (cid:15) (cid:88) t =1 (cid:16) M At ( x ) e − π ıΦ( g At ,c B ) λ ABt (cid:17) . (34)For each domain pair ( A, B ) = ( A Li , Ω), let U Li ∈ C O (1) × r (cid:15) be the matrix representing thelinear transformation in (34). Deﬁne U L = diag (cid:8) U L , U L , · · · , U L L − (cid:9) . By the same argument in the initialization step, the matrix-vector multiplication format ofall the operations in this step is u = U L Λ L . In sum, by the discussion above, K is approximated by the preliminary IBF as K ≈ U L G L − · · · G L/ M L/ ( H L/ ) ∗ . . . ( H ) ∗ ( V ) ∗ . Since the total number of nonzero entries of the above sparse factors is O ( r (cid:15) N log N ) and eachfactors can be constructed explicitly by the Chebyshev interpolation, the operation and memorycomplexity of constructing and applying the IBF is O ( r (cid:15) N log N ).14 Optimal interpolative butterﬂy factorization

Recall that at each level (cid:96) the kernel matrix K restricted in a domain pair ( A, B ) = ( A (cid:96)i , B L − (cid:96)j ),denoted as K (cid:96)ij , is a numerically low-rank matrix. For a given (cid:15) , let r ( (cid:15), (cid:96), i, j ) be the numericalrank provided by a truncated SVD of K (cid:96)ij . With abuse of notation, let us use r for simplicity.Similarly, let r (cid:15) denote the numerical rank provided by the low-rank approximation by Chebyshevinterpolation. Since r (cid:15) might not be able to reveal the optimal numerical rank r , i.e. r (cid:15) > r . The O ( r (cid:15) ) prefactor in the complexity of the preliminary IBF might be far from the optimal one, r .The above observation motivates the design of the novel sweeping matrix compression based ona sequence of structure-preserving matrix compression. The sweeping matrix compression furthercompresses the preliminary IBF into an optimal one. The main idea is to propagate low-rankproperty among matrix factors in the preliminary IBF and to shrink the size of dense submatricesin these matrix factors by the randomized low-rank approximation in Section 2.The structure-preserving matrix compression is introduced in Section 4.1. The sweeping matrixcompression consists of two stages, the sweep-out and the sweep-in stages. They will be presentedin Section 4.2 and Section 4.3, respectively. One key idea to construct the optimal IBF is the sweeping matrix compression via a sequence ofstructure-preserving matrix compression introduced below. Suppose S is a block matrix with m × k blocks, i.e., S =  S , S , · · · S ,k − S , S , S ,k − ... . . . S m − , S m − , S m − ,k −  ∈ C mr × kr . For simplicity, we assume that each block in S is of size r × r . The structure-preserving matrixcompression can be easily extended to block matrices with diﬀerent block sizes. Let D be a block-diagonal matrix with k diagonal blocks and each diagonal block is of size r × r with r < r ,i.e., D =  D D . . . D k −  ∈ C kr × kr . Let P be the product of S and D , i.e., P =  P , P , · · · P ,k − P , P , P ,k − ... . . . P m − , P m − , P m − ,k −  =  S , D S , D · · · S ,k − D k − S , D S , D S ,k − D k − ... . . . S m − , D S m − , D S m − ,k − D k −  ∈ C mr × kr . We call that the matrix pencil (

S, D ) has the structure-preserving low rank property of type ( m, k, r, r ), if it satisﬁes the following condition. For each i = 1, . . . , m , there exists alow-rank approximation (cid:0) P i, P i, · · · P i,k − (cid:1) ≈ ˜ D i (cid:0) ˜ S i, ˜ S i, · · · ˜ S i,k − (cid:1) (35)15here ˜ D i ∈ C r × r and ˜ S i,j ∈ C r × r for j = 0 , , . . . , k −

1. Under the assumption of the structure-preserving low-rank property of type ( m, k, r, r ), by assembling the low-rank approximations in(35) into two large factors ˜ D and ˜ S , the structure-preserving matrix compression of type ( m, k, r, r )factorizes SD as SD ≈ ˜ D ˜ S, where ˜ D is again a block-diagonal matrix with k diagonal blocks of size r × r , ˜ S is a block matrixwith m × k blocks of size r × r , S and ˜ S share the same column indices of nonzero blocks in eachrow block (see Figure 3 for an example).     =   ≈     Figure 3: The structure-preserving matrix compression of type (4 , , , SD = P ≈ ˜ D ˜ S , where S ∈ C × (left matrix), P ∈ C × (middle matrix), and ˜ S ∈ C × (right matrix) are 4 × D ∈ C × (middle left matrix) and ˜ D ∈ C × (middleright matrix) are 4 × The optimal IBF of K is built in two stages. In the ﬁrst stage, we further compress the matrixfactors in the preliminary IBF, K ≈ U L G L − · · · G h M h ( H h ) ∗ . . . ( H ) ∗ ( V ) ∗ , sweeping from the middle matrix M h and moving out towards U L and V . Recall that each nonzerosubmatrix ˜ M hij ∈ C r (cid:15) × r (cid:15) in M h is the kernel function K ( x.ξ ) evaluated at the Chebyshev grid pointsin the domain pairs ( A, B ) = ( A hi , B hj ) at level h . Hence, ˜ M hij is a numerically low-rank matrix if r (cid:15) > r (which is usually the case in existing butterﬂy algorithms using Chebyshev interpolationsto construct low-rank factorizations). Hence, M h can be further compressed via a rank-revealinglow-rank approximation of each ˜ M hij , e.g., the truncated SVD, into M h ≈ C h ¯ M h ( R h ) ∗ , resulting the middle level factorization K ≈ U L G L − · · · G h C h ¯ M h ( R h ) ∗ ( H h ) ∗ . . . ( H ) ∗ ( V ) ∗ . (36)The middle level factorization is described in detail in Section 4.2.1.Next, we recursively factorize G (cid:96) C (cid:96) ≈ C (cid:96) +1 ¯ G (cid:96) for (cid:96) = h, h + 1 , . . . , L −

1, ( H (cid:96) R (cid:96) ) ∗ ≈ ( ¯ H (cid:96) ) ∗ ( R (cid:96) − ) ∗ (cid:96) = h, h − , . . . ,

1, since the matrix pencils ( G (cid:96) , C (cid:96) ) and ( H (cid:96) , R (cid:96) ) satisfy the structure-preservinglow-rank property. In another word, C (cid:96) and R (cid:96) propagate the low-rank property of M h to G (cid:96) and H (cid:96) , respectively, so that we can further compress G (cid:96) and H (cid:96) . After this recursive factorization, let¯ U L = U L C L , and ¯ V = V R , then one reaches at a more compressed IBF of K : K ≈ ¯ U L ¯ G L − · · · ¯ G h ¯ M h ( ¯ H h ) ∗ · · · ( ¯ H ) ∗ ( ¯ V ) ∗ , (37)where all factors are sparse matrices with almost O ( r N ) nonzero entries. We refer to this stageas the recursive factorization and it is discussed in detail in Section 4.2.2 and 4.2.3. Recall that we denote the i th node at level (cid:96) in T X as A (cid:96)i for i = 0 , , . . . , (cid:96) − j th nodeat level L − (cid:96) in T Ω as B L − (cid:96)j for j = 0 , , . . . , L − (cid:96) −

1. Let h = L/ m = 2 L/ . Since thenonzero submatrix ˜ M hij in M h is a matrix representation of the kernel function K ( x, ξ ) evaluatedat the Chebyshev grid points { g At } t and { g Bs } s for the domain pair ( A, B ) = ( A hi , B hj ) at the middlelevel (cid:96) = h . ˜ M hij ∈ C r (cid:15) × r (cid:15) is numerically rank r . Hence, a rank- r approximation to every ˜ M hi,j iscomputed by the SVD algorithm via random sampling in [13, 34] with O ( r (cid:15) r + r ) operations. Infact, when r (cid:15) is already very small, a direct method for SVD truncation of order r (cid:15) is eﬃcient aswell. Once the approximate SVD of ˜ M hi,j is ready, it is transformed in the form˜ M hi,j ≈ C hi,j S hi,j ( R hj,i ) ∗ following (6). We would like to emphasize that the columns of C hi,j and R hj,i are scaled with thesingular values of the approximate SVD so that they keep track of the importance of these columnsin approximating ˜ M hi,j .After calculating the approximate rank- r factorization of each ˜ M hi,j , we assemble these factorsinto three block matrices C h , ¯ M h and R h as follows: M h =  M h , · · · M h ,m − ... . . . ... M hm − , · · · M hm − ,m −  =  C h . . . C hm −   ¯ M h , · · · ¯ M h ,m − ... . . . ...¯ M hm − , · · · ¯ M hm − ,m −   ( R h ) ∗ . . . ( R hm − ) ∗  = C h ¯ M h ( R h ) ∗ , (38)where C hi =  C hi, C hi, . . . C hi,m −  ∈ C mr (cid:15) × mr , (39)17 hj =  R h ,j R h ,j . . . R hm − ,j  ∈ C mr × mr (cid:15) , (40)and ¯ M hi,j ∈ C mr × mr is also a m × m block matrix with block size r × r where all blocks are zeroexcept that the ( j, i ) block is equal to the diagonal matrix S hi,j ∈ C r × r .It is obvious that there are only r N nonzero entries in ¯ M h and r (cid:15) r N nonzero entries in C h and R h . See Figure 4 for an example of a middle level factorization of a 64 ×

64 matrix with r = 1and r (cid:15) = 4.   ≈       Figure 4: The middle level factorization of a 64 ×

64 matrix M ≈ C ¯ M ( R ) ∗ assuming r = 1and r (cid:15) = 4. Grey blocks indicate nonzero blocks. C and R are block-diagonal matrices with 16blocks of size 4 ×

1. The diagonal blocks of C and R are assembled according to Equation (39)and (40), respectively, as indicated by black rectangles. ¯ M is a 4 × M i,j itself an 4 × j, i ) block. U L Each recursive factorization at level (cid:96) G (cid:96) C (cid:96) ≈ C (cid:96) +1 ¯ G (cid:96) (41)results from the structure-preserving low-rank property that originates from the low-rank propertyof K (cid:96)i,j for all index pairs ( i, j ). At level (cid:96) , recall that G (cid:96) =  G (cid:96) G (cid:96) . . . G (cid:96) (cid:96) −  , G (cid:96)i =  G (cid:96) i, G (cid:96) i, G (cid:96) i, G (cid:96) i, . . . G (cid:96) i, L − (cid:96) − − G (cid:96) i, L − (cid:96) − − G (cid:96) i +1 , G (cid:96) i +1 , G (cid:96) i +1 , G (cid:96) i +1 , . . . G (cid:96) i +1 , L − (cid:96) − − G (cid:96) i +1 , L − (cid:96) − −  for i = 0 , , . . . , (cid:96) −

1. By construction, [ G (cid:96) i, j G (cid:96) i, j +1 ] is an interpolation matrix that inter-polates the kernel function K ( x, ξ ) from Chebyshev grid points in the domain pairs ( A (cid:96)i , B L − (cid:96) j )and ( A (cid:96)i , B L − (cid:96) j +1 ) to the points in the domain pair ( A (cid:96) +12 i , B L − (cid:96) − j ). Similarly, [ G (cid:96) i +1 , j G (cid:96) i +1 , j +1 ]interpolates the kernel function K ( x, ξ ) from ( A (cid:96)i , B L − (cid:96) j ) and ( A (cid:96)i , B L − (cid:96) j +1 ) to ( A (cid:96) +12 i +1 , B L − (cid:96) − j ).At level (cid:96) = h , C h =  C h . . . C hm −  where C hi =  C hi, C hi, . . . C hi,m −  ∈ C r (cid:15) m × r m . By construction, the column space of C hi,j comes from the column space of K hi,j for all i =0 , , . . . , m − j = 0 , , . . . , m −

1. Hence, (cid:0) G h i, j C hi, j G h i, j +1 C hi, j +1 (cid:1) represents the col-umn space of K h +12 i,j , which implies that (cid:0) G h i, j C hi, j G h i, j +1 C hi, j +1 (cid:1) is numerically rank r . Bythe randomized low-rank approximation, we have (cid:0) G h i, j C hi, j G h i, j +1 C hi, j +1 (cid:1) ≈ C h +12 i,j (cid:0) ¯ G h i, j ¯ G h i, j +1 (cid:1) , where C h +12 i,j ∈ C r (cid:15) × r , ¯ G h i, j ∈ C r × r , and ¯ G h i, j +1 ∈ C r × r . By a similar argument, we have (cid:0) G h i +1 , j C hi, j G h i +1 , j +1 C hi, j +1 (cid:1) ≈ C h +12 i +1 ,j (cid:0) ¯ G h i +1 , j ¯ G h i +1 , j +1 (cid:1) , where C h +12 i +1 ,j ∈ C r (cid:15) × r , ¯ G h i +1 , j ∈ C r × r , and ¯ G h i +1 , j +1 ∈ C r × r . Hence, the matrix pencil( G h , C h ) satisﬁes the conditions of the structure-preserving low-rank property of type (2 L , L , r (cid:15) , r ).Assembling { ¯ G hi,j } and { C h +1 i,j } together to generate ¯ G h ∈ C r N × r N and C h +1 ∈ C r (cid:15) N × r N , respec-tively, in the same way as generating G h and C h , we have G h C h ≈ C h +1 ¯ G h . At other levels (cid:96) = h, . . . , L −

1, similarly to the discussion above, the matrix pencil ( G (cid:96) , C (cid:96) )satisﬁes the structure-preserving low-rank property of type (2 L , L , r (cid:15) , r ). By assembling the resultsof randomized low-rank approximations, we have G (cid:96) C (cid:96) ≈ C (cid:96) +1 ¯ G (cid:96) (cid:96) = h, . . . , L − (cid:96) = L , both matrices U L , C L are block diagonal matrices, we simply multiplythem together and let ¯ U L = U L C L . After the recursive factorization of each G (cid:96) C (cid:96) , we have K ≈ ¯ U L ¯ G L − · · · ¯ G h ¯ M h ( R h ) ∗ ( H h ) ∗ · · · ( H ) ∗ ( V ) ∗ , (42)where ¯ G (cid:96) contains only r N nonzero entries and ¯ U L contains r N nonzero entries. V The recursive factorization towards V is similar to the one towards U L . In each step of thefactorization ( H (cid:96) R (cid:96) ) ∗ ≈ ( ¯ H (cid:96) ) ∗ ( R (cid:96) − ) ∗ (43)for all (cid:96) = h, h + 1 , . . . ,

1, we take advantage of the structure-preserving low-rank property of thematrix pencil ( H (cid:96) , R (cid:96) ). The same procedure of Section 4.2.2 now to ( H (cid:96) , R (cid:96) ) leads to the recursivefactorization in (43). Deﬁne ¯ V = V R , (44)then we have ( R h ) ∗ ( H h ) ∗ · · · ( H ) ∗ ( V ) ∗ ≈ ( ¯ H h ) ∗ · · · ( ¯ H ) ∗ ( ¯ V ) ∗ , where ¯ H (cid:96) contains only r N nonzero entries and ¯ V contains r N nonzero entries.After the recursive factorization sweeping from the middle matrix towards U L and V , we reacha more compressed IBF K ≈ ¯ U L ¯ G L − · · · ¯ G h ¯ M h ( ¯ H h ) ∗ · · · ( ¯ H ) ∗ ( ¯ V ) ∗ . (45) If the points in the sets X and Ω are distributed uniformly, the IBF in (45) is already optimal inthe butterﬂy factorization scheme, i.e., nearly all dense submatrices in its factors are of size r × r ,where r is the numerical rank of the kernel function K ( x, ξ ) sampled uniformly in a domain pair( A, B ) ∈ T X × T Ω . However, when the point sets are nonuniform, e.g., in the nonuniform FFT, thenumber of samples in ( A, B ) might be far smaller than r . This means that there might be densesubmatrices of size less than r × r in the block diagonal matrices ¯ U L and ¯ V . This motivates asequence of structure-preserving low-rank matrix compression to further compress the IBF in (45),sweeping from outer matrices and moving towards the middle matrix ¯ M h as follows.¯ U L ≈ ˙ U L C L and C (cid:96) +1 ¯ G (cid:96) ≈ ˙ G (cid:96) C (cid:96) for (cid:96) = L − , L − , . . . , h ; similarly, we have,( ¯ V ) ∗ ≈ ( R ) ∗ ( ˙ V ) ∗ and ( ¯ H (cid:96) ) ∗ ( R (cid:96) − ) ∗ ≈ ( R (cid:96) ) ∗ ( ˙ H (cid:96) ) ∗ (cid:96) = 1 , , . . . , h . The sweeping matrix compression above is due to the fact that the matrix pencils(( ¯ G (cid:96) ) ∗ , ( C (cid:96) +1 ) ∗ ) and (( ¯ H (cid:96) ) ∗ , ( R (cid:96) − ) ∗ ) satisfy the structure-preserving low-rank property. In anotherword, C (cid:96) and R (cid:96) propagate the low-rank property of ¯ U L and ¯ V L to ¯ G (cid:96) and ¯ H (cid:96) , respectively, sothat we can further compress ¯ G (cid:96) and ¯ H (cid:96) . After this recursive factorization, let ˙ M h = C h ¯ M h ( R h ) ∗ ,then one reaches the optimal IBF of K : K ≈ ˙ U L ˙ G L − · · · ˙ G h ˙ M h ( ˙ H h ) ∗ · · · ( ˙ H ) ∗ ( ˙ V ) ∗ . (46)where all factors are sparse matrices with almost O ( r N ) or less nonzero entries.For a given input vector g ∈ C N , the O ( N ) matrix-vector multiplication u = Kg can be ap-proximated by a sequence of O (log N ) sparse matrix-vector multiplications given by the optimalIBF. Since computing the factors in the preliminary IBF takes only O ( r (cid:15) N log N ) operations andthere are only O ( r N log N ) nonzero entries in the optimal IBF, the construction and applicationcomplexity in operation is O ( r (cid:15) N log N ) and O ( r N log N ), respectively. Since all the matrix fac-tors in the preliminary IBF can be generated explicitly, the peak memory complexity O ( r (cid:15) N log N )occurs when the preliminary IBF is completed. Although we limited our discussion to one-dimensional problems in previous discussion, the inter-polative butterﬂy factorization, along with its construction algorithm, can be easily generalized tohigher dimensions.By the theorems in [7, 22], a multidimensional kernel function K ( x, ξ ) satisfying Assumption1.1 is complementary low-rank (e.g., the nonuniform FFT). In this case, similarly to Section 3, bywriting the multidimensional butterﬂy algorithm [7, 22] into a matrix factorization form, we havea preliminary multidimensional IBF K ≈ U L G L − · · · G h M h ( H h ) ∗ . . . ( H ) ∗ ( V ) ∗ , where the depth L = O (log N ) of T X and T Ω is assumed to be even, h = L/ O ( N ) nonzero entries and a large prefactor. The preliminaryIBF can be further compressed by the sweeping matrix compression to obtain the optimal IBF K ≈ ˙ U L ˙ G L − · · · ˙ G h ˙ M h ( ˙ H h ) ∗ · · · ( ˙ H ) ∗ ( ˙ V ) ∗ . However, many important multidimensional kernel matrices fail to satisfy Assumption 1.1 andis not complementary low-rank in the entire domain X × Ω. The most signiﬁcant example, themultidimensional Fourier integral operator, typically has a singularity when ξ = 0 in the Ω domain.Fortunately, it was proved that this kind of kernel functions satisﬁes the complementary low-rankproperty in the domain away from ξ = 0. An multiscale interpolative butterﬂy factorization (MIBF)hierarchically partitions the domain Ω into subdomains { Ω t } t excluding the singular point ξ = 0and apply the multidimensional IBF to the kernel restricted in each subdomain pair X × Ω t .To be more speciﬁc, in two-dimensional problems, supposeΩ = (cid:40) ξ = ( n , n ) , − N / ≤ n , n < N / n , n ∈ Z (cid:41) . t = (cid:40) ( ξ , ξ ) : N / t +2 < max( | ξ | , | ξ | ) ≤ N / t +1 (cid:41) ∩ Ω , (47)for t = 0 , , . . . , log n − s , where n = N / and s is a small constant, and Ω C = Ω \ ∪ t Ω t . Equation(47) is a corona decomposition of Ω, where each Ω t is a corona subdomain and Ω C is a squaresubdomain at the center containing O (1) points.The FIO kernel function satisﬁes the complementary low-rank property when it is restricted ineach subdomain X × Ω t as proved in [22]. Hence, the MIBF evaluates u ( x ) = (cid:88) ξ ∈ Ω e πı Φ( x,ξ ) g ( ξ )via a multiscale summation, u ( x ) = u C ( x ) + log n − s (cid:88) t =0 u t ( x ) = (cid:88) ξ ∈ Ω C e πı Φ( x,ξ ) g ( ξ ) + log n − s (cid:88) t =0 (cid:88) ξ ∈ Ω t e πı Φ( x,ξ ) g ( ξ ) . (48)For each t , the MIBF evaluates u t ( x ) = (cid:80) ξ ∈ Ω t e πı Φ( x,ξ ) g ( ξ ) with a standard multidimensional IBFand the ﬁnal piece u C ( x ) is evaluated directly in O ( N ) operations. Let K t and K C denote thematrix representation of the kernel K ( x, ξ ) restricted in X × Ω t and X × Ω C , respectively, then bythe standard multidimensional IBF, we have K t ≈ ˙ U L t t ˙ G L t − t · · · ˙ G Lt t ˙ M Lt t (cid:18) ˙ H Lt t (cid:19) ∗ · · · (cid:16) ˙ H t (cid:17) ∗ (cid:16) ˙ V t (cid:17) ∗ . Once we have computed the optimal IBF in each restricted domain, the multiscale summation in(48) is approximated by u = Kg ≈ K C R C g + log n − s (cid:88) t =0 ˙ U L t t ˙ G L t − t · · · ˙ M Lt t · · · (cid:16) ˙ H t (cid:17) ∗ (cid:16) ˙ V t (cid:17) ∗ R t g, (49)where R C and R t are the restriction operators to the domains Ω C and Ω t respectively.The construction and application complexity of the MIBF is O ( N log N ) with an optimallysmall prefactor in the butterﬂy scheme. This section presents several numerical examples to demonstrate the eﬃciency of the interpolativebutterﬂy factorization. The numerical results were obtained on a single node of a server clusterwith quad socket Intel R (cid:13) Xeon R (cid:13) CPU E5-4640 @ 2.40GHz (8 core/socket) and 1.5 TB RAM. Allimplementations are in MATLAB R (cid:13) and can found in authors’ homepages.Let { u d ( x ) , x ∈ X } , { u i ( x ) , x ∈ X } and { u m ( x ) , x ∈ X } denote the results given by the di-rect matrix-vector multiplication, the interpolative butterﬂy factorization (IBF) and the multiscaleinterpolative butterﬂy factorization (MIBF). The accuracies of applying the IBF and MIBF areestimated by the relative error deﬁned as follows, (cid:15) i = (cid:115) (cid:80) x ∈ S | u i ( x ) − u d ( x ) | (cid:80) x ∈ S | u d ( x ) | and (cid:15) m = (cid:115) (cid:80) x ∈ S | u m ( x ) − u d ( x ) | (cid:80) x ∈ S | u d ( x ) | , (50)22here S is a point set of size 256 randomly sampled from X . Meanwhile, let R comp denotesthe compression ratio of the optimal interpolative butterﬂy factorization against the preliminaryinterpolative butterﬂy factorization, which is deﬁned as R comp = Memory usage of the preliminary interpolative butterﬂy factorizationMemory usage of the optimal interpolative butterﬂy factorization . (51) R comp accurately evaluates the rank compression ratio in the optimal interpolative butterﬂy fac-torization without lost of accuracy.Although we deﬁne R comp as a ratio of memory usage, it also reﬂects the ratio of running time.Since the application of IBF is a sequence of matrix-vector multiplications, the total running timeis linearly depends on the number of non-zeros which is linearly depends on the memory usage.Therefore, R comp also equals the running time of the preliminary IBF over the optimal IBF. We ﬁrst present the numerical result of the most widely used Fourier integral operator, Fouriertransform. More speciﬁcally, we focus on one-dimensional nonuniform Fourier transform of type I (cid:98) u ( ξ i ) = (cid:88) x j e − πıx j ξ i u ( x j ) , i, j = 1 , , . . . , N, (52)where { x i } are N random points in [0 ,

1) each of which is drawn from uniform distribution [0 , ξ j = j − − N/

2. The values of the input function { u ( x j ) } Nj =1 , are randomly generated.Table 1 summarizes the results of this example for varying problem sizes, N , and numbersof Chebyshev points, r (cid:15) . In Table 2, we further provide the detailed comparison between theapplication cost of IBF and unifom/non-uniform FFTs.The result in Table 1 reﬂects the O ( N log N ) complexity for both the construction and applica-tion. The relative error slightly increases as the problem size N increases. In general, nonuniformFFT requires more eﬀect to interpolate the irregular point distribution, which means there is anunderlying penalty factor coming from the problem itself. Based on the numbers in Table 1, thepenalty factor is on average 9 for approximation with accuracy 1e-3 and 25 for approximation withaccuracy 1e-7. This implies that if the proposed algorithm is well implemented and the code isdeeply optimized, the application time of the IBF for the nonuniform Fourier transform is about 9and 25 times slower than the FFT for an approximation accuracy 1e-3 and 1e-7, respectively. Ta-ble 2 provides the concrete running time comparison. The actual time penalty over the NUFFT [15]is on average about 3 and 6. As we shall discuss later, the IBF would have better scalability indistributed and parallel computing (future work) than the existing NUFFT framework. Hence, thedistributed and parallel IBF could be better than the existing NUFFT framework for large-scalecomputing. Our second example is to evaluate a one-dimensional FIO [20] of the following form: u ( x ) = (cid:90) R e πı Φ( x,ξ ) (cid:98) f ( ξ ) dξ, (53)where (cid:98) f is the Fourier transform of f , and Φ( x, ξ ) is a phase function given byΦ( x, ξ ) = x · ξ + c ( x ) | ξ | , c ( x ) = (2 + sin(2 πx )) / . (54)23 , r (cid:15) (cid:15) i R comp T F actor ( min ) T d ( sec ) T app ( sec ) T d /T app N is theproblem size; r (cid:15) is the number of Chebyshev points; (cid:15) i is the sampled relative error given in (50); R comp is the compression ratio deﬁned as (51); T F actor is the construction time of the IBF; T d isthe running time of the direct evaluation; T app is the application time of the IBF; T d /T app is thespeedup factor over the direct evaluation. N (cid:15) i T app ( sec ) P op T NUF F T ( sec ) T app /T NUF F T

256 4.35e-04 6.99e-04 6.64e+00 2.25e-04 3.11e+001024 7.80e-04 1.78e-03 7.82e+00 6.17e-04 2.88e+004096 8.89e-04 7.58e-03 8.65e+00 2.57e-03 2.95e+0016384 1.09e-03 3.66e-02 9.24e+00 9.88e-03 3.70e+0065536 1.12e-03 1.68e-01 9.71e+00 4.13e-02 4.07e+00262144 1.20e-03 7.63e-01 1.01e+01 1.80e-01 4.25e+001048576 1.18e-03 3.56e+00 1.04e+01 7.74e-01 4.60e+00256 3.57e-08 8.67e-04 1.41e+01 2.18e-04 3.97e+001024 5.09e-08 3.58e-03 1.93e+01 6.34e-04 5.65e+004096 1.02e-07 1.55e-02 2.23e+01 2.59e-03 5.97e+0016384 1.13e-07 7.59e-02 2.44e+01 1.00e-02 7.57e+0065536 1.27e-07 3.36e-01 2.57e+01 4.11e-02 8.18e+00262144 1.34e-07 1.57e+00 2.66e+01 1.74e-01 9.01e+001048576 1.43e-07 1.44e+01 2.74e+01 7.45e-01 1.93e+01Table 2: Numerical comparison between IBF and NUFFT [15] for the one-dimensional Fouriertransform given in (52). P op is the operator-wise penalty over the uniform FFT [18, 4], i.e., thenumber of operation count over the one of the FFT; T NUF F T is the running time of the NUFFT [15]where the implementation is in Fortran; T app /T NUF F T is the time penalty of the non-uniform FFT.24he discretization of (53) is u ( x i ) = (cid:88) ξ j e πı Φ( x i ,ξ j ) (cid:98) f ( ξ j ) , i, j = 1 , , . . . , N, (55)where { x i } and { ξ j } are points uniformly distributed in [0 ,

1) and [ − N/ , N/

2) following x i = ( i − /N and ξ j = j − − N/ . (56)Table 3 summarizes the results of this example for diﬀerent grid sizes N and Chebyshev points r (cid:15) . N, r (cid:15) (cid:15) i R comp T F actor ( min ) T d ( sec ) T app ( sec ) T d /T app N [7, 22]. In practice, even though the relative error increases slowly as the size ofthe problem increases due to the accumulation of the numerical error, the error stays in the sameorder. The third column of the table indicates that the compression ratio is around 2.3 for theﬁrst group and 2 for the second group. This implies that the sweeping compression procedure inthe nearly optimal IBF compresses the factorization by a factor greater than 2, which results insaving in both memory and application time. The saving is greater in higher dimensional problemsas we can see in previous analysis and in the example later. On the time scaling side, both thefactorization time and the application time strongly support the complexity analysis. Every timewe quadripule the problem size, the factorization time increases on average by a factor of 5, andthe increasing factor decreases monotonically down to 4. The increasing factor for the applicationtime is on average lower but close to 4. The speedup factor over the direct method may catch theeye ball of the users who are interested in the application of the FIO. This section presents a numerical example to demonstrate the eﬃciency of the multiscale interpola-tive butterﬂy factorization (MIBF). 25e revisit a similar example in [21], u ( x ) = (cid:88) ξ ∈ Ω e πı Φ( x,ξ ) g ( ξ ) , x ∈ X, (57)with a kernel Φ( x, ξ ) given byΦ( x, ξ ) = x · ξ + (cid:113) c ( x ) ξ + c ( x ) ξ ,c ( x ) =(2 + sin(2 πx ) sin(2 πx )) / ,c ( x ) =(2 + cos(2 πx ) cos(2 πx )) / , (58)where X and Ω are deﬁned as, X = (cid:110) x = (cid:16) n N / , n N / (cid:17) , ≤ n , n < N / with n , n ∈ Z (cid:111) (59)and Ω = (cid:40) ξ = ( n , n ) , − N / ≤ n , n < N / n , n ∈ Z (cid:41) . (60)In the multiscale decomposition of Ω, we recursively divide Ω until the center part is of size 16 by16. N, r (cid:15) (cid:15) m R comp T F actor ( min ) T d ( sec ) T app ( sec ) T d /T app , 6 2.52e-03 2.45 7.63e-02 2.45e-01 7.52e-03 3.25e+0164 , 6 4.13e-03 2.47 4.00e-01 2.17e+00 3.71e-02 5.86e+01128 , 6 3.11e-03 2.41 2.19e+00 2.11e+01 2.65e-01 7.98e+01256 , 6 1.71e-02 3.08 1.91e+01 2.61e+02 1.19e+00 2.20e+02512 , 6 5.32e-02 3.35 9.58e+01 4.88e+03 5.59e+00 8.74e+0232 , 9 5.58e-06 3.79 1.77e-01 2.44e-01 1.17e-02 2.08e+0164 , 9 7.21e-06 2.96 1.07e+00 2.27e+00 7.68e-02 2.95e+01128 , 9 6.98e-06 2.66 5.55e+00 2.09e+01 6.12e-01 3.41e+01256 , 9 8.37e-06 3.16 6.34e+01 2.85e+02 8.48e+00 3.36e+01512 , 9 1.23e-05 2.95 3.11e+02 4.79e+03 5.25e+01 9.13e+01Table 4: Numerical results for the two-dimensional FIO given in (57) by the MIBF.Table 4 summarizes the results of this example by the MIBF. The results agree with the O ( N log N ) complexity analysis. As we double the problem size N , the factorization time in-creases by a factor 5 on average. Similarly, the actual application time matches the theoreticalcomplexity as well. The relative error is essentially independent of the problem size N and thespeedup factor is attractive. Comparing Table 3 and Table 4, we notice that R comp in two dimen-sions is larger than that in one-dimension. This matches our expectation because the numericalrank by the Chebyshev interpolation is r d(cid:15) , which increases with the dimension d . Therefore, thesweeping compression beneﬁts more in multidimensional problems. This paper introduces an interpolative butterﬂy factorization as a data-sparse approximation ofcomplementary low-rank matrices when their kernel functions satisfy certain analytic properties.26ore precisely, it represents such an N × N dense matrix as a product of O (log N ) sparse matriceswith nearly optimal number of entries. The construction and application of the interpolativebutterﬂy factorization is highly eﬃcient with O ( N log N ) operation and memory complexity. Theprefactor of the complexity is nearly optimal in the butterﬂy scheme.Since applying the sparse factors is essentially a sequence of sparse matrix-vector multiplicationswith structured sparsity, this algorithm is especially of interest in distributed parallel computing.Based on the data distribution patten given in [28], the problem can be easily distributed in a d -dimensional way, which is of great interests for extreme-scale computing. In another word, for aproblem of size N = n d , we could distribute the problem among P = O ( N ) processes and achievecommunication complexity, O ( α log p + β NP r log P ), where α is the message latency and β is theper-process inverse bandwidth. It is a promising general framework for scalable implementation ofa wide range of transforms in harmonic analysis. Acknowledgments.

Y. Li was partially supported by the National Science Foundation underaward DMS-1328230 and the U.S. Department of Energy’s Advanced Scientiﬁc Computing Researchprogram under award DE-FC02-13ER26134/DE-SC0009409. H. Y. is partially supported by theNational Science Foundation under grants ACI-1450280 and thank the support of the AMS-Simonstravel award.

References [1] A. Averbuch, R. Coifman, D. Donoho, M. Elad, and M. Israeli. Fast and accurate polar Fouriertransform.

Applied and Computational Harmonic Analysis , 21(2):145 – 167, 2006.[2] O. Ayala and L.-P. Wang. Parallel implementation and scalability analysis of 3D fast Fouriertransform using 2D domain decomposition.

Parallel Computing , 39(1):58 – 77, 2013.[3] G. Bao and W. Symes. Computation of pseudo-diﬀerential operators.

SIAM Journal onScientiﬁc Computing , 17(2):416–429, 1996.[4] D. J. Bernstein. The tangent FFT. In

Applied algebra, algebraic algorithms and error-correctingcodes , pages 291–300. Springer, 2007.[5] E. Cand`es, L. Demanet, and L. Ying. Fast computation of Fourier integral operators.

SIAMJ. Sci. Comput. , 29(6):2464–2493, 2007.[6] E. Cand`es, X. Li, and M. Soltanolkotabi. Phase retrieval via Wirtinger ﬂow: Theory andalgorithms.

Information Theory, IEEE Transactions on , 61(4):1985–2007, April 2015.[7] E. J. Cand`es, L. Demanet, and L. Ying. A fast butterﬂy algorithm for the computation ofFourier integral operators.

Multiscale Modeling and Simulation , 7(4):1727–1750, 2009.[8] L. Demanet, M. Ferrara, N. Maxwell, J. Poulson, and L. Ying. A butterﬂy algorithm forsynthetic aperture radar imaging.

SIAM Journal on Imaging Sciences , 5(1):203–243, 2012.[9] L. Demanet and L. Ying. Discrete symbol calculus.

SIAM Review , 53(1):71–104, 2011.[10] L. Demanet and L. Ying. Fast wave computation via Fourier integral operators.

Mathematicsof Computation , 81:1455–1486, 2012.[11] L. Demanet and L. Ying. Fast wave computation via Fourier integral operators.

Math. Com-put. , 81(279), 2012. 2712] B. Engquist and L. Ying. Fast directional multilevel algorithms for oscillatory kernels.

SIAMJournal on Scientiﬁc Computing , 29(4):1710–1737, 2007.[13] B. Engquist and L. Ying. A fast directional algorithm for high frequency acoustic scatteringin two dimensions.

Communications in Mathematical Sciences , 7(2):327–345, 06 2009.[14] W. M. Gentleman and G. Sande. Fast Fourier transforms: For fun and proﬁt. In

Proceedings ofthe November 7-10, 1966, Fall Joint Computer Conference , AFIPS ’66 (Fall), pages 563–578,New York, NY, USA, 1966. ACM.[15] L. Greengard and J.-Y. Lee. Accelerating the Nonuniform Fast Fourier Transform.

SIAMReview , 46(3):443–454, 2004.[16] N. Halko, P. Martinsson, and J. Tropp. Finding structure with randomness: Probabilisticalgorithms for constructing approximate matrix decompositions.

SIAM Review , 53(2):217–288, 2011.[17] J. Hu, S. Fomel, L. Demanet, and L. Ying. A fast butterﬂy algorithm for generalized Radontransforms.

Geophysics , 78(4):U41–U51, June 2013.[18] S. G. Johnson and M. Frigo. A modiﬁed split-radix FFT with fewer arithmetic operations.

Signal Processing, IEEE Transactions on , 55(1):111–119, 2007.[19] L. Knockaert. Fast Hankel transform by fast sine and cosine transforms: the Mellin connection.

Signal Processing, IEEE Transactions on , 48(6):1695–1701, Jun 2000.[20] Y. Li, H. Yang, E. R. Martin, K. L. Ho, and L. Ying. Butterﬂy Factorization.

MultiscaleModeling & Simulation , 13(2):714–732, 2015.[21] Y. Li, H. Yang, and L. Ying. Multidimensional Butterﬂy Factorization. arXiv:1509.07925[math.NA] , 2015.[22] Y. Li, H. Yang, and L. Ying. A multiscale butterﬂy aglorithm for Fourier integral operators.

Multiscale Modeling and Simulation , 13(2):614–631, 2015.[23] W. Liao and A. Fannjiang. MUSIC for single-snapshot spectral estimation: Stability andsuper-resolution.

Applied and Computational Harmonic Analysis , 40(1):33 – 67, 2016.[24] E. Michielssen and A. Boag. A multilevel matrix decomposition algorithm for analyzing scatter-ing from large structures.

Antennas and Propagation, IEEE Transactions on , 44(8):1086–1093,Aug 1996.[25] M. O’Neil, F. Woolfe, and V. Rokhlin. An algorithm for the rapid evaluation of special functiontransforms.

Appl. Comput. Harmon. Anal. , 28(2):203–226, 2010.[26] D. Pekurovsky. P3DFFT: A framework for parallel computations of Fourier transforms inthree dimensions.

SIAM Journal on Scientiﬁc Computing , 34(4):C192–C209, 2012.[27] D. Pekurovsky. P3DFFT: A Framework for Parallel Computations of Fourier Transforms inThree Dimensions.

SIAM Journal on Scientiﬁc Computing , 34(4):C192–C209, 2012.[28] J. Poulson, L. Demanet, N. Maxwell, and L. Ying. A parallel butterﬂy algorithm.

SIAM J.Sci. Comput. , 36(1):C49–C65, 2014. 2829] V. Rokhlin and M. Tygert. Fast algorithms for spherical harmonic expansions.

SIAM J. Sci.Comput. , 27(6):1903–1928, Dec. 2005.[30] D. S. Seljebotn. Wavemoth-fast spherical harmonic transforms by butterﬂy matrix compres-sion.

The Astrophysical Journal Supplement Series , 199(1):5, 2012.[31] A. Townsend. A fast analysis-based discrete Hankel transform using asymptotic expansions.

SIAM Journal on Numerical Analysis , 53(4):1897–1917, 2015.[32] D. O. Trad, T. J. Ulrych, and M. D. Sacchi. Accurate interpolation with high-resolutiontime-variant Radon transforms.

Geophysics , 67(2):644–656, 2002.[33] M. Tygert. Fast algorithms for spherical harmonic expansions, { III } . Journal of ComputationalPhysics , 229(18):6181 – 6192, 2010.[34] H. Yang and L. Ying. A fast algorithm for multilinear operators.

Applied and ComputationalHarmonic Analysis , 33(1):148 – 158, 2012.[35] B. Yazici, L. Wang, and K. Duman. Synthetic aperture inversion with sparsity constraints.In

Electromagnetics in Advanced Applications (ICEAA), 2011 International Conference on ,pages 1404–1407, Sept 2011.[36] L. Ying. Sparse Fourier transform via butterﬂy algorithm.

SIAM J. Sci. Comput. , 31(3):1678–1694, Feb. 2009.[37] L. Ying, L. Demanet, and E. Cand`es. 3D discrete curvelet transform. In

Optics & Photonics2005 , pages 591413–591413. International Society for Optics and Photonics, 2005.[38] Z. Zhao, Y. Shkolnisky, and A. Singer. Fast Steerable Principal Component Analysis.