[PDF] Wavemoth -- Fast spherical harmonic transforms by butterfly matrix compression

Abstract

We present Wavemoth, an experimental open source code for computing scalar spherical harmonic transforms (SHTs). Such transforms are ubiquitous in astronomical data analysis. Our code performs substantially better than existing publicly available codes due to improvements on two fronts. First, the computational core is made more efficient by using small amounts of precomputed data, as well as paying attention to CPU instruction pipelining and cache usage. Second, Wavemoth makes use of a fast and numerically stable algorithm based on compressing a set of linear operators in a precomputation step. The resulting SHT scales as O(L^2 (log L)^2) for the resolution range of practical interest, where L denotes the spherical harmonic truncation degree. For low and medium-range resolutions, Wavemoth tends to be twice as fast as libpsht, which is the current state of the art implementation for the HEALPix grid. At the resolution of the Planck experiment, L ~ 4000, Wavemoth is between three and six times faster than libpsht, depending on the computer architecture and the required precision. Due to the experimental nature of the project, only spherical harmonic synthesis is currently supported, although adding support or spherical harmonic analysis should be trivial.

Full PDF

aa r X i v : . [ a s t r o - ph . I M ] J a n Draft version August 1, 2018

Preprint typeset using L A TEX style emulateapj v. 5/2/11

WAVEMOTH – FAST SPHERICAL HARMONIC TRANSFORMS BY BUTTERFLY MATRIX COMPRESSION

D. S. Seljebotn Draft version August 1, 2018

ABSTRACTWe present

Wavemoth , an experimental open source code for computing scalar spherical harmonictransforms (SHTs). Such transforms are ubiquitous in astronomical data analysis. Our code performssubstantially better than existing publicly available codes due to improvements on two fronts. First,the computational core is made more eﬃcient by using small amounts of precomputed data, as well aspaying attention to CPU instruction pipelining and cache usage. Second, Wavemoth makes use of a fastand numerically stable algorithm based on compressing a set of linear operators in a precomputationstep. The resulting SHT scales as O ( L log L ) for the resolution range of practical interest, where L denotes the spherical harmonic truncation degree. For low and medium-range resolutions, Wavemothtends to be twice as fast as libpsht , which is the current state of the art implementation for theHEALPix grid. At the resolution of the Planck experiment, L ∼ Subject headings:

Methods: numerical BACKGROUND

The spherical harmonic transform (SHT) is the spher-ical analog of the Fourier transform, and is an essentialtool for data analysis and simulation on the sphere. Ascalar ﬁeld f ( θ, φ ) on the unit sphere can be expressed asa weighted sum of the spherical harmonic basis functions Y ℓm ( θ, φ ), f ( θ, φ ) = ∞ X ℓ =0 ℓ X m = − ℓ a ℓm Y ℓm ( θ, φ ) . (1)The coeﬃcients a ℓm contain the spectral information ofthe ﬁeld, with higher ℓ corresponding to higher frequen-cies. In calculations the spherical harmonic expansion istruncated for ℓ > L , and the spherical ﬁeld representedby O ( L ) grid samples. Computing the sum above isknown as the backward SHT or synthesis , while the in-verse problem of ﬁnding the spherical harmonic coeﬃ-cients a ℓm given the ﬁeld f is known as the forward SHT or analysis .In order to compute an SHT, the ﬁrst step is nearlyalways to employ a separation of sums, which we re-view in Section 2.3, to decrease the cost from O ( L ) to O ( L ). We will refer to codes that take no measures be-yond this to reduce complexity as brute-force codes. Ofthese, HEALPix (G´orski et al. 2005) is one very widelyused package, in particular among CMB researchers.Recently, the libpsht package (Reinecke 2011) halvedthe computation time with respect to the originalHEALPix implementation, simply through code opti-mizations. As of version 2.20, HEALPix uses libpshtas the backend for SHTs. Other packages using thebrute-force algorithm include S HAT (Hupca et al. 2010;Szydlarski et al. 2011), focusing on cluster parallelization [email protected] Institute of Theoretical Astrophysics, University of Oslo,P.O. Box 1029 Blindern, N-0315 Oslo, Norway and implementations on the GPU, as well as GLESP(Doroshkevich et al. 2005) and ssht (McEwen & Wiaux2011), focusing on spherical grids with more accuratespherical harmonic analysis than what can be achievedon the HEALPix grid.The discovery of Fast Fourier Transforms (FFTs) hasbeen all-important for signal analysis over the past halfcentury, and there is no lack of high quality commercialand open source libraries to perform FFTs with stunningspeed. Unfortunately, the straightforward divide-and-conquer FFT algorithms do not generalize to SHTs, andresearch in fast SHT algorithms has yet to reach maturityin the sense of widely adopted algorithms and libraries.The libftsh library (Mohlenkamp 1999) uses localtrigonometric expansions to compress the spherical har-monic linear operator, resulting in a computational scal-ing of O ( L / log L ) in ﬁnite precision arithmetic. Sphar-monicKit (Healy et al. 2003) implements a divide-and-conquer scheme which scales as O ( L log L ). We com-ment further on these in Section 4.4. Other algorithmshave also been presented but either suﬀer from problemswith numerical stability, are impractical for current reso-lutions, or simply lack publicly available implementations(e.g., Suda & Takami 2002; Kunis & Potts 2003; Rokhlin& Tygert 2006; Tygert 2008, 2010).We present Wavemoth , an experimental open sourceimplementation of the algorithm of Tygert (2010). Thisalgorithm has several appealing features. First, it is sim-ple to implement and optimize. Second, it is inherentlynumerically stable. Third, its constant prefactor is rea-sonable, yielding substantial gains already at L ∼ O ( L log L ), but lowering the requestedaccuracy makes the constant prefactor smaller. http://github.com/wavemoth; commit 59ec31b8 was used toproduce the results of this paper. We stress that our work consists solely in providing anoptimized implementation. While we review the basicsof the algorithm in Section 3, Tygert (2010) should beconsulted for details and proofs. We have focused inparticular on the HEALPix grid, and use libpsht as ourbaseline for comparisons. However, all methods workequally well for any other grid with iso-latitude rings.Section 2 reviews SHTs in more detail, as well as thecomputational methods that are widely known and usedacross all popular codes. Section 3 reviews the algorithmof Tygert (2010) and how we have adapted it to our pur-poses. Section 4 focuses on the high-level aspects of soft-ware development and provides benchmarks, while anappendix provides the low-level implementation details. BASELINE ALGORITHMS

The spherical harmonic basis functions

We use the convention that points on the sphere areparameterized by a co-latitude θ ∈ [0 , π ], where 0 corre-sponds to the “north pole”, and a longitude φ ∈ [0 , π ).The spherical harmonic basis functions Y ℓm ( θ, φ ) canthen be expressed in terms of the associated Legendrefunctions P mℓ ( z ). Assuming m ≥

0, we have Y ℓm ( θ, φ ) = s ℓ + 14 π ( ℓ − m )!( ℓ + m )! P mℓ (cos θ ) e imφ ≡ e P mℓ (cos θ ) e imφ , (2)where we deﬁne the normalized associated Legendre func-tion e P mℓ . Our deﬁnition follows that of Press et al.(2007); the normalization diﬀers by a factor of p / Y ℓm and thecoeﬃcients a ℓm are complex, e P mℓ is real for the argu-ment range of interest. For negative m , the symmetry Y ℓ, − m = ( − m Y ∗ ℓm can be used, although this is onlyneeded for complex ﬁelds. Wavemoth only supports realﬁelds, which have spherical harmonic expansions obeying a ℓm = ( − m a ∗ ℓ − m . Discretization and the forward transform

For computational work one has to assume that oneis working with a band-limited signal, so that a ℓm = 0when ℓ > L . The SHT synthesis is then given simply byevaluating equation (1) in a set of points on the sphere.The opposite problem of computing a ℓm given f ( θ j , φ j ), namely spherical harmonic analysis, is lessstraightforward. In the limit of inﬁnite resolution, wehave a ℓm = Z f ( θ, φ ) Y ∗ ℓm ( θ, φ ) d Ω , (3)where d Ω indicates integration over the sphere. Thisfollows easily from the orthogonality property, Z Y ℓm Y ∗ ℓ ′ m ′ d Ω = δ ℓℓ ′ δ mm ′ . (4)There is no canonical way of choosing sample points onthe sphere. The simplest grid conceptually is the equian-gular grid. Doroshkevich et al. (2005) and McEwen &Wiaux (2011) describe grids that carries the orthogonal-ity property of the continuous spherical harmonics over to the discretized operator. In contrast, the HEALPixgrid (G´orski et al. 2005) trades orthogonality for theproperty that each pixel has the same area, which is con-venient for many operations in the pixel basis.Independent of what grid is chosen, a natural approachto spherical harmonic analysis is to use a quadrature rulewith some weights w j , so that a ℓm = N pix X j =1 w j f ( θ j , φ j ) Y ∗ ℓm ( θ j , φ j ) . (5)On the HEALPix grid the numerical accuracy of thisapproach is limited, but it is still the most common pro-cedure.Some real world signal analysis problems do not needthe forward transform at all. In the presence of measure-ment noise in the pixel basis, one can argue that the bestapproach is not to pull the noise part of the signal intospherical harmonic basis at all. For instance, considerthe archetypical CMB data model, d = Ys + n , (6)where d represents a vector of pixels on the sky withobserved data (not necessarily the full sky), s representsour signal of interest in spherical harmonic basis, and n represents instrumental noise in each pixel. Sphericalharmonic synthesis is denoted Y ; note that equation (1)describes a linear operator and can be written f = Ya .If we now assume that s and n are Gaussian randomvectors with vanishing mean and known covariance ma-trices S and N , respectively, then the maximum likeli-hood estimate of the signal is given byˆ s = ( S − + Y † N − Y ) − Y † N − d , (7)with ˆ s in spherical harmonic basis. This system can besolved with reasonable eﬃciency by iterative methods.Note that we are here only concerned with the eﬀectof Y as a non-invertible projection, and that no spheri-cal harmonic analysis is ever performed, only the adjointsynthesis. Thus, neither the non-orthogonality causedby the HEALPix grid, nor masking out large parts ofthe sky, is a concern. See Eriksen et al. (2008) and ref-erences therein for more details on this technique in thecontext of CMB analysis. Applying the Fast Fourier Transform

The ﬁrst step in speeding up the spherical harmonictransform beyond the O ( L ) brute-force sum is a simpleseparation of sums. For this to work well, pixels mustbe arranged on a set of iso-latitude rings, with equidis-tant pixels within each ring. All grids in use for high-resolution data has this property.We show the case for SHT synthesis; analysis can betreated in the same way. Starting from equation (1), wehave, for pixel j within ring k , and with z k ≡ cos θ k , f ( θ k , φ k,j ) = L X m = − L  L X ℓ = | m | a ℓm e P mℓ ( z k )  e imφ k,j ≡ L X m = − L q m ( z k ) e imφ k,j , (8)avemoth: Fast spherical harmonic transforms 3where we introduce q m ( z k ). Assuming that ring k con-tains J k pixels, their equidistant longitude is given by φ k,j = φ k, + 2 πjJ k . (9)Since e ix has period 2 π , and since q m ( z k ) = 0 whenever | m | > L , we ﬁnd that L X m = − L q k,m e imφ k,j = J k − X j =0 τ j ( z k ) e πji/J k (10)with τ j ( z k ) = ∞ X t = −∞ q J k t + j ( z k ) e iφ k, ( J k t + j ) . (11)Thus one can phase-shift the coeﬃcients q m ( z k ) to matchthe ring grid, wrap around or pad with zeros, and per-form a regular backward FFT. The symmetries of thespherical harmonic coeﬃcients of a real ﬁeld carry overdirectly to the Hermitian property of real Fourier trans-forms.This separation of sums represents a ﬁrst step in speed-ing up the SHT, and is implemented in all packages forhigh-resolution spherical harmonic transforms. Legendre transforms and even/odd symmetry

The function q m ( z ) introduced in equation (8) is knownas the (Associated) Legendre transform of order m , q m ( z k ) = L X ℓ = m e P mℓ ( z k ) a ℓm , (12)assuming m ≥

0. The following symmetry cuts the arith-metic operations required in a SHT in half, as long as thespherical grid distributes the rings symmetrically aroundthe equator. For any non-negative integer n , the func-tions e P mm +2 n ( z ) are even and e P mm +2 n +1 ( z ) are odd. Wedeﬁne q even m and q odd m so that q even m contains the even-numbered and q odd m the odd-numbered terms of equation(12), and so that q m ( z ) = q even m ( z ) + q odd m ( z ) . (13)Then, since q even m and q odd m are weighted sums of even andodd functions, respectively, they are themselves even andodd, so that q m ( − z ) can be computed at the same timeessentially for free, q m ( − z ) = q even m ( z ) − q odd m ( z ) . (14)For spherical harmonic analysis, one uses the orthogo-nality property. Assuming m ≥ Z e P mℓ ( z ) e P mℓ ′ ( z ) dz = δ ℓℓ ′ , (15)so that a ℓm = Z e P mℓ ( z ) q m ( z ) dz. (16)As discussed in Section 2.2, the resulting quadrature usedin calculations can be exact or approximate, dependingon the placement of the pixel rings. One can also in thiscase cut computation time in half by treating even andodd ℓ − m separately. FAST LEGENDRE TRANSFORMS

As the Fourier transform part is essentially a solvedproblem, eﬀorts to accelerate SHTs revolve aroundspeeding up the Legendre transforms. Let us write equa-tion (12) as q = Λ T a , (17)where we leave m and the odd versus even case implicit.For a full SHT, such a product must be computed foreach of 2( L +1) diﬀerent Λ matrices. The backwards Leg-endre transform required for spherical harmonic analysisis similarly a = Λq , (18)give or take a set of quadrature weights.The idea of Fast Legendre Transform algorithms is tocompute equations (17) and (18) faster than O ( LN ring ).The approach of Tygert (2010) is to factor Λ as a prod-uct of block-diagonal matrices in a precomputation step,which can signiﬁcantly reduce the number of elements intotal. This technique is known as butterﬂy compression ,and was introduced by Michielssen & Boag (1996). Theaccuracy of the compression is tunable, but even nearlyloss-less compression with close to double precision ac-curacy is able to yield signiﬁcant gains as the resolutionincreases. We review the algorithm below, but stressagain that the reader should consult Tygert (2010) forthe full details. The butterﬂy compression technique wasintroduced by, The interpolative decomposition

The core building block of the compression algorithmis the Interpolative Decomposition (ID), described inCheng et al. (2005). Assume that an m × n matrix A has rank k , then the ID is A = A ( k ) e A , (19)The matrix A ( k ) , known as the skeleton matrix , consistsof k columns of A , whereas e A , the interpolation matrix ,interpolates the eliminated columns from the ones thatare preserved. Of course, k of the columns of e A mustform the identity matrix.The ID is obviously not unique; the trick is to ﬁnd adecomposition that is numerically stable. The algorithmof Cheng et al. (2005) ﬁnds an interpolation matrix e A so that no element has absolute value greater than 2, allsingular values are larger than or equal to 1, and thespectral norm is bounded by p k ( n − k ) + 1. The nu-merical precision of the decomposition is tunable, as thedecomposition found by the algorithm satisﬁes k A − A ( k ) e A k ≤ p k ( n − k ) + 1 σ k +1 (20)where σ k +1 is the ( k + 1) greatest singular value of A .Implementing lossy compression is simply a matter ofreducing the accuracy required of the IDs we use. Butterﬂy matrix compression

We now use the ID recursively to factor the matrix Λ .After applying p levels of compression, we have Λ = RS p P p − S p − · · · P S P S , (21) Level 1 = Λ R ′ I S = R P S Level 2 = R R ′ I S = R P S Level 3 = R R I S Figure 1.

Illustration of the butterﬂy matrix compression scheme. On the ﬁrst level, we use the Interpolative Decomposition to compresssub-blocks of the matrix Λ and produce the factorization Λ = R ′ S , where all blocks in R ′ have full rank. We then proceed by permutingthe columns of R ′ so that Λ = R P S , in order to create new rank-deﬁcient blocks. The contents of the S matrix is saved as precomputeddata, while we carry R along for further compression on the next level. The algorithm continues in this fashion until the residual matrix R only consists of a single diagonal of full-rank blocks. The ﬁnal factorization becomes Λ = RS P S P S . The permutations involvedare known in the FFT literature as butterﬂy permutations ; the “butterﬂy” can be seen twice in the pattern of P . avemoth: Fast spherical harmonic transforms 5where R is a block-diagonal residual matrix containingelements that were not compressed, the S i are block-diagonal matrices containing compressed data, and the P i are permutation matrices. See Figure 1 for an il-lustration. The structure of the permutations are verysimilar to the butterﬂies used in FFT algorithms, hencethe name of the compression scheme. In fact, if one lets S i contain a speciﬁc set of 2 × × Λ into 2 p column blocks. Thenumber of levels p is mainly determined by the number ofcolumns in the matrix, so that the column blocks all areroughly of the same predetermined width. In our case,64 columns worked well.We then split each block roughly in half horizontally,and compress each resulting block using the ID, Λ = (cid:20) T , T , . . . B , T , . . . (cid:21) = " ( T ( k )1 , · e T , ) ( T ( k )1 , · e T , ) . . . ( B ( k )1 , · e B , ) ( B ( k )1 , · e B , ) . . . , where the ﬁrst subscript of each matrix refers to thisbeing the ﬁrst iteration of the algorithm. It is useful towrite the above matrix as Λ = " T ( k )1 , T ( k )1 , · · · B ( k )1 , B ( k )1 , · · · T , e B , e T , e B , . . .  . We denote the right matrix S . It can not be furtherprocessed and its blocks are simply saved as precomputeddata, making use of the fact that each block embeds theidentity matrix in a subset of its columns.The left matrix can be permuted and further com-pressed. For some permutation matrix P we have: Λ = " T ( k )1 , T ( k )1 , · · · B ( k )1 , B ( k )1 , · · · S = " T ( k )1 , T ( k )1 , · · · B ( k )1 , B ( k )1 , · · · P S . Then we join blocks horizontally, split them vertically,and compress each resulting block. For the top-left cor-ner we have h T ( k )1 , T ( k )1 , i = (cid:20) T , B , (cid:21) = " ( T ( k )2 , · e T , )( B ( k )2 , · e B , ) . (22) Applying this to all blocks in the matrix, we get Λ =  ( T ( k )2 , · e T , ) · · · ( B ( k )2 , · e B , ) · · · ( T ( k )2 , · e T , ) · · · ( B ( k )2 , · e B , ) · · ·  P S =  T ( k )2 , T ( k )2 , · · · B ( k )2 , B ( k )2 , · · · T ( k )2 , . . . · · · B ( k )2 , · · ·  ·  e T , e B , e T , e B , . . .  P S . And so the scheme continues. For each iteration the num-ber of diagonals in the left matrix is halved, the numberof blocks in each diagonal is doubled, and the height ofeach block is roughly halved. Eventually the left ma-trix consists only of a single diagonal band of blocks,and further compression is impossible. This becomes theresidual matrix R of equation (21).The eﬃciency of the scheme relies on the non-trivialrequirement that the T ( k ) and B ( k ) blocks are rank-deﬁcient at every level of the algorithm. To get a handleon which matrices exhibit this behavior, we start withassuming the rank property , namely that any contigu-ous rectangular sub-block of Λ , up to the numerical pre-cision chosen, has rank proportional to the number ofelements in the sub-block. That is, the rank does notdepend on the location or shape of the block. Now, eachtime the butterﬂy algorithm joins two skeletons, such as[ T ( k )1 , T ( k )1 , ] in equation (22), the resulting matrix hasroughly 2 k columns while spanning out a correspondingblock of Λ of rank k . Therefore, half of the columns canbe eliminated by applying the ID. Since the data volumeis roughly halved at each compression level, and since S i at each level has O ( L ) interpolative matrices of sizeroughly k × k = O (1), the resulting compressed rep-resentation of Λ has O ( L log L ) elements. See Tygert(2010) for a more detailed argument.O’Neil et al. (2010) proves the rank property in the caseof Fourier transforms and Fourier-Bessel transforms. Itis however not proven in the case of associated Legendrefunctions e P mℓ ( z ). Figure 2 shows our results for resolu-tions up to L ∼ Notes on interpolation

Tygert (2008) describes an elegant and exact interpo-lation scheme which, in the case of the HEALPix gridand L = 2 N side , reduces the number of required evalua-tion points for q m ( z k ) by 2 /

3. Although our conclusionwas not to include this step in our code, we include abrief discussion in order to motivate our decision.We focus on the even Legendre functions, the odd caseis similar. Let n be an integer such that L < m + 2 n .

32 128 512 2048 8192 32768 131072 L N u m b e r o f m a t r i x e l e m e n t s

32 128 512 2048 8192 32768 131072 L A r b i t r a r y un i t s Figure 2.

Eﬃciency of the butterﬂy compression scheme.

Left panel:

Estimated size of compressed data (solid) compared to uncompressedmatrix size (dashed). A line proportional to O ( L log L ) (dotted) is shown for comparison. In each case we use a HEALPix grid with N side = L/ Right panel:

A closer look on the computational scaling. The size of the compressed data is shown divided by O ( L log L )(dashed), O ( L log L ) (solid) and O ( L log L ) (dotted), using an arbitrary normalization. The function P mm +2 n ( x ) has n roots in the interval (0 , z , . . . , z n . Now, assuming that we haveevaluated q even m in these roots, we can interpolate to anyother point y ∈ ( − ,

1) by using the formula q even m ( y ) = ω ( y ) n X i =1 γ ( z i ) y − z i q even m ( z i ) , (23)for some precomputed weights ω ( y ) and γ ( z i ). The proofrelies on the Christoﬀel-Darboux identity for the normal-ized associated Legendre functions (Tygert 2008; Jakob-Chien & Alpert 1997). The Fast Multipole Method(FMM) allows the computation of equation (23) for p points with operation count of order O ( p + n ) ratherthan O ( pn ). The FMM was originally developed for ac-celerating N -body simulations, but is here motivated al-gebraically. For more information about one-dimensionalFMM we refer to Yarvin & Rokhlin (1999) and Dutt etal. (1996).The reason we did not include this step in our codeis that much of the interpolation is already embeddedin the butterﬂy matrix compression. Consider for in-stance N side = 2048, L = 3 N side and m = 2000. Thefull matrix Λ occupies 65 MiB when evaluated in theHEALPix co-latitude nodes, and only 49 MiB when eval-uated in the optimal nodes as described above. However,after compression the diﬀerence is only 10.4 MiB versus9.4 MiB. Thus the butterﬂy compression compensates,at least partially, for the over-sampling. Indeed, Mar-tinsson & Rokhlin (2007) use a strongly related matrixcompression technique to implement the FMM itself.Interpolation also causes the precomputed data to be-come independent of the chosen grid and resolution.However, we found the constant prefactor in the FMMto be quite high, and including it only as a matter of con-venience appears to be out of the question for our targetresolutions. Since the FMM has a linear computationalscaling, the question should be revisited for higher reso-lutions. CPU and memory trade-oﬀs

So far we have focused on reducing the number ofﬂoating point operations (FLOPs). However, during the past decade the speed of the CPU has increased muchmore rapidly than the system memory bandwidth, sothat in current multi-core computers it is easy to get ina situation where the CPUs are starved for data to pro-cess. When processing only one or a few transforms con-currently, the volume of the precomputed data is muchlarger than the volume of the maps being transformed,so that the limitation is moving the precomputed dataover the memory bus, not processing power. Note thatin the case of very many simultaneous transforms theproblem is alleviated since the movement of precomputeddata is amortized. Following in the footsteps of libpsht,and our own requirements in CMB analysis, we have re-stricted our attention to between one and ten concurrenttransforms. While the butterﬂy algorithm probably per-forms well in the face of many concurrent transforms, itwould require additional blocking and optimization be-yond what we have implemented, so that movement ofthe working set in memory is properly amortized. Notethat as each m is processed independently, the workingset is only about 1 /L of the total input.The considerations above motivates stopping compres-sion early, after a signiﬁcant reduction in the ﬂoating-point operation count has been achieved, but before thesize of the precomputed data becomes too large (see Fig-ure 3). Butterﬂy compression has the convenient featurethat the blocks in the residual matrix R consists of con-tiguous slices from columns of Λ . By orienting Λ so thatrows are indexed by ℓ and columns by z , the elementsof the residual blocks can be computed on the ﬂy fromthree-term recurrence formulas for the associated Legen-dre functions. We return to this topic in Section A.2.As an example, consider N side = 2048 and m = 400.The uncompressed matrix Λ takes 64 MB in double pre-cision. This can be compressed to 20% of the originalsize by using ﬁve levels of compression, with the uncom-pressed residual R accounting for about 13% of the com-pressed data. If one instead stops after three levels ofcompression, then although the size of the compresseddata has now grown to 24% of the original, 57% of thisis made out of elements in R . Since one only needs tostore two elements for every column of 512 elements in R and can generate the rest on the ﬂy, stopping compres-avemoth: Fast spherical harmonic transforms 7 Number of compression iterations M a t r i x s i z e ( e l e m e n t s ) Figure 3.

Eﬀect of each level of butterﬂy compression. The sizeof the compressed data (solid) is the sum of elements in residualuncompressed blocks in R (dashed) and the interpolation matrices S i (dotted). While R can be generated on the ﬂy during trans-forms, the S i needs to be stored as precomputed data, so that thechoice of compression level is a trade-oﬀ between CPU use andthe size of the precomputed data. Parameters for this ﬁgure are N side = 2048, L = 3 N side , m = 0, and the initial chunk size 32columns. sion after three levels reduces the memory bus traﬃc andsize of precomputed data by about 40%, at the cost ofsome extra CPU instructions. Note that the brute-forcecodes may simply be seen as the limit of zero levels ofcompression. IMPLEMENTATION & RESULTS

Technology

The Wavemoth library is organized in a core part andan auxiliary part. The core is primarily written in C andcontains the routines for performing spherical harmonictransforms. The auxiliary shell around the core is writ-ten as a Python package, and is responsible generatingthe precomputed data using the butterﬂy compressionalgorithm, as well as the regression and unit tests.By writing the core in pure C we remain close to thehardware, and make sure the library can be used with-out Python. C remains the easiest language to call fromother languages such as Fortran, C++, Java, Python,MATLAB, and so on. By using Python in the auxiliarysupport code we accelerate development of the parts thatare not performance critical, and make writing tests apleasant experience. Being able to quickly write up unittests is an indispensable tool, as it allows optimizing theC code iteratively without introducing bugs. Since indi-vidual pieces of the C core is tested, there is both a publicAPI for end-users and a private API that is used fromPython to test individual C routines in isolation. Muchof the support code is implemented in Cython (Behnelet al. 2011), which bridges the worlds of Python and C.The C core depends on ﬁles containing precomputeddata, a Fourier transform library and a BLAS library.For the latter two we use FFTW3 (Frigo & Johnson 2005)and ATLAS (Whaley et al. 2001), respectively. Parts ofthe Wavemoth core is written using templates in orderto generate many slight variations of the same C rou-tine. We use Tempita , a purely text-oriented templating http://pythonpaste.org/tempita/ language, and ﬁnd this to be much more convenient foroptimizing a computational core than the type-orientedtemplates of C++. During the precomputations, we usethe open source Fortran 77 library ID to compute theInterpolative Decomposition, and libpsht to generate theassociated Legendre functions.Unlike libpsht, we have not focused on portability, andWavemoth is only tested on 64-bit Linux with the GCCcompiler on Intel-platform CPUs. Computational coresare written using SSE intrinsics and 128-bit registers.More work is needed for optimal performance on the lat-est Intel micro-architecture, which support 256-bit regis-ters, or on non-Intel platforms. Beyond that, we expectno hurdles in improving portability. Benchmarks

We include benchmarks for two diﬀerent systemswith diﬀerent memory bandwidth, as Wavemoth’s per-formance is deeply inﬂuenced by this aspect of thehardware. Figure 4 presents benchmarks taken on a64-core 2.27 GHz Intel Xeon X7560 (Nehalem micro-architecture), which has a compute-to-bandwidth ratioof about 45:1. Figure 5 presents benchmarks taken ona 48-core 2.2 GHz AMD Opteron 6174. The compute-to-bandwidth ratio is in this case about 64:1, signiﬁ-cantly worse than the Intel system . The consequenceis that butterﬂy compression gives less of an advan-tage, with only about four times speedup over libpshtat L = 4096, compared to the corresponding six timesspeedup achieved on the Intel system. In the case of tensimultaneous transforms, libpsht achieves a very consis-tent 2x speedup which Wavemoth is not able to fullymatch, as most of our tuning eﬀort has been on the sin-gle transform path.The highest tested accuracy of ǫ = 10 − for the Leg-endre transforms was chosen because current codes usingthe HEALPix grid only agree to this accuracy on highresolutions (Reinecke 2011).An important aspect of the systems for our purposes isthe non-uniform memory access (NUMA). On each sys-tem, the CPU cores are grouped into eight nodes, andthe RAM chips evenly divided between the nodes. EachCPU only have direct access to RAM chips on the lo-cal node, and must go through a CPU interconnect busto access other RAM chips. For consistent performancewe need to ensure that Wavemoth distributes the pre-computed data in such a way that each CPU ﬁnds thedata it needs in its local RAM chips. In the benchmarkswe always use a whole number of nodes, so that com-putation power and memory bandwidth scale together.The exception is benchmarks using a single core, but inthose cases, Wavemoth’s precomputed data ﬁts in cacheanyway.Table 1 list the sizes of the precomputed data. Tobalance bandwidth and CPU requirements as describedin Section 3.4, the precomputation code takes a param-eter ρ , specifying the cost of ﬂoating point operations http://cims.nyu.edu/˜tygert/software.html The Intel system supports transfer of 13 billion numbers persecond and has theoretical peak compute power 580 GFLOPS, us-ing all 64 cores. The AMD system supports transfer of 6.5 billionnumbers per second and has theoretical peak compute power of 422GFLOPS, using all 48 cores. All numbers refer to double precisionﬂoating point. L -5 -4 -3 -2 -1 C P U s e c o n d s p e r t r a n s f o r m L F a c t o r s p ee d u p L -5 -4 -3 -2 -1 L Figure 4.

Benchmarks for full SHTs performed on the Intel system. The left pane shows timings for a single transform, the right forten simultaneous transforms. We scale up the number of CPU cores together with the resolution. Each pane is divided into four partiallyoverlapping segments corresponding to 1, 8, 16, 32, and 64 CPU cores, respectively (indicated by white/gray backgrounds and changesin line colors). The libpsht code (red triangles) is compared to Wavemoth (blue/black) with no compression (solid, circles), compressionwith precision 10 − (dashed, diamonds) and compression with precision 10 − (dotted, crosses). In each case we use a HEALPix gridwith resolution N side = L/

2. Note for instance how both codes suﬀer from parallelization overhead at the transition from one to eightcores, but that libpsht suﬀers less and catches up with Wavemoth. For a single transform at high resolutions, the situation is the contrary,with Wavemoth parallelizing better at the jump from 16 to 32 cores and from 32 to 64 cores. We repeated each benchmark multiple timesboth with and without HyperThreading, and report the fastest wall clock time achieved multiplied with the number of CPU cores usedand divided by the number of simultaneous transforms. Some 32-core timings for ten simultaneous transforms at L = 8192 could not beobtained due to memory limitations. The load time of the precomputed data from the hard drive is not included. avemoth: Fast spherical harmonic transforms 9 -2 -1 C P U s e c o n d s p e r t r a n s f o r m

128 256 512 1024 2048 4096 L F a c t o r s p ee d u p -2 -1

128 256 512 1024 2048 4096 L Figure 5.

Benchmarks for full SHTs performed on the AMD system, using all 48 CPU cores. The libpsht code (red triangles) is comparedto Wavemoth (blue) with no compression (solid, circles), compression with precision 10 − (dashed, diamonds), compression with precision10 − (dotted, crosses). Left pane shows a single transform and the right pane ten simultaneous transforms. In each case we use a HEALPixgrid with resolution N side = L/

2. The large speedup in the range L = 256 .. Table 1

Size of precomputed data L No comp. Intel ( ρ = 7 .

5) AMD ( ρ = 18) Precomputation timeTol. 10 − Tol. 10 − Tol. 10 − Tol. 10 − (CPU minutes)32 130 KiB – – – – 0.0264 496 KiB – – – – 0.03128 2.0 MiB 2.0 MiB 2.0 MiB 1.9 MiB 1.9 MiB 0.22256 8.0 MiB 8.0 MiB 8.0 MiB 7.1 MiB 7.1 MiB 2.6512 27 MiB 174 MiB 187 MiB 27 MiB 27 MiB 7.41024 102 MiB 937 MiB 988 MiB 102 MiB 170 MiB 122048 389 MiB 6.0 GiB 5.8 GiB 4.4 GiB 4.3 GiB 904096 1.5 GiB 38 GiB 35 GiB 28 GiB 27 GiB 5368192 5.8 GiB 212 GiB 208 GiB – – 4380 Note . — The precomputation time quoted is the wall time taken to compute attolerance 10 − on the Intel system, multiplied by the number of CPU cores used. Weuse 1 core for L = 32, and then scale up gradually to 64 cores at L = 8192. Theprecomputed data is saved to a network ﬁle system. Table 2

Samples of numerical accuracy L No compression Tolerance 10 − Tolerance 10 − Note . — In each case, the transform of a single Gaus-sian sample is compared against libpsht double precisionresults. L = 4096, resulting in optimal choices of ρ = 7 . ρ = 18 on the AMD system. Per-forming the precomputations scales as O ( L ). In the caseof no compression, we still store the precomputed quan-tities necessary for the Legendre recurrence relations inmemory, as described in the appendix. Loading this datafrom memory is not necessarily faster than computing iton the ﬂy, but doing so saved some development time.All methods involved are numerically stable and wellunderstood, so we do not include a rigorous analysis ofnumerical accuracy. Table 2 lists the relative error fromtransforming a single set of standard Gaussian coeﬃ-cients per conﬁguration. We use the relative error ǫ = vuut N pix X i =1 ( x i − y i ) / N pix X i =1 x i , (24)where x i denote the result of libpsht and y i the result ofour code. The discrepancies in the no-compression, high- L cases are due to using a diﬀerent recurrence for theassociated Legendre functions, as described in AppendixA.2. As we did not compare with higher precision results,it is not clear whether it is our code, libpsht, or both,that loose precision with higher resolution. Note thatthe input data to the butterﬂy compression is generatedusing libpsht. Higher resolutions

Due to memory constraints we have not gone to higherresolutions than L ∼ m rather than the full SHT.At each resolution, we compress Λ odd m for 20 diﬀerent m , and ﬁt the cost estimateˆ c m = α + X p =0 β p m log(1 + m ) p (25)by least squares minimization in the parameters α, β , β and β . The ﬁnal cost is then estimated byˆ c total = 2 L X m =0 ˆ c m , (26)since Λ even m and Λ odd m has almost identical behavior. Theresults can be seen in Figure 2. For L ∼ O ( L ) operations of the brute-force Legendre transform.At high resolutions, the O ( L log L ) trajectory is clearlya better ﬁt than the O ( L log L ) scaling conjectured byTygert (2010). Note that the numerical evidence pre-sented in Tygert (2010) show that the average k increases

256 512 1024 2048 L -3 -2 -1 T i m e ( s e c o n d s ) Figure 6.

Comparison of Wavemoth (black circles) with the algo-rithm of Mohlenkamp (1999) as implemented in libftsh (red trian-gles), with accuracy 10 − (solid) and 10 − (dotted). Both codesare run on a single core. Only the Legendre transform part isbenchmarked, as libftsh does not implement the full SHT. Wave-moth uses a HEALPix grid with 2 L − L rings. The placement of the rings shouldmake little diﬀerence to the performance of either code. monotonically with m , so it may indeed be the case thatthe rank property is not fully satisﬁed, or only satis-ﬁed conditional on m . The benchmark results of Tygert(2010) seem to be in agreement with the O ( L log L )hypothesis as well. Comparison with other fast SHT algorithms

A widely known scheme for fast SHTs is the O ( L log L ) transform of Healy et al. (2003), imple-mented in SpharmonicKit. It algebraically expresses aLegendre transform of degree L as a function of two Leg-endre transforms of degree L/

2, resulting in a divide-and-conquer scheme similar to the FFT algorithms. Un-fortunately, the scheme is inherently numerically unsta-ble, and special stabilization steps must be incorporated.Also, it is restricted to equiangular grids, so that it cannot be used directly with the HEALPix or GLESP grids.Wiaux et al. (2006) benchmarks SpharmonicKit againstthe original HEALPix implementation (pre 2.20) andﬁnd that it is almost three times slower at L = 1024.Keep in mind that libpsht, used in present releases ofHEALPix, is about twice as fast as the original HEALPiximplementation. Considering the above, we stop short ofa direct comparison between Wavemoth and Spharmon-icKit. Note that while SpharmonicKit achieves muchhigher accuracy of an SHT round-trip than HEALPixdoes, this is an eﬀect of the diﬀerent sampling grids be-ing used, not of the computational method, and it isstraightforward to extend the Wavemoth code to use thesame grid as SpharmonicKit.Mohlenkamp (1999) uses a matrix compression tech-nique similar to the one employed in this paper, which isindependent of the pixel grid chosen. A matrix relatedto the Λ of the present paper is locally approximated bytruncated trigonometric series. The resulting SHT algo-avemoth: Fast spherical harmonic transforms 11rithm scales as O ( L / log L ). As shown in Figure 6, thecode behaves very similarly to our code at medium reso-lution, as long as one do not require too much numericalaccuracy. The size of the precomputed data is also of thesame order, sometimes half and sometimes double thatof Wavemoth’s data.Note that libftsh appears to have potential for opti-mization for modern platforms, and this should be takeninto account when comparing the algorithms. Due to itsage, libftsh makes assumptions about 32-bit array sizeswhich prevents comparison at higher resolutions withoutporting libftsh to 64-bit. The libftsh code contains an im-plementation of the Legendre transforms only, and notof the full spherical harmonic transforms. It should bestraightforward to modify Wavemoth to use libftsh forits Legendre transforms in order to perform full SHTsusing this algorithm.The compression scheme of Mohlenkamp (1999) ap-pears to be very competitive for low accuracy trans-forms, but less so if higher precision is needed. It maybe fruitful to hybridize the algorithms of Tygert (2010)and Mohlenkamp (1999) and use both together to com-press a single matrix. Even if that does not work, onecan simply use whichever performs best for a given m . DISCUSSION

There is signiﬁcant potential in speeding up sphericalharmonic transforms beyond the codes in popular use to-day. We achieved a 2x speedup at low and medium res-olutions simply due to restructuring how the brute-forcecomputations are done, and believe there is potential foreven more speedup if time is spent on proﬁling and micro-optimization. In particular, our code is under-optimizedfor multiple simultaneous transforms.At the highest resolutions in practical use in cosmol-ogy today, L ∼ q m ( z k ) betweenthe Legendre transforms and the Fourier transforms.The author thanks S. K. Næss, H. K. Eriksen, M.Tygert, M. Reinecke and M. Mohlenkamp for useful dis-cussions, and M. Omang and F. Hansen for lending thebenchmark hardware. The author is funded by EuropeanResearch Council grant StG2010-257080. The bench-mark hardware is funded by the Norwegian Defence Es-tates Agency and the Research Council of Norway. APPENDIX

IMPLEMENTATION DETAILS

Applying the compressed matrix representation to a vector

On modern computers, the primary bottleneck is often to move data around. Fundamental design decisions weremade with this in mind. Looking at the compressed representations of Λ in Section 3.2, the immediate algorithmthat comes to mind for computing Λx or Λ T x is the breadth-ﬁrst approach: First compute S x , then permute theresult, then compute S ( P S x ), and so on. However, this leads to storing several temporary results for longer thanthey need to, since the rightmost permutations are very local permutations, and only the leftmost permutation isfully global. Therefore we traverse the data dependency tree set up by the permutations in a depth-ﬁrst manner.The advantage of this approach is that it is cache oblivious when transforming a few vectors at the time. That is, itautomatically minimizes data movement for any cache hierarchy, whereas breadth-ﬁrst traversal will always drop tothe memory layer that is big enough to hold the entire set of input vectors. Note that for transforming many mapsat the same time, cache-size dependent blocking should be implemented in addition, but we have stopped short ofthis. Like Tygert (2010), we also do the compression during precomputation depth-ﬁrst, which ensures that, per m ,memory requirements go as O ( L log L ) even though computation time go as O ( L ).The core computation during tree traversal is to apply the interpolative matrices, e.g., e Tx or e T T x . Keep in mind2that the k -by- n matrix e T contains the k -by- k identity matrix in a subset of its columns; making use of this is importantas it roughly halves the storage size and FLOP count. Given an ID T = T ( k ) e T , we can freely permute the rows of e T ,simply by permuting the columns of T ( k ) correspondingly. We do this during precomputation to avoid the unorderedmemory usage pattern of arbitrary permutations. Instead, we can simply ﬁlter the input or output vectors into thepart that hits the identity sub-matrix and the part that hits the dense sub-matrix. Eﬃcient code for Legendre transforms

As mentioned in Section 3.4, it is necessary to balance the amount of precomputed data to the memory bandwidth,so code is required to apply the residual blocks in R to vectors without actually storing R in memory. This meanscomputing a cropped version of the Legendre transform, q ′ ( z j ) = k stop X k = k start e P mm +2 k + t ( z j ) a ℓm , (A1)where t = 0 for the even transforms and t = 1 for the odd transforms. To compute e P mℓ we use a relation that jumpstwo steps in ℓ for each iteration (Tygert 2010): e P mℓ +2 ( z ) = z − d ml c mℓ e P mℓ ( z ) − c mℓ − c mℓ e P mℓ − ( z ) ≡ ( z + α mℓ ) β mℓ e P mℓ ( z ) + γ mℓ e P mℓ − ( z ) , (A2)with c mℓ = s ( ℓ − m + 1)( ℓ − m + 2)( ℓ + m + 1)( ℓ + m + 2)(2 ℓ + 1)(2 ℓ + 3) (2 ℓ + 5)and d mℓ = 2 ℓ ( ℓ + 1) − m − ℓ − ℓ + 3) . This recurrence relation requires ﬁve arithmetic operations per iteration, as opposed to a more widely used relationwhich takes one step in ℓ and only needs four arithmetic operations per step (see, e.g., Press et al. 2007). However,since Λ even and Λ odd may have diﬀerent columns in the residual blocks of their compressed representations, relation(A2) is a better choice in our case.For each block in R we precompute α mℓ , β mℓ and γ mℓ , as well e P mk start ( z ) and e P mk start +1 ( z ) for each z for initial conditions.Note that e P mℓ ( z ) in parts of its domain take values so close to zero that they can not be represented in IEEE doubleprecision. However, in these cases e P mℓ ( z ) is always increasing in the direction of increasing ℓ , so we can simply increase k start correspondingly. In fact, we follow libpsht and assume that the dynamic range of the input data is small enough,within each m , that values of e P mℓ ( z ) smaller than 10 − in magnitude can safely be neglected. As far as possible wegroup together six and six columns with the same k start and k stop , for reasons that will soon become clear.For an eﬃcient implementation, the ﬁrst important point is to make sure the number of loads from cache intoCPU registers is balanced with the number of ﬂoating-point operations. The second is to make sure there are enoughindependent ﬂoating-point operations in ﬂight simultaneously, so that operations can be pipelined. Thus, • for performing a single transform with one real and one imaginary vector, the values of e P mℓ should never needto leave the CPU registers. Rather, we fuse equation (A1) and equation (A2) in the core loop. For multiplesimultaneous transforms we save e P mℓ to cache, but make sure to process in small batches that easily ﬁt in L1cache. • we process for several z j simultaneously. This amortizes the register loads of α mℓ , β mℓ and γ mℓ . It also ensuresthat there are multiple independent chains of computation going on so that pipelining works well.In the single transform case with one real and one imaginary vector, we do the full summation for six z j at thetime (when possible). The allocation of the 16 available 128-bit registers, each holding two double-precision numbers,then becomes three registers for e P mℓ , three for e P mℓ − , three for the auxiliary data α mℓ , β mℓ and γ mℓ , six accumulationregisters for q ′ ( z j ), and one work register. The z j values are, perhaps counter-intuitively, read again from cache ineach iteration, which conserves three registers and thus enables processing six z j in each chunk instead of only fourwithout register spills. Finally, when the time comes for multiplying e P mℓ with a ℓm , the auxiliary data is no longerneeded, leaving room for loading a ℓm .On the Intel Xeon system, the routine performs at 6.46 GFLOP/s per core (71% of the theoretical maximum)when benchmarked on all the Legendre transforms necessary for a full SHT across 32 cores. The eﬀect of instructionavemoth: Fast spherical harmonic transforms 13pipelining is evident; reducing the number of columns processed in each iteration from six to four reduces performanceto 5.69 GFLOP/s (63%), and when only processing two columns at the time, performance is only 4.28 GFLOP/s(47%).We skip the details for the multiple transform case, but in short, in involves the same sort of blocking performed formatrix multiplication, including repacking the input data in blocks. Goto & van de Geijn (2008) provide an excellentintroduction to blocking techniques. In this case the performance is 5.60 GFLOP/s (62%) per core when performingthe Legendre transforms necessary for ten simultaneous SHTs.The considerations above guided the choice of loop structure, which was then implemented in pure C using SSEintrinsics. We did not spend much time on optimization, so there should be room for further improvements, inparticular for the multiple-transform path. Data layout

The butterﬂy compression algorithm naturally leads to the following code organization for spherical harmonic syn-thesis:1. Since each m is processed independently, we request input in m -major ordering. Also, for multiple simultaneoustransforms, the coeﬃcients of each map are interleaved, which is optimal both for the butterﬂy algorithm andthe brute-force cropped Legendre transforms. In most places, the real and complex parts of the input can betreated as two independent vectors, since Λ is a real matrix.2. Compute all q m ( z j ) into a 2D array. Since each m is processed independently, this ends up in m -major ordering,like the input.3. While transposing the q m ( z j ) array into ring-major ordering, phase-shift and wrap around the coeﬃcients, andperform FFTs on each ring. Rings must be processed in small batches in order to avoid loading cache linesmultiple times.A temporary work buﬀer with size of the same order as the input and output is used for q m ( z j ). An in-place codeshould be feasible with the use of an in-place transpose.A drawback compared to brute-force codes is that q m ( z j ) needs to ﬁrst be written to and then read from mainmemory. Here, libpsht is instead able to employ blocking, so that a few rings at the time are completely processedbefore moving on. Our benchmarks do however indicate that this is not a big problem in practice. Also, for clusterparallelization using MPI, it would be natural to follow S HAT (Hupca et al. 2010; Szydlarski et al. 2011) in distributingthe input data by m and the output data by rings, which also leads to a global transpose operation.Wavemoth stores the output maps in interleaved order, since FFTW3 is able to deal well with such transforms. Thelibpsht code is able to support any output ordering, although stacked, non-interleaved maps are slightly faster, so thatis the ordering we use for libpsht in the benchmarks.and the output data by rings, which also leads to a global transpose operation.Wavemoth stores the output maps in interleaved order, since FFTW3 is able to deal well with such transforms. Thelibpsht code is able to support any output ordering, although stacked, non-interleaved maps are slightly faster, so thatis the ordering we use for libpsht in the benchmarks.