Libpsht - algorithms for efficient spherical harmonic transforms
AAstronomy & Astrophysics manuscript no. paper c (cid:13)
ESO 2011January 11, 2011
Libpsht – algorithms for efficient spherical harmonic transforms
M. Reinecke Max-Planck-Institut f¨ur Astrophysik, Karl-Schwarzschild-Str. 1, 85741 Garching, GermanyReceived 11 October 2010 / Accepted 1 November 2010
ABSTRACT
Libpsht (or “library for performant spherical harmonic transforms”) is a collection of algorithms for e ffi cient conversion betweenspatial-domain and spectral-domain representations of data defined on the sphere. The package supports both transforms of scalarsand spin-1 and spin-2 quantities, and can be used for a wide range of pixelisations (including HEALPix, GLESP, and ECP). It will takeadvantage of hardware features such as multiple processor cores and floating-point vector operations, if available. Even without thisadditional acceleration, the employed algorithms are among the most e ffi cient (in terms of CPU time, as well as memory consumption)currently being used in the astronomical community.The library is written in strictly standard-conforming C90, ensuring portability to many di ff erent hard- and software platforms, andallowing straightforward integration with codes written in various programming languages like C, C ++ , Fortran, Python etc. Libpsht is distributed under the terms of the GNU General Public License (GPL) version 2 and can be downloaded from http://sourceforge.net/projects/libpsht/ . Key words. methods: numerical – cosmic background radiation – large-scale structure of the Universe
1. Introduction
Spherical harmonic transforms (SHTs) have a wide range of ap-plications in science and engineering. In the case of CMB sci-ence, which prompted the development of the library presentedin this paper, they are an essential building block for extractingthe cosmological power spectrum from full-sky maps, and arealso used for generating synthesised sky maps in the context ofMonte Carlo simulations. For all recent experiments in this field,the extraction of spherical harmonic coe ffi cients has to be doneup to very high multipole moments (up to 10 ), which makesSHTs a fairly costly operation and therefore a natural candidatefor optimisation. It also gives rise to a number of numerical com-plications that must be addressed by the transform algorithm.The purpose of SHTs is the conversion between functions (ormaps) on the sphere and their representation as a set of spheri-cal harmonic coe ffi cients in the spectral domain. In CMB sci-ence, such transforms are most often needed for quantities ofspin 0 (i.e. quantities which are invariant with respect to rotationof the local coordinate system) and spin 2, because the unpo-larised and polarised components of the microwave radiation arefields of these respective types. For applications concerned withgravitational lensing, SHTs of spins 1 and 3 are also sometimesrequired.The paper is organised as follows. The following sectionlists the underlying equations anf the motivations for developing libpsht , as well as the goals of the implementation. A detailedexplanation of the library’s inner workings is presented in Sect.3, and Sect. 4 contains a high-level overview of the interfaceit provides. Detailed studies regarding libpsht ’s performance,accuracy, and other quality indicators were performed, and theirresults presented in Sect. 5. Finally, a summary of the achieved(and not completely achieved) goals is given, together with anoutlook on planned future extensions. Send o ff print requests to : M. Reinecke,e-mail: [email protected]
2. Code capabilities
A continuous spin- s function f ( ϑ, ϕ ) with a spectral band limitof l max is related to its corresponding set of spin spherical har-monic coe ffi cients s a lm by the following equations: f ( ϑ, ϕ ) = l max (cid:88) m = − l max l max (cid:88) l = | m | s a lm s Y lm ( ϑ, ϕ ) (1) s a lm = (cid:90) πϑ = (cid:90) πϕ = d ϕ d ϑ sin ϑ f ( ϑ, ϕ ) s Y ∗ lm ( ϑ, ϕ ) . (2)Eqs. (1) and (2) are known as backward (or synthesis ) and forward (or analysis ) transforms, respectively. For a discretisedspherical map consisting of a vector p of N pix pixels at loca-tions ( ϑ , ϕ ) and with (potentially weighted) solid angles w , theychange to: p n = l max (cid:88) m = − l max l max (cid:88) l = | m | s a lm s Y lm ( ϑ n , ϕ n ) (3) s ˆ a lm = N pix (cid:88) n = p n w n s Y ∗ lm ( ϑ n , ϕ n ) . (4)Depending on the choice of l max , w , and the grid geometry, the p → s a lm transform may only be approximate, which is indicatedby choosing the identifier ˆ a instead of a .The main purpose of the presented code is the e ffi cient im-plementation (regarding CPU time as well as memory consump-tion) of Eqs. (3) and (4) for scalars as well as tensor quantitiesof spins ± ± a r X i v : . [ a s t r o - ph . I M ] J a n . Reinecke: Libpsht – algorithms for e ffi cient spherical harmonic transforms At present, several SHT implementations are in use withinthe CMB community, like those distributed with the Fortranand C ++ implementations of HEALPix (G´orski et al. 2005),GLESP (Doroshkevich et al. 2005), s2hat , spinsfast (Hu ff enberger & Wandelt 2010), ccSHT , and IGLOO (Crittenden & Turok 1998); they o ff er varying degrees of per-formance, flexibility and feature completeness. The main designgoal for libpsht was to provide a code which performs betterthan the SHTs in these packages and is versatile enough to serveas a drop-in replacement for them.This immediately leads to several technical and pragmaticrequirements which the code must fulfill: E ffi ciency: The transforms provided by the library must performat least as well (in terms of CPU time and memory) as thosein the available packages, on the same hardware. If possi-ble, all available hardware features (like multiple CPU cores,vector arithmetic instructions) should be used.
Flexibility:
The library needs to support all of the above dis-cretisations of the sphere; in addition, it should allow trans-forms of partial maps.
Portability:
It must run on all platforms supported by the soft-ware packages above.
Universality:
It must provide a clear interface that can be con-veniently called from all the programming languages usedfor the packages above.
Completeness:
If feasible, it should not depend on external li-braries, because these complicate installation and handlingby users.
Compactness:
In order to simplify code development, mainte-nance and deployment, the library source code should bekept as small as possible without sacrificing readability.It should be noted that the code was developed as a library foruse within other scientific applications, not as a scientific appli-cation by itself. As such it strives to provide a comprehensiveinterface for programmers, rather than ready-made facilities forusers.
For completely general discretisations of the sphere, the oper-ation count for SHTs is O ( l N pix ) in both directions, whichis not really a ff ordable for today’s resolution requirements.Fortunately, approaches of lower complexity exist for grids withthe following geometrical constraints: – the map pixels are arranged on N ϑ ≈ (cid:112) N pix iso-latituderings, where the colatitude of each ring is denoted as ϑ y . – within one ring, pixels are equidistant in ϕ and have identicalweight w y ; the number of pixels in each ring can vary and isdenoted as N ϕ, y , and the azimuth of the first pixel in the ringis ϕ , y . http://healpix.jpl.nasa.gov http://glesp.nbi.dk http://crd.lbl.gov/˜cmc/ccSHTlib/doc Under these assumptions, Eqs. (3) and (4) can be written as p xy = l max (cid:88) m = − l max l max (cid:88) l = | m | s a lm s λ lm ( ϑ y ) exp (cid:32) im ϕ , y + π imxN ϕ, y (cid:33) (5) s ˆ a lm = N ϑ − (cid:88) y = N ϕ, y − (cid:88) x = p xy w y s λ lm ( ϑ y ) exp (cid:32) − im ϕ , y − π imxN ϕ, y (cid:33) , (6)where s λ lm ( ϑ ) : = s Y lm ( ϑ, p xy = l max (cid:88) m = − l max F m , y exp (cid:32) im ϕ , y + π imxN ϕ, y (cid:33) with (7) F m , y : = l max (cid:88) l = | m | s a lm s λ lm ( ϑ y ), and (8) s ˆ a lm = N ϑ − (cid:88) y = G m , y s λ lm ( ϑ y ) with (9) G m , y : = w y N ϕ, y − (cid:88) x = p xy exp (cid:32) − im ϕ , y − π imxN ϕ, y (cid:33) , (10)it becomes obvious that the steps s a lm → F m , y and G m , y → s ˆ a lm require O ( N ϑ l ) operations, while a set of fast Fourier trans-forms – with the subdominant complexity of O ( N ϑ l max log l max ) –can be used for the steps F m , y → p xy and p xy → G m , y , respec-tively. For any “useful” arrangement of the rings on the sphere, N ϑ will be roughly proportional to l max , resulting in O ( l ) op-erations per SHT.Practically all spherical grids which are currently used inthe CMB field (HEALPix, ECP, GLESP, IGLOO) fall into thiscategory, so libpsht can take advantage of this simplification.A notable exception is the “quadrilateral spherical cube” gridused by the COBE mission, but this is not really an issue, sinceCOBE’s resolution is so low that nowadays even brute-forceSHTs produce results in acceptable time.For the special case of the ECP grid, which is equidis-tant in both ϑ and ϕ , algorithms of the even lower complex-ity O ( l log l max ) have been presented (see, e.g., Driscoll &Healy 1994; Wiaux et al. 2007). These were not considered for libpsht , because this would conflict with the flexibility goal.In addition, it has been observed that the constant prefactor forthese methods is fairly high, so that they do not o ff er any perfor-mance advantage for small and intermediate l max ; for transformswith high l max , on the other hand, their memory consumption isprohibitively high. Without limiting the general applicability of the provided SHTs, libpsht imposes a few constraints on the format of the datait processes. For scalar transforms, it will only work with real-valued maps (the ubiquitous scenario in CMB physics); this isnot really a limitation, since transforming a complex-valued mapcan be achieved by simply transforming its real and imaginaryparts independently and computing a linear combination of theresults.For SHTs with s (cid:44) libpsht works on the so-called gra-dient (E) and curl (B) components of the tensor field | s | E lm : = ( − H ( | s | a lm + ( − s −| s | a lm ) (11) i | s | B lm : = ( − H ( | s | a lm − ( − s −| s | a lm ) (12)
2. Reinecke: Libpsht – algorithms for e ffi cient spherical harmonic transforms where | s | E ∗ lm = ( − m | s | E l , − m , | s | B ∗ lm = ( − m | s | B l , − m and ( − H is a sign convention (Lewis 2005). The default setting for H in libpsht is 1, following the HEALPix convention, but this canbe changed easily if necessary. On the real-space side, the tensorfield is represented as two separate maps, containing the real andimaginary part, respectively.In the following discussion, no fundamental distinction ex-ists between | s | E lm and | s | B lm , so for the sake of brevity both willbe identified as s a lm .This choice of representation was made because it is compat-ible with the data formats used by most existing packages, whichsimplifies interfacing and data exchange. It also has the impor-tant advantage that any set of spherical harmonic coe ffi cients iscompletely specified by its elements with m > =
0, since s a lm and s a l , − m are coupled by symmetry relations.
3. Algorithm description
The central building block for every SHT implementation isthe quick and accurate computation of spherical harmonics s Y lm ( ϑ, ϕ ); for libpsht , these need only be evaluated at ϕ = s λ lm ( ϑ ). For s =
0, a proven method of determining the coe ffi cients isusing the recursion relation λ lm = cos ϑ A lm λ l − , m − B lm λ l − , m (13)with A lm : = (cid:114) l − l − m and B lm : = A lm A l − , m (14)in the direction of increasing l ; the starting values λ mm can be ob-tained from λ = (4 π ) − / and the additional recursion relation λ mm = − sin ϑ (cid:114) m + m λ m − , m − (15)(Varshalovich et al. 1988). Since two values are required to startthe recursion in l , one also uses λ m + , m = λ mm √ m + λ mm can becomeextremely small for increasing m and decreasing sin ϑ , such thatthey can no longer be represented by the IEEE754 number for-mat used by virtually all of today’s computers. This problem wasaddressed using the approach described by Pr´ezeau & Reinecke(2010), i.e. by augmenting the IEEE representation with an ad-ditional integer scale factor, which serves as an “enhanced ex-ponent” and dramatically increases the dynamic range of repre-sentable numbers.As is to be expected, this extended range comes at asomewhat higher computational cost; therefore the algorithmswitches to standard IEEE numbers as soon as the λ lm havereached a region which can be safely represented without a scalefactor. From that point on, the recursion in l only requires threemultiplications and one subtraction per λ value, which are verycheap instructions on current microprocessors.From the above it follows that, depending on the exact pa-rameters of the SHT, a non-negligible fraction of the s λ lm ( ϑ ) isextremely small, and that all terms in Eqs. (8) and (9) contain-ing such values as a factor do not contribute in a measurable way to the SHT’s result, due to the limited dynamic range of thefloating-point format in which the result is produced. For thatreason, libpsht records the index l th , at which the | s λ lm ( ϑ ) | firstexceed a threshold ε th (which is set to the conservative value of10 − ) during every l recursion. The remaining operations willthen be carried out for l ≥ l th only.Another acceleration is achieved by skipping the l -recursionfor an ( m , ϑ )-tuple entirely, if another recursion was performedbefore with m c ≤ m and | cos ϑ c | ≤ | cos ϑ | < ε th . In the case that a ring at colatitude ϑ has a counterpart at π − ϑ (which is almost always the case for the popular grids), the s λ lm ( ϑ ) recursion need only be performed for one of those; thevalues for the other ring are quickly obtained by the symmetryrelation s λ lm ( π − ϑ ) = ( − l − s − s λ lm ( ϑ ). (16) Libpsht will identify such ring pairs automatically and treatthem in an optimised fashion.
Following Lewis (2005), libpsht does not literally compute thetensor harmonics ± s λ lm , but rather their linear combinations s F ± lm : = ( + s λ lm ± ( − s − s λ lm ). (17)These are not directly obtained using a recursion, but rather byfirst computing the spin-0 λ lm and subsequently applying spinraising and lowering operators; see appendix A of Lewis (2005)for the relevant formulae. Whenever scalar and tensor SHTs areperformed simultaneously, this has the advantage that the scalar λ lm can be re-used for both; in CMB science this scenario is verycommon, since any transform involving a polarised sky map re-quires SHTs for both s = s = ± | s | > libpsht ’s transforms arecurrently limited to s = , ± , ±
2. Should a need for SHTs athigher s become evident, a method making use of the relation-ships between s λ lm and the Wigner d lmm (cid:48) matrix elements seemsmost promising, since they obey very similar recursion relations(see, e.g., Kostelec & Rockmore 2008), for which a recent im-plementation is available (Pr´ezeau & Reinecke 2010). s λ lm ( ϑ )Many packages allow the user to supply an array containing pre-computed s λ lm ( ϑ ) coe ffi cients to the SHT routines, which canspeed up the computation by a factor of up to ≈ Libpsht doesnot provide such an option; this may seem surprising at first, butit was left out deliberately for the following reasons: – Even if supplied externally, the coe ffi cients have to be ob-tained by some method. Reading them from mass storage isdefinitely not a good choice, since they can be computed on-the-fly much more quickly on current hardware. Externallyprovided values can therefore only provide a benefit if theyare used in at least two consecutive SHTs. – When using multiple threads in combination with precom-puted s λ lm ( ϑ ), computation time soon becomes dominatedby memory access operations to the table of coe ffi cients, be-cause the memory bandwidth of a computer does not scalewith the number of processor cores. This will severely limit
3. Reinecke: Libpsht – algorithms for e ffi cient spherical harmonic transforms the scalability of the code. It is safe to assume that this prob-lem will become even more pronounced in the future, sinceCPU power tends to grow more rapidly than memory band-width (Drepper 2007). – For problem sizes with l max (cid:39) N ϑ l ) become so large thatthey no longer fit into main memory. A conservative estimatefor the size of the s λ lm ( ϑ ) array at l max = s = – Libpsht supports multiple simultaneous SHTs, and willcompute the s λ lm ( ϑ ) only once during such an operation, re-using them for all individual transforms. Since this is donein a piece-by-piece fashion (see Sect. 3.3), only small sub-sets of size O ( l max ) need to be stored, which is more or lessnegligible; this approach also implies that the internally pre-computed s λ lm ( ϑ ) subsets will most likely remain in the CPUcache and therefore not be subject to the memory bandwidthlimit. Libpsht uses an adapted version of the well-known fftpack library (Swarztrauber 1982), ported to the C language (with afew minor adjustments) by the author. This package performswell for FFTs of lengths N whose prime decomposition onlycontains the factors 2, 3, and 5. For prime N , its complexity de-grades to O ( N ), which should be avoided, so for N contain-ing large prime factors, Bluestein’s algorithm (Bluestein 1968)is employed, which allows replacing an FFT of length N by aconvolution (conveniently implemented as a multiplication inFourier space) of any length ≥ N −
1. By appropriate choiceof the new length, the desired O ( N log N ) complexity can bereestablished, at the cost of a higher prefactor.Obviously it would have been possible to use the very pop-ular and e ffi cient FFTW library (Frigo & Johnson 2005) for theFourier transforms, but since this package is fairly heavyweight,this would conflict with libpsht ’s completeness or compact-ness goals. Even though libfftpack with the Bluestein modifi-cation will be somewhat slower than FFTW , the e ffi ciency goal isnot really compromised: libpsht ’s algorithms have an overallcomplexity of O ( N ϑ l ), so that the FFT part with its total com-plexity of O ( N ϑ l max log l max ) will (for nontrivial problem sizes)never dominate the total CPU time. As Eqs. (5) and (6) and the complexity of O ( N ϑ l ) alreadysuggest, a code implementing SHTs on an iso-latitude grid willcontain at least three nested loops which iterate on y , m , and l , re-spectively. The overall speed and / or memory consumption of thefinal implementation depend sensitively on the order in whichthese loops are nested, since di ff erent hierarchies are not equallywell suited for precomputation of common quantities used in theinnermost loop.The choice made for the computation of the s λ lm ( ϑ ), i.e. toobtain them by l -recursion (see Sect. 3.1) strongly favors placingthe loop over l innermost; so only the order of the y and m loopsremains to be determined.From the point of view of speed optimisation, arguably thebest approach would be to iterate over m in the outermost loop, http://fftw.org then over the ring number y , and finally over l . This allows pre-computing the A lm and B lm quantities (see Sect. 3.1) exactly onceand in small chunks, for an additional memory cost of only O ( l max ). However, this method requires additional O ( N ϑ l max )storage for the intermediate products F m , y or G m , y , which is com-parable to the size of the input and output data. This was con-sidered too wasteful, since it prohibits performing SHTs whoseinput and output data already occupy most of the available com-puter memory.Unfortunately, switching the order of the two outermostloops does not improve the situation: now, one is presented withthe choice of either precomputing the full two-dimensional A lm and B lm arrays, which is not acceptable for the same reason thatprohibits storing the entire F m , y and G m , y arrays, or recomputingtheir values over and over, which consumes too much CPU time.To solve this dilemma, it was decided not to perform theSHTs for a whole map in one go, but rather to subdivide theminto a number of partial transforms, each of which only dealswith a subset of rings (or ring pairs). An appropriate choice forthe number of parts will result in near-optimal performance com-bined with only small memory overhead; the most symmetri-cal choice of N / ϑ leads to a time overhead of O ( N / ϑ l ) andadditional memory consumption of O ( N / ϑ l max ), both of whichare small compared to the total CPU and storage requirements. Libpsht uses sub-maps of roughly 100 or N ϑ /
10 rings (or ringpairs), whichever is larger.The arrangement described above is su ffi cient for a singleSHT. Since libpsht supports multiple simultaneous SHTs, an-other loop hierarchy had to be introduced. A pseudo-code listingillustrating the complete design of the SHT algorithm (includ-ing the subdominant parts performing the FFTs) is presented inFig. 1. To make use of multiple CPU cores, if available, the algorithm’sloop over m , as well as the code blocks performing the FFTs,have been parallelised using OpenMP directives, requesting dy-namic scheduling for every iteration; this is indicated by the "OpenMP" tag in Fig. 1. As long as the SHTs are large enough,this results in very good scaling with the number of cores avail-able (see also Sect. 5.2.4).If the target machine supports the SSE2 instruction set (which is the case for all AMD / Intel CPUs introduced sincearound 2003), its ability to perform two arithmetic operations ofthe same kind with a single machine instruction is used to pro-cess two ring (pairs) simultaneously, greatly improving the over-all execution speed. The relevant loop is marked with "SSE2" inFig. 1. The necessary changes are straightforward for the mostpart; however, special attention must be paid to the l recursion,because the two sequences of s λ lm ( ϑ ) will cross the thresholdto IEEE-representable numbers at di ff erent l th . Fortunately thiscomplication can be addressed without a significant slowdownof the algorithm.The data types used and the functions called in order to usethe SSE2 instruction set are not part of the C90 language stan-dard, and will therefore not be known to all compilers and onall hardware platforms. However there is a portable way to de-tect compiler and hardware support for SSE2; if the result of thischeck is negative, only the standard-compliant floating-point im- http://openmp.org http://en.wikipedia.org/wiki/SSE2
4. Reinecke: Libpsht – algorithms for e ffi cient spherical harmonic transforms for b =
Schematic loop structure of libpsht ’s transform algo-rithmplementation of libpsht will be compiled, without the neces-sity of user intervention.Similarly, the OpenMP functionality is accessed using so-called compiler directives, which are simply ignoredby compilers lacking the required capability, reducing the sourcecode to a single-threaded version.Both extensions are supported by the widely used gcc andIntel compilers.
4. Programming interface
Only a high-level overview of the provided functionality can begiven here. For the level of detail necessary to actually interfacewith the library, the reader is referred to the technical documen-tation provided alongside the source code.
Due to libpsht ’s universality goal, it must be able to acceptinput data and produce output data in a wide variety of storageschemes, and therefore only makes few assumptions about howmaps and a lm are arranged in memory. http://gcc.gnu.org A tesselation of the sphere and the relative location of itspixels in memory is described to libpsht by the following setof data: – the total number of rings in the map; and – for every ring – the number of pixels N ϕ, y ; – the index of the first pixel of the ring in the map array; – the stride between indices of consecutive pixels in thisring; – the azimuth of the first pixel ϕ , y ; – the colatitude ϑ y ; – the weight factor w y . This is only used for forward SHTsand is typically the solid angle of a pixel in this ring.For a lm , less information is necessary: – the maximum l and m quantum numbers available in the set – the index stride between a lm and a l + , m ; – an integer array containing the (hypothetical) index of theelement a , m .Due to the convention described in Sect. 2.4, only coe ffi cientswith m ≥ a lm , they describe all characteristics of a set of input or outputdata that libpsht needs to be aware of.It should be noted that the map description trivially supportstransforms on grids that only cover parts of the ϑ range; ringsthat should not be included in map synthesis or analysis can sim-ply be omitted in the arrays describing the grid geometry, speed-ing up the SHTs on this particular grid. More general masks,which a ff ect parts of rings, have to be applied by explicitly set-ting the respective map pixels to zero before map analysis orafter map synthesis. Libpsht will only work on data that is already laid outin main memory according to the structures mentioned above.There is no functionality to access data on mass storage directly;this task must be undertaken by the programs calling libpsht functionality. Given the bewildering variety of formats which areused for storing spherical maps and a lm on disk – and with theprospect of these formats changing more or less rapidly in thefuture – I / O was not considered a core functionality of the li-brary.
In addition to the flexibility in memory ordering, libpsht canalso work on data in single- or double-precision format, with therequirement that all inputs and outputs for any set of simulta-neous transforms must be of the same type. Internally, however,all computations are performed using double-precision numbers,so that no performance gain can be achieved by using single-precision input and output data.
As already mentioned, it is possible to compute several SHTs atonce, independent of direction and spin, as long as their mapand a lm storage schemes described above are identical. Thisis achieved by repeatedly adding SHT requests, complete with
5. Reinecke: Libpsht – algorithms for e ffi cient spherical harmonic transforms pointers to their input and output arrays, to a “job list”, and fi-nally calling the main SHT function, providing the job list, mapinformation and a lm information as arguments. Due to the choice of C as libpsht ’s implementation language,the library can be interfaced without any complications fromother codes written in C or C ++ . Using the library from lan-guages like Python and Java should also be straightforward, butrequires some glue code.For Fortran the situation is unfortunately a bit more compli-cated, because a well-defined interface between Fortran and Chas only been introduced with the Fortran2003 standard, whichis not yet universally supported by compilers. Due to the phi-losophy of incrementally building a job list and finally execut-ing it (see Sect. 4.3), simply wrapping the C functions using cfortran.h glue code does not work reliably, so that build-ing a portable interface is a nontrivial task. Nevertheless, theFortran90 code beam2alm of the Planck mission’s simulationpackage has been successfully interfaced with libpsht .
5. Benchmarks
In order to evaluate the reliability and e ffi ciency of the library, aseries of tests and comparisons with existing implementationswere performed. All of these experiments were executed ona computer equipped with 64GB of main memory and XeonE7450 processors (24 CPU cores overall) running at 2.4 GHz. gcc version 4.4.2 was used for compilation. The validation of libpsht ’s algorithms was performed in twostages: first, the result of a backward transform was comparedto that produced with another SHT library using the same input,and afterwards it was verified that a pair of backward and for-ward SHTs reproduces the original a lm with su ffi cient precision.All calculations in this section start out with a set of s a lm thatconsists entirely of uniformly distributed random numbers in therange ( −
1; 1), except for the imaginary parts of the s a l , , whichmust be zero for symmetry reasons, and coe ffi cients with l < | s | ,which have no meaning in a spin- s transform and are set to zeroas well. For this test, a set of a lm with l max = N side = libpsht and the Fortran HEALPix fa-cility synfast . This was done for spins 0 and 2, mimicking thesynthesis of a polarised sky map. Comparison of the resultingmaps showed a maximum discrepancy between individual pix-els of 1 . × − , and the RMS error was 6 . × − , whichindicates a very good agreement. ˆ a lm An individual accuracy test consists of a map synthesis, followedby an analysis of the obtained map, producing the result s ˆ a lm . -16 -14 -12 -10 -8 -6 -4 -2
64 128 256 512 1024 2048 4096 8192 ε r m s l max ECPGaussHEALPixHEALPix4HEALPix8
Fig. 2. ε rms for libpsht transform pairs on ECP, Gaussian andHEALPix grids for various l max . Every data point is the maxi-mum error value encountered in transform pairs of all supportedspins. “Healpix4” and “Healpix8” refer to iterative analyses with4 and 8 steps, respectively. -16 -14 -12 -10 -8 -6 -4 -2
64 128 256 512 1024 2048 4096 8192 ε m a x l max ECPGaussHEALPixHEALPix4HEALPix8
Fig. 3. ε max for libpsht transform pairs on ECP, Gaussian andHEALPix grids for various l max . Every data point is the maxi-mum error value encountered in transform pairs of all supportedspins. “Healpix4” and “Healpix8” refer to iterative analyses with4 and 8 steps, respectively.Two quantities are used to assess the agreement betweenthese two sets of spherical harmonic coe ffi cients; they are de-fined as ε rms : = (cid:115) (cid:80) lm | s a lm − s ˆ a lm | (cid:80) lm | s a lm | and (18) ε max : = max lm (cid:0) |(cid:60) ( s a lm − s ˆ a lm ) | , |(cid:61) ( s a lm − s ˆ a lm ) | (cid:1) . (19)A variety of such tests was performed for di ff erent grid ge-ometries, l max , and spins; the results are shown in Figs. 2 and3. For ECP and “Gaussian” grids, each ring consisted of 2 l max + l max + l max + ϑ y for the Gaussian grid were chosen to coin-cide with the roots of the Legendre polynomial P l max + ( ϑ ). Underthese conditions, and with appropriately chosen pixel weights(taken from Driscoll & Healy 1994 and Doroshkevich et al.2005), the transform pair should reproduce the original a lm ex-actly, were it not for the limited precision of floating-point arith-
6. Reinecke: Libpsht – algorithms for e ffi cient spherical harmonic transforms Table 1.
Single-core CPU time for various SHTs with l max = Direction Spin ECP Gaussian HEALPixforward 0 15.14 s 7.89 s 13.11 sbackward 0 14.74 s 7.53 s 12.65 sforward 1 29.49 s 15.36 s 25.05 sbackward 1 27.83 s 14.30 s 23.85 sforward 2 31.91 s 16.91 s 28.15 sbackward 2 30.33 s 15.60 s 26.51 s
Notes.
The ECP grid had 2 l max + l max + l max + l max + N side parameter of the HEALPix grid was l max / metics. In fact, the errors measured for SHTs on those two gridsare extremely small and close to the theoretical limit.For the transform pairs on HEALPix grids, a resolution pa-rameter of N side = l max / s ˆ a lm will only be an approximation of the original s a lm ; if higher accu-racy is required, this approximation can be improved by Jacobiiteration, as described in the HEALPix documentation. The fig-ures show clearly that ε rms is around 10 − for the simple trans-form pair and can be reduced dramatically by iterated analysis. Table 1 shows the single-core CPU times for SHTs of di ff erentdirections, spins and grids, but with identical band limit, in orderto illustrate their relative cost. It is fairly evident that forward andbackward SHTs with otherwise identical parameters require al-most identical time, with the a lm → map direction being slightlyslower. Since this relation also holds for other band limits, onlythe timings for one direction will be shown in the following fig-ures, in order to reduce clutter.SHTs with | s | > s = s =
1, because computing the F ± lm from λ lm requires moreoperations than computing the F ± lm .The timings also indicate the scaling of the SHT cost with N ϑ , which is practically equal for HEALPix and ECP grids, buthalf as high for the Gaussian grid.These timings, as well as all other timing results for libpsht in this paper, were obtained using SSE2-accelerated code.Switching o ff this feature will result in significantly longer ex-ecution times; on the benchmark computer the factor lies in therange of 1.6 to 1.8. l max SHTs of di ff erent spins and a wide range of band limits were per-formed on a single CPU core, in order to assess the correlationbetween CPU time and l max ; the results are shown in Fig. 4. Asexpected, the time consumption scales almost with l ; devia-tions from this behaviour only become apparent for lower bandlimits, where SHT components of lower overall complexity (likethe FFTs, computation of the A lm and B lm coe ffi cients and initial-isation of data structures) are no longer negligible. Switching toa normalised representation of the CPU time, as was done inFig. 5, makes this e ff ect much more evident; even for transforms C P U t i m e [ s e c ond s ] l max lmax spin=0spin=1spin=2polarised Fig. 4.
Timings of single-core forward SHTs on a HEALPix gridwith N side = l max /
2. The annotation “polarised” refers to a com-bined SHT of spin 0 and 2 quantities, which is a very commoncase in CMB physics. no r m a li s ed C P U t i m e [ s e c ond s / ( l m a x N θ ) ] l max spin=0spin=1spin=2polarised Fig. 5.
Same as Fig. 4, but with CPU time divided by l N ϑ of very high band limit the curves are not entirely horizontal,which would indicate a pure O ( l N ϑ ) algorithm. Fig. 6 shows the ratio of CPU times for equivalent SHTs car-ried out using the synfast facility of Fortran HEALPix and libpsht , using identical hardware resources. It allows severaldeductions: – In all tested scenarios, libpsht ’s implementation is signif-icantly faster, even when precomputed coe ffi cients are usedwith synfast . – The performance gap is markedly wider for s = libpsht ’s l -recursion is the partwhich has been accelerated by the largest factor in compari-son to the Fortran HEALPix SHT library. – The polarised synfast
SHTs only show a marginal speedimprovement when using precomputed coe ffi cients; for thescalar transforms, the di ff erence is much more notice-able. Although this observation has no direct connection to libpsht , it is nevertheless of interest: most likely the miss-ing acceleration is caused by saturation of the memory band-width, as discussed in Sect. 3.1.3, and confirms the decisionnot to introduce such a feature into libpsht .
7. Reinecke: Libpsht – algorithms for e ffi cient spherical harmonic transforms C P U t i m e ( sy n f a s t ) / C P U t i m e ( li bp s h t ) l max spin=0polarisedspin=0, precomputedpolarised, precomputed Fig. 6.
CPU time ratios between the Fortran HEALPix synfast code and libpsht for various backward SHTs on a HEALPixgrid with N side = l max /
10 100 1 2 4 8 16 w a ll c l o ck t i m e number of threads backward, s=2ideal scaling Fig. 7.
Elapsed wall clock time for a backward SHT with s = l max = N side = As illustrated in Fig. 1, all CPU-intensive parts of libpsht ’stransform algorithm are instrumented for parallel execution onmulticore computers using OpenMP directives. Fig. 7 shows thewall clock times measured for a single SHT that was run usinga varying number of OpenMP threads. The maximum numberof threads was limited to 16 (in contrast to the maximum use-ful value of 24 on the benchmark computer), because the ma-chine was not completely idle during the libpsht performancemeasurements, and the scaling behaviour at larger numbers ofthreads would have been influenced by other running tasks.It is evident that the overhead due to workload distributionamong several threads is quite small: for the 16-processor run,the accumulated wall clock time is only 18% higher than for thesingle-processor SHT. An analogous experiment was performedwith the Fortran HEALPix code synfast ; here the overhead for16 threads was around 35%.
For this experiment, the library’s code was enhanced to count thenumber of floating-point operations that were carried out during
Table 2.
Speedup for simultaneous vs. consecutive SHTs on aHEALPix grid with N side = l max = Transforms T simul / s T separate / s Speedup1a0 12.65 12.65 1.002a0 18.06 25.30 1.403a0 23.47 37.95 1.625a0 34.77 63.25 1.8210a0 62.56 126.50 2.023a0 + + + + + + Notes.
The transforms are specified in an abbreviated fashion, where“5a2” signifies “5 a lm → map SHTs of spin 2”, and “3m0” stands for“3 map → a lm SHTs of spin 0”. The runs were performed on a singleCPU core. the SHT. This instrumentation was only done in the parts of thecode with O ( l N ϑ ) complexity, i.e. everything related to cre-ation and usage of the s λ lm ( ϑ ). Operations that were necessaryonly for the purpose of numerical stability (e.g. due to the range-extended floating-point format) were not taken into account.Dividing the resulting number of operations by the CPU timefor a single-core uninstrumented run of the same SHT provides aconservative estimate for the overall floating-point performanceof the algorithm. Measuring s = s = a lm → map SHTson a HEALPix grid with N side = l max = / s and 4.1GFlops / s, respectively. This correspondsto 0.88 and 1.71 floating-point operations per clock cycle, or22% and 43% of the theoretical peak performance. Both of thesepercentages are quite high for an implementation of a nontrivialalgorithm, which can be seen as a hint that there is not muchmore room for further optimisation on this hardware architec-ture. In other words, further speedups can only be achieved by afundamental change in the underlying algorithm.There are several reasons for the obvious performance dis-crepancy between scalar and s = λ lm recursion: even though the CPU could in principle start newoperations every clock cycle, it has to wait for some precedingoperations to finish first (which takes several cycles), becausetheir results are needed in the next steps. When computing a re-currence, the CPU is therefore spending a significant fraction oftime in a waiting state (see, e.g., Fog 2010). For the s = l -recursion becomes subdomi-nant.Other reasons for the lower performance of the recursion arethe necessity to deal with extended floating-point numbers andits rather high ratio of memory accesses compared to arithmeticoperations. The CPU time savings achievable by performing multiple SHTstogether instead of one after another were measured on aHEALPix grid with N side = l max =
8. Reinecke: Libpsht – algorithms for e ffi cient spherical harmonic transforms -2 -1
512 1024 2048 4096 8192 pea k m e m o r y u s age ( G B ) l max lmax lmax libpsht(s=0)synfast(s=0)libpsht(polarised)synfast(polarised)synfast(s=0,precomputed)synfast(polarised,precomputed) Fig. 8.
Memory consumption of various SHTs performed with libpsht and synfast on a HEALPix grid with N side = l max / s λ lm ( ϑ ) (see Sect. 3.1.3), but the presentedapproach has dramatically lower memory needs and can there-fore be used for higher-resolution transforms. It is also evidentthat forward and backward transforms, as well as transforms ofdi ff erent spins, can be freely mixed while still preserving theperformance increase. Fig. 8 shows the maximum amount of memory allocated duringa variety of SHTs carried out using libpsht and the FortranHEALPix synfast facility. Since the memory required for for-ward and backward transforms is essentially equal, only one di-rection has been plotted. The sizes of SHTs with l max < libpsht needs slightly less memory than synfast for equivalent operations; overall, the scaling is veryclose to the expected l if no precomputed s λ lm ( ϑ ) are used.For comparison purposes, a few data points for synfast runsusing precomputed scalar and tensor s λ lm ( ϑ ) were also plotted;they clearly indicate the extremely high memory usage of thesetransforms, which scales with l .During the tests with multiple simultaneous threads it wasobserved that increasing the number of cores does not have asignificant influence on the total required memory; for the con-crete test discussed in Sect. 5.2.4, memory usage for the run with16 threads was only 2% larger than for the single-threaded run.
6. Conclusions
Looking back at the goals outlined in Sect. 2.2 and the results ofthe benchmark computations above, it can be concluded that thecurrent version of the libpsht package meets almost all spec-ified requirements. It implements SHTs that have the same ora higher degree of accuracy as the other available implementa-tions, can work on all spherical grids relevant to CMB studies,and is written in standard C, which is very widely supported. Theemployed algorithms are significantly more e ffi cient than thosepublished and used in this field of research before; their memoryusage is economic and allows transforms whose input and outputdata occupy almost all available space. When appropriate, vector capabilities of the CPUs, as well as shared memory parallelismare exploited, resulting in further performance gains.Despite this range of capabilities, the package only consistsof approximately 7000 lines of code, including the FFT imple-mentation and in-line documentation for developers; except fora C compiler, it has no external dependencies. Libpsht has been integrated into the simulation package ofthe
Planck mission (
Level-S , Reinecke et al. 2006), where it is in-terfaced with a development version of the C ++ HEALPix pack-age and a Fortran90 code performing SHTs on detector beampatterns; this demonstrates the feasibility of interfacing the li-brary with C ++ and Fortran code.There are two areas in which libpsht ’s functionality shouldprobably be extended: it currently does not provide transformsfor spins larger than 2, and there is no active support for dis-tributing SHTs over several independent computing nodes. Aswas mentioned in Sect. 3.1.2, addressing the first point is prob-ably not too di ffi cult, and enhancing the library with s Y lm gen-erators based on Wigner d matrix elements is planned for oneof the next releases. Regarding the second point, a combina-tion of libpsht ’s central SHT functionality with the paralleli-sation strategy implemented by Radek Stompor’s s2hat libraryappears very promising and is currently being investigated. Libpsht is distributed under the terms of the GNU GeneralPublic License (GPL) version 2 (or later versions at the user’schoice); the most recent version, alongside online technical doc-umentation can be obtained at the URL http://sourceforge.net/projects/libpsht/ . Acknowledgements.
Some of the results in this paper have been derived us-ing the HEALPix (G´orski et al. 2005) and GLESP (Doroshkevich et al. 2005)packages. MR is supported by the German Aeronautics Center and SpaceAgency (DLR), under program 50-OP-0901, funded by the Federal Ministry ofEconomics and Technology. I am grateful to Volker Springel, Torsten Ensslin,J¨org Rachen and Georg Robbers for fruitful discussions and valuable feedbackon drafts of this paper.
References
Bluestein, L. I. 1968, Northeast Electronics Research and Engineering MeetingRecord, 10, 218Crittenden, R. G. & Turok, N. G. 1998, arXiv:astro-ph / Driscoll, J. R. & Healy, D. M. 1994, Advances in Applied Mathematics, 15, 202Fog, A. 2010, Optimizing software in C ++ , http://agner.org/optimize/optimizing_cpp.pdf Frigo, M. & Johnson, S. G. 2005, Proceedings of the IEEE, 93, 216, http://fftw.org/fftw-paper-ieee.pdf
G´orski, K. M., Hivon, E., Banday, A. J., et al. 2005, ApJ, 622, 759Hu ff enberger, K. M. & Wandelt, B. D. 2010, ApJS, 189, 255Kostelec, P. & Rockmore, D. 2008, Journal of Fourier Analysis and Applications,14, 145, 10.1007 / s00041-008-9013-5Lewis, A. 2005, Phys. Rev. D, 71, 083008Pr´ezeau, G. & Reinecke, M. 2010, ApJS, 190, 267Reinecke, M., Dolag, K., Hell, R., Bartelmann, M., & Enßlin, T. A. 2006, A&A,445, 373Swarztrauber, P. 1982, Vectorizing the Fast Fourier Transforms (New York:Academic Press), 51Varshalovich, D. A., Moskalev, A. N., & Khersonskii, V. K. 1988, QuantumTheory of Angular Momentum (World Scientific Publishers)Wiaux, Y., Jacques, L., & Vandergheynst, P. 2007, Journal of ComputationalPhysics, 226, 2359 The next major release of HEALPix C ++ will use libpshtlibpsht