MADmap: A Massively Parallel Maximum-Likelihood Cosmic Microwave Background Map-Maker
C.M. Cantalupo, J.D. Borrill, A.H. Jaffe, T.S. Kisner, R. Stompor
MMADmap: A Massively Parallel Maximum-LikelihoodCosmic Microwave Background Map-Maker
C. M. Cantalupo
Computational Cosmology Center, Lawrence Berkeley National Laboratory,Berkeley, CA 94720, USA [email protected]
J. D. Borrill
Computational Cosmology Center, Lawrence Berkeley National Laboratory,Berkeley, CA 94720, USA [email protected]
A. H. Jaffe
Department of Physics, Blackett Laboratory, Imperial College,London SW7 2AZ, Great Britain [email protected]
T. S. Kisner
Computational Cosmology Center, Lawrence Berkeley National Laboratory,Berkeley, CA 94720, USA [email protected]
R. Stompor
CNRS, Laboratoire AstroParticule et Cosmologie, Universit´e Paris-7 ‘Denis Diderot’,75205 Paris Cedex 13, France [email protected]
ABSTRACT
MADmap is a software application used to produce maximum-likelihood images of the skyfrom time-ordered data which include correlated noise, such as those gathered by Cosmic Mi-crowave Background (CMB) experiments. It works efficiently on platforms ranging from smallworkstations to the most massively parallel supercomputers. Map-making is a critical step in theanalysis of all CMB data sets, and the maximum-likelihood approach is the most accurate andwidely applicable algorithm; however, it is a computationally challenging task. This challengewill only increase with the next generation of ground-based, balloon-borne and satellite CMBpolarization experiments. The faintness of the B-mode signal that these experiments seek tomeasure requires them to gather enormous data sets. MADmap is already being run on up to O (10 ) time samples, O (10 ) pixels and O (10 ) cores, with ongoing work to scale to the nextgeneration of data sets and supercomputers. We describe MADmap’s algorithm based around apreconditioned conjugate gradient solver, fast Fourier transforms and sparse matrix operations.We highlight MADmap’s ability to address problems typically encountered in the analysis ofrealistic CMB data sets and describe its application to simulations of the Planck and EBEXexperiments. The massively parallel and distributed implementation is detailed and scaling com-plexities are given for the resources required. MADmap is capable of analysing the largest datasets now being collected on computing resources currently available, and we argue that, givenMoore’s Law, MADmap will be capable of reducing the most massive projected data sets. Subject headings:
Cosmic Microwave Background, data analysis, map making, maximum likelihood a r X i v : . [ a s t r o - ph . C O ] D ec . Introduction The Cosmic Microwave Background (CMB) isthe left-over radiation from the Big Bang. Thisradiation is comprised of primordial photons lastscattered when the first neutral atoms formedsome 400,000 years after the Big Bang. The subse-quent expansion of the Universe redshifts the spec-trum of these photons, originally distributed as a3000K black body, to appear in contemporary ob-servations as a 3K black body. Encoded withinthe image of the thermal fluctuations in the CMBare not only the signatures of the basic param-eters of cosmology but also, using the Big Bangas the ultimate particle accelerator, insights intofundamental physics at the very highest energies(Dodelson 2003).Most CMB experiments simply scan across thesky, sampling its temperature (and now polar-ization) at one or more microwave frequenciesat some fixed rate. Once these data have beencleaned and calibrated, the first step in their scien-tific analysis involves producing pixelized maps ofthe sky and at least an estimate of the pixel noiseproperties. Subsequent steps include separatingthese sky maps into distinct CMB and foregroundcomponent maps, estimating the power spectra ofthe CMB from its maps, and estimating cosmolog-ical parameters from these power spectra. Sincethese subsequent steps depend on the quality ofthe original maps, it is important to generate themost accurate and well-characterised maps we can,in particular maximum-likelihood, minimum vari-ance maps. However, the faintness of the CMBfluctuations on the sky drives us to gather enor-mous data sets which must be reduced coherentlyif we are to preserve their scientific content, result-ing in a very significant computational challenge.The computational costs of the dominant CMBanalysis steps (in floating-point operations, mem-ory, communication, and input/output) are set bythe numbers of time samples n t and sky pixels n z .As the goals of CMB experiments have evolved,both of these numbers have grown. The growth inthe number of samples is driven by the quest forfainter polarized and high multipole signals, andthe number of pixels is driven by the sky-coverageand resolution required for measuring low and highmultipole signals respectively. Table (1) showsthis evolution for a representative selection of sub- orbital and all anticipated satellite CMB missions.Since brute force dense matrix maximum-likelihood algorithms scale as O (n z3 ) (Borrill1999) they have become computationally in-tractable, and we have had to adopt alternativeapproximate algorithms that are dominated byoperations on the time-ordered data that are lin-ear and log-linear in n t . Over the next 15 yearswe can expect CMB time-ordered data volumesto grow by 3 orders of magnitude; coincidentallythis exactly matches the projected growth in com-puting power over the same period assuming acontinuation of Moore’s Law. Since CMB dataanalysis is already pushing the limits of currentstate-of-the-art HPC systems, this implies thatour algorithms and implementations will have tocontinue scaling on the leading edge of HPC tech-nology for the next 10 Moore-doublings if we areto be able fully to support first the design and de-ployment of these missions and then the scientificexploitation of the data sets they gather.The next section gives an overview of the com-putational context for the MADmap software: thelibraries it depends on, the supporting applica-tions, and hardware platforms targeted. Section 3describes the formalism which is applied to thedata by MADmap including the statistical deriva-tion and mathematical framework. MADmap isdesigned around the method of preconditionedconjugate gradients, and Section 4 describes thisalgorithm. Section 5 details MADmap’s imple-mentation, including subsections on data distri-bution, noise weighting, pointing data compres-sion, pixel indexing, and the functional complex-ity of the communication, memory, CPU and diskrequirements of MADmap. Section 6 exploresthe versatility of the MADmap data model, anddescribes a variety of real world problems thatMADmap has been used to solve. Section 7 de-scribes a set of MADmap runs on simulated Planckand EBEX data and explains in detail the perfor-mance characteristics of these runs which span arange of data sizes in both the time and pixel do-main and a variety of processor counts. The paperfinishes with a comparison with other software, adiscussion of future work, and concluding remarks.We will use the following notation in this paper:variables in unitalicized roman font are scalars(e.g., n t and n i ), time domain vectors are repre-sented by italic Greek letters (e.g., γ and ν ), pixel2ate Experiment Description Samples (n t ) Pixels (n z )1990-93 COBE All-sky, low-res, T 8 × × × × × × × × × × × × × × × × Table 1: The evolution of sample and pixel counts over time for a representative selection of suborbital andall anticipated satellite missions. Details of the proposed CMBpol satellite are from the EPIC IntermediateMission concept study. To allow meaningful comparison, the effective number of samples and pixels for eachexperiment are given, which may differ form those quoted by the respective experiment teams. Specifically,sample counts are defined as the sum over all of an experiment’s time-ordered data streams of the stream’ssampling rate times the duration of its observation (where data streams may include various combinationsof detector streams), while pixel counts are the sum over all observing frequencies of the fraction of the skyobserved divided by the fiducial beam size, assuming a circular beam with diameter given by the assumedfull-width at half-maximum, at that frequency.domain vectors are represented by italic Romanletters (e.g., y and z ), and matrices are capital-ized (e.g., A and B ).
2. MADmap
Under the assumption of piecewise stationaryGaussian noise (defined more precisely below)with known spectral properties MADmap pro-duces maximum-likelihood maps to user-specifiedprecision, and in particular does so for the verylargest real and simulated CMB data sets extantand on the very largest of today’s supercomputers.For example, MADmap can enable a wall-clocktime to solution of tens of minutes for Planck-likedata volumes running on a significant fraction ofthe 40,000-core Cray XT4 at the US Departmentof Energy’s National Energy Research ScientificComputing Center. Supercomputers will be acrucial resource in the analysis of forthcomingCMB data sets and MADmap’s ability to scale touse large computing resources will enable sciencethat would be otherwise inaccessible (Bock et al.2006). The maps’ noise properties can be calcu-lated by other tools in the MADCAP softwaresuite. MADpre, an application distributed withMADmap, generates the auto-correlation matrixassuming time domain white noise (which canbe used as a precondioner by MADmap), whileMADping calculates the full dense noise covari- ance matrix by explicit inversion, although this isimpractical for more than O (10 ) pixels.These software components take advantage ofa set of libraries to facilitate simulation and in-put/output operations. M3 is a data abstractionlibrary that uses a plug-in architecture to enableanalysis applications to ingest data from a vari-ety of experiments, which invariably adopt dif-ferent data formats and distributions. Crucially,M3 can also simulate simple signals and complexnoise on the fly, which reduces the potentially over-whelming I/O and disk cost of the traditionallyseparate simulation and analysis steps. Similarly,the Generalized Compressed Pointing (GCP) li-brary, described in more detail below, has beendevised to calculate the pointings of individualdetector samples on demand at run time fromcompressed pointing information (e.g., the sparse-sampled pointing of a reference frame, such as thefocal-plane bore-sight).MADmap was originally designed to map CMBdata in a manner independent of any individualCMB experiment, and has been applied to realand simulated data from a number of missions in-cluding the historic BOOMERanG (de Bernardiset al. 2000) and MAXIMA (Hanany et al. 2000)experiments, and the upcoming EBEX (Oxley EBEX: http://groups.physics.umn.edu/cosmology/ebex/
3t al. 2004) balloon and Planck satellite missions.However the problem it solves appears in any sit-uation where a signal described by a sparse linearmodel is to be derived from data containing corre-lated Gaussian noise that is stationary over knownintervals, and it is already being used in other do-mains, for example by the Herschel mission (Was-kett et al. 2007).We also note that map-making-like operationsare commonly used in other stages of the dataanalysis, including power spectrum estimation viasampling algorithms (Jewell et al. 2004; Wan-delt et al. 2004) and component separation viaparametric technique (Stompor et al. 2009), andMADmap can be employed in a straightforwardmanner in all those contexts.
3. Formalism
The central assumption of our map-making isthat the time-ordered data measured by each de-tector, γ , can be written γ = ν + ζ (1) ζ = Az (2)for temporal noise ν , pixelized signal z , and asparse pointing matrix A encoding the weightswith which each pixel is observed at each time.In the simplest case (a total power CMB temper-ature with uniform symmetric beams in the limitof small pixels compared to the beam size) thepointing matrix contains a single unit weight perobservation; at time sample t with the detectorpointed at sky-coordinates ( θ t , ψ t ) A t,p = (cid:26) , if ( θ t , ψ t ) ∈ pixel p;0 , otherwise . (3)For ideal polarization experiments, this singleweight is then replaced by one for each of theStokes parameters in the observed pixel: ζ t = 12 [ i p + q p sin(2 α t ) + u p cos(2 α t )] (4)where α is a time ordered vector of the observedpolarization angle, and we consider the signal asa vector z = [ i, q, u ]. More complex beams may Planck: be incorporated by appropriately weighting mul-tiple pixels in each observation, and parasitic sig-nals that are fixed in any other basis (for exam-ple, MAXIMA’s chop-synchronous signal (Stom-por et al. 2002)) can be simultaneously solved forby extending the pixel basis appropriately (see alsoSection 6). Ultimately the only requirement on A is that it be full rank, which is to say that thereare at least as many linearly independent obser-vations as there are signals that are being solvedfor. In the broadest sense the pointing matrix isa projection from the signal basis to the time ba-sis that defines the linear relationship between thesignal to be solved for and the recorded detectortime stream.Representing now the noise as, ν = γ − Az (5)and making use of its Gaussian properties we canwrite the likelihood function of a pixelized sky map z given our data γ as L ( z | γ ) ∝ exp (cid:26) − (cid:0) ν T N − ν + Tr [ln N ] (cid:1)(cid:27) , (6)where we have used the matrix identityTr ln H = ln det H (7)to move the usual Gaussian prefactor into theexponential. Maximizing this over all a pri-ori equally likely signals z , gives the maximum-likelihood map and its noise correlation matrixˆ z = M A T N − γM = (cid:0) A T N − A (cid:1) − (8)which can also be derived as the minimum vari-ance or generalized least squares solution (deGasperis et al. 2005). In addition, the map ˆ z is a sufficient statistic for the likelihood functionsas written. That is, the Gaussian likelihood of asky map z only depends on the estimate ˆ z and noother function of the data: L ( z | ˆ z ) ∝ exp (cid:26) − (cid:2) ( z − ˆ z ) T M − ( z − ˆ z ) + Tr(ln M ) (cid:3)(cid:27) (9)where the map noise correlation matrix is M .These equations reduce to simple averaging intopixels in the white noise limit, N t,t (cid:48) = σ δ t,t (cid:48) . We4ote that the O (n z3 ) scaling of the brute force ex-plicit dense matrix calculation arises from the in-version of the inverse pixel noise matrix, althoughit is important to note that the computational costof building the matrix before inversion, O (n t n c ),may dominate O (n z3 ) for some experiments.Our solution ˆ z is only the maximum-likelihoodestimate of z only if N is the true noise correla-tion matrix, and since the noise properties of CMBdata have to be derived from the data there willinevitably be uncertainties. However, provided N is positive definite, then ˆ z provides an unbiased,though potentially not optimal, estimate of z .
4. Algorithm
Brute force explicit dense matrix computationof Equation (8), requiring the construction andinversion of the full dense pixel-pixel noise corre-lation matrix, is computationally impractical forcurrent and future CMB data sets with millionsto hundreds of millions of pixels. However, we ob-serve that it can be re-written in standard linearform (Oh et al. 1999), Hx = b (10)where H ≡ A T N − A , b ≡ A T N − d and x ≡ ˆ z .Since N − is positive definite (being a correla-tion matrix) and A is full rank (by assumption,but see the discussion later in this section), H is also guaranteed to be positive definite. Nowprovided we can operate on any time-domain vec-tor with N − (weighting) and A T (pointing), andon any pixel-domain vector with A (unpointing),we can solve for x using preconditioned conjugategradient (PCG) - a widely-used technique for solv-ing positive definite linear systems that providesthe fastest time to solution for many problems(especially sparse systems), (Barrett et al. 1994;Shewchuk 1994)).The method of conjugate gradients is an opti-mization algorithm similar to steepest decent ex-cept that search directions are chosen to be orthog-onal to each other. This optimization is appliedto minimize the quadratic form: g ( x ) = 12 x T Hx − b T x + c (11)which has the following derivative with respect to x : g (cid:48) ( x ) = Hx − b. (12) When H is positive definite g is a convex quadratic,so that when the gradient of g is zero we have bothminimized g and solved the original linear systemof Equation (10). Embedding the solution of apositive definite linear system into the optimiza-tion of a convex quadratic may at first seem just tocomplicate matters, but operationally it buys usa lot if we can operate with H on a vector quicklyand with minimal memory overhead. This is aniterative technique, and each iteration involves theoperation of H on a vector, which is the dominantcomputational cost of the algorithm. Since thenumber of iterations required to converge to thesolution is proportional to the condition numberof H , having an approximate inverse of H thatcan multiply a vector quickly can greatly reducethe number of iterations required by making theeffective matrix in the preconditioned system ap-proximate the identity matrix.The time streams collected by CMB telescopesgenerally contain large low frequency correlationsin the noise which is often referred to as “1/f-noise”. The noise power spectrum [the Fouriertransform of the noise correlation function n ( | t − t (cid:48) | )] can be modelled as p ( f ) = σ [1 + ( f k /f ) α ] (13)with a white noise level σ , knee-frequency f k , andtypically power law 1 ≤ α ≤
2. Having powerinversely proportional to frequency clearly blowsup for the constant mode ( f = 0) . This (near)degeneracy produces a null space to the constantmode in the inverse pixel pixel noise correlationmatrix, and therefore the inverse problem is ill-posed (equivalently, the H matrix is only positive semi -definite). Applying the conjugate gradienttechnique to such ill-posed problems has been dis-cussed in the literature (Brakhage 1996), and usedin many fields. In the specific example discussedhere, the total offset of the estimated map is ar-bitrary. In a more general case such degeneraciesmay result from the presence of (or removal of)a systematic effect in the time-ordered data, andneed to be considered case by case. In generalwhen applying conjugate gradient to semi-definitelinear systems the iterates will begin by converg- We note that for real detectors the power at zero frequencyis clearly finite, though still high enough to lead to a nu-merical near degeneracy. × A T A matrix) and excising poorly-conditioned pix-els.Removing pixels from the map and thereforetheir respective samples from the time stream af-fects the structure of the time-domain noise cor-relation matrix, making it non-Toeplitz, and thusdifficult to handle efficiently, whenever the noiseis correlated. This problem can be overcome bythe so-called gap filling procedure Stompor et al.(2002), which within the MADCAP software suitecan be facilitated by the MADmask and MADnesapplications. These tools replace time ordereddata samples that occur while an excised pixel isobserved by simulated noise that is consistent withthe power spectrum associated with each excisedsample and the correlations inferred by the noisein surrounding samples. In this way the continu-ity and stationarity of the time-domain noise ispreserved, together with the Toeplitz character ofits respective noise correlation matrix. The rowsof the pointing matrix that correspond to obser-vations of excised pixels are set to zero, and thismodels the simulated noise as free of signal.
5. Implementation
The core of MADmap is a massively parallelimplementation of a preconditioned conjugate gra-dient (PCG) solver. Each PCG iteration moves ina direction in the solution space that is orthogo-nal to all previous steps. In the absence of nu-merical error and degeneracies in the system theexact solution is guaranteed after after n z steps be-cause the number of possible directions has beenexhausted. In practice however the calculation is terminated either after a fixed number of itera-tions n i , or after achieving a given accuracy asmeasured by the relative residual (cid:15) (in the ab-sence of user-specified values MADmap defaultsto n i = 50, (cid:15) = 10 − ). On termination MADmapoutputs the vector which produced the lowest rel-ative residual thus far, which is not necessarily thelast iteration of the PCG as the relative residual isnot guaranteed to decrease monotonically with it-erations. By algorithmic design the relative errordoes not increase with iteration.It is possible to precalculate a sparse precondi-tioner and have MADmap read this preconditionerin from disk at run time; for example, the MAD-pre code will precompute and store ( A T W − A ) − ,where W − is simply N − with all off-diagonal el-ements set to zero. This is the noise correlationmatrix if white noise is assumed in the time do-main. If no preconditioner file is specified thenby default MADmap calculates the point Jacobipreconditioner which is the diagonal matrix com-posed of the inverse of the diagonal elements of A T A .To minimize the impact of system failure on acalculation in progress, MADmap checkpoints pe-riodically and can restart from any of these. Bydefault MADmap checkpoints every 20 PCG it-erations, although this can be altered by settingan environment variable . Each checkpoint com-prises five map vectors and six scalars, with thedistributed maps being gathered onto the root pro-cessor before being written to disk. At a check-point restart this process is inverted, and the rootprocessor reads the maps (and scalars) from diskand scatters them to the other processors. In thisway the checkpoint data volume and distributionis independent of the number of processors used,and a restarted job is not required to use the samenumber of processors as the original. Figure 1 isan overview of the map making process. The memory for both the time domain dataand the pixel domain data are distributed overprocessors. The user has options regarding howthe time domain data will be distributed in waysthat allow for CPU and memory resource opti- The checkpoint frequency environment variable:
MM CHECKPOINT FREQUENCY s , then each processor would an-alyze the data from the fraction 1 /s of the de-tectors, and the experiment’s run time is evenlydivided among processors who have a common re-mainder when divided by s .Figure 2 is a graphical representation of thememory distribution for the detector time streams7nd telescope pointing. Figure 3 shows a toy map-ping of each processors’ locally observed pixelsto the global pixel space. The pixel data thatis stored in the memory of each processor corre-sponds to those pixels which were observed by thedetector samples within the time streams analyzedby the processor. This means that the distribu-tion of the pixel data is strongly dependent on thescanning strategy of the experiment and the rela-tive locations of the detectors on the focal planein addition to the effects of the different modes oftime stream distribution and the number of pro-cessors used to analyze the data.The concatenated distribution has several ad-vantages over the unstrided stacked mode, butthere are some very critical advantages to thestacked mode that make it quite useful. The multi-stacked distribution is often a good compromisethat will allow the problem to be tackled with thecomputational tools available given a good choiceof stride. In the unstrided stacked mode all of thethe detector-specific noise filters are required tobe stored on each processor, compared with veryfew in the concatenated mode. The concatenateddistribution can be used to simultaneously ana-lyze data from different experiments, or data setsthat have large gaps in time where no data aretaken. In the case of run time simulation of cor-related noise, the concatenated mode allows forload balanced generation of correlated noise forlonger intervals of time on a each processor thanin the unstrided stacked mode. In this case theuse of the concatenated mode provides the abil-ity to scale this calculation to larger numbers ofprocessors. The unstrided stacked distribution isadvantageous because this is optimal for the stor-age and computational requirements of expandingthe compressed pointing data that is shared by allthe detectors for each time sample. If the instrumental noise is a least piece-wisestationary, N will be a block-diagonal Toeplitzmatrix, with each block corresponding to one ofthe stationary pieces. A Toeplitz matrix is de-fined so that each row is shifted by one entry fromthe one above. For an n × n Toeplitz matrix, T i,j = T i − ,j − ∀ i & j ∈ { , . . . , n − } . (14) Moreover, for a typical experiment each of the di-agonal blocks will be banded. Generally, the in-verse of a Toeplitz matrix is not Toeplitz, howeverfor banded matrices this is nearly true with po-tential departures seen only at the first and lastrows and columns of the diagonal block (Stomporet al. 2002). Moreover, the inverse of a banded-matrix is also, to a good approximation, bandedwith a band-width which needs to be appropri-ately tuned. The MADmap algorithm assumestherefore that N − is a block-diagonal, bandedToeplitz matrix. N − t,t (cid:48) = (cid:26) f ( | t − t (cid:48) | ) if | t − t (cid:48) | < n c O (n f log(n f )), where n f is the lengthof the vector and size of the matrix. A circulantmatrix is a special type of Toeplitz matrix with theadditional constraint of periodic boundary condi-tions (e.g., Golub & van Loan 1996): C i,j = C i − ,j − ∀ i & j ∈ { , . . . , n − } (16) C ,j = C n − ,j − ∀ j ∈ { , . . . , n − } (17) C i, = C i − , n − ∀ i ∈ { , . . . , n − } (18)The inverse time-time noise correlation matrix, N − is block diagonal and these blocks are them-selves band diagonal and Toeplitz. Our objectiveis to use circulant matrix multiplies as a buildingblock from which we can compute the action of N − . We can treat each diagonal block of N − ulti ! stacked Distribution with Stride 2 D e t ec t o r s D e t ec t o r s P r o ce ss o r R a nk P r o ce ss o r R a nk D e t ec t o r s P r o ce ss o r R a nk Time
Detector Time StreamsDetector Time Streams0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 51 3 50 2 41 3 50 2 4Detector Time Streams 0 11 23 44 5
Concatenated DistributionStacked Distribution
Fig. 2.— This is a graphical representation of the data distribution for the detector time streams and thetelescope pointing time stream. The cartoon represents the distribution of four detector time streams over sixprocessors in the three different modes of operation. The numbers labeling the figure all represent processorranks. Note that the detector time streams are distributed over the processors, where as the telescopepointing is only fully distributed in the stacked mode. For this reason we can show the distribution of thedetector time streams over all of the processors in one image but the telescope pointing data distributionmust be shown for each processor individually where gray indicates telescope pointing that is stored on agiven processor. 9ig. 3.— This is a graphical representation of the distribution of pixel domain data between processors.Each processor has a set of time samples and the minimal set of “local pixels” needed for the pointing inthose samples. These local pixels have a further mapping to global pixels on the sphere.as an independent problem, so let us simply con-sider the problem of multiplying a band diagonalToeplitz matrix with a bandwidth of n c where B i,j = B i − ,j − ∀ i & j ∈ { , . . . , n c } (19) B i,j = 0 ∀ i & j s . t . | i − j | > n c (20)We can approximate B with a circulant matrix C which differs from B only in the upper right andlower left corners of the matrix B i,j = C i + b,j + b ∀ i & j s . t . | i − j | < n − n c (21)In order to compute the action of B on a vector byusing a circulant matrix C , we construct C to betwo bandwidths larger than B and pad the vectorto be multiplied with a bandwidth of zeros at thebeginning and end. The vector that results fromthe multiplication of C by the zero-padded vectorwill be the correct solution to the original prob-lem once a bandwidth has been trimmed from thebeginning and end of the resulting vector.The technique above scales as O (n log(n))where here n is the length of a stationary pe-riod. In the case where n (cid:29) n c we can do evenbetter than this. As shown above, each time we doa circulant matrix multiply in place of a bandedToeplitz matrix multiply there is corrupted datawithin one bandwidth of either end of the resultingvector. We can use this fact to break the problemup into two sub-problems where two circulant ma-trix multiplies are performed, each of n / c length, and each producing correct results for n / f log(n f ), and the number of uncorrupted el-ements determined by an FFT is n f − c . Inorder to calculate n t uncorrupted elements theremust be (cid:100) n t / (n f − c ) (cid:101) FFT’s performed, so thecomputational complexity is O (cid:18)(cid:24) n t n f − c (cid:25) n f log (n f ) (cid:19) . (22)To a certain extent n f is a free parameter and wecan choose it to be optimal. This is done by min-imizing the complexity. For simplicity of analysiswe consider the function h (n f ): h (n f ) = n f ln(n f )n f − c (23) h (cid:48) (n f ) = 1 + ln(n f )n f − c − n f ln(n f )(n f − c ) (24) h (cid:48)(cid:48) (n f ) = 2n f ln(n f )(n f − c ) − f ))(n f − c ) + 1n f (n f − c ) (25)Equation (22) and Equation (23) differ in thatEquation (22) includes a discretization and a lin-ear factor of n t / ln(2), which determines the dis-cretization. We will revisit the discrete nature of10he problem later, as there is another discretiza-tion to consider which impacts performance: theradix of the FFT. Note as well, that we only con-sider the cases where n f > c , as all values de-termined by the FFT would be corrupted by theboundary conditions in cases where n f ≤ c . Set-ting h (cid:48) (n f ) = 0 gives the following relationship be-tween n f and n c :n f − c (1 + ln(n f )) = 0 (26)which, unfortunately, does not have an analyticalsolution for n f in terms of n c . We can solve for n c ,however, and check that h (cid:48)(cid:48) (n f ) > f gives an h (cid:48) (n f )which is positive, then decrease n f by a factor oftwo until h (cid:48) (n f / < f / < c . This valueof n f is the length of the FFT used by MADmap inthe case where the user has not specified a lengthby way of an environment variable .MADmap uses the FFTW library by default(Frigo & Johnson 2005), however at compile timeit is also possible to specify the ACML libraryto do FFT’s. The ACML library is significantlyfaster than FFTW on AMD’s Opteron platform.Future extensions to MADmap will allow the usethe Intel Math Kernel library and IBM’s libmasswhich are other common high performance mathlibraries which include FFT functionality. The pointing matrix is at the core of our timestream model as outlined in Equation (1). Thismatrix determines how a time stream of data ob-served by a detector relates to the discretized sig-nal z . In its simplest form the pointing matrix canhave one non-zero per row with value unity. Thecolumn assigned this unity value is determined bythe pixel index of the element in the map that The FFT length environment variable:
MM FFT LENGTH is being observed by a detector at the time sam-ple associated with the row. More sophisticatedpointing matrices result when there is polariza-tion sensitivity, or when the pointing matrix in-cludes information about the detector beam sen-sitivity pattern. Each row of the pointing matrixdetermines the linear combination of the discretesignals being measured that were observed in aparticular time sample.As discussed in the introduction, the compres-sion of the pointing matrix is critical to reducingthe memory requirement, especially for analysis ofdata collected by contemporary CMB experimentswith large numbers of detectors on a single focalplane. The calculation of the “right hand side”(RHS): A T N − γ , uses γ , which is a length n t vec-tor, but this vector is used only once with a singlepass through the time stream, so there is no needto store all of γ in memory simultaneously. Thedata can be read into small buffers, reduced topixel domain data, and then overwritten with thenext buffer of data.The pointing matrix A , however, appears in theright hand side, and twice in the linear system in-verted, which means that A is used on each iter-ation of our PCG solver. The simplest solutionto this problem is the “high memory mode” ofMADmap, where the sparse representation of A is stored concurrently in memory (in a distributedfashion) for the entire execution of the applica-tion. This method is fast, but can be overwhelm-ingly memory intensive, as the memory require-ment scales as O (n t ).One method of averting this problem is to at-tempt to read the pointing matrix into memoryfrom disk on each iteration of the PCG, but for al-most every system configuration this will result inan I/O bound problem, and leave the processorsidle while they wait on the disk subsystem. Wehave not yet attempted to use the asynchronousI/O options afforded by the MPI-2 standard, buta better solution exists which avoids repeatedreading altogether while requiring a more mod-est memory footprint than is required for holdingall of the detector pointing in memory simultane-ously.To overcome this problem, we make use of thefact that our pointing matrix is derived from datawhich is smaller than data required for the sparserepresentation. The compression of the pointing11atrix is experiment and signal specific, but invery broad terms, MADmap can work with anyrepresentation of compressed pointing that is sam-pled at a constant rate with a fixed number of dou-ble precision floating point numbers per sample.These data can be ingested by the GCP librarywhich will produce detector-sampled pointing inrow compressed sparse form (column index andweight pairs).The pointing matrix for CMB experiments isoften derived from telemetry data which measuresthe direction in which some fiducial center of thefocal plane is pointed at a rate that is significantlylower than the detector sample rate. This is truefor both real data sets and often for simulated dataas well. The sparse representation of A is derivedby an interpolation of the bore-sight pointing to adetector sample rate, and a rotation of the bore-sight pointing to each detector’s relative offset.This location is then discretized into a pixel in-dex and assigned a weight, which usually differsfrom unity if the detector is polarization sensitiveand the calculation of this weight is experimentspecific.In the past, when the number of detectors be-ing analyzed simultaneously was more modest, thepointing solution for each detector would be storedon disk at the detector sample rate. This wouldsometimes be stored as ordinates (e.g., Euler an-gles) or as a list of pixel indices (with or withoutweights depending on the experiment’s capacity tomeasure a polarized signal). Reading these detec-tor specific pointing sampled at the detector sam-ple rate can be quite costly, but is still a possibleoption in MADmap. Regardless of how the sparse pointing matrixis obtained, once it is constructed in terms ofthe global pixel indexing scheme it must be re-indexed to match the distributed local pixel in-dexing scheme. Every pixelation of the sky hassome intrinsic mapping between an index valueand a position on the sky, and likewise, any dis-cretized signal will have a mapping between indexand the element of the discrete signal. There areactually three different indexing schemes used byMADmap: the global indexing scheme which isintrinsic to the discretization of the signal, the ob- served indexing scheme which includes only thoseindex elements which are observed in any proces-sor’s data, and the local indexing scheme whichincludes only those index elements which were ob-served within the local processor’s data.In order to distribute the signal vectors overthe processors, each processor stores only thosesignal elements that are observed during the sam-ples that are assigned to it, and all local sig-nal vectors are indexed with the local scheme.MADmap stores two vectors that are used formapping between indexing schemes. These areboth the length of the number of locally observedpixels, and map from local index to observed indexand from local index to global index. In order touse these vectors to map from global or observedindex to local index, a binary search of the appro-priate vector is done. This re-indexing is done ev-ery time the sparse pointing matrix is constructedfrom compressed pointing data. This requires abinary search of the locally observed pixel indexfor each time sample which has a computationalcomplexity of O (n i n t log(n z )) operations. Theseare integer operations and do not contribute tothe total flop count per se , but do consume clockcycles none the less.In order to construct the index remapping vec-tors without ever allocating a full n z length index-ing vector, some bit operations are used. A bitarray is allocated that has one bit for every globalpixel index (n z bits). The elements of this arrayare set to one or zero depending on if the indexis observed within the local pointing matrix. Thisarray can be used to construct the mapping fromlocal index to global index. This array is then col-lectively reduced using MPI_AllReduce() with the
MPI_BOR operator after which each processor hasa bit vector that tells which pixels were observedwithin the data on all processors, and this bit ar-ray can be used to construct the mapping vectorfrom local to observed pixels.
Given the size of CMB data sets, MADmap isintended for parallel machines, from small clus-ters to the most massively parallel systems extant.Both time-ordered and pixel-domain data are dis-tributed over the processors, and MADmap usesthe standard Message Passing Interface (MPI) forinter-processor communication. Since the avail-12ble memory will vary hugely depending on thenumber of processors being used to solve a partic-ular problem (from tens to tens of thousands atthe time of writing, with hundreds of thousandson the immediate horizon) and the available mem-ory per processor (typically from 256MB to 4GB),MADmap provides a number of options to tradeadditional calculation for reduced memory.Fundamentally, there are three communicationoperations in MADmap. The first comes up in thecase where each processor has computed the con-tribution to a map vector from the time samplesassigned to it in the computation of the operationof the pointing matrix. The sub-maps on eachprocessor must be summed with the sub-maps onall other processors. Note that each processorcontains a subset of the observed pixels that isoverlapping with the subset of observed pixels onother processors. This computation, in the end,is computed with a series of
MPI_AllReduce() calls. Each call is done on a fraction of the en-tire observed map, and each processor fills thereduction vector with the value it computed, orzero if the pixel was not observed in the proces-sor’s time stream. After the reduction is com-pleted, the results that are pertinent to the pro-cessor are stored, and the vector is overwrittenwith the next buffer to be reduced. This operationis called
MM_SomeReduce() . This is, by far, themost costly communication operation. It scaleslike O (n z log(n proc )) where n z is the number ofobserved pixels, and n proc is the number of pro-cessors. This must be done once on each iterationof the PCG algorithm.The other two communication operations de-pend on the concept of “primary pixels.” InMPI, each processor available to an applicationis assigned a rank which is an integer in the set { , , . . . , n proc − } so that every processor hasa unique rank and there is a sequential orderingof the processors. A processor’s primary pixelsare defined to be those pixels which are observedwithin the time stream distributed to the proces-sor, and are not observed within the time streamdistributed to any lower rank processor. This de-termination defines a disjoint division of the pix-els over the processors that is comprised of a sub-set of the fundamental distribution of the pixelsover the processors. To determine the primarypixels, an MM_SomeReduce() is called on a vector that is set to be the processor’s rank if the pixelis observed by the processor or the total numberof processors in the case where the pixel is notobserved within the processor’s time stream, andthe reduction operator is
MPI_MIN . After the callto
MM_SomeReduce() , the reduced vector can beused to determine as primary pixels all of the pix-els that are set to the value of the processor’s rank.The second operation is very similar to the first,except that it is used only in the writing of mapsto disk. When writing a distributed map vectorto disk there are a series of calls to
MPI_Reduce() .This sequence of reductions effectively constitutesingle reduction operation on a map of all observedpixels while limiting the size of the vector usedin the
MPI_Reduce() call and thereby mitigatingthe memory requirement. In this operation eachprocessor sets the map value to its stored value ifthe pixel is a primary pixel, and sets it to zero ifthe pixel is not a primary pixel of the processor.In this way the root processor collects the mapand writes the buffer to disk between each call to
MPI_Reduce() .The final operation is the computation of thedot product of two distributed map vectors. Thisis done by computing the local contribution to thedot product by primary pixels on the processor,and then an
MPI_AllReduce() is called on the onedouble precision floating point element to evaluatethe complete dot product combining calculationsfrom each processor. Because this requires a singlecall to
MPI_AllReduce() on a single element foreach dot product, this communication cost doesnot account for a significant amount of the runtime.
MADmap’s computational work can be seg-mented into several distinct operations. The di-mensions of the problem to be solved determinethe relative cost of the operations, but we willdiscuss the operations that typically are the mosttime consuming. Those operations which are doneon each iteration of the PCG tend to dominatethe run time because the factor of the number ofiterations is usually between one and two ordersof magnitude. The dominant term in the over-all computational complexity of MADmap in allmodes of operation is O (n i n t log(n c )), which is de-13ermined by the computation of the FFT’s, andthis computation constitutes a significant portionof the run time. A portion of the run time isspent constructing and acting with the pointingmatrix and, in low memory mode, these operationshave an operational complexity of O (n i n t n nz (1 +log(n z )) with a significant pre-factor. This com-putation includes the expansion of the compressedpointing into the sparse pointing matrix, the re-indexing of the pixels to distributed indexing, andmultiplying by the sparse re-indexed pointing ma-trix.The complexity measures given below are dis-tributed over the processors and MADmap hasbeen scaled to run on tens of thousands of cores.MADmap is load balanced computationally andthe memory distribution is balanced in all dimen-sions except n z which is, nonetheless, distributed.MADmap has three different modes of operationwhich reduce the memory requirement at the ex-pense of more computations. The operationalcomplexity of the high memory mode isc h = O (n i n t log(n c )) , (27)and the memory requirement ism h = O (n z + n t + n proc m n ) . (28)The operational complexity of the low memorymode isc l = O (n i n t (log(n c ) + n nz log(n z )) (29)and the memory requirement ism l = O (n z + r p ∆ t + n proc m n ) . (30)The operational complexity of the extremely lowmemory mode isc e = O (n i (n t (log(n c )+ n nz log(n z ))+n b n c log(n c )))(31)and a memory requirement ofm e = O (n z + r p ∆ t ) . (32)The variable m n is the per processor memory re-quirement for the noise filters and depends on thedata distribution as given by the variables m ns andm nc defined in Section 5.7. Note that low memorymode footprint is nearly invariant with the num-ber of detectors and the footprint of the extremely low memory mode is completely invariant with thenumber of detectors. Next generation CMB exper-iments will operate with large numbers of detec-tors sampled at a high rate so the memory savingsafforded by low memory mode will be critical tothe tractability of the maximum-likelihood dataanalysis of these next generation experiments. The disk is accessed for five purposes, three ofthem reading: detector time streams, pointing in-formation, and noise filters; and two of them writ-ing: checkpointing, and the final maps. There aremany options which allow most of this disk accessto be minimized. In some cases the time streamdata are simulated at run time using the M3 on-the-fly simulation capabilities, and require only asmall amount of input data from disk (usually inthe form of maps). The pointing information cantake on a wide variety of forms. Noise filters canbe input parametrically, and in this case the fil-ters are calculated at run time without readingfrom disk. Checkpointing is optional, and the fre-quency can be chosen, but by default occurs afterevery 20 iterations. The output of the final mapsis a relatively small amount of data, and only oneprocessor writes the final maps to disk.The detector time stream samples are dis-tributed in a balanced way over the processorsindependent of the data distribution mode chosen.Each processor reads or simulates a fraction of thetotal volume of the detector time samples dividedby the number of processors. Recall that there arethree data distribution modes in MADmap: con-catenated, stacked and multi-stacked. The choiceof distribution has an impact on the amount oftelescope pointing data read, and the number ofnoise filters read and stored, but does not impactthe volume of detector data required by each pro-cessor. If the time stream data are read from disk,then every processor readsd t = O (cid:18) n t n proc (cid:19) (33)samples which are typically stored in eight byteprecision.The amount of pointing data to be read fromdisk depends on how the MADmap job was config-ured. In the minimal case, MADmap is run in the14tacked data time stream distribution, and tele-scope centroid pointing sampled at a slow rate isused and interpolated to detector pointing at runtime. The amount of reading required in this caseis the number of telescope centroid pointing sam-ples divided by the number of processors. The useof the concatenated data distribution requires anadditional factor of the number of focal plane de-tectors of extra reading for centroid pointing sincedifferent processors are analyzing data from thesame time period. The telescope centroid point-ing data in the concatenated data distribution arenot balanced and different processors will requirethe same pointing data. When using telescopecentroid pointing it is usually optimal to use thestacked data distribution. In the typical case ofreading telescope centroid pointing and using thestacked distribution MADmap readsd ps = O (cid:18) n t r p n d r d n proc (cid:19) (34)elements of the telescope pointing. Each elementof telescope pointing is typically between three andsix numbers stored in eight byte precision. Us-ing the concatenated distribution requires anotherfactor of n d on each processor bringing the numberof elements to d pc = O (cid:18) n t r p r d n proc (cid:19) . (35)If individual detector pointing is read the numberof elements required isd pd = O (cid:18) n t n proc (cid:19) . (36)The potential hazard of the stacked data distri-bution lies in the reading and memory requirementassociated with the noise filters. In the concate-nated distribution each processor has to load andstore only order one noise filter requiringd nc = m nc = O (n c ) (37)elements read from disk stored in eight byte preci-sion by each processor. In the stacked distributioneach processor loads and stores a noise filter forevery detector requiringd ns = m nc = O (n d n c ) (38) elements stored and read from disk. To com-promise between these noise filter requirementsand the telescope centroid pointing requirementthe multi-stacked distribution is advised. An ex-ceptional case for the filter memory requirementoccurs if the same noise filter is used for ev-ery stationary interval analyzed, and in this caseMADmap stores just one filter, regardless of thedata distribution. This exception is most usefulfor simulated data.
6. Using MADmap
The MADmap algorithm, its implementation,as well as the structure of the associated data ab-straction (M3) and compressed pointing (GCP)libraries have all been designed to ensure maximalflexibility and broad applicability of the softwareto CMB data analysis and more generally to esti-mation problems. The range of problems that theMADmap software can be directly applied to isindeed very wide, and significant effort has beenundertaken to ensure that the generality in theproblem formulation and algorithm choices do notnegatively impact the computational efficiency ofthe software. We will illustrate the efficiency issuein the next section studying two specific applica-tions and compare it with some other similar codesin Section 8. In this section we describe a set ofexample problems to which MADmap can be di-rectly applied. We include here only the kinds ofruns which have actually been performed and theirresults validated on either simulated or real data.MADmap’s most notable flexibility is its abil-ity to ingest and process an arbitrary pointingmatrix. This is true at least on the algorithmiclevel, however the run-time and memory require-ments will depend on the sparsity and structureof the pointing matrix, and these requirementswill clearly set some constraints on the practicallevel. Nonetheless, MADmap permits performingruns which not only produce the sky signal maps,but also extract the unwanted parasitic contribu-tions typical of many experimental data, as long asthey can be expressed within the linear responsemodel of Equation 1. The examples given previ-ously demonstrate that those include cases of realpractical interest.(1)
Non-parametric synchronous signals.
In manyapplications there exists a parasitic signal of an in-15trumental or environmental origin which can bethought of as a unique function of some parame-ter. We will refer to those as signals synchronouswith the respective parameter. Examples of suchsignals include the previously mentioned primary-mirror-chop-synchronous parasitic signal observedin the MAXIMA data, as well as ground or atmo-spheric pick up usually seen in ground-based ex-periments, (e.g., Kuo et al. 2004). If we denotesuch a parameter or set of those by ψ we can up-date Equation 1 to read, γ = ν + ζ + ψ = ν + Az + By (39)where B is a ‘pointing’ matrix of the parasiticsignal, and y is a discretization of its ampli-tude. In the specific example of the primary-chop-synchronous parasitic signal, y may correspond toan orientation angle of the primary with respect tothe gondola frame and B t,p is non-zero and equalto unity for all the time samples, t , for whichthe primary orientation fell in between an inter-val, [ y p , y p + ∆ y p ], defining the non-overlappingdiscretization (one dimensional ’pixelation’) of thepossible angles, y .On replacing, z → (cid:20) zy (cid:21) , A → (cid:2) A B (cid:3) , (40)in Equation 1 we see that both Equation 1 andEquation 39 are clearly of the same form andtherefore can be solved using the same algorithmas implemented in MADmap, provided that suit-able input data properly describing the full point-ing matrix of the problem are available.(2) Parametric synchronous signals.
In some applications parasitic signal may not besynchronous with any easily identifiable externalparameter but there may exist a linear parametricmodel which can describe the time-dependence ofthe signal, i.e., ψ t ≡ ψ ( t ; { b i } ) and ψ = Cb . Wenote that the last equality may be in fact only anapproximation, e.g., a result of linearization proce-dure around the anticipated values of the parasiticsignal parameters, even if the latter are inherentlynon-linear. In those cases, the validity of the ap-proach may need to be evaluated a posteriori .On adding such a term to Equation 1, we againarrive at an expression similar to Equation 39,which in turn is equivalent to Equation 1 with appropriately redefined pointing matrix and esti-mate vector, Equation 40.We note that such an option is of great cur-rent interest. This is because the parasitic signaldue to the rotating polarizers, such as e.g., half-wave plate. The signal of this sort is expectedto be one of the major systematics the polarizedexperiments using this kind of technology, andaccurately removing this systematic is crucial toachieving sensitivity levels required for a success-ful B-mode polarization detection. It has beenshown (Johnson et al. 2007) that in cases of prac-tical interest such a signal can be modeled with acouple of dozens of parameters and that a linearmodel can be indeed sufficient for such a purpose.Specifically, the particular model can be writtendown as, ψ HWP ( t ) = (cid:88) k =1 ( b k + t b k +8 ) cos kω t + ( b k +16 + t b k +24 ) sin kω t , (41)where ω t denotes a known position angle of thepolarizer. Thus for any time t , we recover, C t,i = cos iω t i = 1 , ..., t cos( i − ω t i = 9 , ..., i − ω t i = 17 , ..., t sin( i − ω t i = 25 , ...,
32; (42)where C complements the sky signal pointing ma-trix as in Equation 40. The map-making solverwill then estimate simultaneously both the sky sig-nal ( z ) and the HWP parasitic parameters ( b ).We note that, in the case of the rotatingpolarimeter, the parasitic signal is indeed syn-chronous with the polarizer orientation and couldalso be subtracted using the approach describedin case (1) above. Applying the parametric modelas described here, however, allows us to reduce thenumber of extra degrees of freedom introduced tothe problem and therefore to achieve higher preci-sion of the recovered final maps of the sky. It alsopermits for the avoidance of discretization effects,and these are particularly important for cases in-volving rapidly varying systematics. We also pointout that the block of the pointing matrix, as de-fined in Equation 42, is dense, and therefore posesa particular challenge to the standard way of im-plementing the pointing and unpointing operatorsdiscussed earlier.163) ’Destriping’ runs. In some particular appli-cations the time-domain noise can be effectivelymodeled as white noise plus a series of randomoffsets assigned to some predefined time inter-vals. This was first discussed in the context ofPlanck-like experiments (Janssen et al. 1996) andgave rise to so-called destriping algorithms (De-labrouille 1998; Maino et al. 1999; Keih¨anen et al.2004). It has been shown that in the Planck casethe destriper achieves a competitive precision ata fraction of the numerical load (Poutanen et al.2006; Ashdown et al. 2007a; Stompor & White2004). The approximate time-order data modelin this case reads γ t = ν + ζ + ω = ν + Az + Bo (43)where ν is uncorrelated, piecewise-stationarynoise, ω t is an offset added to a sample t , and o p is an offset common to all samples in a timeinterval p . B is an offset pointing matrix, essen-tially defining the time intervals associated witheach of the offsets denoted as o p . Using both thepointing matrix and the solution (map) vector ex-tended as in Equation 40 MADmap can thereforesimultaneously estimate the sky signal and theoffset amplitudes, assuming that the used timeintervals conditions do not lead to a singular sys-tem of equations. We can further assume thatthe noise is not merely stationary, but white , andthus that N is diagonal, which is a basic assump-tion in the standard destripers (Keih¨anen et al.2004). In spite of its more complex pointing ma-trix, this kind of run will benefit considerably fromthe fact that no FFT is needed for performing anyof the noise kernel convolutions, and this resultsin a significantly shorter run-time when comparedto the run incorporating the more complex time-stream model. As mentioned before, at least insome specific cases, the speed-up may indeed beachieved at no substantial loss of precision in thefinal sky maps. We note however that thoughsuch MADmap run would indeed be significantlyfaster, than a full MADmap run with a completenoise model, it would not reach the speed typi-cal of custom-made destripers. Nevertheless, theattractive feature is that MADmap provides in asingle package both functionalities as well as entirespectrum of intermediate options.(4) Multi-resolution map-making
As MADmapdoes not interpret or make any assumptions about the pixel signal and/or pixelation schemes, itstraightforwardly permits for mixing sky pixelsof different resolutions, for instance, enabling ahigher resolution for sky areas with particularlyhigh signal-to-noise ratio. It also accepts differ-ent resolutions for different Stokes parameters.The latter fact is again particularly interesting inthe light of forthcoming polarization experimentsbased on total power detectors. In such casesthe measured signal is a linear combination of allthree Stokes parameters. Map-making thereforeattempts to unscramble it to estimate all threeof them simultaneously. For the CMB, the powercontained in the total intensity mode, I , is signif-icantly higher than that in the polarization. Thathas two consequences. First, the total intensitymap could be pixelized with high resolution pix-els retaining the same signal-to-noise on the pixelscale as for the polarization. Moreover, given thatsky signal is assumed to be constant on the pixelscale, any departure from such assumptions (e.g.,due to residual sub-pixel power) may result in sig-nal leakages between different Stokes parameters.The intensity signal is much higher than the polar-ization signal, and therefore any sub-pixel powerwhich is a small fraction of the intensity may besubstantial when compared with the polarizationsignal. By over-pixelizing the intensity we canlower sub-pixel power and mediate the leakage ofthis power to the Q and U basis. In this waywe reduce the aliased sub-pixel intensity signalpresent in the estimated Q and U maps.Such an effect has been indeed observed in sim-ulated cases (Ashdown et al. 2007a). MADmap’sability to accommodate different pixel resolutionsfor different Stokes parameters allows us to ro-bustly bypass such problems by reducing the sizeof the I pixels and with it the amount of the powerleaked from I to Q and U .We note that all these capabilities are uniquecompared to other CMB map-making softwarepackages.
7. MADmap Map-Making Examples
We will describe the tests that have been per-formed on MADmap and highlight the function-ality provided by the M3 data abstraction layer,and the Generalized Compressed Pointing (GCP)library. MADmap has been run on an array of17eal and simulated data. In this paper, we wouldlike to describe two large runs that demonstrateMADmap’s capabilities and capacity. Both ofthese runs are analyses of simulated data, the firsta year-long Planck satellite mission (The PlanckCollaboration 2005), and the second a two-weekEBEX balloon-borne measurement (Oxley et al.2004; Grainger et al. 2008). In the following westart from a short description of features of eachof the experiments and their data set and con-tinue with a detailed discussion of MADmap map-making run performance.
Planck is a ESA/NASA satellite, which willcarry on board two instruments: Low and HighFrequency (LFI and HFI, respectively) and a to-tal of 74 independent detectors (54 of which arepolarization-sensitive) measuring the sky signal in9 frequency bands. Over a period of fourteenmonths Planck will observe the full sky twice at a resolution dependent on the frequency band,ranging from 33 (cid:48) at 30GHz down to 5 (cid:48) at 217GHzand above. The detector sampling frequency like-wise ranges from 32 . A decision on extending the mission to support two furthercoverings of the sky is pending. were generated only at run-time using the M3 on-the-fly simulation capability. The full Planck focalplane analysis was the largest scale MADmap runto date, which turned 325 Giga-samples into esti-mates of 150 Mega-pixel amplitudes.
EBEX is a balloon-borne experiment which willcollect data during a roughly two-week long cir-cumpolar flight in Antarctica. It will carry onboard 1406 detectors observing in 3 different fre-quency bands: 140, 220 and 410GHz. The reso-lution at each frequency will be 8 (cid:48) . The samplingfrequency is set to 190Hz. During its flight, EBEXwill target a small area amounting to roughly 1%of the entire sky. Due to its much larger num-ber of detectors and higher sampling rate, the to-tal length of the EBEX time ordered data set iscomparable to the Planck baseline mission. Theadopted observation strategy will result in a verydeep integration of the probed sky area reach-ing up to millions of measurements per beam-sizepixel. Consequently, the recovered maps will havehigh signal to noise ratio at the resolution scale.This is driven by the scientific goals of EBEX,which are to detect and characterize the minusculeCMB B-mode polarization signature. The size ofthe maps produced from the EBEX data will belimited to 10 pixels, i.e. half a percent of one ofthe Planck maps. For the EBEX simulations weuse the M3 simulation capability to generate timestream data on the fly when MADmap makes datarequests. The simulations presented here were conductedat the Department of Energy NERSC supercom-puting center on the Cray XT4 named Franklin.Franklin has 9660 compute nodes, and at the timewhen the Planck full focal plane simulation wasrun in September 2007 these nodes contained one2.6 GHz clock speed dual core AMD Opteron pro-cessor with 2 gigabytes of memory per core. Whenthe scaling tests for the Planck and EBEX sim-ulations were run in October 2008 the Franklinprocessors had been updated to quad core AMDOpteron processors running at 2.3 GHz with twogigabytes of memory per core. Franklin has a fastSeaStar2 switch interconnect, and runs the Lustre18le system to control 350 terabytes of usable diskspace.We performed a sequence of scaling tests onMADmap to show its performance on large datasets and computing at scale. These runs are bro-ken into three cases. The first two are nearlyidentical simulations of the Planck telescope’s 217GHz channel. For one of these simulations thedata are precomputed and read from disk, and inthe other case the simulation of the time streamis done at run time. The third case is a simula-tion of the EBEX experiment with data simulatedat run time. Through these three simulations weshow the performance on high resolution full skydata in the Planck case, and a small patch mediumresolution very low pixel noise simulation in theEBEX case. By simulating data at the time of re-quest we are demonstrating MADmap’s capacityfor Monte Carlo simulation analysis with negligi-ble disk use. We also present an analysis of a yearof data collected by the entire Planck focal planeas a demonstration of computational capacity inboth the pixel and time domains.In the scaling runs each of the three cases wasanalyzed at four different processor concurrencies.For the Planck analysis these concurrencies were2,196, 4,392, 8,784 and 17,568 processor cores,and the EBEX analyses were run on 1,920, 3,840,7,680, and 15,360 processor cores. In the Planckscaling runs the number of time samples, n t , was7 . × and the number of observed pixels in allthree polarization maps, n z , was 1 . × . In theEBEX case n t was 1 . × and n z was 1 . × .The results of these runs are shown in Figure 4.There are some particular features that stand outin the plots. These are strong scaling plots, there-fore measured in processor seconds, and ideallythese would be flat, implying total cost invariancewith processor concurrencies. The cost of the com-putation is close to this ideal, which implies thatthe problem is well balanced. The input/outputsubsystem is very limited under the strain of largeconcurrencies of file requests and does not scalewell. This is a known problem with contemporarysupercomputers run at scale, and highlights thepower of being able to simulate data at the timeof request in order to avoid the I/O bottle neck.This point is clear when the time spent in I/O inthe Planck disk case is compared to that of thePlanck on-the-fly simulation case. Communication cost for all sky high resolutionmaps is significant, but in the low pixel count caseof EBEX the communication is sub-dominant.The communication cost is essentially linear withthe number of observed pixels. The EBEX casehas just over 100 thousand pixels observed, andthe Planck case has 150 million pixels observed.Over the concurrencies probed, the range of pix-els spans the space where communication is sig-nificantly smaller and larger than the computa-tion cost. The Planck case scaled up to over 10kprocessors shows the limitations of the parallelismused for the pixel domain data. Work is ongoingto improve this limitation so that MADmap willscale to hundreds of thousands of processors andhundreds of millions of pixels.The computation of the pixel look-up actuallyrequires fewer calculations at higher concurrenciesbecause the complexity is a function of the num-ber of locally observed pixels which decreases asthe time stream is divided more finely. We canalso see the effects of memory locality and cachemisses in the computation of the pointing. Thisis highlighted by comparing the Planck runs donewith the Healpix nested scheme and the Healpixring scheme (G´orski et al. 2005). In the nestedscheme, pixels nearby on the sky have similar pixelindexes, and therefore show better cache perfor-mance in the calculation of the action of the point-ing matrix.In the case of the Planck full focal plane rundescribed earlier, which constituted the largestMADmap run performed to date, the analysiswas done in the concatenated data distributionin low memory mode on 16,384 cores and com-pleted in 32 minutes. In this run, the expansion ofthe compressed pointing by the GCP library con-sumed 19.0% of the run time. The calculation ofFFT’s with the ACML library consumed 11.9% ofthe wall clock time. Pixel re-indexing constituted6.3% of run time and multiplying by the sparsepointing matrix 4.9%. Outside of these calcula-tions, 24.0% of run time was spent on communica-tion of signal vectors, 23.1% on reading simulatedtime streams from disk, and 3.8% on writing mapsto disk.A sample of the results of the MADmap runsanalyzing the simulated EBEX and Planck datasets is presented in Figures 5 & 6, respectively.19 Number of cores C o m pu t a t i ona l c o s t ( c o r e * s e c ) Planck OTFS ring scheme overall timing calculationcommunicationposixtotal0 0.5 1 1.5 2x 10 Number of cores C o m pu t a t i ona l c o s t ( c o r e * s e c ) Planck OTFS ring scheme calculation timing GCPFFTpointingpixel lookuptod sim 0 0.5 1 1.5 2x 10 Number of cores C o m pu t a t i ona l c o s t ( c o r e * s e c ) Planck disk nest scheme overall timing calculationcommunicationposixtotal0 0.5 1 1.5 2x 10 Number of cores C o m pu t a t i ona l c o s t ( c o r e * s e c ) Planck disk nest scheme calculation timing GCPFFTpointingpixel lookup 0 0.5 1 1.5 2x 10 Number of cores C o m pu t a t i ona l c o s t ( c o r e * s e c ) Ebex overall timing calculationcommunicationposixtotal0 0.5 1 1.5 2x 10 Number of cores C o m pu t a t i ona l c o s t ( c o r e * s e c ) Ebex calculation timing GCPFFT pointing pixel lookuptod sim
Fig. 4.— This figure plots the results of the three scaling tests that were performed. Each run is representedby two plots. One measures the overall cost of calculation, communication and I/O; the other breaks downthe time spent doing computations between the algorithms that comprise the computations.20ig. 5.— This figure shows the map resulting from the EBEX scaling runs, which recovered the polarizedsky signal from 1 . × time samples. The color scale is given in units of µK thermodynamic temperature.Plotted in the top three panels are the I, Q and U polarization components, and below each map is an imageof the difference with the input signal map used to generate the simulation.Fig. 6.— This figure shows the map resulting from the Planck scaling runs. The color scale is given inunits of µK thermodynamic temperature. Plotted in the top three panels are the I, Q and U polarizationcomponents, and below each map is an image of the difference with the input signal map used to generatethe simulation. Note that for the Planck 217 GHz channel the Q and U maps are noise dominated whenproduced at high resolution (Healpix nside parameter of 2048). The data set simulated for this run comprisedof 7 . × samples divided between 74 detectors and 9 frequency bands.21 . Comparison With Other Codes Map-making is one of the principal constituentsof the data analysis pipeline of any CMB experi-ment. Consequently, there has been a significantbody of work devoted to different algorithmic andimplementation aspects of the map-making prob-lem, (e.g., Wright et al. 1996; Janssen et al. 1996;Tegmark 1997; Delabrouille 1998; Borrill 1999;Oh et al. 1999; Maino et al. 1999; Dor´e et al.2001; Stompor et al. 2002; Hinshaw et al. 2003;Keih¨anen et al. 2004; Stompor & White 2004; Ar-mitage & Wandelt 2004; de Gasperis et al. 2005;Keih¨anen et al. 2005; Poutanen et al. 2006; Patan-chon et al. 2008; Armitage-Caplan & Wandelt2008; Sutton et al. 2009; Kurki-Suonio et al. 2009),and many numerical map-making codes have beendeveloped over the years and reported in the liter-ature. Few of those have ever been implementedfor a massively parallel architecture. Still fewer ofthose have been applied to the data sets of the sizeand complexity as expected from Planck and thenext generation of the CMB experiments. Onlysome of those codes are flexible and general enoughto be applicable to a realistic CMB data set as pro-duced by scanning experiments. To the best of ourknowledge none allows for fully arbitrary pointingmatrices as MADmap does, instead usually rely-ing on its hard coded, parametrized version ap-propriate for some specific instrument. Moreover,although some of the existing codes are used bylarger groups of researchers, none is readily avail-able in the public domain.Over the past few years the Planck CTP work-ing group has undertaken a coordinated effort tocompare some of the existing map-making codes.The final stage the comparison involved 5 codes,including 2 Planck-specific codes referred to as ’de-stripers’ and 3 maximum-likelihood codes includ-ing MADmap. The aim of these comparisons wasto first demonstrate the software’s capacity to dealwith the requisite volume of data, and then theprecision of the results, and the resource require-ments were measured and compared.The results have been published in series of pa-pers (Ashdown et al. 2007a,b, 2009). Those resultsvalidated MADmap, which has been shown to pro-duce maps virtually indistinguishable from thoseof the other maximum-likelihood map-makers andthe produced maps were consistent with input maps used to generate the time ordered data sim-ulations. The MADmap results were also found tobe very similar to those of the destripers, thoughMADmap was somewhat closer to the input map.This small advantage of the maximum-likelihoodcodes is understandable and expected as they aredesigned to produce minimum variance, optimalmaps. In terms of the resources, the superiorCPU and memory performance of MADmap withrespect to other maximum-likelihood estimatorshas been duly demonstrated in Ashdown et al.(2009). The destripers, devised as Planck-specificmap-makers (but see Sutton et al. 2009), gain onspeed and/or memory requirements at the price oftheir generality, and therefore were capable of de-livering a comparable performance to MADmap atthe fraction of its resource requirements. We notethat the resource comparison presented in Ash-down et al. (2009) was obtained using a data set ofa rather modest size with a total number of sam-ples n t ∼ and pixels n z ∼ × includingmaps of the three polarization Stokes parameters.In the particular simulation presented in (Ash-down et al. 2007b), when run in large mem-ory mode, MADmap was found to require lessthan half of the CPU resources of the other twomaximum likelihood map-makers in the com-parison (44% of the CPU resources required byMapCumba, and 40% of the CPU resources re-quired by ROMA), while MADmap had a slightlyhigher memory footprint. MADmap’s slightlyhigher memory usage is because MADmap storeda different noise filter for each stationary periodwhere the other codes assumed a single noisefilter for the entire data set. When run in lowmemory mode MADmap’s memory footprint wasnearly one quarter the size of the other twomaximum likelihood implementations, and theCPU resources required were about two thirds ofthose required by the other maximum-likelihoodsolvers (68% of the CPU resources required byMapCumba, and 61% of the CPU resources re-quired by ROMA). This paper complements theresults published in the CTP papers by showingthe scalability of the MADmap code to the largeconcurrences and its performance while facing themost voluminous forthcoming data sets as well ashighlighting its additional features not tested inthe CTP runs.The majority of the codes and methods dis-22ussed above work efficiently, or typically only,in pixel domain. Harmonic space map-makershave been also recently considered (Armitage &Wandelt 2004; Armitage-Caplan & Wandelt 2008)though demonstrated so far only in the cases withuncorrelated time-domain noise. The potentialadvantage of the harmonic approach is that it al-lows to correct for a potential beam asymmetryin a more straightforward manner. We note thatsuch a beam correction can be at least to some ex-tent resolved also in the pixel domain (Burigana &S´aez 2003) though would require an appropriatelydefined pointing matrix. Given that, MADmapcan be a tool of choice for performing such a studyin the future.
9. Conclusions & Future Work
This paper offers an in-depth description of theMADmap software application, describing the de-tails of the algorithm and the massively parallelimplementation. What is presented in this paperis MADmap’s capacity to scale to very large prob-lems and very large processor concurrencies. Wediscussed the variety of modes in which MADmapcan be run, which allows for trade-offs betweenCPU and memory consumption. One of the mainfeatures that sets MADmap apart from other map-makers is the wide flexibility in the data modelthat enables support for any sparse linear modelof the signal to be estimated. This highly adapt-able model for the signal is further enabled by theGCP library which provides a modular plug in ar-chitecture that can be used to compute the sparselinear model from compressed data. The function-ality provided by GCP is what enables MADmap’slow memory mode.Recall that, in low memory mode, the sparsepointing matrix is created in a buffered way fromcached compressed data through the GCP library.When MADmap is run in the stacked data dis-tribution, the compressed data can be reusedfor multiple detectors, and in this way reducesmemory and sometimes CPU requirements forGCP. GCP’s modularity allows for the additionof features which can be used to accomplish awealth of science goals. One interesting use is thecalibration with a known signal (like the CMBdipole) as a byproduct of map-making while con-sistently accounting for the extra degree of free- dom added by this estimation. Another possibil-ity is to do component separation during map-making to produce maps of different astrophysicalprocesses from time streams of data collected atdifferent observing frequency bands. This can beaccomplished if the linear scaling terms for theprojection from frequency band to astrophysicalcomponent are known a priori , and a GCP mod-ule is added to produce a sparse pointing matrixthat corresponds to this projection. A capabilitywhich currently exists is the removal of system-atic effects corresponding to a linear scaling ofa measured template. These systematic effectsrange from the temperature of the cold stage ofthe cryogenic electronics, to a measured templatefor thermal pick-up from the ground as a functionof a ground-based telescope’s orientation, or anyother a priori known signal to be marginalizedfrom the output maps. GCP also promises theability to deconvolve compact beam effects dur-ing map-making by representing some portion ofthe beam profile in the pixel domain in the sparsepointing matrix. This can be used to account forasymmetries in the beam, through the combiningof detectors with disparate beam profiles into asingle map (usually required for component sepa-ration during map-making), or can even be usedto deconvolve in the pixel domain the effects of abolometric detector’s response curve in the timedomain.MADmap is capable of solving massive prob-lems quickly and efficiently given appropriate com-putational resources and such large runs havebeen presented in this paper. MADmap is a use-ful and efficient tool that can be run on muchmore modest computing resources, and the scaleof the runs presented should not misinform thereader about minimum resource requirements ofMADmap. MADmap can be quite computation-ally efficient on small cluster sized resources, andwhen run in low memory mode it can reduce largetime ordered data sets in core memory. The lim-iting factor in this case is run time, but at lowconcurrencies, and when using a reasonable com-munication fabric, MADmap is very efficient. Inaddition, often on these small clusters a limiteduser pool allows for the execution of jobs withvery long wall clock times, and in these situationsMADmap can tackle non-trivial problems with amodest computer. MADmap implements efficient23heck-pointing, and this allows for the exploita-tion of long-running jobs without fear of wastingresources in the case of a system failure duringexecution.The massively parallel computing communityis on a road map to push its systems to the ex-ascale, which is 10 calculations per second, byabout 2018. The primary impediment to scalingup MADmap to the exascale and beyond is the col-lective communication calls used to reduce maps.It is the subject of ongoing research as to optimizeour collective communication to take best advan-tage of character of our data distribution, its spar-sity and the communication fabric topology.MADmap does not currently have a facility formodeling correlations in the noise that may ex-ist between different data sets, which are to beprocessed simultaneously, such as measurementscoming from different detectors of the same exper-iment. Such correlations may be due to electroniccross talk, thermal drifts of a cryogenic focal plane,or in the case of ground based experiments, atmo-spheric noise that is seen by the focal plane simul-taneously. This correlations can be very importantto some collected data sets, and incorporating thisfunctionality is a focus of ongoing work.In future publications we will present more ap-plications of MADmap in particular those demon-strating the GCP library functionality on real andsimulated data. A package containing MADmapand all its dependencies is available for down-load at the LBNL CodeForge website . MADmap,GCP and M3 are free software and distributed un-der the GNU public licence agreement with hopethat it will become a software package of the choicefor the CMB community for an efficient computa-tion of the microwave sky maps.We are grateful to NASA (through the PlanckProject) and DoE for funding and facilities. Thisresearch used resources of the National Energy Re-search Scientific Computing Center, which is sup-ported by the Office of Science of the U.S. Depart-ment of Energy under Contract No. DE-AC02-05CH11231. Thanks to Charles Lawrence for hishelp and guidance in this project and the sub-mission of this paper. We acknowledge the use LBNL CodeForge: https://codeforge.lbl.gov/projects/cmb/files of simulations of telescope pointing provided bythe Planck and EBEX collaborations. In particu-lar the simulation of the Planck mission was doneby the simmission (Reinecke et al. 2006) appli-cation primarily written by Daniel Mortlock andmanaged by Martin Reinecke. The sky scannedin the Planck simulation was provided by thePlanck collaboration and is the model derivedby the Planck foregrounds working group. TheEBEX telescope pointing simulation was providedby Samuel Leach, and the sky simulation was pro-vided by Nicolas Ponthieu. The parametrizationof the noise properties of EBEX was provided byShaul Hanany, and the simulation was generally aproduct of the EBEX data analysis working group.24 . Proof that critical FFT length minimizes complexity of convolution algorithm . Proof.
In Section (5.2) we need to show that h (cid:48)(cid:48) (n f ) > f so as to insure that ourcritical point minimizes the complexity. Recall that h (cid:48)(cid:48) (n f ) is defined as h (cid:48)(cid:48) (n f ) = 2n f ln(n f )(n f − c ) − f ))(n f − c ) + 1n f (n f − c )Let us assume that h (cid:48)(cid:48) (n f ) ≤ f ln(n f )(n f − c ) − f ))(n f − c ) + 1n f (n f − c ) ≤ f + n f ln(n f )(n f − c ) ≤ ln (n f )n f − c (n f − c ) ≤ f (n f − c )(1 + ln(n f )) − f 2 ln(n f )4n c2 + 4n c n f ln(n f ) ≤ n f 2 substitute n c in terms of the critical n f n c = n f f ))4 (cid:18) n f f )) (cid:19) + 4 (cid:18) n f f ) (cid:19) n f ln(n f ) ≤ n f 2 f )) + 2 ln(n f )1 + ln(n f ) ≤
11 + 2(1 + ln(n f )) ln(n f ) ≤ (1 + ln(n f )) f ) + 2 ln(n f ) ≤ f ) + ln(n f ) ln(n f ) ≤ f ≤ c is non-zero, and we have derived a contradiction, therefore, h (cid:48) (n f ) = 0 ⇒ h (cid:48)(cid:48) (n f ) > B. Table of variables used in text h The aggregate number of operations required for high memory modec l The aggregate number of operations required for low memory modec e The aggregate number of operations required for extremely low memory modem h The aggregate distributed memory requirement required for high memory modem l The aggregate distributed memory requirement required for low memory modem e The aggregate distributed memory requirement required for extremely low memory modem ns The per processor memory requirement for the noise filters in the stacked distribution modem nc The per processor memory requirement for the noise filters in the concatenated distribution moded t The per processor disk requirement for reading time stream of datad ps The per processor disk requirement for reading compressed pointing in the stacked distribution moded pc The per processor disk requirement for reading compressed pointing in the concatenated distribution moded pd The per processor disk requirement for reading detector pointing without compressed pointingd ns The per processor disk requirement for reading noise filters in the stacked distributiond nc The per processor disk requirement for reading noise filters in the concatenated distributionn z The number of pixels in the mapn d The number of detectors.r d The detector sample rate∆ t The integral time over which the telescope observed datan t The total number of samples measured by all detectors.n i The number of iterations required to solve the PCGn c The correlation length of the noisen b The number of stationary periodsr p The telescope telemetry data sample raten nz The number of non-zero elements per row of the pointing matrix.n proc
The number of processors used in a run.Table 2: Some notes on the variables: The number of pixels in the map, n z , is more generally the numberof free parameters in the signal model which could include signals in addition to the pixels in the mapand may include multiple maps. Note that for detectors sampled simultaneously and at the same raten t = n d r d ∆ t . The number of iterations required to solve the PCG, n i is proportional to the conditionnumber of the pixel-pixel noise correlation matrix after preconditioning. The correlation length of the noise,n c , is also the band width of the diagonal blocks of the inverse time-time noise correlation matrix. Thenumber of stationary periods, n b , corresponds to the number of diagonal blocks in the inverse time-timenoise correlation matrix. The telescope telemetry data sample rate, r p , is more generally the minimumsample rate required to accurately reconstruct the pointing matrix A .26 EFERENCES
Armitage, C., & Wandelt, B. D. 2004,Phys. Rev. D, 70, 123007Armitage-Caplan, C., & Wandelt, B. D. 2008,ArXiv e-printsAshdown, M. A. J., et al. 2007a, A&A, 471, 361—. 2007b, A&A, 467, 761—. 2009, A&A, 493, 753Barrett, R., et al. 1994, Templates for the Solutionof Linear Systems: Building Blocks for Itera-tive Methods, 2nd Edition (Philadelphia, PA:SIAM)Bock, J., et al. 2006, ArXiv Astrophysics e-printsBorrill, J. 1999, ArXiv Astrophysics e-printsBrakhage, H. 1996, SIAM Review, 38, 682Burigana, C., & S´aez, D. 2003, A&A, 409, 423Cooley, J. W., & Tukey, J. W. 1965, Journal ofMathematical Computations, 19, 297de Bernardis, P., et al. 2000, Nature, 404, 955de Gasperis, G., Balbi, A., Cabella, P., Natoli, P.,& Vittorio, N. 2005, A&A, 436, 1159Delabrouille, J. 1998, A&AS, 127, 555Dodelson, S. 2003, Modern Cosmology (525 BStreet, Suite 1900, San Diego, California 92101-4495, USA: Academic Press)Dor´e, O., Teyssier, R., Bouchet, F. R., Vibert, D.,& Prunet, S. 2001, A&A, 374, 358Frigo, M., & Johnson, S. G. 2005, Proceedings ofthe IEEE, 93, 216, special issue on ”ProgramGeneration, Optimization, and Platform Adap-tation”Golub, G. H., & van Loan, C. F. 1996, Matrixcomputations (Johns Hopkins University Press)G´orski, K. M., Hivon, E., Banday, A. J., Wandelt,B. D., Hansen, F. K., Reinecke, M., & Bartel-mann, M. 2005, ApJ, 622, 759 Grainger, W., et al. 2008, in Society of Photo-Optical Instrumentation Engineers (SPIE)Conference Series, Vol. 7020, Society of Photo-Optical Instrumentation Engineers (SPIE)Conference SeriesHanany, S., et al. 2000, ApJ, 545, L5Hinshaw, G., et al. 2003, ApJS, 148, 135Janssen, M. A., et al. 1996, ArXiv Astrophysicse-printsJewell, J., Levin, S., & Anderson, C. H. 2004, ApJ,609, 1Johnson, B. R., et al. 2007, ApJ, 665, 42Keih¨anen, E., Kurki-Suonio, H., & Poutanen, T.2005, MNRAS, 360, 390Keih¨anen, E., Kurki-Suonio, H., Poutanen, T.,Maino, D., & Burigana, C. 2004, A&A, 428,287Kuo, C. L., et al. 2004, ApJ, 600, 32Kurki-Suonio, H., Keihanen, E., Keskitalo, R.,Poutanen, T., Sirvio, A. ., Maino, D., & Burig-ana, C. 2009, ArXiv e-printsMaino, D., et al. 1999, A&AS, 140, 383Oh, S. P., Spergel, D. N., & Hinshaw, G. 1999,ApJ, 510, 551Oxley, P., et al. 2004, in Society of Photo-OpticalInstrumentation Engineers (SPIE) ConferenceSeries, Vol. 5543, Society of Photo-Optical In-strumentation Engineers (SPIE) Conference Se-ries, ed. M. Strojnik, 320–331Patanchon, G., et al. 2008, ApJ, 681, 708Poutanen, T., et al. 2006, A&A, 449, 1311Press, W., Teukolsky, S., Vetterling, W., & Flan-nery, B. 1992, Numerical Recipes in C, 2nd edn.(Cambridge, UK: Cambridge University Press)Reinecke, M., Dolag, K., Hell, R., Bartelmann, M.,& Enßlin, T. A. 2006, A&A, 445, 373Shewchuk, J. R. 1994, privately on webStompor, R., et al. 2002, Phys. Rev. D, 65, 02200327tompor, R., Leach, S., Stivoli, F., & Baccigalupi,C. 2009, MNRAS, 392, 216Stompor, R., & White, M. 2004, A&A, 419, 783Sutton, D., Johnson, B. R., Brown, M. L., Ca-bella, P., Ferreira, P. G., & Smith, K. M. 2009,MNRAS, 101Tegmark, M. 1997, Phys. Rev. D, 56, 4514The Planck Collaboration. 2005, ESA-SCI, ESApublications, 1Wandelt, B. D., Larson, D. L., & Lakshmi-narayanan, A. 2004, Phys. Rev. D, 70, 083511Waskett, T. J., Sibthorpe, B., Griffin, M. J., &Chanial, P. F. 2007, MNRAS, 381, 1583Wright, E. L., Hinshaw, G., & Bennett, C. L. 1996,ApJ, 458, L53+