[PDF] A Generative Model of Galactic Dust Emission Using Variational Inference

Abstract

Emission from the interstellar medium can be a significant contaminant of measurements of the intensity and polarization of the cosmic microwave background (CMB). For planning CMB observations, and for optimizing foreground-cleaning algorithms, a description of the statistical properties of such emission can be helpful. Here we examine a machine learning approach to inferring the statistical properties of dust from either observational data or physics-based simulations. In particular, we apply a type of neural network called a Variational Auto Encoder (VAE) to maps of the intensity of emission from interstellar dust as inferred from Planck sky maps and demonstrate its ability to a) simulate new samples with similar summary statistics as the training set, b) provide fits to emission maps withheld from the training set, and c) produce constrained realizations. We find VAEs are easier to train than another popular architecture: that of Generative Adversarial Networks (GANs), and are better-suited for use in Bayesian inference.

Full PDF

MMNRAS , 1–10 (2021) Preprint 28 January 2021 Compiled using MNRAS L A TEX style ﬁle v3.0

A Generative Model of Galactic Dust Emission Using Variational Inference

Ben Thorne, ★ Lloyd Knox, and Karthik Prabhu Department of Physics, University of California, One Shields Avenue, Davis, CA 95616, USA

Accepted XXX. Received YYY; in original form ZZZ

ABSTRACT

Emission from the interstellar medium can be a signiﬁcant contaminant of measurements of the intensity and polarization of thecosmic microwave background (CMB). For planning CMB observations, and for optimizing foreground-cleaning algorithms,a description of the statistical properties of such emission can be helpful. Here we examine a machine learning approach toinferring the statistical properties of dust from either observational data or physics-based simulations. In particular, we apply atype of neural network called a Variational Auto Encoder (VAE) to maps of the intensity of emission from interstellar dust asinferred from

Planck sky maps and demonstrate its ability to a) simulate new samples with similar summary statistics as thetraining set, b) provide ﬁts to emission maps withheld from the training set, and c) produce constrained realizations. We ﬁndVAEs are easier to train than another popular architecture: that of Generative Adversarial Networks (GANs), and are better-suitedfor use in Bayesian inference.

Key words: cosmology: cosmic microwave background – ISM: general – methods:statistical

Among the many research enterprises stimulated by the detection oflarge-scale anisotropies in the cosmic microwave background (CMB)by the COsmic Background Explorer (COBE) with its DiﬀerentialMicrowave Radiometer (Smoot et al. 1992), is the hunt for signaturesof primordial gravitational waves (PGW). To date, only upper limitshave been set, most commonly expressed as limits on the ratio ofprimordial tensor perturbation power to scalar perturbation power, 𝑟 .Soon after the COBE detection it was realized that reliably detectinglevels below 𝑟 (cid:39) . 𝑟 < .

06 (BICEP2 Collaboration et al. 2018). So-called StageIII CMB experiments, such as the Simons Observatory (Ade et al.2019), and BICEP Array (Hui et al. 2018) combined with SPT-3G(Benson et al. 2014) are designed to have suﬃcient sensitivity andsystematic error control to tighten the 95% conﬁdence upper limits bya factor of about 20. The Stage IV experiments LiteBIRD and CMB-S4 are targeting upper limits factors of 2 and 5 times more stringentstill, respectively. Thus we are rapidly moving into a regime where theforeground contamination is up to two orders of magnitude larger than the signal of interest. ★ E-mail: [email protected] This is for ﬂuctuation power. The rms level of contamination in the map isup to one order of magnitude larger than the signal of interest.

The most exciting possibility is that there will be a detection ofPGW, as opposed to improved upper limits. A detection claim wouldessentially be a claim that there is power remaining in the map thatcannot be explained as a residual instrumental systematic or residualforeground emission. Detection, therefore, requires not only fore-ground cleaning, but the capability to quantify the probability dis-tribution of residual foreground power. Such capability is hamperedby our lack of prior knowledge of the probability distribution of thenon-Gaussian and non-isotropic galactic foreground emission.The state of the art in analysis of such observations either implicitlyor explicitly has the galactic emission, or their residuals, modeled asGaussian isotropic ﬁelds (Planck Collaboration et al. 2020; Aiolaet al. 2020; BICEP2 Collaboration et al. 2018). They are modeled assuch not because they are, but strictly for convenience.At the very least, we need suﬃcient simulations of galactic emis-sion to test such algorithms for bias. A more ambitious objectiveis to abandon assumptions of Gaussianity and isotropy altogether,and perform a complete Bayesian analysis with incorporation of anappropriate prior for the spatial distribution of interstellar emission.Groundbreaking progress toward such a Bayesian analysis has beenmade recently, with the development of analysis methodologies byMillea et al. (2020a), and the recent application to real data (Milleaet al. 2020b).The analysis framework in Millea et al. (2020a) was developedfor “de-lensing” of the CMB; i.e., taking into account the impactof gravitational lensing on the statistical properties of CMB polar-ization. Although it has not been applied to multi-frequency data,or used for foreground cleaning, at a conceptual level the frame-work can be straightforwardly extended to analysis of foreground-contaminated multi-frequency data. Although this extension couldbe implemented with isotropic Gaussian priors for foreground emis-sion, it also presents the opportunity to incorporate more realistic © a r X i v : . [ a s t r o - ph . C O ] J a n B. Thorne et al. priors – priors that more accurately reﬂect what we know about suchemission from other data, or from physics-based simulations.We are thus interested in both creating simulated maps of galacticemission with the appropriate statistical properties for testing analysisalgorithms to be used on real data, and also in learning, from otherdata and perhaps physical modeling (e.g. MHD simulations of theinterstellar medium (Kim et al. 2019)) the statistical properties ofmaps of galactic emission for use in Bayesian inference engines.Here we report on progress toward accomplishing both of thesetasks with the use of neural networks. Aylor et al. (2019) studied theuse of generative adversarial networks (GANs) for learning how tosimulate new emission maps with statistic properties similar to thosefrom a training set, whilst Krachmalnicoﬀ & Puglisi (2020) trainedto simulate non-Gaussian small-scale polarized dust emission. Herewe present a similar study, this time using a diﬀerent neural networkarchitecture and training program, that of variational auto encoders(VAEs).VAEs and GANs are examples of deep generative models. Thesemodels have had recent success in accurately modeling compli-cated, high-dimensional, datasets, and generating realistic novel sam-ples (Razavi et al. 2019; van den Oord et al. 2016b; Brock et al.2018). Generative models can be divided into two main categories:likelihood-based models that seek to optimize the log likelihood ofthe data, these include the VAE (Kingma & Welling 2013; JimenezRezende et al. 2014), ﬂow based methods (Dinh et al. 2014, 2016;Jimenez Rezende & Mohamed 2015; Kingma & Dhariwal 2018),and autoregressive models (van den Oord et al. 2016a); and implicitmodels, such as GANs (Goodfellow et al. 2014), which train a gen-erator and discriminator in an adversarial game scenario. There aremany trade-oﬀs to consider when selecting a likelihood-based ap-proach (Kingma & Dhariwal 2018), but here we choose to explorethe use of VAEs due to their simplicity and computational scalabilityto higher resolution datasets.We ﬁnd some advantages of VAEs over GANs. The adversarialtraining process does not produce an explicit inference model, andit is hard to consistently compare model performance against sometest set. Furthermore, it is also a common problem that samples fromGANs do not represent the full diversity of the underlying distribution(Grover et al. 2017). In contrast, VAEs optimize the log likelihoodof the data. This means both that it is possible to directly comparemodels, and trained models should support the entire dataset, whichis crucial when applying a trained model to real data. VAEs also tendto be easier to train in that training success is more stable to variationof hyperparameters. As a downside, VAEs are well known for lossof resolution. We see this in our results and discuss adaptations onecould make to avoid this degradation of angular resolution.Although our work is motivated by the PGW-driven desire to un-derstand the statistical properties of polarized foreground emission,in this paper, as was the case in Aylor et al. (2019), we restrictourselves to intensity. Observations of polarized dust emission withhigh signal-to-noise over a large fraction of sky do not currentlyexist, which precludes the training of similar models on real data.However, in ongoing work, we are exploring the use of magnetohy-drodynamical (MHD) (Kim et al. 2019) simulations to train genera-tive models of polarized emission. In this scenario a trained modelwould provide a ‘compression’ of the information available in MHDsimulations into a single statistical model, which could then be usedeither in inference, or to augment real low-resolution observationswith physically-motivated small-scale realizations.The rest of this paper is structured as follows. In Section 2 weintroduce variational autoencoders, and the objective for their op-timization. We then describe the network architecture we used, the training dataset we produced to train the network, and how hyper-parameter values were set. In Section 3 we present the results ofapplying the trained VAE to test set images. Finally, in Section 4 wesummarize our ﬁndings and discuss areas of current and future work.

In this Section we will introduce the idea of variational autoencoders,the speciﬁc model we implement, and the details of how we train thatmodel.Our goal here is to take a set of images of thermal emissionfrom interstellar dust x ( 𝑖 ) = ( 𝑥 ( 𝑖 ) , . . . , 𝑥 ( 𝑖 ) 𝑁 ) ∈ R 𝑁 , and infer fromthem an underlying distribution, 𝑝 ( 𝑥 ) from which they could havebeen drawn, using the techniques of generative modeling . Variationalautoencoders are a type of generative machine learning model, whichprovide a framework by which we may infer the parameters of ajoint distribution over our original data, and some latent variables , z , representing the unobserved part of the model. We can factorizethe joint distribution of the data and latent variables into two termsrepresenting the generative process of the data, and the latent space,responsible for the variance in the observed data: 𝑝 ( x , z ) = 𝑝 ( x | z ) (cid:124) (cid:32) (cid:123)(cid:122) (cid:32) (cid:125) Generative 𝑝 ( 𝑧 ) (cid:124)(cid:123)(cid:122)(cid:125) Variance . (1)The VAE approach is to model the conditional distribution withan appropriate family of functions with some unknown weights, 𝜃 : 𝑝 𝜃 ( x | z ) ≈ 𝑝 ( x | z ) . This conditional model encodes the generativeprocess by which x depends on the latent set of variables z . The choiceof 𝑝 ( z ) can then be a simple, perhaps Gaussian, prior probabilitydistribution 𝑝 ( z ) , which encodes the dataset variation in a simplelatent space. This can be seen as a type of regularization by whichwe separate out diﬀerent sources of variation within the dataset, aprocess that is quite natural for physical processes, and often makesthe resulting model interpretable.The goal of training is thus to ﬁnd a transformation that deliversan acceptable approximation 𝑝 𝜃 ( x ) ≈ 𝑝 ( x ) , that is optimal (in somesense), given the training set data. Toward that end we consider theparametrized joint distribution of x and z : 𝑝 𝜃 ( x , z ) = 𝑝 𝜃 ( x | z ) 𝑝 ( z ) , (2)which leads to our object of interest via marginalization over 𝑧 : 𝑝 𝜃 ( x ) = ∫ 𝑑 z 𝑝 𝜃 ( x , z ) . (3)Our tasks are thus to choose a parameterization – this is referredto as a choice of architecture – and then ﬁnd a means of optimizingthese parameters 𝜃 with resepect to a chosen objective , via a processreferred to as training . In principle we could determine 𝜃 by maximizing the training set’sjoint likelihood Π 𝑖 𝑝 𝜃 ( x 𝑖 ) . In practice, however, this would involveevaluating the integral in Equation 3 for each datapoint individ-ually, which is intractable for even moderately high-dimensionallatent spaces. The VAE framework provides an objective functionthat bounds the maximum likelihood value, and is computationallytractable.Let a dataset D be made up of samples x ( 𝑖 ) = ( 𝑥 ( 𝑖 ) , . . . , 𝑥 ( 𝑖 ) 𝑁 ) ∈ MNRAS000 , 1–10 (2021) enerative modeling of Galactic dust R 𝑁 , which we will assume to be independent and identically dis-tributed samples from some true underlying distribution 𝑝 D ( x ) . Ab-sent an analytical model for 𝑝 D ( x ) , we can instead take it to bea member of an expressive family of functions parametrized by 𝜽 : 𝑝 D ( x ) = 𝑝 𝜽 ( x ) . This can be done by introducing an unobservedset of latent variables, z = ( 𝑧 , . . . , 𝑧 𝑑 ) ∈ R 𝑑 , and considering thejoint distribution 𝑝 ( x , z ) . This joint distribution is speciﬁed by: theprior over the latent space, 𝑝 ( z ) , which is assumed to be some sim-ple distribution (typically Gaussian); and the conditional distribution 𝑝 ( x | z ) , which is intended to represent most of the complexity in thetrue underlying distribution 𝑝 D ( x ) . We model this distribution as aneural network with weights 𝜃 : 𝑝 𝜃 ( x | z ) . The marginal likelihood isthen: 𝑝 𝜃 ( x ) = ∫ 𝑑 z 𝑝 ( z ) 𝑝 𝜃 ( x | z ) = E 𝑝 ( z ) [ 𝑝 𝜃 ( x | z )] , (4)where we have introduced the notation E 𝑌 [ ℎ ( 𝑦 )] to indicate theexpectation of the function ℎ ( 𝑦 ) with respect to the distribution 𝑦 ∼ 𝑌 . In principle, we could determine the conditional modelby ﬁxing 𝜃 to a value that maximizes the marginal likelihood. Inpractice, however, the integral in Equation 4 is intractable, due tothe dimensionality of the latent space, and in any case would re-quire a per-datapoint optimization process. As a result, the posterior 𝑝 𝜃 ( z | x ) = 𝑝 𝜃 ( z , x )/ 𝑝 𝜃 ( x ) is also intractable.We make progress by introducing a second approximation, thistime to the posterior: 𝑞 𝜙 ( z | x ) ≈ 𝑝 𝜃 ( z | x ) , where 𝑞 𝜙 ( z | x ) is oftenreferred to as an inference network. For any choice of 𝑞 𝜙 ( z | x ) , in-cluding any choice of its weights 𝜙 , we can write the log likelihoodof the data as:log 𝑝 𝜃 ( x ) = E 𝑞 𝜙 ( z | x ) [ log 𝑝 𝜃 ( x )] . (5)Applying the chain rule of probability: 𝑝 𝜃 ( x , z ) = 𝑝 𝜃 ( z ) 𝑝 𝜃 ( x | z ) ,and inserting an identity, this can be split into two terms:log 𝑝 𝜃 ( x ) = L 𝜃,𝜙 ( x ) + D KL ( 𝑞 𝜙 ( z | x )|| 𝑝 𝜃 ( z | x )) , (6)where L 𝜃,𝜙 is referred to as the evidence lower bound (ELBO): L 𝜃,𝜙 ( x ) ≡ E 𝑞 𝜙 ( z | x ) (cid:20) log (cid:20) 𝑝 𝜃 ( x , z ) 𝑞 𝜙 ( z | x ) (cid:21)(cid:21) , (7)and the second term is the Kullback-Leibler (KL) divergence: D KL ( 𝑞 𝜙 ( z | x )|| 𝑝 𝜃 ( z | x )) = E 𝑞 𝜙 ( z | x ) (cid:20) log (cid:20) 𝑞 𝜙 ( z | x ) 𝑝 𝜃 ( z | x ) (cid:21)(cid:21) , (8)which is a measure of the ‘distance’ between two distributions, andis always positive.From Equation 6 we see that the bound L 𝜃,𝜙 ( x ) will become tight-est when D KL ( 𝑞 𝜙 ( z | x )|| 𝑝 𝜃 ( z | x )) →

0, such that our approximationto the posterior, 𝑞 𝜙 ( z | x ) ≈ 𝑝 𝜃 ( z | x ) , becomes exact. However, dueto the presence of the 𝑝 𝜃 ( z | x ) term, D KL ( 𝑞 𝜙 ( z | x )|| 𝑝 𝜃 ( z | x )) can notbe evaluated directly, and so we are not able to directly optimize thelikelihood in Equation 6. Instead, we seek to maximize the evidencelower bound, thereby achieving an ‘optimum’ set of weights 𝜃, 𝜙 .The evidence lower bound and its gradient with respect to 𝜃 canbe computed straightforwardly. The gradients with respect to 𝜙 ap-pear more problematic, since the expectation we are calculating istaken over a distribution parametrized by 𝜙 . The typical Monte Carloestimates of this expectation, and its derivatives, are unbiased, buttend to have a high variance, often making the training process un-stable. Through a reparametrization presented in Kingma & Welling(2013), it is possible to rewrite this expectation such that the sourceof randomness is not dependent on 𝜙 , and gradients with respect to 𝜙 may be calculated with standard Monte Carlo techniques. We aretherefore able to optimize L 𝜃,𝜙 ( x ) by stochastic gradient descent,and approximately optimize the marginal log likelihood. In this section we describe the architecture of the networks 𝑝 𝜃 ( x | z ) and 𝑞 𝜙 ( z | x ) , and the latent prior 𝑝 ( z ) . We adopt a convolutionalarchitecture for both the encoder and decoder network. We choose to use a 𝑑 -dimensional latent space, with a multivariatenormal prior, z ∼ N ( , (cid:49) 𝑑 × 𝑑 ) . The encoder maps input images x ∈ R × to latent space dis-tribution parameters, [ 𝝁 𝑑 , 𝝈 𝑑 ] ∈ R 𝑑 . It is worth emphasizing thepoint that, since we are modelling the distribution 𝑝 ( z | x ) , the outputof the encoder is not a single point in the latent parameter space,but rather a distribution, parametrized by the mean and variance [ 𝝁 𝑑 , 𝝈 𝑑 ] . The mapping from image to latent space parameters re-quires both a dimensionality reduction, and a reshaping. We achievethese goals by using a convolutional neural network . In the followingwe will describe the precise network that we implemented, usingthe language of neural networks. For details on the motivation forthese choices, and their technical meaning, we refer to introductorytexts on machine learning and convolutional neural networks such asGoodfellow et al. (2016)The encoder reduces the dimension of the input image by applyinga series of strided convolutions with a rectiﬁed linear unit activationfunction, and then ﬂattens the image for input to a ﬁnal dense layerconnected to the output latent space distribution parameters. Eachconvolution is characterized by a kernel shape with a number ofpixels, 𝑘 𝑖 , where 𝑖 indicates the layer, and a stride length, which we setto 2. The values 𝑘 𝑖 are set during the hyperaparameter optimizationstage described in Section 2.3.3. We apply a batch normalizationwith momentum parameter equal to 0.9 after each convolution. Thisregularizes the weights, and leads to more stable training. A summaryof the encoder model is given in Table 1. The decoder is essentially the reverse process to the encoder, mappinga latent vector z ∈ R 𝑑 to an image x ∈ R × . We denote adecoder 𝑔 , with weights 𝜙 as 𝑔 𝜙 : z → x . The primary diﬀerenceto the structure of the encoder is that we use transverse convolutionsas opposed to convolutions, in order to increase the size of eachdimension. A summary of the decoder model is given in Table 2. In this section we detail the process by which we optimize the weightsof the VAE model described in Section 2.2 with respect to the ELBOobjective introduced in Section 2.1. The training process requires usto specify the training dataset, D , the training strategy by which wemake updates to the weights 𝜃, 𝜙 , and the process of hyperparameteroptimization by which we make concrete selections of meta param-eters of the model (such as kernel shapes and training parameters). MNRAS , 1–10 (2021)

B. Thorne et al.

Layer Layer Output Shape HyperparametersInput (256, 256, 1)Conv2D (128, 128, 256) stride=2ReLu (128, 128, 256)BatchNorm (128, 128, 256) momentum=0.9Conv2D (64, 64, 128) stride=2ReLu (64, 64, 128)BatchNorm (64, 64, 128) momentum=0.9Conv2D (32, 32, 64) stride=2ReLu (32, 32, 64)BatchNorm (32, 32, 64) momentum=0.9Dense (1024)Dense (512)

Table 1.

This table shows the structure of the encoder network, 𝑞 𝜙 ( z | x ) .Layer Layer Output Shape HyperparametersInput (256, 1)Dense (8192)Reshape (16, 16, 32)BatchNorm (16, 16, 32) momentum=0.9TransposeConv2D (32, 32, 128) stride=2ReLu (32, 32, 128)BatchNorm (32, 32, 128) momentum=0.9TransposeConv2D (64, 64, 64) stride=2ReLu (64, 64, 64)BatchNorm (64, 64, 64) momentum=0.9TransposeConv2D (128, 128, 32) stride=2ReLu (128, 128, 32)BatchNorm (128, 128, 32) momentum=0.9TransposeConv2D (256, 256, 16) stride=2ReLu (256, 256, 16)BatchNorm (256, 256, 16) momentum=0.9TransposeConv2D (256, 256, 1) stride=1 Table 2.

This table shows the structure of the decoder network, 𝑝 𝜃 ( x | z ) . Machine learning techniques are notoriously data-hungry, and willperform best for larger datasets. Standard computer vision datasetson which algorithms are tested (e.g. ImageNet (Russakovsky et al.2015)) contain tens of thousands, sometimes millions, of images.However, we have only one sky from which to obtain observationsof Galactic dust. As such, we are forced to partition the sky intopatches, which we treat as separate images in the training process.In order to obtain ∼ ∼ ◦ . Such a small patch size has the advantagethat we are then justiﬁed in projecting the cutouts onto the ﬂat sky,and applying standard machine learning techniques to the resultingtwo-dimensional images, sidestepping the issue of deﬁning neuralnetworks that operate on spherical images (for such implementationssee Perraudin et al. (2019); Krachmalnicoﬀ & Tomasi (2019)).We use the P lanck GNILC-separated thermal dust intensity mapat 545 GHz , which we download from the Planck Legacy Archive.In order to extract cutout images from this map we follow a similarprocedure to Aylor et al. (2019). We mask the Galactic plane byexcluding all regions at latitudes below 15 ◦ . Then we lay down a setof centroids ( 𝑙 𝑖 + , 𝑏 𝑖 + ) = ( 𝑙 𝑖 + 𝑠, 𝑏 𝑖 + 𝑠 / cos ( 𝑙 𝑖 )) , where 𝑠 is a stepsize parameter, and 𝑠 / cos ( 𝑙 𝑖 ) is a step between longitudes for a givenlatitude, which ensures the same angular separation in the latitudinal http://pla.esac.esa.int/pla/aio/product-action?MAP.MAP_ID=COM_CompMap_Dust-GNILC-F545_2048_R2.00.fits direction. Each centroid is then rotated to the equator, and an 8 ◦ × ◦ square region around the centroid is projected onto a cartesian gridwith 256 pixels along each size. For 𝑠 = ◦ , this results in a dataset, D , of 2254 maps. We then shuﬄe and split D into three groups: a70% training set, x train , a 15% validation set, x val , and a 15% testset, x test .In order to artiﬁcially increase the diversity of images in our lim-ited sample we employ two standard data augmentation techniques.During the data preprocessing stage of training, we randomly ﬂipeach image along the horizontal and vertical directions, and rotateeach image by an integer multiple of 90 ◦ . These transformationsare not invariant under convolution; however, these would constituteperfectly realistic foreground images. Here we discuss the training strategy used to learn the weights 𝜃, 𝜙 .As discussed in Section 2, to train a VAE we maximize the lowerbound on the log likelihood of the data given in Equation 7 withrespect to the weights 𝜃, 𝜙 . In practice, at each step we compute aMonte Carlo estimate of this quantity: E 𝑞 𝜙 ( z | x ) (cid:20) 𝑝 𝜃 ( x , z ) 𝑞 𝜙 ( z | x ) (cid:21) ≈ log 𝑝 𝜃 ( x | z ) + log 𝑝 ( z ) − log 𝑞 𝜙 ( z | x ) (9)where x on the RHS is now a minibatch of the data, the size ofwhich is a hyperparameter of the training process. The analysis wepresent in Section 2.3.3 shows that a batch size of 8 is preferred.For each batch we then calculate the gradients of this quantity withrespect to the weights 𝜃, 𝜙 and backpropagate the errors through thenetwork, adjusting 𝜃, 𝜙 in accordance with the learning schedule. Forthis schedule we used the Adam optimizer with hyperparameters de-termined through the optimization process described in Section 2.3.3.The training was performed by passing over the entire dataset 100times, and in each pass splitting the data into batches of 8 images. Toguard against overﬁtting we evaluated L 𝜃,𝜙 ( x train ) and L 𝜃,𝜙 ( x val ) every ﬁve epochs and checked for divergence between these quanti-ties at late epochs. If the network had begun to overﬁt on the trainingdata, its predictions for the validation set would deteriorate, whichwould be reﬂected in a worsening L 𝜃,𝜙 ( x val ) . We found that the L 𝜃,𝜙 ( x train ) plateaued after 50 epochs, and saw no divergence be-tween L 𝜃,𝜙 ( x train ) and L 𝜃,𝜙 ( x val ) after training for an additional 50epochs.Models were built using the Tensorﬂow software package (Abadiet al. 2015), and trained using a Tesla V100 GPU on the Cori super-computer at NERSC. In this section we provide motivation for our selection of the modelhyperparameters. It is not possible to optimize model hyperparame-ters such as batch size, or model architecture, using the same stochas-tic gradient descent technique that is used to optimize model weightsand biases. Instead, a limited number of hyperparameter combina-tions can be trained, and the corresponding model that achieves thebest loss after a certain amount of training time, or certain number ofepochs, is used. The space of hyperparameters is high-dimensional,and so can not be uniformly densely sampled due to computationalcost. Instead, we employed a Bayesian optimization approach inwhich a few random combinations of hyperparameters are chosen,and trained for 20 epochs each. From this set of hyperparameters,

MNRAS000

MNRAS000 , 1–10 (2021) enerative modeling of Galactic dust a Gaussian process (GP) model of the loss as a function of hyper-parameters is built. From this GP model, new trial candidates areselected, and trained, with the resulting loss then being incorporatedinto the GP weights. We allowed this process to continue for 100 dif-ferent trials, and used the hyperparameters that achieved the lowestloss after twenty epochs of training. In this section we present reconstructions of test set images, andcompare their pixel value distribution and power spectra.For a given image, x test , we can sample the posterior as z ( 𝑖 ) test ∼ 𝑞 𝜙 ( z | x ) , and push these through the decoder to get a reconstructedimage x ( 𝑖 ) test = g 𝜃 ( z ( 𝑖 ) test ) . To summarize the distribution of recon-structed images, we draw 𝐿 samples and calculate their average:˜ x ≈ 𝐿 𝐿 ∑︁ 𝑙 = g 𝜃 ( z ( 𝑙 ) test ) . (10)For the remainder of this section, a ‘reconstruction’ refers to thecalculation of Equation 10 with 𝐿 = NaMaster code (Alonso et al. 2019). For reasonsthat will become clear later we are primarily interested in comparingranges of multipoles in the signal-dominated regime, well within theresolution limit of the original maps, and so we do not make anyeﬀorts to noise debias or account for the beam present in the originalmaps.First, we present the reconstructions of three randomly-selectedtest set images, and show the resulting maps, along with the resid-uals, in Figure 1. We can see that the network does very well inreconstructing the large-scale features in these test-set maps, and thevisual quality is suﬃcient to appear ‘real’, if lower-resolution. Fea-tures are well recovered up to ∼ degree scales, with features belowthat scale being smoothed out by the calculation of the expectationin Equation 10. The residuals shown in the bottom row of Figure 1are well behaved and do not show any strong biases correlated withfeatures in the map.In Figure 2 we take a single randomly-selected test set image, andshow its reconstruction, the pixel value histograms of each image,and their power spectra. As was the case for the three examples shownin Figure 1, there is excellent visual agreement between the origi-nal image and its reconstruction. This is enforced by the excellentagreement between the distribution of pixel values in the two images,shown in the bottom left panel of Figure 2. The reconstructed powerspectrum in the bottom right panel of Figure 2 also shows excellentagreement up to ℓ ∼ th percentile, median, and75 th percentile as a function of bin center, for both the original testset images, and their reconstructions. There is excellent agreementbetween the two sets of images, with no evidence of any aggregatebias in the reconstructions. In Figure 4 we compare the power spectraof all test set images and their reconstructions. Figure 4 shows thatthe same behavior as was seen in Figure 2 is displayed for the entiretest set. Spectra are generally well recovered for ℓ < ℓ > E 𝑞 𝜙 ( z | x ) [ 𝑝 𝜃 ( x , z )] . Since this expectation is taken with respectto the distribution 𝑞 𝜙 ( z | x ) , it will strongly penalize points ( x , z ) thatare likely under 𝑞 𝜙 , but unlikely under 𝑝 𝜃 . On the other hand, pointsthat are likely under 𝑝 𝜃 , but are not present in the empirical datadistribution, will suﬀer a much smaller penalty. The result is that, ifthe model is not suﬃciently ﬂexible to ﬁt the data distribution exactly,it will compensate by widening the support of 𝑝 𝜃 ( x , z ) beyond whatis present in the data distribution, inﬂating the variance of 𝑝 𝜃 ( x | z ) .Since we have assumed a Gaussian distribution for the decoder modelthat is independent from pixel to pixel, and given that the signal inthe training images is red-tilted (as is the case for most natural imagescontaining extended recognizable structures), the increased varianceleads to a degradation of small-scale features through the averagingprocess of Equation 10 (Zhao et al. 2017). A corollary of the extendedsupport of 𝑝 𝜃 ( x , z ) is that sampling the prior in order to generatenovel images will not necessarily produce realistic samples (Kingma& Welling 2019).One way in which the ﬂexibility of VAEs may be enhanced isthrough the use of normalizing ﬂows (Jimenez Rezende & Mohamed2015). As the name suggests, the idea here is to start with a sim-ple distribution, such as a multivariate normal, and ‘stack’ layersof invertible transformations, such that the output may be signiﬁ-cantly more complex. There are certain requirements placed on thesetransformations such that they remain computationally eﬃcient, forexample they must have tractable Jacobians (Jimenez Rezende &Mohamed 2015). Expanding the VAE model presented here by in-troducing normalizing ﬂows could be expected to improve both thereconstruction quality, and the quality of novel samples, and is thesubject of current work. As a means of investigating the structure of the encoding that hasbeen learned, we study the ‘interpolation’ between real images, x and x , by performing the interpolation between their latent encod-ings, z and , z . From the smooth nature of the changes in theresulting continuum of maps we will see that smooth variations inthe latent space result in smooth variations in the map space. Thisstudy also demonstrates the ability of the VAE approach to generatenovel foreground images by restricting to a region of the latent spaceclose to the encodings of real maps, therefore avoiding the spuriousregions of ( x , z ) that could be obtained by sampling from an ill-ﬁttedprior, as discussed at the end of Section 3.1.The probability mass in high-dimensional distributions tends toconcentrate in a shell relatively far from the modal probability den-sity. Therefore, traversing the latent space in a straight line (in the MNRAS , 1–10 (2021)

B. Thorne et al. − − − D i m e n s i o n l e ss l og s c a l e d e m i ss i o n Figure 1.

This ﬁgure shows the reconstruction of three randomly-selected images from the test set, not used during the training or validation of the network. Thetop row are the original images, the second row are the reconstructions. and the third row are the residuals of the reconstructions. The reconstructions clearlylose small-scale details, but but manage to recover the large scale variations well.

Euclidean sense), does not necessarily pass through areas of highprobability mass. In order to keep the interpolated points within ar-eas of high probability mass, we interpolate from z to z usingspherical trajectories that traverse great circles in the latent space,as the distance from the origin smoothly changes from | z | to | z | .Speciﬁcally, we follow this continuous trajectory parametrized bysome factor 𝜆 : z , ( 𝜆 ) = sin (( − 𝜆 ) 𝜃 ) sin 𝜃 z + sin ( 𝜆𝜃 ) sin 𝜃 z , (11)where cos ( 𝜃 ) = ˆ z · ˆ z . We then take 𝑁 points along this line corre-sponding to 𝜆 = [ /( 𝑁 + ) , /( 𝑁 + ) , . . . , 𝑁 /( 𝑁 + )] , and decodeto obtain the corresponding map x , ( 𝜆 ) = 𝑔 𝜙 ( z , ( 𝜆 )) .Figure 5 shows the smooth transition in image space betweenthe two real images (the top left panel and the bottom right panel)randomly selected from the test set, calculated using the interpolationdescribed above. Features, such as the strong ﬁlamentary structuresin the center of the image, transition smoothly in and out of the image,demonstrating that small perturbations in the latent space result insmall perturbations in decoded images. In this section we consider a possible application of our trained modelto the reconstruction of corrupted data. During the analysis of CMBdata there are many possible reasons that data may be incomplete,from masking of point sources, to corruption by uncontrolled system-atics. The task of inpainting these regions is simple when the missingemission is well described by Gaussian statistics, as is the case forthe CMB (Bucher & Louis 2012). The lack of a similarly simpleapproach for the non-Gaussian foreground signal means that previ-ous eﬀorts have relied on empirically-validated, simple, algorithms,such as diﬀusive ﬁlling (Bucher et al. 2016). Future surveys will haveever-lower noise ﬂoors, and so will be increasingly contaminated bypoint-sources, even in polarization. The aggressive masking requiredin this regime could lead to the failure of simple foreground inpaint-ing techniques (Puglisi & Bai 2020). The statistical foreground modelpresented here allows us to take a Bayesian approach to foregroundinpainting, in which we may compute a posterior distribution for themissing data, conditioned on the observed data (Böhm et al. 2019).This has the advantage of conserving the foregrounds’ statisticalproperties, whilst also taking into account all of the contextual in-formation in the image, unlike methods such as diﬀusive inpainting.In the rest of this section we will present a toy model for corrupted

MNRAS000

MNRAS000 , 1–10 (2021) enerative modeling of Galactic dust − . . . . . . . D e n s i t y x ˜ x

200 400 600 800 ‘ − − − l og ( C ‘ ) Figure 2.

Top left : a randomly-selected test set image, x . Top right : the reconstruction of the test set image, ˜ x , as computed using Equation 10. Bottom left : kerneldensity estimate of the distribution of pixel values of the original image, and its reconstruction.

Bottom right : the log power spectra of the test set image and itsreconstruction. Note that since the test set images are standardized, these quantities are unitless. − . . . D e n s i t y ReconstructedOriginal

Figure 3.

In this ﬁgure we compare the pixel value distributions of the 339 testset images (black), and their reconstructions (green). We calculate quantilesacross the test set, and plot the 25 th and 75 th quartiles (the dashed lines), andthe median as functions of pixel value (the solid lines). data, and show that we are able to perform inpainting by optimizingthe posterior distribution in the latent space.Representing the contamination as a linear operator A , we canwrite down a model for the observed data d : d = A x + n , where n is a possible noise term. The posterior distribution of z is given byBayes’ theorem:log 𝑝 ( z | d ) = log 𝑝 ( z ) + log 𝑝 𝜃 ( d | z ) − log 𝑝 ( d ) . (12)For a given statistical model of the noise, we have a complete de-scription of the term log 𝑝 ( d | z ) , and we can work with the posteriordistribution in the latent space.As a concrete example we will consider the case of a binary

100 1000 ‘ − − − l og ( C ‘ ) ReconstructedOriginal

Figure 4.

In this ﬁgure we compare the power spectra of the 339 test setimages (black) and their reconstructions (green). Each power spectrum isplotted as an individual line. 𝑁 × 𝑁 masking operator, A , with elements equal to one (zero) wherepixels are (un)observed. To form simulated ‘corrupted’ images, wetake random images from the test dataset, apply A , and add whiteGaussian noise n , characterized by a pixel standard deviation 𝜎 : d test = A x test + n . The posterior distribution in the latent space isthen: − 𝑝 ( z | d test ) ∝ z 𝑇 z + 𝝁 𝜃 ( z ) 𝑇 𝝁 𝜃 ( z ) 𝜎 , (13)where we have written the residual vector as 𝝁 𝜃 ( z ) = A g 𝜃 ( z ) − d test .Fully sampling Equation 13 can be computationally expensivedue to the dimensionality of z , and is made more challenging bythe possibility of log 𝑝 ( z | d test ) being multi-modal. For these rea- MNRAS , 1–10 (2021)

B. Thorne et al. − − y [ ◦ ] x ( λ = 0) λ = 1 / λ = 2 / λ = 3 / − − y [ ◦ ] λ = 4 / λ = 5 / λ = 6 / λ = 7 / − − x [ ◦ ] − − y [ ◦ ] λ = 8 / − − x [ ◦ ] λ = 9 / − − x [ ◦ ] λ = 10 / − − x [ ◦ ] x ( λ = 1) − − − D i m e n s i o n l e ss l og - s c a l e d e m i ss i o n Figure 5.

This ﬁgure presents synthetic images generated by interpolating between real images, x and x , shown in the top left and bottom right panelsrespectively. The interpolation is carried out in the latent space using Equation 11, and is parametrized by a continuous variable 𝜆 . The intermediate panels showthe interpolation evaluated at 𝑁 =

10 points along the trajectory. sons, applying standard Markov Chain Monte Carlo techniques canoften fail to fully explore the posterior (Böhm et al. 2019), and weleave a sampling approach for future work, here taking only a singlerepresentative sample by maximizing ˆ z test = argmax z log 𝑝 ( z | d test ) .In the following we will take A to be a masking operator that ap-plies a binary mask to a map. However, as long as a forward model forthe corruption operation can be written down (e.g. a Gaussian convo-lution), the same technique could be applied. We take three randomlyselected test set images, x , x , x , and apply three diﬀerent binarymasks, A , A , A . To each corrupted image, we add a white noiserealization with a pixel standard deviation of 0.2. For each corrupted,noisy image, we then maximize the posterior in Equation 13 to ﬁnd z MAP 𝑖 using the LBFGS algorithm. In Figure 6 we show the randomlyselected test set images in the ﬁrst row, the corrupted images in thesecond row, and the reconstructed map 𝑔 ( z MAP 𝑖 ) in the third row. Wealso calculate the pixel value histograms and power spectra of theinput and reconstructed maps and show these in the bottom two rowsof Figure 6.One can see from Figure 6 that all the images are well recon-structed, and there is no visible eﬀect of the masking remaining inthe reconstructions. Comparing the regions in the ﬁrst and third rowscorresponding to the masked areas, we see that the network does notreproduce the exact features in the masked region, for any of the x 𝑖 , asexpected. However, the network does reconstruct plausible inpaint-ings, with the correct statistics, given the context in the rest of theimage. For example, the reconstruction 𝑔 𝜙 ( z MAP2 ) does not replicate the true high-intensity ﬁlamentary structure in the input image, x ,which would be impossible. However, it does recognize from the con-text that intensity is increasing towards the masked area in the bottomleft of the image, and populates that area with high-variance, high-intensity features. Correspondingly, such high-intensity features arenot seen in the reconstructed regions of 𝑔 𝜙 ( z MAP1 , ) , which correspondto relatively low-emission regions. The pixel value histograms andpower spectra in the last two rows of Figure 6 show similar behav-ior. We see good agreement between the original and reconstructedhistograms and powerspectra for both the x and x maps, up to thesuppression at ℓ >

400 common to all reconstructions. On the otherhand, we see a disagreement between the original and reconstructedstatistics of x , due to the higher variance associated with the ﬁlled-inregion.These results show that the network has learned generalizableinformation about foreground behavior, and is able to inpaint novelforeground emission with correct statistical properties, based on thecontext of an image. The forward model used in this inpaintingprocess can be easily extended to maps with multiple masks anddiﬀerent types of ﬁltering and noise found in real data. In this paper we have presented a new application of VAEs to imagesof Galactic thermal dust emission. Using a training set extracted

MNRAS , 1–10 (2021) enerative modeling of Galactic dust [p] − − y [ ◦ ] x x x − − y [ ◦ ] A x + n A x + n A x + n − − x [ ◦ ] − − y [ ◦ ] g φ ( z MAP1 ) − − x [ ◦ ] g φ ( z MAP2 ) − − x [ ◦ ] g φ ( z MAP3 ) − − − D i m e n s i o n l e ss l og - s c a l e d e m i ss i o n − . . D e n s i t y − − xg φ ( z )

100 1000 ‘ − . − . l og ( C ‘ )

100 1000 ‘

100 1000 ‘ Figure 6.

This ﬁgure shows three randomly-selected test set images, x , , in the top row. As described in Section 3.3, these images are corrupted with a binarymask A , , and white noise. The corrupted images are shown in the second row. The third row shows the reconstructed images obtained by maximizing thelatent space posterior in Equation 13 for each of the three corrupted images, and decoding the resulting points in the latent space. The fourth and ﬁfth rows showthe pixel value histograms and power spectra of the original and reconstructed maps. MNRAS , 1–10 (2021) B. Thorne et al. from Planck observations of thermal dust emission, this techniqueallowed us to learn a transformation from a space of uncorrelatedlatent variables with a multivariate normal prior, to the space ofpossible dust maps.The training process was validated by computing and comparingsummary statistics, including the distribution of pixel values, andpower spectra of reconstructed maps, on a test set withheld duringthe training process. The applicability of the trained model was alsodemonstrated by reconstructing data corrupted by noise and masking.This was the ﬁrst use of a trained generative dust model to performBayesian inference, and demonstrates the applicability of this ap-proach in the simulation of foreground images, and the Bayesianmodeling of polarized CMB data.The usefulness of this model is currently limited by the ﬂexibilityof the posterior, and its ability to ﬁt the true underlying posterior.As was discussed in Section 3.1, this has two main consequences: i)a naïve sampling of the prior is not guaranteed to produce realisticsamples, ii) reconstructed images are blurry, limiting accuracy todegree scales. Both of these issues may be tackled by increasing theexpressiveness of the model (Kingma & Welling 2019), which weplan to do by introducing a normalizing ﬂow to link the prior andlatent space (Kingma et al. 2016).As discussed in the Section 1, our main goal is to model polarizeddust emission. We attempted a similar analysis to that presentedhere by repeating the training procedure on a network that acceptedan additional ‘channel’ as input, representing a tuple of Stokes 𝑄 and 𝑈 parameters, rather than only Stokes 𝐼 , and using the Planck353 GHz polarization observations to form a training set. We foundthat the network was not able to learn any meaningful informationfrom this setup, consistent with what similar analyses have found(Petroﬀ et al. 2020). In order to extend our analysis to polarization,we are therefore exploring the use of MHD simulations (Kim et al.2019) as a training set. Kim et al. (2019) have demonstrated thatsimulations of a multiphase, turbulent, magnetized ISM producesynthetic observations of the ISM with statistics (such as the ratio of 𝐸 power to 𝐵 power, and the tilt of the 𝐸 𝐸 and 𝐵𝐵 power spectra)matching those of real skies. Our initial results have shown that thisis a promising alternative to the use of real data in training generativenetworks. ACKNOWLEDGEMENTS

We would like to acknowledge useful conversations with Ethan An-deres and Kevin Aylor in the preparation of this work. This workwas supported by an XSEDE start up allocation, PHY180022. Thiswork was supported in part by the National Science Foundation viaawards OPP-1852617 and AST-1836010. We also acknowledge theuse of the Perlmutter preparedness GPU allocation on the Cori supercomputer at NERSC.

DATA AVAILABILITY

The data used in this study is available on the PlanckLegacy Archive at the URL: http://pla.esac.esa.int/pla/aio/product-action?MAP.MAP_ID=COM_CompMap_Dust-GNILC-F545_2048_R2.00.fits

REFERENCES

Abadi M., et al., 2015, TensorFlow: Large-Scale Machine Learning on Het-erogeneous Systems,

Ade P., et al., 2019, J. Cosmology Astropart. Phys., 2019, 056Aiola S., et al., 2020, arXiv e-prints, p. arXiv:2007.07288Alonso D., Sanchez J., Slosar A., LSST Dark Energy Science Collaboration2019, MNRAS, 484, 4127Aylor K., Haq M., Knox L., Hezaveh Y., Perreault-Levasseur L., 2019, arXive-prints, p. arXiv:1909.06467BICEP2 Collaboration et al., 2018, Phys. Rev. Lett., 121, 221301Benson B. A., et al., 2014, in Holland W. S., Zmuidzinas J., eds, Vol. 9153,Millimeter, Submillimeter, and Far-Infrared Detectors and Instrumenta-tion for Astronomy VII. SPIE, pp 552 – 572, doi:10.1117/12.2057305, https://doi.org/10.1117/12.2057305

Böhm V., Lanusse F., Seljak U., 2019, arXiv e-prints, p. arXiv:1910.10046Brock A., Donahue J., Simonyan K., 2018, arXiv e-prints, p.arXiv:1809.11096Bucher M., Louis T., 2012, MNRAS, 424, 1694Bucher M., Racine B., van Tent B., 2016, Journal of Cosmology and As-troparticle Physics, 2016, 055Dinh L., Krueger D., Bengio Y., 2014, arXiv e-prints, p. arXiv:1410.8516Dinh L., Sohl-Dickstein J., Bengio S., 2016, arXiv e-prints, p.arXiv:1605.08803Goodfellow I. J., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., OzairS., Courville A., Bengio Y., 2014, arXiv e-prints, p. arXiv:1406.2661Goodfellow I., Bengio Y., Courville A., 2016, Deep Learning. MIT PressGrover A., Dhar M., Ermon S., 2017, arXiv e-prints, p. arXiv:1705.08868Hui H., et al., 2018, in Zmuidzinas J., Gao J.-R., eds, Society ofPhoto-Optical Instrumentation Engineers (SPIE) Conference Series Vol.10708, Millimeter, Submillimeter, and Far-Infrared Detectors and In-strumentation for Astronomy IX. p. 1070807 ( arXiv:1808.00568 ),doi:10.1117/12.2311725Jimenez Rezende D., Mohamed S., 2015, arXiv e-prints, p. arXiv:1505.05770Jimenez Rezende D., Mohamed S., Wierstra D., 2014, arXiv e-prints, p.arXiv:1401.4082Kamionkowski M., Kosowsky A., Stebbins A., 1997, Phys. Rev. Lett., 78,2058Kim C.-G., Choi S. K., Flauger R., 2019, ApJ, 880, 106Kingma D. P., Dhariwal P., 2018, arXiv e-prints, p. arXiv:1807.03039Kingma D. P., Welling M., 2013, arXiv e-prints, p. arXiv:1312.6114Kingma D. P., Welling M., 2019, arXiv e-prints, p. arXiv:1906.02691Kingma D. P., Salimans T., Jozefowicz R., Chen X., Sutskever I., Welling M.,2016, arXiv e-prints, p. arXiv:1606.04934Knox L., Turner M. S., 1994, Phys. Rev. Lett., 73, 3347Krachmalnicoﬀ N., Puglisi G., 2020, arXiv e-prints, p. arXiv:2011.02221Krachmalnicoﬀ N., Tomasi M., 2019, A&A, 628, A129Millea M., Anderes E., Wandelt B. D., 2020a, arXiv e-prints, p.arXiv:2002.00965Millea M., et al., 2020b, arXiv e-prints, p. arXiv:2012.01709Perraudin N., Deﬀerrard M., Kacprzak T., Sgier R., 2019, Astronomy andComputing, 27, 130Petroﬀ M. A., Addison G. E., Bennett C. L., Weiland J. L., 2020, arXive-prints, p. arXiv:2004.11507Planck Collaboration et al., 2020, A&A, 641, A6Puglisi G., Bai X., 2020, arXiv e-prints, p. arXiv:2003.13691Razavi A., van den Oord A., Vinyals O., 2019, arXiv e-prints, p.arXiv:1906.00446Russakovsky O., et al., 2015, International Journal of Computer Vision(ĲCV), 115, 211Seljak U., Zaldarriaga M., 1997, Phys. Rev. Lett., 78, 2054Smoot G. F., et al., 1992, ApJ, 396, L1Zhao S., Song J., Ermon S., 2017, arXiv e-prints, p. arXiv:1702.08658van den Oord A., Kalchbrenner N., Kavukcuoglu K., 2016a, arXiv e-prints,p. arXiv:1601.06759van den Oord A., et al., 2016b, arXiv e-prints, p. arXiv:1609.03499This paper has been typeset from a TEX/L A TEX ﬁle prepared by the author.MNRAS000