[PDF] Bayesian decomposition of the Galactic multi-frequency sky using probabilistic autoencoders

Abstract

All-sky observations of the Milky Way show both Galactic and non-Galactic diffuse emission, for example from interstellar matter or the cosmic microwave background (CMB). The different emitters are partly superimposed in the measurements, partly they obscure each other, and sometimes they dominate within a certain spectral range. The decomposition of the underlying radiative components from spectral data is a signal reconstruction problem and often associated with detailed physical modeling and substantial computational effort. We aim to build an effective and self-instructing algorithm detecting the essential spectral information contained Galactic all-sky data covering spectral bands from γ -ray to radio waves. Utilizing principles from information theory, we develop a state-of-the-art variational autoencoder specialized on the adaption to Gaussian noise statistics. We first derive a generic generative process that leads from a low-dimensional set of emission features to the observed high-dimensional data. We formulate a posterior distribution of these features using Bayesian methods and approximate this posterior with variational inference. The algorithm efficiently encodes the information of 35 Galactic emission data sets in ten latent feature maps. These contain the essential information required to reconstruct the initial data with high fidelity and are ranked by the algorithm according to their significance for data regeneration. The three most significant feature maps encode astrophysical components: (1) The dense interstellar medium (ISM), (2) the hot and dilute regions of the ISM and (3) the CMB. The machine-assisted and data-driven dimensionality reduction of spectral data is able to uncover the physical features encoding the input data. Our algorithm is able to extract the dense and dilute Galactic regions, as well as the CMB, from the sky brightness values only.

Full PDF

AAstronomy & Astrophysics manuscript no. NEAT_VAE c (cid:13)

ESO 2020September 15, 2020

Bayesian decomposition of the Galactic multi-frequency skyusing probabilistic autoencoders

Sara Milosevic , , Philipp Frank , , Reimar H. Leike , , Ancla Müller , and Torsten A. Enßlin , Ludwig-Maximilians-Universität München, Geschwister-Scholl-Platz 1, 80539 München, Germany Max-Planck-Institut für Astrophysik, Karl-Schwarzschild-Str. 1, 85748 Garching, Germany Ruhr-Universität Bochum, Universitätsstr. 150, 44801 Bochum, GermanySeptember 15, 2020 / Accepted date

ABSTRACT

Context.

All-sky observations of the Milky Way show both Galactic and non-Galactic di ﬀ use emission, for example from interstellarmatter or the cosmic microwave background (CMB). The di ﬀ erent emitters are partly superimposed in the measurements, partlythey obscure each other, and sometimes they dominate within a certain spectral range. The decomposition of the underlying radiativecomponents from spectral data is a signal reconstruction problem and often associated with detailed physical modeling and substantialcomputational e ﬀ ort. Aims.

We aim to build an e ﬀ ective and self-instructing algorithm detecting the essential spectral information contained Galacticall-sky data covering spectral bands from γ -ray to radio waves. Methods.

Utilizing principles from information theory, we develop a state-of-the-art variational autoencoder specialized on the adap-tion to Gaussian noise statistics. We ﬁrst derive a generic generative process that leads from a low-dimensional set of emissionfeatures to the observed high-dimensional data. We formulate a posterior distribution of these features using Bayesian methods andapproximate this posterior with variational inference.

Results.

The algorithm e ﬃ ciently encodes the information of 35 Galactic emission data sets in ten latent feature maps. These containthe essential information required to reconstruct the initial data with high ﬁdelity and are ranked by the algorithm according to theirsigniﬁcance for data regeneration. The three most signiﬁcant feature maps encode astrophysical components: (1) The dense interstellarmedium (ISM), (2) the hot and dilute regions of the ISM and (3) the CMB. Conclusions.

The machine-assisted and data-driven dimensionality reduction of spectral data is able to uncover the physical featuresencoding the input data. Our algorithm is able to extract the dense and dilute Galactic regions, as well as the CMB, from the skybrightness values only.

Key words.

Methods: data analysis – Methods: statistical – Techniques: image processing – Galaxy: general – ISM: structure

1. Introduction

The interstellar medium (ISM) is a key element of the MilkyWay and subject to both astrophysical and cosmological studies.It consists of localized components such as molecules and inter-stellar dust in cold clouds, atomic and ionized hydrogen, and hotplasma in the Galactic halo, as well as components which per-vade the entire ISM like cosmic rays and magnetic ﬁelds. Ourpresent knowledge about the existence of these components isbased on the fact that they all contribute to the interstellar radia-tion ﬁeld (e.g., Draine 2011). Each component, or the interplayof several components, generates radiation of a speciﬁc spectrumand can be reconstructed by component separation algorithms ofvarying complexity. Some components show very characteris-tic emission lines such as CO or neutral (HI) and ionized (HII)atomic hydrogen, which permits their Galactic distribution tobe determined very precisely. Other components, however, cangenerate radiation distributed over completely opposite areas ofthe electromagnetic spectrum, making the component separationtask more sophisticated: For example, the interaction of cosmicrays and magnetic ﬁelds generates synchrotron radiation and canbe observed in the radio regime, as in the 408 MHz radio map(Haslam et al. 1982), while the interaction of cosmic rays withinterstellar matter imprints in the γ -ray regime due to hadron-nucleon collisions producing pions (e.g., Mannheim & Schlick- eiser 1994). Another example is hot ionized plasma, which radi-ates in the X-ray regime when generated by supernovae, whereashot molecules ionized by collisions emit in the UV regime (e.g.,Ferriere 2001). In this speciﬁc case, both the UV and soft X-rayphotons are again absorbed by interstellar dust, which preventsobservations in regions of high dust density like the Galacticplane. This interplay also shows the high complexity of radia-tive extinction, which needs to be taken into account when re-constructing single emission components.A famous component that is not subject to dust extinction,but still di ﬃ cult to measure is the non-Galactic cosmic mi-crowave background (CMB). The reason is that the CMB issuperimposed by Galactic foregrounds. Cosmological studiesaim to extract the CMB radiation from multiple frequencychannels by identifying and systematically removing theseGalactic foregrounds. In frequencies below 100 GHz, Galacticsynchrotron and free-free emission contaminate the CMB, whileabove 100 GHz, thermal dust emission and the cosmic infraredbackground dominate (Gold et al. 2011; Planck Collaboration2016b). To distinguish between di ﬀ erent sources of emission,members of the Planck Collaboration (2018b) developed severalcomponent separation algorithms, for example the C ommander (Eriksen et al. 2008), S evem (Leach et al. 2008), or S mica code (Delabrouille et al. 2003; Cardoso et al. 2008). The Article number, page 1 of 25 a r X i v : . [ a s t r o - ph . I M ] S e p & A proofs: manuscript no. NEAT_VAE results obtained from those algorithms contain, among others,all-sky maps of the CMB, synchrotron, free-free, thermal andspinning dust, and CO line emission (Planck Collaboration2016b). These approaches are however based on cosmological,astrophysical and instrumental parameters, and require prepro-cessed spectral templates or explicit knowledge about physicalcorrelations. To verify these results in a manner independent ofsuch assumptions, approaches that allow automated componentidentiﬁcation, like machine learning techniques, have beenincreasingly pursued in recent years (Longo et al. 2019).A broad range of machine learning algorithms was appliedto cosmological and astrophysical problems (Fluke & Jacobs2020). Here we just name a few: Beaumont et al. (2011)employed the Support Vector Machine (SVM) algorithm toclassify structures in the ISM. The algorithm was able toidentify a supernovae remnant behind a molecular cloud basedon a sample of manually classiﬁed data. Ucci et al. (2018a)examined the composition of the ISM of dwarf galaxies byprocessing the available spectral information in a machinelearning code. The so-called G ame algorithm was trained on alarge library of synthetic data (Ucci et al. 2018b) and recoveredISM properties such as metallicity and gas densities and theirrespective correlations on the basis of spectral emission lines.By training a convolutional neural network on synthetic spectra,Murray & Peek (2019) were able to decompose the thermalphases of neutral hydrogen HI. This selection of algorithmsrepresents supervised learning approaches, which requirelabeled or pre-classiﬁed data in order to be trained. We wantto investigate to what extend unsupervised approaches can beapplied to Galactic observations.One unsupervised machine learning approach which automati-cally identiﬁes relevant features within some input data is calledrepresentation learning (RL) (Hinton 1990; Bengio et al. 2013;Goodfellow et al. 2016). Given an observation, RL methods aimto extract the underlying causes which generated the data, inother words they learn the most informative representation of theobservation (Bengio et al. 2013; Goodfellow et al. 2016). Thiscan for example be achieved by reducing the dimension of theinput data using a neural network to the so-called underlying,explanatory factors of variation (Hinton & Salakhutdinov 2006;Bengio et al. 2013; Goodfellow et al. 2016). This concept canbe translated to the task of astrophysical component separationby investigating which underlying, data-generating componentscan be extracted from Galactic all-sky data. These componentsare constructed to encode mutually independent features of thedata and are often used as the input for subsequent analyses.Especially, considering the large quantities of astrophysical dataproduced every day by surveys such as the Sloan Digital SkySurvey (SDSS), an e ﬀ ective preprocessing strategy needs to bedeveloped to be able to analyze the vast amount of data (Kremeret al. 2017; Reis et al. 2019).In the present study, we apply RL to a data set of 35 Galacticall-sky maps recorded in multiple frequencies provided byMüller et al. (2018). On their data set, the authors learnedspectral pixel-wise relations using Gaussian Mixture Models(GMMs), which permitted to augment pixels with missingmeasurement data. They veriﬁed the pre-stated hadronic andleptonic component of the γ − ray sky (Selig et al. 2015) andpresented a higher resolved hadronic component as well as acompletion of non-observed information. However, the maincomponent maps of the GMM did not not capture di ﬀ erent astrophysical environments. This motivated the approach inthe present study to explore other latent variable models, likeautoencoders (Hinton & Salakhutdinov 2006), which are able toencode useful representations of the data in their latent space.Based on the provided data set, we combine generative mod-eling with variational inference to learn a lower-dimensionalrepresentation of the data, which we call features. Our modelapproximates the posterior probability on these features, and wee ﬃ ciently optimize the approximation using a state-of-the-artvariational autoencoder (VAE) (Kingma & Welling 2013).Such an approach was, for example, used in the context of thepreviously mentioned Sloan Digital Sky Survey, as Portilloet al. (2020) successfully applied variational autoencoders tothe SDDS data. The authors e ﬃ ciently reduced the dimen-sionality of 1000 input pixels to six latent components, whilethe VAE was able to outperform principal component analysisconsidering the spectral dimensionality reduction. Their latentspace separates types of galaxies or detects outliers, making ita very useful preprocessing step for large astrophysical data.However, the authors claim that the uncertainty quantiﬁcationcould be improved and suggest to include the pixel-level noiseas a separate feature to improve latent variance representations.Our variational autoencoder is based on the principles of infor-mation theory, meaning that we follow a Bayesian approach totrack all relevant uncertainties and we enable our model to adaptto the introduced model noise. Our mathematical derivation ofthe loss function is, although based on the speciﬁc signal recon-struction problem, a general approach: Starting with a generativedata model, we are able to explain all terms occurring in a classi-cal VAE’s loss function, but in addition we provide further termswhich clearly result from our calculations in Sect. 2 and, simul-taneously, deliver robust results in Sect. 3.

2. Data and methods

Observations of the Milky Way can be visualized by all-skymaps showing the sky brightness in a certain frequency range.When we combine data from Galactic all-sky records in multi-ple frequencies, we can obtain a more complete picture of ourGalaxy, but we also gain redundant information. Our goal is todetermine a reduced representation of the observed sky, whichcontains only non-redundant or essential information. For thisanalysis, we use the aforementioned data set consisting of 39Galactic all-sky maps distributed over the entire electromagneticspectrum compiled by Müller et al. (2018).To generate the data set, the authors assembled informationfrom all-sky surveys ranging from γ -ray to radio frequencies(Müller et al. 2018, and references therein). They incorporatedall-sky data with at least 80 % spatial coverage in HEALPixformat (Gorski et al. 2005) and used the highest resolution mapfor each frequency. The UV regime is not part of their data set,since the respective GALEX survey (Bianchi et al. 2017) showstoo many unobserved regions. To homogenize and unify thedata, they (1) converted the sky brightness values in the originalmaps to ﬂux magnitude values, (2) reduced the noise level inX-ray data by smoothing with a Gaussian kernel, (3) removedextra-Galactic sources and calibration artifacts from all mapsexcept the already cleaned γ -ray data from (Selig et al. 2015),and (4) uniﬁed the resolution of all maps to nside = Article number, page 2 of 25ara Milosevic et al.: Bayesian decomposition of the Galactic multi-frequency sky using probabilistic autoencoders

Based on the resulting 39 Galactic all-sky maps, we build a dataset D (see Table A.1) for the present study with three modiﬁca-tions: (1) we discard the hadronic and leptonic component mapsby Selig et al. (2015), since these are derived from the Fermidata and thus contain redundant information, (2) we remove theAKARI far infrared map recorded at 65 µ m , which exhibits apoor spatial coverage (we can only work with pixels that are cov-ered by all maps), and (3) we neglect the CO line emission map,since this information is partly contained in the Planck 100, 217,and 353 GHz frequency channels (Planck Collaboration 2016b).We then deﬁne our data set consisting of k =

35 all-sky maps asan indexed set D = (cid:16) d , . . . , d p (cid:17) , where p is the number of pixelsand the magnitude vectors d i ∈ R k represent the magnitude ﬂuxvalues of all frequency maps for the i th pixel . The determination of a non-redundant and lower-dimensionalrepresentation of the frequency information in our data set isan inverse problem: We are looking for unknown quantities,our so-called features, and procedures applied to them thatcould have generated the full information in D . To describe thisproblem, we use generative modeling to deﬁne a data modelwith corresponding parameters Θ . The solution to the inverseproblem is then given by the posterior probability distribution P ( Θ | D ) of the model parameters Θ , which we specify in thenext paragraph. Using Bayes’ theorem, this posterior can beexpressed by the data likelihood and the prior distribution up toa Θ -independent factor, reading P ( Θ | D ) ∝ P ( D | Θ ) × P ( Θ ).We perform a pixel-based analysis of image data by assumingmagnitude vectors of di ﬀ erent pixels to be independent. Thisallows us to factorize the data likelihood and prior distributions.In the following, we will calculate the respective distributionsper pixel and ﬁnally combine all derivations to one solution forthe entire data set D .We start by deﬁning a generative process that leads from an ab-stract source S to the observations in the data set D . We do notknow S at this point, but we aim to associate S a posteriori withsome relevant physical quantities. We assume that each d i ∈ D isgenerated from a source vector s i ∈ R l with l ≤ k , and the collec-tion of these vectors is building up a source set S = ( s , . . . , s p ).This relation is expressed in a data model D = (cid:101) D + N model , (1)where we deﬁne the observed data D to be composed of (cid:101) D (cid:66) ( (cid:101) d , . . . , (cid:101) d p ), which is the output of a generative process G : R l → R k , and some model noise N model . The generative pro-cess G ( · ) mapping the variables of S pixel by pixel to (cid:101) D , i.e. (cid:101) D = G ( S ), is unknown. We factorize the prior distribution of the set of source vectors S as P ( S ) = (cid:81) pi = P ( s i ). The prior on P ( s i ) can be arbitrarily com-plex, but without loss of generality we can ﬁnd a transformation s i = T ( z i ) of the source vectors to a set of latent vectors z i ∈ R l that provides a standardized prior distribution P ( z i ) = G ( z i , ).This coordinate transformation into the eigenspace of the In other words: d i = (cid:16) d (1) i , d (2) i , · · · , d ( k ) i (cid:17) T , where d ( λ ) i is the magni-tude ﬂux value of the λ th sky map at pixel i . 𝑑 ! 𝑒 " 𝑓 𝑧 ! %𝑑 ! 𝜇 ! Encoder DecoderInput OutputLatent SpaceInference GenerationVariational inference: approximate

𝑃 𝑍, 𝜃, 𝜉 $ 𝐷 with 𝑄 𝑍, 𝜃, 𝜉 $ |𝐷 = 𝑄 𝑍|𝐷 𝑄 𝜃|𝐷 𝑄 𝜉 $ |𝐷 Approximate generative function with neural network1 Generative data model

𝐷 = 0𝐷 + 𝑁= 𝑓 (𝑍) + 𝑡(𝜉 $ )𝑑 ! − %𝑑 ! = 𝑛 ! Bayesian derivation of the true post-erior distribution

𝑃 𝑍, 𝜃, 𝜉 $ 𝐷)𝑄 𝑧 ! |𝑑 ! = 𝒢(𝑧 ! − 𝜇 ! , 9𝛴 ! ) log( ! )with 𝑒 " (𝑑 ! ) = 𝜇 ! , log ! & 𝑧 ! = 𝜇 ! + 9𝛴 ! ; 𝜀 Approximate inference function with neural network Connect encoder and decoder by inverse transform sampling and

𝑃 𝜀 = 𝒢(𝜀, 𝕀) l featuresInput: pixel-wise magnitude-vectors of Galactic all-sky maps 𝑑 ! k maps Fig. 1: Model design of our algorithm (NEAT-VAE). In ourmathematical derivation, we ﬁrst deﬁne a generative data modeland afterwards build an inference process using information the-oretical methods. The combination of both processes allows theresulting algorithm to perform in the opposite direction: It usesits input d i to infer a latent representation z i of the data, whichis (among other factors like priors) led by the generative pro-cess aiming to regenerate the initial data vector d i from the la-tent space again. The minimization objective, or loss function,directing the algorithm’s learning process is Eq. (8).Table 1: Priors on model parameters. z i are the latent variablesgenerating the data, ξ N is the transformed expression of themodel noise covariance N , and θ is the parameter vector of thegenerative forward model.Model parameter Prior distributionLatent variable P ( z i ) = G ( z i , )Latent noise parameter P ( ξ N ) = G ( ξ N , P ( θ ) = const.prior, also known as random variate generation using inversetransform sampling (e.g., Devroye 1986), or reparametrizationtrick (Kingma & Welling 2013; Rezende et al. 2014; Titsias& Lázaro-Gredilla 2014), allows us to absorb all complex andunknown structure of the source space in the transformation T ( · ), and provides us with an easy to calculate unit Gaussian Article number, page 3 of 25 & A proofs: manuscript no. NEAT_VAE prior distribution. Using this deﬁnition, we can rewrite thegenerative process as a parametrized function of the latentvariables (cid:101) d i = G ( s i ) = G ( T ( z i )) (cid:67) f θ ( z i ), with θ denoting theparameters of the transformed generative forward model.We further deﬁne the model noise N model = ( n , . . . , n p ) as a setof p noise vectors n i ∈ R k and assume the pixel-wise noise isindependent and identically distributed with a Gaussian distri-bution of zero mean and noise covariance N ∈ R k × k , meaning P ( N model ) = (cid:81) pi = P ( n i ) = (cid:81) pi = G ( n i , N ). Later in the discus-sion we will see that the noise covariance induces a metric indata space, which is between two magnitude vectors d i and (cid:101) d i . N thus indicates how accurately the data set d i can be reconstructedusing the latent variables z i . Since this accuracy is unknown,we introduce N as an inference parameter. We again perform atransformation to the prior eigenspace of the form N = t ψ ( ξ N )with the latent noise parameter ξ N ∈ R , being distributed as P ( ξ N ) = G ( ξ N ,

1) with the transformation parameter ψ . For ourspeciﬁc application, we choose a log-normal mapping N = t ψ ( ξ N ) = k × k exp ( µ N + σ N ξ N ) (2)with mean µ N and standard deviation σ N . In other words, thisis a diagonal noise covariance matrix N with transformationparameters ψ = ( µ N , σ N ) T and a single global noise parameter ξ N , which we aim to learn.Combining the deﬁnitions of the noise and the generative model,we can rewrite the data model from Eq. (1) as a pixel-wise ex-pression: d i = f θ ( z i ) + n i . (3)At this point, we can deﬁne the data model parameter vector Θ described in Sect. 2.2. It consists of the latent vectors z i , the la-tent noise parameter ξ N , which indirectly deﬁnes the model noise n i , and the parameters of the generative model θ . The parame-ters θ are the parameters of the neural network used to approx-imate the generative process, and are speciﬁed in further detailin Sect. 2.6. Since we do not have prior knowledge on these for-ward model parameters, we assume a uniform prior distributionon θ (see Table 1). Thus we have Θ = ( Z , θ , ξ N ) with the latentspace set Z = ( z , . . . , z p ). We can include our forward model in the pixel-wise likelihood P ( D | Θ ) = (cid:81) pi = P ( d i | Z , θ , ξ N ) by marginalizing over themodel noise: P ( d i | Z , θ , ξ N ) = (cid:90) d n i P ( d i , n i | Z , θ , ξ N ) = (cid:90) d n i P ( d i | z i , θ , n i ) P ( n i | ξ N ) = (cid:90) d n i δ ( d i − f θ ( z i ) − n i ) G ( n i , t ψ ( ξ N )) = G ( d i − f θ ( z i ) , t ψ ( ξ N )) , (4)where we assumed the noise n i to be a priori independent of Z and θ . Combining the prior distributions listed in Table 1 and the datalikelihood in Eq. (4), the posterior distribution P ( Θ | D ) reads: P ( Z , θ , ξ N | D ) ∝ p (cid:89) i = P ( d i | Z , θ , ξ N ) p (cid:89) i = P ( z i ) P ( θ ) P ( ξ N ) ∝ p (cid:89) i = (cid:16) G ( d i − f θ ( z i ) , t ψ ( ξ N )) G ( z i , ) (cid:17) ×G ( ξ N , , (5)where we used P ( θ ) = const. We are not able to calculateexpectation values from this high-dimensional probabilitydistribution, but we can use variational inference (e.g., Blei et al.2017) to approximate the posterior distribution P ( Θ | D ) withan easier to integrate distribution Q Φ ( Θ | D ) with variationalparameters Φ . In the following, we deﬁne a suitable approx-imate distribution Q Φ ( Θ | D ) and use the Kullback-LeiblerDivergence as a measure to evaluate the dissimilarity of P ( Θ | D ) and Q Φ ( Θ | D ).Assuming that Z , θ and ξ N are a posteriori independent, the ap-proximate distribution Q Φ can be written as Q Φ ( Z , θ , ξ N | D ) = Q Φ ( Z | D ) Q Φ ( θ | D ) Q Φ ( ξ N | D ) . (6)For Q Φ ( Z | D ) = (cid:81) pi = Q Φ ( z i | d i ) we choose a pixel-wiseindependent Gaussian distribution G ( z i − µ i , Σ i ) based on themaximum entropy principle (Jaynes 1982). The principle statesthat if only the mean µ and covariance Σ of some data areknown, then the knowledge contained in that data set canbest be expressed by a Gaussian distribution with exactlythis mean µ and covariance Σ (e.g., Enßlin 2019). Later wewill see that by construction µ i and Σ i are functions of theinput data, making the choice of a Gaussian distribution for Q Φ ( Z | D ) valid. For Q Φ ( θ | D ) and Q Φ ( ξ N | D ) we choosemaximum a posteriori solutions. In practice we evaluate theapproximation at the variational parameter values { (cid:98) θ , (cid:98) ξ N } ∈ Φ ,expressed by Q Φ ( θ | D ) = δ ( θ − (cid:98) θ ) and Q Φ ( ξ N | D ) = δ ( ξ N − (cid:98) ξ N ).To approximate P ( Θ | D ) with Q Φ ( Θ | D ), we use the Kullback-Leibler Divergence D KL [ Q Φ ( Z , θ , ξ N | D ) || P ( Z , θ , ξ N | D ) ] == (cid:90) d Z d θ d ξ N Q Φ ( Z , θ , ξ N | D ) ln (cid:32) Q Φ ( Z , θ , ξ N | D ) P ( Z , θ , ξ N | D ) (cid:33) . (7)Inserting the derived expressions from the previous sections andusing Monte Carlo Methods to approximate integrals with ﬁnitesums, we arrive at We consider the spurious amount of artiﬁcial information introducedby Q when we calculate D KL [ Q ( · ) || P ( · ) ], while D KL [ P ( · ) || Q ( · ) ]expresses the loss of information when using Q instead of P (Leike &Enßlin 2017). The latter quantity is what one ideally would like to baseapproximate (conservative) inference on, but unfortunately it cannot becalculated with an intractable posterior P . For this reason, we use thevariational inference approach and minimize the former KL-Divergence D KL [ Q ( · ) || P ( · ) ], that is minimizing the amount of spurious informa-tion added when going from P to Q .Article number, page 4 of 25ara Milosevic et al.: Bayesian decomposition of the Galactic multi-frequency sky using probabilistic autoencoders D KL [ Q Φ ( · ) || P ( · )] = p (cid:88) i = (cid:34) − tr (ln Σ i ) − l (1 + ln (2 π )) + p (cid:98) ξ N + tr  t ψ ( (cid:98) ξ N ) (cid:16) d i − f (cid:98) θ ( z i ) (cid:17) (cid:16) d i − f (cid:98) θ ( z i ) (cid:17) T  + tr (cid:16) Σ i + µ i µ Ti (cid:17) + tr (cid:16) ln t ψ ( (cid:98) ξ N ) (cid:17) (cid:35) + H , (8)where we absorb all constant terms into H . The full calculationswith all intermediate steps are carried out in Appendix B. Our goal is to infer a lower-dimensional representation of thedata, which in our case is expressed by the latent source space Z . We can achieve this by minimizing Eq. (8), which describesthe spurious amount of artiﬁcial information introduced by theapproximation of P with Q Φ . In the following, we will explainhow this objective function can be translated into the frame-work of a latent variable model called variational autoencoder(Kingma & Welling 2013).A basic autoencoder learns a low-dimensional, latent representa-tion of higher-dimensional input data. This is achieved by train-ing the autoencoder to reconstruct the original input as accu-rately as possible from the reduced, latent representation (e.g.,Rumelhart et al. 1985; Hinton & Salakhutdinov 2006; Good-fellow et al. 2016). In this context, training describes the min-imization of an objective or loss function, for example the meansquared error between input data and the reconstructed outputdata. The dimensionality reduction of the input to the latentspace occurs in the so-called encoder, the latent space itself iscalled bottleneck layer, and in the decoder, the latent space getstranslated back to data space. Variational autoencoders (VAEs)o ﬀ er a probabilistic framework to jointly optimize latent vari-able (or generative) models and inference models (Kingma &Welling 2019). Eq. (8) can be transformed to such a VAE frame-work (see Fig. 1) as follows: – Generative Process : Neural networks are generalized func-tion approximators. Since the exact form of the data gener-ating process f θ ( z i ) = (cid:101) d i is unknown, we can use a neuralnetwork with input z i and output (cid:101) d i to approximate f θ anduse back propagation to optimize the parameters θ of the net-work, which corresponds to the generative decoder. – Variational Inference : The inference of the latent spacevariables is approximated by Q Φ ( z i | d i ) = G ( z i − µ i , Σ i ). Wecan build a function delivering posterior samples followingthis variational distribution, expressed by a neural networkwith input d i and output z i . Let the pixel-wise mean µ i ∈ R l and covariance Σ i ∈ R l × l be determined by a parametrizedfunction of the input data e φ ( d i ) = (cid:0) µ i , log Σ i (cid:1) , where φ con-tains the parameters of the function e φ and the matrix log-arithm of Σ i is calculated. This parametrization ensures thevariance to be positive and the calculation to be numericallystable, since the logarithmic function maps the small valuesof the variance to a larger space. By inverse transform sam-pling, we can then deﬁne the posterior latent space variablesas z i = µ i + exp (cid:16) log ( Σ i ) (cid:17) · (cid:15) i = µ i + √ Σ i (cid:15) i with an auxiliaryvariable (cid:15) i and P ( (cid:15) i ) = G ( (cid:15) i , ). In practice we approximate the function e φ ( d i ) by the variational inference encoder. Wetake √ Σ i to be diagonal and describe it by its diagonal vectordiag (cid:16) √ Σ i (cid:17) , allowing us to calculate µ i and diag (cid:16) √ Σ i (cid:17) as twodistinct outputs of the encoder network. – Independent representation : Based on the input data, theencoder network delivers latent space variables z i with mean µ i and covariance vector (cid:101) Σ i = diag (cid:16) √ Σ i √ Σ i T (cid:17) . By usingthis deﬁnition we ﬁnd the optimal independent approxima-tion to the posterior P ( Θ | D ). This leads to a disentangle-ment of the input information, meaning each dimension of z i , which is equivalent to the hidden neurons in the bottle-neck layer, encodes (approximately) mutually independentfeatures of the data.The minimization objective in Eq. (8) contains the loss functionof a classic VAE with three modiﬁcations: (1) We include thesize of the latent space l and its corresponding weight (see Ap-pendix B for the derivation) to be able to compare di ﬀ erent latentspaces with each other, (2) besides the network weights, we aimto optimize the noise covariance and thus the latent noise vari-able ξ N . The prior on this latent noise adds an extra term pro-portional to (cid:98) ξ N to the objective function, and (3) by includingnoise in the data model, the likelihood contains a factor 1 / t ( (cid:98) ξ N )and contributes an additional term tr (cid:16) ln t ( (cid:98) ξ N ) (cid:17) from the normal-ization. Since we expand the VAE framework by the adaptionto noise, we name our algorithm NEAT-VAE (NoisE AdapTingVariational AutoEncoder).Using the transformations illustrated before, the ﬁnal objectivefunction depends only on the generative decoder parameters (cid:98) θ ,the variational encoder parameters φ and the latent noise pa-rameters (cid:98) ξ N . We implement an autoencoder architecture (en-coder network, bottleneck layer, decoder network) with Eq.(8) as the respective loss function in the PyTorch framework ,which is publicly available at https://gitlab.mpcdf.mpg.de/msara/neat_vae . The framework allows us to calculatederivatives of Eq. (8) with respect to (cid:98) θ , φ and (cid:98) ξ N using automatedback propagation, and to minimize the loss with a build-in opti-mizer. The conducted experiments are described in Appendix C.

3. Results and discussion

Applied to our set of Galactic full-sky observations, the NEAT-VAE framework yields a posterior probability distribution of thelatent space variables that capture the essential information inour data. The derived loss function forces the latent variables tobe statistically independent of each other, and thereby to repre-sent individual physical components. We obtained the posteriorby simultaneously optimizing two processes: an inference mech-anism reducing the observed data D to a lower-dimensional la-tent space Z , and a generative process mapping these latent spacevariables back to the higher-dimensional data space (cid:101) D . Thesetwo processes, described by artiﬁcial encoder and decoder net-works, support each other during training: the decoder recon-structs the data space based on the latent variables the encoderdelivers. This reconstruction is constantly compared to the inputdata by the likelihood term in Eq. (8), ensuring the encoder toadapt the inference function and thus to provide improved latentspace variables to the decoder. In addition, we aim to ﬁnd la-tent space variables that encode mutually independent featuresof the input data. In the following, we analyze the resulting la-tent space variables z i ∈ Z and their full-sky representations. https: // pytorch.org / Article number, page 5 of 25 & A proofs: manuscript no. NEAT_VAE

Fig. 2: Reconstruction mean squared error (MSE) and values ofthe minimized Kullback-Leibler Divergence (Loss) ∆ D KL ( D KL in Eq. (8) except constant terms) depending on the dimensionof the latent space (x-axis). The values are determined by theNEAT-VAE with a conﬁguration of six layers, 30 hidden neuronsin the encoder and decoder layers, noise transformation parame-ters µ N = − σ N =

1, learning rates of 0 .

005 for the networkweights and 0 .

001 for ξ N , and a batchsize of 128. We do not trackall normalization constants through the calculations, which leadsto negative values for the loss.The dimension of the latent space Z corresponds to the num-ber of hidden neurons in the bottleneck layer. From here on, wecall the subsequent hidden neurons ‘features’ and the resultingfull-sky representations ‘feature maps’. We ﬁrst analyze how the dimension of the latent space in theNEAT-VAE correlates with the reconstruction quality of the gen-erative process. We quantify the reconstruction by the meansquared error of input maps and reconstructed mapsMSE ( d , (cid:101) d ) = p p (cid:88) i = ( d i − (cid:101) d i ) , (9)with the number of pixels p , the 35-dimensional data vectors d i ∈ D and the corresponding reconstruction vectors (cid:101) d i ∈ (cid:101) D .We observe that only three features are required to achieve anMSE of below 0 .

1, which describes an average deviation of thereconstructed values compared to the input values of a naturallogarithm-based ﬂux magnitude. For small values, the absoluteuncertainty on logarithmic scale equals the relative uncertaintyon linear scale, meaning MSE = . ≈ .

02, thatis a relative uncertainty of ≈

2% at approximately ten features.We interpret this as an indication for a high redundancy of theinformation contained in the 35 Galactic all-sky maps, sinceincreasing the number of features that are able to encode theinput data beyond this point do not increase the quality of thereconstruction any further.

Based on the experiments with di ﬀ erent latent space dimensions,we examine the spatial structures in the feature maps in moredetail. We recognize some spatially correlated structures of theinput data in the feature maps. We assume this to be a meaning-ful result, since the autoencoder only learns correlations amongthe magnitude ﬂux values d i within one data pixel and is not in-formed about spatial structures. We also observe feature mapswith the same morphology to occur in latent spaces of di ﬀ erentdimensions. To investigate whether there is a pattern or an orderamong the feature maps with respect to information content, wecalculate the signiﬁcance of each feature for the reconstructionof the data by using the following measure S feature = p p (cid:88) i = (cid:16) z feature map , i − z feature map (cid:17) σ , i , (10)where z feature map , i is the intensity value at the i th pixel, z feature map is the mean intensity value of a single feature map, and σ , i denotes the posterior variance as calculated by the algorithm atthe i th pixel. This means we calculate the feature map variance,which describes the ﬂuctuations within the posterior meanmap, weighted by the posterior feature variance averaged overall pixels p . The signiﬁcance thus expresses the ratio of themagnitude of ﬂuctuations within a map compared to the uncer-tainties of the map. In this context, a high signiﬁcance marks thefeatures that the autoencoder is most certain about to be requiredfor the reconstruction, while a feature signiﬁcance below 1corresponds to an insigniﬁcant feature, as the posterior uncer-tainty is larger compared to the posterior mean values of themap. Averaged over all the experiments we conducted (see Ap-pendix C), we call the three most signiﬁcant features A, B and C.For the conﬁguration shown in Fig. 3, the features have signiﬁ-cance values S featureA = . × , S featureB = . × , and S featureC = . × . The remaining features have signiﬁcancevalues ranging from 8 .

75 to 3 . × and encode artifacts andother morphologies of the input data, as displayed in AppendixD. A similar behavior was also observed by Müller et al. (2018):The authors build a Gaussian mixture model (GMM) to recon-struct observations from a certain frequency range based on dataof complementary frequencies. The GMM components used intheir study also encoded artifacts of the input data when the num-ber of components was increased. However, posterior samples ofthe latent space of our algorithm show that the NEAT-VAE doesnot always assign information to each and every feature: Start-ing from twelve features and adding further neurons to the latentspace, the signiﬁcance of the added features drops below one.This means that the posterior variance of a feature map is greaterthan the ﬂuctuations within the map, and the resulting posteriorsamples show white noise statistics. On average, ten to elevenfeatures are signiﬁcant throughout our experiments with varyinglatent space size. We assume that from this point on, our algo-rithm has identiﬁed all mutually independent features of the in-put data. When further dimensions are added to the latent space,the algorithm ‘tunes out’ those degrees of freedom by makingthem insigniﬁcant.In the next sections, we focus on the three most signiﬁcantfeatures of the conﬁguration with in total ten latent spacefeatures (since from this point on the reconstruction does notchange signiﬁcantly, see Fig. 2), and their physical interpre-tation. Visual inspections of the other features (Appendix D)indicate that the separation into independent components is not Article number, page 6 of 25ara Milosevic et al.: Bayesian decomposition of the Galactic multi-frequency sky using probabilistic autoencoders fully reached, possibly due to the ﬁnite amount of optimization.Morphological structures similar to the North Galactic Spur orthe Fermi bubbles imprint onto several features simultaneously,as well as measurement artifacts.The posterior mean values in Fig. 3 reﬂect the internal repre-sentation of the three most signiﬁcant latent space features byour algorithm. We observe that the overall sign of the mapschanges for varying hyperparameter conﬁgurations. We did notobserve the change of sign to follow a speciﬁc rule or to dependon the sign of other features from the same hyperparameterconﬁguration. Hence we assume the overall sign to not have anyphysical interpretation and only to depend on the initializationof the network parameters. We however observe that the relationof positive and negative values within single maps is constantthroughout di ﬀ erent choices of hyperparameters, that is theoverall feature map structure. Since these signs do not changethe morphology of the map, we chose the sign of the color codein the map display of Fig. 3 according to that of the input mapwhich most closely resembles it. This is positive for strongeremitting regions and negative for weaker emitting regions.We can investigate which data information a feature map en-codes by considering the generative process of the NEAT-VAEand using the back propagation algorithm: Since the amountof parameters in the latent space is much smaller compared tothe data space, the autoencoder is forced to extract the essentialinformation of the input data in order to generate it again. Thegenerative process therefore o ﬀ ers, at least approximately, anexplanation for the information ﬂow within the autoencoder,and we can visualize to which extend speciﬁc features aregenerating the individual Galactic all-sky maps using sensitivityanalysis (e.g., Zurada et al. 1994). Here, we compute thegradients of the reconstructed output maps with respect to thefeature maps using the back propagation algorithm. The valuesof the resulting decoder Jacobian are displayed in HEALPixpixelization in Appendix E.Another measure for quantifying which input information is en-coded by the features is the mutual information. The mutual in-formation I ( X ; Y ) = (cid:88) x ,y P ( x , y ) ln (cid:32) P ( x , y ) P ( x ) P ( y ) (cid:33) (11)can be calculated from two-dimensional histograms of featuremaps ( x ) and output maps ( y ). For a given number of bins,the joint probability distribution P ( x , y ) is represented in an( x , y )-shaped matrix by counts per bin, while the marginalizeddistributions P ( x ) and P ( y ) are obtained by summing over therespective y and x -axes of this matrix. In our interpretations, wewill use both measures to evaluate the encoded information inthe latent space (mutual information of feature and input maps),as well as the generative process from the latent space (decoderJacobian maps), in order to determine the physical content ofthe features A, B and C. Feature A, which is the most signiﬁcant feature in 98% of theexamined hyperparameter conﬁgurations (see Appendix C), isdisplayed in Fig. 3a. From a visual analysis, we recognize apositive color-coded Galactic plane and negative color-coded Galactic poles in the posterior mean. In the eastern part of theGalactic plane, we see a bright, circular structure in the Cygnusregion. Further in the east, south of the Galactic plane, structuressimilar to the Perseus region occur. The circular shaped structurenorth of the Galactic center resembles the Ophiuchus regionand southwestern of the Galactic center structures similar tothe Small and Large Magellanic clouds can be recognized. Inthe western part of the Galactic plane, the bright structureslook like the Orion region. The posterior variance shows a highcertainty in the region of the Galactic plane and the structuresof the surrounding latitudes, while at the southern and northernGalactic poles, the uncertainty increases.

Interpretation.

We ﬁnd compelling evidence that feature A tracesthe dense and dusty parts of the ISM: Based on calculations ofthe mutual information with 512 bins, the top three data sets con-tributing to feature A are the AKARI far-infrared 140 µ m inputdata with mutual information I ( X ; Y ) = .

91, the IRIS infrared100 µ m input data with I ( X ; Y ) = .

84, and the AKARI far-infrared 160 µ m input data, also with I ( X ; Y ) = .

84. This fre-quency band of the interstellar radiation ﬁeld is dominated bythe infrared emission of dust (e.g., Draine 2011, p.121). FeatureA shares the least mutual information with the ROSAT X-ray1 .

545 keV input data with I ( X ; Y ) = .

20. Next, we analyze thedecoder Jacobian maps (see Appendix E), which display the gra-dients of the reconstructed Galactic all-sky maps with respect tolatent space features in HEALPix format. We observe the fol-lowing: The lower and mid-latitudes of the γ -ray regime stronglydepend on feature A, and going from higher to lower energies,the overall-dependence on feature A grows. Especially, the low-energy regime of the Fermi data is dominated by hadronic inter-actions of cosmic rays with the interstellar medium (ISM) andthus shows the gas distribution (Selig et al. 2015). In the X-rayregime, we have small to no dependence; however, the mid- andhigh latitudes of the soft X-ray data (0 .

212 keV and 0 .

197 keV)show negative gradients toward feature A. Here, negative gra-dients state that when increasing values in feature A, the corre-sponding values in the reconstructed X-ray maps will decrease.Such an inverse-proportional behavior is very plausible by thephysical context: Radiation from X-rays is extincted by cold in-terstellar gas (Ferriere 2001), thus the absence of X-ray emissionreveals regions where interstellar matter is present. The H α mapat 656 . n = n = µ m infrared data on feature A, while with furtherdecreasing energies, the infrared data down to 545 GHz showan overall, positive dependence. In this regime, emission fromthermal dust is observed (Klessen & Glover 2014; Planck Col-laboration 2018a). With further decreasing energy, the Planckmicrowave data depends strongly on feature A, in particular theGalactic plane. The 21 cm line emission shows an overall de-pendence on feature A and describes the total neutral atomichydrogen column density, since the displayed emission occursdue to the transition between two levels of the ground state ofatomic hydrogen (Ewen & Purcell 1951). In the synchrotron ra-dio regime at 1420 MHz and 408 MHz, both feature A and fea-ture B play an important role in determining the emission struc-tures, which we will address in Sect. 3.6.The positive dependencies of sky maps on feature A describeareas of our Galaxy with high density of interstellar matter, Article number, page 7 of 25 & A proofs: manuscript no. NEAT_VAE(a) Posterior mean Feature A Posterior variance(b) Posterior mean Feature B Posterior variance(c) Posterior mean Feature C Posterior variance

Fig. 3: Most essential components of the Galactic emission data in HEALPix pixelization. Left panels show the posterior mean andright panels show the posterior variance of the three most signiﬁcant hidden neurons (also called features) in the latent space of theNEAT-VAE. In this case, the algorithm was trained to reduce its input of 35 Galactic all-sky maps to ten features in the latent spacewith the conﬁguration described in Fig. 2. The mean feature maps (left panels) are displayed in order of signiﬁcance according toEq. (10), meaning these features show the highest ratio of feature ﬂuctuations to feature uncertainties within the latent space (seeSect. 3.2 for details). The colors in the posterior mean of feature C (left panel in (c)) are inverted for illustration purposes. The greypixels correspond to missing values in the input data.while the negative gradients correlate with extinction of emis-sions by interstellar matter. We assume that positive gradientsmark the pixels in latent space that are used to generate thecorresponding pixels in data space, while negative gradientsmark an anti-correlation of feature and data pixels. On the basis of this speciﬁc combination of gradients and the mutualinformation, we infer that feature A encodes dense regions ofthe ISM.

Article number, page 8 of 25ara Milosevic et al.: Bayesian decomposition of the Galactic multi-frequency sky using probabilistic autoencoders(a) Thermal dust emission (b) Correlation feature A - dust(c) Hadronic γ -ray component (d) Correlation feature A - hadronic component Fig. 4: Correlation of feature A with astrophysical components tracing interstellar matter. From top left to bottom right: (a) thermaldust component in logarithmic scaling (Planck Collaboration 2016b), (b) 2D histogram of the posterior mean intensity of feature Aand the thermal dust component (mutual information I ( X ; Y ) = . γ -ray spectrum in logarithmicscaling (Selig et al. 2015) with white pixels denoting missing values, (d) 2D histogram of the posterior mean intensity of feature Aand the hadronic component ( I ( X ; Y ) = . Discussion.

We can test our hypothesis by investigating the cor-relation of feature A with dust as a tracer for the ISM (Kenni-cutt & Evans 2012). Fig. 4b shows the relationship of featureA and the thermal dust emission (Fig. 4a) as calculated by thePlanck C ommander code (Eriksen et al. 2008; Planck Collab-oration 2016b). The posterior mean of feature A has a strong,positive, and linear correlation with the thermal dust component.Both the C ommander and the NEAT-VAE algorithm perform aBayesian, pixel-wise analysis based on a data model containingthe linear sum of a signal function and noise, but with two maindi ﬀ erences: First, we only employ statistical information in ourprior knowledge, while the C ommander priors include detailedphysical models for the various emission processes contributingto the radio to far-infrared sky, as well as calibration and correc-tion factors, and a prior for the CMB. Second, the C ommander algorithm has 11 million free parameters to tune (Planck Collab-oration 2016b), while our model has just about 5 , ommander codewas especially developed to separate the Galactic foregrounds to reconstruct the CMB, while the NEAT-VAE only seeks to ﬁndan essential representation of the input data. But in this context,the NEAT-VAE summarized the input data into categories, oneof which already contains dust emission. With this result, we as-sume it is possible to derive the dust component based on featureA with little computational e ﬀ ort.We investigate another tracer for interstellar matter, namely thehadronic γ -ray component derived by Selig et al. (2015), shownin Fig. 4c. It represents the γ -ray emission due to the interactionof cosmic ray protons with interstellar matter, and is positivelycorrelated with feature A, see Fig. 4d. This correlation is reason-able, since Selig et al. (2015) composed the hadronic componentof the low-energy γ -ray maps, on which feature A also dependson. These results meet our initial aim of ﬁnding a reduced repre-sentation of the input data that combines redundant informationin one feature, in this case tracers for dense regions in the ISM.The high signiﬁcance of feature A is likely to result from thechoice of input data, since most of the maps in our data set D represent emission from the dense interstellar matter. We did not Article number, page 9 of 25 & A proofs: manuscript no. NEAT_VAE

Fig. 5: Correlation of feature B and feature A with mutual in-formation I ( X ; Y ) = .

33. The 2D histogram is computed as de-scribed in Fig. 4.include the CO line emission data, but from the positive gradi-ents of the Planck 100 - 353 GHz channels with respect to fea-ture A, which include the information of the CO line emission(Planck Collaboration 2016b), we assume that the Galactic planeof feature A most likely would generate the CO data. This wouldsupport our interpretation of feature A, since CO is a tracer formolecular interstellar gas (Scoville & Sanders 1987), and thus,for regions of high density.

Feature B, the second most signiﬁcant feature in 98% of ourexperiments, is displayed in Fig. 3b. We see a negative color-coded equator which resembles the Galactic plane, with positivecolor-coded bulges north and south of the plane. Especiallythe northern bulge structure looks similar to the morphologyof the North Polar Spur. The positive color-coded, circularstructure in the western part of the Galactic plane lies in the Velaregion, whereas in the east, the location of the Cygnus regionappears negative color-coded with a positive, ring-like structuresurrounding it. The uncertainty of this map appears to be low inall latitudes, but is an order of magnitude higher compared tothe variance map of feature A.

Interpretation.

Feature B shows highest mutual information withthe X-ray input data of ROSAT at 0 .

885 keV ( I ( X ; Y ) = . .

725 keV ( I ( X ; Y ) = .

78) and 1 .

145 keV ( I ( X ; Y ) = . I ( X ; Y ) = . γ -ray data on feature B. Starting from X-ray data at 1 .

545 keV,the positive gradients get stronger with decreasing energy, andespecially the bulge-like area of the reconstructed X-ray datastrongly depends on feature B. The soft X-ray regime around0 .

25 keV, which coincides with hydrogen cavities (Sanders et al.1977; Snowden et al. 1990, e.g.,), also shows a weak, positivedependence on feature B. The infrared and microwave regimeagain show little to no dependence, we however observe smallnegative gradients to be present where in the corresponding gradients maps of feature A, strong, positive gradients occur.The reconstructed 1420 MHz and 408 MHz radio maps showpositive dependence on feature B, especially in the bulge-likeareas. This data set detects the synchrotron radiation generatedby the interaction of cosmic ray electrons with magnetic ﬁeldsin the ISM (Ginzburg & Syrovatskii 1965), and the intensity ofthis radiation depends on the density of relativistic particles andthe magnetic ﬁeld strength. The former one is predominantlyfound in the hot areas of the ISM, if only for the much largervolume occupation of this phase within the Milky Way (Cox2005). Thus we assume that feature B encodes the enhancedpresence of such cosmic ray electrons. With exception of thelow energy X-ray and radio synchrotron regime, we observe thedata predominantly generated by feature A to have little to nodependence on feature B, which we interpret as feature B beingcomplementary to dense regions of the ISM. In combinationwith the indications from the mutual information with X-rayinput data and positive gradients, we assume that feature Bencodes tracers for dilute and hot regions of the ISM.

Discussion.

Fig. 5, showing the correlation between Feature Aand B, supports our interpretation: The features, in our case thedense and dilute regions of the ISM, are basically uncorrelated.This relationship is reasonable in the sense that the features de-scribe two distinct categories of the ISM: The data generated byfeature A are based on emissions from cold and warm ionizedgas (as observed in the 21 cm and H α line emissions), dust, andinteractions with interstellar matter (cosmic rays). The ioniza-tion of the warm gas occurs mainly due to photo-ionization byO and B stars, the hottest and most massive stars of the MilkyWay. Due to their their ratio of mass and luminosity, these starshave a short main sequence life cycle and are thus found neartheir initial birthplaces, the dense ISM (e.g., Blome et al. 1997,p.60). Feature B, however, encodes data which represent radia-tion of an even hotter medium, the hot ionized plasma, which isgenerated by supernovae explosions (e.g., Kahn 1980).The radiative processes encoded by features A and B can thus betraced back to two fundamentally di ﬀ erent origins, namely emis-sion from interstellar matter and from stellar explosions. Oneregime of the electromagnetic spectrum missing in our input datais the ultraviolet (UV) frequency band. In this regime, the emis-sion of very hot gas which gets ionized by collisions can be ob-served, but UV radiation does not penetrate dense regions of theISM and is thus mostly absorbed in low and mid-latitudes (Fer-riere 2001). Due to this extinction, it is likely that UV radiationwould be interpreted by the NEAT-VAE as redundant with thesoft X-ray data, which shows similar dust absorption patterns.This would support our interpretation of feature B to encode thehot and dilute ISM. The third most signiﬁcant feature in 74% of the investigatedconﬁgurations, Feature C, is displayed in Fig 3c. Here, we in-verted the colors of the posterior mean such that the ﬂuctuationsresemble the color-coding of the CMB, see Fig. 6a. The regionsof the Galactic plane and Cygnus have a strong, positive color-coding, while most of the other structures ﬂuctuate around zero.The uncertainty is highest in the Galactic plane, especially inthe Galactic center. There are bulge-like shapes in the variancemap north and south of the Galactic center denoting a mediumlevel of uncertainty. Toward higher latitudes, the uncertainty isvery low.

Article number, page 10 of 25ara Milosevic et al.: Bayesian decomposition of the Galactic multi-frequency sky using probabilistic autoencoders(a) CMB (b) Correlation feature C - CMB(c) Correlation feature C - feature A (d) Correlation feature C - feature B

Fig. 6: Panel (a) shows the CMB as derived by the Planck Collaboration (2018c), panels (b)-(d) display correlations of the posteriormean of feature C with (b) the CMB, mutual information I ( X ; Y ) = .

88, (c) feature A, I ( X ; Y ) = .

29, and (d) feature B, I ( X ; Y ) = .

20. The histograms and mutual information are compiled as described in Fig. 4. Bright colors in the histograms denote a highnumber of counts.

Interpretation.

The mutual information of feature C is highestwith the Planck data of 70, 100 and 143 GHz with values of I ( X ; Y ) = { . , . , . } , respectively, and the least mutualinformation is observed with the ROSAT X-ray 1 .

545 keVdata with I ( X ; Y ) = .

12. According to the decoder Jacobianmaps in Appendix E, feature C mostly generates the Planck30 −

217 GHz channels, especially the 70 and 100 GHz data, inwhich the CMB can be observed (Planck Collaboration 2018a).All other reconstructed maps show little to no dependence onfeature C, with exception of a weak, positive dependence ofthe soft ROSAT X-ray data on feature C that we will addressin Sect. 3.6. Based on the mutual information and the strongpositive gradients in the microwave regime around 100 GHz, weassume feature C to encode the CMB.

Discussion.

Considering our physical interpretation of featuresA and B, it is reasonable that the CMB emission is encoded ina separate feature, since the underlying physical processes in theEarly Universe shaping the CMB ﬂuctuations are independentof the processes in the ISM. To test our hypothesis we comparefeature C to the CMB all-sky map derived by the Planck C om - mander code in Fig. 6a. Feature C shows a positive, linear cor-relation ( I ( X ; Y ) = .

88) with the CMB (Fig. 6b). A main di ﬀ er-ence between feature C and the CMB can be seen in the Galactic plane, which feature C encodes (in addition to the Cygnus re-gion) with high intensities, but also with high uncertainty (seeFig. 3c). We assume the high intensities in the Galactic planeof feature C to originate from the model’s architecture: Sincewe do not take spatial correlations into account, the algorithmlearns spectral relations of each pixel independently. The Galac-tic planes of the input maps are dominated by high intensityvalues, while for other latitudes, more low-intensity values arepresent. We assume that this generates two distinct clusters of in-tensity ranges, leading to two di ﬀ erent spectral relations learnedby the algorithm. We can also observe this distinction in intensityby analyzing the contributions of pixels for data generation (seeAppendix E): The structure of the Galactic plane is recognizablein most of the decoder Jacobian maps, even though the algo-rithm has no information about spatial correlations. This againindicates that the relations learned in di ﬀ erent intensity regimesrepresent di ﬀ erent processes in the NEAT-VAE structure, whichhowever are rather to be associated with internal computationsthan with physical properties. We ﬁnally examine the decoder Jacobian maps which do nothave a clear physical interpretation. For example, we observe

Article number, page 11 of 25 & A proofs: manuscript no. NEAT_VAE feature A (the dense ISM) contributing to the 1420 and 408 MHzradio data, as well as feature C (the CMB) to supposedly gener-ate the soft X-ray data.The NEAT-VAE algorithm recombines all features of the latentspace in a highly non-linear way to generate the output maps,meaning that the gradients in Appendix E cannot always be con-sidered independently or in a linear way. Especially when theabsolute gradient values of the reconstructed maps are large formore than one feature, a holistic analysis is required. In mostof the displayed cases in Appendix E, there is one feature dom-inantly generating the reconstruction of input data. One excep-tion, however, is the radio data, where we observe positive gradi-ents with respect to both feature A and B. A speculative physicalexplanation is that feature A marks regions of enhanced mag-netic ﬁeld strength and feature B of enhanced relativistic elec-tron densities, both quantities that in combination determine thesynchrotron emission. However, we rather assume that these twofeatures have no physical meaning for synchrotron emission, butthe internal computations of our algorithm run most e ﬃ cientlywhen latent values of features A and B are used for radio datageneration. One always has to consider the fact that our algo-rithm has no information about physical relations and primarilyoptimizes a statistical function. Another example for such a be-havior can be found in the soft X-ray data, where we observestrong negative gradients with respect to feature A and weakpositive gradients toward both features B and C. This mixingof features in a partially contrasting manner indicates the non-linearity of the function mapping from the latent space valuesto the reconstructed output data: We hypothesize that some val-ues might be used by the algorithm just to tune some others out,since the decoder has to regenerate the data based on all avail-able features. This discussion shows that the decoder Jacobiansgive a valid ﬁrst interpretation of the strongest correlations, buthave to be analyzed with care. Not each contribution to the gra-dient represents a meaningful relationship, especially since allgradients are entangled and cannot be considered independently.

4. Conclusions

In summary, we derived a probabilistically motivated machinelearning framework, the NEAT-VAE, which successfully identi-ﬁes the most signiﬁcant sky emission components according topixel-wise sky brightness values across the full electromagneticspectrum. At this task, the algorithm performs computationallye ﬃ cient and in a fully unsupervised manner. The three mostsigniﬁcant resulting sky components express physical relation-ships in the considered data set and can be assigned to emissionprocesses of the dense ISM, the hot and dilute ISM, and theCMB.We achieved this performance by developing a Bayesian formu-lation of the component separation problem for astrophysicaldata, which serves as the minimization objective, or lossfunction, for a state-of-the-art variational autoencoder. UsingBayes’ theorem, we combined a generative data model witha variational inference process, incorporated a very limitedamount of prior knowledge of generic nature (e.g., indepen-dence of features, Gaussian statistics of typical noise processes),and approximated unknown functions by neural networks.The resulting algorithm is able to group essential informationcontained in a set of Galactic all-sky observations into mutuallyindependent features. This property results from the algorithm’sarchitecture and the diagonal covariance approximation wechose: First, an autoencoder maps its input to a lower dimen- sional latent space, from which it aims to reconstruct its inputagain. By reducing the dimension of the data in this so-calledinformation bottleneck, the autoencoder is forced to learn auseful representation of the input data. By additionally stating adiagonal covariance matrix in the variational approximation ofthe latent space posterior distribution (see Sect. 2.5), we ﬁnd thelatent spaces’ optimal independent approximation. We observethat there is an order among the latent space features based ontheir signiﬁcance to encode the data set. This computationalsigniﬁcance correlates with physical signiﬁcance, with the mostsigniﬁcant features having a clear physical interpretation andrepresenting emissions from the dense ISM, the hot and diluteISM, and the non-Galactic CMB, respectively. Our interpreta-tions are based on the analysis of (1) the mutual information ofthe features and the input data, (2) the generative properties ofthe features to reconstruct the input data, and (3) the features’correlations with other astrophysical quantities. Thus we wereable to verify that the NEAT-VAE algorithm detects the most in-formative components of emission into which the Galactic inputdata can be decomposed. However, as discussed in Sect. 3.6, theanalysis of our generative process is not always straightforwardand has to be performed with care, since the examined gradientsmight represent internal computation strategies of the algorithmrather than physically meaningful generative properties.The relevance of our work lies in the insights we obtained inthe context of representation learning (RL). In RL, the reducedrepresentation of high-dimensional data is often used as apreprocessing step in order to perform the actual analysis moree ﬃ cient. For example, the task of deriving the thermal dustemission is computationally very expensive. Our feature A,however, which encodes the dense ISM, is highly correlated withthe thermal dust component. We hypothesize that a componentseparation algorithm only analyzing the information containedin feature A, or using feature A as a starting point, is likely toperform faster and more e ﬃ cient compared to analyzing a largerdata set with redundant and entangled information.The loss function of the NEAT-VAE and the associated hy-perparameter tuning mainly determine the performance ofthe algorithm. Although the loss function used to direct thelearning is based on very general assumptions, the Bayesianframework in which it was derived describes the initial problemprobabilistically. This allows us to track uncertainties andevaluate the signiﬁcance of each feature. Hyperparametertuning can be reduced by including those parameters in theBayesian inference process, as in our case the noise covarianceof the model noise. In general, the joint optimization of aninference and a generative process is a state-of-the-art approachto disentangle data, which we showed to be very successful forunsupervised learning applied on astrophysical observations.Especially the fact that astrophysical data are a superpositionof several radiative processes, and thus are rich in relationshipsbetween emitting as well as absorbing components, makes thema very suitable target for decomposing algorithms.The fact that we neither incorporated spatial correlations norphysical priors in the learning process shows how much infor-mation can be retrieved from data-driven approaches: Alone thesky brightness values of single pixels in Galactic all-sky mapsare su ﬃ cient to detect emission from the dense ISM, the di-lute ISM and the CMB. By additionally combining Bayesianmethods with neural networks, we were able to track uncertain-ties and e ﬃ ciently approximate unknown functions. This work Article number, page 12 of 25ara Milosevic et al.: Bayesian decomposition of the Galactic multi-frequency sky using probabilistic autoencoders shows that machine-assisted, Bayesian signal inference can beperformed with neural networks, and that the minimization of aloss function based on information theoretic principles revealsmeaningful relations in astrophysical data.

Acknowledgements.

We acknowledge Jakob Knollmüller for helpful discus-sions.

References

Ackermann, M., Ajello, M., Albert, A., et al. 2012, ApJS, 203, 4Atwood, W. B., Abdo, A. A., Ackermann, M., et al. 2009, The AstrophysicalJournal, 697, 1071–1102Beaumont, C. N., Williams, J. P., & Goodman, A. A. 2011, The AstrophysicalJournal, 741, 14Bengio, Y., Courville, A., & Vincent, P. 2013, IEEE transactions on pattern anal-ysis and machine intelligence, 35, 1798Bianchi, L., Shiao, B., & Thilker, D. 2017, The Astrophysical Journal Supple-ment Series, 230, 24Bishop, C. M. 2006, Pattern recognition and machine learning (springer)Blei, D. M., Kucukelbir, A., & McAuli ﬀ e, J. D. 2017, Journal of the AmericanStatistical Association, 112, 859–877Blome, H.-J., Hoell, J., & Priester, W. 1997, Bergmann-Schäfer, Bd. 8: Sterneund WeltraumCardoso, J.-F., Le Jeune, M., Delabrouille, J., Betoule, M., & Patanchon, G.2008, IEEE Journal of Selected Topics in Signal Processing, 2, 735Cox, D. P. 2005, Annual Review of Astronomy and Astrophysics, 43, 337Delabrouille, J., Cardoso, J.-F., & Patanchon, G. 2003, Monthly Notices of theRoyal Astronomical Society, 346, 1089Dennison, B., Simonetti, J., & Topasna, G. 1999, 31, 1455Devroye, L. 1986Doi, Y., Takita, S., Ootsubo, T., et al. 2015, PASJ, 67, 50Draine, B. T. 2011, Physics of the Interstellar and Intergalactic MediumEnßlin, T. A. 2019, Annalen der Physik, 531, 1800127Eriksen, H., Jewell, J., Dickinson, C., et al. 2008, The Astrophysical Journal,676, 10Ewen, H. I. & Purcell, E. M. 1951, Nature, 168, 356Ferriere, K. M. 2001, Reviews of Modern Physics, 73, 1031Finkbeiner, D. P. 2003, ApJS, 146, 407Fluke, C. J. & Jacobs, C. 2020, Wiley Interdisciplinary Reviews: Data Miningand Knowledge Discovery, 10, e1349Freyberg, M. J. 1998, Astronomische Nachrichten, 319, 93Freyberg, M. J. & Egger, R. 1999, in Highlights in X-ray Astronomy, Vol. 272,278Gaustad, J. E., McCullough, P. R., Rosing, W., & Van Buren, D. 2001, PASP,113, 1326Ginzburg, V. L. & Syrovatskii, S. 1965, Annual Review of Astronomy and As-trophysics, 3, 297Gold, B., Odegard, N., Weiland, J. L., et al. 2011, ApJS, 192, 15Goodfellow, I., Bengio, Y., & Courville, A. 2016, MIT Press, 521, 800Gorski, K. M., Hivon, E., Banday, A. J., et al. 2005, The Astrophysical Journal,622, 759Haslam, C. G. T., Salter, C. J., Sto ﬀ el, H., & Wilson, W. E. 1982, A&AS, 47, 1HI4PI Collaboration. 2016, A&A, 594, A116Hinton, G. E. 1990, in Machine learning (Elsevier), 555–610Hinton, G. E. & Salakhutdinov, R. R. 2006, science, 313, 504Jaynes, E. T. 1982, Proceedings of the IEEE, 70, 939Kahn, F. D. 1980, Highlights of Astronomy, 5, 365Kennicutt, R. C. & Evans, N. J. 2012, Annual Review of Astronomy and Astro-physics, 50, 531Kingma, D. P. & Ba, J. 2014, arXiv preprint arXiv:1412.6980Kingma, D. P. & Welling, M. 2013, arXiv preprint arXiv:1312.6114Kingma, D. P. & Welling, M. 2019, An Introduction to Variational AutoencodersKlessen, R. S. & Glover, S. C. O. 2014, Physical Processes in the InterstellarMediumKremer, J., Stensbo-Smidt, K., Gieseke, F., Pedersen, K. S., & Igel, C. 2017,IEEE Intelligent Systems, 32, 16–22Leach, S. M., Cardoso, J.-F., Baccigalupi, C., et al. 2008, Astronomy & Astro-physics, 491, 597Leike, R. & Enßlin, T. 2017, Entropy, 19, 402Longo, G., Merényi, E., & Tiˇno, P. 2019, Publications of the Astronomical Soci-ety of the Paciﬁc, 131, 100101Madsen, G. J., Ha ﬀ ner, L. M., & Reynolds, R. J. 2001, The Wisconsin H-AlphaMapper Northern Sky Survey of Galactic Ionized HydrogenMannheim, K. & Schlickeiser, R. 1994, A&A, 286, 983Miville-Deschênes, M.-A. & Lagache, G. 2005, ApJS, 157, 302 Müller, A., Hackstein, M., Greiner, M., et al. 2018, Astronomy & Astrophysics,620, A64Murray, C. & Peek, J. E. 2019, in American Astronomical Society Meeting Ab-stracts, Vol. 233, American Astronomical Society Meeting Abstracts ﬀ use component sepa-rationPlanck Collaboration. 2018c, arXiv preprint arXiv:1807.06208Portillo, S. K. N., Parejko, J. K., Vergara, J. R., & Connolly, A. J. 2020, TheAstronomical Journal, 160, 45Reich, P. & Reich, W. 1986, A&AS, 63, 205Reich, P., Testori, J. C., & Reich, W. 2001, A&A, 376, 861Reich, W. 1982, A&AS, 48, 219Reis, I., Rotman, M., Poznanski, D., Prochaska, J. X., & Wolf, L. 2019, E ﬀ ec-tively using unsupervised machine learning in next generation astronomicalsurveysRemazeilles, M., Dickinson, C., Banday, A. J., Bigot-Sazy, M. A., & Ghosh, T.2015, MNRAS, 451, 4311Rezende, D. J., Mohamed, S., & Wierstra, D. 2014, arXiv preprintarXiv:1401.4082Rumelhart, D. E., Hinton, G. E., & Williams, R. J. 1985, Learning internal rep-resentations by error propagation, Tech. rep., California Univ San Diego LaJolla Inst for Cognitive ScienceSanders, W. T., Kraushaar, W. L., Nousek, J. A., & Fried, P. M. 1977, ApJ, 217,L87Scoville, N. Z. & Sanders, D. B. 1987, in Interstellar Processes, ed. D. J. Hollen-bach & H. A. Thronson (Dordrecht: Springer Netherlands), 21–50Selig, M., Vacca, V., Oppermann, N., & Enßlin, T. A. 2015, Astronomy & As-trophysics, 581, A126Snowden, S., Cox, D., McCammon, D., & Sanders, W. 1990, The AstrophysicalJournal, 354, 211Snowden, S. L., Egger, R., Freyberg, M. J., et al. 1997, ApJ, 485, 125Snowden, S. L., Freyberg, M. J., Plucinsky, P. P., et al. 1995, ApJ, 454, 643Titsias, M. & Lázaro-Gredilla, M. 2014, in International conference on machinelearning, 1971–1979Ucci, G., Ferrara, A., Gallerani, S., et al. 2018a, Monthly Notices of the RoyalAstronomical Society, 483, 1295–1313Ucci, G., Ferrara, A., Pallottini, A., & Gallerani, S. 2018b, Monthly Notices ofthe Royal Astronomical Society, 477, 1484Zurada, J. M., Malinowski, A., & Cloete, I. 1994, in Proceedings of IEEE Inter-national Symposium on Circuits and Systems-ISCAS’94, Vol. 6, IEEE, 447–450 Article number, page 13 of 25 & A proofs: manuscript no. NEAT_VAE

Appendix A: Astrophysical data set

Table A.1: Overview of data sets used in this study. Data was preprocessed by Müller et al. (2018) as described in Sect. 2.1. Our datacompilation D consists of 35 data sets including a certain number of data sets (declared in the column ‘Maps’) from each mission.Mission / survey Frequency range Maps ReferencesFermi γ -ray 0 . .

22 GeV 9 Atwood et al. (2009), Ackermann et al. (2012),Selig et al. (2015)ROSAT X-ray 0 . .

545 keV 6 Snowden et al. (1995, 1997), Freyberg (1998),Freyberg & Egger (1999)H α line emission 656 . µ m 4 Neugebauer et al. (1984) ,Miville-Deschênes & Lagache (2005)AKARI far-infrared 90–160 µ m 3 Doi et al. (2015)Planck microwave 30–857 GHz 9 Planck Collaboration (2016a)HI line emission radio 21 cm 1 HI4PI Collaboration (2016)Reich radio 1 420 MHz 1 Reich (1982), Reich & Reich (1986),Reich et al. (2001)Haslam radio 408 MHz 1 Haslam et al. (1982); Remazeilles et al. (2015) Appendix B: Calculations

Here, we provide the analytic steps between Eqs. (7) and (8). For completeness, we start our calculations with the full posterior P ( Θ | D ) = P ( D | Θ ) × P ( Θ ) / P ( D ): D KL [ Q Φ ( · ) || P ( · )] = (cid:90) d Z d θ d ξ N Q Φ ( Z , θ , ξ N | D ) ln (cid:32) Q Φ ( Z , θ , ξ N | D ) P ( Z , θ , ξ N | D ) (cid:33) = (cid:90)  p (cid:89) i = d z i Q Φ ( z i | d i )  (cid:90) d θ Q Φ ( θ | D ) (cid:90) d ξ N Q Φ ( ξ N | D ) × ln  (cid:81) pi = Q Φ ( z i | d i ) Q Φ ( θ | D ) Q Φ ( ξ N | D ) P ( θ ) P ( D ) P ( ξ N ) (cid:81) pi = G (cid:16) d i − f θ ( z i ) , t ψ ( ξ N ) (cid:17) G ( z i , )  = (cid:90)  p (cid:89) i = d z i Q Φ ( z i | d i )  (cid:90) d θ Q Φ ( θ | D ) (cid:90) d ξ N Q Φ ( ξ N | D ) × (cid:34)  p (cid:88) i = ln Q Φ ( z i | d i )  + ln Q Φ ( θ | D ) + ln Q Φ ( ξ N | D ) − ln P ( θ ) − ln P ( ξ N ) + ln P ( D ) − p (cid:88) i = ln (cid:16) G ( z i , ) G (cid:16) d i − f θ ( z i ) , t ψ ( ξ N ) (cid:17)(cid:17) (cid:35) = (cid:90)  p (cid:89) i = d z i Q Φ ( z i | d i )   p (cid:88) i = ln Q Φ ( z i | d i ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = (cid:80) pi = (cid:104) ln Q Φ ( z i | d i ) (cid:105) Q Φ ( z i | d i ) (cid:90) d θ Q Φ ( θ | D ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = (cid:90) d ξ N Q Φ ( ξ N | D ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = + (cid:90)  p (cid:89) i = d z i Q Φ ( z i | d i ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = (cid:90) d θ Q Φ ( θ | D ) ln Q Φ ( θ | D ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = (cid:82) d θ δ ( θ − (cid:98) θ ) ln δ ( θ − (cid:98) θ ) = const . (cid:90) d ξ N Q Φ ( ξ N | D ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = + (cid:90)  p (cid:89) i = d z i Q Φ ( z i | d i ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = (cid:90) d θ Q Φ ( θ | D ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = (cid:90) d ξ N Q Φ ( ξ N | D ) ln Q Φ ( ξ N | D ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = (cid:82) d ξ N δ ( ξ N − (cid:98) ξ N ) ln δ ( ξ N − (cid:98) ξ N ) = const . − (cid:90)  p (cid:89) i = d z i Q Φ ( z i | d i ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = (cid:90) d θ Q Φ ( θ | D ) ln P ( θ ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = const ., since P ( θ ) = const . (cid:90) d ξ N Q Φ ( ξ N | D ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = Article number, page 14 of 25ara Milosevic et al.: Bayesian decomposition of the Galactic multi-frequency sky using probabilistic autoencoders − (cid:90)  p (cid:89) i = d z i Q Φ ( z i | d i ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = (cid:90) d θ Q Φ ( θ | D ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = (cid:90) d ξ N Q Φ ( ξ N | D ) ln P ( ξ N ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = (cid:104) ln G ( ξ N , (cid:105) Q Φ ( ξ N | D ) + (cid:90)  p (cid:89) i = d z i Q Φ ( z i | d i ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = (cid:90) d θ Q Φ ( θ | D ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = (cid:90) d ξ N Q Φ ( ξ N | D ) ln P ( D ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = const ., since P ( D ) = const . − (cid:90)  p (cid:89) i = d z i Q Φ ( z i | d i )  p (cid:88) i = ln G ( z i , ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = (cid:80) pi = (cid:104) ln G ( z i , ) (cid:105) Q Φ ( z i | d i ) (cid:90) d θ Q Φ ( θ | D ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = (cid:90) d ξ N Q Φ ( ξ N | D ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = − (cid:90)  p (cid:89) i = d z i Q Φ ( z i | d i )  (cid:90) d θ Q Φ ( θ | D ) (cid:90) d ξ N Q Φ ( ξ N | D ) × p (cid:88) i = ln G (cid:16) d i − f θ ( z i ) , t ψ ( ξ N ) (cid:17) . Inserting the deﬁnitions for Q Φ ( θ | D ) = δ ( θ − (cid:98) θ ), Q Φ ( ξ N | D ) = δ ( ξ N − (cid:98) ξ N ) and Q Φ ( z i | d i ) = G ( z i − µ i , Σ i ), we get D KL [ Q Φ ( · ) || P ( · )] = p (cid:88) i = (cid:104) ln G ( z i − µ i , Σ i ) (cid:105) G ( z i − µ i , Σ i ) − (cid:104) ln G ( ξ N , (cid:105) δ ( ξ N − (cid:98) ξ N ) − p (cid:88) i = (cid:104) ln G ( z i , ) (cid:105) G ( z i − µ i , Σ i ) − p (cid:88) i = (cid:68) ln G (cid:16) d i − f (cid:98) θ ( z i ) , t ψ ( (cid:98) ξ N ) (cid:17)(cid:69) G ( z i − µ i , Σ i ) + H , (B.1)where H collects all terms independent of the parameters. We rename the subsequent terms of the Kullback-Leibler Divergence asfollows and calculate each expression respectively. – T1 = (cid:104) ln G ( z i − µ i , Σ i ) (cid:105) G ( z i − µ i , Σ i ) – T2 = (cid:104) ln G ( ξ N , (cid:105) δ ( ξ N − (cid:98) ξ N ) – T3 = (cid:104) ln G ( z i , ) (cid:105) G ( z i − µ i , Σ i ) – T4 = (cid:68) ln G (cid:16) d i − f (cid:98) θ ( z i ) , t ψ ( (cid:98) ξ N ) (cid:17)(cid:69) G ( z i − µ i , Σ i ) First energy term T1: T1 = (cid:90) d z i G ( z i − µ i , Σ i ) ln  exp (cid:16) − ( z i − µ i ) T Σ − i ( z i − µ i ) (cid:17) √| π Σ i |  = (cid:90) d z i G ( z i − µ i , Σ i )  −

12 ( z i − µ i ) T Σ − i ( z i − µ i ) −

12 ln (cid:0) | π Σ i | (cid:124)(cid:123)(cid:122)(cid:125) (2 π ) l | Σ i | (cid:1) The trace of a scalar is the scalar itself, resulting in ( z i − µ i ) T Σ − i ( z i − µ i ) = tr (cid:16) ( z i − µ i ) T Σ − i ( z i − µ i ) (cid:17) . Further, the trace isinvariant under cyclic permutations: tr (cid:16) ( z i − µ i ) T Σ − i ( z i − µ i ) (cid:17) = tr (cid:16) Σ − i ( z i − µ i )( z i − µ i ) T (cid:17) . Finally, the trace is a linear operatorand therefore commutes with the expectation: = −

12 tr (cid:90) d z i G ( z i − µ i , Σ i ) Σ − i ( z i − µ i )( z i − µ i ) T (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = Σ − i Σ i =  − (cid:90) d z i G ( z i − µ i , Σ i ) (cid:104) ln( | Σ i | ) (cid:124) (cid:32) (cid:123)(cid:122) (cid:32) (cid:125) = tr(ln Σ i ) + l ln(2 π ) (cid:105) = − (cid:104) tr( ) (cid:124)(cid:123)(cid:122)(cid:125) = l + tr(ln Σ i ) + l ln(2 π ) (cid:105) = − (cid:104) tr(ln Σ i ) + l (1 + ln (2 π )) (cid:105) (B.2) Article number, page 15 of 25 & A proofs: manuscript no. NEAT_VAE

In the upper calculation, we carry the term l (1 + ln (2 π )) through our computations, since we change the number of latent spacefeatures l in our di ﬀ erent experiments. Second energy term T2: T2 = (cid:90) d ξ N ln (cid:32) √ π exp( − ξ N ) (cid:33) δ ( ξ N − (cid:98) ξ N ) = − (cid:98) ξ N −

12 ln(2 π ) (cid:124) (cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32) (cid:125) = const . = − (cid:98) ξ N + C (B.3) Third energy term T3: T3 = (cid:90) d z i G ( z i − µ i , Σ i ) ln (cid:32) exp( − z Ti − z i ) √| π | (cid:33) = − (cid:34) (cid:90) d z i G ( z i − µ i , Σ i ) 12 z Ti − z i (cid:124) (cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = tr( z i z Ti ) + (cid:90) d z i G ( z i − µ i , Σ i ) 12 ln( | π | ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = const . (cid:35) = −

12 tr (cid:32)(cid:90) d z i G ( z i − µ i , Σ i ) z i z Ti (cid:33) + C The expression z i z Ti can be rewritten as z i z Ti = ( z i − µ )( z i − µ ) T − µµ T + z i µ T + µ z Ti . We insert this in the line above: = −

12 tr (cid:32)(cid:90) d z i G ( z i − µ i , Σ i ) (cid:104) ( z i − µ i )( z i − µ i ) T (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) =Σ i − µ i µ Ti + z i µ Ti + µ i z Ti (cid:105)(cid:33) + C = −

12 tr (cid:16) Σ i − µ i µ Ti + µ i µ Ti + µ i µ Ti (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) =Σ i + µ i µ Ti (cid:17) + C = −

12 tr (cid:16) Σ i + µ i µ Ti (cid:17) + C (B.4) Fourth energy term T4: T4 = (cid:90) d z i G ( z i − µ i , Σ i ) ln G ( d i − f (cid:98) θ ( z i ) , t ψ ( (cid:98) ξ N )) = (cid:90) d z i G ( z i − µ i , Σ i ) ln (cid:32) exp( − ( d i − f (cid:98) θ ( z i )) T t ψ ( (cid:98) ξ N ) − ( d i − f (cid:98) θ ( z i ))) (cid:113) | π t ψ ( (cid:98) ξ N ) | (cid:33) = − (cid:90) d z i G ( z i − µ i , Σ i ) (cid:104)

12 ( d i − f (cid:98) θ ( z i )) T t ψ ( (cid:98) ξ N ) − ( d i − f (cid:98) θ ( z i )) +

12 ln( | π t ψ ( (cid:98) ξ N ) | (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = (2 π ) k | t ψ ( (cid:98) ξ N ) | ) (cid:105) = − (cid:90) d z i G ( z i − µ i , Σ i ) (cid:104) ( d i − f (cid:98) θ ( z i )) T t ψ ( (cid:98) ξ N ) − ( d i − f (cid:98) θ ( z i )) (cid:105) − (cid:90) d z i G ( z i − µ i , Σ i ) (cid:2) k ln(2 π ) (cid:124) (cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32) (cid:125) = const . + tr(ln t ψ ( (cid:98) ξ N )) (cid:3) = − (cid:90) d z i G ( z i − µ i , Σ i ) tr (cid:104) t ψ ( (cid:98) ξ N ) − ( d i − f (cid:98) θ ( z i )) ( d i − f (cid:98) θ ( z i )) T (cid:105) −

12 tr (ln t ψ ( (cid:98) ξ N )) + C (B.5) Article number, page 16 of 25ara Milosevic et al.: Bayesian decomposition of the Galactic multi-frequency sky using probabilistic autoencoders

We cannot exactly evaluate the integral in this expression using analytic techniques. Fortunately, we can make use of Monte Carlomethods to approximate the expectation with a ﬁnite sum: (cid:90) d z i G ( z i − µ i , Σ i ) tr (cid:104) t ψ ( (cid:98) ξ N ) − ( d i − f (cid:98) θ ( z i )) ( d i − f (cid:98) θ ( z i )) T (cid:105) ≈ M M (cid:88) m = tr (cid:16) t ψ ( (cid:98) ξ N ) − ( d i − f (cid:98) θ ( z ( m ) i )) ( d i − f (cid:98) θ ( z ( m ) i )) T (cid:17) . (B.6)With this method we can achieve a high accuracy estimator (even with a small number of samples) for the expectation value, as longas we draw samples z ( m ) i from the distribution G ( z i − µ i , Σ i ) (Bishop 2006). This means the samples z i are a function of µ i and Σ i by z i = µ i + √ Σ i (cid:15) i with an auxiliary variable (cid:15) i and P ( (cid:15) i ) = G ( (cid:15) i , ) (see Sect. 2.6 for details). We sample once for each pixel, i.e. M =

1. The full Kullback-Leibler Divergence (B.1) then reads D KL [ Q Φ ( · ) || P ( · )] = p (cid:88) i = (cid:34) − tr (ln Σ i ) − l (1 + ln (2 π )) + p (cid:98) ξ N + tr (cid:16) Σ i + µ i µ Ti (cid:17) + tr  t ψ ( (cid:98) ξ N ) ( d i − f (cid:98) θ ( z i ))( d i − f (cid:98) θ ( z i )) T  + tr (ln t ψ ( (cid:98) ξ N )) (cid:35) + H , (B.7)where we absorbed all constant terms in a redeﬁned H . Appendix C: Hyperparameters

Hyperparameters for the NEAT-VAE are (1) the number of network layers, (2) the number of hidden neurons per layer, (3) thenumber of neurons in the bottleneck layer, (4) the mean µ N and (5) the standard deviation σ N of the log-normal model for thenoise covariance transformation, the learning rates of (6) network weights and (7) noise parameter weights, and (8) the batch size.We examined 50 conﬁgurations of the NEAT-VAE, which consist of di ﬀ erent arrangements of hyperparameters. The values forthe hyperparameters were randomly chosen from limited intervals, which we speciﬁed for each hyperparameter based on priorexperiments (see Table C.1). We implemented rectiﬁed linear unit (ReLU) functions as activation functions in each layer except theoutput layer and used the Adam optimizer (Kingma & Ba 2014) to update the network weights φ and (cid:98) θ , and the latent noise value (cid:98) ξ N . We trained each conﬁguration for (10 × batchsize) / p epochs with p denoting the number of pixels. We build our NEAT-VAE asa descriptive rather than a predictive model, since we aim to learn the underlying, independent features generating one certain dataset. For this reason, we do not split the data into training, validation and test sets, nor do we analyze overﬁtting. For reproducibility,we ﬁxed the random seed to 123.Table C.1: Hyperparameters of NEAT-VAE. The number of layers, neurons per layer and bottleneck neurons determine the networkarchitecture. µ N and σ N are transformation parameters of the noise covariance matrix N . The optimization of learnable parametersis determined by the learning rates (LR) for network weights φ , (cid:98) θ and the latent noise (cid:98) ξ N , which are tuned to minimize the objectivefunction in Eq. (B.7). When using mini-batching, the batch size determines how many data samples are used to compute the lossfunction before back propagation and model updating is performed.Hyperparameter Sampling setsLayers { , , , , } Hidden neurons { , . . . , } Bottleneck neurons { , . . . , } µ N {− , − . , − , − , − } σ N Unif[1 , { . , . , . , . } LR (cid:98) ξ N { . , . , . , . } Batch size { , , , , } Article number, page 17 of 25 & A proofs: manuscript no. NEAT_VAE

Appendix D: Other features

In the experiment discussed in Sect. 3, our NEAT-VAE algorithm mapped 35 Galactic all-sky maps to ten latent space features thatencode the essential information required to reconstruct the input again. The latent space exhibits an order in signiﬁcance of thefeatures, which we measured by the ratio of mean ﬂuctuations compared to feature posterior variance (see Eq. (10)). Features A,B and C, which have the highest signiﬁcance, are shown in Fig. 3. The remaining seven features (with their mean and variance inHEALPix representation) are displayed in the following in order of descending signiﬁcance. By a visual analysis, we can recognizeartifacts, for example, of the IRIS scanning scheme, in features D, F, G and H. Feature H also seems to encode structures near theGalactic plane of the H α map, the mean of feature J encodes structures similar to the Fermi Bubbles in high energy γ -ray data.All-sky images of the 35 Galactic input data are displayed in the leftmost columns of Appendix E.Fig. D.1: Full latent space representation with 10 neurons in descending order of signiﬁcance. Signiﬁcance values are: S (feature D) = . S (feature E) = .

07, and S (feature F) = . Article number, page 18 of 25ara Milosevic et al.: Bayesian decomposition of the Galactic multi-frequency sky using probabilistic autoencoders

Fig. D.1: Least signiﬁcant latent space feature maps. Signiﬁcance values are: S (feature G) = . S (feature H) = . S (feature I) = .

34, and S (feature J) = . Article number, page 19 of 25 & A proofs: manuscript no. NEAT_VAE

Appendix E: Decoder Jacobian maps

The following panels show the gradients of reconstructed Galactic all-sky maps (that is the output of our NEAT-VAE algorithm)with respect to the latent space features A, B and C. The Galactic input data is displayed in the leftmost column, while the singlefeatures mark the top row. The resulting gradient values in each pixel are shown as a HEALPix map connecting a Galactic map anda feature map, with red colors indicating positive gradient values and blue colors denoting negative gradient values.Fig. E.1: Decoder Jacobian maps. Derivatives of reconstructed Galactic all-sky maps with respect to latent feature maps A, B and C(top panel).

Article number, page 20 of 25ara Milosevic et al.: Bayesian decomposition of the Galactic multi-frequency sky using probabilistic autoencoders

Fig. E.1: Decoder Jacobian maps continued.

Article number, page 21 of 25 & A proofs: manuscript no. NEAT_VAE

Fig. E.1: Decoder Jacobian maps continued.

Article number, page 22 of 25ara Milosevic et al.: Bayesian decomposition of the Galactic multi-frequency sky using probabilistic autoencoders

Fig. E.1: Decoder Jacobian maps continued.

Article number, page 23 of 25 & A proofs: manuscript no. NEAT_VAE

Fig. E.1: Decoder Jacobian maps continued.

Article number, page 24 of 25ara Milosevic et al.: Bayesian decomposition of the Galactic multi-frequency sky using probabilistic autoencoders