[PDF] A unified framework for 21cm tomography sample generation and parameter inference with Progressively Growing GANs

Abstract

Creating a database of 21cm brightness temperature signals from the Epoch of Reionisation (EoR) for an array of reionisation histories is a complex and computationally expensive task, given the range of astrophysical processes involved and the possibly high-dimensional parameter space that is to be probed. We utilise a specific type of neural network, a Progressively Growing Generative Adversarial Network (PGGAN), to produce realistic tomography images of the 21cm brightness temperature during the EoR, covering a continuous three-dimensional parameter space that models varying X-ray emissivity, Lyman band emissivity, and ratio between hard and soft X-rays. The GPU-trained network generates new samples at a resolution of ∼ 3 ′ in a second (on a laptop CPU), and the resulting global 21cm signal, power spectrum, and pixel distribution function agree well with those of the training data, taken from the 21SSD catalogue \citep{Semelin2017}. Finally, we showcase how a trained PGGAN can be leveraged for the converse task of inferring parameters from 21cm tomography samples via Approximate Bayesian Computation.

Full PDF

MMNRAS , 1–15 (2019) Preprint 20 February 2020 Compiled using MNRAS L A TEX style ﬁle v3.0

A uniﬁed framework for 21cm tomography samplegeneration and parameter inference with ProgressivelyGrowing GANs

Florian List (cid:63) and Geraint F. Lewis

Sydney Institute for Astronomy, School of Physics, A28, The University of Sydney, NSW 2006, Australia

Accepted XXX. Received YYY; in original form ZZZ

ABSTRACT

Creating a database of 21cm brightness temperature signals from the Epoch of Reion-isation (EoR) for an array of reionisation histories is a complex and computationallyexpensive task, given the range of astrophysical processes involved and the possiblyhigh-dimensional parameter space that is to be probed. We utilise a speciﬁc type ofneural network, a Progressively Growing Generative Adversarial Network (PGGAN),to produce realistic tomography images of the 21cm brightness temperature duringthe EoR, covering a continuous three-dimensional parameter space that models vary-ing X-ray emissivity, Lyman band emissivity, and ratio between hard and soft X-rays.The GPU-trained network generates new samples at a resolution of ∼ (cid:48) in a second(on a laptop CPU), and the resulting global 21cm signal, power spectrum, and pixeldistribution function agree well with those of the training data, taken from the 21SSDcatalogue (Semelin et al. 2017). Finally, we showcase how a trained PGGAN can beleveraged for the converse task of inferring parameters from 21cm tomography samplesvia Approximate Bayesian Computation. Key words: dark ages, reionization, ﬁrst stars – methods: numerical – methods:statistical

The 21cm spectral line holds a great potential for probingthe Dark Ages, the formation of the ﬁrst generations of stars,and the Epoch of Reionisation (EoR; see Furlanetto et al.2006; Morales & Wyithe 2010; Pritchard & Loeb 2012, for re-views of 21cm cosmology). It is produced by a spin-ﬂip tran-sition of H I gas in the intergalactic medium (IGM) due to theabsorption of Cosmic Microwave Background (CMB) pho-tons. This signal from the early Universe, which is receivedon Earth highly redshifted in the radio frequencies, is a dor-mant treasure trove that is expected to reveal a plethora ofinformation on inﬂationary and dark sector physics (Furlan-etto et al. 2019a), the matter power density after recombina-tion (Loeb & Zaldarriaga 2004), and the reionisation historyof the Universe (Furlanetto et al. 2019b).Since the signal is concealed by foreground sources thatcan be ∼ Experiment to Detect the (cid:63)

E-mail: ﬂ[email protected]

Global EoR Signature , Bowman et al. 2018). Currently oper-ating interferometers such as LOFAR (

Low Frequency Array ,van Haarlem et al. 2013), GMRT (

Giant Metrewave RadioTelescope,

Paciga et al. 2013), MWA (

Murchison WideﬁeldArray , Tingay et al. 2013), and PAPER (

Precision Arrayfor Probing the Epoch of Reionization , Parsons et al. 2010)are not sensitive enough to map the spatial ﬂuctuations ofthe 21cm signal and aim instead at characterising the signalby means of statistics such as the power spectrum (Mellemaet al. 2015), for which several upper limits at diﬀerent red-shifts have been derived (e.g. Beardsley et al. 2016; Patilet al. 2017; Barry et al. 2019; Eastwood et al. 2019; Gehlotet al. 2019; Kolopanis et al. 2019; Li et al. 2019). However,future radio interferometer experiments such as the SKA(

Square Kilometer Array , Mellema et al. 2013) and HERA(

Hydrogen Epoch of Reionization Array , DeBoer et al. 2017)will usher in an era of 21cm imaging.With these powerful telescopes coming online within thenext few years, accurate and fast emulators are needed thatgenerate templates for a wide range of astrophysical param-eters and reionisation models. The EDGES signal shows atrough with an amplitude more than twice as large as canbe accommodated by the most extreme predictions (Cohenet al. 2017), which, if conﬁrmed, could hint at an excess ra- © a r X i v : . [ a s t r o - ph . C O ] F e b F. List & G. F. Lewis dio background (e.g. Feng & Holder 2018; Fialkov & Barkana2019) from a so far unknown population of high-redshift ra-dio sources such as black holes (Ewall-Wice et al. 2018) orsupernovae (Jana et al. 2019, but see Sharma 2018 for con-straints on the cooling time when resorting to astrophysicalexplanations), or at exotic physics such as scattering be-tween dark matter and baryons (Barkana 2018; Hirano &Bromm 2018; Slatyer & Wu 2018). Whether or not this sig-nal will be substantiated by other instruments, predictingsignals for all physically plausible models and conductingparameter studies is vital in order to be prepared for up-coming observations. However, generating mock samples isa challenging task: the 21cm signal depends both on cos-mology and on astrophysical processes, which are not wellunderstood to date. Moreover, running computationally ex-pensive large-scale cosmological simulations with radiativetransfer requires both big volumes as well as high resolu-tions in order to resolve the ionising sources, which oftenmakes the exploration of a large parameter space infeasible.Approximate methods for the computation of the 21cm sig-nal that do not invoke 3D radiative simulations have beendeveloped, e.g. (Mesinger et al. 2011) and

Sim-Fast21 (Santos et al. 2010), whereas methods such as

Bears (Thomas et al. 2009) post-process dark-matter-only simula-tions to generate 21cm maps.Machine learning techniques lend themselves to vastlyspeeding up the creation of mock samples as well as carry-ing out fast and accurate parameter inference. In the follow-ing, we will brieﬂy review the related literature: Kern et al.(2017) present an emulator for the 21cm power spectrumbased on Gaussian processes and use it to forecast the con-straining power of HERA. A neural-network-based powerspectrum emulator is introduced in Schmit & Pritchard(2018), which can be harnessed for accelerating the evalua-tion of the likelihood function during Bayesian parameter in-ference; this present paper shares the spirit of their approachin that the output of a generative model is used for param-eter inference. Chardin et al. (2019) predict the reionisationtimes in three-dimensional cubes using convolutional neu-ral networks (CNNs). Cohen et al. (2019) generate smoothglobal 21cm signals with a neural network that is trained topredict the strength of the principal components dependingon seven astrophysical parameters. A complementary ap-proach to the work herein, which is based upon generativeadversarial networks (GANs) as well, has been presentedby Zamudio-Fernandez et al. (2019). In that paper, a GANis trained on 3D cubes from the IllustrisTNG simulation(Pillepich et al. 2018) and learns to generate new realisationsof the H I distribution. In terms of statistics, the quality ofGAN-generated samples exceeds that of Halo OccupationDistributions. In contrast to modelling the sources of 21cmemission, our method addresses the prediction of the signalas received on Earth.While the aforementioned works consider the task ofartiﬁcial data creation, the reverse problem of machine-learning-aided parameter inference has also been studied inthe recent years. Pioneering the use of neural networks forthe analysis of the 21cm signal, Shimabukuro & Semelin(2017) train a neural network on 21cm power spectra to in-fer astrophysical parameters. A comprehensive comparisonof diﬀerent machine learning methods for parameter estima-tion from the 21cm power spectrum is carried out in Dous- sot et al. (2019). To date, the majority of machine learningcontributions to 21cm cosmology has focused on the powerspectrum, which is arguably the simplest and most pow-erful statistics when it comes to constraining EoR models.However, unlike the CMB, the 21cm signal is highly non-Gaussian (e.g. Majumdar et al. 2018), and valuable addi-tional information is encoded in the full tomographic signal.In this spirit, Gillet et al. (2019) consider the recovery ofastrophysical parameters by means of a neural network thatis trained directly on 21cm images. La Plante & Ntampaka(2019) determine the EoR duration from 2D images in theplane of the sky using a CNN within an error of per cent,assuming a well-constrained midpoint of the EoR. A CNNis also employed by Hassan et al. (2019), who infer cosmo-logical and astrophysical parameters from 21cm mock mapswith SKA-like noise. A neural network classiﬁer that dis-cerns galaxy-dominated and Active Galactic Nuclei (AGN)-dominated models from images in the presence of realisticnoise is presented in Hassan et al. (2017).In this work, we employ a neural network, to be morespeciﬁc a Progressively Growing Generative AdversarialNetwork (PGGAN) (that we name ), for the cre-ation of two-dimensional tomographic 21cm signals, condi-tioned on a three-dimensional parameter vector. Our imple-mentation of in Tensorﬂow (Abadi et al. 2016),as well as the trained neural network are publicly available .Diﬀerent input formats for the tomography samples are sup-ported, and adapting the code to other 21cm catalogues thatprobe diﬀerent parameter spaces should be possible withoutmajor diﬃculties.This paper is structured as follows: in Section 2, we givea bird’s eye view of the evolution of the 21cm signal betweenRecombination and Reionisation. In Section 3, we introducethe public data set from Semelin et al. (2017, henceforth21SSD), which we use as training data for our neural net-work. Then, we turn towards generative models in Section4 and summarise the major concepts behind (PG)GANs. InSection 5, we present our results and assess the quality of thegenerated samples by analysing the global 21cm signal, thepower spectrum at ﬁxed redshift, and the pixel distributionfunction (PDF). In Section 6, we show how parameter infer-ence can be done using the trained PGGAN. To this end, weconsider two methods that use diﬀerent parts of the neuralnetwork ( generator and critic ). We conclude this paper inSection 7. The diﬀerential brightness temperature of the 21cm lineagainst the CMB is commonly written as (e.g. Furlanetto https://github.com/FloList/21cmGAN MNRAS , 1–15 (2019) et al. 2006) δ T b = T s − T γ + z (cid:0) − e − τ ν (cid:1) ≈ x H I ( + δ ) (cid:18) T s − T γ T s (cid:19) (cid:18) + H ( z ) d v (cid:107) d r (cid:107) (cid:19) − × (cid:18) + z (cid:19) (cid:18) Ω b . h . (cid:19) (cid:18) Ω m . (cid:19) mK , (1)where τ ν is the optical depth, T s is the spin temperature ofthe 21cm transition, T γ is the background temperature ofthe CMB, x H I is the local H I fraction, H ( z ) is the Hubbleparameter as a function of redshift z , δ is the local frac-tional overdensity, Ω b and Ω m denote the fractional en-ergy content of baryonic matter and total matter, respec-tively, and h is the dimensionless Hubble constant deﬁnedby H = h × km s − Mpc − . Moreover, d v (cid:107) / d r (cid:107) is thevelocity gradient along the line of sight, which entails red-shift space distortions (Barkana & Loeb 2005). The localityof the non-cosmological quantities in the above equation iswhat causes the spatial variations of δ T b .The spin temperature T s is related to the ratio of num-ber densities in each of the two hyperﬁne states n (1s sin-glet) and n (1s triplet) and is given as n n = (− T (cid:63) / T s ) , (2)where T (cid:63) = h ν / K B = . K. As derived by Field(1958), the spin temperature T s can be written as a weightedsum of the CMB temperature T γ , the kinetic temperature T kin , and the colour temperature T α : T s = T γ + y kin T kin + y α T α + y kin + y α . (3)Note that an alternative convention expresses the inversespin temperature T − s as the weighted sum of inverse tem-peratures. The colour temperature is tightly coupled to thekinetic temperature, i.e. T α ≈ T kin (Field 1959).The interplay between mechanisms that drive T s to-wards T γ or T kin , respectively, is expressed in the equa-tion above by the coupling coeﬃcients y kin (accounting foratomic collisions) and y α (accounting for the Wouthuysen–Field eﬀect, Wouthuysen 1952; Field 1958). This decompo-sition is motivated by the following considerations (see e.g.Furlanetto et al. 2006; Pritchard & Loeb 2012; Mesingeret al. 2011 for more detailed explanations): In the post-recombination Universe between (cid:46) z (cid:46) , the IGMis collisionally coupled to the CMB via Compton scatteringbetween CMB photons and leftover free electrons, leadingto T s = T kin = T γ and therefore δ T b = . Afterwards, theIGM cools adiabatically according to T kin ∝ ( + z ) , fasterthan the CMB whose temperature scales as T γ ∝ ( + z ) .Hence δ T b < , and the 21cm signal can be detected forthe ﬁrst time in absorption. As the IGM gradually dispersesdue to the expansion of the Universe, collisional cooling be-comes ineﬃcient, and the spin temperature approaches againthe CMB temperature, reaching equality at z ∼ where δ T b ≈ .The next transition is brought about by the onsetof the ﬁrst astrophysical sources, which start coupling T s again to T kin . This coupling arises from the Wouthuysen–Field eﬀect due to the background Lyman- α radiation. Since T s ∼ T kin < T γ , the 21cm signal is once more in absorption,marking the advent of the Cosmic Dawn. Spatial ﬂuctua-tions of the 21cm signal emerge from inhomogenities of theLyman- α background, as well as from density variations.The Lyman- α coupling gradually intensiﬁes until it eventu-ally saturates. From then on, the 21cm signal is no longersusceptible to Lyman- α ﬂuctuations. X-ray heating from as-trophysical sources leads to growing temperature diﬀerencesbetween hot, overdense regions and the cold bulk of the gas,and the 21cm signal from heated regions can be seen forthe ﬁrst time in emission. Gas temperature ﬂuctuations nowcontribute to the brightness temperature ﬂuctuations.As the X-ray heating continues and becomes dominantover cosmic expansion in the entire IGM, T kin surpasses T γ and keeps rising, eventually giving T s ∼ T kin (cid:29) T γ . The 21cmline is globally in emission, and the contribution of the spintemperature in Equation (1) becomes negligible. As reion-isation proceeds and an increasing number of regions be-come ionised, ﬂuctuations are sourced by spatially varyingionisation fractions. With the end of reionisation, the 21cmsignal essentially disappears (see e.g. Wyithe & Loeb 2008for the post-reionisation signal from damped Lyman- α sys-tems). Note however that the timing, duration, and order ofthe transitions involving astrophysical sources are uncertainand model-dependent (Pritchard & Furlanetto 2007). We use the publicly available data set from 21SSD that con-tains lightcones of the 21cm brightness temperature duringthe EoR in a redshift range of z = to for a three-dimensional parameter space. In what follows, we brieﬂysummarise the ﬁxed and varying parameters of the simu-lations, but the interested reader is referred to said paperfor a detailed discussion. The lightcones have been produced from fully coupled ra-diative hydrodynamic simulations in a box of side length h − Mpc with particles, run with the paral-lel Tree+SPH code Licorice (Semelin & Combes 2002;Semelin et al. 2007; Baek et al. 2009, 2010; Vonlanthen et al.2011; Semelin 2016). Monte-Carlo ray-tracing on an adap-tive mesh is used for the computation of X-ray and UV ra-diative transfer, and both hard and soft X-ray emission isenabled. The only feedback from star formation is photo-heating. The Lyman- α ﬂux is calculated in a post-processingstep on a ﬁxed grid consisting of cells, emitting × photons once every yr. Planck Collaboration (2016) cos-mology is assumed, and the simulations start at z = .Several astrophysical parameters have been ﬁxed suchthat the simulation results comply with observational con-straints (see Bouwens et al. 2015a,b and references therein):a Kennicutt–Schmidt law with exponent 1 and c eﬀ = Gyr − models the star formation in regions with comoving over-density > , i.e. d ρ s / d t = ρ g /( Gyr ) , and a Salpeterinitial mass function (IMF) is adopted for the star popu-lation, truncated below . and above M (cid:12) , respectively.The escape fraction of ionising UV photons from unresolvedstructures is set to . . MNRAS000

F. List & G. F. Lewis

The 21SSD data set explores a three-dimensional parameterspace, where the varying parameters are the X-ray emissivity f X , the hard-to-soft ratio of the X-ray emission r h / s , andLyman band emissivity f α . • f X : After the mean kinetic temperature of the neutralgas has decreased monotonously since recombination, X-ray heating triggers a turnaround and eventually causes T s = T kin > T γ . Assuming that the local proportionalitybetween X-ray luminosity and star formation rate (SFR)holds for high redshifts, one can parametrise (Furlanettoet al. 2006) L X = . × f X (cid:18) SFR M (cid:12) yr − (cid:19) erg s − , (4)with the unknown renormalisation parameter f X account-ing for the extrapolation to high redshifts. • r h / s : In the simulations, hard X-rays from X-ray binariesreceive a special treatment due to their large mean freepath, which renders the tracking of all hard X-ray pho-tons over the course of the entire simulation unfeasible(see 21SSD for details). The description of their spectralproperties follows Fragos et al. (2013). In contrast, AGNsproduce soft X-rays between . and keV, with a spec-tral index of . . Soft X-rays are believed to carry mostof the energy (Furlanetto et al. 2006), but their shortermean free paths leads to a more localised energy deposi-tion which generates spatial ﬂuctuations. The fraction ofhard X-rays r h / s is deﬁned as r h / s = f XRB X f XRB X + f AGN X , (5)where f XRB X and f AGN X are the X-ray emissivities of X-ray binaries and AGNs, respectively. • f α : In the 21SSD data set, the eﬃciency of the Lymanband emission is described by one free parameter f α . Theenergy emission from a stellar population within a fre-quency band [ ν , ν ] is given by E ( ν , ν ) = ∫ ν ν ∫ M max M min ξ ( M ) L ( M , ν ) T life ( M ) d M d ν, (6)where M is stellar mass, M min = . M (cid:12) , M max = M (cid:12) , ξ ( M ) is the IMF, T life ( M ) is the lifetime of a star as a func-tion of its mass, L ( M , ν ) is the energy emission per timeand per frequency from a star of mass M at frequency ν .The parameter f α is then deﬁned as the proportionalityfactor between the energy as produced in the simulation E eﬀ and the theoretical energy emission in the relevantfrequency band, that is E eﬀ ( ν α , ν limit ) = f α E ( ν α , ν limit ) , (7)with the Lyman- α frequency ν α and Lyman limit fre-quency ν limit . Higher Lyman band emission eﬃciencies f α produce more prominent troughs and peaks in the dif-ferential brightness temperature. The 21SSD data set contains simulations for each of thepossible combinations of the following parameter values: f X : 0 . , . , , , , r h / s : 0 , . , , f α : 0 . , , , thus 45 simulations in total. For each simulation, three high-resolution lightcones of the diﬀerential brightness tempera-ture are available, corresponding to observers aligned withthe three coordinate axes, giving a total of 135 lightcones.The line-of-sight dimension of 21cm images is special asit represents measurements at diﬀerent frequencies, fromwhich the associated redshifts and hence comoving distancescan be inferred to reconstruct a 3D distribution. The fre-quency resolution is typically chosen such that the resultingspatial line-of-sight resolution matches the angular one. Thisis also the case for the 21SSD lightcones whose cells have anequal ∆ ν that leads to almost cubic cells (of slightly vary-ing comoving thickness). Although brightness temperaturelightcones at ∼ (cid:48) ( × × pixels, with the line-of-sight dimension being the longest) and ∼ (cid:48) ( × × pixels) resolution are provided in the public data set forconvenience, we start from the high-resolution lightcones( × × pixels) in order to obtain more slicesas a large data set size is crucial for deep learning. Wecut slices from each lightcone by ﬁxing pixels in theﬁrst and second dimension, respectively, yielding in total = × ×

2D tomography samples ( sam-ples for each set of parameters) at a resolution of × .The shorter (vertical) dimension of the 2D samples is per-pendicular to the line of sight and covers a comoving lengthof h − Mpc, whereas the longer (horizontal) dimensionis aligned with the line of sight and comprises a redshiftrange of z = to (from left to right). If isto be applied to smaller data sets, one might resort to dataaugmentation methods such as mirroring. We subsequentlydownscale the samples by averaging over adjacent pixels toobtain a resolution of × pixels. From these samples, wecreate versions from resolution × to × , correspond-ing to each stage of the PGGAN training (see Subsection4.2). The two most prominent representatives within the classof generative models are Variational Auto-Encoders (VAEs,Kingma & Welling 2013) and Generative Adversarial Net-works (GANs, Goodfellow et al. 2014). While GANs are no-toriously hard to train, they are more ﬂexible and generallyproduce images of higher quality (e.g. Arora & Zhang 2017,however, see Razavi et al. 2019 for recent progress on VAEs),which is why we choose them for this work.

The setting of a standard GAN in our context is the fol-lowing: two players, the generator G and the discrimina-tor D , compete in a min-max-game against each other. Let MNRAS , 1–15 (2019) p ( x | θ ) be the probability distribution function of the diﬀer-ential brightness temperature images x ∈ [− , ] H × W (where H and W stand for the height and width of the images) ,given a vector of reionisation parameters θ . The generator outputs samples from a distribution p G ( x | θ ) and attemptsto shift p G towards p as the training progresses. On theother hand, the discriminator tries to distinguish the im-ages drawn from p G from the true samples taken from p .Mathematically, the generator and discriminator are map-pings G : ( z ; θ ) (cid:55)→ G ( z | θ ) and D : ( x ; θ ) (cid:55)→ D ( x | θ ) ∈ [ , ] .The random noise vector z , which is usually drawn from astandard uniform or normal distribution p z , introduces ran-domness and enables the generator to create a genuine dis-tribution of samples for each θ , rather than a deterministicmapping. The discriminator input x is either a real sam-ple, i.e. x ∼ p ( x | θ ) , or has been produced by the generator ,i.e. x ∼ p G ( x | θ ) , and the discriminator output estimates theprobability of the former. As the probability distributions inour case are conditioned on the reionisation parameters θ ,this variant of GANs is known as a conditional GAN (Mirza& Osindero 2014). The training objective is thus tantamountto ﬁnding a saddle point ( G ∗ , D ∗ ) = arg min G max D L GAN ( G , D ; θ ) , (8)where the function L GAN is deﬁned as L GAN ( G , D ; θ ) = E x ∼ p ( x ) [ log D ( x | θ )] + E z ∼ p z ( z ) [ log ( − D ( G ( z | θ )| θ ))] . (9)While the mathematical theory of such problems is well es-tablished (e.g. Nash 1950), the practical training of GANsis intricate and often unstable (Arjovsky & Bottou 2017). Aproblem commonly encountered is that of so-called modecollapse, which describes the situation when the genera-tor output is but a small subset of the real data dis-tribution p . Whereas the loss L GAN is based upon theJensen–Shannon divergence, diﬀerent loss functions havebeen proposed that rely on other distance measures suchas f -divergences (Nowozin et al. 2016). An interesting ap-proach was introduced by Arjovsky et al. (2017), who ad-vocate the use of the Wasserstein distance (also known asEarth Mover’s distance) that provides meaningful gradientseven when p G and p are far apart from each other. The q -Wasserstein distance W q ( p , p ) between two probabilitydistributions p and p is deﬁned as W q ( p , p ) = (cid:18) inf γ ∈ Π ( p , p ) E ( a , b )∼ γ (cid:2) (cid:107) a − b (cid:107) q (cid:3)(cid:19) q , (10)where Π ( p , p ) denotes the set of all joint distributions γ ( a , b ) with marginals p and p , respectively, and q ≥ .In the case of q = , the Wasserstein distance gives the min-imum amount of work needed for moving the “probabilitymass” piled up according to p to the distribution p , wherework is deﬁned as the product of probability mass and dis-tance by which the mass is moved. An equivalent character-isation with more practical relevance is obtained using theKantorovich–Rubenstein duality, yielding W ( p , p ) = sup (cid:107) f (cid:107) L ≤ (cid:16) E a ∼ p [ f ( a )] − E b ∼ p [ f ( b )] (cid:17) , (11) The pixel values can also be scaled to other intervals, e.g. [ , ] instead of [− , ] . where the supremum is taken over all 1-Lipschitz contin-uous functions. While this Lipschitz constraint is achievedby clipping the weights of the neural network in Arjovskyet al. (2017), Gulrajani et al. (2017) propose to make useof the fact that a diﬀerentiable function f is 1-Lipschitz ifand only if its gradients are bounded by 1 everywhere. Thismotivates replacing the weight clipping by a penalty term,giving rise to WGAN-GPs (Wasserstein GANs with Gradi-ent Penalty). The appropriate loss function L WGAN-GP forour saddle point problem reads as L WGAN-GP ( G , D ; θ ) = E x ∼ p ( x ) [ D ( x | θ )]− E z ∼ p z ( z ) [ D ( G ( z | θ )| θ )] + λ E ˜ x ∼ p ˜ x (cid:104) ((cid:107)∇ ˜ x D ( ˜ x | θ )(cid:107) − ) (cid:105) . (12)Here, ˜ x = α x + ( − α ) G ( z | θ ) with a random number α ∼ U [ , ] ,and the factor λ determines the strength of the gradientpenalty. Using the gradient penalty in lieu of the weightclipping has been shown to improve the image quality andconvergence. In the context of WGANs, the discriminator isreferred to as the critic since it no longer measures the prob-ability of x ∼ p ( x | θ ) but now produces a score that gaugeshow real x seems to the critic (in particular, the output ofthe critic is not conﬁned to [ , ] ). The aim of the critic isto maximise the diﬀerence between scores for real and fakesamples. In this work, we present the emulation of 21cm tomographysamples consisting of × pixels. The corresponding an-gular resolution is . (cid:48) − . (cid:48) (for z = − ) , which is withinthe typical range of the SKA. As an example, the BaselineDesign of SKA1-Low (Dewdney et al. 2013) will have a pro-jected noise of ∼ mK at a frequency of ν = MHz(corresponding to redshift z ∼ ) on scales of (cid:48) (Mellemaet al. 2015). With increasing resolution (and hence numberof pixels per image), training GANs becomes increasinglytime and memory consuming. Furthermore, the more pixelsthe discriminator has at its disposal, the easier it becomesto discern real and fake images, while the task of the genera-tor to generate high quality output consisting of many pixelsbecomes harder. This can lead to an imbalance and causethe GAN training to fail. Recently, much work has been fo-cused on methods for the generation of high-resolution im-ages with GANs, and applying techniques such as an inter-mediate latent space in the generator (Karras et al. 2018),self-attention layers (Zhang et al. 2018), spectral normal-isation (Miyato et al. 2018), or a truncation trick (Brocket al. 2018) has led to remarkable results. A major break-through is the work by Karras et al. (2017, hereafter PG-GAN), who progressively increase the resolution and numberof layers during training, obtaining high-quality × images. Although our resolution of × pixels is rathermodest, we use a Progressively Growing GAN (PGGAN),mainly because of two reasons: ﬁrstly, we observe a lowertendency towards mode collapse when building up our GANprogressively and secondly, our neural network architecturecan easily be extended to generate higher resolution tomog-raphy samples that might be needed for future interferome-try experiments. Of course, creating 3D samples at the full MNRAS000

F. List & G. F. Lewis

Figure 1.

Schematic representation of : given a ran-dom noise vector z and a parameter vector θ , the generator G produces an image G ( z | θ ) in an attempt to fool the critic D . Im-ages from the generator and from the training data set are shownto the critic , whose task it is to assign high scores to real sam-ples and low scores to GAN-produced samples. Importantly, the critic judges the image quality of the samples given the parametervector θ . Accordingly, a real sample corresponding to parameters θ shown to the critic together with very diﬀerent parameters θ will receive a low score from the critic . Blue triple dots stand forgrowth as the training proceeds: in order to produce samples atgradually increasing resolution, new layers are smoothly faded induring the course of the training. This is done for the generator and the critic in a symmetric way. resolution of the 21SSD catalogue of × × pixelswith GANs is still a long way oﬀ.During the training, new layers are gradually added toboth the generator and the critic . This happens in a smoothmanner, allowing the GAN to slowly transition to a higherresolution. Once part of the GAN, the weights of each layerremain trainable until the end of the training. While theGAN is trained on lower resolution images, downscaled im-ages are shown to the critic . For more details on the pro-gressive growing of the GAN, we refer the interested readerto PGGAN, whose approach we largely follow. Our network architecture borrows from PGGAN, which iswhy we mainly describe the diﬀerences in what follows. Theworkﬂow of our PGGAN is schematically depicted in Figure1, and the details are listed in Appendix A.Our generator network is a CNN whose layers succes-sively construct the image up to the desired resolution. Wetake the latent vector to be of length , where we ﬁx theﬁrst three components to be the entries of θ = ( f X , r h / s , f α ) and append independent and identically distributeddrawings from a standard normal distribution, the latterforming the vector z . Thus, we condition the generator bymanually encoding information in the otherwise random la-tent vector. Whereas benchmark data sets for GANs are typ-ically composed of square images, our images have an aspectratio of . For this reason, we choose × as the base res-olution of our GAN. First, the latent vector is mapped to thebase resolution via a fully connected layer. Then, convolu-tional layers and upsampling layers follow, which gradually increase the spatial resolution until the desired resolution atthe respective stage of the training is reached.For the critic , we use a structure that is essentially sym-metric with respect to the generator architecture. Since theinput for the critic is an image x ∈ [− , ] H × W , we can-not simply append the three-dimensional parameter vectorto the input as we do for the generator . Instead, we con-catenate the three parameters as constant additional imagechannels. We leave the implementation of more sophisticatedconditioning methods for the critic as proposed by Miyato &Koyama (2018) for future work. While the critic determinesautonomously which features are relevant for assessing theimage quality, there are two major requirements for the gen-erated samples that we point out: ﬁrstly, the global bright-ness temperature, that is the vertical average in the 2D caseconsidered herein, should agree with the samples from the21SSD catalogue. Secondly, the ﬂuctuations of the mock im-ages should exhibit the same statistics as their 21SSD coun-terparts. We encourage the critic to evaluate these proper-ties by appending the vertically averaged brightness tem-perature (cid:104) δ T b (cid:105) and the diﬀerence towards it, i.e. δ T b − (cid:104) δ T b (cid:105) ,as additional image channels. Altogether, the input for the critic hence takes the shape × × (1 image channel, 2extra channels, and 3 parameter channels). We train our PGGAN on 4 Nvidia Tesla Pascal P100 GPUson the supercomputer Raijin, which is located in Canberraand is part of the Australian National Computational In-frastructure (NCI). We use a batch size of 8 per GPU andexecute 40000 iterations for each growth stage of the PG-GAN. The training alternates between transitional stagesduring which the next higher resolution (with twice as manypixels in each dimension) is faded in, and stabilising stages,where the PGGAN is trained at the respective resolution.Starting at a resolution of × , this means that the train-ing consists of stabilising stages, separated by transi-tional stages, to arrive at resolution × . This amountsto iterations in total, requiring ∼ h of wall time.We take the WGAN-GP loss function as deﬁned in Equa-tion (12) with λ = , plus a small additional penalty term − . E x ∼ p ( x ) [ D ( x | θ ) ] in order to keep the critic scores closeto zero and prevent them from diverging. The minimisationis done using an Adam optimiser (Kingma & Ba 2014). We brieﬂy elaborate on how can be utilised inpractice for the generation of 21cm tomography samples andfor aiding parameter inference. The ﬁrst and simplest optionis to use the trained PGGAN presented in this work, whichreadily produces tomography samples at a resolution of (cid:38) (cid:48) .If required, instrument-speciﬁc noise and foregrounds can beadded to the generated images. Another possibility is to re-train (a modiﬁed version of) on a diﬀerent set oftomography images. First, a parameter space needs to be de-ﬁned that describes the processes impacting the 21cm signal.In this work, X-ray emissivity, ratio of hard to soft X-rays,and Lyman- α eﬃciency are considered; however, alternativeparameter spaces are spanned for instance by the ionising MNRAS , 1–15 (2019) eﬃciency of high-redshift galaxies ζ (which itself dependson the escape fraction of ionising UV photons, the fractionof galactic gas in stars, the number of times a hydrogenatom recombines, and the number of ionising photons thatbaryons in stars produce), the mean free path of ionisingphotons in ionised regions, and the minimum virial temper-ature for star formation in haloes (Greig & Mesinger 2015).Furthermore, diﬀerent models for star formation and super-nova feedback can be considered (Mesinger et al. 2016). Oncethe parameter space is set, a suitable sampling strategy isrequired, e.g. grid-based sampling, Latin Hypercube Sam-pling, or an adaptive metric-based approach as proposed byEames et al. (2019). Then, 21cm samples are generated forthe chosen parameter sets, either using a hydrodynamic sim-ulation with radiative transfer or an approximate method.The produced samples (or 2D slices thereof) serve as train-ing data for , which learns the dependence ofthe 21cm tomography samples on the individual parame-ters. The trained neural network can create output samplesfor arbitrary points in parameter space (evidently, cautionis needed when the neural network is evaluated on parame-ters that lie outside the range in parameter space on whichthe neural network was trained, or if the training data stemsfrom an extremely coarse sampling of the parameter space).The trained neural network represents a sampler fromthe probability distribution function p G ( x | θ ) ≈ p ( x | θ ) , which,in contrast to p , can be evaluated on a continuous subset inparameter space. This sampler can be utilised for backwardmodelling, that is parameter inference from a given tomogra-phy sample, as well: since the approximate likelihood func-tion L G ( θ ) = p G ( x | θ ) is analytically intractable, methodsthat rely on an expression for the likelihood such as MCMCcannot be applied, however, likelihood-free methods such asApproximate Bayesian Computation (ABC) provide a suit-able framework not only for ﬁnding the maximum-likelihoodestimate for θ , but also for obtaining error estimates. Thiswill be demonstrated in Section 6. We start with a visual comparison between the samples gen-erated by and those from the 21SSD catalogue.Figure 2 shows random samples for diﬀerent parameter sets,sorted by the strength of the emission in decreasing order(cf. Fig. 7 in 21SSD for samples at the full resolution of thecatalogue). Evidently, the PGGAN has learned to map theinput parameters to the correct redshifts and amplitudes ofthe absorption and emission regions. While the 21cm line isin emission almost everywhere at z ∼ − for f X = , theemission region for f X = . − . consists of many jaggedpatches, which is well reproduced by the PGGAN. Also, theisolated bubbles in absorption at z = − are capturedcorrectly. The spatial distribution of emission and absorp-tion regions is diverse, and we do not observe any indicationof mode collapse. Altogether, the visual impression of theGAN-generated tomography samples is satisfactory, and itis diﬃcult to distinguish them from their 21SSD counter-parts by eye. More random samples for selected parametervectors in Appendix B demonstrate the diversity of the PG-GAN output and their resemblance to the 21SSD slices. Figure 3 shows the global 21cm signal (cid:104) δ T b (cid:105) as a function ofredshift (solid lines, ﬁrst legend column for the 21SSD sam-ples, second column for the PGGAN samples), computedas the average over all pixel values of 6144 samples at eachredshift. The 1 σ -region is shaded for PGGAN samples anddelimited by dotted lines for the 21SSD samples. As alreadyseen above, the global signal is highly sensitive to f X . Ad-ditionally, higher Lyman- α emissivities f α cause earlier andmore prominent peaks and troughs. The parameter r h / s hasan eﬀect on the average kinetic temperature of the H I gas(see Fig. 5 in 21SSD). We consider the same parameter setsas in Figure 2. The global signal of the samples from thePGGAN agrees well with that from the 21SSD catalogue forall parameter sets, and the maximum diﬀerence amounts toless than mK (and much less for models with high X-rayemissivity). Also, the spread in the brightness temperatureat each redshift is well reproduced, with low X-ray modelsexhibiting a larger spatial variability than high X-ray mod-els. The ﬂuctuations of the PGGAN brightness temperaturefollow those of the 21SSD samples, which suggests that ex-tending the set of training images such that the resultingglobal signal is smoother would lead to a smoother globalsignal of the 21SSD samples as well. A standard statistic for measuring the strength of ﬂuctua-tions is the power spectrum. Figure 4 shows the 1D powerspectrum for two models bracketing the investigated rangeof f X and one model with an intermediate value of f X = .The model with f X = . has the highest power on all scalesat redshifts z = and z = . . At z = . , the intermediatemodel has the lowest power, while the power increases by z = . . In contrast, the power of the model with f X = de-creases on large scales as the 21cm line changes from absorp-tion (at z = . ) to emission (at z = ). The GAN-generatedsamples possess a similar power spectrum to the 21SSD sam-ples at both redshifts and capture the eﬀects of the diﬀerentparameters, although the spectra are a bit noisier on smallscales. Due to the strong non-Gaussianity of the 21cm signal, thepower spectrum alone is not suﬃcient for a complete de-scription of the ﬂuctuations. In the context of our PGGANtomography emulator, it is therefore important that not onlythe power spectra match, but also the pixel distributionfunctions (PDFs) at each redshift. The PDF gives the distri-bution of the occurring brightness temperatures at each red-shift. Figure 5 shows the PDF for the same models and red-shifts as in Figure 4. The PDF of the model with the lowestX-ray emissivity has the widest temperature spread, whichis clearly non-Gaussian but negatively skewed at z = . . Incontrast, the model with f X = has a narrow distributionthat is positively skewed at z = . The PDFs from the GAN-generated samples closely resemble those from their 21SSDbrethren; in particular, the width, height, and skewness arein excellent agreement.Alternative ways of characterising the non-Gaussian MNRAS , 1–15 (2019)

F. List & G. F. Lewis

Figure 2.

Random tomography samples for diﬀerent parameter vectors, from redshift z = to z = (as indicated below the samples).The X-ray emissivity f X decreases from top to bottom. For the extreme cases f X = . and , two diﬀerent combinations of r h / s and f α are shown. Figure 3.

The global 21cm signal, obtained by averaging over the vertical dimension of the samples and averaging over 6144 samples.Colours in the ﬁrst / second column of the legend correspond to 21SSD / PGGAN samples, respectively. The shaded regions show thestandard deviation of the brightness temperature for the PGGAN samples at each redshift, whereas the dotted lines correspond to thestandard deviation for the 21SSD samples. The diﬀerence between the averages (cid:104) δ T b (cid:105) PGGAN − (cid:104) δ T b (cid:105) is plotted below.MNRAS , 1–15 (2019) Figure 4.

1D power spectrum of the brightness temperature forthree diﬀerent models at redshifts z = . and z = , averagedover 6144 samples (ﬁrst legend row: 21SSD, second legend row:PGGAN). Figure 5.

PDF for three diﬀerent models at redshifts z = . and , obtained by binning the values of δ T b from 6144 samples ateach redshift into temperature bins, where the bin width has beentaken to be mK. Colours have the same meaning as in Figure4. structure of the 21cm ﬂuctuations are higher order statisticssuch as the bispectrum or topological measures, for instanceMinkowski functionals, which are not in the scope of thiswork. Keeping the 509 randomly drawn components of the noisevector z ﬁxed, one can investigate how a particular sam-ple changes as either of the reionisation parameters is var-ied. Figure 6 shows an exemplary interpolation in parameterspace of a random sample. The three blocks of four samplescorrespond to varying f X , r h / s , and f α (from top to bottom),while the remaining two other parameters are kept ﬁxed.The PGGAN was only trained on some of the depicted pa-rameter vectors and has learned a continuous mapping fromthe parameters to the samples, rather than just memorisingsamples that it has seen during the training. The randomnoise z governs the localisation of the stochastic brightnesstemperature ﬂuctuations. The trained PGGAN has learned a probabilistic mappingfrom the parameter space to realistic 2D tomographic sam-ples. We now present two approaches for harnessing the PG-GAN for the inverse problem, namely for inferring param-eters from a tomographic sample. First, we brieﬂy discussuncertainty in the context of parameter inference from 21cmimages.

In uncertainty quantiﬁcation, one commonly distinguishesbetween two diﬀerent types of uncertainty: aleatoric (sta-tistical) uncertainty from the data and epistemic (system-atic) uncertainty from the modelling. For the applicationof estimating astrophysical parameters from spatially re-solved 21cm measurements from the EoR, aleatoric uncer-tainty comes from thermal noise and from sample variance.The latter is due to the individuality of the spatial ﬂuctu-ations in each ﬁnite volume under consideration and is alsopresent for the 3D 21cm lightcones in the 21SSD catalogue,each of which possesses slightly diﬀerent statistics such asthe 2D power spectrum. By going from 3D to 2D tomo-graphic samples, the sample variance increases signiﬁcantlysince the information content per sample is reduced. Whenbeing trained, the GAN learns to generate samples whoseparticular spatial ﬂuctuations are determined by the ran-dom noise vector z , in such a way that the sample-to-samplevariance agrees with that of the training data. Conversely,accurate inference of parameters from tomographic samplesis limited by the sample variance. This being said, even forupcoming observations with the SKA, thermal noise will bea greater source of stochasticity as compared to sample vari-ance (Koopmans et al. 2015). None the less, we focus in thissection on aleatoric uncertainty stemming from sample vari-ance as this is the stochasticity that the GAN has learned togenerate in the training process. The extension to additionaluncertainty due to thermal noise is discussed in Subsection6.5. Since aleatoric uncertainty is inherent to the tomogra-phy samples, it does not decrease as the training of the GANproceeds. This is in contrast to the epistemic uncertainty,which describes the ignorance about the correct model thatexplains the data. For neural networks, it is determined bythe spread of the posterior distributions of the neural net-work weights. In the course of the GAN training, the epis-temic uncertainty should gradually decline as the distribu-tion produced by the GAN approaches the true distribution,i.e. p G ( x | θ ) ≈ p ( x | θ ) . The estimation of the epistemic uncer-tainty is out of the scope of this work, and we simply take thecredible regions obtained for the parameters at face value. A popular class of techniques that allow parameter infer-ence given a forward simulator (the PGGAN in our case)is known as Approximate Bayesian Computation (ABC).Unlike MCMC-type methods, ABC is a likelihood-free wayof inverse modelling, meaning that an analytic expression

MNRAS000

MNRAS000 , 1–15 (2019) F. List & G. F. Lewis

Figure 6.

Interpolation along the three axes of the parameter space: the parameters f X , r h / s , and f α are monotonically varied (from topto bottom), while the random noise vector z and the other two parameters are kept ﬁxed. Some of the selected parameter vectors werenot contained in the discrete parameter space used for training, showing that the PGGAN has learned to interpolate to a continuousparameter space. MNRAS , 1–15 (2019) of L( θ ) = p ( x | θ ) is not required. For simplicity, we useABC rejection sampling, which is the easiest representativeof ABC methods, but more sophisticated schemes such asABC Sequential Monte Carlo (ABC-SMC) or Bayesian Op-timization for Likelihood-Free Inference (BOLFI, Gutmann& Corander 2015) can be applied without diﬃculties.Let ˆ x be a tomographic sample belonging to a true un-known parameter vector ˆ θ . Furthermore, let p ( θ ) be the priordistribution of the parameters, and let ∆ i = ∆ ( ˆ x , x i ) be adiscrepancy measure between the observed sample ˆ x and agenerated sample x i . The underlying idea of ABC rejectionsampling is simple: generate samples x i for random param-eter vectors θ i ∼ p ( θ ) and compare them to ˆ x . If a generatedsample x i is similar to ˆ x , i.e. ∆ i is suﬃciently small, theparameters θ i should be similar to ˆ θ . With the generator network from our trained PGGAN as a“black box” simulator, random samples drawn from p G ( x | θ ) can be readily generated. Algorithm 1 lists schematically thesteps needed to obtain an approximation of the posteriordistribution p ( θ | ˆ x ) . Due to the randomness of the samplegeneration, the discrepancy ∆ i is not a deterministic functionof the parameter vector θ i (assuming ﬁxed ˆ x ), but rather arandom variable depending on the noise vector z . A newrandom noise sample should be drawn for each parametervector θ i in order to avoid a bias due to the distinct featuresemanating from a ﬁxed noise sample. Possible choices for ∆ are for instance a suitable norm between the power spectraor the PDFs. As ε (cid:38) , one expects the distribution deﬁnedby the accepted samples to gradually shift from the priordistribution p ( θ ) to the posterior distribution p ( θ | ˆ x ) . Notethat the approximation made for the posterior distributionis twofold, namely p ( θ | ˆ x ) ≈ p G ( θ | ˆ x ) ≈ p G ( θ | ∆ ( ˆ x , x ) < ε ) , andthe latter is approximated by a ﬁnite number of drawings of θ , which makes a careful sanity check of the obtained resultsmandatory. Algorithm 1

ABC Rejection Sampling: Generator-based Deﬁne ε > suﬃciently small for ( i = i < N samples ; i ++) do Draw θ i ∼ p ( θ ) Draw z ∼ N( , I N noise ) x i ← G ( z | θ i ) ∼ p G ( x | θ i ) (cid:46) Generate PGGAN sample ∆ i ← ∆ ( ˆ x , x i ) if ∆ i < ε then Accept θ i else Reject θ i Approximate the posterior distribution p ( θ | ˆ x ) by the dis-tribution of accepted parameters θ i , acc For the generator -based approach as described above, a suit-able statistic needs to be deﬁned based on which the prox-imity between a generated sample x i and ˆ x is measured.Furthermore, drawing the same θ i multiple times generally leads to diﬀerent ∆ i owing to the randomness of the gener-ated images. Another approach for parameter inference inthe ABC framework relies on the trained critic network ofthe PGGAN: recalling that the critic assesses the quality ofeach generated sample given an associated parameter vector ,one expects the critic to assign a low score to a real imagepaired with a wrong parameter vector. This motivates theuse of the critic score as a discrepancy measure for the ABCrejection sampling, as summarised in Algorithm 2. Now, ˆ x is shown to the critic together with parameter vectors θ i .Those that receive a high critic score ∆ i > M are accepted,where M ∈ R . Thus, there is no need to generate new sam-ples and to manually deﬁne a statistic for the calculationof ∆ i . Moreover, ∆ i depends deterministically on θ i . Whilethe generator -based approach is a classical application of anABC technique for which a few theoretical results (e.g. Deanet al. 2014) and a solid amount of empirical studies exist,the critic -based approach is clearly more heuristic and pro-vides less interpretability since it is not transparent basedon which criteria the critic assigns a certain score to a tomo-graphic sample, and in particular how the parameter vectorenters this assessment. On the other hand, since the critic has access to all the pixel values and not just a summarystatistic, it is imaginable that the critic is able to tell apartsamples corresponding to diﬀerent parameter vectors whosestatistics such as power spectra are very similar. Algorithm 2

ABC Rejection Sampling: Critic-based Deﬁne M ∈ R for ( i = i < N samples ; i ++) do Draw θ i ∼ p ( θ ) ∆ i ← D ( ˆ x | θ i ) (cid:46) Obtain critic score if ∆ i > M then Accept θ i else Reject θ i Approximate the posterior distribution p ( θ | ˆ x ) by the dis-tribution of accepted parameters θ i , acc We repeat the training of the PGGAN from Section 5, butthis time excluding all samples corresponding to the param-eters f X = f α = from the training set (for all three values of r h / s since the eﬀect of r h / s is less dominant than that of theother two parameters) in order to assess the ability of thePGGAN to interpolate to new points in parameter space,which is crucial for parameter inference.As a test case, we take the sample ˆ x shown in Figure 7,which arises from a moderate reionisation history expressedby the parameter vector ˆ θ = ( ˆ f X , ˆ r h / s , ˆ f α ) = ( , . , ) . Sinceit is diﬃcult to determine appropriate values for ε and M a priori , we draw samples θ i for each method anddetermine ε and M a posteriori such that the number ofaccepted samples amounts to 512, which is roughly per The critic score becomes stochastic if the critic contains layersthat introduce randomness at evaluation time, such as dropoutlayers.MNRAS000

Figure 7.

The random 2D sample ˆ x from the 21SSD cataloguefor which we carry out the parameter inference, without noise(top) and with simulated thermal noise with SKA-like varianceat each redshift (bottom). This sample belongs to the parametervector ˆ θ = ( ˆ f X , ˆ r h / s , ˆ f α ) = ( , . , ) . cent. We checked that the resulting posterior distributionsare suﬃciently robust with respect to the accepted quantile,and accepting e.g. . − per cent instead of per cent givessimilar results.In this example, we take non-informative priors over theparameter range spanned by the 21SSD catalogue. Theseare given by a uniform prior for r h / s and log-uniform priorsfor the scale variables f X and f α , i.e. log ( f X ) ∼ U(− , ) , r h / s ∼ U( , ) , and log ( f α ) ∼ U(− log ( ) , log ( )) .For the generator -based approach, we take the dis-crepancy ∆ to be the -Wasserstein distance between thebrightness temperature distributions at each scale factor a = ( + z ) − , integrated over the range of scale factors(or equivalently over the range of received frequencies). Thescale factor a is a proxy for the horizontal position withinthe tomography samples (recall that the cells in the sam-ples have constant ∆ ν implying constant ∆ a ). The motiva-tion behind this choice of ∆ is the following: for each scalefactor, similar parameter vectors are expected to give sim-ilar PDFs. Binning the values of δ T b at each scale factor,one could proceed by deﬁning bin-to-bin discrepancy mea-sures such as the absolute value of the diﬀerence betweenthe PDFs in each temperature bin, summed over all bins.However, given that only pixel values of δ T b are availableat each scale factor for our 2D tomographic samples, veryfew values would be assigned to each temperature bin, andbin-to-bin measures would depend sensitively on the chosentemperature bin size. Contrarily, the Wasserstein distanceas a cross-bin dissimilarity measure is much less sensitive tothe chosen bin size: the amount of work required to move“probability mass” to an adjacent bin is small, and so ishence the resulting Wasserstein distance. To be more spe-ciﬁc, the Wasserstein distance in the case of two Dirac deltadistributions located at T and T collapses to | T − T | andsimply measures the diﬀerence between the two tempera-tures. We take the -Wasserstein distance and the L -normover the scale factor for consistency with the loss functionof the GAN. The reason for evaluating the Wasserstein dis-tance between the PDFs at each scale factor instead of thePDFs of all pixel values in the image is that two samplesthat have a very diﬀerent evolution of the brightness tem-perature over time should be assigned a high discrepancy,even if the distribution of all pixel values taken in aggre-gate is similar. In fact, binning δ T b is not needed at all forthe calculation of the -Wasserstein distance, which can bedone in 1D by means of a discrete form of the equivalent Figure 8.

Corner plot of the approximated posterior distribu-tion using ABC rejection sampling with the generator -based and critic -based approach for the sample ˆ x in Figure 7 (without ther-mal noise). The true parameter values are given in black in thecorners of the diagonal plots. We report the equi-tailed cred-ible intervals around the median in the plot titles. The contoursin the 2D histograms mark the . σ, σ, . σ , and σ regions.A gentle Gaussian ﬁlter has been applied to the 1D and 2D his-tograms for presentation. characterisation (Vallender 1974): W ( p , p ) = ∫ ∞−∞ | P ( x ) − P ( x )| d x , (13)where P and P are the cumulative distribution functionsof p and p , respectively. Thus, we take ∆ i = ∆ ( ˆ x , x i ) = ∫ a max a min W ( ˆ t ( a ) , t i ( a )) d a , (14)where a min = ( + ) − , a max = ( + ) − , and ˆ t = ˆ t ( a ) and t i = t i ( a ) are the PDFs of ˆ x and x i at scale factor a , re-spectively. The integrand is non-negative since the Wasser-stein distance deﬁnes a metric. Numerically, we compute the -Wasserstein distance between the brightness temperaturedistributions at each image column, and sum up the resultsover the columns. Figure 8 shows the resulting estimates of the posterior dis-tribution p ( θ | ˆ x ) for the two diﬀerent approaches. We stressagain that for this PGGAN, all samples corresponding to f X = f α = were excluded from the training set, not onlythe particular sample ˆ x . This is because in a realistic sce-nario, it is unlikely that the underlying parameter vector ofan observed tomography sample lies exactly on the param-eter grid of the training set.Both approaches perform similarly well, although the generator -based approach gives a somewhat more accurate MNRAS , 1–15 (2019) estimate of the Lyman band emissivity f α . Given that the ( ± ) % quantiles of the marginal prior distributions are . + . − . and . + . − . for f X and f α , respectively, the spreadof the posterior distributions for f X and f α is small (see thetitles of the marginal distributions in Figure 8), and theABC inference has improved the bounds by a factor of afew. Determining the X-ray hard-to-soft ratio r h / s is harderdue to its smaller imprint on the tomography samples ascompared to the other two parameters (see Figure 6). Theposterior distributions for both approaches peak close to thecorrect value r h / s = . , although the improvement with re-spect to the prior quantiles . ± . is small. This is in linewith 21SSD who ﬁnd that r h / s is diﬃcult to constrain withthe power spectrum and the PDF. The observed covariancebetween r h / s and f X is expected since increasing the X-rayemissivity and decreasing the X-ray hardness both results ina higher mean kinetic temperature of the gas.For the generator -based approach, our choice of ∆ isonly one out of many well-motivated discrepancy measures.We remark that replacing the integrand in Equation (14)with the absolute diﬀerence between the mean brightnesstemperatures at scale factor a barely changes the results.In order words, simply comparing the means for each scalefactor instead of the full temperature distributions does notresult in a loss in discriminatory power. However, statisticsthat capture more information such as the one proposedherein might be useful for future high-resolution 21cm imag-ing. Whereas we have assumed that the tomography sample withunknown parameters ˆ x has the same quality as the train-ing data in the inference above, real 21cm samples willbe plagued by thermal noise and foregrounds that needto be removed. Therefore, it is vital to discuss how theabove approaches carry over to a more realistic setting. Forthe generator -based approach, accounting for noise is rel-atively straightforward: given an instrument-speciﬁc noisemodel, random noise realisations can be added to the GAN-generated samples before calculating ∆ i . We repeat thegenerator-based parameter inference for the same sample,subject to Gaussian random noise with a redshift-dependentvariance as it is expected from the SKA (depicted in Fig-ure 7). We choose the same observational parameters asin 21SSD. For simplicity, we calculate the noise root meansquare integrated over all wave numbers and subsequentlydraw random noise realisations in real space with SKA-likevariance at each redshift (taking into account that the angu-lar resolution for the ﬁxed-size pixels varies with redshift),neglecting spatial correlations that would arise when care-fully sampling the noise in the UV-plane and applying aninverse Fourier transform back to real space.Interestingly, the generator-based parameter inferencebarely deteriorates in the presence of noise: the resultingmarginalised equi-tailed credibility regions are . + . − . for f X , . + . − . for r h / s , and . + . − . for f α . This showsthat astrophysical knowledge can be extracted in realisticsituations where the measurements are impacted by bothsample variance and thermal noise. We emphasise again thateven for upcoming 21cm surveys, thermal noise will be a greater source of uncertainty than sample variance, and onewould expect the credibility regions to be set mainly by thenoise. In such situations, the contribution of the epistemicuncertainty merits further investigation, which we will carryout in future work.On the other hand, dealing with noise in the critic -basedapproach is more intricate: without introducing noise to thePGGAN, it is not clear whether the critic , which has beentrained on noiseless samples, will output high scores for anoisy image together with the correct parameter vector. Apossible solution could be to add realistic noise to the gen-erated samples during training before showing them to the critic . While it is a common technique to add noise to the discriminator input (or several layers of the discriminator )in GANs (Salimans et al. 2016) in order to make it harderto distinguish fake from real images, this often comes at thecost of reduced image quality. Another approach for both generator -based and critic -based inference could be addingnoise to the training samples and training the PGGAN toproduce noisy samples. Our proposed methods exploit thefact that the generator and critic networks are trained any-way, but of course, it is also possible to employ an additional(possibly Bayesian) neural network for the task of parameterinference, as done in several references in the introductionof this work. We have presented a framework for the fast creation of re-alistic 21cm tomography samples by means of a PGGAN.During training, the PGGAN learns to correctly account forX-ray emissivity f X , X-ray hard-to-soft ratio r h / s , and Ly-man band emissivity f α , and to generate matching 21cm to-mographic samples at gradually increasing resolution. Oncetrained, the PGGAN produces 2D samples at an SKA-likeresolution of ∼ (cid:48) ( × pixels) in roughly a second ona laptop CPU. In comparison, the production of the 21SSDcatalogue (which of course contains snapshots and lightconesat much higher resolution) took × CPU hours. ThePGGAN-generated samples are diverse and accurately re-produce the global 21cm signal, the power spectrum, andthe pixel distribution function of the training data. We haveshown how both the generator network and the critic net-work can be harnessed for the task of parameter inferencein the context of ABC rejection sampling. The reionisationparameters of an exemplary tomography sample are accu-rately recovered, and tight constraints are obtained for theparameters f X and f α – despite the fact that no samples forthe correct parameter vector were contained in the trainingset. For the generator-based approach, this is even the casein the presence of simulated thermal noise with SKA-likevariance, which shows the applicability of our approach inrealistic scenarios. In the dawning era of high-redshift 21cmimaging, deep learning techniques will provide valuable toolsfor forward and backward modelling. In this paper we havedemonstrated how (PG)GANs can be utilised for both tasks. MNRAS , 1–15 (2019) F. List & G. F. Lewis

ACKNOWLEDGEMENTS

The authors thank Benoˆıt Semelin for his help with the21SSD data set and with the SKA noise computation, andfor making the data publicly available. The authors alsothank the anonymous referee for their feedback that im-proved the quality of this work. The authors acknowledgethe National Computational Infrastructure (NCI), which issupported by the Australian Government, for providing ser-vices and computational resources on the supercomputersRaijin and Gadi that have contributed to the research re-sults reported within this paper. This research made use ofthe Argus Virtual Research Desktop environment funded bythe University of Sydney. F. L. is supported by the Univer-sity of Sydney International Scholarship (USydIS).

REFERENCES

Abadi M., et al., 2016, preprint (arXiv:1603.04467)Arjovsky M., Bottou L., 2017, preprint (arXiv:1701.04862)Arjovsky M., Chintala S., Bottou L., 2017, International Confer-ence on Machine Learning, pp 214–223Arora S., Zhang Y., 2017, preprint (arXiv:1706.08224)Baek S., Di Matteo P., Semelin B., Combes F., Revaz Y., 2009,A&A, 495, 389Baek S., Semelin B., Di Matteo P., Revaz Y., Combes F., 2010,A&A, 523, A4Barkana R., 2018, Nature, 555, 71Barkana R., Loeb A., 2005, ApJ, 624, L65Barry N., et al., 2019, preprint (arXiv:1909.00561)Beardsley A. P., et al., 2016, ApJ, 833, 102Bouwens R. J., et al., 2015a, ApJ, 803, 34Bouwens R. J., Illingworth G. D., Oesch P. A., Caruana J., Hol-werda B., Smit R., Wilkins S., 2015b, ApJ, 811, 140Bowman J. D., Rogers A. E. E., Monsalve R. A., Mozdzen T. J.,Mahesh N., 2018, Nature, 555, 67Brock A., Donahue J., Simonyan K., 2018, preprint(arXiv:1809.11096)Chardin J., Uhlrich G., Aubert D., Deparis N., Gillet N., OcvirkP., Lewis J., 2019, preprint (arXiv:1905.06958)Cohen A., Fialkov A., Barkana R., Lotem M., 2017, MNRAS, 472,1915Cohen A., Fialkov A., Barkana R., Monsalve R., 2019, preprint(arXiv:1910.06274)DeBoer D. R., et al., 2017, Publ. Astron. Soc. Paciﬁc, 129, 045001Dean T. A., Singh S. S., Jasra A., Peters G. W., 2014, Scandina-vian Journal of Statistics, 41, 970Dewdney P. E., Turner W., Millenaar R., Mccool R., Lazio J.,Cornwell T. J., 2013, SKA Technical DocumentDillon J. S., et al., 2014, Phys. Rev. D, 89, 023002Doussot A., Eames E., Semelin B., 2019, MNRAS, 490, 371Eames E., Doussot A., Semelin B., 2019, MNRAS, 3664, 3655Eastwood M. W., et al., 2019, The Astronomical Journal, 158, 84Ewall-Wice A., Chang T.-C., Lazio J., Dor´e O., Seiﬀert M., Mon-salve R. A., 2018, ApJ, 868, 63Feng C., Holder G., 2018, ApJ, 858, L17Fialkov A., Barkana R., 2019, MNRAS, 486, 1763Field G. B., 1958, Proceedings of the IRE, 46, 240Field G. B., 1959, ApJ, 129, 536Fragos T., et al., 2013, ApJ, 764, 41Furlanetto S. R., Peng Oh S., Briggs F. H., 2006, Phys. Rep., 433,181Furlanetto S., et al., 2019a, preprint (arXiv:1903.06212)Furlanetto S., et al., 2019b, preprint (arXiv:1903.06204)Gehlot B. K., et al., 2019, MNRAS, 488, 4271 Gillet N., Mesinger A., Greig B., Liu A., Ucci G., 2019, MNRAS,293, 282Goodfellow I. J., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., Bengio Y., 2014, Advancesin neural information processing systems, 27, 2672Greig B., Mesinger A., 2015, MNRAS, 449, 4246Gulrajani I., Ahmed F., Arjovsky M., Dumoulin V., Courville A.,2017, in Advances in Neural Information Processing Systems.pp 5768–5778 ( arXiv:1704.00028 )Gutmann M. U., Corander J., 2015, Journal of Machine LearningResearch, 17, 1Hassan S., Liu A., Kohn S., Aguirre J. E., Plante P. L., Lidz A.,2017, Proceedings of the International Astronomical Union,12, 47Hassan S., Andrianomena S., Doughty C., 2019, preprint(arXiv:1907.07787)Hirano S., Bromm V., 2018, MNRAS, 480, L85Jana R., Nath B. B., Biermann P. L., 2019, MNRAS, 483, 5329Karras T., Aila T., Laine S., Lehtinen J., 2017, preprint(arXiv:1710.10196)Karras T., Laine S., Aila T., 2018, preprint (arXiv:1812.04948)Kern N. S., Liu A., Parsons A. R., Mesinger A., Greig B., 2017,ApJ, 848, 23Kingma D. P., Ba J. L., 2014, preprint (arXiv:1412.6980)Kingma D. P., Welling M., 2013, preprint (arXiv:1312.6114)Kolopanis M., et al., 2019, ApJ, 883, 133Koopmans L., et al., 2015, in Proceedings of Advanc-ing Astrophysics with the Square Kilometre Array ˆa ˘AˇTPoS(AASKA14). Sissa Medialab, Trieste, Italy, p. 001( arXiv:1505.07568 ), doi:10.22323/1.215.0001, https://pos.sissa.it/215/001

La Plante P., Ntampaka M., 2019, ApJ, 880, 110Li W., et al., 2019, preprint (arXiv:1911.10216)Loeb A., Zaldarriaga M., 2004, Phys. Rev. Lett., 92, 211301Majumdar S., Pritchard J. R., Mondal R., Watkinson C. A.,Bharadwaj S., Mellema G., 2018, MNRAS, 476, 4007Mellema G., et al., 2013, Experimental Astronomy, 36, 235Mellema G., Koopmans L., Shukla H., Datta K. K., Mesinger A.,Majumdar S., 2015, in Proceedings of Advancing Astrophysicswith the Square Kilometre Array ˆa ˘AˇT PoS(AASKA14).Sissa Medialab, Trieste, Italy, p. 010 ( arXiv:1501.04203 ),doi:10.22323/1.215.0010, https://pos.sissa.it/215/010

Mesinger A., Furlanetto S., Cen R., 2011, MNRAS, 411, 955Mesinger A., Greig B., Sobacchi E., 2016, MNRAS, 459, 2342Mirza M., Osindero S., 2014, preprint (arXiv:1411.1784)Miyato T., Koyama M., 2018, preprint (arXiv:1802.05637)Miyato T., Kataoka T., Koyama M., Yoshida Y., 2018, preprint(arXiv:1802.05957)Morales M. F., Wyithe J. S. B., 2010, Annu. Rev. Astron. Astro-phys., 48, 127Nash J. F., 1950, Proc. Natl. Acad. Sci., 36, 48Nowozin S., Cseke B., Tomioka R., 2016, in Advancesin Neural Information Processing Systems. pp 271–279( arXiv:1606.00709 )Paciga G., et al., 2013, MNRAS, 433, 639Parsons A. R., et al., 2010, The Astronomical Journal, 139, 1468Patil A. H., et al., 2017, ApJ, 838, 65Pillepich A., et al., 2018, MNRAS, 473, 4077Planck Collaboration 2016, A&A, 594, A13Pober J. C., et al., 2013, ApJ, 768, L36Pritchard J. R., Furlanetto S. R., 2007, MNRAS, 376, 1680Pritchard J. R., Loeb A., 2012, Reports on Progress in Physics,75, 086901Razavi A., van den Oord A., Vinyals O., 2019, preprint(arXiv:1906.00446)Salimans T., Goodfellow I., Zaremba W., Cheung V., Radford A.,Chen X., 2016, Advances in Neural Information ProcessingSystems, pp 2234–2242 MNRAS , 1–15 (2019) Santos M. G., Ferramacho L., Silva M. B., Amblard A., CoorayA., 2010, MNRAS, 406, 2421Schmit C. J., Pritchard J. R., 2018, MNRAS, 475, 1213Semelin B., 2016, MNRAS, 455, 962Semelin B., Combes F., 2002, A&A, 388, 826Semelin B., Combes F., Baek S., 2007, A&A, 474, 365Semelin B., Eames E., Bolgar F., Caillat M., 2017, MNRAS, 472,4508Sharma P., 2018, MNRAS, 481, L6Shimabukuro H., Semelin B., 2017, MNRAS, 468, 3869Slatyer T. R., Wu C.-l., 2018, Phys. Rev. D, 98, 023013Thomas R. M., et al., 2009, MNRAS, 393, 32Tingay S. J., et al., 2013, Publications of the Astronomical Societyof Australia, 30, e007Ulyanov D., Vedaldi A., Lempitsky V., 2016, preprint(arXiv:1607.08022)Vallender S. S., 1974, Theory of Probability & Its Applications,18, 784Vonlanthen P., Semelin B., Baek S., Revaz Y., 2011, A&A, 532,A97Wouthuysen S. A., 1952, The Astronomical Journal, 57, 31Wyithe J. S. B., Loeb A., 2008, MNRAS, 383, 606Zamudio-Fernandez J., Okan A., Villaescusa-Navarro F., Bi-laloglu S., Cengiz A. D., He S., Levasseur L. P., Ho S., 2019,preprint (arXiv:1904.12846)Zhang H., Goodfellow I., Metaxas D., Odena A., 2018, preprint(arXiv:1805.08318)van Haarlem M. P., et al., 2013, A&A, 556, A2

APPENDIX A: IMPLEMENTATION DETAILSOF THE NEURAL NETWORK

Table A1 lists the layers of the neural network at the ﬁ-nal resolution. We found it beneﬁcial in our experimentsto replace pixel normalisation by instance normalisation(Ulyanov et al. 2016), except for directly after the inputlayer of the generator . In contrast to PGGAN, we processthe generator input with a fully connected layer instead ofa convolutional layer. Besides, we noticed that the PGGANbecomes prone to mode collapse if the number of channels isreduced as the PGGAN grows deeper, which is why we keepthe number of channels constant at which is feasible atthe resolution considered herein – however, for the creationof higher resolution samples, a bottleneck architecture mightbe attractive. Diﬀerent from the common practice for GANs,downsampling in PGGAN is not achieved by strided convo-lutions, but rather by average pooling, which we follow. Theupsampling operation is a nearest neighbour interpolation.Recall that the channels of the critic input correspond to δ T b , the global signal at each redshift (cid:104) δ T b (cid:105) , δ T b − (cid:104) δ T b (cid:105) ,and the three model parameters. As suggested by PGGAN,we employ an equalised learning rate and a minibatch stan-dard deviation layer with group size 4. The decay rates of themoments for the Adam optimiser are taken to be β = . and β = . .It proved advantageous to take the logarithm of thetwo scale variables, i.e. X-ray and Lyman band emissivity f X and f α , respectively, and to normalise all components of θ before feeding them to the neural network via the mapping θ i (cid:55)→ θ i − ¯ θ i σ i for i = , , , where ¯ θ i and σ i are the mean andstandard deviation of each parameter computed over the setof training images. For the training, we normalise the brightness tempera-tures via the mapping δ T b (cid:55)→ δ T b + . + . arcsinh ( . δ T b ) . (A1)The inverse hyperbolic sine term leads to a steeper slopeand hence to an increased sensitivity to small perturbationsaround δ T b = . The temperature range [− , ] mK ismonotonically mapped to [− , ] , and since we do not applyan activation function after the last × convolutional layerof the generator that would conﬁne the generator outputto a certain interval, the few pixels outside this range donot require any special treatment. We calculate the inversetransformation of the PGGAN output back to temperaturespace numerically.The total number of trainable parameters for the fullygrown PGGAN amounts to ∼ . × and ∼ . × forthe generator and the critic , respectively. APPENDIX B: SAMPLES FOR SELECTEDPARAMETER VECTORS

Figure B1 shows four random 21SSD and PGGAN samplesfor each of ﬁve diﬀerent parameter choices. The quality ofthe PGGAN samples is high, and each sample features anindividual distribution of the ﬂuctuations.

This paper has been typeset from a TEX/L A TEX ﬁle prepared bythe author.MNRAS000

This paper has been typeset from a TEX/L A TEX ﬁle prepared bythe author.MNRAS000 , 1–15 (2019) F. List & G. F. Lewis

Figure B1.

Four random tomography samples from the 21SSD catalogue and from the PGGAN for each of the following ﬁve parametersets: ( f X , r h / s , f α ) = ( , , ) , ( , . , ) , ( , . , ) , ( . , . , ) , ( . , , ) , from top to bottom. MNRAS , 1–15 (2019) Operation Output shape ( C × H × W ) GENERATOR :Parameters and noise 512 × × ◦ LReLU ◦ FC ◦ PN 512 × × ◦ LReLU ◦ Conv × × × × × ◦ LReLU ◦ Conv × × × ◦ LReLU ◦ Conv × × × × × ◦ LReLU ◦ Conv × × × ◦ LReLU ◦ Conv × × × × × ◦ LReLU ◦ Conv × × × ◦ LReLU ◦ Conv × × × × × ◦ LReLU ◦ Conv × × × ◦ LReLU ◦ Conv × × × × × ◦ LReLU ◦ Conv × × × ◦ LReLU ◦ Conv × × × × × × CRITIC :Input tensor 6 × × ◦ Conv × × × ◦ Conv × × × ◦ Conv × × × × × ◦ Conv × × × ◦ Conv × × × × × ◦ Conv × × × ◦ Conv × × × × × ◦ Conv × × × ◦ Conv × × × × × ◦ Conv × × × ◦ Conv × × × × × ◦ Conv × × × ◦ Conv × × × × × × × ◦ Conv × × × ◦ Conv × × × × × Table A1.

Network architecture of the generator and critic forthe fully grown PGGAN. PN: pixel normalisation, FC: fully con-nected layer, LReLU: leaky ReLU, IN: instance normalisation,Conv k h × k w : convolutional layer with kernel size k h × k w andstride × , Minibatch STD: minibatch standard deviation layer.MNRAS000