[PDF] Constraining the astrophysics and cosmology from 21cm tomography using deep learning with the SKA

Abstract

Future Square Kilometre Array (SKA) surveys are expected to generate huge datasets of 21cm maps on cosmological scales from the Epoch of Reionization (EoR). We assess the viability of exploiting machine learning techniques, namely, convolutional neural networks (CNN), to simultaneously estimate the astrophysical and cosmological parameters from 21cm maps from semi-numerical simulations. We further convert the simulated 21cm maps into SKA-like mock maps using the detailed SKA antennae distribution, thermal noise and a recipe for foreground cleaning. We successfully design two CNN architectures (VGGNet-like and ResNet-like) that are both efficiently able to extract simultaneously three astrophysical parameters, namely the photon escape fraction (f esc ), the ionizing emissivity power dependence on halo mass ( C ion ) and the ionizing emissivity redshift evolution index ( D ion ), and three cosmological parameters, namely the matter density parameter ( Ω m ), the dimensionless Hubble constant ( h ), and the matter fluctuation amplitude ( σ 8 ), from 21cm maps at several redshifts. With the presence of noise from SKA, our designed CNNs are still able to recover these astrophysical and cosmological parameters with great accuracy ( R 2 >92% ), improving to R 2 >99% towards low redshift and low neutral fraction values. Our results show that future 21cm observations can play a key role to break degeneracy between models and tightly constrain the astrophysical and cosmological parameters, using only few frequency channels.

Full PDF

MMNRAS , 1– ?? (2015) Preprint 22 April 2020 Compiled using MNRAS L A TEX style ﬁle v3.0

Constraining the astrophysics and cosmology from 21cmtomography using deep learning with the SKA

Sultan Hassan , (cid:63) † , Sambatra Andrianomena , , Caitlin Doughty Department of Astronomy, New Mexico State University, Las Cruces, NM 88003, USA University of the Western Cape, Bellville, Cape Town 7535, South Africa South African Radio Astronomy Observatory (SARAO), Black River Park, Observatory, Cape Town, 7925, South Africa

Accepted XXX. Received YYY; in original form ZZZ

ABSTRACT

Future Square Kilometre Array (SKA) surveys are expected to generate hugedatasets of 21cm maps on cosmological scales from the Epoch of Reionization (EoR).We assess the viability of exploiting machine learning techniques, namely, convolu-tional neural networks (CNN), to simultaneously estimate the astrophysical and cos-mological parameters from 21cm maps from semi-numerical simulations. We furtherconvert the simulated 21cm maps into SKA-like mock maps using the detailed SKAantennae distribution, thermal noise and a recipe for foreground cleaning. We suc-cessfully design two CNN architectures (VGGNet-like and ResNet-like) that are botheﬃciently able to extract simultaneously three astrophysical parameters, namely thephoton escape fraction (f esc ), the ionizing emissivity power dependence on halo mass( C ion ) and the ionizing emissivity redshift evolution index ( D ion ), and three cosmologi-cal parameters, namely the matter density parameter ( Ω m ), the dimensionless Hubbleconstant ( h ), and the matter ﬂuctuation amplitude ( σ ), from 21cm maps at sev-eral redshifts. With the presence of noise from SKA, our designed CNNs are stillable to recover these astrophysical and cosmological parameters with great accuracy( R > ), improving to R > towards low redshift and low neutral fractionvalues. Our results show that future 21cm observations can play a key role to breakdegeneracy between models and tightly constrain the astrophysical and cosmologicalparameters, using only few frequency channels. Key words: dark ages, reionisation, ﬁrst stars – methods: numerical – methods:statistical

The last global phase transition in the Universe, known asthe Epoch of Reionization (EoR), marks the time at whichthe ﬁrst stars gradually reionized the Inter-Galactic Medium(IGM) and the Universe transitioned from highly neutral-opaque to a highly ionized-transparent state (for a review,see e.g. Loeb & Barkana 2001). This epoch represents a cru-cial period in the Universe’s history, particularly with regardto the formation and evolution of early galaxies.Constraining the astrophysical and cosmological param-eters has been the focus for most observational and theo-retical studies. Several techniques have been developed toconstrain the cosmological parameters (e.g matter density (cid:63)

E-mail: [email protected] † Tombaugh Fellow parameter Ω m and Hubble constant H ) such as using theCosmic Microwave Background (CMB) anisotropies mea-surements (e.g. Hinshaw et al. 2013; Planck Collaborationet al. 2016), Sunyaev-Zel’dovich cluster surveys (e.g. Bat-tye & Weller 2003), galaxy clusters in optical and X-raybands (e.g. Moscardini et al. 2001), gamma ray burst X-ray afterglow light curves (e.g. Cardone et al. 2010), lensedGW+EM signals (e.g. Li et al. 2019), Ly- α forest powerspectrum and COBE-DMR (e.g. Phillips et al. 2001), large-scale clustering of SDSS luminous red galaxies (e.g. Pad-manabhan et al. 2007), and a joint CMB and weak lens-ing analysis (e.g. Contaldi et al. 2003). On the other hand,several works have attempted to constrain the astrophys-ical parameters (e.g. the photon escape fraction, f esc , andionizing emissivity evolution, (cid:219) N ion ), using Ly- α forest mea-surements (e.g. Becker & Bolton 2013), Lyman continuum(LyC) radiation from local galaxies (e.g. Leitet et al. 2013), © a r X i v : . [ a s t r o - ph . C O ] A p r Hassan, Andrianomena & Doughty and inferred constraints by tuning diﬀerent theoretical mod-els to other measurements (e.g. Mitra et al. 2015; Finlatoret al. 2015a).While all these methods show diﬀerent levels of suc-cess to place constraints on various parameters, tighter con-straints are expected to come from the EoR through mea-surements of the 21cm ﬂuctuations on cosmological scales.With its strong dependence on the ionization and densityﬁelds, the 21cm signal carries a wealth of information that isimportant in order to understand early stages of galaxy for-mation and evolution. In this light, many radio interferome-ter experiments, such as the Low Frequency Array (LOFAR;van Haarlem et al. 2013), the Precision Array for Probingthe Epoch of Reionization (PAPER; Parsons et al. 2010), theMurchison Wide ﬁeld Array (MWA; Bowman et al. 2013),the Giant Metrewave Radio Telescope (GMRT; Paciga et al.2011), the Hydrogen Epoch of Reionization Array (HERA;DeBoer et al. 2017) and Square Kilometer Array (SKA;Mellema et al. 2013) are devoted to detecting reionizationin the near future. These growing observational eﬀorts re-quire equivalent eﬀorts in both the theoretical and statisti-cal sides, in order to prepare for extracting all possible in-formation and constrain the cosmological and astrophysicalparameters from future 21cm surveys.Several studies have already shown that combining the21cm power spectrum with Markov Chain Monte Carlo(MCMC) analysis is a powerful technique to obtain tighterconstraints and break degeneracy between models (e.g.Greig & Mesinger 2015; Liu et al. 2016; Pober et al. 2016;Hassan et al. 2017; Park et al. 2019). Besides the power spec-trum, future 21cm surveys, the SKA in particular, are alsoexpected to generate huge imaging datasets for the 21cmﬂuctuations on large scales that will contain more informa-tion than the power spectrum. Going beyond the power spec-trum has been the target of many studies (e.g. Bharadwaj &Pandey 2005; Barkana & Loeb 2008; Watkinson & Pritchard2015; Majumdar et al. 2018), in which more information canbe obtained through investigating the non-gaussian natureof the 21cm signal using higher-order statistics such as thebispectrum. To eﬃciently use the 21cm information storedin the 2-dimensional 21cm maps, Convolutional Neural Net-works (CNNs) have been a very successful deep learningtool to recover the astrophysical parameters during reioniza-tion (Gillet et al. 2019), to learn the reionization history (LaPlante & Ntampaka 2019; Mangena et al. 2019), to emu-late reionization simulations (Chardin et al. 2019), and toidentify reionization sources from diﬀerent models (Hassanet al. 2019). However, the astrophysical parameter recov-ery by Gillet et al. (2019) ignores the instrumental eﬀectsas an initial proof-of-concept study. Accounting for theseeﬀects such as the angular resolution, foreground cleaning,and thermal noise, are all crucial in order to add realism tothe simulated 21cm images as we prepare for the 21cm era.In this work, we take a step further to design two dif-ferent CNNs to simultaneously estimate several parametersfrom 21cm maps at several redshifts and diﬀerent stagesthrough reionization. We here that assume all observationsat diﬀerent redshifts are performed independently. We sim-ply take maps from diﬀerent redshifts and apply the instru-mental noise directly on each map assuming a single fre-quency channel of a size ∼

50 KHz (i.e. simulation resolu-tion). We ﬁnally combine the maps from diﬀerent redshifts to create our training datasets. We note that learning fromlight-cones is beyond the scope of the current work. Our aimis to provide a network that is able to predict parameterswithout requiring the redshift nor neutral fraction as inputs,which is a more ﬂexible design. Three astrophysical param-eters are evaluated: the photon escape fraction (f esc ), theionizing emissivity power dependence on halo mass ( C ion )and the redshift evolution index ( D ion ). Additionally, we es-timate three cosmological parameters: the matter densityparameter ( Ω m ), the dimensionless Hubble constant ( h ), andthe matter ﬂuctuation amplitude ( σ ). To assess the abilityof future 21cm tomography to constrain these parameters,we follow the recipe presented in Hassan et al. (2019) toadd a physically motivated and realistic 21cm noise to largescale 21cm maps that are produced using our semi-numericalmodel, SimFast21 (Santos et al. 2008, 2010). This paper isorganized as follows: we ﬁrst describe our suite of simula-tions of the 21cm signal and noise in §

2. We then presentthe two network designs in § § §

5, and draw our concludingremarks in § SimFast21

We use the Instantaneous model of our semi-numerical simu-lations

SimFast21 , that has been developed in Hassan et al.(2016), to improve over previous implementations of the ion-izing source and sink populations in these semi-numericalsimulations. In addition, it has been recently shown thatthis model is in a relatively good agreement with predic-tions from our radiative transfer simulation (

ARTIST ; Mo-laro et al. 2019), particularly in terms of the morphologyand power spectrum of the ionization and 21cm ﬁelds. How-ever, the reionization history can be quite diﬀerent for thesame photon escape fraction value. This arises from viola-tion of photon conservation which is an intrinsic problem inthe use of excursion set-formalism in semi-numerical simula-tions (Zahn et al. 2007; Paranjape et al. 2016; Hassan et al.2017). As indicated by

ARTIST , as a temporary solutionall our photon escape fraction predictions can be adjustedby a factor of 20% to account for the photon conservationproblem. We here brieﬂy describe the simulation ingredients,and defer to Santos et al. (2010) for the full details of thesimulation algorithm, and to Hassan et al. (2016) for theInstantaneous model development.The dark matter density is generated in the linearregime from a Gaussian distribution using a Monte-Carloapproach. Evolving the density ﬁeld to non-linear regimeis performed through the Zel’dovich (1970) approximation.Halos are then generated using the excursion-set formalism(ESF). Ionized regions are identiﬁed using a similar form ofthe ESF that is based on a direct comparison between the in-stantaneous rates of ionization R ion and recombination R rec in spherical regions of decreasing sizes as speciﬁed by theESF. Regions are ﬂagged as ionized if: f esc R ion ≥ R rec , (1)where f esc is the escape fraction. The R rec is obtained froma radiative transfer simulation (Finlator et al. 2015b), inorder to account for the clumping eﬀects below our cell size. MNRAS , 1– ???? (2015) stro-Cosmo constraints with 21cm CNN The R rec is parameterized as a function of overdensity ∆ andredshﬁt z as follows: R rec V = . × − ( + z ) . (cid:34) ( ∆ / . ) . + ( ∆ / . ) . (cid:35) , (2)where V refers to the cell volume. The R ion parameterizationis derived from a combination of the radiative transfer sim-ulation (Finlator et al. 2015b), and a larger hydrodynamicsimulation (Dav´e et al. 2013) that both have been shown toreproduce wide range of observations, including low- z obser-vations. The R ion is parameterized as a function of halo mass M h and redshift z as follows: R ion M h = . × ( + z ) D ion (cid:18) M h . × (cid:19) C ion exp (cid:18) − . × M h (cid:19) . , (3)where the best ﬁt values of the ionizing emissivity de-pendence on halo mass C ion and redshift D ion were foundto be C ion = . and D ion = . , respectively. Later,we will change these parameters to generate the trainingand testing datasets. Note that Equation (3) shows that R ion scales as M . , which is consistent with the SFR − M h relation previously found by Finlator et al. (2011). We deferto Hassan et al. (2016) for the full details on the derivationof the R ion and R rec ﬁtting functions and their eﬀects onseveral reionization key observables. We here describe the method used to account for various in-strumental eﬀects following the recipe developed in Hassanet al. (2019). We brieﬂy review this method below and referthe interested readers to Hassan et al. (2019) for detailedinformation and complete steps of how we convert an 21cmsimulated map into a mock map according to the assumedarray design. In this work, we restrict our analysis to theSKA proposed design and leave a more detailed comparisonbetween diﬀerent arrays, such as HERA and LOFAR, forfuture works. The instrumental noise is applied separatelyon each redshift assuming a single frequency channel cor-responding to the map size ( ∼

50 kHz). We leave to futureworks learning from the light-cones by considering many fre-quency channels over a large bandwidth in the analysis.The 21cm Instrument simulation pipeline consists ofthree parts: • Foreground cleaning:

Foreground contaminatedmodes of the signal lie inside the foreground wedge in the k ⊥ − k (cid:107) plane. The foreground wedge slope (m) is given by: m = D H E ( z ) sin θ c ( + z ) , (4)where H is the Hubble parameter, c is the speed of light, E ( z ) ≡ (cid:112) Ω m ( + z ) + Ω Λ , and θ is the beam angle. To cleanforegrounds, we simply zero out all modes within the wedge,satisfying k (cid:107) < m k ⊥ . For the same experiment, the slope in-creases with redshift, which means more modes are removedat higher redshifts. We quote exact wedge slope values forthe SKA at our redshifts of interest in Table 1. This the Array design 866 compact coreStation diameter, D [m] 35Station area, A [ m ] (cid:16) ν [ MHz ] (cid:17) System temperature [K] ( T sys = T sky + T rcvr ) 1.1 T sky + 40Total observation time t int [h] 1000Frequency resolution ∆ ν [kHz] 48Redshift 10, 9, 8 ,7Frequency [MHz] 129 , 142, 158, 178FWHM [arcmin] 1.37, 1.24, 1.12, 0.99Beam angle θ [rad] 0.066, 0.06, 0.054, 0.048Default wedge slope m , Equation. (4) 0.27, 0.23, 0.19, 0.15 Table 1.

Summary of our assumed SKA array design. ﬁrst step of the noise pipeline to clean foregrounds from the3-dimensional co-eval cubes. • Angular resolution: we account for the angular res-olution of a given array by exploiting its detailed baselinedistribution, via the u v -coverage, which is a measure of thebaseline intensity observing the signal modes in directionsperpendicular to the sightline. The u v -coverage is computedusing the package from our assumed SKA an-tennae distribution. We then Fourier transform the simu-lated 21cm map and set the signal to zero at k ⊥ modes whose u v -coverage is zero . We additionally smooth down the sim-ulated maps using a Gaussian ﬁlter whose full width halfmaximum (FWHM) is given by: FWHM = λ cm (1+z)/B,whereas the maximum baseline length B=5,834 m for our as-sumed SKA design, and λ cm is the rest frame wavelengthof the 21cm signal. This sets the minimum angular resolu-tion for our assumed SKA design. For instance, our simu-lated maps initially have an angular resolution of ∼ (cid:48) atz=7, that are smoothed to have a lower angular resolutionof ∼ (cid:48) according to the FWHM at this redshift. Exact an-gular resolution values as a function of redshift are quotedin Table 1. The angular resolution recipe is applied on mapsextracted from the 3-dimensional foreground ﬁltered boxesfrom the previous step. • Thermal noise: the thermal noise is uncorrelated be-tween measurements, and can be drawn from a Gaussiandistribution of unit mean and standard deviation (Zaldar-riaga et al. 2004) given by: (cid:113) (cid:104)| N | (cid:105)[ Jy ] = k B T sys A √ ∆ ν t int , (5)where t int here is the integration time to observe a single vis-ibility at a frequency resolution ∆ ν , and k B is the Boltzmannconstant. The total system temperature T sys and other pa-rameters are summarized in Table 1. Having generated thethermal noise in 2D grid using the above equation in Fourierspace, we further suppress the noise by the amount of the u v -coverage N uv by a factor of ∼ /√ N uv . We ﬁnally inverseFourier transform the noise map and add it to the angular https://github.com/jpober/21cmSense Modes with zero uv -coverage lie outside the angular resolutionof the experiment.MNRAS , 1– ?? (2015) Hassan, Andrianomena & Doughty M p c z = 10 , x HI = 0 . Ω m = 0 . , h = 0 . , σ = 0 . f esc = 0 . , C ion = 0 . , D ion = 1 . z = 9 , x HI = 0 . Ω m = 0 . , h = 0 . , σ = 0 . f esc = 0 . , C ion = 0 . , D ion = 1 . z = 8 , x HI = 0 . Ω m = 0 . , h = 0 . , σ = 0 . f esc = 0 . , C ion = 0 . , D ion = 0 . z = 7 , x HI = 0 . Ω m = 0 . , h = 0 . , σ = 0 . f esc = 0 . , C ion = 0 . , D ion = 1 .

75 50 25 0 25 50 75

Mpc M p c

75 50 25 0 25 50 75

Mpc

75 50 25 0 25 50 75

Mpc

75 50 25 0 25 50 75

Mpc -10 -10 -10 -1 -1 δ T b − ¯ T b [mK] Figure 1.

Examples of four randomly selected 21cm maps (top), from our training dataset, with their corresponding mock version(bottom), using our assumed SKA design. Red and blue color represent neutral and ionized regions, respectively. Subtitles show theastrophysical and cosmological parameters used to generated each map. These parameters are: the photon escape fraction (f esc ), ionizingemissivity power dependence on halo mass ( C ion ) and ionizing emissivity redshift evolution index ( D ion ), matter density parameter ( Ω m ),dimensionless Hubble constant ( h ), and matter ﬂuctuation amplitude ( σ ). Coloured version is available online. resolution - foreground ﬁltered signal map to form our mock21cm map.Using this pipeline with parameters listed in Table 1, therms brightness temperature (noise level) is about ∼ u v -coverage that extends down toa very small scales ( ∼ x HI < . ) and fully neutralmaps ( x HI > . ), as described later in §

4, are alreadyexcluded from the training sample, since they are identi-cal for diﬀerent set of parameters. Distinguishing identicalmaps is challenging for neural networks, where more infor-mation, such as redshift evolution, is required to assist pa-rameter recovery at these extreme limits. When the Uni-verse is highly ionized (e.g. x HI ∼ . − . , see column 4 in Figure 1), the noise dominates but nevertheless the residualneutral patches can still be seen and recognized. These resid-ual patches are usually diﬀerent for diﬀerent set of param-eters, which might help the network to distinguish betweenmaps and parameters. On the other hand, in the beginningof reionization, the ionized regions are very small due to thesmall number of sources and ionizing photons. The noisethen contaminates and ﬁlls these small ionized regions (e.g. x HI ∼ . − . , see column 1 in Figure 1), and hence mapsmight look similar to those from a fully neutral Universe.This makes recognizing the prominent signal features morechallenging, and many of the reionization realizations fora highly neutral Universe become approximately indistin-guishable. This might impact the parameter recovery froma highly neutral IGM, which basically exists at high redshiftswhere the noise is stronger. We consider two types of network in this work. It is worthreiterating that our main objective is to able to infer bothastrophysical { f esc , C ion , D ion } and cosmological { Ω m , h , σ } parameters simultaneously from their corresponding 21cmmaps. To this end, our main focus is to simply explore dif-ferent network designs with diﬀerent layout in width anddepth as an attempt to achieve our goal.The ﬁrst architecture ( network I ) considered for ourinvestigation is given in Table. 2. It is slightly similar to VGG-Net (Simonyan & Zisserman 2014) in terms of chain-

MNRAS , 1– ????

MNRAS , 1– ???? (2015) stro-Cosmo constraints with 21cm CNN Table 2.

The architecture of

Network I for this study.Layer Output shape1 Input (1, 200, 200)2 3 × × × × × × × × × × × Fully Connected Layer (1024)22 Batch Normalization –23 ReLU Activation –24

Fully Connected Layer (1024)25 Batch Normalization –26 ReLU Activation –27

Fully Connected Layer (1024)28 Batch Normalization –29 ReLU Activation –30

Fully Connected Layer (6) ing convolutional layers before downsampling, however thekey diﬀerence here is that each stage , we have two con-volutional layers with same number of feature maps in arow followed by batch normalisation and ReLU activation( Conv+Conv+BN+ReLU ) as shown in Figure 2. We note thatthe representation

N x N + M(S) denotes the kernel size(

N x N ) and the stride ( M ) of a convolutional layer. In total,we have four stages, each with a Conv+Conv+BN+ReLU layerfollowed by a

Max Pooling with stride = 2 to reduce thedimensions of the inputs , and four dense layers each with1024 units with the exception of the output layer, whichhas only 6 units corresponding to the number of inferredparameters. This network design is also similar to the previ-ous design used in our reionization models classiﬁer (Hassanet al. 2019), except that the convolutional layers used hereare wider and no dropout seems to be needed. This is consis-tent with the disharmony observed between batch normal-ization and dropout (Li et al. 2018). Similar to our previousworks in the classiﬁer, we initialize the network weights us- which we refer to as mapping the input x without reducing thedimensions ( weight × height ). In other words, the ouputs from the previous stage.

Table 3.

The architecture of

Network II for this study.Layer Output shape1 Input (1, 200, 200)2 Convolutional Layer (16, 200, 200)3 Batch Normalization –4 ReLU Activation –5 Residual Layer (3 Residual Blocks) (16, 100, 100)6 Residual Layer (6 Residual Blocks) (32, 50, 50)7 Residual Layer (6 Residual Blocks) (64, 25, 25)8 Residual Layer (3 Residual Blocks) (128, 13, 13)9 Inception Module (240, 13, 13)10 Max Pooling (240, 7, 7)11 Inception Module (240, 7, 7)12 Inception Module (256, 7, 7)13 Inception Module (288, 7, 7)14 Max Pooling (288, 4, 4)15

Fully Connected Layer (1024)16 Batch Normalization –17 ReLU Activation –18

Fully Connected Layer (1024)19 Batch Normalization –20 ReLU Activation –21

Fully Connected Layer (1024)22 Batch Normalization –23 ReLU Activation –24

Fully Connected Layer (6)

Figure 2.

One stage in network I . Chaining two convolutionallayers with same number of feature maps followed by a batchnormalisation and ReLU function before a max pooling. Colouredversion is available online.MNRAS , 1– ?? (2015) Hassan, Andrianomena & Doughty

Figure 3.

Residual block in network II . Left panel : thedownsampling only occurs at the ﬁrst convolutional layer (blue ) but the dimension is kept the same at the second con-volutional layer (blue ). To match the dimensions of theoutput from the chain of convolutional layers (blue ones) the in-put is fed to a convolutional layer with strides = 2 (red ). Right panel : when there is no downsampling, the input is simplyadded to the output from the chain of convolutional layers (blueones). Coloured version is available online. ing a generalized form of Xavier initializer (Glorot & Bengio2010) that is also called the Variance Scaling initializer, inwhich the random numbers are drawn from a zero meanGaussian distribution whose variance is equal to the inverseof the average of the number of input and output neurons.This initializer ensures that the variance of the input datais preserved as it propagates through the network layers.Our second architecture, which we simply name net-work II , is based on a combination of residual layers (Heet al. 2016) and inception modules (Szegedy et al. 2015)as shown in Table. 3. The inputs, as described in §

2, areﬁrst fed into a convolutional layer followed by a batch nor-malization (Ioﬀe & Szegedy 2015) before a ReLU activation(

Conv+BN+ReLU ). This is then followed by four residual lay-ers, each composed of three, six, six and three residual blocksrespectively. It was shown in He et al. (2016) that the re-sulting error (both training and testing) of deeper architec-ture tends to be larger than that of shallower architecture.Therefore they proposed a residual layer which allowed themto increase the depth of the model in order to gain betterperformance. In contrast with network I , instead of usingsimple convolutional layers we stack residual blocks, whichare achieved with the schematic shown in Figure. 3 (rightpanel) where the residual learning is constructed using a

Conv+BN+ReLU+Conv+BN+ReLU+Conv+BN layer. Depending onwhether there is downsampling (Fig. 3 left panel) throughthe chain of convolutional layers in a residual block, theinput needs to be downsampled using a

Conv+BN layer tomatch the dimension of the output of the chain of convolu-tional layers.In each residual layer, the downsampling occurs at theﬁrst residual block. There are variants of deep residual net-works, but in essence what we consider here is such that thenetwork performance is optimized for our speciﬁc task.As proposed by Szegedy et al. (2015), in order to im-

Figure 4.

Inception module considered in this study. The redconvolutional layers ( ) are used for dimensionality re-duction. Coloured version is available online. prove the recognition of more complex features at the higherlevels of the network, we make use of four inception mod-ules after the residual layers. The prescription suggested inSzegedy et al. (2015) is to deal with the computational com-plexity related to the depth of the network, i.e. increasingthe size of the network while maintaining the computationalcost. The inception module used in this network design isshown in Figure. 4. The idea behind convolving the inputswith a × ﬁlter before the convolutional layers with × and × kernels is to reduce the number of feature mapsfrom the inputs as computations are more expensive withlarger kernel size. The features at diﬀerent scales – capturedby diﬀerent kernel sizes × , × and × – can be learned simultaneously (Szegedy et al. 2015). It is worth noting thatwe opt for He initialization (He et al. 2016) for all layers in network II .For training, as usual for any machine learning tasks,one needs to ﬁne-tune the hyperparameters in order for thealgorithm to generalize well and hence achieve the best pos-sible performance, where the distance between the groundtruth and network predictions is minimum. To that end, asshown in Table B1 in Appendix B, we use two completelydiﬀerent approaches in terms of optimisation for the two ar-chitectures. Although the two networks produce similar re-sults, as will be presented in § network I converges faster.This can be explained by the capacity of network I with itsnumber of trainable parameters of about 167 millions whichtranslates to ∼ . × ﬂoating point operation per second(ﬂops) at inference time, whereas network II has ∼ mil-lions of trainable parameters corresponding to ∼ . × ﬂops at inference time.For reproducibility, we have used TensorFlow pack-age (Abadi et al. 2015) to develop network I which has beentrained for 50,000 training steps (about 100 training stepsper epoch) which spend ∼

15 hours on a single GPU. Eachtraining step with a batch size of 128 images takes about ∼ RMSE ∼ ) at epoch = 40 (see MNRAS , 1– ????

15 hours on a single GPU. Eachtraining step with a batch size of 128 images takes about ∼ RMSE ∼ ) at epoch = 40 (see MNRAS , 1– ???? (2015) stro-Cosmo constraints with 21cm CNN Figure A1 in Appendix A). This indicates that same resultscan be obtained in about 6 hours with a single GPU. For network II , we have used

PyTorch (Paszke et al. 2019)resorting to three GPUs to speed up the convergence. Eachepoch, in which each GPU processes in parallel a batch of128 images at a time, takes about 2 minutes which is trans-lated to 40 hours for 1200 epochs.

We generate the training dataset from a large simulationbox of a size L=150 Mpc with N=200 cells. We run 1,000diﬀerent reionization simulations realizations with 1,000 dif-ferent random seeds for the initial density ﬁeld ﬂuctuations.The prior range assumed to our parameters of interest is asfollows: • Cosmology :– Matter density parameter: . ≤ Ω m ≤ . .– Hubble constant: . ≤ h ≤ . .– Matter ﬂuctuation amplitude: . ≤ σ ≤ . . • Astrophysics :– Photon escape fraction: . ≤ f esc ≤ .– R ion -M h power dependence: ≤ C ion ≤ .– R ion redshift evolution index: ≤ D ion ≤ .The ranges considered for the astrophysical parameters aremotivated from our previous MCMC estimates to reproducevarious reionization observables (Hassan et al. 2017), andthose of the cosmology are inspired by the recent parame-ters estimates from the Planck Collaboration 2018 (Aghanimet al. 2018). From these priors, we select 1,000 values for eachparameter using Latin Hypercube Sampling (LHS) in or-der to eﬃciently explore our 6-dimensional parameter spaceand ensuring that the simulation doesn’t run twice usingthe same set of parameters. From each simulation run, westore the 21cm brightness temperature at several redshifts z = 10, 9, 8, 7 in order to have a suﬃciently large numberof maps to ensure training. Balancing the training data setis important to ensure equal learning at all redshifts and allneutral fraction values. This can be achieved by ﬂatteningthe distribution according to the neutral fraction at each red-shift. Flattening the distributions has previously been usedin learning cluster masses (Ntampaka et al. 2015) to reducethe bias towards low mass clusters. To ﬂatten our distri-bution, we take the following steps: First, we ignore highlyionized ( x HI < . ) and highly neutral ( x HI > . ) 21cmboxes. Second, at each redshift, we bin the boxes accordingto their neutral fraction. Since the neutral fraction changesstrongly with diﬀerent parameters at diﬀerent redshifts, thenumber of boxes in each neutral fraction bin is also diﬀer-ent. One has to choose a ﬁxed number to select boxes frombins to ﬂatten the distribution. We here choose 20 boxes.These 20 boxes are randomly selected from each bin. If theneutral fraction bin has less than 20 boxes, then we considerall boxes in this speciﬁc bin. If all neutral fraction bins have20 boxes at the 4 diﬀerent redshifts, then the total num-ber of all boxes is 800. However, few bins have less than 20boxes which reduces the total number of boxes to 763. Eachselected box has 200 diﬀerent maps along each of x , y , z direc-tions. This means each redshift has × × = Figure 5.

The distribution of training sample at z=7,8,9,10 (topto bottom) as a function of neutral fraction. We intentionallyignore all current reionization history constraints to check theCNNs’ viability to recover parameters without imposing any pri-ors. Coloured version is available online. possible diﬀerent 21cm maps. However, close maps in thesame box would contain similar features, and from our ownexperience (e.g. Hassan et al. 2019), we have found that ∼ ∼ × ( 40 × • Each set of the six parameters corresponds to four boxesat redshifts z=10,9,8,7. This shows that the network seesthe same six parameters for four diﬀerent maps from fourdiﬀerent redshifts. • The redshift information is encoded in each box throughthe density ﬁeld contribution on small scales. This showsthat the network sees four diﬀerent levels of density ﬁeldcontribution in the neutral regions in all maps.However, an explicit inclusion of this eﬀect is through creat-ing light-cones for each set of the six parameters to accountfor ionized bubbles growth along the sightline (i.e. the k (cid:107) modes), redshift-space distortion and angular scales. How-ever, we here assume that all observations at diﬀerent red-shifts are performed independently and apply the noise using

MNRAS , 1– ?? (2015) Hassan, Andrianomena & Doughty

Table 4.

The hyperparameters and optimisers used to train thealgorithms.optimiser learning rate batch cost function I Nesterov 0.005 128 (cid:96) norm II Adam 0.01 128 rmse a single frequency channel (resolution) corresponding to themap size. The light-cone is more relevant when the full band-width is considered. This study sets the baseline for a moredetailed analysis to compare the results from 2-dimensionalmaps (single frequency channel) versus 3-dimensional light-cones (full bandwidth). Indeed, it is expected that more in-formation exists in the reionization window (e.g. Liu et al.2014) which contains the bubble evolution along the fre-quency axis through constructing the light-cone. This weleave for future works as its beyond the scope of the currentpaper.While our box size, 150 Mpc, might be small to cap-ture the large scale ﬂuctuations and cosmic variance (Ilievet al. 2014), we have previously found that our simulationproduces a convergent 21cm power spectrum with respectto the volume (see Figure 8 in Hassan et al. 2016), such thatthe 150 Mpc volume produces similar power to that from300 Mpc volume. This indicates the ionized bubbles distri-bution in 150 Mpc volumes is similar, on average, to those inlarge volumes. In addition, it has also been found that sucha volume produces a convergent reionization history (Ilievet al. 2014). However, the resolution (number of cells) ismore important for the neural network performance, sincehigher resolution maps contain more information and struc-tures. Our maps are composed of 200x200 pixels which areable to resolve the small and large scale ﬂuctuations reason-ably well. We leave investigating the network performancein terms of box size and resolution for future works.

To assess how well the algorithms perform in terms of pre-dicting the parameters from learning the input features, weuse the coeﬃcient of determination, also known as R score,which is given by R = − (cid:205) ni = ( y i − ˆ y i ) (cid:205) ni = ( y i − ¯ y ) , (6)where ˆ y i , y i and ¯ y are the predicted value, the actual valueand the average of all the actual values in the test samplerespectively. The numerator of the second term in Equa-tion 6 – residual sum of squares – quantiﬁes the variationof the predicted values ˆ y i around the actual values y i , andthe denominator accounts for the variation of actual values y i around their mean ¯ y . This metric quantiﬁes the strengthof the correlation between the inferred and true values ofthe parameters, in other words unity R indicates that thenetwork predictions are identical to the ground truth. The R also quantiﬁes the fraction by which the predicted vari-ance is less than the true variance. For each architecture,we carry out two types of training depending on the input features that the regressors are meant to learn from in or-der to infer our astrophysical and cosmological parameters( f esc , C ion , D ion , Ω m , h , σ ): • feature extraction from a simulated 2D 21cm map( clean/noiseless map hereafter), this involves training andtesting using clean maps • feature extraction from a simulated 2D 21cm map whichwas convolved with a simulated SKA like noise ( noisy/mockmap hereafter, see § noisy maps .It is worthwhile to mention that we train our networksto predict standardized parameters, meaning that we ﬁrstsubtract the mean and divide by the standard deviation foreach array of the parameters. After training, we scale backthe predictions to the prior range. Standardizing parame-ters is important, particularly when the parameters rangeis diﬀerent, to prevent the highest parameter range fromdominating the loss function. This step is commonly used inmulti-parameters regression deep learning tasks. clean maps The top two rows in Figure 6 show the test results whentraining the networks with the clean maps . The red andblue areas encompass the 15.9% and 84.1% percentiles (i.e. ∼ σ level) of the true values given the predictions at eachbin for network I and network II respectively. Overall,the constraints on each parameter are very tight. The highvalue of the R score ( ≥ for both network I and net-work II ) corresponding to each parameter ﬁtting denotesvery strong correlation between the true and inferred pa-rameter, suggesting that the algorithms are able to learnthe salient features from the data. On comparing the per-formance of the two architectures, their R score for eachﬁtting suggests that they are in a fairly good agreement,and hence perform equally well. noisy maps The test results after training the algorithms on the noisymaps are presented in the bottom two rows in Figure 6. Theconstraints on all parameters are slightly weaker as com-pared to those obtained from training the ﬁtters using the clean maps . The overall decrease in performance denoted bythe lower values of R score corroborates that ﬁnding. Thiscan be accounted for by the fact that the relevant featuresare in this case convolved with noise, therefore extractingthem is a bit more challenging.Despite being convolved with noise, which essentiallycauses the quality of their features to degrade, all parametersare successfully recovered with an accuracy of R ≥ for network I and R ≥ for network II , which isremarkably promising. In contrast to training the algorithmswith the clean maps , it can be noticed that, overall, networkII outperforms network I , as demonstrated by the R scoresof the former, which are a bit higher than those of the latteron all parameters. MNRAS , 1– ????

24 0 .

28 0 .

32 0 . predicted Ω m . . . . t r u e Ω m R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) .

64 0 .

68 0 .

72 0 . predicted h . . . . t r u e h R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) .

72 0 .

76 0 .

80 0 .

84 0 . predicted σ . . . . . t r u e σ R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) . . . . predicted f esc . . . . t r u e f e s c R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) . . . . predicted C ion . . . . t r u e C i o n R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) Clean maps . . . . predicted D ion . . . . t r u e D i o n R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) .

24 0 .

28 0 .

32 0 . predicted Ω m . . . . t r u e Ω m R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) .

64 0 .

68 0 .

72 0 . predicted h . . . . t r u e h R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) .

72 0 .

76 0 .

80 0 .

84 0 . predicted σ . . . . . t r u e σ R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) . . . . predicted f esc . . . . t r u e f e s c R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) . . . . predicted C ion . . . . t r u e C i o n R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) Noisy maps . . . . predicted D ion . . . . t r u e D i o n R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) Figure 6.

Correlation between the true and predicted parameters. On the top two rows, the networks have been trained with the mapswithout noise, whereas on the bottom two rows, simulated SKA like noise has been injected into the maps which have been used totrain the networks. Red and blue shaded areas encompass the 15.9% and 84.1% percentiles (i.e. ∼ σ level) of the true values giventhe predictions from network I and network II respectively. Solid black lines represent the identity line i.e. true parameters vs trueparameters. In all cases, the astrophysical parameters recovery is better than those of the cosmology. Adding the noise reduces theaccuracy but the parameter recovery is still promising. The network II outperforms network I , particularly with the mock images,since more complex architecture seems to be needed to extract more information. Large ﬂuctuations are due to low number statistics.Coloured version is available online.MNRAS , 1– ?? (2015) Hassan, Andrianomena & Doughty redshift . . . . R . . . . neutral fractionΩ m hσ f esc C ion D ion Figure 7.

Variation of the resulting coeﬃcient of determination R as a function of redshift ( left panel ) and neutral fraction ( rightpanel ). Solid and dashed lines correspond to network I and network II respectively. The accuracy of parameter recovery increasesslowly towards low redshift, where the noise is smaller, and rapidly towards low neutral fraction values, where the images features canstill be recognized (see Figure 1). Coloured version is available online. In real observations, both foreground contamination and thethermal noise become stronger with increasing redshift. Onewould then expect some form of dependence of the con-straints on redshift. To investigate that possibility, we binthe maps according to their redshift in the test sample anddo the predictions by considering each bin separately usingthe regressors trained with the noisy maps . The results pre-sented in the left panel of Figure 7 suggest the parameterrecovery improves with decreasing redshift. While networkII tends to have a slightly higher accuracy for each parame-ter as a function of redshift than network I , the dependenceon redshift is fairly mild. This weak dependence is due tothe fact that there are all possible neutral fraction values ateach redshift, without imposing any prior knowledge to thetraining dataset by allowing certain neutral fraction valuesfor each redshift, following the current reionization historyconstraints.As mentioned and seen earlier, the observed features ina 21cm map are more dependent on x HI . To address this ef-fect on the performance of the algorithms, we now bin theslices according their value of x HI . It is noticeable in Figure 7right panel that the performance of each ﬁtter on all param-eters declines with increasing value of the neutral fraction.This is expected, as previously seen in Figure 1, the noisealways dominates the ionized regions. When the Universe ishighly ionized, the prominent features, which are probablythe recombining clumps of the remaining dense gas, can stillbe seen in the presence of noise. This is in contrast to the case where the Universe is highly neutral, and the bubblesare small. Here, the ionized bubbles extend to much smallerscales where the noise dominates, and hence recognition ofthe bubbles becomes challenging. At this limit, diﬀerent re-alizations (with diﬀerent sets of parameters) of a highly neu-tral Universe would look similar. This also explains the rapidincrease of the accuracy of parameter recovery towards lowneutral fraction values. Similar trends have been recentlyfound with using deep learning to constrain the reionizationhistory (e.g. Mangena et al. 2019). It is worthwhile men-tioning that this interesting dependence on the neutral frac-tion cannot be used in future observations since the exactneutral fraction is not known prior observations, althoughsome constraints can be obtained independently from Ly α forest observations (e.g. Fan et al. 2006). However, it is be-yond the scope of current networks to use this dependenceto constrain parameters. It is rather an interesting theoret-ical ﬁnding and consistent with redshift dependence trendssince the Universe is highly ionized at low redshifts.Having established that the constraints are tighter atlower redshift and lower neutral fraction i.e. ionised case,we now apply some conditions on the test sample as follows • select examples with x HI < . , • select examples with lower neutral fraction at lower red-shift, x HI < . and z < .We show in the top two rows of Figure 8 the resultingconstraints on all parameters when considering maps with x HI < . . The results show how the constraints greatly im-prove by selecting examples with lower neutral fraction from MNRAS , 1– ????

24 0 .

28 0 .

32 0 . predicted Ω m . . . . t r u e Ω m R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) .

64 0 .

68 0 .

72 0 . predicted h . . . . t r u e h R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) .

72 0 .

76 0 .

80 0 .

84 0 . predicted σ . . . . . t r u e σ R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) . . . . predicted f esc . . . . t r u e f e s c R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) . . . . predicted C ion . . . . t r u e C i o n R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) x HI < . . . . . predicted D ion . . . . t r u e D i o n R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) .

24 0 .

28 0 .

32 0 . predicted Ω m . . . . t r u e Ω m R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) .

64 0 .

68 0 .

72 0 . predicted h . . . . t r u e h R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) .

72 0 .

76 0 .

80 0 .

84 0 . predicted σ . . . . . t r u e σ R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) . . . . predicted f esc . . . . t r u e f e s c R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) . . . . predicted C ion . . . . t r u e C i o n R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) x HI < . , z < . . . . predicted D ion . . . . t r u e D i o n R = . ( II ) RMSE = . ( II ) RMSE = . ( I ) R = . ( I ) Figure 8.

Correlation between the actual and the predicted parameters using the validation sample. On the top two rows, the networkshave been trained with the noisy maps but a test sample with a neutral fraction < . has been used for predictions. On the bottomtwo rows, the same noisy data have been used to train the networks but some cut on both the neutral fraction < . and redshift z < have been applied on the test sample. Red and blue shaded areas encompass the 15.9% and 84.1% percentiles (i.e. ∼ σ level) of thetrue values given the predictions from network I and network II respectively. Solid black line represents the identity line i.e. trueparameters vs true parameters. Imposing constraints on the neutral fraction and redshift of the testing sample increases the accuracyand performance is comparable to the case without including any instrumental eﬀects as seen in top rows in Figure 6. Large ﬂuctuationsare due to low number statistics. Coloured version is available online.MNRAS , 1– ?? (2015) Hassan, Andrianomena & Doughty the test sample. For this speciﬁc case, the coeﬃcient of de-termination value ≥ . for network I and ≥ . for net-work II on all parameters indicates that the performanceof the algorithms, despite considering mock maps for theirtraining, is comparable to their performance when trainedwith noiseless maps .Restricting ourselves to lower z , together with only se-lecting maps with low neutral fraction, in the test datasetfurther improves the predictions on all parameters as indi-cated by R ≥ . for both network I and network II (seeFigure 8 bottom two rows). Inferring parameters from noisymaps at higher redshift is more challenging, since the noiseis stronger (see Figure 7 left panel). Therefore one would ex-pect further improvement of the predictions by combiningthe two criteria x HI < . and z < . . This is an excitingresult for future 21cm surveys that tighter constraints canbe obtained from low redshift observations ( z ∼

6, 7), wherethe Universe is highly Ionized. This ﬁnding is supported bythe fact that the noise is higher at high redshifts, and fur-ther conﬁrmed by our additional tests in Appendix B. Usingthis technique, the SKA will be able to place their ﬁrst con-straints on the astrophysical and cosmological parameters inthe near future and from the ﬁrst cycle of imaging.

For the sake of completeness and in order to able to compareour results to other similar studies, we compute, for eachparameter, the resulting Root Mean Square Error, rmse , asfollows:

RMSE = (cid:114) N (cid:213) ( y predicted − y true ) , (7)where the summation runs through the whole test dataset.This metric, among others, tells us about the generalisationerror inherent to our parameter estimation, in other wordsthe level of accuracy the ﬁtters can achieve on average whenestimating the parameters from encoding the inputs. Weshow the rmse values obtained for each parameter whenconsidering both noiseless and noisy maps in Figure 6 (twotop rows and two bottom rows respectively). By comparingthe rmse values resulting from training on clean maps andthose obtained from training on mock maps , we ﬁnd thatin the idealised scenario the prediction is subject to smalleraverage error for each parameter in contrast to the realisticone. This ﬁnding is expected and consistent with our re-sults based on the R metric in that the inference is morechallenging for each parameter when the data considered fortraining/testing are contaminated by noise. Although the re-sults based on the two metrics are found to be consistent,it is tempting to expect that for any two diﬀerent param-eters, irrespective of the case ( noisy or clean maps ), if the R score of one of them is higher than that of the other, itimplies that its rmse must be lower. This trend is seen forall parameters as quoted in the legends.Gupta et al. (2018), by training a convolutional neuralnetwork with ∼ millions parameters to predict cosmolog-ical parameters from simulated noiseless convergence maps, Not to be confused with accuracy, the metric used in classiﬁca-tion tasks. arrived at a generalisation error of × − for Ω m and . × − for σ . Ribli et al. (2018) improved the constraintswith a diﬀerent neural network architecture of about ∼ . millions parameters, by also using simulated lensing maps,achieving rmse = . × − , . × − for Ω m and σ re-spectively. In terms of encoding features from a 2D mapusing convolutional neural network to infer cosmological pa-rameters, our results – rmse = × − and × − on Ω m and σ from network I and network II respectively –are comparable to those obtained in these previous works.More importantly, our results corresponding to the realisticcase, with/without imposing constraints (see Figure 6 bot-tom row and Figure 8), where the input maps are noisy arevery promising and exciting for future 21cm surveys. We have demonstrated in this work the feasibility of simul-taneously inferring both astrophysical and cosmological pa-rameters ( f esc , C ion , D ion , Ω m , h , σ ) using 21cm maps from theEoR, considering future Hi surveys with the SKA. To thisend, we have generated thousands of realizations each with adiﬀerent set of parameters using simfast21 , then compileda dataset composed of 2D maps (see § noiseless sim-ulated maps and a real world-mimicking scenario in whichthe networks are trained with simulated maps contaminatedby simulated SKA-like noise. We have used R – coeﬃcientof determination – as a performance metric.We summarize our ﬁndings as follows: • The overall results for the idealised case, with R ≥ for both network I and network II on all parameters, sug-gest that the algorithms considered in this work are capableof learning the salient features from the maps in order toconstrain the corresponding parameters with a remarkablyexcellent accuracy. • In a more realistic setup, where maps from observationsare subject to noise contamination, the constraints on all pa-rameters are slightly weaker, with an accuracy of R ≥ for network I and ≥ for network II . This is expectedsince disentangling the relevant information from noise ismore challenging. It has been found that network II , lever-aging the combination of residual layers at lower level andinception module at higher level of the architecture, outper-forms network I despite the former’s lower capacity. Thisthen points towards deploying similar architectures to net-work II in a real world scenario. • In the case of learning from the noisy maps , the predic-tions are dependent on both the underlying neutral fractionof the map and its distance from an observer. The perfor-mance of the methods improves with decreasing neutral frac-tion and, as foreground contamination is more important athigher redshift, the constraints are tighter at lower z . Theresults obtained from the test sample is R ≥ with net-work I and R ≥ with network II when only selectingmaps with x HI < . . Recovery improves with imposing con-straints on both neutral fraction and redshift ( x HI < . at MNRAS , 1– ?? (2015) stro-Cosmo constraints with 21cm CNN lower z < ), resulting in R > with both network I and network II . This indicates that even in the presence of noise in maps, our methods can still estimate the relevantparameters to an excellent level of precision, which is indeedquite promising. • We have computed the prediction error on average – rmse – on each parameter in both optimistic and realisticcases. It has been found that the rmse is smaller in theformer, in good agreement with the results when using thecoeﬃcient of determination as a performance metric. Com-pared to other previous works, our approach has also showna great potential for inferring the underlying parameters ofwhat is observed in future cosmological experiments, suchas Hi intensity mapping.We here considered a redshift range that is consis-tent with an early reionization scenario, which has been in-creasingly favoured by galaxy-dominated models of reioniza-tion, although more recent work by Kulkarni et al. (2019)shows that galaxies can produce low optical depth and alate reionization scenario. However, late reionization as usu-ally favoured by AGN-dominated scenarios is currently dis-favoured (e.g. see Qin et al. 2017; Hassan et al. 2018; Mitraet al. 2018; Parsa et al. 2018). Regardless of the redshiftrange, the main result, that the accuracy increases with de-creasing redshift and neutral fraction, would qualitativelyremain valid if lower redshifts are included in this study, suchas z= , , since the instrumental eﬀects are always higher ata higher redshift.In our analysis, we have generated our training samplesbased on 1,000 diﬀerent reionization simulations to constrain6 parameters. For instance, Gupta et al. (2018), La Plante& Ntampaka (2019), and Gillet et al. (2019) have used 96,1,000, 10,000 model evaluations to constrain 2, 1, 3 parame-ters. Schmit & Pritchard (2018) further have shown that 100model evaluations is suﬃcient to constrain 3 parameters. Incomparison to these works, the number of simulations usedin this study is low, which limits the presented results. Theprior range assumed in this study is also small (i.e. 0.2 -2) which places additional limitation to our results. Higheraccuracy than reported in this study is expected with largertraining samples and more model evaluations, which we willexplore in future works.Our results are entirely limited to the set of assump-tions and approximation implemented in our 21cm instru-ment simulation. A more reﬁned and sophisticated recipe toaccount for all of the implemented instrumental eﬀects, suchas the angular resolution, foreground cleaning and thermalnoise, might alter our concluding remarks. The approxima-tion and assumptions implemented in the semi-numericalsimulations, through the use of the excursion set formal-ism to identify the ionized regions, as well as the choice ofour dynamic range and resolution, place additional limita-tions to the presented results. While limited to the SKA,our analysis can be easily extended to include instrumentaleﬀects from other 21cm surveys such as HERA and LOFAR,which we leave for future works to perform a detailed com-parison between diﬀerent array designs and diﬀerent observ-ing strategies. Inferring parameters from the 3D light conesmight improve recovery in the presence of noise without theneed to impose constraints on the neutral fraction or red-shift. Our analysis also can be easily extended to include all of the astrophysical parameters from the source and sinkmodels, and all cosmological parameters, which we leave forfuture works.This study has not only highlighted the constrainingpower of our methods, probing deep into EoR in the nearfuture with the arrival of more advanced Hi instruments likeSKA, but also shown how future 21cm surveys and Hi inten-sity mapping can help break the degeneracy between modelsby combining them with other experiments, such as Planck ,to better the constraints on cosmological parameters in anera of precision cosmology.

ACKNOWLEDGEMENTS

The authors acknowledge helpful discussions with LauraBoucheron, Kristian Finlator, Jon Holtzman, Tumelo Man-gena, James McAteer, Mario Santos, and Rene Walterbos.We particularly thank the anonymous referee for their con-structive comments which have improved the paper qual-ity signiﬁcantly. Simulations and analysis were performedat UWC’s

Pumbaa , IDIA/

Ilifu cloud computing facilities,and NMSU’s

DISCOVERY supercomputers. This work alsoused the Extreme Science and Engineering Discovery En-vironment (XSEDE), which is supported by National Sci-ence Foundation grant number ACI-1548562, and computa-tional resources (Bridges) provided through the allocationAST190003P. SA acknowledges ﬁnancial support from the

South African Radio Astronomy Observatory (SARAO) andis grateful to Sean February and Martin Slabber from the

Science Data Processing (SDP) team at SARAO for theirstrong technical support with regards to the computing re-sources. CCD thanks the LSSTC Data Science FellowshipProgram, which is funded by LSSTC, NSF CybertrainingGrant

REFERENCES

Abadi M., et al., 2015, TensorFlow: Large-Scale Machine Learningon Heterogeneous Systems, http://tensorflow.org/

Aghanim N., et al., 2018, arXiv preprint arXiv:1807.06209Barkana R., Loeb A., 2008, MNRAS, 384, 1069Battye R. A., Weller J., 2003, Phys. Rev. D, 68, 083506Becker G. D., Bolton J. S., 2013, Monthly Notices of the RoyalAstronomical Society, 436, 1023Bharadwaj S., Pandey S. K., 2005, MNRAS, 358, 968Bowman J. D., et al., 2013, Publ. Astron. Soc. Australia, 30, e031Cardone V. F., Dainotti M. G., Capozziello S., Willingale R.,2010, Monthly Notices of the Royal Astronomical Society,408, 1181Chardin J., Uhlrich G., Aubert D., Deparis N., Gillet N., OcvirkP., Lewis J., 2019, MNRAS, 490, 1055Contaldi C. R., Hoekstra H., Lewis A., 2003, Phys. Rev. Lett.,90, 221303Dav´e R., Katz N., Oppenheimer B. D., Kollmeier J. A., WeinbergD. H., 2013, MNRAS, 434, 2645DeBoer D. R., et al., 2017, PASP, 129, 045001Fan X., Carilli C. L., Keating B., 2006, ARA&A, 44, 415Finlator K., Dav´e R., ¨Ozel F., 2011, ApJ, 743, 169MNRAS , 1– ?? (2015) Hassan, Andrianomena & Doughty

Finlator K., Thompson R., Huang S., Dav ˜Al’ R., Zackris˜son E.,Oppenheimer B. D., 2015a, Monthly Notices of the Royal As-tronomical Society, 447, 2526Finlator K., Thompson R., Huang S., Dav´e R., Zackrisson E.,Oppenheimer B. D., 2015b, MNRAS, 447, 2526Furlanetto S. R., Oh S. P., Briggs F. H., 2006, Phys. Rep., 433,181Gillet N., Mesinger A., Greig B., Liu A., Ucci G., 2019, MNRAS,484, 282Giri S. K., Mellema G., Ghara R., 2018, MNRAS, 479, 5596Glorot X., Bengio Y., 2010, in Proceedings of the thirteenth in-ternational conference on artiﬁcial intelligence and statistics.pp 249–256Greig B., Mesinger A., 2015, Monthly Notices of the Royal As-tronomical Society, 449, 4246Gupta A., Matilla J. M. Z., Hsu D., Haiman Z., 2018, PhysicalReview D, 97, 103515Hassan S., Dav´e R., Finlator K., Santos M. G., 2016, MNRAS,457, 1550Hassan S., Dav´e R., Finlator K., Santos M. G., 2017, MNRAS,468, 122Hassan S., Dav´e R., Mitra S., Finlator K., Ciardi B., Santos M. G.,2018, MNRAS, 473, 227Hassan S., Liu A., Kohn S., La Plante P., 2019, MNRAS, 483,2524He K., Zhang X., Ren S., Sun J., 2016, in Proceedings of theIEEE conference on computer vision and pattern recognition.pp 770–778Hinshaw G., et al., 2013, The Astrophysical Journal SupplementSeries, 208, 19Iliev I. T., Mellema G., Ahn K., Shapiro P. R., Mao Y., Pen U.-L.,2014, MNRAS, 439, 725Ioﬀe S., Szegedy C., 2015, arXiv preprint arXiv:1502.03167Kakiichi K., et al., 2017, MNRAS, 471, 1936Kulkarni G., Keating L. C., Haehnelt M. G., Bosman S. E. I.,Puchwein E., Chardin J., Aubert D., 2019, MNRAS, 485, L24La Plante P., Ntampaka M., 2019, ApJ, 880, 110Leitet E., Bergvall N., Hayes M., Linn´e S., Zackrisson E., 2013,A&A, 553, A106Li X., Chen S., Hu X., Yang J., 2018, arXiv e-prints, p.arXiv:1801.05134Li Y., Fan X., Gou L., 2019, The Astrophysical Journal, 873, 37Liu A., Parsons A. R., Trott C. M., 2014, Phys. Rev. D, 90, 023018Liu A., Pritchard J. R., Allison R., Parsons A. R., Seljak U.,Sherwin B. D., 2016, Phys. Rev. D, 93, 043013Loeb A., Barkana R., 2001, Annual Review of Astronomy andAstrophysics, 39, 19Majumdar S., Pritchard J. R., Mondal R., Watkinson C. A.,Bharadwaj S., Mellema G., 2018, MNRAS, 476, 4007Mangena T., Hassan S., MG S., 2019, submitted to MNRASMellema G., et al., 2013, Experimental Astronomy, 36, 235Mitra S., Choudhury T. R., Ferrara A., 2015, Monthly Notices ofthe Royal Astronomical Society: Letters, 454, L76Mitra S., Choudhury T. R., Ferrara A., 2018, MNRAS, 473, 1416Molaro M., Dav´e R., Hassan S., Santos M. G., Finlator K., 2019,arXiv e-prints, p. arXiv:1901.03340Moscardini L., Matarrese S., Mo H., 2001, Monthly Notices of theRoyal Astronomical Society, 327, 422Ntampaka M., Trac H., Sutherland D. J., Battaglia N., P´oczosB., Schneider J., 2015, ApJ, 803, 50Paciga G., et al., 2011, MNRAS, 413, 1174Padmanabhan N., et al., 2007, Monthly Notices of the Royal As-tronomical Society, 378, 852Paranjape A., Choudhury T. R., Padmanabhan H., 2016, MN-RAS, 460, 1801Park J., Mesinger A., Greig B., Gillet N., 2019, MNRAS, 484, 933Parsa S., Dunlop J. S., McLure R. J., 2018, MNRAS, 474, 2904Parsons A. R., et al., 2010, AJ, 139, 1468 Paszke A., et al., 2019, in Wallach H., Larochelle H., Beygelz-imer A., d ' Alch´e-Buc F., Fox E., Garnett R., eds, , Advancesin Neural Information Processing Systems 32. Curran Asso-ciates, Inc., pp 8024–8035Phillips J., Weinberg D. H., Croft R. A. C., Hernquist L., KatzN., Pettini M., 2001, The Astrophysical Journal, 560, 15Planck Collaboration et al., 2016, A&A, 594, A13Pober J. C., Greig B., Mesinger A., 2016, MNRAS, 463, L56Qin Y., et al., 2017, MNRAS, 472, 2009Ribli D., Pataki B. ´A., Csabai I., 2018, arXiv preprintarXiv:1806.05995Santos M. G., Amblard A., Pritchard J., Trac H., Cen R., CoorayA., 2008, The Astrophysical Journal, 689, 1Santos M., Ferramacho L., Silva M., Amblard A., Cooray A.,2010, Monthly Notices of the Royal Astronomical Society, 406,2421Schmit C. J., Pritchard J. R., 2018, MNRAS, 475, 1213Simonyan K., Zisserman A., 2014, arXiv preprint arXiv:1409.1556Szegedy C., et al., 2015, in Proceedings of the IEEE conferenceon computer vision and pattern recognition. pp 1–9Watkinson C. A., Pritchard J. R., 2015, MNRAS, 454, 1416Zahn O., Lidz A., McQuinn M., Dutta S., Hernquist L., Zaldar-riaga M., Furlanetto S. R., 2007, ApJ, 654, 12Zaldarriaga M., Furlanetto S. R., Hernquist L., 2004, ApJ, 608,622Zel’dovich Y. B., 1970, A&A, 5, 84van Haarlem M. P., et al., 2013, A&A, 556, A2MNRAS , 1– ????

Figure A1 shows the loss evolution of network I (left) and II (right) for training (blue) and validation (red) samplesas a function of training epoch for the case of training onthe noisy data set. In both cases, the loss is decreasing astraining progresses, which indicates a reduction in the errorrate and predictions are approaching the target labels. Theﬂuctuations in the training and validation curves are due torandom selection of batches during training. Regardless ofthese ﬂuctuations, the loss evolution for validation convergesand stays constant on average, which indicates that the net-works are not over ﬁtting. It is worth noting that the suddendrop in training/validation error while training network II is owing to the fact that the learning rate is updated to 10%of its initial value in order to escape the plateau. APPENDIX B: REDSHIFT EVOLUTION

Our main result which suggests that the accuracy in-creases with decreasing redshift has been derived from amodel trained on mixed maps from all redshifts. To con-ﬁrm whether accuracy increases towards low redshift, wehere perform additional learnings by restricting the trainingsample to have maps only from the minimum (z=7) or max-imum (z=10) redshifts considered in this study (referred toas only). We also compare with predictions at these redshiftsfrom training with the whole dataset, including all other red-shifts (referred to as whole), for the case of noisy maps asreported in Table B1. In all cases, we ﬁnd that the accuracyat z=7 is always higher than that of at z=10. This shows,regardless of training with whole mixed maps or maps at agiven redshift, the qualitative trend that the accuracy in-creases towards low redshifts is still seen as summarized inTable B1. In addition, the quantitative results are also sim-ilar with a minimal diﬀerence of about (cid:46)

2% of accuracyfor some parameters as summarised in Table B1 for trainingwith maps at individual redshifts (only) versus those de-rived using a trained model on mixed maps (whole). Such aminimal diﬀerence is expected due to the diﬀerent numberof samples used in the case of “only” versus “whole” tests.This shows that our networks are successful to recover thesame qualitative and quantitative results without explicitlyincluding the redshift information as an input to the network(e.g. ﬁtting parameters to four maps from the four diﬀerentredshifts, z = − ). MNRAS , 1– ?? (2015) Hassan, Andrianomena & Doughty epoch r m s trainingvalidation epoch ‘ n o r m trainingvalidation Figure A1.

Left panel : Progression of the training of network I where the loss function rms varies as a function of training epoch.

Right panel : Progression of the training of network II showing the loss function (cid:96) norm as a function of number of epochs. Colouredversion is available online. The two plots are related to the training on the noisy data. Table B1.

Networks accuracy comparison between training only with dataset from z=7 and z=10 (referred to as only) versus predictingat these redshifts from training with whole dataset (including all other redshifts, referred to as whole), for the case of noisy maps. Forall parameters with all networks, accuracy increases towards low redshift.

Network I Network II z = (only) z = (whole) z = (only) z = (whole) z = (only) z = (whole) z = (only) z = (whole) Ω m h σ f esc C ion D ion MNRAS , 1– ????