PkANN - II. A non-linear matter power spectrum interpolator developed using artificial neural networks
Shankar Agarwal, Filipe B. Abdalla, Hume A. Feldman, Ofer Lahav, Shaun A. Thomas
MMon. Not. R. Astron. Soc. , 000–000 (0000) Printed 20 August 2018 (MN L A TEX style file v2.2)
PkANN – II. A non-linear matter power spectruminterpolator developed using artificial neural networks
Shankar Agarwal , † , Filipe B. Abdalla , § , Hume A. Feldman , ‡ , Ofer Lahav ,(cid:91) & Shaun A. Thomas ,(cid:63) CNRS, Laboratoire Univers et Th´eories (LUTh), UMR 8102 CNRS, Observatoire de Paris, Universit´e Paris Diderot,5 Place Jules Janssen, 92190 Meudon, France. Department of Physics & Astronomy, University of Kansas, Lawrence, KS 66045, USA. Department of Physics & Astronomy, University College London, Gower Street, London, WC1E 6BT, UK.emails: † [email protected], § [email protected], ‡ [email protected], (cid:91) [email protected], (cid:63) [email protected] ABSTRACT
In this paper we introduce
PkANN , a freely available software package for inter-polating the non-linear matter power spectrum, constructed using Artificial NeuralNetworks (ANNs). Previously, using halofit to calculate matter power spectrum,we demonstrated that ANNs can make extremely quick and accurate predictions ofthe power spectrum. Now, using a suite of 6380 N-body simulations spanning 580cosmologies, we train ANNs to predict the power spectrum over the cosmological pa-rameter space spanning 3 σ confidence level (CL) around the concordance cosmology.When presented with a set of cosmological parameters (Ω m h , Ω b h , n s , w, σ , (cid:80) m ν and redshift z ), the trained ANN interpolates the power spectrum for z ≤ k ≤ . h Mpc − . PkANN is faster than computation-ally expensive N-body simulations, yet provides a worst-case error < PkANN is set by the accuracy of our N-body simulations, at 5per cent level for cosmological models with (cid:80) m ν < . z ≤
2. Formodels with (cid:80) m ν > . z > z ≤ PkANN interpolator may be freely downloaded from http://zuserver2.star.ucl.ac.uk/~fba/PkANN . Key words : methods: N-body simulations – cosmological pa-rameters – cosmology: theory – large-scale structure of Uni-verse.
With the upcoming surveys promising to breach the per centlevel of precision, any efforts to further improve the con-straints on cosmological parameters will be predominantlytheory limited. The Baryon Oscillation Spectroscopic Sur-vey (BOSS) (Eisenstein et al. 2011) aims to determine theangular diameter distance with a precision of 1 per cent atredshifts z = 0 . z = 0 .
55, and the cosmic expansionrate H ( z ) with 1-2 per cent precision at the same redshifts.The DES (The Dark Energy Survey Collaboration 2005) willprobe the nature of dark energy through both the growth ofstructure in the universe as a function of time and the depen-dence of distances on the expansion rate. The Dark EnergySpectroscopic Instrument (DESI) (Levi et al. 2013), throughredshift measurements of millions of galaxies and quasars,will enable baryon acoustic oscillation (BAO) and redshiftspace distortion measurements. The Large Synoptic Survey Telescope (LSST) (Ivezic et al. 2008), LSST will measure thecomoving distance in the redshift range z = 0 . − . w = − . ± .
13 with no evidence for dy-namical dark energy. This is consistent with a cosmologi-cal constant ( w = −
1) dominated flat universe. In order todistinguish between various models of dark energy, such as w (cid:54) = − c (cid:13) a r X i v : . [ a s t r o - ph . C O ] D ec Agarwal, Abdalla, Feldman, Lahav & Thomas thereby suppressing the growth of cosmological structure.Accurate measurements of the matter power spectrum offera powerful tool to constrain the absolute mass-scale of neu-trinos, and complement the oscillation experiments which,being sensitive to the mass squared differences between theneutrino eigentstates, only provide a lower bound on the to-tal neutrino mass. Specifically, mass splittings of | ∆ m | =(2 . ± . × − eV and ∆ m = (7 . ± . × − eV (Adamson et al. 2008; KamLAND 2008) imply a lower limitfor the sum of the neutrino masses to be 0 .
06 and 0 . (cid:80) m ν = 0 .
06 eV) normal hierarchy for the neutrino masses,the Planck survey find (cid:80) m ν < .
23 eV (95% CL). WMAP9-year (Hinshaw et al. 2013) analysis find (cid:80) m ν < .
44 eV(95% CL). Lahav et al. (2010) obtained an upper limit of0.11 eV (95% CL). Numerical studies of the scale-dependentsuppression of matter power spectrum has been performedby various groups: Brandbyge & Hannestad (2010); Viel,Haehnelt, & Springel (2010); Agarwal & Feldman (2011);Bird, Viel, & Haehnelt (2012); Wagner, Verde, & Jimenez(2012). Agarwal & Feldman (2011) (hereafter, Paper I) andWagner, Verde, & Jimenez (2012) show that resolving theneutrino mass hierarchy may require the power spectrumto be measured at better than 0 . halofit (Smith et al. 2003), (ii) higher order perturbation theory(PT, e.g. Saito et al. 2008; Nishimichi et al. 2009; Saito,Takada, & Taruya 2009; Upadhye et al. 2013), (iii) N-bodysimulations ( e.g. enzo
O’Shea et al. 2010 and gadget
Springel 2005); (iv) spectrum interpolators ( e.g.
Heitmannet al. 2006; Habib et al. 2007; Lawrence et al. 2010; Heit-mann et al. 2013). While halofit performs well on largescales ( k < ∼ . h Mpc − ), its performance degrades rapidlyon smaller scales. Takahashi et al. (2012) re-calibrated theoriginal halofit (Smith et al. 2003) extending it to includedark energy models with constant equation of state w (cid:54) = − halofit predictions is model dependentand may be as low as 5 −
10 per cent at k ∼ h Mpc − (Takahashi et al. 2012; Heitmann et al. 2013). Likewise, PTimproves upon linear theory predictions on large scales butfails on smaller ( k > ∼ . h Mpc − ) scales. At higher red-shifts when the perturbations are less evolved, the accuracyfor both halofit and PT improves. However, since dark en-ergy is a late-time phenomenon ( z < ∼ halofit and PT at low redshifts ifone aims to develop a theoretical framework capable of pre-dicting the non-linear matter power spectrum at per centlevel. This leaves N-body simulations as the only methodcapable of controlling the accuracy levels as desired. Heit-mann et al. (2010) show that gravity-only simulations can beused to calculate the matter power spectrum at sub-per centaccuracy up to k < ∼ h Mpc − . On smaller scales, baryonicphysics affects the power spectrum and needs to be includedin numerical simulations to maintain per cent accuracy.A typical high-resolution dark-matter only simulationintended to probe k < ∼ h Mpc − scales can cost ∼ ,
000 CPU-hours. Including hydrodynamics in simulations toprobe smaller scales can take prohibitively long, especiallywhen running multiple simulations spread across the cosmo-logical parameter space. As discussed earlier in Heitmannet al. (2006) and Habib et al. (2007), parameter estimationand model building typically involves sampling the param-eter space and evaluating the power spectrum for each cos-mology. As we mentioned in Agarwal et al. (2012) (hereafter,Paper II), given the multi-dimensionality of the cosmologicalparameter space, a brute force application of N-body sim-ulations is beyond our current state of the art computingcapabilities.A novel alternative to running numerical simulationsto determine the non-linear response from varying parame-ter settings, is to use Machine-learning techniques. Machine-learning has found use in a variety of applications such asbrain-machine interfaces (Jenatton et al. 2011; Pedregosaet al. 2012), analyses of stock market (Ghosh 2011; Hurwitz& Marwala 2012), fitting of cosmological functions (Auldet al. 2007; Fendt & Wandelt 2007; Auld, Bridges, & Hob-son 2008), and estimating photometric redshifts (Collister& Lahav 2004).Using Machine-learning in the form of Gaussian pro-cesses Heitmann et al. (2009); Lawrence et al. (2010); Heit-mann et al. (2013) have developed a matter power spectrumcalculator – cosmic emulator , that is an order of magni-tude improvement over the popular halofit prescription.The cosmic emulator , based on gravity-only N-body simu-lations, comes in two versions: h -fixed (Lawrence et al. 2010)and h -free (Heitmann et al. 2013). The h -fixed version com-putes the Hubble parameter h using the CMB constraint onthe acoustic scale and predicts the non-linear matter powerspectrum up to z ≤ k < ∼ h Mpc − . The h -freeversion has h as a free parameter that can be set by the user.The range of validity of the h -free version is up to z ≤ k < ∼ h Mpc − . Both versions are restricted to cos-mological models with massless neutrinos. Since the currentunderstanding is that at least two neutrino eigentstates havenon-zero masses, it is reasonable to develop a power spec-trum interpolator that is suitable for cosmological modelswith/without massive neutrinos.In paper II, we developed the formalism for estimat-ing the non-linear matter power spectrum using ANNs. Us-ing halofit spectra as mock N-body spectra, we showedthat the ANN formalism enables a remarkable fit with amanageable number of simulations. In this paper, we usea suite of 6380 N-body simulations spanning 580 cosmolo-gies around the WMAP 7-year central values, and trainANNs to predict the power spectrum accurate at 5-10 percent level for k ≤ . h Mpc − up to redshifts z ≤
2. The
PkANN package, along with instructions to use, is avail-able at http://zuserver2.star.ucl.ac.uk/~fba/PkANN .We trained
PkANN for a range of cosmologies includ-ing w (cid:54) = − m ν (cid:54) = 0. However, the training can beeasily extended to include other parameters such as time-varying dark energy, modified gravity as well as probingsmall scale baryonic effects. This will require (i) runninga few N-body simulations around the cosmological parame-ter(s) being probed; (ii) calculating the matter power spec- c (cid:13) , 000–000 kANN–Matter power spectrum interpolator tra from numerical simulations; (iii) randomly dividing thesepower spectra into two sets, namely, the training and valida-tion sets (explained in Paper II, and here in Appendix A.1);(iv) training PkANN using the training and validation sets.Once training is over, the trained network can be used topredict the matter power spectrum at new parameter set-tings.The outline of this paper is as follows. We discuss ournumerical simulations in Section 2. We develop the
PkANN interpolator in Section 3. We present our results in Section 4starting with the performance of the
PkANN interpolatoragainst spectra computed using N-body simulations (Sec-tion 4.1). The estimate of errors in
PkANN ’s predictions aresummarized in Section 4.2. In Section 4.3, we use
PkANN to study the response of matter power spectrum to varia-tions in cosmological parameters.
PkANN ’s performance iscompared with the h -fixed cosmic emulator as well. Weconclude in Section 5. In the Appendix, we detail the for-mulae used in developing PkANN . We run N-body simulations over a range of cosmologicalparameters with the publicly available adaptive mesh re-finement (AMR), grid-based hybrid (hydro+gravity) code enzo (Norman et al. 2007; O’Shea et al. 2010). All oursimulations are hydro+gravity and run in unigrid (AMRswitched off) mode. For the hydrodynamical simulations,we include radiative cooling of baryons using an analyti-cal approximation (Sarazin & White 1987) for a fully ion-ized gas with a metallicity of 0 . M (cid:12) . The cooling approx-imation is valid over the temperature range from 10 − K. Below 10 K, the cooling rate is effectively zero. We donot account for metal-line cooling, supernova (SN) feedbackor active galactic nucleus (AGN) feedback. The parameterswe consider are I ≡ (Ω m h , Ω b h , n s , w, σ , (cid:80) m ν ), where h, Ω m , Ω b , n s , w, σ and (cid:80) m ν are the present-day normal-ized Hubble parameter in units of 100 km s − Mpc − , thepresent-day matter and baryonic normalized energy densi-ties, the primordial spectral index, the constant equation ofstate parameter for dark energy, the amplitude of fluctuationon an 8 h − Mpc scale and the total neutrino mass, respec-tively. The limits (see Table 1) on this six-dimensional pa-rameter space includes the WMAP 7-year+BAO+ H (Ko-matsu et al. 2011) constraints.For details on generating the initial conditions for sim-ulations, and treating massive neutrinos, refer Paper I. OurN-body simulations do not explicitly account for the pres-ence of neutrino perturbations and implement neutrinosonly through its effects on the background evolution. Specifi-cally, we modified the cosmological routines of the enzo codeto include the effects of massive neutrinos on the homoge-neous Hubble expansion h ( a ) (for details, see Paper II) andthe linear growth factor. Our modifications to the growthfactor neglect any scale-dependence in the presence of mas-sive neutrinos. We sample 70 ( (cid:80) m ν = 0) + 130 ( (cid:80) m ν (cid:54) = http://lca.ucsd.edu/projects/enzo Figure 1.
Top panel:
Matter power spectrum evaluated at red-shifts z = 0 , , I ≡ (0 . , . , . , − . , . ,
0) with h = 0 . P cnl (long-dashed), (ii) P bnl (short-dashed) and (iii) P nl (solid). The linearmatter power spectrum is shown by dot-dashed line. P nl is con-structed using P cnl and P bnl , as discussed in the text (see Eqs 1 and2). Lower panels:
The ratio of the non-linear spectra ( P cnl , P bnl and P nl ) to the CDM spectrum P cnl .
0) = 200 (training set), 18 + 32 = 50 (validation set) and150 + 180 = 330 (testing set) cosmologies from the parame-ter space (see Table 1) using an improved Latin hypercubetechnique (for details, see Paper II). The training set guidesthe neural network training, the validation set prevents theANN from overfitting to the training set, and the testing setis used to evaluate the performance of the trained network.The testing set has no effect on training and provides anindependent measure of network performance. For each cos-mology I ≡ (Ω m h , Ω b h , n s , w, σ , (cid:80) m ν ), we compute theHubble parameter h using the WMAP 7-year+BAO con-straint on the acoustic scale πd ls /r s = 302 .
54, where d ls isthe distance to the surface of last scattering and r s is thecomoving size of the sound horizon at the redshift of lastscattering. The procedure to compute h is outlined in PaperII. This h value, together with the chosen Ω m h and Ω b h , isused to derive Ω m and Ω b . The present-day normalized en-ergy density of dark energy is fixed as Ω de = 1 − Ω m . Startingat redshift z = 99, all simulations are run in a comoving boxof length 200 h − Mpc, with 256 cold dark matter (CDM)particles evolved on a 512 grid. We take 111 snapshots ofthe CDM and baryon positions between redshifts z = 2 and z = 0; specifically 100 snapshots (∆ z = 0 .
01 apart) between0 ≤ z ≤ .
99, and 11 snapshots (∆ z = 0 . ≤ z ≤ c (cid:13)000
99, and 11 snapshots (∆ z = 0 . ≤ z ≤ c (cid:13)000 , 000–000 Agarwal, Abdalla, Feldman, Lahav & Thomas transform the CDM and baryon positions into their respec-tive mass density fields. The densities are Fast Fourier Trans-formed to obtain the CDM and baryon non-linear powerspectra, namely P cnl and P bnl , respectively. Together with theneutrino linear spectrum P ν lin , and the weights f i ≡ Ω i / Ω m ,the non-linear matter power spectrum P nl is then calculated(for details, see Paper I) as P nl ( k ) = (cid:20) ( f c + f b ) (cid:113) P cbnl ( k ) + f ν (cid:113) P ν lin ( k ) (cid:21) , (1)where, P cbnl ( k ) = ( f c + f b ) − (cid:20) f c (cid:113) P cnl ( k ) + f b (cid:113) P bnl ( k ) (cid:21) . (2)The subscripts ‘lin’ and ‘nl’ indicate quantities in the lin-ear and non-linear regimes, respectively. Throughout ouranalyses, we work with flat cosmological models: Ω m (=Ω b + Ω c + Ω ν ) + Ω de = 1, where Ω c and Ω ν are thepresent-day normalized energy densities of CDM and neu-trino, respectively. To suppress statistical scatter in thematter power spectrum, we average the power spectrafor 11 realizations per cosmology. In Fig. 1, we show P cnl , P bnl and P nl spectra (long-dashed, short-dashed andsolid lines, respectively) for one of the cosmological models I ≡ (0 . , . , . , − . , . ,
0) with h = 0 . k = 1 h Mpc − , baryons suppress the CDM spec-trum at 1 − z < ∼ k > ∼ h Mpc − ). This is consistent with previous stud-ies (Rudd, Zentner, & Kravtsov 2008, Casarini et al. 2011)that investigated the effect of baryonic physics on the matterpower spectrum through simulations including gas cooling,star formation and SN feedback. We note that although allour simulations in this work are hydro+gravity, on largescales ( k < ∼ h Mpc − ) the matter power spectrum is min-imally affected by baryonic dynamics and one can rely ongravity-only simulations.We use the one-Loop standard PT as implemented bySaito et al. (2008) for estimating the matter power spec-trum up to k ≤ . h Mpc − and stitch it with the non-linear power spectrum from numerical simulations. Finally,the stitched spectrum is sampled at 50 k -values between0 . h Mpc − ≤ k ≤ h Mpc − . The stitched-and-samplednon-linear power spectrum is used as P nl ( k, z ) for ANNtraining. This stitch-and-sample procedure is repeated foreach cosmology I in the training set to complete the train-ing set P nl ( k, z | I ). Fig. 2 shows a skeleton of a machine-learning network. Usinga suitable training set (input parameters for which data isavailable), the machine-learning algorithm is trained to learna parameterization. With this parameterization the networkis capable of reproducing (as closely as possible) the output,
Training Set Machine Learning Algorithm Trained NetworkNew Input DataPredicted Output1 2 34
Tuesday, November 6, 2012
Figure 2.
Steps 1 and 2: A machine-learning network learns toparameterize the output, for the input patterns that form thetraining set. Steps 3 and 4: The trained network is capable ofmaking predictions when presented with input parameter set-tings. The queried input settings must lie within the parameterranges of the patterns in the training set. when queried with input parameter settings that are part ofthe training set. The trained network can now be presentedwith new settings of the input parameters (for which onedoes not have any prior data) and by using the same param-eterization learnt during the training process, the networkmakes predictions.ANN – a form of machine-learning – is a collection of nodes arranged in a series of layers, with each node in a layerconnected to all other nodes in adjacent layers. A network’sarchitecture is specified by the number of nodes from inputto output as N in : N : N : ... : N n : N out . That is, a networkwith an architecture 4 : 9 : 5 : 7 has 4 inputs, two hiddenlayers with 9 and 5 nodes respectively, and finally 7 outputs.An extra node (called the bias node) is added to the inputlayer as well as to each of the hidden layers. The bias nodesare added in order to compensate for the difference betweenthe network’s mean prediction and the mean of the outputsof training set patterns (for details, refer Bishop 1995). Eachbias node connects to all the nodes in the next layer. Notethat the counts N in , N , N , ..., N n do not include the biasnodes. The output layer has no bias node. The total numberof connections (also called the weights ) N W for a genericarchitecture N in : N : N : ... : N n : N out can be calculatedusing the formula N W = N in · N + n (cid:88) l =2 N l − · N l + N n · N out + n (cid:88) l =1 N l + N out , (3)where the summation index l is over the hidden layers only.Throughout this paper, we will use the vector notation w to collectively refer to all the network weights.In Fig. 3, we show a typical ANN architecture (left-handpanel) and the formulae to calculate the node activations(right-hand panels). In the network configuration depicted,there are N in input parameters/features ( x , ..., x i ), a sin-gle hidden layer with N nodes ( z , ..., z j ), and N out outputparameters/features ( y , ..., y k ). The bias nodes in the inputand hidden layers are x and z , respectively.Each node in the l th hidden layer is a neuron with an activation , z j ≡ g ( a j ), taking as its argument a j = (cid:88) i =0 w ji z i , (4)where the sum is over all nodes i (including the bias node)of the previous layer sending connections to the j th node c (cid:13) , 000–000 kANN–Matter power spectrum interpolator Table 1.
Parameter space used in generating the ANN training and validation sets. The last column shows the corresponding WMAP7-year+BAO+ H constraints at 68 per cent CL. Inside parentheses is the range for the ANN testing set. The range of the parametersfor the testing set is designed to avoid the boundaries of the parameter space. Neutrino mass being physically bound ( (cid:80) m ν (cid:38) H a Ω m h ± b h ± n s ± w -1.35 (-1.15) -0.65 (-0.85) -1.1 ± σ ± (cid:80) m ν (eV) 0 (0) 1.1 (0.5) < . b Note. a Komatsu et al. (2011); b
95 per cent CL for w = − . (barring the bias node) of the current layer. Note that fornetworks with a single hidden layer (as in Fig. 3), z i in Eq. 4would correspond to the input parameters x i . The activationfunctions are typically taken to be sigmoid functions such as g ( a j ) = 1 / [1 + exp( − a j )]. Since the range of g ( a j ) is from 0to 1, it allows the output of the neurons to be interpretedas the probability that any specific neuron will ‘fire’ whenpresented with an input parameters setting. The sigmoidfunctions impart some degree of non-linearity to the neuralnetwork models. A network becomes overly non-linear if theweights w deviate significantly from zero. This drives theactivation g ( a j ) of the nodes to saturation. The number andsize of the hidden layers add to the complexity of ANNs. Theactivation of all bias nodes is permanently set to a value of1 and during network training the bias parameters (namely, w j and w k in Fig. 3 left-hand panel) are adjusted so asto minimize the difference between the mean prediction forthe network and the mean of the outputs of the training setpatterns.The activation y k ≡ ˜ g ( a k ) for neurons in the outputlayer is usually taken to be a k , i.e. ˜ g ( a k ) = a k , with a k beingthe weighted sum of all nodes in the final hidden layer, a k = (cid:88) j =0 w kj z j . (5) For a particular input vector ( x , ..., x i ), the output vec-tor ( y , ..., y k ) of the network is determined by progressingsequentially through the network layers, from inputs to out-puts, calculating the activation of each node.Adjusting the weights w to get the desired mapping iscalled the training of the network. For matter power spec-trum estimation, we use a training set of N-body simulationswith known cosmological parameters: I ≡ (Ω m h , Ω b h , n s , w, σ , (cid:88) m ν ) . It has been shown (see Hornik 1991; Ito 1991; Bishop1995) that networks with a single hidden layer are capable ofmaking arbitrarily accurate approximation to a function andits derivatives. As such, for
PkANN ’s architecture, we onlyconsider networks having single-hidden layer with sigmoidalactivations and output nodes with linear (˜ g ( a k ) = a k ) acti-vations, as depicted in Fig. 3.In Appendix A.1, we develop the PkANN cost function χ C ( w ). Minimizing this cost function with respect to theweights w generates a trained ANN that can be used for non-linear matter power spectrum interpolation. To minimize χ C ( w ) (see Eq. 16) with respect to the weights w , we use aniterative quasi-Newton algorithm (Appendix A.2) that in-volves evaluating the first-order derivative (gradient) of the Input layer Hidden layer
Output layerN in N out N a j = (cid:1) N i w ji x i i=0 x x x i z z j w ji w kj y y k . y x z a k = (cid:1) N w kj z j j=0 z j (cid:2) g(a j ) = 1/[1+exp(-a j )]z k (cid:2) g(a k ) = a k (cid:1) For Hidden LayerFor Output Layer .... ..
Thursday, November 22, 2012
Figure 3.
A typical ANN architecture (left-hand panel) withnode activation formulae for the hidden and output layers(right-hand panels). There can be more than one hidden layers.Throughout our
PkANN analysis, we work with a single hiddenlayer.c (cid:13)000
PkANN analysis, we work with a single hiddenlayer.c (cid:13)000 , 000–000
Agarwal, Abdalla, Feldman, Lahav & Thomas cost function. See Appendix A.3 for the derivation of the gra-dient. The quasi-Newton algorithm also involves informationabout the inverse of the Hessian (second-order derivative)matrix which we approximate using the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method (see Appendix A.4. Fordetails, see Bishop 1995).Starting with randomly assigned weights w , their val-ues are re-estimated iteratively, making sure that each it-eration proceeds in a direction that lowers the cost func-tion χ C ( w ). In order to avoid over-fitting to the trainingset, after each iteration to the weights, Eq. 16 is also cal-culated for what is known in neural network parlance asa validation set. The validation set for our application ofneural networks, is a small set of simulations with known I ≡ (Ω m h , Ω b h , n s , w, σ , (cid:80) m ν ) and P nl ( k, z ). The finalweights w f are chosen so as to give the best fit (minimum χ C ( w )) to the validation set. The network training is con-sidered finished once χ C ( w ) is minimized with respect tothe validation set. The trained network can now be usedto predict P nl ( k, z ) for new cosmologies. It is important tonote that starting with a different (but still random) con-figuration of weights, may lead to a trained network with adifferent set of final weights w f . As such, we train a numberof networks that start with an alternative random configura-tion of weights. The trained networks are collectively calleda committee of networks and subsequently give rise to bet-ter performance than any single ANN in isolation. For thefinal output, we average over the outputs of the committeemembers. In Paper II, we compared
PkANN ’s performance against halofit spectra to demonstrate that a suitably trainednetwork is capable of reproducing the halofit spectraat sub-per cent accuracy. Here, we repeat the procedure,this time using spectra calculated using N-body simula-tions. We selected the combination 7 : N hidden : 50 as our PkANN architecture, where N hidden (number of nodes inthe hidden layer) was varied from 7 to 98, in steps of 7.The number of inputs were fixed at 7, corresponding to I ≡ (Ω m h , Ω b h , n s , w, σ , (cid:80) m ν ) including redshift z . Asdiscussed in Section 2, we use the camb code to calculatethe CDM, baryon and neutrino transfer functions. The ini-tial conditions for CDM particles and baryons are then gen-erated from their transfer functions using enzo . The non-linear matter power spectrum P nl ( k ) is constructed usingEqs 1 and 2.As in Paper II, we do not sample the redshift in theLatin hypercube but instead evaluate P nl ( k, z ) at 111 red-shifts between z = 0 and z = 2 from numerical simulations,using Eqs 1 and 2. As we discussed in Section 2, we ex-tend the range of our spectra to k = 0 . h Mpc − by usingthe one-loop standard PT Saito et al. (2008). We estimatethe matter power spectrum up to k ≤ . h Mpc − us-ing the one-loop standard PT and stitch it with P nl ( k, z ). The stitched spectrum is then sampled at 50 k -modes be-tween 0 . h Mpc − ≤ k ≤ h Mpc − . Since our trainingand validation sets have (130 + 70) and (32 + 18) cosmolo-gies, respectively (see Paper II), we calculated P nl ( k, z ) foreach cosmology, at 111 redshifts. These P nl ( k, z ) are scaledby their respective linear spectra P lin ( k, z ) (see Eq 14), be-fore being fed to the neural network. Thus, the overall size N T of the training set that we train our ANN with is N T =200 ×
111 = 22 , ×
111 = 5 , N hidden setting, wetrained a committee of 16 ANNs. The weights w for eachANN were randomly initialized (the random configurationbeing different for each ANN). The weights are allowed toevolve until χ C ( w ) (see Eq. 16) is minimized with respectto the cosmologies in the validation set.In Fig. 4, we show the percentage error in the ANN pre-dictions with respect to the N-body results when presentedwith the 200 cosmologies in the training set. We average the P ANNnl ( k, z ) predictions over the 16 ANN committee mem-bers. The rows correspond to N hidden = 14 −
98 (from topto bottom) in increments of 14. The columns (from left toright) correspond to z = 0 , ,
2. The mean error over all200 cosmologies in the training set is shown by a solid linein each panel, to get an idea about any systematics in ourANN training scheme. With N hidden = 70 and higher, theANN predictions are within ± k ≤ . h Mpc − , after which the performance de-grades marginally to ± . N hidden units. How-ever, such complex networks can adversely affect their gen-eralizing ability when presented with a new dataset. Thevalidation set helps in controlling the complexity of a net-work, as we discussed earlier in Section 3. In Fig. 5, we showthe residual cost function χ C ( w ) (see Eq. 16) evaluated as afunction of the number of nodes in the hidden layer, N hidden .The residual error is a monotonically decreasing function forthe training set (dashed line) while for the validation set(solid line), it increases beyond N hidden = 70. The perfor-mance of the trained ANNs as a function of N hidden units,over the cosmologies in the validation set, is shown in Fig. 6.Increasing N hidden beyond 70 increases the error marginally,indicating that N hidden = 70 saturates the generalizing abil-ity of our network.The performance of the trained ANNs for cosmologi-cal models in the testing set, is shown in Fig. 7. Increasing N hidden beyond 70 does not contribute to a significant er-ror reduction on the testing set, confirming our assessmentthat N hidden = 70 saturates the generalizing ability of thenetwork. With N hidden = 70, the ANN prediction for every cosmology, at all redshifts z ≤
2, is within ± . k ≤ . h Mpc − . The c (cid:13) , 000–000 kANN–Matter power spectrum interpolator -202-202-202-202-202-202 0.01 0.1 1-202 0.01 0.1 1 0.01 0.1 1 Figure 4.
Percentage error at redshift z = 0 (left-hand panel), z = 1 (middle panel) and z = 2 (right-hand panel) between the predictednon-linear power spectrum (using PkANN ) and the true underlying spectrum (using N-body simulations) for 200 training set cosmologies.The shaded region contains the middle 99.73% (3 σ ) of the residuals. The rows (from top to bottom) correspond to N hidden = 14 −
98 inincrements of 14. The mean error over all 200 cosmologies is shown by a solid line – an indicator of any bias in the ANN training scheme.
PkANN performs exceedingly well within the boundaries ofthe restricted parameter space.Next, we assess the accuracy of the
PkANN networkacross the range for each of the six parameters, namely,Ω m h , Ω b h , n s , w, σ and (cid:80) m ν . We vary each parameterbetween its minimum and maximum values and bin the 200cosmologies of the training set in 10 intervals across the pa-rameter range. We calculate the prediction error for eachbin. We repeat this for all six parameters and show the re-sults for the Ω m h case in Fig. 8. The rows correspond to the10 linearly spaced bins between Ω m h = 0 . − . z = 0 (left-hand panel), z = 1 (middlepanel) and z = 2 (right-hand panel). As discussed above, we fix N hidden = 70. As expected, PkANN ’s performancedegrades near the edges of the range Ω m h = 0 . − . ± k ≤ . h Mpc − . Results with theother five parameters are similar to Fig. 8. We summarizethe prediction errors for all six parameters in Table 2. Our ANN framework successfully recreates the input powerspectrum at sub-percent level up to k ≤ . h Mpc − , andthe overall accuracy of the PkANN interpolator is set by c (cid:13)000
PkANN networkacross the range for each of the six parameters, namely,Ω m h , Ω b h , n s , w, σ and (cid:80) m ν . We vary each parameterbetween its minimum and maximum values and bin the 200cosmologies of the training set in 10 intervals across the pa-rameter range. We calculate the prediction error for eachbin. We repeat this for all six parameters and show the re-sults for the Ω m h case in Fig. 8. The rows correspond to the10 linearly spaced bins between Ω m h = 0 . − . z = 0 (left-hand panel), z = 1 (middlepanel) and z = 2 (right-hand panel). As discussed above, we fix N hidden = 70. As expected, PkANN ’s performancedegrades near the edges of the range Ω m h = 0 . − . ± k ≤ . h Mpc − . Results with theother five parameters are similar to Fig. 8. We summarizethe prediction errors for all six parameters in Table 2. Our ANN framework successfully recreates the input powerspectrum at sub-percent level up to k ≤ . h Mpc − , andthe overall accuracy of the PkANN interpolator is set by c (cid:13)000 , 000–000 Agarwal, Abdalla, Feldman, Lahav & Thomas
Figure 5.
The residual error χ C ( w ) (see Eq. 16) evaluated as a function of the number of nodes in the hidden layer, N hidden . Theerror is a monotonically decreasing function for the training set (dashed line) while for the validation set (solid line), it starts increasingbeyond N hidden = 70 indicating that the generalizing ability of the neural network is best with N hidden = 70. The error bars correspondto the spread in χ C ( w ) for the 16 ANN committee members. Table 2.
Performance of the
PkANN network as a function of the range of the six parameters, namely, Ω m h , Ω b h , n s , w, σ and (cid:80) m ν .Each parameter range is sub-divided into 10 equal intervals and the training set cosmologies are binned accordingly. The 3 σ bounds onthe PkANN prediction errors (in per cent) are mentioned for each bin, at redshifts z = 0 , m h case is shown in Fig. 8.Bins Ω m h Ω b h n s w σ (cid:80) m ν z=0 z=1 z=2 z=0 z=1 z=2 z=0 z=1 z=2 z=0 z=1 z=2 z=0 z=1 z=2 z=0 z=1 z=21 1.0 1.0 1.0 0.9 0.7 1.1 1.0 0.9 1.2 0.9 0.9 0.9 0.8 0.5 1.0 0.8 0.8 1.12 1.0 0.8 1.1 0.6 0.7 1.1 0.9 0.8 0.9 0.8 0.8 0.9 0.8 0.5 1.0 0.8 0.7 1.03 0.9 0.9 0.9 0.9 0.7 1.0 0.9 0.8 0.9 0.7 0.7 0.9 0.6 0.5 1.0 0.5 0.4 1.04 0.7 0.8 0.9 0.8 0.6 0.9 0.8 0.6 0.9 0.6 0.7 0.9 0.6 0.5 1.0 0.5 0.6 0.95 0.7 0.5 0.9 0.7 0.6 0.9 0.6 0.5 1.0 0.6 0.6 0.9 0.5 0.5 0.9 0.7 0.5 0.96 0.9 0.5 1.0 0.4 0.5 0.8 0.7 0.5 1.0 0.7 0.5 0.9 0.4 0.5 0.8 0.8 0.7 1.07 0.8 0.6 0.9 0.7 0.6 1.0 0.7 0.8 1.0 0.7 0.7 0.9 0.5 0.4 0.9 0.9 0.5 0.98 1.0 0.6 0.9 0.8 0.6 0.9 0.6 0.7 1.0 0.8 0.5 0.8 0.6 0.4 1.0 0.9 0.8 0.99 0.9 0.8 1.0 0.8 0.6 1.0 0.8 0.8 1.0 0.9 0.7 0.9 0.8 0.5 0.9 0.9 0.8 0.910 1.0 0.8 1.1 0.9 0.7 1.1 0.8 0.8 1.0 0.9 0.6 0.9 1.0 0.7 0.9 0.8 0.9 1.2 the force resolution and statistical variance from our N-bodysimulations. Running enzo in a 200 h − Mpc box with 512 unigrid results in a matter power spectrum that is progres-sively suppressed from 1 per cent level at k = 0 . h Mpc − to 5 per cent level at k = 0 . h Mpc − , when comparedto spectrum calculated from high-resolution runs. Limitedcomputing resources prohibited us from running higher reso-lution simulations. Since PkANN is built using conservativesimulation settings described above, we expect all
PkANN predictions to be suppressed at 1-5 per cent level between k = 0 . − . h Mpc − .We follow the approach outlined in Jeong & Komatsu(2009) (see their Appendix A) to roughly estimate the statis-tical error on our non-linear power spectrum from numericalsimulations. A simulation box of length 200 h − Mpc corre- sponds to a fundamental wavenumber of δk = 2 π/
200 =0 . h Mpc − . The number of independent k -modes avail-able in a spherical shell at k = 0 . h Mpc − is N k =2 π ( k/δk ) ≈
64. With our 11 realizations per cosmology,this gives a relative error of σ P ( k ) /P ( k ) = 1 / √ N k ≈ k = 0 . h Mpc − . Higher k -modes are sampled more fre-quently and the corresponding sampling errors become pro-gressively smaller, to ∼ .
4% at k = 0 . h Mpc − .As mentioned earlier, we match the matter power spec-tra from one-Loop standard PT with numerical simulationsat k = 0 . h Mpc − . Heitmann et al. (2010) (their Fig. 6)showed that small simulation volumes fail to capture lin-ear evolution on the largest scales probed by the simulationbox as well as miss the onset of non-linearity, resulting inthe suppression of the matter power spectrum at ∼ − c (cid:13) , 000–000 kANN–Matter power spectrum interpolator -202-202-202-202-202-202 0.01 0.1 1-202 0.01 0.1 1 0.01 0.1 1 Figure 6.
Similar to Fig. 4, using 50 validation set cosmologies. per cent level. As such, for a simulation box of length 200 h − Mpc, we expect our spectra amplitudes to be in error at ∼ k ≈ h Mpc − . PkANN can be used for spatially flat cosmologicalmodels with three species of degenerate massive neutrinosup to (cid:80) m ν = 1 . z = 0, our neutrino spec-tra for (cid:80) m ν up to 0 . , .
475 and 0 .
95 eV are expectedto be in error by < ∼ . , z = 1 and z = 2 are < ∼ . , , < ∼ . , , (cid:80) m ν > .
475 eV; however, it is important to note that thecurrent constraints on the total neutrino mass are around0.3 eV. Using photometric redshifts measured from SloanDigital Sky Survey III Data Release Eight (SDSS DR8, Ai-hara et al. 2011), de Putter et al. (2012) obtained constraintsof (cid:80) m ν < . − .
36 eV. Using BAO and CMB data, thePlanck survey (Planck Collaboration et al. 2013) finds anupper limit of 0.23 eV. Using numerical simulations, Wag-ner, Verde, & Jimenez (2012) studied the effect of neutrinoson the non-linear matter power spectrum for (cid:80) m ν ≤ . (cid:80) m ν ≤ . z = 0 non-linear neutrino corrections are at 0.3 per cent level, andnegligible at higher redshifts. Overall, for (cid:80) m ν ≤ . z ≥ c (cid:13)000
36 eV. Using BAO and CMB data, thePlanck survey (Planck Collaboration et al. 2013) finds anupper limit of 0.23 eV. Using numerical simulations, Wag-ner, Verde, & Jimenez (2012) studied the effect of neutrinoson the non-linear matter power spectrum for (cid:80) m ν ≤ . (cid:80) m ν ≤ . z = 0 non-linear neutrino corrections are at 0.3 per cent level, andnegligible at higher redshifts. Overall, for (cid:80) m ν ≤ . z ≥ c (cid:13)000 , 000–000 Agarwal, Abdalla, Feldman, Lahav & Thomas -202-202-202-202-202-202 0.01 0.1 1-202 0.01 0.1 1 0.01 0.1 1
Figure 7.
Similar to Fig. 4, using 330 testing set cosmologies.
To summarize, across all cosmological models (see Ta-ble 1) with (cid:80) m ν < . PkANN interpolator isexpected to be accurate at 5 per cent level for all redshifts z ≤
2. For models with (cid:80) m ν > . z > ∼
10 per cent for z ≤ Having built the power spectrum interpolator, we now studythe behavior of the power spectrum as a function of thecosmological parameters. Similar tests were performed byHeitmann et al. (2013). In Fig. 9, we show variations in thepower spectrum at redshift z = 0 (top row), z = 1 (middlerow), z = 2 (bottom row). At each redshift, Ω m h is varied between its minimum and maximum value (see parameterranges for the testing set, in Table 1) while Ω b h , n s , w, σ are fixed at their central values. We fix (cid:80) m ν = 0 sincewe want to compare our PkANN predictions with the cos-mic emulator , which is not trained for cosmological modelswith massive neutrinos. The left-hand panels show naturallogarithm of the ratio of the power spectra with differentΩ m h to the base power spectrum. The base power spectrumcorresponds to the central values: Ω m h = 0 . , Ω b h =0 . , n s = 0 . , w = − , σ = 0 . (cid:80) m ν = 0.The absolute power spectra are shown in the right-handpanels. Within each panel, the power spectra (from top tobottom) correspond to increasing Ω m h . Higher Ω m h re-duces the large-scale normalization of the power spectrumsignificantly. Accurate measurements of the power spectrumamplitude on large scales can help improve the constraints c (cid:13) , 000–000 kANN–Matter power spectrum interpolator Figure 8.
The 200 cosmologies of the training set are binned in 10 equal intervals between Ω m h = 0 . − .
165 (from top to bottom,in increasing order). The columns are redshift z = 0 , , N hidden = 70 for all panels. For each bin, PkANN ’s predictions are compared to the N-body power spectra and the residual errors (3 σ CL) are plotted. Closer to the middle ofthe range Ω m h = 0 . − .
165 (middle rows), the prediction errors get smaller. Even near the edges (outer rows), the errors are within ± k ≤ . h Mpc − . on Ω m h . PkANN predictions (dotted) agree well with the cosmic emulator (solid lines). Note that for redshift z = 2,we only show PkANN predictions since the h -fixed versionof the cosmic emulator (Lawrence et al. 2010) can makepredictions only up to z = 1.In Figs. 10 – 13, we vary Ω b h , n s , w and σ , respec-tively. The power spectra trends from minimum to max-imum values are as follows: top to bottom ( n s and w )and bottom to top (Ω b h and σ ). At z = 0, except σ ,all other parameters affect the power spectrum predomi-nantly on large scales ( ∼ k < . h Mpc − ). Reducing un-certainties in the other parameters using small-scale data would be difficult unless one measures the power spectrumat higher redshifts where almost all parameters leave dis-cernible imprints. Note that the power spectra convergearound k ∼ . h Mpc − in the Ω m h , Ω b h , n s and w plots.This is a direct consequence of our imposing the CMB con-straint on the acoustic scale based on WMAP 7-year+BAOdata. The acoustic scale is model dependent. Fixing its valueto match observations requires adjusting the Hubble param-eter h accordingly. As we discussed in Section 2, given a cos-mological model I , we compute h to satisfy the constraint πd ls /r s = 302 . c (cid:13)000
165 (middle rows), the prediction errors get smaller. Even near the edges (outer rows), the errors are within ± k ≤ . h Mpc − . on Ω m h . PkANN predictions (dotted) agree well with the cosmic emulator (solid lines). Note that for redshift z = 2,we only show PkANN predictions since the h -fixed versionof the cosmic emulator (Lawrence et al. 2010) can makepredictions only up to z = 1.In Figs. 10 – 13, we vary Ω b h , n s , w and σ , respec-tively. The power spectra trends from minimum to max-imum values are as follows: top to bottom ( n s and w )and bottom to top (Ω b h and σ ). At z = 0, except σ ,all other parameters affect the power spectrum predomi-nantly on large scales ( ∼ k < . h Mpc − ). Reducing un-certainties in the other parameters using small-scale data would be difficult unless one measures the power spectrumat higher redshifts where almost all parameters leave dis-cernible imprints. Note that the power spectra convergearound k ∼ . h Mpc − in the Ω m h , Ω b h , n s and w plots.This is a direct consequence of our imposing the CMB con-straint on the acoustic scale based on WMAP 7-year+BAOdata. The acoustic scale is model dependent. Fixing its valueto match observations requires adjusting the Hubble param-eter h accordingly. As we discussed in Section 2, given a cos-mological model I , we compute h to satisfy the constraint πd ls /r s = 302 . c (cid:13)000 , 000–000 Agarwal, Abdalla, Feldman, Lahav & Thomas
Figure 9.
Variations in the power spectrum at redshift z = 0 (top row), z = 1 (middle row), z = 2 (bottom row). Parameter Ω m h isvaried between its minimum and maximum value (see testing set range, Table 1) while Ω b h , n s , w, σ are fixed at their central values. (cid:80) m ν = 0 to facilitate comparison with the h -fixed version of the cosmic emulator (Lawrence et al. 2010). The left-hand panels shownatural logarithm of the ratio of the power spectra with different Ω m h to the base power spectrum. The cosmological parameters forthe base power spectrum are: I ≡ (0 . , . , . , − , . , m h . The predicted ratios using PkANN (dotted) are within 0 .
2% of the cosmic emulator ’s predictions (solid lines). At z = 2, only PkANN predictions are shown since the h -fixed cosmic emulator is valid only for z ≤ z = 0 (upper panel) and z = 1 (lower panel) computed using PkANN and the h -fixed cosmic emulator . The cosmolo-gies considered are all models of Section 4.3 as well as the 150testing set cosmologies with (cid:80) m ν = 0. At z = 0, PkANN matches the cosmic emulator ’s predictions to within ± k ≤ . h Mpc − and within ± k ≤ . h Mpc − . The loss of power due to our use of 512 un-igrid simulations is clearly evident beyond k = 0 . h Mpc − .At z = 1, PkANN is within ± Machine-learning techniques offer unparalleled advantage inanalyses of large datasets of the kind being generated bycurrent and upcoming surveys. A brute force application ofnumerical simulations can consume millions of CPU-hoursand may not be a feasible solution. Instead, by running alimited number of simulations, one can develop machine-learning schemes. These schemes can then be used to effi-ciently handle new and previously unseen data. c (cid:13) , 000–000 kANN–Matter power spectrum interpolator Figure 10.
Similar to Fig. 9, but for a range of Ω b h values. Within each panel, the power spectra from bottom to top correspond toincreasing Ω b h values. In this paper, we have introduced
PkANN – the non-linear matter power spectrum interpolator. Using a man-ageable number of N-body simulations, we have successfullydeveloped a neural network-based interpolating scheme thatreconstructs the matter power spectrum over the parame-ter space spanning 3 σ CL around the WMAP 7-year cen-tral values. Although
PkANN reproduces the input powerspectrum at sub-percent level, its overall accuracy is lim-ited by the accuracy of our N-body simulations.
PkANN (i) predicts matter power spectrum up to redshifts z ≤ w (cid:54) = − k ≤ . h Mpc − formodels with (cid:80) m ν < . z ≤
2, (iv) is accurate at 5 (10) per cent level for redshifts z > z ≤ (cid:80) m ν > . ∼ ∼
20% level at k ≈ h Mpc − . van Daalen et al. (2011)have shown that AGN feedback reduces power relative toCDM-only simulations at per cent level at k ≈ . h Mpc − .While the dark energy component in numerical simulationsis usually assumed smooth and implemented only throughits effects on the background evolution, Alimi et al. (2010) c (cid:13)000
20% level at k ≈ h Mpc − . van Daalen et al. (2011)have shown that AGN feedback reduces power relative toCDM-only simulations at per cent level at k ≈ . h Mpc − .While the dark energy component in numerical simulationsis usually assumed smooth and implemented only throughits effects on the background evolution, Alimi et al. (2010) c (cid:13)000 , 000–000 Agarwal, Abdalla, Feldman, Lahav & Thomas
Figure 11.
Similar to Fig. 9, but for a range of n s values. Within each panel, the power spectra from top to bottom correspond toincreasing n s values. find that dark energy clustering leaves distinct imprints onthe non-linear matter power spectrum. To extract any mean-ingful and unbiased information from current and upcomingdata, such effects will need to be incorporated in N-bodysimulations and any fitting functions thereof. Although wedid not consider a wide range of cosmological models suchas the ones with non-zero spatial curvature, time-varyingequation of state for dark energy or dark energy cluster-ing, our ANN scheme can be readily extended for thesecases by running extra simulations. The PkANN packageis freely available at http://zuserver2.star.ucl.ac.uk/~fba/PkANN/PkANN.tar.gz . This work is supported by the National Science Foundationthrough TeraGrid resources provided by the NCSA. The re-search leading to these results has received funding fromthe European Research Council under the European Com-munity’s Seventh Framework Programme (FP7/2007-2013Grant Agreement no. 279954). FBA acknowledges the sup-port of a Royal Society URF, and OL acknowledges a RoyalSociety Wolfson Research Merit Award and an AdvancedERC Grant. SA would like to thank Pier Stefano Corasanitiand Yann Rasera for useful discussion. We thank the refereefor providing constructive comments and help in improvingthe contents of this paper. c (cid:13) , 000–000 kANN–Matter power spectrum interpolator Figure 12.
Similar to Fig. 9, but for a range of w values. Within each panel, the power spectra from top to bottom correspond toincreasing w values. REFERENCES
Adamson P., Andreopoulos C., Arms K. E., Armstrong R.,Auty D. J., Ayres D. S., Baller B., et al., 2008, Phys. Rev.Lett., 101, 131802Agarwal S., Abdalla F. B., Feldman H. A., Lahav O.,Thomas S. A., 2012, MNRAS, 424, 1409Agarwal S., Feldman H. A., 2011, MNRAS, 410, 1647Aihara H., Allende Prieto C., An D., Anderson S. F.,Aubourg ´E., Balbinot E., Beers T. C., et al., 2011, ApJS,193, 29Alimi J.-M., F¨uzfa A., Boucher V., Rasera Y., Courtin J.,Corasaniti P.-S., 2010, MNRAS, 401, 775Auld T., Bridges M., Hobson M. P., 2008, MNRAS, 387,1575Auld T., Bridges M., Hobson M. P., Gull S. F., 2007, MN- RAS, 376, L11Bird S., Viel M., Haehnelt M. G., 2012, MNRAS, 2175Bishop C. M., 1995, Neural Networks for Pattern Recogni-tion. Oxford Univ. Press, New YorkBrandbyge J., Hannestad S., 2009, JCAP, 5, 2—, 2010, Journal of Cosmology and Astro-Particle Physics,1, 21Brandbyge J., Hannestad S., Haugbølle T., Thomsen B.,2008, Journal of Cosmology and Astro-Particle Physics,8, 20Casarini L., Macci`o A. V., Bonometto S. A., Stinson G. S.,2011, MNRAS, 412, 911Collister A. A., Lahav O., 2004, PASP, 116, 345de Putter R., Mena O., Giusarma E., Ho S., Cuesta A., SeoH.-J., Ross A. J., et al., 2012, ApJ, 761, 12 c (cid:13)000
Adamson P., Andreopoulos C., Arms K. E., Armstrong R.,Auty D. J., Ayres D. S., Baller B., et al., 2008, Phys. Rev.Lett., 101, 131802Agarwal S., Abdalla F. B., Feldman H. A., Lahav O.,Thomas S. A., 2012, MNRAS, 424, 1409Agarwal S., Feldman H. A., 2011, MNRAS, 410, 1647Aihara H., Allende Prieto C., An D., Anderson S. F.,Aubourg ´E., Balbinot E., Beers T. C., et al., 2011, ApJS,193, 29Alimi J.-M., F¨uzfa A., Boucher V., Rasera Y., Courtin J.,Corasaniti P.-S., 2010, MNRAS, 401, 775Auld T., Bridges M., Hobson M. P., 2008, MNRAS, 387,1575Auld T., Bridges M., Hobson M. P., Gull S. F., 2007, MN- RAS, 376, L11Bird S., Viel M., Haehnelt M. G., 2012, MNRAS, 2175Bishop C. M., 1995, Neural Networks for Pattern Recogni-tion. Oxford Univ. Press, New YorkBrandbyge J., Hannestad S., 2009, JCAP, 5, 2—, 2010, Journal of Cosmology and Astro-Particle Physics,1, 21Brandbyge J., Hannestad S., Haugbølle T., Thomsen B.,2008, Journal of Cosmology and Astro-Particle Physics,8, 20Casarini L., Macci`o A. V., Bonometto S. A., Stinson G. S.,2011, MNRAS, 412, 911Collister A. A., Lahav O., 2004, PASP, 116, 345de Putter R., Mena O., Giusarma E., Ho S., Cuesta A., SeoH.-J., Ross A. J., et al., 2012, ApJ, 761, 12 c (cid:13)000 , 000–000 Agarwal, Abdalla, Feldman, Lahav & Thomas
Figure 13.
Similar to Fig. 9, but for a range of σ values. Within each panel, the power spectra from bottom to top correspond toincreasing σ values. Eisenstein D. J., Weinberg D. H., Agol E., Aihara H., Al-lende Prieto C., Anderson S. F., Arns J. A., et al., 2011,AJ, 142, 72Fendt W. A., Wandelt B. D., 2007, ApJ, 654, 2Ghosh A., 2011, arXiv:1111.4930Habib S., Heitmann K., Higdon D., Nakhleh C., WilliamsB., 2007, PRD, 76, 083503Heitmann K., Higdon D., Nakhleh C., Habib S., 2006,ApJL, 646, L1Heitmann K., Higdon D., White M., Habib S., WilliamsB. J., Lawrence E., Wagner C., 2009, ApJ, 705, 156Heitmann K., Lawrence E., Kwan J., Habib S., Higdon D.,2013, arXiv:1304.7849Heitmann K., White M., Wagner C., Habib S., Higdon D.,2010, ApJ, 715, 104 Hinshaw G., Larson D., Komatsu E., Spergel D. N., Ben-nett C. L., Dunkley J., Nolta M. R., et al., 2013, ApJS,208, 19Hornik K., 1991, Neural Networks, 4, 251Hurwitz E., Marwala T., 2012, arXiv:1208.4429Ito Y., 1991, Neural Networks, 4, 385Ivezic Z., Tyson J. A., Acosta E., Allsman R., AndersonS. F., Andrew J., Angel R., et al., for the LSST Collabo-ration, 2008, arXiv:0805.2366v2Jenatton R., Gramfort A., Michel V., Obozinski G., EgerE., Bach F., Thirion B., 2011, arXiv:1105.0363Jeong D., Komatsu E., 2009, ApJ, 691, 569KamLAND, 2008, Physical Review Letters, 100, 221803Komatsu E., Smith K. M., Dunkley J., Bennett C. L., GoldB., Hinshaw G., Jarosik N., et al., 2011, ApJS, 192, 18 c (cid:13) , 000–000 kANN–Matter power spectrum interpolator Lahav O., Kiakotou A., Abdalla F. B., Blake C., 2010,MNRAS, 405, 168Lawrence E., Heitmann K., White M., Higdon D., WagnerC., Habib S., Williams B., 2010, ApJ, 713, 1322Levi M., Bebek C., Beers T., Blum R., Cahn R., EisensteinD., Flaugher B., et al., representing the DESI collabora-tion, 2013, arXiv:1308.0847Lewis A., Challinor A., Lasenby A., 2000, ApJ, 538, 473Nishimichi T., Shirata A., Taruya A., Yahata K., Saito S.,Suto Y., Takahashi R., et al., 2009, PASJ, 61, 321Norman M. L., Bryan G. L., Harkness R., Bord-ner J., Reynolds D., O’Shea B., Wagner R., 2007,arXiv:0705.1556O’Shea B. W., Bryan G., Bordner J., Norman M. L., AbelT., Harkness R., Kritsuk A., 2010, Astrophysics SourceCode Library, 10072Otten E. W., Weinheimer C., 2008, Reports on Progress inPhysics, 71, 086201Pedregosa F., Gramfort A., Varoquaux G., Thirion B., Pal-lier C., Cauvet E., 2012, arXiv:1207.3520Planck Collaboration, Ade P. A. R., Aghanim N.,Armitage-Caplan C., Arnaud M., Ashdown M., Atrio-Barandela F., Aumont J., Baccigalupi C., Banday A. J.,et al., 2013, arXiv:1303.5076Rudd D. H., Zentner A. R., Kravtsov A. V., 2008, ApJ,672, 19Saito S., Takada M., Taruya A., 2008, Physical Review Let-ters, 100, 191301 —, 2009, PRD, 80, 083528Sarazin C. L., White III R. E., 1987, ApJ, 320, 32Smith R. E., Peacock J. A., Jenkins A., White S. D. M.,Frenk C. S., Pearce F. R., Thomas P. A., et al., 2003,MNRAS, 341, 1311SNO, 2004, Physical Review Letters, 92, 181301Springel V., 2005, MNRAS, 364, 1105Takahashi R., Sato M., Nishimichi T., Taruya A., OguriM., 2012, ApJ, 761, 152The Dark Energy Survey Collaboration, 2005,arXiv:0510346Upadhye A., Biswas R., Pope A., Heitmann K., Habib S.,Finkel H., Frontiere N., 2013, arXiv:1309.5872van Daalen M. P., Schaye J., Booth C. M., Dalla VecchiaC., 2011, MNRAS, 415, 3649Viel M., Haehnelt M. G., Springel V., 2010, Journal of Cos-mology and Astro-Particle Physics, 6, 15Wagner C., Verde L., Jimenez R., 2012, ApJL, 752, L31
A APPENDIX
The following is based on the treatment presented in Bishop(1995).
Figure 14.
The ratio of the matter power spectra at z = 0 (upperpanel) and z = 1 (lower panel) computed using PkANN and the h -fixed cosmic emulator for all models of Section 4.3 as well asthe 150 testing set cosmologies with (cid:80) m ν = 0. The two predic-tion schemes disagree at 5 per cent level out to k < ∼ . h Mpc − .Beyond k = 0 . h Mpc − , PkANN predictions fall off due to ouruse of unigrid simulations (see Section 4.2 for discussion).c (cid:13)000
The ratio of the matter power spectra at z = 0 (upperpanel) and z = 1 (lower panel) computed using PkANN and the h -fixed cosmic emulator for all models of Section 4.3 as well asthe 150 testing set cosmologies with (cid:80) m ν = 0. The two predic-tion schemes disagree at 5 per cent level out to k < ∼ . h Mpc − .Beyond k = 0 . h Mpc − , PkANN predictions fall off due to ouruse of unigrid simulations (see Section 4.2 for discussion).c (cid:13)000 , 000–000 Agarwal, Abdalla, Feldman, Lahav & Thomas
A.1 PkANN Cost Function
PkANN is a single hidden-layer feed-forward network withsigmoid hidden nodes and linear output nodes. For trainingthe
PkANN neural network to predict the matter powerspectrum, we consider a training set consisting of cosmo-logical models for which we have full information aboutthe non-linear matter power spectra P nl (computed fromN-body simulations) as a function of scale k and redshift z , as well as the underlying cosmological parameters: I ≡ (Ω m h , Ω b h , n s , w, σ , (cid:80) m ν ). The joint likelihood of get-ting the set of matter power spectra { P nl ( z ; I t ) } for all cos-mologies I t in the training set is(cid:32)L [ { P nl ( z ; I t ) } ] = T (cid:89) t =1 p [ P nl ( z ; I t )]= T (cid:89) t =1 p [ P nl ( z | I t )] p [ I t ] , (6)where p [ P nl ( z | I t )] is to be interpreted as the conditionalprobability of getting spectrum P nl ( z ) given cosmology I t ,while p [ I t ] is the unconditional probability that the cosmo-logical parameters I take a particular setting of I t . The index t runs over all cosmologies I t in the training set. We can takethe product of the individual probabilities since each model I t is drawn independently from the cosmological parameterspace.The weights w of the PkANN network are chosen (iter-atively during network training) so as to minimize the neg-ative logarithm of the likelihood (cid:32)L (which is equivalent tomaximizing (cid:32)L), χ = − ln (cid:32)L = T (cid:88) t =1 ln p [ P nl ( z | I t )] + T (cid:88) t =1 ln p [ I t ] . (7)If the power spectrum is sampled at different val-ues of scale k (the k -modes being represented by the set { k } h Mpc − ), we can write p [ P nl ( z | I t )] as p [ P nl ( z | I t )] = (cid:89) k i ∈{ k } p [ P nl ( k, z | I t )] , (8)where the product k i is over all the scales that form the set { k } h Mpc − , and we have assumed that P nl ( k, z | I t ) haveindependent distributions.To suppress sampling uncertainties in the power spec-trum P nl ( k, z | I t ), the numerical simulation code is run mul-tiple times with different seeds while keeping the underlyingcosmological model I t fixed. Assuming P nl ( k, z | I t ) has Gaus-sian distribution about the true power spectrum P Trnl ( k, z | I t )with variance σ , we can write the probability that a numer-ical run would give P nl ( k, z | I t ) as p [ P nl ( k, z | I t )] = 1(2 πσ ) / e − [ P Trnl ( k,z | I t ) − P nl( k,z | I t ) ] σ . (9)N-body codes give larger variance σ on scales compa-rable to the simulation volume since the density field onthese scales can only be sampled fewer times. However, tosimplify the PkANN training algorithm, in Eq. 9 we have assumed that the variance σ is independent of the scale k and model I t .Since the aim of developing PkANN is to model thetrue spectrum P Trnl ( k, z | I t ) by making an optimal choice forthe network weights w , we replace P Trnl ( k, z | I t ) in Eq. 9 bythe ANN prediction P ANNnl ( k, z | w , I t ) to get p [ P nl ( k, z | I t )] = 1(2 πσ ) / e − [ P ANNnl ( k,z | w , I t ) − P nl( k,z | I t ) ] σ . (10)Inserting Eq. 10 into Eq. 8, we get p [ P nl ( z | I t )] = 1(2 πσ ) N out / e − (cid:80) ki ∈{ k } [ P ANNnl ( k,z | w , I t ) − P nl( k,z | I t ) ] σ , (11)where N out is the number of k -modes in the set { k } . Re-member that, by construction, N out is also the number ofnodes in the output layer of the PkANN network. UsingEq. 11, we can now write Eq. 7 as χ ( w ) = 12 σ T (cid:88) t =1 (cid:88) k i ∈{ k } (cid:104) P ANNnl ( k, z | w , I t ) − P nl ( k, z | I t ) (cid:105) − T ln (cid:34) πσ ) N out / (cid:35) + T (cid:88) t =1 ln p [ I t ] . (12)We can drop the terms that do not depend on theweights w , since these terms merely scale χ ( w ) withoutaltering its location in the weight-space. Thus, the cost func-tion for the purposes error minimization can be written as χ ( w ) = 12 T (cid:88) t =1 (cid:88) k i ∈{ k } (cid:104) P ANNnl ( k, z | w , I t ) − P nl ( k, z | I t ) (cid:105) . (13)Remember that the matter power spectrum is a function ofscale k ( h Mpc − ). We sample the matter spectrum at dis-creet values between 0 . h Mpc − ≤ k ≤ h Mpc − andassign the sampled spectrum to the output nodes of theneural network. The discreet values of scale k form the set { k } h Mpc − . In Eq. 13, the sum k i is over all the nodes inthe output layer, with each node sampling the matter powerspectrum at some specific scale, k ( h Mpc − ). P nl ( k, z | I ) isthe non-linear matter power spectrum for the specific cos-mology I , computed using N-body simulations. Given theweights w , P ANNnl ( k, z | w , I ) is the ANN’s predicted powerspectrum for the I th cosmology. In our fitting procedure, wework with the ratio of the non-linear to linear power spec-trum, namely R ( k, z ) ≡ P nl ( k, z ) /P lin ( k, z ), where P lin ( k, z )is calculated using camb (Lewis et al. 2000). As such, weigh-ing Eq. 13 by P lin ( k, z ) gives, χ ( w )= 12 T (cid:88) t =1 (cid:88) k i ∈{ k } (cid:20) P ANNnl ( k, z | w , I t ) − P nl ( k, z | I t ) P lin ( k, z | I t ) (cid:21) = 12 T (cid:88) t =1 (cid:88) k i ∈{ k } (cid:104) R ANN ( k, z | w , I t ) − R ( k, z | I t ) (cid:105) . (14)The ratio R ( k, z ) is a flatter function and gives better per-formance, particularly at higher redshifts where the ratio c (cid:13) , 000–000 kANN–Matter power spectrum interpolator tends to 1. Given the weights w , R ANN ( k, z | w , I ) in Eq. 14is the network’s prediction of the ratio R ( k, z | I ) for thespecific cosmology I . The predicted non-linear spectrum P ANNnl ( k, z | w , I ) is recovered by multiplying R ANN ( k, z | w , I )by the corresponding linear spectrum P lin ( k, z | I ).In Eq. 14, optimizing the weights w so as to minimize χ ( w ) generates an ANN that predicts the power spectrumvery well for the specific cosmologies in the training set.However, such a network might not make accurate predic-tions for cosmologies not included in the training set. Thisusually indicates (i) an overly simple network architecture(very few hidden layer nodes), (ii) very sparsely or poorlysampled parameter space and/or (iii) a highly complex non-linear mapping that actually over-fits to the noise on thetraining dataset. In order to generate smoother networkmappings that generalize better when presented with newcosmologies that are not part of the training set, a penaltyterm χ Q ( w ) is added to the cost function χ ( w ), χ Q ( w ) = ξ || w || , (15)where || w || is the quadratic sum of all the weights. Thepenalty term χ Q ( w ) prevents the network weights from be-coming too large during the training process by penalizingin proportion to their sum. The regularization parameter ξ controls the degree of regularization (smoothing) of a net-work’s predictions. After having initialized ξ , its value isre-estimated during the training process iteratively. For theupdate formula, see Appendix A.5. For its derivation, seeBishop (1995).Thus, the overall cost function which is presented to theANN for minimization with respect to the weights w is, χ C ( w ) = 12 T (cid:88) t =1 (cid:88) k i ∈{ k } (cid:104) R ANN ( k, z | w , I t ) − R ( k, z | I t ) (cid:105) + ξ || w || . (16) A.2 Quasi-Newton Method
Quasi-Newton method, used for finding stationary points(local maxima and minima) of a function, assumes that thefunction can be approximated by a quadratic in the regionaround a stationary point. Taylor expanding the
PkANN cost function χ C ( w ) (see Eq. 16) around some point w inthe weight space and retaining terms up to second-order, weget χ C ( w ) = χ C ( w ) + ( w − w ) T g w + 12 ( w − w ) T H w ( w − w ) , (17)where the superscript T stands for the transpose and g w isdefined to be the gradient of χ C evaluated at w , g w ≡ (cid:53) χ C (cid:12)(cid:12) w . (18) H w is a symmetric N W × N W Hessian matrix (evaluatedat w ) with elements H ij (cid:12)(cid:12) w ≡ ∂ χ C ∂w i ∂w j (cid:12)(cid:12)(cid:12)(cid:12) w , (19) where N W (see Eq. 3) is the total number of nodes in the net-work. Note that in Eq. 19, instead of referencing the weightsby the relevant nodes they connect to, for the sake of claritywe refer to the weights with a single subscript running from1 to N W .Taking the gradient of Eq. 17 gives the local approxi-mation for the gradient itself, g w = g w + H w ( w − w ) . (20)To find the stationary point around w , one sets g w inEq. 20 to zero, thereby giving the Newton step , w = w − H − w g w . (21)Since the cost function χ C ( w ) is not an exact quadraticfunction, the Newton step of Eq. 21 does not point to thelocal minimum around w . As such, we apply Eq. 21 itera-tively, and if the Hessian matrix is positive definite ( i.e. allof its eigenvalues are positive), then each successive Newtonstep moves closer to the local minimum. If the initial choiceof the weights w happens to be around a local maximum of χ C ( w ), then the Hessian matrix is not positive definite andthe cost function may increase with each Newton step.One can apply some modifications to the Newtonmethod that guarantee convergence towards a local mini-mum, irrespective of the initial choice of the weights. In-stead of taking a step in the Newton direction ( − H − g ),one proceeds in a quasi-Newton direction ( − Gg ), w = w − λ w G w g w , (22)where matrix G represents an approximation to the inverseof the Hessian H − , and λ is the size of the step takenalong the quasi-Newton direction − Gg . The step size λ isallowed to vary with each iteration to the weights. Its valueis determined by proceeding in the direction − Gg until theminimum of the cost function is found along − Gg . Thus, inEq. 22, λ w is such that the gradient of χ C at w (namely, g w ) vanishes along the direction − G w g w ,( − G w g w ) T g w = 0 . (23)The quasi-Newton algorithm involves taking a a seriesof steps τ of Eq. 22, which can be written as w τ +1 = w τ − λ w τ G w τ g w τ , (24)with the step size λ w τ for the τ th step being such that( − G w τ g w τ ) T g w τ +1 = 0 . (25)At each step of the algorithm, G is constructed to bepositive definite, ensuring that the direction − Gg proceedstowards a local minimum of the cost function. To construct G , we use the Broyden–Fletcher–Goldfarb–Shanno (BFGS)method (see Appendix A.4). A.3 PkANN Cost Function Gradient
The overall cost function which is presented to the ANNfor minimization with respect to the weights w is given byEq. 16.We now derive the expression for its derivative with c (cid:13)000
The overall cost function which is presented to the ANNfor minimization with respect to the weights w is given byEq. 16.We now derive the expression for its derivative with c (cid:13)000 , 000–000 Agarwal, Abdalla, Feldman, Lahav & Thomas respect to the weights w . PkANN ’s network architectureis N in : N : N out with two layers of adaptive weights.The first layer of weights w ji connect the input layer nodes( x , x , ..., x i ) to the hidden nodes ( z , ..., z j ). Note that thehidden bias node activation z is permanently fixed at 1 andtherefore, does not receive any connections from the inputlayer. The activation of each hidden node is z j ≡ g ( a j ),taking as its argument a j = N in (cid:88) i =0 w ji x i , (26)where the sum is over all input nodes i (including the inputbias) sending connections to the j th hidden node (barringthe hidden bias node). PkANN ’s hidden nodes have sigmoidal activations g ( a j ) = 1 / [1 + exp( − a j )]. The second layer of weights w kj connect the hidden nodes ( z , z , ..., z j ) to the network out-puts ( y , ..., y k ). The output nodes have linear activations y k = a k , with a k being the weighted sum of all hidden nodes, a k = N (cid:88) j =0 w kj z j . (27)PkANN has two layers of adaptive weights and we willconsider the cost function derivatives separately for the twolayers. A.3.1 Gradient w.r.t. First Layer Weights
Taking the gradient of Eq. 16 with respect to a first layerweight w ji , we get ∂ (cid:2) χ C ( w ) (cid:3) ∂w ji = (cid:88) t, { k } (cid:104) R ANN ( k, z | w , I t ) − R ( k, z | I t ) (cid:105) ∂R ANN ∂w ji + ξw ji . (28)Since R ANN ( k, z | w , I t ) = a ( k, z | w , I t ) (see Eq. 27) for theoutput nodes, we get ∂ (cid:2) χ C ( w ) (cid:3) ∂w ji = (cid:88) t, { k } (cid:104) R ANN ( k, z | w , I t ) − R ( k, z | I t ) (cid:105) ∂a tk ∂w ji + ξw ji , (29)where we have introduced the shorthand notation a tk ≡ a ( k, z | w , I t ). Using Eq. 27 for a k together with sigmoidalactivation for z j , we get ∂a tk ∂w ji = N (cid:88) j (cid:48) =0 w kj (cid:48) ∂z tj (cid:48) ∂w ji = N (cid:88) j (cid:48) =0 w kj (cid:48) ∂g ( a tj (cid:48) ) ∂a tj (cid:48) ∂a tj (cid:48) ∂w ji . (30)For sigmoidal activation functions, it is straightforwardto show that ∂g ( a tj ) ∂a tj = g ( a tj ) (cid:0) − g ( a tj ) (cid:1) . (31) Inserting Eq. 31 into Eq. 30, we get ∂a tk ∂w ji = N (cid:88) j (cid:48) =0 w kj (cid:48) g tj (cid:48) (cid:0) − g tj (cid:48) (cid:1) ∂a tj (cid:48) ∂w ji . (32)Differentiating Eq. 26 with respect to a first layer weight w ji , we get ∂a tj (cid:48) ∂w ji = N in (cid:88) i (cid:48) =0 x ti (cid:48) ∂w j (cid:48) i (cid:48) ∂w ji = N in (cid:88) i (cid:48) =0 x ti (cid:48) δ ii (cid:48) δ jj (cid:48) = x ti δ jj (cid:48) . (33)Inserting Eq. 33 into Eq. 32, we get ∂a tk ∂w ji = N (cid:88) j (cid:48) =0 w kj (cid:48) g tj (cid:48) (cid:0) − g tj (cid:48) (cid:1) x ti δ jj (cid:48) = w kj g tj (cid:0) − g tj (cid:1) x ti . (34)From Eqs. 29 and 34, we get our final equation for thederivative of the PkANN cost function with respect to thefirst layer of adaptive weights w ji to be ∂ (cid:2) χ C ( w ) (cid:3) ∂w ji = (cid:88) t, { k } R ANN ( k, z | w , I t ) w kj g tj (cid:0) − g tj (cid:1) x ti − (cid:88) t, { k } R ( k, z | I t ) w kj g tj (cid:0) − g tj (cid:1) x ti + ξw ji . (35) A.3.2 Gradient w.r.t. Second Layer Weights
Taking the gradient of Eq. 16 with respect to a second layerweight w kj , we get ∂ (cid:2) χ C ( w ) (cid:3) ∂w kj = (cid:88) t, { k (cid:48) } (cid:104) R ANN ( k (cid:48) , z | w , I t ) − R ( k (cid:48) , z | I t ) (cid:105) ∂R ANN ∂w kj + ξw kj . (36)Since R ANN ( k (cid:48) , z | w , I t ) = a ( k (cid:48) , z | w , I t ) (see Eq. 27) for theoutput nodes, we get ∂ (cid:2) χ C ( w ) (cid:3) ∂w kj = (cid:88) t, { k (cid:48) } (cid:104) R ANN ( k (cid:48) , z | w , I t ) − R ( k (cid:48) , z | I t ) (cid:105) ∂a tk (cid:48) ∂w kj + ξw kj , (37)where as before, we use the shorthand notation a tk (cid:48) ≡ a ( k (cid:48) , z | w , I t ). From Eq. 27, we get ∂a tk (cid:48) ∂w kj = N (cid:88) j (cid:48) =0 ∂w k (cid:48) j (cid:48) ∂w kj z tj (cid:48) = N (cid:88) j (cid:48) =0 δ kk (cid:48) δ jj (cid:48) z tj (cid:48) = δ kk (cid:48) z tj (38)Inserting Eq. 38 into Eq. 37, we get our final equationfor the derivative of the PkANN cost function with respect c (cid:13) , 000–000 kANN–Matter power spectrum interpolator to the second layer of adaptive weights w kj to be ∂ (cid:2) χ C ( w ) (cid:3) ∂w kj = (cid:88) t, { k (cid:48) } (cid:104) R ANN ( k (cid:48) , z | w , I t ) − R ( k (cid:48) , z | I t ) (cid:105) δ kk (cid:48) z tj + ξw kj = (cid:88) t (cid:104) R ANN ( k, z | w , I t ) − R ( k, z | I t ) (cid:105) z tj + ξw kj (39)For any choice of weights w , the network output vector R ANN ( k, z | w , I t ) is determined for each cosmology I t in thetraining set, by progressing sequentially through the networklayers, from inputs to outputs, calculating the activation ofeach node. Having calculated the activations and networkoutputs for all cosmologies, it is straightforward to computethe derivatives in Eqs. 35 and 39. A.4 BFGS Approximation for Inverse-HessianMatrix
In order to minimize the
PkANN cost function χ C ( w ) (seeEq. 16) with respect to the weights w , the weights are firstrandomly initialized to w and then updated iteratively us-ing Eq. 24.Updating the weights involves estimating G – an ap-proximation to the inverse Hessian matrix H − . The in-verse Hessian H − evaluated at w is approximated by a N W × N W identity matrix ( i.e. G w = I ). Following ourdiscussion in Appendix A.2, the weight vector is updated to w as w = w − λ w g w (40)by stepping a distance λ w in the quasi-Newton direction − g w . Note that the gradient g w is computed using Eqs. 35and 39. The step size λ w is such that the gradient of χ C at w (namely, g w ) vanishes along the direction − g w , − g T w g w = 0 . (41)To make any further updates in the weight space,one needs to evaluate H − w . The inverse Hessian, being a N W × N W matrix, can be computationally expensive to cal-culate exactly for networks with N W > ∼ H − w by G w .In general, for the ( τ + 1) step, the approximation G w τ + is G w τ +1 = G w τ + 1 S (cid:20)(cid:18) S S (cid:19) aa T − ab T G w τ − G w τ ba T (cid:21) , (42)where we use the following definitions for the vectors ( a and b ) and the scalars ( S and S ), a = w τ +1 − w τ b = g w τ +1 − g w τ S = a T b S = b T Gb (43)At each step, the BFGS method makes increasinglymore accurate approximations for G . Moreover, since G is positive definite (by construction), the χ C ( w ) minimizationalgorithm is guaranteed to converge to a local minimum. A.5 Regularization Parameter ξ In situations where the training data is noisy, controllingthe complexity of a network is crucial to avoid overfittingand underfitting issues. An overly complex network may fitthe noise in the training data. On the other hand, a verysimple network may not be able to capture the signal ina dataset, leading to underfitting. Both overfitting and un-derfitting lead to models with low predictive performance.One of the methods employed to regularize the complexityof a neural network is to train the network by minimizing acost function that includes a penalty term χ Q ( w ) ( e.g. seeEq. 15).Small (large) values of the regularization parameter ξ lead to complex (simple) networks. Since the optimum valuefor ξ is not known a priori, its value is initialized randomly,and updated iteratively by the cost minimization algorithm.Here, we only present the updating rule for ξ . For itsderivation, refer Bishop (1995). The PkANN cost function(Eq. 16) can be written as χ C ( w )= β (cid:88) t, { k } (cid:104) R ANN ( k, z | w , I t ) − R ( k, z | I t ) (cid:105) + α β || w || (44)where α and β are the regularization parameters with ξ ≡ α/β and β = 1. For the purposes of cost minimization,the overall scale factor β is irrelevant and the degree of reg-ularization depends only on the ratio ξ ≡ α/β . For networkswhere the number of training patterns N T far exceeds thenumber of weights N W , Bishop (1995) derives the followingupdating rules for α and β , α τ +1 = N W || w τ || (45) β τ +1 = N T χ ( w τ ) , (46)where χ ( w ) (see Eq. 14) is the sum of squares of residualsfor the training data. Thus, we update ξ as ξ τ +1 = N W N T χ ( w τ ) || w τ || . (47)From Eq. 47, we see that for sufficiently complex networks( N W >>
1) with lots of training data ( N T >> N W ), theparameter ξ <<
1. It shows that underfitting and over-fitting issues can be avoided by simply choosing networkarchitectures that satisfy conditions: (i) N W >> N T >> N W . However, both these conditions can puttremendous load on the computing resources. In situationswhere the computing time is at a premium, a penalty termis used to achieve a balance between computing load anddesired prediction accuracy of the neural network. c (cid:13)000