[PDF] Constraints on fNL from Wilkinson Microwave Anisotropy Probe 7-year data using a neural network classifier

Abstract

We present a multi-class neural network (NN) classifier as a method to measure nonGaussianity, characterised by the local non-linear coupling parameter fNL, in maps of the cosmic microwave background (CMB) radiation. The classifier is trained on simulated non-Gaussian CMB maps with a range of known fNL values by providing it with wavelet coefficients of the maps; we consider both the HealPix (HW) wavelet and the spherical Mexican hat wavelet (SMHW). When applied to simulated test maps, the NN classfier produces results in very good agreement with those obtained using standard chi2 minimization. The standard deviations of the fNL estimates for WMAPlike simulations were {\sigma} = 22 and {\sigma} = 33 for the SMHW and the HW, respectively, which are extremely close to those obtained using classical statistical methods in Curto et al. and Casaponsa et al. Moreover, the NN classifier does not require the inversion of a large covariance matrix, thus avoiding any need to regularise the matrix when it is not directly invertible, and is considerably faster.

Full PDF

aa r X i v : . [ a s t r o - ph . C O ] J a n Mon. Not. R. Astron. Soc. , 000–000 (0000) Printed 12 September 2018 (MN L A TEX style ﬁle v2.2)

Constraints on f NL from Wilkinson Anisotropy Probe7-year data using a neural network classiﬁer B. Casaponsa, ⋆ M. Bridges , A. Curto , R.B. Barreiro , M.P. Hobson ,E.Mart´ınez-Gonz´alez Instituto de F´ısica de Cantabria, CSIC-Universidad de Cantabria, Avda. de los Castros s/n, 39005 Santander, Spain. Dpto. de F´ısica Moderna, Universidad de Cantabria, Avda. los Castros s/n, 39005 Santander, Spain. Astrophysics Group, Cavendish Laboratory, Madingley Road, Cambridge, CB3 0HE, U.K.

Accepted —. Received —; in original form 12 September 2018

ABSTRACT

We present a multi-class neural network (NN) classiﬁer as a method to measure non-Gaussianity, characterised by the local non-linear coupling parameter f NL , in mapsof the cosmic microwave background (CMB) radiation. The classiﬁer is trained onsimulated non-Gaussian CMB maps with a range of known f NL values by providing itwith wavelet coeﬃcients of the maps; we consider both the HealPix (HW) waveletand the spherical Mexican hat wavelet (SMHW). When applied to simulated test maps,the NN classﬁer produces results in very good agreement with those obtained usingstandard χ minimization. The standard deviations of the f NL estimates for WMAP-like simulations were σ = 22 and σ = 33 for the SMHW and the HW, respectively,which are extremely close to those obtained using classical statistical methods in Curtoet al. and Casaponsa et al. Moreover, the NN classiﬁer does not require the inversionof a large covariance matrix, thus avoiding any need to regularise the matrix when itis not directly invertible, and is considerably faster. Key words: methods: data analysis — cosmic microwave background

Artiﬁcial intelligence algorithms are being used increasinglyto improve the eﬃciency of computationally intensive dataanalysis. In particular, neural networks (NN) have been suc-cessfully applied to pattern recognition, classiﬁcation of ob-jects and parameter estimation in a number of ﬁelds, includ-ing cosmology (see e.g. Auld et al. 2007).Cosmological analysis typically involves the use of largedatasets and high precision numerical tools. In particular,the study of deviations from Gaussianity in the distribu-tion of temperature anisotropies in the cosmic microwavebackground (CMB) require very demanding computationalmethods. The simplest way to characterise such a deviationis through third order moments, as these vanish in the Gaus-sian case. It is now commonplace in CMB analysis to work inspherical harmonic space, where computing the three pointcorrelation function or bispectrum can prove diﬃcult, or in-deed impossible, due to numerical instability. Some recent ef-forts have been applied to lessen the computational demandwithout reducing eﬃciency; see for example the KSW bis-pectrum estimator (Komatsu et al. 2005), or the binned esti- ⋆ e-mail: [email protected] mator (Bucher et al. 2010). Other methods which have alsobeen applied to non-Gaussianity analysis include Minkowskifunctionals (Hikage et al. 2008; Natoli et al. 2010), wavelet-based methods (Cay´on et al. 2001; Mukherjee & Wang2004; Curto et al. 2009a,b; Pietrobon 2010; Casaponsa et al.2010), a Bayesian approach (Elsner & Wandelt 2010) andthe analysis of the N -dimensional probability density func-tion (Vielva & Sanz 2010).This paper introduces an approach based on a neuralnetwork classiﬁer which, after training on third order mo-ments of wavelet coeﬃcients derived from simulated Gaus-sian and non-Gaussian CMB realisations, can be used to es-timate the presence and degree of non-Gaussianity for anygiven data map. We have chosen to estimate the local non-linear coupling parameter f NL , which parameterises the localnon-Gaussianity as a quadratic term in the primordial cur-vature perturbation. More precisely, f NL is the amplitude ofthe corrections at second order of the primordial curvatureperturbations (Salopek & Bond 1990; Gangui et al. 1994;Verde et al. 2000; Komatsu & Spergel 2001). This type ofnon-Gaussianity is predicted even in the simplest slow-rollinﬂationary scenario, albeit at a very low level f NL < f NL values (for a more com- c (cid:13) B. Casaponsa et al. plete review see Bartolo et al. (2004),Babich et al. (2004)and Yadav & Wandelt (2010)). Estimating the value of f NL from a given data map using existing methods typically hasa high computational cost and usually numerical problemsarise (e.g. matrix inversion). As we will show, the use ofneural networks bypasses these diﬃculties.In principle, one could use the pixel temperatures inthe CMB map directly, or its spherical harmonic coeﬃcients,as the inputs to the neural network classiﬁer. Nonetheless,we perform a pre-processing step in which we decomposethe temperature maps into their wavelet coeﬃcients, whichhave shown themselves to be sensitive to non-Gaussian sig-nals (e.g. Curto et al. 2009b, 2011a; Casaponsa et al. 2010).In particular, we consider the HealPix wavelet (HW) anda spherical Mexican hat wavelet (SMHW), and computethird-order moments of these wavelet coeﬃcients, the meanvalue of which is proportional to f NL . The network is thentrained so that when presented with these cubic statisticsfor a new (data) map, it can estimate the f NL value andits error bar. We apply this method to estimate the degreeof non-Gaussianity in the Wilkinson microwave anisotropyprobe (WMAP) 7-year data release.This paper is organized as follows. In Section 2, wegive a brief introduction to the wavelet analysis used. Anoverview of the type of neural network employed and ourtraining procedure follows in Section 3. In Section 4 weexplain the generation of our Gaussian and non-Gaussiansimulations, and the speciﬁc characteristics of our f NL clas-siﬁcation network. We present the results of applying ourclassiﬁer to simulations and to WMAP 7-year data in Sec-tion 5. Our conclusions are summarised in Section 6. Wavelet methods have seen increasing usage in cosmol-ogy. This has been particularly marked in CMB non-Gaussianity analyses, in which competitive results have beenobtained using wavelets such as the SMHW (Cay´on et al.2003; Vielva et al. 2004; Cruz et al. 2005; Curto et al.2011a), directional spherical wavelets (McEwen et al.2008), spherical Haar wavelet (SHW) (Tenorio et al. 1999;Barreiro et al. 2000), and recently the

HealPix wavelet(HW) (Casaponsa et al. 2010). For a review of wavelets ap-plied on the sphere, see, for example, McEwen et al. (2007).In essence, decomposing a CMB map into its wavelet coeﬃ-cients allows one to separate the search for non-Gaussianityon diﬀerent length-scales, while retaining positional informa-tion. In this section we will brieﬂy discuss the characteristicsof both the HW and SMHW and describe how we constructthe statistics which are used in our analysis.

HealPix wavelet

The

HealPix wavelet is very similar to that presented byShahram et al. (2007). Casaponsa et al. (2010) have used arevised version of this wavelet and perform the ﬁrst cos-mological application. In both papers, the central idea isthe construction of a fast wavelet method adapted to the

HealPix pixelization scheme (G´orski et al. 2005). The HWis similar to the SHW in the sense that, at each level of the wavelet transform, one produces both a high- and low-resolution map. The low-resolution map for the HW is ob-tained simply by averaging over 4-pixel blocks, and thehigh-resolution map is just the original map minus the low-resolution map. One begins with the original map at reso-lution J = 9 ( N side = 512) and performs successive waveletdecompositions until resolution J = 2 ( N side = 2), therebyconstructing 7 sets of high- and low-resolution maps. Al-though the original map is fully represented by the 7 high-resolution maps plus the low-resolution map at J = 2, inour analysis we have used all the high- and low-resolutionmaps, plus the original map, since this has been shown toimprove results (see Casaponsa et al. 2010, for details).Using all these maps, the third order moments of thewavelet coeﬃcients are computed as follows: S jkl = 1 N l P N l i =1 w i,j w i,k w i,l σ j σ k σ l , (1)where w i,j is the i th wavelet coeﬃcient of the map at res-olution j , σ j is the dispersion of w i,j , and N l is the num-ber of pixels in the map at resolution l (since one requires j k l ). Some of these statistics are redundant (linearlydependency exists between them), so we restrict our analy-sis to the set of non-redudant statistics, which gives a totalof 232 quantities; these are then computed for non-Gaussiansimulations with a range of known values of f NL .The expected values of these statistics are proportionalto the non-linear coupling parameter, and they have pre-viously been used to estimate the best ﬁt f NL value forthe data by weighted least squares parameter estimation(Casaponsa et al. 2010). In this case, the dispersion in theestimated f NL value for Gaussian simulations and is foundto be σ ( f NL ) = 34, which is slightly larger that the optimalvalue. The main advantage of the HW is the computing ef-ﬁciency; for example, the third-order statistics constructionis 10 times faster than for the KSW bispectrum estimator(Komatsu et al. 2005) and 10 times faster than the SMHW(see below). This procedure (for both the HW and SMHW)does, however, include the estimation and inversion of a cor-relation matrix, which can be computationally demandingand, in some cases, close to singular. As we will show below,this step is avoided with the use of a NN classiﬁer. The spherical Mexican hat wavelet (SMHW)(Antoine & Vandergheynst 1998; Mart´ınez-Gonz´alez et al.2002) has produced competitive results in constrainingprimordial non-Gaussianity (Mukherjee & Wang 2004;Curto et al. 2009a,b, 2011a). It is a continuous wavelet thathas better frequency localization than the HW, althoughthe computing eﬃciency is lower. Curto et al. (2011a) usethe SMHW to constrain f NL with an accuracy equivalentto the bispectrum estimators (see for example Smith et al.2009; Fergusson & Shellard 2009; Fergusson et al. 2010;Komatsu et al. 2011; Bucher et al. 2010). The deﬁnition ofthe third-order moments is the same as for the HW. In thiscase, however, there are more inter-scale combinations be-cause the scales involved are not restricted by the HealPix pixelization. The total number of non-redundant statisticsfor the SMHW wavelet coeﬃcients is 680. Using the meanvalues and covariances of these statistics computed from c (cid:13) , 000–000 MAP 7yr local f NL constraints using neural networks x i bbb w i j h j bbb w j k y k bbb Figure 1.

Schematic of a 3-layer feed-forward neural network. non-Gaussian simulations, Curto et al. (2011a) applied a χ minimisation method to obtain optimal uncertaintieson the f NL estimates of σ ≈

21. However, this methodrequires a principal component analysis to deal with thedegenerancies present in the covariance matrix. As we willsee, this problem is avoided with the use of the multi-classneural network classiﬁer.

Artiﬁcial neural networks are a methodology for comput-ing, based on massive parallelism and redundancy, whichare features also found in animal brains. They consist ofa number of interconnected processors each of which pro-cesses information and passes it to other processors in thenetwork. Well-designed networks are able to ‘learn’ froma set of training data and to make predictions when pre-sented with new, possibly incomplete, data. These algo-rithms have been successfully applied in several areas, in par-ticular, we note the following applications in astrophysics:Storrie-Lombardi et al. (1992); Baccigalupi et al. (2000);Vanzella et al. (2004); Auld et al. (2007) and Carballo et al.(2008).The basic building block of an ANN is the neuron . In-formation is passed as inputs to the neuron, which processesthem and produces an output. The output is typically a sim-ple mathematical function of the inputs. The power of theANN comes from assembling many neurons into a network.The network is able to model very complex behaviour frominput to output. We use a three-layer feed-forward networkconsisting of a layer of input neurons, a layer of ‘hidden’neurons and a layer of output neurons. In such an arrange-ment each neuron is referred to as a node. Figure 1 shows aschematic design of such a network.The outputs of the hidden layer and the output layerare related to their inputs as follows:hidden layer: h j = g (1) ( f (1) j ); f (1) j = X i w (1) ji x i + b (1) j , (2)output layer: y k = g (2) ( f (2) k ); f (2) k = X j w (2) kj h j + b (2) k , (3)where the output of the hidden layer h and output layer y aregiven for each hidden node j and each output node k . The index i runs over all input nodes. The functions g (1) and g (2) are called activation functions. The non-linear nature of g (1) is a key ingredient in constructing a viable and practicallyuseful network. This non-linear function must be bounded,smooth and monotonic; we use g (1) ( x ) = tanh x . For g (2) we simply use g (2) ( x ) = x . The layout and number of nodesare collectively termed the architecture of the network. Fora basic introduction to artiﬁcial neural networks the readeris directed to MacKay (2003).For a given architecture, the weights w and biases b de-ﬁne the operation of the network and are the quantities wewish to determine by some training algorithm. We denote w and b collectively by a . As these parameters vary duringtraining, a very wide range of non-linear mappings betweeninputs and outputs is possible. In fact, according to a ‘uni-versal approximation theorem’ Leshno et al. (1993), a stan-dard three-layer feed-forward network can approximate anycontinuous function to any degree of accuracy with appro-priately chosen activation functions and a suﬃcient numberof hidden nodes.In our application, we will construct a classiﬁcation net-work. The aim of any classiﬁcation method is to place mem-bers of a set into groups based on inherent properties or fea-tures of the individuals, given some pre-classiﬁed trainingdata. Formally, classiﬁcation can be summarised as ﬁndinga classiﬁer C : x → C which maps an object from some (typ-ically multi-dimensional) feature space x to its classiﬁcationlabel C , which is typically taken as one of { , ..., N } where N is the number of distinct classes. Thus the problem ofclassiﬁcation is to partition feature space into regions (notnecessarily contiguous), assigning each region a label corre-sponding to the appropriate classiﬁcation. In our context,the aim is to classify sets of third-order statistics of waveletcoeﬃcients of (possibly) non-Gaussian CMB maps (assem-bled into an input feature vector x ) into classes deﬁned byranges of f NL ; this is discussed in more detail below.In building a classiﬁer using a neural network, it is con-venient to view the problem probabilistically . To this end weconsider a 3-layer MLP (multi-layer percepton) consistingof an input layer ( x i ), a hidden layer ( h j ), and an outputlayer ( y i ). In classiﬁcation networks, however, the outputsare transformed according to the softmax procedure p k = e y k P m e y m , (4)such that they are all non-negative and sum to unity. Inthis way p k can be interpreted as the probability that theinput feature vector x belongs to the k th class. A suitableobjective function for the classiﬁcation problem is then L ( a ) = X l X k t ( l ) k ln p k ( x ( l ) , a ) , (5)where the index l runs over the training dataset D = { x ( l ) , t ( l ) } , in which the target vector t ( l ) for the networkoutputs has unity in the element corresponding to the trueclass of the l th feature vector x ( l ) and zeroes elsewhere. Onethen wishes to choose network parameters a so as to max-imise this objective function as the training progresses. Theadvantage of this probabilistic approach is that we gain theability to make statistical decisions on the appropriate clas-siﬁcation in very large feature spaces where a direct linearpartition would not be feasible. c (cid:13) , 000–000 B. Casaponsa et al.

One wishes to choose network parameters a so as tomaximise the objective function L ( a ) as the training pro-gresses. This is, however, a highly non-linear, multi-modalfunction in many dimensions whose optimisation poses anon-trivial problem. We perform this optimisation usingthe MemSys package (Gull & Skilling 1999). This algorithmconsiders the parameters a to have prior probabilities pro-portional to e αS ( a ) , where S ( a ) is the positive-negative en-tropy functional (Hobson & Lasenby 1998). α is treated as ahyper-parameter of the prior, and sets the scale over whichvariations in a are expected. α is chosen to maximise itsmarginal posterior probability whose value is inversely pro-portional to the standard deviation of the prior. Thus fora given α , the log-posterior probability is proportional to L ( a ) + αS ( a ). For each chosen α there is a solution ˆ a thatmaximises the posterior. As α varies, the set of solutions ˆ a is called the maximum-entropy trajectory . We wish to ﬁndthe solution for which L is maximised which occurs at theend of the trajectory where α = 0. For practical purposes westart at a large value of α and iterate downwards until α issuﬃciently small so that the posterior is dominated by the L term. MemSys performs this algorithm using conjugategradient descent at each step to converge to the maximum-entropy trajectory. The required matrix of second deriva-tives of L is approximated using vector routines only, thuscircumventing the need for O ( N ) operations required forexact calculations. The application of MemSys to the prob-lem of network training allows for the fast eﬃcient train-ing of relatively large network structures on large data setsthat would otherwise be diﬃcult to perform in a reason-able time. Moreover the

MemSys package also computesthe Bayesian evidence for the model (i.e. network) underconsideration, (see for example Jaynes 2003, for a review),which provides a powerful model selection tool. In principle,values of the evidence computed for each possible architec-ture of the network (and training data) provide a mechanismto select the most appropriate architecture, which is simplythe one that maximises the evidence (although we will use amore prosaic method below for deciding on the network ar-chitecture). The

MemSys algorithm is described in greaterdetail in (Gull & Skilling 1999). F NL CLASSIFICATION NETWORK

To train our f NL classiﬁcation network we provide it with anensemble of training data D = { x ( l ) , t ( l ) } . The l th input vec-tor x ( l ) contains the third-order statistics of the wavelet co-eﬃcients of the l th simulated CMB map. The output classesof our network correspond to contiguous ranges of f NL val-ues. Thus, the target vector t ( l ) for the network outputs haszeroes everywhere except for a unit entry in the element cor-responding to the class in which the true f NL value of the l th simulated CMB map falls.The N output classes of the network were deﬁned bydividing some initial (anticipated) range of f NL values into N equal-width subranges. For example, for a total range of − f NL <

30 and a network with just 3 output classes,input vectors constructed from maps with − f NL < − class =1 with an associated target vector t = (1 , , − f NL <

10 to class =2 with t =(0 , , f NL <

30 to class =3 with t = (0 , , x would be a 3-dimensional vector p = ( p , p , p ), where P k p k = 1 and p k can be interpretedas the probability that the input vector belongs to class k .The discrepancy between the targets and the outputs duringtraining can be measured by the true positive rate, which issimply the fraction of the training input vectors for whichthe network assigns the maximum probability to the correctclass.From the output values p k obtained for each map, wedeﬁne the estimator of the local non-Gaussianity parameterto be simply ˆ f NL = n class X k =1 h f NL i k p k (6)where h f NL i k is the mean value of f NL in the k th class. Thestatistical properties of this estimator, namely its mean anddispersion, determine the accuracy of the method. The training input vectors x ( l ) were generated as fol-lows. We began with a set of 1000 non-Gaussian CMBrealisations from which a NG lm and a G lm were generated byElsner & Wandelt (2009) and normalized to the WMAP7-year concordance model power spectrum generated byCAMB. These a lm are publicly available . The ultimate ac-curacy of the network classiﬁer is improved, however, by theinclusion of further training data. Given the ﬁnite numberof available simulations, we thus created a further set byrotation of the original maps by 90 ◦ perpendicular to thegalactic plane. This rotation creates roughly 20 per cent ex-tra map area based on the original mask; we veriﬁed thatits inclusion improves the results. Using this procedure wegenerate a further 1000 non-Gaussian simulations. Of the2000 non-Gaussian maps, 1800 were used for training andthe remainder were set aside for testing of the networks.For each non-Gaussian simulation used for training, setsof a lm were then generated with varying f NL using the fol-lowing prescription a lm = a G lm + f NL a NG lm , (7)with 20 diﬀerent f NL random values between −

120 and 120for the HW decomposition and between −

76 and 76 for theSMHW analysis. Thus, for each non-Gaussian simulation, 20sets of a lm were generated. Hence the total number of avail-able training data sets is 36000. Noise-weighted V+W-bandWMAP realizations were then constructed as explained inCurto et al. (2009a) and Casaponsa et al. (2010), and theKQ75 mask was then applied, which covers roughly 29%of the sky. A wavelet decomposition for both the HW andSMHW was performed to determine the wavelet coeﬃcentsfor each a lm set, and their third-order moments computed.These statistics were provided as inputs to the neural net-work. Each input vector contained 232 values for the HWand 680 for the SMHW. http://planck.mpa-garching.mpg.de/cmb/fnl-simulations/c (cid:13) , 000–000 MAP 7yr local f NL constraints using neural networks The architecture of our 3-layer neural networks are deﬁnedby two free parameters: the number of hidden nodes n hid and the number of output classes, n class , into which the f NL range is divided. A further parameter, which determines theaccuracy of the network classiﬁer, is the quantity of train-ing data n data . Variation of these parameters can aﬀect thetraining eﬃciency so it is desirable to explore this trainingspace adequately in order to ﬁnd an optimal set of parame-ters.Although the MemSys algorithm provides routines todetermine the optimal value of the number of hidden nodesusing the Bayesian evidence Gull & Skilling (1999), in thisapplication n hid is determined simply by measuring trainingtimes and the accuracy of the trained networks on an in-dependent testing set. In this example, we have found thatin fact the optimal architecture contains no hidden nodes,resulting in what is eﬀectively a linear classiﬁer. This is notsurprising, since we are eﬀectively ‘asking’ the network tolearn the mean value and dispersion of the third-order mo-ments of the wavelet coeﬃcients for each f NL ; since the ex-pectation value is linearly dependent on the f NL , this net-work architecture trivially satisﬁes this requirment. Indeed,networks of this sort provide a simple way of obtaining the(pseudo)inverse of any matrix.The number of output classes, n class , of the networkis clearly related to the total range of f NL considered andsize of the subranges into which this range is divided. Herewe consider networks with n class = 9 (an odd number en-sures that f NL = 0 does not lie on the boundary of a class)The range of f NL was chosen a priori to correspond toapproximately ± σ , where σ is the dispersion in the f NL estimates obtained previously using the standard χ min-imisation method. Thus, the full range was taken to be − f NL <

120 for the HW and − f NL <

76 forthe SMHW, resulting in subranges per class of width 27and 17 units, respectively. This combination fulﬁlled all therequirements of classiﬁcation over the range of our simulateddata.The quantity of training data, n data , determines the ac-curacy of the resulting classiﬁcation network. Naturally, thenetwork accuracy increases with n data , but it typically sat-urates after a given number. We found that the quantityof data required saturated at roughly n data ∼ Figure 3 illustrates the training evolution for the classiﬁca-tion network with n hid = 0 and n class = 9. In the top twopanels we plot the true positive rates (TPR) of the networkon the training set and the test set, for the HW and SHMWrespectively; in each plot, the TPR on the training set hasbeen mutliplied by a factor less than unity to highlight thedivergence with the TPR for the test set. We see that thisdivergence occurs after ∼

100 and ∼

500 iterations of the

MemSys optimiser for the HW and SMHW, respectively.Thus the training was terminated at this point to constructour ﬁnal classiﬁcation networks.A key criterion in determining the quality of our classi-ﬁers is the dispersion of the f NL values obtained in the test- n data dispersion on f NL HWSMHW

Figure 2.

Results of the dispersion of ˆ f NL for 1000 Gaussiansimulations for diﬀerent values of n data . ing set. This is plotted in the bottom two panels of Figure 3for the HW and SMHW, respectively. We note that, in eachcase, this dispersion increases noticeable beyond the pointwhere the TPRs on the training and testing sets diverge. We ﬁrst applied our classiﬁers to 1000 WMAP-7yr simu-lations made from Gaussian CMB maps ( f NL = 0). Forthe HW classiﬁer, we obtained h ˆ f NL i = −

1, which indi-cates the estimator is essentially unbiassed. Moreover, thedispersion of the estimator σ ( ˆ f NL ) = 33 is extremely similarto that obtained with the weighted least-squares method( σ ( ˆ f NL ) = 34). The full distribution of the estimator isshown in the top panel of Fig. 4. For the SMHW classiﬁer, weagain found h ˆ f NL i = −

1, with a dispersion of σ ( ˆ f NL ) = 22,which is very close to the optimal value of σ ( ˆ f NL ) = 21. Thedistribution of the estimator for the SMHW is shown in thebottom panel of Fig. 4.The histogram bins in Fig. 4 have the same size andcentral values as those used to deﬁne the network classes.We see that the classes at extremal f NL values are empty,indicating that the network placed no maps in these f NL ranges. Thus for estimating f NL from Gaussian or nearlyGaussian maps the range in f NL used is suﬃciently wide.We next applied our estimator to sets of non-Gaussiansimulations, each with a diﬀerent non-zero f NL value. Foreach true f NL value, we analysed the corresponding WMAPsimulations and calculated the mean and dispersion of ourestimator ˆ f NL for both the HW and SMHW classiﬁers. Theresults are shown in ﬁg. 5, in which we plot the mean valueof ˆ f NL against the true f NL value. We see that the classiﬁersare unbiassed for | f NL | . σ with an almost constant disper-sion. For larger | f NL | values, however, the estimator becomesprogressively more biassed and its dispersion decreases.The latter behaviour is simply understood as an edgeeﬀect due to the ﬁnite total range of f NL considered by thenetworks. This point is illustrated in Fig. 6 in which weplot the full distributions of ˆ f NL obtained for a number ofrepresentative values of the true f NL . We see that for | f NL | . σ , we obtain close to symmetric distribution centred on thetrue f NL value, with no maps being placed in the extremeclasses. As | f NL | > σ , however, we see that the classiﬁer c (cid:13) , 000–000 B. Casaponsa et al. iterations true positives rate (%) iterations dispersion on f NL iterations true positives rate (%) iterations dispersion on f NL Training*3/4Testing Training*2/3Testing

HW SMHWSMHWHW

Figure 3.

Evolution of the true positive rate for each iteration of the training process with a neural network with n hid = 0 and n data = 10000. Note that the TPR of the training set have been multiplied by a factor less than unity in order to highlight the divergenceof the behaviours. The bottom panels show the variation of the dispersion on the estimate ˆ f NL during the training. Left panels for HWand right panels for SMHW. does begin to place maps in the extreme classes, resultingin the distribution of ˆ f NL becoming severely skewed and nolonger centred on the true value. Of course, if one were toencounter this behaviour in the analysis of a real data set,one could simply alter the range of f NL considered by thenetwork and retrain.In any case, the above results show that both the HWand SMHW network classiﬁers produce unbiassed estimatesˆ f NL provided − σ < f NL < σ . Moreover, the dispersionson these estimators are very similar to those obtained withthe classical weighted least squares (WLS) method, indi-cating that neural networks can produce very accurate re-sults within the limitations described above. In the case ofthe SMHW, this is a particularly important result since thecomplexity of the covariance matrix inversion required inthe standard approach is bypassed via the use of the neuralnetwork classiﬁer. Curto et al. (2011a) used a principal com-ponent analysis to reduce the covariance matrix dimensionto allow inversion. −100 −50 0 50 10000.0050.010.015 HW−50 0 5000.0050.010.0150.02 SMHW Figure 4.

Distribution of ˆ f NL obtained from 1000 Gaussian re-alizations for HW (top) and SMHW (bottom). Applying the neural network classiﬁers to real data (theV+W WMAP 7-year data map), we obtain ˆ f NL = − f NL = 19 for the SMHW. Both these val-ues lie well within the corresponding dispersion of the esti-mator. From the corresponding ˆ f NL distributions obtainedon simulated data, we ﬁnd that 95% conﬁdence limits are − < f NL <

51 for the HW and − < f NL <

61 for the c (cid:13) , 000–000 MAP 7yr local f NL constraints using neural networks ˆ f NL , data σ ( ˆ f NL ) h ˆ f NL , gauss i P . P . SMHW (NN) 19 22 − −

43 42SMHW (WLS)

Curto et al. 2011b

37 21 0 −

42 46HW (NN) −

12 33 − −

66 63HW (WLS)

Casaponsa et al. 2011 −

68 67

Table 1.

Results obtained with neural networks (NN) and weighted least squares (WLS). ˆ f NL , data is the best ﬁtting value for V+WWMAP data, h ˆ f NL , gauss i and σ ( ˆ f NL ) are the expected value and the standard deviation for Gaussian simulations. P . and P . represent the percentile values at 95% conﬁdence level of ˆ f NL for Gaussian realizations. −100 0 10000.0050.010.0150.02 f NL =−5 −100 0 10000.0050.010.0150.02 f NL =−20 −100 0 10000.0050.010.0150.02 f NL =−50 −100 0 10000.0050.010.0150.02 f NL =−120−100 0 10000.0050.010.0150.02 f NL =5 −100 0 10000.0050.010.0150.02 f NL =20 −100 0 10000.0050.010.0150.02 f NL =50 −100 0 10000.0050.010.0150.020.025 f NL =120 −100 0 10000.0050.010.0150.02 f NL =−2 −100 0 10000.0050.010.0150.02 f NL =−10 −100 0 10000.0050.010.0150.02 f NL =−30 −100 0 10000.010.020.03 f NL =−70−100 0 10000.0050.010.0150.02 f NL =2 −100 0 10000.0050.010.0150.02 f NL =10 −100 0 10000.0050.010.0150.02 f NL =30 −100 0 10000.010.020.03 f NL =70 Figure 6.

Distribution of ˆ f NL obtained from 200 non-Gaussian realizations with representative true f NL values, for HW (left) andSMHW (right). SMHW. We therefore conclude that the data are consistentwith the Gaussian hypothesis. We note that the SMHW con-ﬁdence limits are very similar to those obtained with theoptimal f NL estimator (Komatsu et al. 2011; Smith et al.2009).These results are summarised in Table 1, along withfound via the weighted least squares (WLS) method. Thelatter results are also consistent with Gaussianity. It is worthmentioning, however, the diﬀerent values of ˆ f NL obtained bythe neural network and the WLS methods, for both HW andSMHW. Although all four values lie well within their cor-responding dispersions, each method returns a diﬀerent ˆ f NL value when applied to the same WMAP-7yr dataset. Thisbehaviour is to be expected, however, since these are four diﬀerent estimators of f NL . Therefore, in general, they willnot be equal, even when applied to the same input data.Only the statistical properties (e.g. mean, dispersion) oftheir sampling distributions are important. Note that the constraints are not corrected for the unresolvedpoint sources contribution.

We have trained a multi-class neural network classiﬁer withthird-order moments of the HW and SMHW coeﬃcientsof non-Gaussian realizations in order to set constraints onthe local non-linear coupling term f NL using WMAP 7-yeardata. We found that with a very simple network and withfew iterations (requiring just a few secs CPU time) it is pos-sible to produce the same results as those obtained withthe weighted least squares method. This is an interestingachievement, as it bypasses any covariance matrix relatedcomputations and their associated regularisation problems.The estimation of the covariance matrix with both waveletsrequires the analysis of at least 10000 Gaussian simulationswhich involves a huge demand in CPU time, in particularwith the SMHW statisitcs. The error bars on the estimationof f NL computed with Gaussian simulations are σ ( ˆ f NL = 33)for HW and σ ( ˆ f NL ) = 22 for SMHW, which are extremelysimilar to the ones obtained in Casaponsa et al. (2010) andCurto et al. (2011a) using the same statisitcs but a diﬀer-ent estimator based on the weighted least squares method( σ = 34, σ = 21 for HW and SMHW respectively). The c (cid:13) , 000–000 B. Casaponsa et al. −100 −50 0 50 100−100−50050100 f NL h ˆ f N L i HW −60 −40 −20 0 20 40 60−60−40−200204060 f NL h ˆ f N L i SMHW

Figure 5.

The mean and dispersion of ˆ f NL obtained for a numberof representative values of the true f NL for the HW network (top)and the SMHW network (bottom). constraints for WMAP 7-year data were found to be −

51 for the HW and − < f NL <

61 for the SMHW,which are compatible to a Gaussian distribution as foundin Smith et al. (2009); Curto et al. (2009b); Komatsu et al.(2011); Casaponsa et al. (2010) and Curto et al. (2011b).The results obtained with the SMHW statistics are similarto the ones found in Smith et al. (2009) and Komatsu et al.(2011), which are the most stringent ones currently availableat the limit of the WMAP resolution. Further analysis, as tothe contribution to f NL of unresolved point sources or fore-grounds can be performed by applying the linear classiﬁer tothe statistics of new simulated maps with this characteristicsignal. ACKNOWLEDGMENTS

We acknowledge partial ﬁnancial support from the SpanishMinisterio de Ciencia e Innovaci´on project AYA2010-21766-C03-01 and from the CSIC-The Royal Society joint projectwith reference 2008GB0012. B. Casaponsa thanks the Span-ish Ministerio de Ciencia e Innovaci´on for a pre-doctoral fel-lowship. The authors acknowledge the computer resources,technical expertise and assistance provided by the Span-ish Supercomputing Network (RES) node at Universidad deCantabria. We acknowledge the use of Legacy Archive forMicrowave Background Data Analysis (LAMBDA). Supportfor it is provided by the NASA Oﬃce of Space Science. The

HealPix package was used throughout the data analysis(G´orski et al. 2005).

REFERENCES

Antoine J.-P., Vandergheynst P., 1998, Journal of Mathe-matical Physics, 39, 3987 Auld T., Bridges M., Hobson M. P., Gull S. F., 2007, MN-RAS, 376, L11Babich D., Creminelli P., Zaldarriaga M., 2004, Journal ofCosmology and Astro-Particle Physics, 8, 9Baccigalupi C., Bedini L., Burigana C., De Zotti G., FarusiA., Maino D., Maris M., Perrotta F., Salerno E., ToﬀolattiL., Tonazzini A., 2000, MNRAS, 318, 769Barreiro R. B., Hobson M. P., Lasenby A. N., Banday A. J.,G´orski K. M., Hinshaw G., 2000, MNRAS, 318, 475Bartolo N., Komatsu E., Matarrese S., Riotto A., 2004,Phys.Rev.D, 402, 103Bucher M., van Tent B., Carvalho C. S., 2010, MNRAS,407, 2193Carballo R., Gonz´alez-Serrano J. I., Benn C. R., Jim´enez-Luj´an F., 2008, MNRAS, 391, 369Casaponsa B., Barreiro R. B., Curto A., Mart´ınez-Gonz´alezE., Vielva P., 2010, ArXiv e-printsCay´on L., Mart´ınez-Gonz´alez E., Arg¨ueso F., Banday A. J.,G´orski K. M., 2003, MNRAS, 339, 1189Cay´on L., Sanz J. L., Martinez-Gonzalez E., Banday A. J.,Argueso F., Gallegos J. E., Gorski K. M., Hinshaw G.,2001, MNRAS, 326, 1243Cruz M., Mart´ınez-Gonz´alez E., Vielva P., Cay´on L., 2005,MNRAS, 356, 29Curto A., Mart´ınez-Gonz´alez E., Barreiro R. B., 2009b,ApJ, 706, 399Curto A., Martinez-Gonzalez E., Barreiro R. B., 2011a,MNRAS, 412, 1023Curto A., Martinez-Gonzalez E., Barreiro R. B., HobsonM. P., 2011b, submitted to MNRASCurto A., Mart´ınez-Gonz´alez E., Mukherjee P., BarreiroR. B., Hansen F. K., Liguori M., Matarrese S., 2009a,MNRAS, 393, 615Elsner F., Wandelt B. D., 2009, ApJS, 184, 264Elsner F., Wandelt B. D., 2010, ApJ, 724, 1262Fergusson J. R., Liguori M., Shellard E. P. S., 2010, Phys-ical Review D, 82, 023502Fergusson J. R., Shellard E. P. S., 2009, Phys. Rev. D, 80,043510Gangui A., Lucchin F., Matarrese S., Mollerach S., 1994,ApJ, 430, 447G´orski K. M., Hivon E., Banday A. J., Wandelt B. D.,Hansen F. K., Reinecke M., Bartelmann M., 2005, ApJ,622, 759Gull S. F., Skilling J., 1999, Quantiﬁed maximum entropy:Mem-Sys 5 users manual. Maximum Entropy Data Con-sultants Ltd, RoystonHikage C., Matsubara T., Coles P., Liguori M., HansenF. K., Matarrese S., 2008, MNRAS, 389, 1439Hobson M., Lasenby A., 1998, 298, 905Jaynes E., 2003, Probability Theory: The Logic of Science.Cambridge University PressKomatsu E., Smith K. M., Dunkley J., Bennett C. L., GoldB., Hinshaw G., Jarosik N., Larson D., Nolta M. R., PageL., Spergel D. N., Halpern M. an Wright E. L., 2011,ApJS, 192, 18Komatsu E., Spergel D., 2001, Phys.Rev.D, 63, 063002Komatsu E., Spergel D. N., Wandelt B. D., 2005, ApJ, 634,14Leshno M., V. Y., A. P., Schocken S., 1993, Neural Netw.,6, 861MacKay D., 2003, Information Theory, Inference and c (cid:13) , 000–000 MAP 7yr local f NL constraints using neural networks Learning Algorithms. Cambridge University PressMart´ınez-Gonz´alez E., Gallegos J. E., Argueso F., CayonL., Sanz J. L., 2002, MNRAS, 336, 22McEwen J. D., Hobson M. P., Lasenby A. N., MortlockD. J., 2008, MNRAS, 388, 659McEwen J. D., Vielva P., Wiaux Y., Barreiro R. B., CayonL., Hobson M. P., Lasenby A. N., Martinez-Gonzalez E.,Sanz J. L., 2007, Journal of Fourier Analysis and Appli-cations, 13, 495Mukherjee P., Wang Y., 2004, ApJ, 613, 51Natoli P., de Troia G., Hikage C., Komatsu E., Migliac-cio M., Ade P. A. R., Bock J. J., Bond J. R., Borrill J.,Boscaleri A., Contaldi C. R., Crill B. P., de Bernardis P.,de Gasperis G., de Oliveira-Costa A., di Stefano G., HivonE., 2010, MNRAS, 408, 1658Pietrobon D., 2010, Memorie della Societa AstronomicaItaliana Supplementi, 14, 278Salopek D. S., Bond J. R., 1990, Phys.Rev.D, 42, 3936Shahram M., Donoho D., Starck J., 2007, in Society ofPhoto-Optical Instrumentation Engineers (SPIE) Confer-ence Series Vol. 6701 of Society of Photo-Optical Instru-mentation Engineers (SPIE) Conference Series, Multiscalerepresentation for data on the sphere and applications togeopotential dataSmith K. M., Senatore L., Zaldarriaga M., 2009, Journalof Cosmology and Astro-Particle Physics, 9, 6Storrie-Lombardi M. C., Lahav O., Sodre Jr. L., Storrie-Lombardi L. J., 1992, MNRAS, 259, 8PTenorio L., Jaﬀe A. H., Hanany S., Lineweaver C. H., 1999,MNRAS, 310, 823Vanzella E., Cristiani S., Fontana A., Nonino M., ArnoutsS., Giallongo E., Grazian A., Fasano G., Popesso P.,Saracco P., Zaggia S., 2004, A&A, 423, 761Verde L., Wang L., Heavens A. F., Kamionkowski M., 2000,MNRAS, 313, 141Vielva P., Martinez-Gonzalez E., Barreiro R. B., Sanz J. L.,Cayon L., 2004, ApJ, 609, 22Vielva P., Sanz J. L., 2010, MNRAS, 404, 895Yadav A. P. S., Wandelt B. D., 2010, Advances in Astron-omy, 2010 c (cid:13)000