Constraints on fNL from Wilkinson Microwave Anisotropy Probe 7-year data using a neural network classifier
B. Casaponsa, M. Bridges, A. Curto, R.B. Barreiro, M.P. Hobson, E.Martínez-González
aa r X i v : . [ a s t r o - ph . C O ] J a n Mon. Not. R. Astron. Soc. , 000–000 (0000) Printed 12 September 2018 (MN L A TEX style file v2.2)
Constraints on f NL from Wilkinson Anisotropy Probe7-year data using a neural network classifier B. Casaponsa, ⋆ M. Bridges , A. Curto , R.B. Barreiro , M.P. Hobson ,E.Mart´ınez-Gonz´alez Instituto de F´ısica de Cantabria, CSIC-Universidad de Cantabria, Avda. de los Castros s/n, 39005 Santander, Spain. Dpto. de F´ısica Moderna, Universidad de Cantabria, Avda. los Castros s/n, 39005 Santander, Spain. Astrophysics Group, Cavendish Laboratory, Madingley Road, Cambridge, CB3 0HE, U.K.
Accepted —. Received —; in original form 12 September 2018
ABSTRACT
We present a multi-class neural network (NN) classifier as a method to measure non-Gaussianity, characterised by the local non-linear coupling parameter f NL , in mapsof the cosmic microwave background (CMB) radiation. The classifier is trained onsimulated non-Gaussian CMB maps with a range of known f NL values by providing itwith wavelet coefficients of the maps; we consider both the HealPix (HW) waveletand the spherical Mexican hat wavelet (SMHW). When applied to simulated test maps,the NN classfier produces results in very good agreement with those obtained usingstandard χ minimization. The standard deviations of the f NL estimates for WMAP-like simulations were σ = 22 and σ = 33 for the SMHW and the HW, respectively,which are extremely close to those obtained using classical statistical methods in Curtoet al. and Casaponsa et al. Moreover, the NN classifier does not require the inversionof a large covariance matrix, thus avoiding any need to regularise the matrix when itis not directly invertible, and is considerably faster. Key words: methods: data analysis — cosmic microwave background
Artificial intelligence algorithms are being used increasinglyto improve the efficiency of computationally intensive dataanalysis. In particular, neural networks (NN) have been suc-cessfully applied to pattern recognition, classification of ob-jects and parameter estimation in a number of fields, includ-ing cosmology (see e.g. Auld et al. 2007).Cosmological analysis typically involves the use of largedatasets and high precision numerical tools. In particular,the study of deviations from Gaussianity in the distribu-tion of temperature anisotropies in the cosmic microwavebackground (CMB) require very demanding computationalmethods. The simplest way to characterise such a deviationis through third order moments, as these vanish in the Gaus-sian case. It is now commonplace in CMB analysis to work inspherical harmonic space, where computing the three pointcorrelation function or bispectrum can prove difficult, or in-deed impossible, due to numerical instability. Some recent ef-forts have been applied to lessen the computational demandwithout reducing efficiency; see for example the KSW bis-pectrum estimator (Komatsu et al. 2005), or the binned esti- ⋆ e-mail: [email protected] mator (Bucher et al. 2010). Other methods which have alsobeen applied to non-Gaussianity analysis include Minkowskifunctionals (Hikage et al. 2008; Natoli et al. 2010), wavelet-based methods (Cay´on et al. 2001; Mukherjee & Wang2004; Curto et al. 2009a,b; Pietrobon 2010; Casaponsa et al.2010), a Bayesian approach (Elsner & Wandelt 2010) andthe analysis of the N -dimensional probability density func-tion (Vielva & Sanz 2010).This paper introduces an approach based on a neuralnetwork classifier which, after training on third order mo-ments of wavelet coefficients derived from simulated Gaus-sian and non-Gaussian CMB realisations, can be used to es-timate the presence and degree of non-Gaussianity for anygiven data map. We have chosen to estimate the local non-linear coupling parameter f NL , which parameterises the localnon-Gaussianity as a quadratic term in the primordial cur-vature perturbation. More precisely, f NL is the amplitude ofthe corrections at second order of the primordial curvatureperturbations (Salopek & Bond 1990; Gangui et al. 1994;Verde et al. 2000; Komatsu & Spergel 2001). This type ofnon-Gaussianity is predicted even in the simplest slow-rollinflationary scenario, albeit at a very low level f NL < f NL values (for a more com- c (cid:13) B. Casaponsa et al. plete review see Bartolo et al. (2004),Babich et al. (2004)and Yadav & Wandelt (2010)). Estimating the value of f NL from a given data map using existing methods typically hasa high computational cost and usually numerical problemsarise (e.g. matrix inversion). As we will show, the use ofneural networks bypasses these difficulties.In principle, one could use the pixel temperatures inthe CMB map directly, or its spherical harmonic coefficients,as the inputs to the neural network classifier. Nonetheless,we perform a pre-processing step in which we decomposethe temperature maps into their wavelet coefficients, whichhave shown themselves to be sensitive to non-Gaussian sig-nals (e.g. Curto et al. 2009b, 2011a; Casaponsa et al. 2010).In particular, we consider the HealPix wavelet (HW) anda spherical Mexican hat wavelet (SMHW), and computethird-order moments of these wavelet coefficients, the meanvalue of which is proportional to f NL . The network is thentrained so that when presented with these cubic statisticsfor a new (data) map, it can estimate the f NL value andits error bar. We apply this method to estimate the degreeof non-Gaussianity in the Wilkinson microwave anisotropyprobe (WMAP) 7-year data release.This paper is organized as follows. In Section 2, wegive a brief introduction to the wavelet analysis used. Anoverview of the type of neural network employed and ourtraining procedure follows in Section 3. In Section 4 weexplain the generation of our Gaussian and non-Gaussiansimulations, and the specific characteristics of our f NL clas-sification network. We present the results of applying ourclassifier to simulations and to WMAP 7-year data in Sec-tion 5. Our conclusions are summarised in Section 6. Wavelet methods have seen increasing usage in cosmol-ogy. This has been particularly marked in CMB non-Gaussianity analyses, in which competitive results have beenobtained using wavelets such as the SMHW (Cay´on et al.2003; Vielva et al. 2004; Cruz et al. 2005; Curto et al.2011a), directional spherical wavelets (McEwen et al.2008), spherical Haar wavelet (SHW) (Tenorio et al. 1999;Barreiro et al. 2000), and recently the
HealPix wavelet(HW) (Casaponsa et al. 2010). For a review of wavelets ap-plied on the sphere, see, for example, McEwen et al. (2007).In essence, decomposing a CMB map into its wavelet coeffi-cients allows one to separate the search for non-Gaussianityon different length-scales, while retaining positional informa-tion. In this section we will briefly discuss the characteristicsof both the HW and SMHW and describe how we constructthe statistics which are used in our analysis.
HealPix wavelet
The
HealPix wavelet is very similar to that presented byShahram et al. (2007). Casaponsa et al. (2010) have used arevised version of this wavelet and perform the first cos-mological application. In both papers, the central idea isthe construction of a fast wavelet method adapted to the
HealPix pixelization scheme (G´orski et al. 2005). The HWis similar to the SHW in the sense that, at each level of the wavelet transform, one produces both a high- and low-resolution map. The low-resolution map for the HW is ob-tained simply by averaging over 4-pixel blocks, and thehigh-resolution map is just the original map minus the low-resolution map. One begins with the original map at reso-lution J = 9 ( N side = 512) and performs successive waveletdecompositions until resolution J = 2 ( N side = 2), therebyconstructing 7 sets of high- and low-resolution maps. Al-though the original map is fully represented by the 7 high-resolution maps plus the low-resolution map at J = 2, inour analysis we have used all the high- and low-resolutionmaps, plus the original map, since this has been shown toimprove results (see Casaponsa et al. 2010, for details).Using all these maps, the third order moments of thewavelet coefficients are computed as follows: S jkl = 1 N l P N l i =1 w i,j w i,k w i,l σ j σ k σ l , (1)where w i,j is the i th wavelet coefficient of the map at res-olution j , σ j is the dispersion of w i,j , and N l is the num-ber of pixels in the map at resolution l (since one requires j k l ). Some of these statistics are redundant (linearlydependency exists between them), so we restrict our analy-sis to the set of non-redudant statistics, which gives a totalof 232 quantities; these are then computed for non-Gaussiansimulations with a range of known values of f NL .The expected values of these statistics are proportionalto the non-linear coupling parameter, and they have pre-viously been used to estimate the best fit f NL value forthe data by weighted least squares parameter estimation(Casaponsa et al. 2010). In this case, the dispersion in theestimated f NL value for Gaussian simulations and is foundto be σ ( f NL ) = 34, which is slightly larger that the optimalvalue. The main advantage of the HW is the computing ef-ficiency; for example, the third-order statistics constructionis 10 times faster than for the KSW bispectrum estimator(Komatsu et al. 2005) and 10 times faster than the SMHW(see below). This procedure (for both the HW and SMHW)does, however, include the estimation and inversion of a cor-relation matrix, which can be computationally demandingand, in some cases, close to singular. As we will show below,this step is avoided with the use of a NN classifier. The spherical Mexican hat wavelet (SMHW)(Antoine & Vandergheynst 1998; Mart´ınez-Gonz´alez et al.2002) has produced competitive results in constrainingprimordial non-Gaussianity (Mukherjee & Wang 2004;Curto et al. 2009a,b, 2011a). It is a continuous wavelet thathas better frequency localization than the HW, althoughthe computing efficiency is lower. Curto et al. (2011a) usethe SMHW to constrain f NL with an accuracy equivalentto the bispectrum estimators (see for example Smith et al.2009; Fergusson & Shellard 2009; Fergusson et al. 2010;Komatsu et al. 2011; Bucher et al. 2010). The definition ofthe third-order moments is the same as for the HW. In thiscase, however, there are more inter-scale combinations be-cause the scales involved are not restricted by the HealPix pixelization. The total number of non-redundant statisticsfor the SMHW wavelet coefficients is 680. Using the meanvalues and covariances of these statistics computed from c (cid:13) , 000–000 MAP 7yr local f NL constraints using neural networks x i bbb w i j h j bbb w j k y k bbb Figure 1.
Schematic of a 3-layer feed-forward neural network. non-Gaussian simulations, Curto et al. (2011a) applied a χ minimisation method to obtain optimal uncertaintieson the f NL estimates of σ ≈
21. However, this methodrequires a principal component analysis to deal with thedegenerancies present in the covariance matrix. As we willsee, this problem is avoided with the use of the multi-classneural network classifier.
Artificial neural networks are a methodology for comput-ing, based on massive parallelism and redundancy, whichare features also found in animal brains. They consist ofa number of interconnected processors each of which pro-cesses information and passes it to other processors in thenetwork. Well-designed networks are able to ‘learn’ froma set of training data and to make predictions when pre-sented with new, possibly incomplete, data. These algo-rithms have been successfully applied in several areas, in par-ticular, we note the following applications in astrophysics:Storrie-Lombardi et al. (1992); Baccigalupi et al. (2000);Vanzella et al. (2004); Auld et al. (2007) and Carballo et al.(2008).The basic building block of an ANN is the neuron . In-formation is passed as inputs to the neuron, which processesthem and produces an output. The output is typically a sim-ple mathematical function of the inputs. The power of theANN comes from assembling many neurons into a network.The network is able to model very complex behaviour frominput to output. We use a three-layer feed-forward networkconsisting of a layer of input neurons, a layer of ‘hidden’neurons and a layer of output neurons. In such an arrange-ment each neuron is referred to as a node. Figure 1 shows aschematic design of such a network.The outputs of the hidden layer and the output layerare related to their inputs as follows:hidden layer: h j = g (1) ( f (1) j ); f (1) j = X i w (1) ji x i + b (1) j , (2)output layer: y k = g (2) ( f (2) k ); f (2) k = X j w (2) kj h j + b (2) k , (3)where the output of the hidden layer h and output layer y aregiven for each hidden node j and each output node k . The index i runs over all input nodes. The functions g (1) and g (2) are called activation functions. The non-linear nature of g (1) is a key ingredient in constructing a viable and practicallyuseful network. This non-linear function must be bounded,smooth and monotonic; we use g (1) ( x ) = tanh x . For g (2) we simply use g (2) ( x ) = x . The layout and number of nodesare collectively termed the architecture of the network. Fora basic introduction to artificial neural networks the readeris directed to MacKay (2003).For a given architecture, the weights w and biases b de-fine the operation of the network and are the quantities wewish to determine by some training algorithm. We denote w and b collectively by a . As these parameters vary duringtraining, a very wide range of non-linear mappings betweeninputs and outputs is possible. In fact, according to a ‘uni-versal approximation theorem’ Leshno et al. (1993), a stan-dard three-layer feed-forward network can approximate anycontinuous function to any degree of accuracy with appro-priately chosen activation functions and a sufficient numberof hidden nodes.In our application, we will construct a classification net-work. The aim of any classification method is to place mem-bers of a set into groups based on inherent properties or fea-tures of the individuals, given some pre-classified trainingdata. Formally, classification can be summarised as findinga classifier C : x → C which maps an object from some (typ-ically multi-dimensional) feature space x to its classificationlabel C , which is typically taken as one of { , ..., N } where N is the number of distinct classes. Thus the problem ofclassification is to partition feature space into regions (notnecessarily contiguous), assigning each region a label corre-sponding to the appropriate classification. In our context,the aim is to classify sets of third-order statistics of waveletcoefficients of (possibly) non-Gaussian CMB maps (assem-bled into an input feature vector x ) into classes defined byranges of f NL ; this is discussed in more detail below.In building a classifier using a neural network, it is con-venient to view the problem probabilistically . To this end weconsider a 3-layer MLP (multi-layer percepton) consistingof an input layer ( x i ), a hidden layer ( h j ), and an outputlayer ( y i ). In classification networks, however, the outputsare transformed according to the softmax procedure p k = e y k P m e y m , (4)such that they are all non-negative and sum to unity. Inthis way p k can be interpreted as the probability that theinput feature vector x belongs to the k th class. A suitableobjective function for the classification problem is then L ( a ) = X l X k t ( l ) k ln p k ( x ( l ) , a ) , (5)where the index l runs over the training dataset D = { x ( l ) , t ( l ) } , in which the target vector t ( l ) for the networkoutputs has unity in the element corresponding to the trueclass of the l th feature vector x ( l ) and zeroes elsewhere. Onethen wishes to choose network parameters a so as to max-imise this objective function as the training progresses. Theadvantage of this probabilistic approach is that we gain theability to make statistical decisions on the appropriate clas-sification in very large feature spaces where a direct linearpartition would not be feasible. c (cid:13) , 000–000 B. Casaponsa et al.
One wishes to choose network parameters a so as tomaximise the objective function L ( a ) as the training pro-gresses. This is, however, a highly non-linear, multi-modalfunction in many dimensions whose optimisation poses anon-trivial problem. We perform this optimisation usingthe MemSys package (Gull & Skilling 1999). This algorithmconsiders the parameters a to have prior probabilities pro-portional to e αS ( a ) , where S ( a ) is the positive-negative en-tropy functional (Hobson & Lasenby 1998). α is treated as ahyper-parameter of the prior, and sets the scale over whichvariations in a are expected. α is chosen to maximise itsmarginal posterior probability whose value is inversely pro-portional to the standard deviation of the prior. Thus fora given α , the log-posterior probability is proportional to L ( a ) + αS ( a ). For each chosen α there is a solution ˆ a thatmaximises the posterior. As α varies, the set of solutions ˆ a is called the maximum-entropy trajectory . We wish to findthe solution for which L is maximised which occurs at theend of the trajectory where α = 0. For practical purposes westart at a large value of α and iterate downwards until α issufficiently small so that the posterior is dominated by the L term. MemSys performs this algorithm using conjugategradient descent at each step to converge to the maximum-entropy trajectory. The required matrix of second deriva-tives of L is approximated using vector routines only, thuscircumventing the need for O ( N ) operations required forexact calculations. The application of MemSys to the prob-lem of network training allows for the fast efficient train-ing of relatively large network structures on large data setsthat would otherwise be difficult to perform in a reason-able time. Moreover the
MemSys package also computesthe Bayesian evidence for the model (i.e. network) underconsideration, (see for example Jaynes 2003, for a review),which provides a powerful model selection tool. In principle,values of the evidence computed for each possible architec-ture of the network (and training data) provide a mechanismto select the most appropriate architecture, which is simplythe one that maximises the evidence (although we will use amore prosaic method below for deciding on the network ar-chitecture). The
MemSys algorithm is described in greaterdetail in (Gull & Skilling 1999). F NL CLASSIFICATION NETWORK
To train our f NL classification network we provide it with anensemble of training data D = { x ( l ) , t ( l ) } . The l th input vec-tor x ( l ) contains the third-order statistics of the wavelet co-efficients of the l th simulated CMB map. The output classesof our network correspond to contiguous ranges of f NL val-ues. Thus, the target vector t ( l ) for the network outputs haszeroes everywhere except for a unit entry in the element cor-responding to the class in which the true f NL value of the l th simulated CMB map falls.The N output classes of the network were defined bydividing some initial (anticipated) range of f NL values into N equal-width subranges. For example, for a total range of − f NL <
30 and a network with just 3 output classes,input vectors constructed from maps with − f NL < − class =1 with an associated target vector t = (1 , , − f NL <
10 to class =2 with t =(0 , , f NL <
30 to class =3 with t = (0 , , x would be a 3-dimensional vector p = ( p , p , p ), where P k p k = 1 and p k can be interpretedas the probability that the input vector belongs to class k .The discrepancy between the targets and the outputs duringtraining can be measured by the true positive rate, which issimply the fraction of the training input vectors for whichthe network assigns the maximum probability to the correctclass.From the output values p k obtained for each map, wedefine the estimator of the local non-Gaussianity parameterto be simply ˆ f NL = n class X k =1 h f NL i k p k (6)where h f NL i k is the mean value of f NL in the k th class. Thestatistical properties of this estimator, namely its mean anddispersion, determine the accuracy of the method. The training input vectors x ( l ) were generated as fol-lows. We began with a set of 1000 non-Gaussian CMBrealisations from which a NG lm and a G lm were generated byElsner & Wandelt (2009) and normalized to the WMAP7-year concordance model power spectrum generated byCAMB. These a lm are publicly available . The ultimate ac-curacy of the network classifier is improved, however, by theinclusion of further training data. Given the finite numberof available simulations, we thus created a further set byrotation of the original maps by 90 ◦ perpendicular to thegalactic plane. This rotation creates roughly 20 per cent ex-tra map area based on the original mask; we verified thatits inclusion improves the results. Using this procedure wegenerate a further 1000 non-Gaussian simulations. Of the2000 non-Gaussian maps, 1800 were used for training andthe remainder were set aside for testing of the networks.For each non-Gaussian simulation used for training, setsof a lm were then generated with varying f NL using the fol-lowing prescription a lm = a G lm + f NL a NG lm , (7)with 20 different f NL random values between −
120 and 120for the HW decomposition and between −
76 and 76 for theSMHW analysis. Thus, for each non-Gaussian simulation, 20sets of a lm were generated. Hence the total number of avail-able training data sets is 36000. Noise-weighted V+W-bandWMAP realizations were then constructed as explained inCurto et al. (2009a) and Casaponsa et al. (2010), and theKQ75 mask was then applied, which covers roughly 29%of the sky. A wavelet decomposition for both the HW andSMHW was performed to determine the wavelet coefficentsfor each a lm set, and their third-order moments computed.These statistics were provided as inputs to the neural net-work. Each input vector contained 232 values for the HWand 680 for the SMHW. http://planck.mpa-garching.mpg.de/cmb/fnl-simulations/c (cid:13) , 000–000 MAP 7yr local f NL constraints using neural networks The architecture of our 3-layer neural networks are definedby two free parameters: the number of hidden nodes n hid and the number of output classes, n class , into which the f NL range is divided. A further parameter, which determines theaccuracy of the network classifier, is the quantity of train-ing data n data . Variation of these parameters can affect thetraining efficiency so it is desirable to explore this trainingspace adequately in order to find an optimal set of parame-ters.Although the MemSys algorithm provides routines todetermine the optimal value of the number of hidden nodesusing the Bayesian evidence Gull & Skilling (1999), in thisapplication n hid is determined simply by measuring trainingtimes and the accuracy of the trained networks on an in-dependent testing set. In this example, we have found thatin fact the optimal architecture contains no hidden nodes,resulting in what is effectively a linear classifier. This is notsurprising, since we are effectively ‘asking’ the network tolearn the mean value and dispersion of the third-order mo-ments of the wavelet coefficients for each f NL ; since the ex-pectation value is linearly dependent on the f NL , this net-work architecture trivially satisfies this requirment. Indeed,networks of this sort provide a simple way of obtaining the(pseudo)inverse of any matrix.The number of output classes, n class , of the networkis clearly related to the total range of f NL considered andsize of the subranges into which this range is divided. Herewe consider networks with n class = 9 (an odd number en-sures that f NL = 0 does not lie on the boundary of a class)The range of f NL was chosen a priori to correspond toapproximately ± σ , where σ is the dispersion in the f NL estimates obtained previously using the standard χ min-imisation method. Thus, the full range was taken to be − f NL <
120 for the HW and − f NL <
76 forthe SMHW, resulting in subranges per class of width 27and 17 units, respectively. This combination fulfilled all therequirements of classification over the range of our simulateddata.The quantity of training data, n data , determines the ac-curacy of the resulting classification network. Naturally, thenetwork accuracy increases with n data , but it typically sat-urates after a given number. We found that the quantityof data required saturated at roughly n data ∼ Figure 3 illustrates the training evolution for the classifica-tion network with n hid = 0 and n class = 9. In the top twopanels we plot the true positive rates (TPR) of the networkon the training set and the test set, for the HW and SHMWrespectively; in each plot, the TPR on the training set hasbeen mutliplied by a factor less than unity to highlight thedivergence with the TPR for the test set. We see that thisdivergence occurs after ∼
100 and ∼
500 iterations of the
MemSys optimiser for the HW and SMHW, respectively.Thus the training was terminated at this point to constructour final classification networks.A key criterion in determining the quality of our classi-fiers is the dispersion of the f NL values obtained in the test- n data dispersion on f NL HWSMHW
Figure 2.
Results of the dispersion of ˆ f NL for 1000 Gaussiansimulations for different values of n data . ing set. This is plotted in the bottom two panels of Figure 3for the HW and SMHW, respectively. We note that, in eachcase, this dispersion increases noticeable beyond the pointwhere the TPRs on the training and testing sets diverge. We first applied our classifiers to 1000 WMAP-7yr simu-lations made from Gaussian CMB maps ( f NL = 0). Forthe HW classifier, we obtained h ˆ f NL i = −
1, which indi-cates the estimator is essentially unbiassed. Moreover, thedispersion of the estimator σ ( ˆ f NL ) = 33 is extremely similarto that obtained with the weighted least-squares method( σ ( ˆ f NL ) = 34). The full distribution of the estimator isshown in the top panel of Fig. 4. For the SMHW classifier, weagain found h ˆ f NL i = −
1, with a dispersion of σ ( ˆ f NL ) = 22,which is very close to the optimal value of σ ( ˆ f NL ) = 21. Thedistribution of the estimator for the SMHW is shown in thebottom panel of Fig. 4.The histogram bins in Fig. 4 have the same size andcentral values as those used to define the network classes.We see that the classes at extremal f NL values are empty,indicating that the network placed no maps in these f NL ranges. Thus for estimating f NL from Gaussian or nearlyGaussian maps the range in f NL used is sufficiently wide.We next applied our estimator to sets of non-Gaussiansimulations, each with a different non-zero f NL value. Foreach true f NL value, we analysed the corresponding WMAPsimulations and calculated the mean and dispersion of ourestimator ˆ f NL for both the HW and SMHW classifiers. Theresults are shown in fig. 5, in which we plot the mean valueof ˆ f NL against the true f NL value. We see that the classifiersare unbiassed for | f NL | . σ with an almost constant disper-sion. For larger | f NL | values, however, the estimator becomesprogressively more biassed and its dispersion decreases.The latter behaviour is simply understood as an edgeeffect due to the finite total range of f NL considered by thenetworks. This point is illustrated in Fig. 6 in which weplot the full distributions of ˆ f NL obtained for a number ofrepresentative values of the true f NL . We see that for | f NL | . σ , we obtain close to symmetric distribution centred on thetrue f NL value, with no maps being placed in the extremeclasses. As | f NL | > σ , however, we see that the classifier c (cid:13) , 000–000 B. Casaponsa et al. iterations true positives rate (%) iterations dispersion on f NL iterations true positives rate (%) iterations dispersion on f NL Training*3/4Testing Training*2/3Testing
HW SMHWSMHWHW
Figure 3.
Evolution of the true positive rate for each iteration of the training process with a neural network with n hid = 0 and n data = 10000. Note that the TPR of the training set have been multiplied by a factor less than unity in order to highlight the divergenceof the behaviours. The bottom panels show the variation of the dispersion on the estimate ˆ f NL during the training. Left panels for HWand right panels for SMHW. does begin to place maps in the extreme classes, resultingin the distribution of ˆ f NL becoming severely skewed and nolonger centred on the true value. Of course, if one were toencounter this behaviour in the analysis of a real data set,one could simply alter the range of f NL considered by thenetwork and retrain.In any case, the above results show that both the HWand SMHW network classifiers produce unbiassed estimatesˆ f NL provided − σ < f NL < σ . Moreover, the dispersionson these estimators are very similar to those obtained withthe classical weighted least squares (WLS) method, indi-cating that neural networks can produce very accurate re-sults within the limitations described above. In the case ofthe SMHW, this is a particularly important result since thecomplexity of the covariance matrix inversion required inthe standard approach is bypassed via the use of the neuralnetwork classifier. Curto et al. (2011a) used a principal com-ponent analysis to reduce the covariance matrix dimensionto allow inversion. −100 −50 0 50 10000.0050.010.015 HW−50 0 5000.0050.010.0150.02 SMHW Figure 4.
Distribution of ˆ f NL obtained from 1000 Gaussian re-alizations for HW (top) and SMHW (bottom). Applying the neural network classifiers to real data (theV+W WMAP 7-year data map), we obtain ˆ f NL = − f NL = 19 for the SMHW. Both these val-ues lie well within the corresponding dispersion of the esti-mator. From the corresponding ˆ f NL distributions obtainedon simulated data, we find that 95% confidence limits are − < f NL <
51 for the HW and − < f NL <
61 for the c (cid:13) , 000–000 MAP 7yr local f NL constraints using neural networks ˆ f NL , data σ ( ˆ f NL ) h ˆ f NL , gauss i P . P . SMHW (NN) 19 22 − −
43 42SMHW (WLS)
Curto et al. 2011b
37 21 0 −
42 46HW (NN) −
12 33 − −
66 63HW (WLS)
Casaponsa et al. 2011 −
68 67
Table 1.
Results obtained with neural networks (NN) and weighted least squares (WLS). ˆ f NL , data is the best fitting value for V+WWMAP data, h ˆ f NL , gauss i and σ ( ˆ f NL ) are the expected value and the standard deviation for Gaussian simulations. P . and P . represent the percentile values at 95% confidence level of ˆ f NL for Gaussian realizations. −100 0 10000.0050.010.0150.02 f NL =−5 −100 0 10000.0050.010.0150.02 f NL =−20 −100 0 10000.0050.010.0150.02 f NL =−50 −100 0 10000.0050.010.0150.02 f NL =−120−100 0 10000.0050.010.0150.02 f NL =5 −100 0 10000.0050.010.0150.02 f NL =20 −100 0 10000.0050.010.0150.02 f NL =50 −100 0 10000.0050.010.0150.020.025 f NL =120 −100 0 10000.0050.010.0150.02 f NL =−2 −100 0 10000.0050.010.0150.02 f NL =−10 −100 0 10000.0050.010.0150.02 f NL =−30 −100 0 10000.010.020.03 f NL =−70−100 0 10000.0050.010.0150.02 f NL =2 −100 0 10000.0050.010.0150.02 f NL =10 −100 0 10000.0050.010.0150.02 f NL =30 −100 0 10000.010.020.03 f NL =70 Figure 6.
Distribution of ˆ f NL obtained from 200 non-Gaussian realizations with representative true f NL values, for HW (left) andSMHW (right). SMHW. We therefore conclude that the data are consistentwith the Gaussian hypothesis. We note that the SMHW con-fidence limits are very similar to those obtained with theoptimal f NL estimator (Komatsu et al. 2011; Smith et al.2009).These results are summarised in Table 1, along withfound via the weighted least squares (WLS) method. Thelatter results are also consistent with Gaussianity. It is worthmentioning, however, the different values of ˆ f NL obtained bythe neural network and the WLS methods, for both HW andSMHW. Although all four values lie well within their cor-responding dispersions, each method returns a different ˆ f NL value when applied to the same WMAP-7yr dataset. Thisbehaviour is to be expected, however, since these are four different estimators of f NL . Therefore, in general, they willnot be equal, even when applied to the same input data.Only the statistical properties (e.g. mean, dispersion) oftheir sampling distributions are important. Note that the constraints are not corrected for the unresolvedpoint sources contribution.
We have trained a multi-class neural network classifier withthird-order moments of the HW and SMHW coefficientsof non-Gaussian realizations in order to set constraints onthe local non-linear coupling term f NL using WMAP 7-yeardata. We found that with a very simple network and withfew iterations (requiring just a few secs CPU time) it is pos-sible to produce the same results as those obtained withthe weighted least squares method. This is an interestingachievement, as it bypasses any covariance matrix relatedcomputations and their associated regularisation problems.The estimation of the covariance matrix with both waveletsrequires the analysis of at least 10000 Gaussian simulationswhich involves a huge demand in CPU time, in particularwith the SMHW statisitcs. The error bars on the estimationof f NL computed with Gaussian simulations are σ ( ˆ f NL = 33)for HW and σ ( ˆ f NL ) = 22 for SMHW, which are extremelysimilar to the ones obtained in Casaponsa et al. (2010) andCurto et al. (2011a) using the same statisitcs but a differ-ent estimator based on the weighted least squares method( σ = 34, σ = 21 for HW and SMHW respectively). The c (cid:13) , 000–000 B. Casaponsa et al. −100 −50 0 50 100−100−50050100 f NL h ˆ f N L i HW −60 −40 −20 0 20 40 60−60−40−200204060 f NL h ˆ f N L i SMHW
Figure 5.
The mean and dispersion of ˆ f NL obtained for a numberof representative values of the true f NL for the HW network (top)and the SMHW network (bottom). constraints for WMAP 7-year data were found to be − 51 for the HW and − < f NL < 61 for the SMHW,which are compatible to a Gaussian distribution as foundin Smith et al. (2009); Curto et al. (2009b); Komatsu et al.(2011); Casaponsa et al. (2010) and Curto et al. (2011b).The results obtained with the SMHW statistics are similarto the ones found in Smith et al. (2009) and Komatsu et al.(2011), which are the most stringent ones currently availableat the limit of the WMAP resolution. Further analysis, as tothe contribution to f NL of unresolved point sources or fore-grounds can be performed by applying the linear classifier tothe statistics of new simulated maps with this characteristicsignal. ACKNOWLEDGMENTS We acknowledge partial financial support from the SpanishMinisterio de Ciencia e Innovaci´on project AYA2010-21766-C03-01 and from the CSIC-The Royal Society joint projectwith reference 2008GB0012. B. Casaponsa thanks the Span-ish Ministerio de Ciencia e Innovaci´on for a pre-doctoral fel-lowship. The authors acknowledge the computer resources,technical expertise and assistance provided by the Span-ish Supercomputing Network (RES) node at Universidad deCantabria. We acknowledge the use of Legacy Archive forMicrowave Background Data Analysis (LAMBDA). Supportfor it is provided by the NASA Office of Space Science. The HealPix package was used throughout the data analysis(G´orski et al. 2005). REFERENCES Antoine J.-P., Vandergheynst P., 1998, Journal of Mathe-matical Physics, 39, 3987 Auld T., Bridges M., Hobson M. P., Gull S. F., 2007, MN-RAS, 376, L11Babich D., Creminelli P., Zaldarriaga M., 2004, Journal ofCosmology and Astro-Particle Physics, 8, 9Baccigalupi C., Bedini L., Burigana C., De Zotti G., FarusiA., Maino D., Maris M., Perrotta F., Salerno E., ToffolattiL., Tonazzini A., 2000, MNRAS, 318, 769Barreiro R. B., Hobson M. P., Lasenby A. N., Banday A. J.,G´orski K. M., Hinshaw G., 2000, MNRAS, 318, 475Bartolo N., Komatsu E., Matarrese S., Riotto A., 2004,Phys.Rev.D, 402, 103Bucher M., van Tent B., Carvalho C. S., 2010, MNRAS,407, 2193Carballo R., Gonz´alez-Serrano J. I., Benn C. R., Jim´enez-Luj´an F., 2008, MNRAS, 391, 369Casaponsa B., Barreiro R. B., Curto A., Mart´ınez-Gonz´alezE., Vielva P., 2010, ArXiv e-printsCay´on L., Mart´ınez-Gonz´alez E., Arg¨ueso F., Banday A. J.,G´orski K. M., 2003, MNRAS, 339, 1189Cay´on L., Sanz J. L., Martinez-Gonzalez E., Banday A. J.,Argueso F., Gallegos J. E., Gorski K. M., Hinshaw G.,2001, MNRAS, 326, 1243Cruz M., Mart´ınez-Gonz´alez E., Vielva P., Cay´on L., 2005,MNRAS, 356, 29Curto A., Mart´ınez-Gonz´alez E., Barreiro R. B., 2009b,ApJ, 706, 399Curto A., Martinez-Gonzalez E., Barreiro R. B., 2011a,MNRAS, 412, 1023Curto A., Martinez-Gonzalez E., Barreiro R. B., HobsonM. P., 2011b, submitted to MNRASCurto A., Mart´ınez-Gonz´alez E., Mukherjee P., BarreiroR. B., Hansen F. K., Liguori M., Matarrese S., 2009a,MNRAS, 393, 615Elsner F., Wandelt B. D., 2009, ApJS, 184, 264Elsner F., Wandelt B. D., 2010, ApJ, 724, 1262Fergusson J. R., Liguori M., Shellard E. P. S., 2010, Phys-ical Review D, 82, 023502Fergusson J. R., Shellard E. P. S., 2009, Phys. Rev. D, 80,043510Gangui A., Lucchin F., Matarrese S., Mollerach S., 1994,ApJ, 430, 447G´orski K. M., Hivon E., Banday A. J., Wandelt B. D.,Hansen F. K., Reinecke M., Bartelmann M., 2005, ApJ,622, 759Gull S. F., Skilling J., 1999, Quantified maximum entropy:Mem-Sys 5 users manual. Maximum Entropy Data Con-sultants Ltd, RoystonHikage C., Matsubara T., Coles P., Liguori M., HansenF. K., Matarrese S., 2008, MNRAS, 389, 1439Hobson M., Lasenby A., 1998, 298, 905Jaynes E., 2003, Probability Theory: The Logic of Science.Cambridge University PressKomatsu E., Smith K. M., Dunkley J., Bennett C. L., GoldB., Hinshaw G., Jarosik N., Larson D., Nolta M. R., PageL., Spergel D. N., Halpern M. an Wright E. L., 2011,ApJS, 192, 18Komatsu E., Spergel D., 2001, Phys.Rev.D, 63, 063002Komatsu E., Spergel D. N., Wandelt B. D., 2005, ApJ, 634,14Leshno M., V. Y., A. P., Schocken S., 1993, Neural Netw.,6, 861MacKay D., 2003, Information Theory, Inference and c (cid:13) , 000–000 MAP 7yr local f NL constraints using neural networks Learning Algorithms. Cambridge University PressMart´ınez-Gonz´alez E., Gallegos J. E., Argueso F., CayonL., Sanz J. L., 2002, MNRAS, 336, 22McEwen J. D., Hobson M. P., Lasenby A. N., MortlockD. J., 2008, MNRAS, 388, 659McEwen J. D., Vielva P., Wiaux Y., Barreiro R. B., CayonL., Hobson M. P., Lasenby A. N., Martinez-Gonzalez E.,Sanz J. L., 2007, Journal of Fourier Analysis and Appli-cations, 13, 495Mukherjee P., Wang Y., 2004, ApJ, 613, 51Natoli P., de Troia G., Hikage C., Komatsu E., Migliac-cio M., Ade P. A. R., Bock J. J., Bond J. R., Borrill J.,Boscaleri A., Contaldi C. R., Crill B. P., de Bernardis P.,de Gasperis G., de Oliveira-Costa A., di Stefano G., HivonE., 2010, MNRAS, 408, 1658Pietrobon D., 2010, Memorie della Societa AstronomicaItaliana Supplementi, 14, 278Salopek D. S., Bond J. R., 1990, Phys.Rev.D, 42, 3936Shahram M., Donoho D., Starck J., 2007, in Society ofPhoto-Optical Instrumentation Engineers (SPIE) Confer-ence Series Vol. 6701 of Society of Photo-Optical Instru-mentation Engineers (SPIE) Conference Series, Multiscalerepresentation for data on the sphere and applications togeopotential dataSmith K. M., Senatore L., Zaldarriaga M., 2009, Journalof Cosmology and Astro-Particle Physics, 9, 6Storrie-Lombardi M. C., Lahav O., Sodre Jr. L., Storrie-Lombardi L. J., 1992, MNRAS, 259, 8PTenorio L., Jaffe A. H., Hanany S., Lineweaver C. H., 1999,MNRAS, 310, 823Vanzella E., Cristiani S., Fontana A., Nonino M., ArnoutsS., Giallongo E., Grazian A., Fasano G., Popesso P.,Saracco P., Zaggia S., 2004, A&A, 423, 761Verde L., Wang L., Heavens A. F., Kamionkowski M., 2000,MNRAS, 313, 141Vielva P., Martinez-Gonzalez E., Barreiro R. B., Sanz J. L.,Cayon L., 2004, ApJ, 609, 22Vielva P., Sanz J. L., 2010, MNRAS, 404, 895Yadav A. P. S., Wandelt B. D., 2010, Advances in Astron-omy, 2010 c (cid:13)000