[PDF] A new approach to observational cosmology using the scattering transform

Abstract

Parameter estimation with non-Gaussian stochastic fields is a common challenge in astrophysics and cosmology. In this paper, we advocate performing this task using the scattering transform, a statistical tool sharing ideas with convolutional neural networks (CNNs) but requiring no training nor tuning. It generates a compact set of coefficients, which can be used as robust summary statistics for non-Gaussian information. It is especially suited for fields presenting localized structures and hierarchical clustering, such as the cosmological density field. To demonstrate its power, we apply this estimator to a cosmological parameter inference problem in the context of weak lensing. On simulated convergence maps with realistic noise, the scattering transform outperforms classic estimators and is on a par with state-of-the-art CNN. It retains the advantages of traditional statistical descriptors, has provable stability properties, allows to check for systematics, and importantly, the scattering coefficients are interpretable. It is a powerful and attractive estimator for observational cosmology and the study of physical fields in general.

Full PDF

DD RAFT VERSION J UNE

16, 2020Typeset using L A TEX twocolumn style in AASTeX63

A new approach to observational cosmologyusing the scattering transform

Sihao Cheng ( 程思浩 ), Yuan-Sen Ting ( 丁源森 ), Brice M´enard ( 梅卜瑞 ), and Joan Bruna ( 步岩岸 ) Department of Physics and Astronomy, The Johns Hopkins University, 3400 N Charles Street, Baltimore, MD 21218, USA Institute for Advanced Study, Princeton, NJ 08540, USA Department of Astrophysical Sciences, Princeton University, Princeton, NJ 08544, USA Observatories of the Carnegie Institution of Washington, 813 Santa Barbara Street, Pasadena, CA 91101, USA Research School of Astronomy & Astrophysics, Australian National University, Cotter Rd., Weston, ACT 2611, Australia Courant Institute of Mathematical Sciences, New York University, New York Center for Data Science, New York University

Abstract

Parameter estimation with non-Gaussian stochastic ﬁelds is a common challenge in astrophysics and cosmol-ogy. In this paper we advocate performing this task using the scattering transform, a statistical tool rooted in themathematical properties of convolutional neural nets. This estimator can characterize a complex ﬁeld, withoutexplicitly computing higher-order statistics, thus avoiding the high variance and dimensionality problems. Itgenerates a compact set of coefﬁcients which can be used as robust summary statistics for non-Gaussian infor-mation. It is especially suited for ﬁelds presenting localized structures and hierarchical clustering, such as thecosmological density ﬁeld.To demonstrate its power, we apply this estimator to the cosmological parameter inference problem in thecontext of weak lensing. Using simulated convergence maps with realistic noise, the scattering transformoutperforms the power spectrum and peak counts, and is on par with the state-of-the-art CNN. It retains theadvantages of traditional statistical descriptors (it does not require any training nor tuning), has provablestability properties, allows to check for systematics, and importantly, the scattering coefﬁcients are inter-pretable. It is a powerful and attractive estimator for observational cosmology and, in general, the study ofphysically-motivated ﬁelds.

1. Introduction

Non-Gaussian ﬁelds are ubiquitous in astrophysics. Ana-lyzing them is challenging as the dimensionality of their de-scription can be arbitrarily high. In addition, there is usuallylittle guidance on which statistical estimator will be most ap-propriate for parameter inference. In this paper, we advocateusing a novel approach, called the scattering transform (Mal-lat 2012) for the analysis of such ﬁelds and, in particular,the matter distribution in the Universe, a highly studied non-gaussian ﬁeld.In many ﬁelds of astrophysics and, in particular, in cosmol-ogy, extracting non-Gaussian information has been attemptedthrough N -point correlation functions. (e.g., Bernardeauet al. 2002; Takada & Jain 2003; Semboloni et al. 2011; Fuet al. 2014) and polyspectra, their Fourier equivalents (e.g.,Sefusatti et al. 2006). However, being high powers of the [email protected] input ﬁeld, these statistics suffer from an increasing varianceand are not robust to outliers in real data, making them gradu-ally less informative (Welling 2005). In addition, the numberof conﬁgurations to consider for N -point functions explodeswith the number of points used. As a result, information ishighly diluted among coefﬁcients, and it becomes a real chal-lenge to efﬁciently extract information with N -point corre-lation functions. Other methods, including using topologi-cal properties such as Minkowski functionals (Mecke et al.1994; Hikage et al. 2003; Shirasaki & Yoshida 2014; Kra-tochvil et al. 2012), and using biasing properties such ascounts of clusters, peaks, and voids (Jain & Van Waerbeke2000; Marian et al. 2009; Kratochvil et al. 2010; Liu et al.2015a,b; Pisani et al. 2019), have also been considered. How-ever, in the cosmological context, these excursions into non-Gaussian signal analyses have had limited impact in improv-ing existing constraints on cosmological parameters so far.Recently, convolutional neural nets (CNNs, e.g., LeCunet al. 1998) have claimed supremacy in a wide variety ofapplications aimed at extracting information from complex a r X i v : . [ a s t r o - ph . C O ] J un C HENG , T

ING , M ´

ENARD & B

RUNA data. They have also shown promises to efﬁciently retrievecosmological information well beyond second-order statis-tics (e.g., Gupta et al. 2018; Ribli et al. 2019a,b). While thepotential of this method is enormous, it also comes with anumber of issues. To precisely and robustly estimate cos-mological parameters, CNNs require a large training set. Inaddition, when applied to real data, systematic errors not in-cluded in the training process of CNN can hardly get checkedand controlled, whereas for traditional statistics, a simple χ test can do so. As such, the use of CNNs in real data comeswith limitations regarding interpretability and validity.In this paper, we advocate using a different approach calledthe scattering transform to efﬁciently and robustly extractstatistical information from non-Gaussian ﬁelds . The setof operations performed by a scattering transform has closesimilarities with those built in CNNs, but the scattering trans-form does not require any training and, like traditional statis-tics, it generates coefﬁcients with proved properties. It cantherefore hopefully overcome the aforementioned limitationsencountered with CNNs. In section 2, we introduce the scat-tering transform, present intuitive understanding of its coef-ﬁcients, and visualize its key properties. In section 3 and 4,we demonstrate the power of the scattering transform to in-fer cosmological parameters ( Ω m and σ ) in the context ofweak lensing using simulated convergence maps. As we willshow, it outperforms the power spectrum and peak counts,and is on par with the state-of-the-art CNN. Finally, we com-ment on the attractive properties of the scattering transformin Section 5 and conclude in Section 6.

2. The scattering transform

In this section, we present the scattering transform and in-tuitive interpretations of its coefﬁcients. The scattering trans-form generates a compact set of coefﬁcients that capture sub-stantial non-Gaussian information beyond the power spec-trum. In contrast to N -point correlation functions, the scat-tering coefﬁcients are all proportional to the input data, anddo not suffer from the increasing variance issue. Thus, thescattering coefﬁcients, which form a representation of the in-put ﬁeld, can be used to extract non-Gaussian informationefﬁciently and robustly. This is particularly attractive from adata analysis point of view. The scattering transform was originally proposed by Mal-lat (2012) as a tool for signal processing to extract informa-tion from high-dimensional data with provable stability prop-erties. The scattering coefﬁcients can be calculated analyti- This work was done simultaneously and independently of that presentedin Allys et al. (2020), where the authors apply a different but related tech-nique, the wavelet phase harmonic, to slices of the matter density ﬁeld. cally and possess many attractive properties including trans-lational invariance, non-expanding variance, and linearizeddependence on deformation (Mallat 2012). Interestingly, thescattering transform has also provided key insights into deci-phering the remarkable behavior and performance of CNNs.A counterintuive feature of CNNs is that the convolution,though restricting the ﬂexibility of the neural net, dramati-cally boosts its performance on many types of data. In ad-dition, a given CNN architecture can often be re-purposedfor very different tasks. These facts suggest that a certainmathematical structure enables efﬁcient information extrac-tion from a wide range of complex data. Understanding thisstructure may dramatically simplify the costly training pro-cess required when using neural networks.The scattering transform has been successfully used inmany areas, including audio signal processing (Anden &Mallat 2011, 2014), image classiﬁcation (Bruna & Mallat2012), texture classiﬁcation (Sifre & Mallat 2013), materialscience (Hirn et al. 2017; Eickenberg et al. 2018; Sinz et al.2020), multifractal analysis in turbulence and ﬁnance (Brunaet al. 2015) or graph-structured data (Gama et al. 2018). Sev-eral of these examples reached state-of-the-art performancecompared to the CNNs in use at the time. In astrophysics, apioneer application has been performed by Allys et al. (2019)to analyze the interstellar medium.The scattering transform can be used with two possiblegoals in mind: representing a speciﬁc realisation of a ﬁeld(with a classiﬁcation goal) or characterising the global sta-tistical properties of the ﬁeld. The narratives in these tworegimes are slightly different (Mallat 2012; Bruna & Mallat2012). We will focus on the latter one, which is relevant tocosmological applications.

Here, we present the formulation of the scattering trans-form in the context of characterizing random ﬁelds (Mallat2012). We focus on the 2-dimensional case in this study,but it can be directly generalized to any other dimensionality.For clarity, we will attach the notation ( x, y ) for the spatialdependence of a ﬁeld only when it is ﬁrst introduced.To extract information from an input ﬁeld, the scatteringtransform ﬁrst generates a series of new ﬁelds by recursivelyapplying two operations: a wavelet convolution and a modu-lus. The expected values of all these ﬁelds are then deﬁned asthe scattering coefﬁcients and used to characterize statisticalproperties of the original ﬁeld. This hierarchical structure,the use of convolution and modulus (a non-linear operator)are all elements found in the architecture of CNNs.Mathematically, given an input ﬁeld I ( x, y ) , the scatter-ing transform generates a set of 1st-order ﬁelds I ( x, y ) byconvolving it with a family of wavelets ψ j,l ( x, y ) and then BSERVATIONAL COSMOLOGY WITH THE SCATTERING TRANSFORM Figure 1.

Illustration of the scattering transform on a weak lensing map. The azimuthal resolution is set to be L = 4. For clarity, we only showresults using wavelets with orientation indices l = 1 and l = 1, and several selected scale indices j and j . In the top left corner of each panel,we show the mean value of that ﬁeld. They are the scattering coefﬁcients ( S , S , S ) of the input ﬁeld. For a convenient display, the bluenumbers are 10 times the coefﬁcients derived from the lensing map. E.g., the S coefﬁcient of this lensing map is actually 0.00267. The colorbar ranges for I , I , I ﬁelds are adjusted separately for better visualization. C HENG , T

ING , M ´

ENARD & B

RUNA

Figure 2.

Upper panel : proﬁle of a Morlet wavelet ( j = 6, l = 0,image size 512 ×

512 pixels) in the real space and another one ( j =1, l = 1) in Fourier space. The center of the Fourier space representszero frequency. Lower panel : radial frequency proﬁles of a fam-ily of wavelets. Dilating/contracting (by factor of 2) and rotating(by π /L) one wavelet give the whole family of wavelets used in thescattering transform. taking the modulus: I ≡ | I (cid:63) ψ j ,l | , (1)where I represents a group of ﬁelds labelled by the waveletindex j , l . This process is illustrated in the ﬁrst two rows ofFigure 1 with a weak lensing convergence ﬁeld. Wavelets arelocalized oscillations and band-pass ﬁlters. Figure 2 showsthe proﬁle of Morlet wavelets in real space and Fourier space.Morlet wavelets are used in our study and described in Ap-pendix A. In general, a family of wavelets covers the wholeFourier space. They all have the same shape but differentsizes and orientations, labelled by j and l respectively. Theycan all be generated through dilating and rotating a prototypewavelet. In the scattering transform, the convention is to usea dilation factor of 2, such that for a pixelized ﬁeld, the sizeof a wavelet ψ j,l in the real space is around j pixels.Having created a set of 1st-order ﬁelds, one can then iteratethe same process to create 2nd-order ﬁelds I ( x, y ) : I ≡ | I (cid:63) ψ j ,l | = || I (cid:63) ψ j ,l | (cid:63) ψ j ,l | , (2)where I represents a group of ﬁelds labelled by the two setsof wavelet index j , l and j , l . An illustration of theseﬁrst two orders of scattering transform is shown in Figure 1.Higher-order ﬁelds can be created with further iterations, butwe will only use the scattering transform up to 2nd order in this study. We note that the iterations are not commutative,so maintaining the order of wavelets is important.If the input ﬁeld I is homogeneous , then all the gener-ated ﬁelds I n remain homogeneous. Therefore, the expectedvalues of their intensity can be used as translation-invariantdescriptors of the input ﬁeld: S ≡ (cid:104) I (cid:105) (3) S j ,l ≡ (cid:104) I j ,l (cid:105) = (cid:104) (cid:12)(cid:12) I (cid:63) ψ j ,l (cid:12)(cid:12) (cid:105) (4) S j ,l ,j ,l ≡ (cid:104) I j ,l ,j ,l (cid:105) = (cid:104) (cid:12)(cid:12)(cid:12)(cid:12) I (cid:63) ψ j ,l (cid:12)(cid:12) (cid:63) ψ j ,l (cid:12)(cid:12) (cid:105) . (5)These expected values S n are called the n th-order scatteringcoefﬁcients. Due to homogeneity, these expected scatteringcoefﬁcients can be estimated by taking the spatial average ofa single realization: ˆ S n = (cid:104) I n (cid:105) x,y , (6)where ˆ S n is an unbiased estimator of S n and (cid:104)(cid:105) x,y representsthe spatial average of a ﬁeld. The number of scattering coefﬁcients is determined by thenumber of wavelets used. Setting J different scales ( J can-not exceed the side length of the ﬁeld) and L different ori-entations results in J × L different wavelets used in total. Ifall combinations of wavelets are used, then the total numberof coefﬁcients at the n th-order will be J n L n . When consid-ering an isotropic ﬁeld, which is the case of interest in cos-mology, the scattering coefﬁcients can be further reduced. Toconstruct isotropic statistics, we simply average over all ori-entation indices, which reduces the number of coefﬁcients byan order of L n and creates a more compact and robust set ofstatistical descriptors. We thus deﬁne our reduced scatteringcoefﬁcients as s ≡ S (7) s j ≡ (cid:104) S j ,l (cid:105) l (8) s j ,j ≡ (cid:104) S j ,l ,j ,l (cid:105) l ,l , (9)where S n represents the standard scattering coefﬁcients, s n represents our reduced coefﬁcients, and (cid:104)(cid:105) l denotes an aver-age over orientation indices. The reduced coefﬁcients s n canalso be understood as the expected value of some ‘reduced’ﬁelds (cid:104) I n (cid:105) l ,..,l n , which are the average of I n with differentwavelet orientation indices l , .., l n . We illustrate these ‘re-duced’ ﬁelds in Figure 3, where information is condensed, Following the convention in cosmology, we use the word “homogeneous”to mean “stationary”, the latter of which is more commonly used in math-ematics. To follow the convention in cosmology, we use slightly different notationsfrom Mallat (2012): we use S n to represent the expected values, whichcharacterize properties of a random ﬁeld and which Mallat denotes as S n ;we use ˆ S n to represent S n ’s estimators calculated from spatial average,which Mallat directly denotes as S n . BSERVATIONAL COSMOLOGY WITH THE SCATTERING TRANSFORM J + J coefﬁcients. As a result, probing the fullrange of scales for an image with 512 ×

512 pixels ( J = 8)yields in total 73 reduced scattering coefﬁcients.In general, performing azimutal averages over all l and l leads to information loss. To preserve more isotropicinformation, one could include l − l as an index of thereduced 2nd-order scattering coefﬁcients (Bruna & Mallat2012; Allys et al. 2019) or apply the scattering strategy againto rotation (Sifre & Mallat 2013). In the weak lensing studypresented below, however, this additional information is notimportant as the ﬁelds considered do not present substantialanisotropic structures. We do not take it into account as itdoes not improve the performance of our analysis.Having introduced the mathematical formulation, we willpresent in the next section some intuitive understanding ofthe key operations involved in the scattering transform. The core operation I → | I (cid:63) ψ j,l | employed by the scat-tering transform comprises two steps: a convolution by acomplex-valued wavelet and a modulus operation. In short,the wavelet convolution selects scales, and the modulus con-verts oscillations into their local strength.Let us discuss the wavelet convolution ﬁrst. As a waveletis a band-pass ﬁlter, the wavelet convolution selects Fouriermodes around a central frequency and coarsely separates in-formation of different scales. Due to the locality of wavelets,which is related to their logarithmic spacing and widths inFourier space, the scattering coefﬁcients are Lipschitz con-tinuous to deformation, meaning that similar ﬁelds differingby a small deformation are also similar in the representationformed by scattering coefﬁcients (Mallat 2012), and there-fore the scattering characterization is a stable one. Fouriercoefﬁcients, in contrast, are unstable to deformation in highfrequencies.Another key idea of the scattering transform is to generateﬁrst-order statistics, in contrast to higher-order moments andcorrelation functions, which multiply an increasing numberof ﬁeld intensities and cause instability to outliers. Being alinear operator, the wavelet convolution certainly keeps the“ﬁrst-order” property. However, for a homogeneous randomﬁeld, convolution alone cannot extract information beyondthe mean of the original ﬁeld (cid:104) I (cid:105) , because the expected valueoperator commutes with all linear transformations. Extract-ing more information requires non-linear operations. For ex-ample, N -point correlation functions use multiplication ofﬁeld intensities as the non-linear operation. The scatteringtransform, on the other hand, employs the modulus operation, which is a natural choice to preserve the desired property ofworking with ﬁrst-order statistics (Mallat 2012).As the modulus is taken in the real space and is non-linear,its behavior in Fourier space is not simple. Nevertheless, wecollected some intuitive understandings and present them inAppendix B for interested readers. The scale separation and phase removal in the scatteringtransform remind us of the power spectrum, which can be de-ﬁned in a similar form to the 1st-order scattering coefﬁcients S = (cid:104)| I (cid:63) ψ |(cid:105) , P ( (cid:126)k ) ∝ (cid:104)| I (cid:63) ψ (cid:48) | (cid:105) with ψ (cid:48) = e − i(cid:126)k · (cid:126)x . (10)The differences between the two estimators S and P ( k ) arejust the choice of the norm (L versus L ) and the convo-lution kernel ( ψ or ψ (cid:48) ). Therefore, the 1st-order scatteringcoefﬁcients are qualitatively similar to the power spectrum.Both of them characterize the distribution of ﬂuctuations, orthe strength of clustering, as a function of scale.However, in the case of the power spectrum, the convo-lution kernel (a sine function) is completely de-localized inreal space. Thus, the power spectrum’s version of I ﬁelds( | I (cid:63) ψ (cid:48) | ) lose all spatial information. In contrast, the useof localized wavelets in the scattering transform allows I to preserve spatial information, as shown in Figure 1 and 3.According to the analogy with the power spectrum, the meanof an I ﬁeld characterizes the average amplitude of Fouriermodes selected by the wavelets, whereas the spatial distri-bution of ﬂuctuations in I , missing in the power spectrumanalog, in turn encodes the phase interaction between thoseFourier modes. This information can be extracted by apply-ing the scattering operations once again, I → I = | I (cid:63) ψ | ,and then measuring the 2nd-order scattering coefﬁcients S .Following the power spectrum analogy, S coefﬁcients re-semble the power spectrum of I ﬁelds and measure cluster-ing properties on I . Because I ﬁelds highlight the regionswhere ﬂuctuation around a scale is strong (or where cluster-ing happens around a scale), the 2nd-order coefﬁcients canbe understood as measuring the clustering of structures high-lighted in I , i.e., the “clustering of (clustered) structures”.This leads to an interesting intuition: we need two pointsto describe the scale of one structure and an additional twopoints for another one. Therefore, the second-order scatter-ing coefﬁcients S , which measure the clustering of sizedfeatures, include information up to the classical 4-point cor-relation function. In general, an n th-order scattering co-efﬁcient S n will contain information up to n -point func-tion of the input ﬁeld. This intuition is consistent with themathematical derivation by Mallat (2012). By this “hierar-chical clustering” design, the scattering transform expansionquickly includes information from higher-order statistics. C HENG , T

ING , M ´

ENARD & B

RUNA

Figure 3.

The scattering transform of three ﬁelds ( I ) with indistinguishable power spectra. We have averaged over angular indice l and l .Row 1 shows a realization of convergence maps in cosmology (Ω m , σ ) = (0.292, 0.835), row 2 shows cosmology (Ω m , σ ) = (0.566, 0.520),row 3 is for a Gaussian random ﬁeld with the same (2D) power spectrum as row 1. It can be seen by eye that the intensity of the last column (cid:104) I (cid:105) l ,l , which corresponds to an s coefﬁcient and measures the clustering strength of structures selected by I , is signiﬁcantly different fromeach other, while their power spectra are indistinguishable.

3. Application to weak lensing cosmology

We now show that the scattering transform can be a power-ful tool in observational cosmology to extract non-Gaussianinformation from the matter density ﬁeld. To illustrate thispoint, we consider an application with 2-dimensional ﬁelds:we show how well cosmological parameters can be con-strained using the scattering coefﬁcients of weak lensing con-vergence maps κ ( (cid:126)θ ) or, equivalently, measurements of cos-mic shear. Being projections of the density ﬁeld along theline-of-sight, these maps present an appreciable level of non-gaussianities on scales smaller than a few degrees, reﬂectingthe non-linear growth of matter ﬂuctuations. For the nec-essary background on cosmology with gravitational lensing,we refer the reader to reviews (Kilbinger 2015; Mandelbaum2018).We explore the use of our reduced scattering coefﬁcientson simulated weak lensing convergence maps to infer Ω m and σ and compare their performance with that of the powerspectrum. We also compare our results with that of a state-of-the-art CNNs by Ribli et al. (2019b) and peak count statistics. We use mock convergence maps generated by theColumbia Lensing team and described in Zorrilla Matillaet al. (2016) and Gupta et al. (2018). The maps are pro-duced through ray-tracing to redshift z = 1 in the output ofdark-matter-only N -body simulations for a set of Λ -CDMcosmologies. These cosmologies differ only in two param-eters: the present matter density relative to the critical den-sity Ω m , and a present normalization of the power spectrum σ . Other cosmological parameters are ﬁxed: baryon density Ω b = 0.046, Hubble constant h = 0.72, scalar spectral index n s = 0.96, effective number of relativistic degrees of freedom( n eff = 3.04), and neutrino masses ( m ν = 0.0). The dark en-ergy density is set so that the universe is spatially ﬂat, i.e., Ω Λ = − Ω m .For each cosmology, there are 512 convergence maps with3.5 × ﬁeld of view, allowing us to sample cosmicvariance. These maps were also used by Ribli et al. (2019b).To compare our results to Ribli et al. (2019b), we use the http://columbialensing.org BSERVATIONAL COSMOLOGY WITH THE SCATTERING TRANSFORM pixelmaps to a 512 resolution with 0.41 arcmin per pixel. In practice, convergence or shear estimates are obtainedfrom measurements of galaxy shapes, with a level of noisethat depends on the galaxy ellipticity distribution and theirnumber density on the sky. To ﬁrst order, background galax-ies used for shear measurements have a wide range of red-shifts and are not correlated. The noise can be well approx-imated as Gaussian white noise. Its contribution to the con-vergence maps can be modelled (van Waerbeke 2000) as σ noise = σ (cid:15) n g A pix , (11)where σ (cid:15) is the intrinsic variance of ellipticity of galaxies,which is taken to be 0.4 , n g is the number density of back-ground galaxies, A pix is the area per pixel, which is 0.1682arcmin . For some existing and on-going surveys such asCFHTLenS, KiDS450, and DES, n g is around 10 arcmin − (Kilbinger et al. 2013; Abbott et al. 2018); for some upcom-ing surveys we expect substantially higher densities: n g ∼

25 arcmin − for LSST , n g >

30 arcmin − for Euclid , and n g ∼ − for the planned WFIRST survey. forfuture optimal case, n g ∼

100 arcmin − .After adding noise, we also smooth the maps. As the powerof Gaussian white noise is distributed mostly at high frequen-cies, smoothing the convergence maps can help to increasethe signal-to-noise of speciﬁc estimators. By default, we donot smooth the noiseless maps, and we perform a σ = 1 ar-cmin (2.44 pixel) Gaussian smoothing on noisy maps. : For each 3.5 × conver-gence ﬁeld in each cosmology, we apply the scattering trans-form up to 2nd order using the ‘kymatio’ python package (Andreux et al. 2020) and then calculate the reduced coef-ﬁcients ( s , s , s ) as deﬁned in Section 2.2. To probe theavailable range of scales, we set J = 8 and L = 4 in thescattering transform, i.e., we use 8 scales spaced logarith-mically with central wavelengths between 1.2 arcmin and 75arcmin and 4 azimuthal orientations, resulting in 32 differentwavelets used in total.By default, the ‘kymatio’ package only calculate the 2nd-order coefﬁcients with j > j , because the coefﬁcients with j ≤ j is mainly determined by the property of wavelets but https://sci.esa.int/euclid https://wﬁrst.gsfc.nasa.gov not the input ﬁeld, as illustrated by the upper-right sketch ofFigure 4. The mathematical reasoning for this property canbe found in Appendix B. To demonstrate these coefﬁcients’behavior, we modiﬁed the ‘kymatio’ code to calculate them,and show them together with the coefﬁcients with j > j in Figure 4. Nevertheless, we checked that they do not con-tribute to constraining cosmological parameters, and there-fore in our inference analysis, we only use 2nd-order coef-ﬁcients with j > j , which yields an even more compactset of 1 + 8 + 28 = 37 scattering coefﬁcients used for ourcosmological inference. Power spectrum : For the same set of input ﬁelds, we alsocompute the power spectrum and peak count statistics us-ing the publicly available ‘LensTools’ python package (Petri2016). The power spectrum is calculated within 20 bins in therange 100 ≤ l ≤ Peak count : In our analysis, a peak is deﬁned as a pixelwith higher convergence ( κ ) than its eight neighbors. Then,peaks are binned by their κ values and counted in each bin.We adopt a binning similar to that in Liu et al. (2015a). Weuse 20 bins in total, including 18 bins linearly spaced be-tween κ = –0.02 and 0.12, one bin for peaks below –0.02,and one bin above 0.12. For reference, κ = 0.12 correspondsto a signiﬁcance of peak ν ≡ κ/σ noise around 7 when n g = 30.Although using more bins for very high peaks ( κ > s , s , and power spectra must be positive for a non-trivialﬁeld, we consider their logarithm to better satisfy a multi-variate Gaussian likelihood. To perform the cosmological in-ference analysis with the three methods introduced above, weuse •

37 scattering coefﬁcients •

20 power spectrum coefﬁcients •

20 peak count coefﬁcients. https://lenstools.readthedocs.io C HENG , T

ING , M ´

ENARD & B

RUNA

Figure 4.

Upper-left Panel : The ﬁducial cosmology (black) and two other cosmologies on the ( Ω m , σ ) plane. Upper-right Panel : Illustrationof reduced scattering coefﬁcients s ( j ) and s ( j , j ) for a single j scale. Lower Panel : The power spectrum and scattering coefﬁcients forthe three cosmologies in noiseless case. The ﬁrst row presents coefﬁcients in the ﬁducial cosmology, and the second row shows changes ofcoefﬁcients ( ∆ coef.) relative to the ﬁducial one when we move to the two other cosmologies. Error bars and gray shaded regions show cosmicvariance, i.e., the intrinsic variability among realizations. The 1st-order scattering coefﬁcients behave similarly to the power spectrum, whilethe 2nd-order scattering coefﬁcients can break the Σ degeneracy, along which non-Gaussianity of weak lensing ﬁeld changes.

4. Results

In this section, we examine the distribution and cosmo-logical sensitivity of scattering coefﬁcients, and present theirconstraining power for two cosmological parameters, Ω m and σ . We show that the scattering coefﬁcients provide sub-stantially more information than the power spectrum and ison a par with CNN. In Figure 4, we present the distributions of reduce scatter-ing transform in the noiseless case together with the powerspectrum. In the ﬁrst row, we show the values for a ﬁducialcosmology that has the Planck cosmology of Ω m = 0.309and σ = 0.816 (Planck Collaboration et al. 2016). The ex-pected values of these descriptors are estimated by averaging over different realizations of a given cosmology. Error bars,which are the sample standard deviation of the realizations,represent the cosmic variance in this noiseless case. We cansee the similarity between the power spectrum and s coefﬁ-cients, as they have similar physical meanings (Section 2.4).We can also see the different behaviors of s coefﬁcients for j < j and j > j , as discussed in Section 3.3.Then, we investigate the cosmological sensitivity of thepower spectrum and scattering coefﬁcients. The power spec-trum is known to be mostly sensitive to one combination ofthe cosmological parameters, namely Σ ≡ σ (cid:18) Ω m . (cid:19) a , (12)with a around 0.6 (e.g., Kilbinger 2015), but can hardlydistinguish cosmologies with the same Σ , as illustrated in BSERVATIONAL COSMOLOGY WITH THE SCATTERING TRANSFORM Σ degeneracy. Gray ar-eas indicate cosmic variance of the ﬁducial cosmology. Asexpected, the 1st-order scattering coefﬁcients show a cosmo-logical sensitivity similar to that of the power spectrum, be-cause both of them measure the strength of ﬂuctuations as afunction of scale.The 2nd-order scattering coefﬁcients, on the other hand,characterize the spatial distribution of sized ﬂuctuations. Tomake the 2nd-order scattering coefﬁcients less correlatedwith the 1st-order ones, here we present de-correlated 2nd-order coefﬁcients s /s , as each s ( j , j ) is proportionalto the corresponding s ( j ) according to their deﬁnitions(Bruna et al. 2015). These s /s exhibit particularly highsensitivity to cosmological change along the Σ degeneracy.In addition, they are indifferent to the other direction of cos-mological change, which means they provide a piece of in-formation roughly orthogonal to that carried by the 1st-ordercoefﬁcients s or the power spectrum. In noisy cases, thoughthe information from s /s is not orthogonal to s anymore,we have checked that they still provide substantial sensitivityalong the Σ degeneracy. Due to this additional sensitivity,the scattering transform can be used to better constrain cos-mological parameters than the power spectrum. We now present the cosmological constraints set by thescattering coefﬁcients measured from a single 3.5 × ﬁeld. For reference, we note that LSST will generate about2,000 times more data, leading to constraints about timestighter than the numbers presented below. In this study, weonly probe the constraints on Ω m and σ and leave the workof using scattering coefﬁcients to constrain the dark energyequation of state parameter w or neutrino mass M ν to futurestudy. Cosmological inference is just another aspect of thecosmological sensitivity problem examined in the previoussubsection. The Fisher inference formalism we use in thisstudy is described in Appendix C.We ﬁrst present results in the noiseless case. In Figure 5we demonstrate the 1 σ Fisher forecast of Ω m and σ usingall scattering coefﬁcients (red ellipse) and power spectrum(gray ellipse). The scattering coefﬁcients provide a dramat-ically tighter constraint than the power spectrum. We alsoshow a break-down of this constraining power into contri-butions from 1st-order (blue ellipse) and 2nd-order (orangeellipse) coefﬁcients alone. As expected, the 1st-order coefﬁ-cients ( s ) and power spectrum set similar constraints. Theslight difference of ellipse orientation originates from the dif- Figure 5.

The 1 σ Fisher forecast of cosmological parameters froma 3.5 × noiseless convergence map with 0.41 arcmin /pixel resolution. The de-correlated 2nd-order scattering coefﬁcients s /s provide critical information to break the Σ degeneracy alongwhich the power spectrum cannot distinguish, therefore drasticallyimprove the constraint. ference between the L and L norms used by the scatteringtransform and the power spectrum. The de-correlated 2nd-order scattering coefﬁcients ( s /s ) provide a strong con-straint along the Σ degeneracy, consistent with our cosmo-logical sensitivity discussion in Section 4.1.We note that the 0th-order coefﬁcient s (the mean of a3.5 × ﬁeld) has strong correlation with other scatter-ing coefﬁcients, which is a sign of being in the non-linearregime and having non-Gaussianity (e.g., Li et al. 2020).Therefore, although the expected value of s is identicallyzero in all cosmology, combining s with other coefﬁcientshelps substantially to tighten the constraints on cosmologi-cal parameters. However, this piece of information may notscale as fast with the increasing ﬁeld of view as small scaleinformation, because in real data each patch of 3.5 × ﬁelds on the sky are not independent.To be more quantitative, we compare different methods us-ing the reciprocal of the area of their 1 σ Fisher forecast el-lipses on the ( Ω m , σ ) plane as the ﬁgure of merit (FoM). Inthe noiseless case, combining all scattering coefﬁcients ( s , s , s ) leads to a constraint that is 14 times tighter than thatof the power spectrum, 5 times tighter than peak count statis-tics, and 3.3 times tighter than the joint constraint from powerspectrum and peak count.0 C HENG , T

ING , M ´

ENARD & B

RUNA

Table 1.

Comparison of the constraining power for ( Ω m , σ ) between different methods, with a single 3.5 × convergence map.Methods Ω m – σ Figure of Merit a n g = 10 arcmin − n g = 30 arcmin − n g = 100 arcmin − noiseless noiseless (no smoothing)scattering transform: s + s + s

50 140 329 1053 3367scattering transform: s + s

21 55 133 492 565scattering transform: s + s

39 91 181 446 1720power spectrum P( l )

20 40 67 104 253peak count 30 89 162 170 667CNN (Ribli et al. 2019b) (44) (121) (292) (1201) ( - ) a The ﬁgure of merit is deﬁned as the reciprocal of the 1 σ conﬁdent area based on Fisher matrix (or the 68% posterior contour, in parentheses)on the ( Ω m , σ ) plane. The convergence maps are smoothed with σ = 1’ Gaussian ﬁlter except for the case shown in the last column with nosmoothing. Figure 6.

The 1 σ Fisher forecast of cosmological parameters ( Ω m and σ ) from different descriptors of a 3.5 × convergence mapsmoothed with σ = 1’ Gaussian ﬁlter. The scattering coefﬁcients have comparable performance as a state-of-the-art CNN (Ribli et al. 2019b) atall noise levels, and 3–5 times better than the power spectrum depending on the noise level. We then compare the performance of the scattering trans-form to a state-of-the-art CNN analysis by Ribli et al.(2019b). To perform a meaningful comparison, as done bythese authors, we use noiseless convergence maps convolvedwith a σ = 1 arcmin Gaussian smoothing. Interestingly, weﬁnd that the scattering coefﬁcients extract a similar amountof cosmological information to the CNN trained in Ribli et al.(2019b). The corresponding ﬁgures of merit are shown in Ta-ble 1.We now consider convergence ﬁelds in the presence ofgalaxy shape noise. As the noise level increases, small-scalestructures, which carry plenty of cosmological information,get erased. As a result, the constraining power of the scat-tering coefﬁcients (as well as other methods) degrades. In Figure 6 we show the Fisher forecast of Ω m and σ from a3.5 × convergence map under three noise levels, us-ing the scattering coefﬁcients and the power spectrum. Wealso show the posterior constraints from CNNs trained by(Ribli et al. 2019b) on the same simulations. The ﬁguresof merit for these methods, together with the peak countmethod, are listed in Table 1. Again, we ﬁnd that the scat-tering transform not only outperforms the power spectrumand peak count, but also provides cosmological constraintson a par with state-of-the-art CNNs.To summarize: we have demonstrated the power of thescattering transform for cosmological parameter inferencewith weak lensing data. For simplicity, we focused on theconvergence ﬁeld but a similar analysis can also be per- BSERVATIONAL COSMOLOGY WITH THE SCATTERING TRANSFORM

5. Discussion

When performing inference, we usually need to estimatea set of physical parameters underlying a ﬁeld from a ﬁnitenumber of realizations. Ideally, one would like to use thelikelihood function of the ﬁeld itself but this is often out ofreach. It is typically not analytically tractable or too costlyto sample through numerical simulations. Therefore, for theinference problem to be feasible, it is reduced towards a setof statistical descriptors, rather than the ﬁeld itself. In otherwords, it requires a ‘statistical’ compression of the data.In the case of Gaussian random ﬁelds, we know that thepower spectrum provides a sufﬁcient statistic, i.e. a loss-less compression of information. However, when there isno known parametric statistical model of the ﬁeld, estimat-ing valid probability distributions in the high-dimensionalregime is a challenging task. A powerful framework is givenby maximum-entropy regularised estimation, which looks forthe most ‘non-committal’ statistical model under the con-straints of a ‘feature vector’ of sufﬁcient statistics (Jaynes1957). There is thus a tension in the design of such vectorof sufﬁcient statistics (Bruna & Mallat 2019): On the onehand, one would like the features to be efﬁciently estimatedfrom the available samples, so that the corresponding statisti-cal model is robust under resampling. In other words, typicalsamples from the true distribution should remain typical un-der the estimated statistical model. On the other hand, thefeatures should be descriptive enough so that they introduceenough constraints, i.e., typical samples from the estimatedmodel should also be typical under the true distribution.Viewed in this direction, modern overparametrised CNNssearch for the most informative compression and representa-tion to the inference problem through a training optimization,at the expense of deﬁning brittle statistical models (Bruna &Mallat 2019). Traditional statistical approaches, on the otherhand, design the compression/representation based on someproperties of the ﬁeld expected from domain knowledge. Forexample, the peak count statistic is used in weak lensing asconvergence maps display distinct halos. N -point correlationfunctions, closely related to perturbation theory and conve-nient for analytical prediction, include/exclude dimensionsof variations by truncating a series expansion, which makesthem good descriptors for ﬁelds slightly deviating from a Figure 7.

Dependence of the ( Ω m , σ ) constraints with differentmethods on galaxy shape noise. The ﬁgure of merit (FoM) is deﬁnedas the 1 σ conﬁdent area on the ( Ω m , σ ) plane. Note that the CNNresult (Ribli et al. 2019b) is reported in terms of posterior, whileothers are Fisher forecast. We have checked that for the scatteringtransform, the posterior FoM is only slightly lower than the Fisherforecast reported here and almost identical to the CNN result. Gaussian one. However, a highly non-Gaussian ﬁeld requiresusing larger N . As the number of coefﬁcients and the com-plexity of conﬁgurations increase rapidly with N , N -pointcorrelation functions quickly become an inefﬁcient or non-robust representation of the original ﬁeld.In physics, we are actually not interested in performing in-ference with any non-Gaussian ﬁeld, but only the subset ofﬁelds that are generated by physical interactions. Such ﬁeldstypically display localized, coherent structures. These struc-tures also act as bricks and then form larger structures. So, inorder to characterize them, it is important to take these prop-erties into account. One needs an appropriate or “tailored”representation rather than a fully generic one. As we willshow in the next section, the design and operations of thescattering transform leads to an efﬁcient and robust represen-tation for such ﬁelds. Efﬁciency: two binning strategies make the scatteringtransform efﬁcient.(i) Spatial/frequency binning: the use of wavelets balancesthe resolution in real and frequency space. As a result, thescattering transform can capture localized information froma large range of scales with a compact set of coefﬁcients. Forexample, in our case, the weak lensing information is com-2 C

HENG , T

ING , M ´

ENARD & B

RUNA pressed into 37 coefﬁcients, a number that is much smallerthan typical bi-spectrum descriptors, while achieving state-of-the-art CNN-like constraint on cosmological parameters.(ii) High-order information binning: after selecting struc-tures of given sizes through wavelet convolution and modulusoperations in the 1st order, the scattering transform then se-lects structures ‘assembled’ by these 1st-order structures inthe next order. This hierarchical design allows the n th or-der scattering coefﬁcients to quickly include information of n -point functions.These two binnings do not lose much information because(i) interesting phenomena in physics often appear in multiplescales, not all squeezed at one scale, and (ii) in physical worldstructures do often form hierarchically. Robustness: (i) As opposed to the N -point correlation function ap-proach, which requires multiplying an increasing number ofﬁeld ﬂuctuations and causes high variance, the scatteringtransform is a ﬁrst-order statistic which avoids amplifyingthe process variability. As a result, it yields low variance de-scriptors and is less sensitive to outliers.(ii) The locality of wavelets, which is related to their loga-rithmic spacing and widths in frequency space, introduce sta-bility to deformations, a desired property of robust descrip-tors which classical N -point functions do not have. Interpretability: as discussed in Section 2.4, the scatter-ing coefﬁcients have an intuitive interpretation. They de-scribe clustering properties of the ﬁeld in the following way:(i) The 1st-order scattering coefﬁcients are similar to acoarsely binned power spectrum, which characterize the clus-tering strength at different scales j . As the scattering trans-form uses an L norm as opposed to an L norm, the ra-tio between s coefﬁcients and the power spectrum providesa measure of sparsity of the ﬁeld. This explains why inFigure 5 the constraints from 1st-order coefﬁcients and thepower spectrum are slightly different, and just combiningthese two can also provide a stronger constraint on cosmol-ogy than using power spectrum alone.(ii) The 2nd-order scattering coefﬁcients characterize theclustering strength of j -scale structures separated by j -scales. In other words, these coefﬁcients characterize the“clustering of structures”, selected over a given frequencyrange. Their departure from their Gaussian counterparts isa robust measure of the strength of non-gaussianities. The scattering transform and CNNs share a number ofproperties. Both of them have hierarchical layers with con-volutions and non-linear operations. The ﬂexibility of CNNsleads to a higher performance for ﬁelds with much highercomplexity, such as images from the world around us, but atthe expense of generating less robust statistical models. The cosmological density ﬁeld is not as complex as random im-ages of rabbits. The simplicity of the structures encounteredin weak lensing maps and their regularity as a function ofscale, imply that a descriptor using convolutions by a sim-ple wavelet can extract a substantial amount of information.As shown by Ribli et al. (2019b), a CNN trained on conver-gence maps internally generates convolution ﬁlters similar tothe Morlet wavelet.While the generation of these CNN ﬁlters requires a largenumber of simulations as a training set, the scattering trans-form which has preset ﬁlters does not require any learningphase. In addition, the choice of CNN architecture can mod-ify the results substantially, as can be seen for example inthe comparison between results of Ribli et al. (2019b) andGupta et al. (2018). As such, CNNs usually require much,and often ad-hoc, ﬁne-tuning. The scattering transform, onthe other hand, is not subject to these sources of variability.It requires the use of simulations only to probe the cosmicvariance of the descriptors.Another advantage of the scattering transform over CNNsis the possibility to investigate systematic effects in the data.Analyzing the scattering coefﬁcients not only returns thebest-ﬁtting cosmological parameters but also allows for san-ity checks and the evaluation of goodness of ﬁt. In contrast,CNNs currently provide little opportunity to verify the ro-bustness of their outcomes. As it is most of the time virtuallyimpossible to precisely simulate real data, the ﬁdelity of sim-ulated training sets is limited and the impact of systematiceffects can be difﬁcult to assess.

The non-linear gravitational evolution of density ﬂuctua-tions in the universe gives rise to halos, which are virializedsystems locally bound by gravity. As highlighted by Ribliet al. (2019b), a substantial amount of non-Gaussian cosmo-logical information can be extracted from these features. Thepeak count method makes use of this domain knowledge andis designed to capture the abundance of halos. However, itdoes not characterize the spatial distribution of these halos,which is also sensitive to cosmological parameters. On theother hand, the scattering transform extracts a comprehen-sive information of the abundance, proﬁle, and distribution ofhalos by highlighting structures of particular scales and thencharacterizing their clustering at other scales, as describedin Section 2.4. In the limit of small j and large j , the2nd-order scattering coefﬁcients can be simply understoodas a measure of the “two-halo term” in the halo model. Thehalo abundance and proﬁle also leave their imprint on s co-efﬁcients at different j and j scales. In this manner, thescattering transform provides a non-parametric description inone-halo, two-halo, and the transitional regime where halosoverlap and are forming larger halos. BSERVATIONAL COSMOLOGY WITH THE SCATTERING TRANSFORM

6. Conclusion

Characterizing arbitrary non-Gaussian ﬁelds is challeng-ing as the dimensionality of their description can be arbitrar-ily high. The subset of ﬁelds relevant in physics, however,tends to be more constrained as they typically display local-ized, coherent structures. In the cosmological context, thematter density ﬁeld presents another characteristic property:hierarchical clustering. An efﬁcient statistical descriptor ofthe cosmological density ﬁeld would ideally be able to natu-rally capture these two properties.In this paper, we advocate the use of the scattering trans-form, a statistical estimator designed to extract informationfrom a complex ﬁeld with provable stability properties.It involves operations similar to those found in convolu-tional neural nets (CNNs): it uses wavelets, which balancethe spatial and frequency resolution, as its convolutional ker-nel and the modulus operation as its non-linear operation.However, in contrast to CNNs, the scattering transform does not require any training. This statistical estimator can gener-ate a compact set of robust coefﬁcients that forms a represen-tation of the input ﬁeld and can be used as efﬁcient summarystatistics for non-Gaussian information.We applied the scattering transform to the cosmologicalparameter inference problem in the context of weak lensingdata. For simplicity, we focused on the convergence ﬁeldbut a similar analysis can also be performed on the shearﬁeld. We used simulated convergence maps generated byray-tracing N -body simulation results (Zorrilla Matilla et al.2016; Gupta et al. 2018) and measure their scattering coefﬁ-cients to infer the cosmological parameters Ω and σ . Forsimulated data with galaxy shape noise, the scattering trans-form outperforms the power spectrum, peak counts and is onpar with state-of-the-art CNNs.As described in section 5.2, the scattering transform pos-sesses a series of attractive properties for parameter estima-tion, especially: it is efﬁcient, robust, and interpretable. Aswe have shown, the 1st-order scattering coefﬁcients charac-terize the clustering strength at different scales and are sim-ilar to the power spectrum; the 2nd-order scattering coefﬁ-cients efﬁciently extract non-Gaussian information by char-acterizing the clustering of structures of different sizes. Ingeneral, the n th-order scattering coefﬁcients include infor-mation of up to n points functions, conveniently summa-rized in a compact set of coefﬁcients, whose distribution istypically Gaussian (a desired property for likelihood esti-mation). Similarly to classical statistical estimators (not re-quiring training nor tuning), the scattering transform offersthe possibility to investigate for systematic errors potentiallypresent with real data.In this paper we focused on applications of the scatteringtransform for cosmological parameter inference with weaklensing data. Using it with existing and upcoming surveys (e.g. DES, LSST, Euclid , WFIRST ) can be great interest toimprove constraints and/or provide consistency checks. Nat-urally, the scattering transform is an attractive estimator formany other applications: in observational cosmology, astro-physics and beyond.

Acknowledgments

We thank the Columbia Lensing group for making theirsuite of simulated maps available, and NSF for supportingthe creation of those maps through grant AST-1210877 andXSEDE allocation AST-140041. We also thank Dezs˝o Riblifor useful discussions. YST is supported by the NASA Hub-ble Fellowship grant HST-HF2-51425.001 awarded by theSpace Telescope Science Institute. This work is partially sup-ported by the Alfred P. Sloan Foundation, NSF RI-1816753,NSF CAREER CIF 1845360, NSF CHS-1901091, SamsungElectronics, and the Institute for Advanced Study.

Appendix A Morlet Wavelets

Wavelets are localized oscillations. If we simply use aGaussian envelop to modulate a plane wave, then we obtaina Gabor function in arbitrary dimensions G ( (cid:126)x ) = 1 (cid:112) | Σ | e − (cid:126)x T Σ − (cid:126)x/ e i (cid:126)k · (cid:126)x , (A1)where Σ is the covariance matrix describing the size andshape of the Gaussian envelop, and (cid:126)k determines the fre-quency of the modulated oscillation. To keep maximum sym-metry, usually Σ is selected to have only 1 eigen-value differ-ent from the others, and (cid:126)k to be along that eigen-direction.Thus we denote the eigen-value along (cid:126)k by σ and the othereigen-value by σ /s . The parameter s is also the ratio oftransverse to radial width of the wavelet in Fourier space.The Fourier transform of a Gabor function is simply aGaussian ﬁlter centered at (cid:126)k , ˜ G ( (cid:126)k ) = e − ( (cid:126)k − (cid:126)k ) T Σ ( (cid:126)k − (cid:126)k ) / . (A2)Wider envelop in real space makes narrower ﬁlter in Fourierspace. Note that the product k σ determines the number ofoscillations within ± π ≈ standard deviation of the Gaus-sian envelop and allows for a trade off between spatial andfrequency resolution.Unfortunately, a Gaussian proﬁle in Fourier space does notgo to zero at 0 frequency. This contradicts the admissibilityof wavelet which requires wavelets to strictly be band-passﬁlters, not low-pass ﬁlters. Therefore, a small correction isrequired. A simple solution is to introduce an offset, β , be-fore the modulation. In Fourier space this is equivalent to4 C HENG , T

ING , M ´

ENARD & B

RUNA subtracting another Gaussian proﬁle centered at 0 to cancelout the 0-frequency contribution. Families of wavelets cre-ated in this way are called Morlet wavelets. Formally, ψ ( (cid:126)x ) = 1 (cid:112) | Σ | e − (cid:126)x T Σ − (cid:126)x/ (cid:16) e i (cid:126)k · (cid:126)x − β (cid:17) , (A3)where β = e − (cid:126)k T Σ (cid:126)k / is determined by the admissibilitycriterion. Its Fourier transform is ˜ ψ ( (cid:126)k ) = ˜ G ( (cid:126)k ) − βe − (cid:126)k T Σ (cid:126)k/ . (A4)In our study, which is a 2-dimension case, we follow thesettings used in the ‘kymatio’ package mentioned in Sec-tion 3.3, σ = 0 . × j k = 3 π × j s = 4 /L , (A5)where σ is in unit of pixels, j is an integer starting from 0,and k is always between 0 and 1. This choice allow a familyof Morlet wavelets best covers the whole Fourier space, witha dyadic sequence of scales ( j ). Within the wavelet envelop,there are about 2 cycles of oscillations, because k σ ≈ . Appendix B Scattering transform in Fourier space

The modulus and expectation operations of the scatteringtransform are deﬁned in the real space and do not have simpleFourier counterparts, so in general it is easier to understandit in the real space. However, as the Morlet wavelets looksimpler in Fourier space (they are band-pass ﬁlters), it is alsoenlightening to collect some intuitions of the scattering trans-form from the Fourier side.In general, the modulus operation will mix Fourier modesand change the frequency distribution of a ﬁeld. In par-ticular, if the ﬁeld in question has been convolved with acomplex-valued wavelet that has a single peak in Fourierspace, then the energy will be redistributed mainly towardslower frequencies. I.e., the typical frequency of | I (cid:63) ψ | islower than I (cid:63) ψ . This can be revealed by ﬁrst writing | I (cid:63) ψ | as (cid:112) ( I (cid:63) ψ )( I (cid:63) ψ ) ∗ , where ∗ stands for complex conjugate,and then Taylor expanding this square root with powers of ( I (cid:63) ψ )( I (cid:63) ψ ) ∗ (Mallat 2010). For the Morlet wavelets usedin scattering transform, the central frequency (in terms ofwavenumber) of the wavelet ψ is roughly k , and the half-width of the wavelet in Fourier space (in the radial direction)is around /σ . So, the leading term of the Taylor expansion,which is proportional to ( I (cid:63) ψ )( I (cid:63) ψ ) ∗ , will have an half-width around √ /σ in Fourier space and its centroid shiftedback to 0 from k , as a multiplication in the real space cor-responds to a convolution in Fourier space. This means that the typical frequency of | I (cid:63) ψ | is lower than I (cid:63) ψ , because √ /σ ≈ . k < k , according to Equation A5.In other words, the core operation I → | I (cid:63) ψ | expresseshigh frequency information of I n in terms of lower fre-quency components including the 0-frequency component inthe next-order ﬁelds I n +1 . As the 0-frequency componentis translation invariant, it can be directly used as a statisticaldescriptor of the original ﬁeld.In addition to the qualitative understandings we have pre-sented above, it will be valuable to understand more quanti-tatively the Fourier behavior of the scattering coefﬁcients. Appendix C Cosmological inference framework

In this appendix we describe the Fisher forecast formal-ism used to infer the cosmological parameters in this study.According to the Cram´er–Rao inequality, the variance ofany unbiased estimator ˆ θ for model parameters θ cannotbe smaller than the inverse of the Fisher information matrix I ( θ ) of the model: cov ( ˆ θ ) ≥ I ( θ ) − . (C6)Elements of the Fisher matrix is deﬁned as I m,n ( θ ) ≡ − (cid:28) ∂ ln p ( x | θ ) ∂θ m ∂ ln p ( x | θ ) ∂θ n (cid:29) , (C7)where x is the observable, p is the likelihood function, (cid:104)·(cid:105) is the expectation over x . In our cosmological case, θ rep-resents cosmological parameters, θ = (Ω m , σ ) , and x rep-resents the statistical descriptors such as the scattering coef-ﬁcients. The function p ( x | θ ) is called the likelihood of θ when x is ﬁxed, and is called the probability density func-tion (PDF) of x when θ is ﬁxed. In our study, we assumethat given any cosmology θ , the PDF of statistical descrip-tors x is Gaussian: p ( x | θ ) ∝ (cid:112) | C | exp [ −