[PDF] Nearest Neighbor distributions: new statistical measures for cosmological clustering

Abstract

The use of summary statistics beyond the two-point correlation function to analyze the non-Gaussian clustering on small scales is an active field of research in cosmology. In this paper, we explore a set of new summary statistics -- the k -Nearest Neighbor Cumulative Distribution Functions ( kNN - CDF ). This is the empirical cumulative distribution function of distances from a set of volume-filling, Poisson distributed random points to the k -nearest data points, and is sensitive to all connected N -point correlations in the data. The kNN - CDF can be used to measure counts in cell, void probability distributions and higher N -point correlation functions, all using the same formalism exploiting fast searches with spatial tree data structures. We demonstrate how it can be computed efficiently from various data sets - both discrete points, and the generalization for continuous fields. We use data from a large suite of N -body simulations to explore the sensitivity of this new statistic to various cosmological parameters, compared to the two-point correlation function, while using the same range of scales. We demonstrate that the use of kNN - CDF improves the constraints on the cosmological parameters by more than a factor of 2 when applied to the clustering of dark matter in the range of scales between 10 h −1 Mpc and 40 h −1 Mpc . We also show that relative improvement is even greater when applied on the same scales to the clustering of halos in the simulations at a fixed number density, both in real space, as well as in redshift space. Since the kNN - CDF are sensitive to all higher order connected correlation functions in the data, the gains over traditional two-point analyses are expected to grow as progressively smaller scales are included in the analysis of cosmological data.

Full PDF

MMNRAS , 1–21 (2020) Preprint 28 July 2020 Compiled using MNRAS L A TEX style ﬁle v3.0

Nearest Neighbor distributions: new statistical measures forcosmological clustering

Arka Banerjee (cid:63) and Tom Abel † Kavli Institute for Particle Astrophysics and Cosmology, Stanford University, 452 Lomita Mall, Stanford, CA 94305, USADepartment of Physics, Stanford University, 382 Via Pueblo Mall, Stanford, CA 94305, USASLAC National Accelerator Laboratory, 2575 Sand Hill Road, Menlo Park, CA 94025, USA

Accepted XXX. Received YYY; in original form ZZZ

ABSTRACT

The use of summary statistics beyond the two-point correlation function to analyze the non-Gaussian clustering on small scales, and thereby, increasing the sensitivity to the underlyingcosmological parameters, is an active ﬁeld of research in cosmology. In this paper, we explorea set of new summary statistics — the k -Nearest Neighbor Cumulative Distribution Functions( k NN-CDF). This is the empirical cumulative distribution function of distances from a set ofvolume-ﬁlling, Poisson distributed random points to the k –nearest data points, and is sensitiveto all connected N -point correlations in the data. The k NN-CDF can be used to measure countsin cell, void probability distributions and higher N –point correlation functions, all using thesame formalism exploiting fast searches with spatial tree data structures. We demonstrate how itcan be computed eﬃciently from various data sets - both discrete points, and the generalizationfor continuous ﬁelds. We use data from a large suite of N -body simulations to explore thesensitivity of this new statistic to various cosmological parameters, compared to the two-pointcorrelation function, while using the same range of scales. We demonstrate that the use of k NN-CDF improves the constraints on the cosmological parameters by more than a factor of2 when applied to the clustering of dark matter in the range of scales between 10 h − Mpc and40 h − Mpc. We also show that relative improvement is even greater when applied on the samescales to the clustering of halos in the simulations at a ﬁxed number density, both in real space,as well as in redshift space. Since the k NN-CDF are sensitive to all higher order connectedcorrelation functions in the data, the gains over traditional two-point analyses are expected togrow as progressively smaller scales are included in the analysis of cosmological data.

Key words: cosmology – cosmological parameter constraints

Over the past three decades, a large part of the progress in cosmol-ogy — especially in the pursuit of increasingly precise and accurateconstraints on the standard cosmological paramemeters — has beenpredicated on the use of two-point statistics in the analysis of cos-mological datasets, either in real space, or in Fourier space. Thisincludes analysis of the Cosmic Microwave Background (PlanckCollaboration et al. 2018; Hinshaw et al. 2013) as well as LargeScale Structure analyses at low redshifts (Alam et al. 2017; Ivanovet al. 2020; d’Amico et al. 2020; Hildebrandt et al. 2017; Abbottet al. 2018). The latter include the analysis of both galaxy clustering— discrete data points — and weak lensing measurements — in theform of continuous maps. These studies employ a wide range of the-oretical approaches in their modeling, including linear perturbationtheory, higher order perturbation theory, and N -body simulations, (cid:63) E-mail: [email protected] † E-mail: [email protected] but the two-point correlation function, or the power spectrum isusually the summary statistic of choice to compare theoretical pre-dictions and data. Given its widespread use, a number of tools havebeen developed over the years for fast and eﬃcient calculation oftwo-point statistics from cosmological data, or from simulations.The two-point statistics provides a complete statistical descrip-tion of a Gaussian random ﬁeld. Perturbations in the early Universeare believed to closely follow the statistics of a Gaussian randomﬁeld ( e.g.

Peacock 1998), though attempts have also been madeto quantify any departures from Gaussianity in the early Universe(Planck Collaboration et al. 2019). As long as the evolution of theearly Universe perturbations remain well described by linear per-turbation theory, the Gaussian nature of the ﬁeld is not aﬀected.High redshift CMB analyses ( z ∼ ) and the analyses of clus-tering on extremely large scales at low redshift fall in this regime.Therefore, current analyses, utilizing two-point statistics, are able toextract maximal information about cosmological parameters fromthese redshifts and scales. On the other hand, at low redshifts, theﬁeld develops non-Gaussian features sourced by continued gravita- © 2020 The Authors a r X i v : . [ a s t r o - ph . C O ] J u l Banerjee & Abel tional collapse — smaller the scale, more non-Gaussian the under-lying ﬁeld. Higher connected N -point correlation functions, whichencode the more complicated nature of the ﬁeld, start to be statis-tically important. The two-point statistics themselves can still beaccurately modeled on relatively small scales, either through higherorder perturbation theories (Carrasco et al. 2012; Taruya et al. 2012;Vlah et al. 2015; d’Amico et al. 2020; Ivanov et al. 2020), or using N -body simulations (Mead et al. 2015; Euclid Collaboration et al.2019; Lawrence et al. 2017; Nishimichi et al. 2017). Nonetheless,an analysis using only the two-point statistic is insuﬃcient to probeall the eﬀects of diﬀerent cosmological parameters on the evolu-tion of the overall cosmological ﬁeld. This motivates the need forconsidering other summary statistics in the analysis of clusteringon smaller scales. Since the total information available in a surveyscales with the number of independent modes (Tegmark 1997), andgiven that there are many more independent modes on small scales,it is important to develop these new statistical methods, that betterextract information from small scales, to make optimal use of datafrom ongoing and future cosmological surveys.Various approaches toward harnessing information beyond thatcontained in the two-point statistics have been explored in the liter-ature. One method is consider the higher N -point functions in theanalysis — the three-point function and its Fourier transform, thebispectrum ( e.g. Scoccimarro et al. 1998; Takada & Jain 2004; Se-fusatti et al. 2006), or the four-point function and its Fourier trans-form, the trispectrum ( e.g.

Verde & Heavens 2001). While theseadditional statistics can be highly informative about small scaleclustering, and promise to yield signiﬁcantly tighter constraints onsome cosmological parameters (Hahn et al. 2020; Coulton et al.2019), one drawback is that they are generally entail much highercomputational costs to measure either from simulations, or fromdata. The computational complexity rises with N - the order of thehighest connected correlation function considered, while keepingthe data size ﬁxed. Further, the noise properties for these higherorder estimators often make it diﬃcult to obtain good signal-to-noise ratio (SNR) over a wide range of scales. In spite of theseissues, analyses including higher order correlations, especially thebispectrum, have successfully been applied to certain cosmologicaldatasets (Gil-Marín et al. 2015; Gualdi et al. 2019; Slepian et al.2017).In this context, there also exist approaches which attempt toundo the nonlinear eﬀects of gravitational clustering on variouscosmological ﬁelds. Once this procedure is applied, the analysisproceeds with the measurement of the two-point function of the lin-earized ﬁeld. This approach has already been applied speciﬁcally tothe reconstruction of the Baryon Acoustic Oscillation (BAO) signal(Eisenstein et al. 2007; Padmanabhan et al. 2009; Padmanabhanet al. 2012). Schmittfull et al. (2015) showed that the reconstructionprocess can be interpreted as the transfer of information from thehigher connected N -point functions in the observed ﬁeld into the re-constructed two-point function. More recently, methods have beenproposed for the full reconstruction of the initial linear modes fromnonlinear observables at low redshift (Baldauf et al. 2010; Seljaket al. 2017; Horowitz et al. 2019; Modi et al. 2018).There is also a large body of literature investigating the onepoint Probability Distribution Function (PDF) of matter densityin the Universe, and the related counts-in-cell (CIC) statistics fordiscrete tracers like dark matter halos or galaxies (Coles & Jones1991; Colombi 1994; Kofman et al. 1994; Gaztañaga et al. 2000;Lam & Sheth 2008; Bernardeau et al. 2014; Uhlemann et al. 2016;Klypin et al. 2018). While the PDF or CIC statistics are close toGaussian when evaluated on large scales, and therefore contain the same information as the two-point function, on small scales the PDFcaptures information about all higher moments of the distribution,and therefore can be used to place stronger constraints on variouscosmological parameters (Uhlemann et al. 2019). Variations of thistype of analysis have also been applied already to diﬀerent cosmo-logical datasets (Petri et al. 2015; Gruen et al. 2015, 2018; Friedrichet al. 2018; Repp & Szapudi 2020). While this approach is extremelyattractive in terms of its sensitivity to all higher order correlations,calculating the PDF from data involves multiple steps of smoothingand averaging, and the calculations have to be done separately forevery radius bin used in the analysis.Other statistical measures that have been employed to extractnon-Gaussian information from cosmological ﬁelds, especially inthe context of weak lensing, include peak counts (Peel et al. 2017;Fluri et al. 2018) and Minkowski functionals (Matsubara 2010;Munshi et al. 2012; Petri et al. 2013). Yet another set of studies haveattempted to use special properties of nonlinear regions, such ashalos and voids, in speciﬁc cosmologies, to enhance the constraintson some of the cosmological parameters. These include searches forscale dependent bias on large scales in the context of massive neu-trinos (Villaescusa-Navarro et al. 2014; LoVerde 2016; Banerjee& Dalal 2016; Chiang et al. 2019; Banerjee et al. 2019) and pri-mordial non-Gaussianity (Dalal et al. 2008; Desjacques et al. 2009;Castorina et al. 2018), and the use of marks, or density dependentweights, for correlation functions in the context of modiﬁed gravity(White 2016; Hernández-Aguayo et al. 2018; Armijo et al. 2018)and massive neutrinos (Massara et al. 2020). While these methodsattempt to use additional information from the nonlinear densityﬁeld, the statistic considered is usually the two-point function.Finally, there are studies exploring the clustering of halos orgalaxies in terms of the Void Probability Function (VPF), whichwas also shown to be the generating function for the full distributionof the clustering, and sensitive to all connected N -point functionsWhite (1979); Fry (1986); Fry & Colombi (2013). A related ap-proach is explored in Paranjape & Alam (2020). VPF and relatedmeasurements have already been applied to data (Sharp 1981; Wayet al. 2011; Walsh & Tinker 2019). Over the years, the concept of thegenerating function, beyond just probabilities of ﬁnding completelyempty volumes, as captured in the VPF, was further developed inthe context of cosmological clustering (Balian & Schaeﬀer 1989;Szapudi & Szalay 1993; Bernardeau 1994), and provides an overar-ching theoretical framework to connect the parallel approaches ofusing higher N -point functions, and the use of one point PDF anal-ysis, toward the extraction of non-Gaussian information on smallscales.In this paper, we introduce the k -nearest neighbor CumulativeDistribution Functions ( k NN-CDF), i.e. , the empirical cumulativedistribution function of distances from a set of volume-ﬁlling Pois-son distributed random points to the k –nearest data points. The k NN-CDF are a set of new summary statistics that can be appliedto the clustering analysis of cosmological datasets - both discretetracers, and continuous ﬁelds, where the latter can be sampled by aset of tracers. We set out the connections between these new statis-tics and the generating function formalism. Through the latter, wedescribe the relationship between the k NN-CDF statistics and thestatistics of higher N -point functions, as well as the density PDFover a range of scales. Importantly, we demonstrate how k NN-CDFstatistics can be computed eﬃciently on a given dataset - and howa single measurement step is suﬃcient to gain information aboutall N -point correlation functions present in the data over a rela-tively broad range of scales. We apply these statistics in the contextof familiar distributions, and ﬁnally, quantify the improvements in MNRAS , 1–21 (2020) osmology with Nearest Neighbor Distributions cosmological parameter constraints, compared to two-point func-tion analyses over the same range of scales.The layout of the paper is as follows: in Sec. 2, we introducethe mathematical framework relevant for the k NN-CDF statistics,and outline how they can be computed for a given dataset. In Sec.3, we apply the k NN-CDF statistics to various underlying ﬁelds toillustrate its novel features. In Sec. 4, the constraints on cosmologicalparameters using the k NN-CDF statistics is explored. Finally, in Sec.5, we summarize our main ﬁndings, and discuss possible directionsin which this study can be extended.

In this section, we introduce the concept of the Nearest NeighborCumulative Distribution Function, and explore its connections toother statistical measures for clustering used in the literature. Wealso outline the method by which NN-CDF can be computed quicklyfor a given dataset.

We consider a set of tracers of a underlying continuous ﬁeld, withmean number density ¯ n and connected N -point correlation functionsdenoted by ξ ( N ) . ξ ( ) = ξ ( ) = P ( z | V ) , of thedistribution of the counts of data points enclosed in volume V canbe written as (White 1979; Balian & Schaeﬀer 1989; Szapudi &Szalay 1993): P ( z | V ) = ∞ (cid:213) k = P k | V z k = exp (cid:34) ∞ (cid:213) k = ¯ n k ( z − ) k k ! × ∫ V ... ∫ V d r ... d r k ξ ( k ) ( r , ..., r k ) (cid:35) . (1)The derivation of Eq. 1 starting from the statistics of an underlyingcontinuous ﬁeld is sketched out in more detail in Appendix A. Theshape of the volume V can, in general, be arbitrary. In this paper,we will only consider the volumes to be associated with spheresof radius r . As shown in Appendix A, this is a natural choicefor describing statistics of a top-hat smoothed ﬁeld. In terms ofnotation, we will switch between r and V throughout the paper,under the implicit assumption V = / π r .The probability of ﬁnding a count of k ∈ { , , , ... } datapoints in a volume V can be computed from the generating functionby computing various derivatives, P k | V = k ! (cid:34)(cid:18) ddz (cid:19) k P ( z | V ) (cid:35) z = . (2)The quantity P ( z | V )| z = or, its mathematical equivalent, P | V isreferred to as the Void Probability Function (VPF) (White 1979),and represents the probability of ﬁnding no data points within avolume V . Note that the expression for the VPF still contains all N -point correlation functions, and White (1979) showed, using aslightly diﬀerent formalism from that used here, the VPF itself canbe considered as the generating function of the full distribution.In the literature, P k | V corresponds directly to the CIC statistics for tracer particles. For scales much larger than the mean inter-particle separation of the tracers, where the mean number of datapoints per volume, (cid:104) k V (cid:105) (cid:29) P k | V corresponds to the densityPDF of the underlying continuous ﬁeld ( e.g. Klypin et al. 2018;Uhlemann et al. 2019). Using a very similar formalism, it is possibleto write down the generating function for various cumulants of thedistribution, which are directly related to the N -point connectedcorrelation functions. This is demonstrated in Appendix B.Next, we consider the statistics of volumes which have morethan k data points, P > k | V , where, once again, k ∈ { , , , ... } . Wewill ﬁrst write down the generating function for this statistic, C ( z | V ) in terms of the P ( z | V ) . We follow the same deﬁnition as Eq. 1 towrite C ( z | V ) = ∞ (cid:213) k = P > k | V z k = ∞ (cid:213) k = ∞ (cid:213) m = k + P m | V z k = ( P | V + P | V + ... ) + ( P | V + P | V + ... ) z + ( P | V + P | V + ... ) z + ... = − P | V + ( P | V + P | V + P | V + ... ) + − ( P | V + P | V ) z + ( P | V + P | V + P | V + ... ) z − ( P | V + P | V + P | V ) z + ( P | V + P | V + ... ) z + ... = − P ( z | V )( + z + z + ... ) + ( + z + z + ... ) = − P ( z | V ) − z , (3)where we have used the fact (cid:205) ∞ k = P k | V =

1, and 1 /( − z ) = ( + z + z + ... ) . Therefore, the generating function for the distributionof P > k | V is fully speciﬁed by the generating function of P k | V . Notethat, by deﬁnition P k | V = P > k − | V − P > k | V ∀ k ≥ . (4)From Eqs. 1, 3, and 4, it becomes clear there are three equiv-alent approaches to characterizing the clustering of a set of datapoints through the generating function: 1) by measuring all the con-nected N -point correlation functions ξ ( N ) , i.e. , the second line inEq. 1, 2) by measuring the distribution of the counts in cell, P k | V , i.e. the ﬁrst line in Eq. 1, and 3) by measuring the cumulative counts, P > k | V , and connecting them to P k | V using Eq. 4. To fully charac-terize the generating function, each of these statistical measureshas to be measured over the full range of scales of interest. In thenext subsection, we present the method for eﬃciently calculating P > k | V concurrently over a large range in V for a set of data points.This is done by exploiting the connection between P > k | V and the k -Nearest Neighbor distributions of the data points from a set ofvolume-ﬁlling random points. P > k | V in data using k NN - CDFFor a set of N d data points distributed over a total volume V tot ,we start by generating a volume-ﬁlling sample of random points.Typically, the total number of randoms, N R is chosen to be largerthan the number of data points, N d , and using more randoms allowsfor better characterization of the tails of the distributions discussedbelow. Once the set of randoms have been generated, for each ran-dom, we compute the distance to, say, the nearest data point. Thiscan be done extremely eﬃciently by constructing a k -d tree on thedata (e.g. Wald & Havran 2006) in N log N operations. Using thistree structure allows to search for the k nearest neighbors in log N MNRAS000

Banerjee & Abel r ( h − Mpc) − − P r o b a b ili t y CDF1 − CDFPeaked CDF

Figure 1.

Peaked CDF (described in the text) of the nearest neighbor dis-tribution (as deﬁned in the text) for 10 random points distributed over a ( h − Gpc ) volume (solid curve). The dot-dashed curve is the analytic pre-diction for the distribution. The empirical CDF measured from the data isplotted using the dashed curve, while the Void Probability Function (VPF)is plotted using the dotted line. time for each point. Most scientiﬁc software has eﬃcient built–infunctions or libraries such as e.g. scipy’s cKDTree implementa-tion, or Julia’s NearestNeighbor.jl library. These typically willreturn an ordered list of the distances to the nearest k neighbors.Sorting the computed distances for each k , immediately gives theEmpirical Cumulative Distribution Function (see e.g. Vaart (1998))of the distribution of distances to the k -nearest data point fromthe set of volume-ﬁlling randoms. In the limit of large N R , thisconverges to the true underlying Cumulative Distribution Function(CDF) of the distance to the k -nearest data point from any arbitrarypoint in the box. Notice that since the randoms are volume-ﬁlling,all regions are sampled equally, irrespective of whether the regionis overdense or underdense in terms of the data points. This is espe-cially relevant to applications in cosmology, where at late times, thevolume is dominated by underdense regions, while the data pointsare usually concentrated in overdense regions. Measures such asthe two-point correlation functions are typically dominated by theoverdense regions, and have little information about the large un-derdense regions (e.g Oort 1983).To understand the connection between the CDF of the distribu-tion of distances to the nearest data point from a random point in thevolume, and the probability of P > k | V , consider all possible spheresof volume V = / π r (Kerscher et al. 1999). These spheres canbe centered on any point in the total volume under consideration.The fraction of spheres with > < r . The nearest-neighbor CDF at some radius r is the precisemeasure of the fraction of points for which the nearest data point is https://github.com/KristoﬀerC/NearestNeighbors.jl at a distance < r . Therefore,CDF ( r ) = P > | V (cid:12)(cid:12) V = / π r . (5)As mentioned previously, we will switch notations between radius r and the volume V throughout this paper, and we note that noneof our results depend on the distinction in notation. Given the con-nection between nearest neighbor distances and ﬁnding a certainnumber of data points in a given volume, 1 − CDF ( r ) thenrepresents the probability of ﬁnding a completely empty volume V = / π r . That is, a randomly placed sphere of Volume V isempty with a probability of 1 − CDF ( V ) , which is known asthe Void Probability Function (VPF, White (1979)). Interestingly,this latter interpretation is what is customarily used (e.g. Corrfunccode Sinha & Garrison (2019); Sinha & Garrison (2020)) to mea-sure the VPF using large numbers of randomly placed spheres. Thisapproach is much slower, however, as compared to the kNN–CDF.It also only provides measurements at typically a small number ofchosen volumes. So even if one were only interested in computingthe VPF, using the nearest neighbor approach discussed here wouldbe advisable since the measurements are desired at multiple vol-umes. Note that the empirical CDF directly measured from the datacontains as many points as random points we chose to cover thefull volume. Given the monotonic nature of the CDF, using linearor higher order interpolation between measured points allows oneto carry out operations on these CDFs as if they were continuousfunctions.We illustrate the shapes of these functions for a Poisson distri-bution of points in Fig. 1. While the Empirical CDF is the quantitywhich is directly computed from the simulations, in our plots wewill usually display the Peaked CDF (PCDF) which is deﬁned asPCDF ( r ) = (cid:40) CDF ( r ) CDF ( r ) ≤ . − CDF ( r ) CDF ( r ) > . . (6)The use of the PCDF allows for better visual representation of bothtails of the CDF. This point is illustrated in Fig. 1 for a set of10 points distributed according to a Poisson distribution over a ( h − Gpc ) volume. The dashed line represents the Empirical Cu-mulative Distribution Function measured from the set of particles.The solid line represents the Peaked CDF computed from the samedata. The right hand tail of the distribution is diﬃcult to diﬀerentiateusing the Empirical CDF, since, by its very nature, it asymptotes to1 smoothly. The Peaked CDF, on the other hand, illustrates clearlythe behavior in the tails, especially when comparing to the analyticexpectation, plotted using the dot-dashed line. We also plot the VPF( = − CDF ) using the dotted line for reference.Importantly, apart from the distance to the 1st nearest neighbordata point, the same tree is used to ﬁnd the distances to the k -thnearest data point for each random point in the volume. Once again,these distances can be sorted to produce the Empirical CDF ofthe k -th neighbor distances, and in the limit of large N R , the trueunderlying CDF of the k -th neighbor distances. We can generalizethe arguments presented above to connect the probability of ﬁndingspheres of volume V = / π r enclosing > k − k -th nearest data point within radius r :CDF k NN ( r ) = P > k − | V (cid:12)(cid:12) V = / π r . (7)Eq. 4 can be recast as P k | V = CDF k NN ( r ) − CDF ( k + ) NN ( r ) ∀ k ≥ , (8) https://github.com/manodeep/Corrfunc MNRAS , 1–21 (2020) osmology with Nearest Neighbor Distributions r ( h − Mpc) − − P e a k e d C D F r ( h − Mpc) − − P ( N ) N = 0 N = 1 N = 2 N = 3 Figure 2.

Left : Peaked CDF as a function of scale for 1st (1NN), 2nd (2NN), 3rd (3NN), and 4th (4NN) nearest neighbor distributions (solid lines) for aset of 10 Poisson distributed data points distributed over a ( h − Gpc ) volume from a set of volume ﬁlling randoms (see text for details). The dashed linesrepresent the analytic expectations for the distribution. Right : Probability of ﬁnding N points in a sphere with radius r given 10 Poisson distributed data pointsover a ( h − Gpc ) volume. Solid lined represent the probabilities computed using the CDFs from the left panel, while the dashed lines represent the analyticexpectation. P | V = − CDF ( r ) . (9)Therefore, computing the k nearest neighbor distributions using themethod outlined here is completely equivalent to measuring P > k | V .Additionally, this procedure allows us to compute the probabilitiesover a range of scales, and for multiple values of k , in a single oper-ation. To provide a sense of the time spent on a typical calculation,computing the CDFs for up to 8 nearest neighbors for 10 randomsand 10 data points distributed in a ( h − Gpc ) volume takes ∼ e.g. when using points placed ona ﬁnely-spaced grid, that is, with grid separations much smallerthan the mean inter-particle separation of the data. Once the setof volume ﬁlling points are generated, whether using a randomprocedure, or on the regular grid, the calculation of the distancesto the nearest data point, and the computation of the EmpiricalCumulative Distribution Function proceeds exactly the same way. In this section, we apply the NN-CDF formalism to tracers followingdiﬀerent distributions, and point out various relevant features. Westart with the simplest example of a Poisson sampling of a uniformﬁeld in Sec. 3.1. We then explore the Gaussian distribution in thisframework in Sec. 3.2. Finally, in Sec. 3.3, we apply it to datafrom cosmological simulations, both simulation particles, and darkmatter halos. k NN - CDF for Poisson sampling of uniform ﬁeld

For a sample of Poisson tracers of a uniform ﬁeld, ξ ( ) =

1, and allhigher order correlation functions are 0. In this case, Eq. 1 simpliﬁesto P ( z | V ) = exp (cid:20) ¯ n ( z − ) V (cid:21) . (10)As can be anticipated for a pure Poisson process on a uniform ﬁeld,the expression for the distribution of counts, P k | V , in Eq. 2 becomes P k | V = k ! (cid:34)(cid:18) ddz (cid:19) k P ( z | V ) (cid:35) z = = ( ¯ nV ) k k ! exp (− ¯ nV ) . (11)The distribution of P > k | V can similarly be worked out by consider-ing the derivatives of C ( z | V ) from Eq. 3: P > k | V = k ! (cid:34)(cid:18) ddz (cid:19) k C ( z | V ) (cid:35) z = = k ! (cid:34)(cid:18) ddz (cid:19) k (cid:32) − exp (cid:2) ¯ n ( z − ) V (cid:3) − z (cid:33)(cid:35) z = = k ! (cid:34) k (cid:213) m = k ! m ! ( k − m ) ! (cid:18) ddz (cid:19) m (cid:18) − exp (cid:2) ¯ n ( z − ) V (cid:3)(cid:19)(cid:18) ddz (cid:19) k − m − z (cid:35) z = = − k (cid:213) m = ( ¯ nV ) m m ! exp (− ¯ nV ) , (12)where we use the fact that ( d / dz ) m ( /( − z )) = m ! /( − z ) m + . Theform of P > k | V derived in Eq. 12 can also be anticipated by simply MNRAS000

Banerjee & Abel noting that P > k | V = − P < = k | V , and using Eq. 11. The form ofEq. 12 is known in the literature as the Cumulative DistributionFunction of the Erlang distribution (M Evans & Peacock 2000).Since we consider the volumes V to be associated with spheres ofradius r , all the equations above can also be trivially written in termsof r . Eq. 12 can also be derived in a diﬀerent way in the language ofdistances to nearest neighbors — by considering the distributionsof successive nearest neighbors. Let us consider the CDF of thenearest neighbor distribution:CDF ( V ) = P > | V = − exp (− ¯ nV ) . (13)Therefore, at ﬁxed ¯ n , the PDF of the distribution of distances to, orequivalently the distribution of volumes enclosed within, the nearestdata point from a random point is given byPDF ( V ) = d (cid:0) CDF ( V ) (cid:1) dV = ¯ n exp (− ¯ nV ) . (14)Since there are no higher order correlations in the underlying contin-uous uniform ﬁeld, the PDF of the distribution of volumes enclosedwithin the second nearest neighbor is a convolution of PDF ( V ) with itself:PDF ( V ) = ∫ V PDF ( V (cid:48) ) PDF ( V − V (cid:48) ) dV (cid:48) = ¯ n V exp (− ¯ nV ) , (15)andCDF ( V ) = ∫ V PDF ( V (cid:48) ) dV (cid:48) = − exp (− ¯ nV ) − ( ¯ nV ) exp (− ¯ nV ) . (16)Comparing with the last line in Eq. 12, the expression above is in-deed equivalent to P > | V at ﬁxed V . A similar result can be demon-strated for higher k NN CDFs of the Poisson distribution. For tracersof a Poisson distribution, therefore, computing the distribution ofdistances to the nearest neighbor data point from any arbitrary pointin the volume under consideration is a complete description of theoverall distribution of points - distributions of distances to all otherneighbors can be generated from the former. Using the connectionbetween nearest neighbor distributions and the probabilities of ﬁnd-ing more than k data points in a volume V , the above result impliesthat, for the Poisson distribution, determining P > | V or the VPFalone allows us to extract maximal information about distribution.This is consistent with the fact that the only variable for a Poissondistribution is the rate ( = ¯ nV here), and that the full distribution canbe generated once the rate is known, from say, the nearest neighbordistribution. Expressed in another way, there is no new informationin any of the higher k -th neighbor distributions, once the nearestneighbor distribution is known.We now compare the analytic predictions to the actual mea-surements from a set of 10 points distributed randomly over a ( h − Gpc ) volume. The results are shown in Fig. 2, where the ﬁrstfour nearest neighbor distributions are plotted in the left panel. Thefour lowest counts-in-cell distributions, computed from the nearestneighbor distributions on the left, are plotted in the right hand panel.In both panels, the solid lines represent the measurements, while thedashed line represent the analytic expectations. The measurementsin the left panel agree extremely well with the analytic expectationsfor scales below, and comparable to, the mean inter-particle sepa-ration ( ∼ h − Mpc). The right panel, where the solid curves havebeen computed using Eq. 8, clearly demonstrates that the nearestneighbor distributions and the counts-in-cells at a given radius are truly equivalent descriptions of the underlying data, and that onecan easily be computed once the other is known.On scales much larger than the mean inter-particle separation,the results begin to diverge for two reasons. The ﬁrst is due to thelimitations on the sampling of the total volume arising from the usea ﬁnite number of random particles. In the method outlined in Sec.2.2, the number of randoms determines how well the Empirical CDFis sampled, and ﬁnite sampling can lead to discrepancies, especiallyin tails, where the true CDF is being approximated from a smallnumber of measurements. For this plot, we use 8 × randomparticles, and we have checked that using more randoms extendsthe agreement out to larger scales. The second eﬀect is that ofsample variance in the data itself. This arises because we evaluate theEmpirical CDF on one speciﬁc realization of 10 points drawn froma Poisson distribution, and once again leads to departures from theanalytic value in the tails. This eﬀect can be reduced by appropriatelyaveraging over many realizations of the data keeping the underlyingdistribution ﬁxed. We note that amongst the distributions we haveplotted, the nearest neighbor distribution most closely traces theanalytic expectation in the tails, at ﬁxed number of randoms, andis therefore expected to be the most robust measurement when wegeneralize to other distributions. In general, we will usually restrictour measurements and analysis to scales which are a few times themean inter-particle separation of the tracers, and ensure that themeasurements are not aﬀected by the lack of sampling in the tail. k NN - CDF for Gaussian Fields

For completely Gaussian continuous ﬁelds, the distribution is com-pletely determined by the variance as a function of scale. Thereforethe power spectrum, P ( k ) , or the two-point correlation function, ξ ( r ) , are complete descriptions of the underlying ﬁeld, as well asfor any set of tracers of the underlying ﬁeld. As discussed in Sec.1, these summary statistics have been employed extensively in thestudy of the CMB, as well as of Large Scale Structure on scaleswhere non-linearities in the density ﬁeld play a minor role. There-fore, a description of the clustering of a set of tracers of a Gaussianﬁeld in the language of NN-CDFs serves as a useful exercise to buildup intuition about the connections between the two formalisms be-fore examining a fully non-linear cosmological ﬁeld.For a Gaussian ﬁeld, the generating function for P k | V can bewritten as follows: P ( z | V ) = exp (cid:20) ¯ n ( z − ) V + ¯ n ( z − ) ∫ V ∫ V d r d r ξ ( ) ( r , r ) (cid:21) = exp (cid:20) ¯ n ( z − ) V +

12 ¯ n ( z − ) V σ V (cid:21) , (17)where σ V is the variance as a function of scale, deﬁned in Eq. B7.Notice that the generating function for P k > V , given by C ( z | V ) = ( − P ( z | V ))/( − z ) , therefore, contains only two possible unknowns- the mean number density ¯ n , and the variance as a function of scale σ V . The individual P > k | V distributions can be obtained by takingderivatives of C ( z | V ) . While there is no closed form expression of P > k | V for a general value of k , the individual terms are easy to MNRAS , 1–21 (2020) osmology with Nearest Neighbor Distributions r (Mpc /h )10 − − P e a k e d C D F r (Mpc /h )0 . . . . . σ ( r ) Figure 3.

Left : Peaked CDF as a function of radius for 1st (1NN), 2nd (2NN), 3rd (3NN), and 4th (4NN) nearest neighbor distributions (solid lines) for a setof 10 points tracing an underlying (almost) Gaussian ﬁeld distributed over a ( h − Gpc ) volume. The dash-dotted lines for the 1NN and 2NN represent theanalytic expectations for distribution given the linear theory variance. The dotted lines represent the predictions for the 3NN and 4NN distributions from theactual measurements of the 1NN and 2NN distributions. Right : The solid line represents the variance of the Gaussian ﬁeld as measured from the 1NN and 2NNCDFs. The dash-dotted line represents the theoretically computed variance from linear perturbation theory. compute, especially for low values of k . For example, P > | V = − exp (cid:20) − ¯ nV +

12 ¯ n V σ V (cid:21) , (18) P > | V = − P > | V − (cid:18) ¯ nV − ¯ n V σ V (cid:19) exp (cid:20) − ¯ nV +

12 ¯ n V σ V (cid:21) , (19)and so on. Note that just by measuring the ﬁrst two cumulativedistributions, P > | V and P > | V , one can constrain ¯ n and σ V . Con-cretely,¯ nV = − (cid:32) log (cid:0) − P > | V (cid:1) + P > | V − P > | V − P > | V (cid:33) , (20)and σ V = − (cid:32) log (cid:0) − P > | V (cid:1) + P > | V − P > | V − P > | V (cid:33) (cid:30) ( ¯ nV ) . (21)Using the relationship between P > k | V and CDF ( k + ) NN ( r ) from Eq.7, Eqs. 20 and 21 can also be expressed in terms of the 1st and 2ndnearest neighbor distributions. Once the relevant parameters havebeen uniquely deﬁned in Eqs. 20 and 21, all higher k distributionscan easily be derived in terms of the measured mean density andvariance.To compare measurements from a set of tracers of a Gaus-sian ﬁeld with the theoretical predictions presented above, we con-sider a very coarse cosmological simulation of 128 particles in a ( h − Gpc ) box at the Planck best-ﬁt cosmology run up to z = . × particles from the simulation particles and compute the four nearest neighbor distributions. We use the Colossus code togenerate the predictions for σ V at this cosmology from linear per-turbation theory. The results of the comparison are plotted in the leftpanel of Fig. 3. The solid lines represent the measurements of thenearest neighbor distributions from the data, while the dot-dashedline represent the predictions for the ﬁrst and second nearest neigh-bor distributions using the theoretical σ V . Once again, we ﬁnd goodagreement between the measurement and the predictions out to a fewtimes the mean inter-particle separation. The dotted lines in the leftpanel represent the predictions for the 3rd and 4th nearest neighbordistributions, using ¯ n and σ V measured from the data using Eqs. 20and 21. As anticipated from the arguments presented above, mea-surements of just the ﬁrst and second nearest neighbor distributionsallow us predict all other nearest neighbor distributions to a high de-gree of accuracy. In the right panel of Fig. 3, we plot the value of σ V that we recover from the measurements of the two nearest neighborCDFs (solid line), and show that it agrees with the linear theory pre-diction for the continuous ﬁeld from Colossus (dot-dashed line),once again on scales comparable to the mean inter-particle separa-tion. It should be noted that the measurements allow us to correctlyinfer the variance of the underlying ﬁeld even on scales smaller thanthe mean inter-particle separation ( ∼ h − Mpc) — naively, thesescales are expected to be dominated by the Poisson-like ¯ nV term inEq. 17.We conclude that, for tracers of a Gaussian density ﬁeld, nearestneighbor distributions beyond the 2 nd nearest neighbor distributiondo not add any new statistical information about the underlying ﬁeld.All higher neighbor distributions can be built up from convolutionsof the two nearest neighbor distributions. In fact, these properties , 1–21 (2020) Banerjee & Abel of the NN-CDF for a formally Gaussian ﬁeld can be turned into atest for the Gaussianity of a given ﬁeld, or a set of tracers. If themean density and the variance are computed from the data, thenthe ﬁeld is completely Gaussian if and only if the nearest neighbordistributions are consistent with Eq. 18. Any departures from theseexpressions can be interpreted as evidence for non-Gaussianity inthe ﬁeld.In general, therefore, for tracers of a ﬁeld that is completelycharacterized by the ﬁrst m connected N -point functions (or cumu-lants), measurements of only the lowest m nearest neighbor distri-butions are suﬃcient to capture the full statistical information ofthe underlying ﬁeld. There is no independent information in thehigher NN distributions. We note that there are certain distributionsrelevant to cosmological applications, most notably the log-normalPDF, which cannot be uniquely described by its cumulants. Further,at any given scale, we do not know a priori how many cumulantsare needed to describe the distribution of tracers. However, evenin these cases, measuring the ﬁrst few NN distributions, capturesa large fraction of total statistical information in the ﬁeld. We willdiscuss these points further in Sections 3.3 and 4. For a discussionon the information in the VPF for a lognormal ﬁeld, see Coles &Jones (1991).Another aspect to note here is that our choice of downsamplingthe simulation particles to 1 . × for our measurements wasarbitrary, and is not related to the number of particles with whichwe run the simulation. The choice was guided only the range ofscales over which we wish to obtain robust measurements of theCDF. We can also recover the variance of the underlying continuousﬁeld for other choices of the mean number density, or equivalently,inter-particle separations. As shown previously, our measurementsare most robust on scales comparable to the mean inter-particleseparation, so a diﬀerent choice of the mean number density allowsus to accurately measure the variance on a diﬀerent set of scales.However, the range of scales on which the linear theory variance canbe reliably estimated from these measurements is limited on bothlarge and small scales due to numerical and practical considerations.These limitations set the choice of scales displayed in the right panelof Fig. 3.First consider the case when the inter-particle separation ismuch larger than considered here. In principle, this choice shouldallow us to measure the variance at large scales. However, the shapeof the cosmological power spectrum is such that the variance de-creases on large scales. If the variance is small on scales which canbe well measured for a speciﬁc choice of the mean density of trac-ers, the distributions are dominated by ¯ nV term in the exponent onthe RHS of Eq. 20, instead of the ¯ n V σ V term. For small enough σ V , this will be true even when considering scales above the meaninter-particle separation ( ¯ nV > i.e. with many more random pointsthan typically used here, it is, in principle, possible to recover thefull information about the variance, even at large scales.At the other end, the problem arises from the fact that in lineartheory, σ V increases as we consider smaller scales, until σ V ∼

10 15 20 30 r ( h − Mpc)10 − − P e a k e d C D F Poisson 1NNParticles 1NNHalos 1NN

Figure 4.

Comparison of the Peaked CDF for nearest neighbor (1NN) distri-butions of a) a Poisson distribution, b) particles from the z = N -body simulation, and c) the most massive halos from the same simula-tion. In each case, 10 points were selected over a ( h − Gpc ) volume. Forthe particles, these were randomly selected from all the simulation particles,while for the halos, a cut was made on the 10 most massive halos in thebox. more detail in Sec. 3.3. However, in this case, a measurement ofthe two nearest neighbor distributions will contain information notjust about the mean and the variance, but all higher order momentsthat are present. This implies that the full clustering statistics can nolonger be uniquely determined from just these measurements, andhigher NN distributions need to be measured for a full statisticaldescription of the ﬁeld. k NN - CDF for Large Scale Structure

We now move to analyzing realistic cosmological density ﬁeldsusing the nearest-neighbor formalism that we have set up. At lowredshifts, the density ﬁeld for matter is highly nonlinear, and thedistribution of densities on small scales, especially, cannot be ap-proximated by a Gaussian. The clustering of massive virialized darkmatter halos, which host the visible galaxies, is usually even morenon-Gaussian. In general, the distribution of Dark Matter at lowredshifts form a cosmic web - with large empty regions (voids),and high density ﬁlaments and knots. In Fig. 4, we plot the nearestneighbor distributions for three diﬀerent sets of points with the sametotal number of points, 10 , spread over a ( h − Gpc ) volume. Thecorresponding mean inter-particle separation is ∼ h − Mpc. Therange of scales represents those where the distributions are bestmeasured for the choice of particle number and the number of ran-doms that are used to characterize the full volume. The ﬁrst set ofpoints are distributed randomly over the full volume, i.e. followinga Poisson distribution. The second set of points are downsampledfrom the simulation particles in a cosmological simulation with512 particles at z =

0. The third set of points are the halo centersof the 10 most massive halos in the simulation at z =

0. Even

MNRAS , 1–21 (2020) osmology with Nearest Neighbor Distributions

10 15 20 30 40 r ( h − Mpc) − − P e a k e d C D F Figure 5.

The Peaked CDFs for the ﬁrst, second, third, and fourth nearest-neighbor distributions for 10 simulation particles in a ( h − Gpc ) volume.The solid lines represent these distributions at z =

0, while the dotted linesrepresent the distributions computed at z = . though the mean number density is the same, it is clear that thenearest neighbor distributions look quite diﬀerent. The NN-CDF ofthe simulation particles, and especially the halo positions, extendsto larger scales. This happens due to the presence of large voids inthese datasets - this implies that for a large faction of the volumeﬁlling set of randoms, the nearest data point is much further awaythan it would be for a Poisson distribution. We note that while thediﬀerences are the most pronounced on larger scales when plottedin terms of the Peaked CDF, the distributions are diﬀerent even onscales smaller than the mean inter-particle separation.It is also useful to plot the change in the nearest neighbordistributions with redshift. Fig. 5 shows the ﬁrst four NN distribu-tions for 10 simulation particles at z=0 (solid lines) and z = . z = z = . z = z = .

5, as a result of the collapse ofthe overdense regions. b − σ degeneracy In the analysis of the clustering of halos using the two-point cor-relation function, there is a known degeneracy between the biasof the halos being considered and the amplitude of clustering ofthe underlying matter ﬁeld, sometimes represented by the width ofthe density PDF at 8Mpc, σ . This is related to the fact that halosform at the peaks of the initial Gaussian random ﬁeld (Bardeenet al. 1986). In other words, two halo populations can have the sametwo-point clustering signal, even when they are produced by very diﬀerent underlying matter ﬁelds, just by appropriately choosing thebias of each sample. Since the bias is primarily dependent on thehalo mass, this is equivalent to making choices about the mass cutfor the samples. As a result, it is diﬃcult to individually constrainthe value of the bias b , and the amplitude of the clustering of theunderlying ﬁeld σ , given a two-point measurement of a sample ofhalos.This degeneracy is illustrated in the left panel of Fig. 6. Thedark solid line represents the ratio of the ξ ( r ) at z = . ξ ( r ) at z = most massive halos for a single realization at aﬁxed cosmology. The halos are identiﬁed at the diﬀerent redshiftsseparately, and so the actual objects that make the cut at diﬀerentredshifts need not be the same. The dotted lines represent the ratioof ξ ( r ) for 15 diﬀerent realizations at z = ξ ( r ) over the 15 realizations. These curvesserve as a measure of the sample variance for this measurement at z =

0. We can conclude that the two-point functions of the 10 most massive halos in the box at z = z = . e.g. in Pan & Szapudi (2005) and Sefusatti et al. (2006),by including the bispectrum or three point correlation function inthe analysis. We illustrate this in the middle and right panels ofFig. 6. In the middle panel, the dark solid line represents the ra-tio of the nearest neighbor CDF of the 10 most massive halos at z = . most massive halosat z =

0, for a single realization at a ﬁxed cosmology. The lighterdotted lines represent the ratio of the nearest neighbor CDFs of the10 most massive halos at z = second nearest neighbor distribution. It easy to distinguish be-tween the distributions at the two redshifts, either by consideringthe nearest neighbor distribution or the second nearest neighbor dis-tribution, even though the clustering of the two samples cannot bedistinguished by measurements of the two-point correlation func-tion. This result can also be interpreted as proof that at z = . z =

0, the clustering is quite non-Gaussian on the scales considered.If the halo ﬁeld was completely Gaussian, Eq. 17 would imply thattwo ﬁelds - one at z = . z = k NN statistics. Thefact that the two ﬁelds have diﬀerent k NN statistics, despite havingthe same variance, implies the non-Gaussianity of the clustering.We note that the fact that the same number density cut - 10 in a ( h − Gpc ) volume - yields the same two-point correlationmeasurement is a coincidence. The results above would hold even ifthe number density cuts were diﬀerent. However, it is also importantto note that for the NN-CDF analysis, care should be taken to use thesame number of data points. The distributions depend sensitivelyon the mean number density, and a change in the mean numberdensity can be misinterpreted as a change in the clustering signal.It is easy to ensure the two datasets have the same number of pointsby randomly downsampling the larger dataset. We will use thisstrategy of keeping the number density ﬁxed when using the NNdistributions to analyze diﬀerent cosmologies and obtain constraintsin Sec. 4. MNRAS000

10 15 20 30 r ( h − Mpc)0 . . . . . . . ξ hh / ξ hh ﬁd

10 15 20 30 r ( h − Mpc)0 . . . . . . . C D F NN / C D F ﬁd NN

10 15 20 30 r ( h − Mpc)0 . . . . . . . C D F NN / C D F ﬁd NN Figure 6.

Left : The darker line represents the ratio of correlation function of the 10 most massive halos in a ( h − Gpc ) box at redshifts z =

0, and z = . z = Center : The darker line represents the ratio of the nearest neighbor CDF of the 10 most massive halos in a ( h − Gpc ) box at redshifts z =

0, and z = .

5. The lighter shaded lines represent the ratio of the nearest neighbor CDFs at z = Right : Same measurements as the center panel, except with second nearest neighbordistances instead of the ﬁrst. Even though the correlation function of the two samples at diﬀerent redshifts are almost indistinguishable within sample varianceuncertainties, the NN CDFs are clearly separated.

In this section, we explore the degree to which constraints on var-ious cosmological parameters improve when the same simulationdatasets, using the same scale cuts, are analyzed using the near-est neighbor framework developed above, compared to the tradi-tional two-point analysis. We focus on the following range of scales,10 h − Mpc to 40 h − Mpc, i.e. , scales smaller than those where thelinear, Gaussian approximation is valid, but larger than the typicalsizes of even the largest virialized structures in the universe. Weavoid using smaller scales, where the gains from using the nearestneighbor statistics are potentially even larger, in the analysis to avoidany possible systematics related to the resolution of the simulationsused in the analysis.

The Fisher matrix formalism has been widely used to estimate theconstraints on cosmological parameters given a set of summarystatistics, from which the relevant “data vector” is constructed, andthe expected error bars on the measurement of these summary statis-tics. For a cosmological survey, the error bars depend on speciﬁ-cations such as the sky area covered by the survey, the depth, andthe number density of tracers. When the summary statistics underconsideration are two-point correlation functions, or power spectra,the error bars are relatively easy to compute once the survey spec-iﬁcations are known. Therefore, the Fisher formalism can be usedto estimate constraints on cosmological parameters around someﬁducial cosmology, even before the survey starts collecting data.For summary statistics other the two-point correlation function,it is often not possible to analytically compute the error bars, or moregenerally, the covariance matrix between various entries in the datavector. The Fisher framework has been applied in such situationsto estimate the parameter constraints from mock datasets. Thesedatasets are often generated from cosmological simulations, wherethese non-trivial summary statistics can computed directly from thesimulation outputs. This is the spirit in which we use the Fisherformalism in this work. Formally, the elements of the Fisher matrix ( F ) is deﬁned as F αβ = (cid:213) i , j ∂ D i ∂ p α (cid:20) C − (cid:21) ij ∂ D j ∂ p β , (22)where D i are the entries of the data vector, p α represent variouscosmological parameters, and C is the covariance matrix for the datavector, evaluated at some ﬁducial cosmology. The Fisher matrixcan then be inverted to determine the constraints on individualparameters, while marginalizing over the uncertainties in all otherparameters. as well as the covariances between diﬀerent parameters.In particular, σ α = (cid:113)(cid:0) F − (cid:1) αα , (23)where σ α represents the 1- σ constraint on parameter α .To construct both the derivatives of the data vector with respectto the cosmological parameters, as well as the covariance matrix,we use the Quijote simulations (Villaescusa-Navarro et al. 2019).These N -body simulations were run on ( h − Gpc ) volumes with512 CDM particles in cosmologies with no massive neutrinos,and with 512 CDM and 512 neutrino particles in cosmologieswith massive neutrinos. The mean inter-particle separation for theparticles is therefore ∼ h − Mpc, and to be conservative in our anal-ysis, we only use measurements above 10 h − Mpc in our analysis.The cosmological parameters that are included in the analysis are { Ω m , Ω b , σ , n s , h , M ν , w } . The Quijote simulations have been runin a way that the derivatives with respect to each of these parameterscan be easily computed around a ﬁducial cosmology. These simula-tions have already been used to estimate the information content ofvarious non-trivial statistics of the cosmological ﬁeld (Hahn et al.2020; Uhlemann et al. 2019).In general, this is done by runningsimulations with one parameter larger (and smaller) than the ﬁdu-cial cosmology, while all other parameters are held at their ﬁducialvalue. Special care has to be taken to compute the derivatives withrespect to the total neutrino mass M ν , and this is discussed in detailin Villaescusa-Navarro et al. (2019). https://github.com/franciscovillaescusa/Quijote-simulationsMNRAS , 1–21 (2020) osmology with Nearest Neighbor Distributions The data vector for the analysis is constructed from the mea-surement of the k NN-CDF for k = { , , , } . Each distributionhas a diﬀerent functional dependence on all the n -point functionspresent in the data. It also worth reiterating that we do not a priori know which n -point functions are relevant for the clustering. Usingmultiple values of k , spread over a relatively wide range, as withour particular choice, ensures that the data vector has fewer degen-eracies. We have checked that using slightly diﬀerent combinationsof k do not have a major eﬀect on the cosmological constraints. Wediscuss this further below. We use 10 data points (simulation parti-cles or halos), and 10 volume-ﬁlling random points to generate theEmpirical CDF as outlined in Sec. 2.2. Each CDF is interpolated todetermine its value at 16 logarithmically spaced values of r . Sincethe analysis is focused on scales from 10 h − Mpc to 40 h − Mpc, thevalues of r for all the CDFs lie within this range. However, to ensurethat we do not go too deep into the tails where the ﬁnite numberof random points starts to aﬀect the ability to accurately measurethe distributions, we impose stricter scale cuts for each NN-CDF.For each k , we determine the range of scales for which the ErlangCDF distribution (see Eq. 12) for that k , and the same mean numberdensity, lies between 0 .

005 and 0 . r for each k overthe range of scales that are allowed after taking into account boththe cuts of 10 h − Mpc to 40 h − Mpc, and the cuts implied by theErlang distribution tails. We then append the 16 measurements foreach k into a single data vector with 64 entries. In most of analysis,we combine measurements from z = z = .

5, by once againcombining the 64 entry data vector computed for each redshift intoa single data vector with 128 entries.Once the data vector is deﬁned and computed on every simu-lation, the derivative term for each parameter in Eq. 22 is computedby averaging over 100 realizations at the relevant cosmologies. Thishelps reduce the noise from both sample variance, and the fact thatthese derivatives are computed numerically. For Fisher matrix anal-ysis, care needs to be taken that the derivatives are smooth, sincenumerical noise can lead to spurious features in derivatives whichare then interpreted as artiﬁcially tight constraints. We inspect thederivatives to ensure that no such pathological features exist overthe range of scales we use in the analysis. Some of the derivativesare discussed further in Sec. 4.2.The covariance matrix is computed from 1000 realizations atthe ﬁducial cosmology. The entries of the raw covariance matrix aregiven by C (cid:48) ij = (cid:42)(cid:18) D i − (cid:104) D i (cid:105) (cid:19) (cid:18) D i − (cid:104) D i (cid:105) (cid:19)(cid:43) , (24)where the labels i , j represent various rows in the data vector and (cid:104) ... (cid:105) represents an average over realizations. Note that we set the oﬀ-diagonal terms of data vector entries corresponding to two diﬀerentredshifts to 0. This is done to avoid spurious covariances arisingfrom the use of the same realizations of the cosmology at z = z = .

5. To compute the correct inverse covariance matrix thatgoes into the Fisher analysis, we use the Hartlap correction factor(Hartlap et al. 2007): C − = n − p − n − (cid:0) C (cid:48) (cid:1) − , (25)where n is total number of entries in the data vector, and p is thenumber of simulation realizations used in the evaluation of C (cid:48) . No-

10 15 20 30 r ( h − Mpc) − . − . − . . . ∂ D / ∂ σ Figure 7.

The derivative of the data vector with respect to the cosmologicalparameter σ . The diﬀerent colored curves represent the portions of the datavectors coming from k = { , , , } nearest neighbor CDF distributions.The solid lines represent the derivative at z =

0, while the dotted linesrepresent the derivative at z = . tice that this constrains the allowed length of the data vector, giventhe number of simulations available. For accurate error forecasts,the Hartlap factor should be as close to unity as possible. Sinceour aim is to only demonstrate the gain in constraining power whenusing k NN analysis over ξ ( r ) analysis for the same analysis choices,and not the actual values of the constraints, we use the native simu-lation volume of ( h − Gpc ) throughout our analysis. Current andfuture cosmological surveys typically have much larger volumes,and consequently can lead to much tighter absolute constraints thanthe ones presented here, even after accounting for possible system-atics arising in the data. While inverting the covariance matrix, wecheck that the condition number is within acceptable limits, for eachof the analyses presented below. We also explicitly check that thedistribution of deviations of the data vector around the mean areroughly Gaussian for diﬀerent entries in the data vector, implyingthat the Fisher error estimates should be roughly valid.We use exactly the same framework to compute the constraintson cosmological parameters from the two-point correlation functionover the range 10 h − Mpc to 40 h − Mpc. At each redshift, we com-pute ξ ( r ) in 30 logarithmically spaced bins. We then combine thedata vectors at z = z = . k NN-CDF results.

In this section, we present the constraints on the cosmological pa-rameters when considering the 3-dimensional Dark Matter density

MNRAS000

MNRAS000 , 1–21 (2020) Banerjee & Abel .

82 0 .

84 0 . σ . . . . . . . P ( σ ) k = { } k = { , } k = { , , } k = { , , , } Figure 8.

The posterior distribution for σ , marginalized over all otherparameters. The diﬀerent colors represent diﬀerent k NN combinations fromwhich the constraint was obtained. The constraints improve as more nearestneighbor distributions are added, but the gain saturates by the time we addall four CDFs that are computed from the data. ﬁeld, where the simulation particles are tracers of the ﬁeld. For thenearest neighbor analysis, we downsample the simulation particlesto 10 before computing the diﬀerent distributions. As shown in Sec.3.2, the downsampling determines the range of scales over whichwe can accurately measure the features of the underlying continu-ous ﬁeld. The speciﬁc choice we make here is to ensure that themeasurements are robust in the range of scales that enter the Fisheranalysis, i.e. 10 h − Mpc to 40 h − Mpc. To reduce the sampling vari-ance caused by downsampling, for every cosmology, we create 16distinct downsamplings of 10 particles each from the original 512 particles in the simulation, and then compute the Empirical CDFby combining the nearest neighbor measurements from individualdownsampled datasets. For the correlation functions, we use theCorrfunc code (Sinha & Garrison 2019; Sinha & Garrison 2020)to compute ξ ( r ) using all 512 particles in the simulation. Oncethe data vector is deﬁned in this way, we compute the covariancematrix and the derivatives of the data vector with respect to thecosmological parameters. Note that we use only the CDM parti-cles, for both the nearest neighbor analysis, and the ξ ( r ) analysis,even for cosmologies with massive neutrinos. In massive neutrinocosmologies, the CDM density ﬁeld, traced by the CDM particles,and the total matter ﬁeld, which determines quantities relevant forgravitational lensing, for example, are diﬀerent. The total matterﬁeld includes the contribution from the clustering of neutrinos, andis known to be more constraining on the total neutrino mass M ν .For this work, we only present the constraints on the cosmologicalparameters, including the neutrino mass, from the CDM ﬁeld only.We will explore the possible stronger constraints when using thetotal matter ﬁeld in a future work. https://github.com/manodeep/Corrfunc In Fig. 7, we plot the derivative of the data vector with respectto σ . The diﬀerent colors represent the parts of the data vector thatcome from diﬀerent k NN distributions. The solid lines representthe part of the data vector computed from particles at z =

0, andthe dashed lines represent the parts of the data vector computedfrom particles at z = .

5. The sign of the derivative for the part ofthe data vector coming from nearest neighbor distribution can beunderstood in the following way — when the amplitude of clusteringat 8Mpc, σ is higher, there are more empty void-like regions onlarger scales. In other words, on these scales, the probability ofﬁnding 0 particles (VPF) in a sphere of radius r is higher when σ is higher. Since CDF ( r ) = P > | r = − VPF ( r ) , a higher σ implies a lower value for the data vector entry at these scales ifall other cosmological parameters are held ﬁxed. This leads to thenegative sign of the derivative seen in Fig. 7. The derivative becomeslarger in magnitude as the redshift decreases, as is expected from thegrowth of structure with time. We check that all the other derivativesare smooth, and do not suﬀer from numerical artifacts on scales thatenter the analysis.Before moving to the combined constraints on all cosmolog-ical parameters, we ﬁrst show how the constraints on one of theparameters, σ , changes as we add information from diﬀerent k NNdistributions at z =

0. For this example, we ﬁrst use a data vec-tor constructed only from the nearest neighbor CDF at z=0, andcalculate the constraints on σ from the Fisher analysis on thisdata vector. Next, we use the measurements from k = { , } near-est neighbors, and repeat the calculation. We repeat this until weuse all four k = { , , , } nearest neighbor CDFs in our analysis.The posterior distribution for σ from each of these calculationsis plotted in Fig. 8. Note that this posterior is marginalized overall other cosmological parameters even though they are not shownhere. We ﬁnd that the constraints improve as more NN distributionsare added to the data vector. However, the gain from adding newNN distributions diminishes by the time we use all four of the com-puted CDFs, k = { , , , } . A similar trend is observed for othercosmological parameters as well. We conclude that up to the lowestscales in the analysis, the choice of k = { , , , } in our nearestneighbor analysis is suﬃcient to extract most of the information onthe cosmological parameters. If smaller scales are to be includedin the analysis, higher k neighbors may have to be considered toensure maximal constraints on the parameters down to those scales.The joint constraints on all the cosmological parameters fromthe simulation particles, using the k NN analysis and the ξ ( r ) analysisis presented in Fig. 9. Here we use the full data vector outlined inSec. 4.1. Since the ﬁducial cosmology has M ν =

0, and negative M ν values are unphysical, we follow the example in Uhlemannet al. (2019), and plot constraints on ∆ M ν instead, where ∆ M ν isthe change in the total neutrino mass from the ﬁducial value. The 1- σ constraints on individual parameters is listed in Table 1. For all thecosmological parameters, we ﬁnd that the k NN analysis improvesthe constraints by almost a factor of 2 over the ξ ( r ) analysis. Whilesome of the degeneracy directions between pairs of parameters aresomewhat diﬀerent between the two analyses, we ﬁnd that both haveare aﬀected by a stronge M ν − σ degeneracy, as can be expectedwhen utilizing information only from the relatively small scalesused in these analyses. Note that our choice of leaving out the largerscales, including the BAO peak from the analysis also aﬀects theshape and size of the other contours and the degeneracy directionscompared to other works. We conclude that on the scales between10 h − Mpc and 40 h − Mpc, using only a two-point function analysisfails to capture at least half the total information about cosmologicalparameters that is available on these scales. The k NN analysis, on the

MNRAS , 1–21 (2020) osmology with Nearest Neighbor Distributions .

04 0 . Ω b − . − . − . w − . . . ∆ M ν . . σ . . . . Ω m . . . . h . . . . n s . . n s . . h .

30 0 . Ω m .

80 0 . σ − . . ∆ M ν − . − . − . wξ ( r ) analysis k NN analysis

Figure 9.

Constraints on the cosmological parameters derived from the Fisher analysis on simulation particles, combining information from z = z = . h − Mpc to 40 h − Mpc. The constraints from the nearest neighbor analysis (using the same set of scales) are tighter by more than afactor of 2 for most of the parameters. The values of the constraints on individual parameters are listed in Table 1. other hand, is sensitive to higher order clustering, and is much moresensitive to changes in the cosmological parameters. In AppendixC, we show that the results presented here are consistent with thoseobtained from a slightly diﬀerent formalism where only the nearestneighbor distribution is used, but with diﬀerent mean densities forthe data points.

We now turn to the parameter constraints from the analysis of halosin the simulation.Throughout this section, we will consider ﬁxednumber density samples across diﬀerent cosmologies. Speciﬁcally,we focus on the clustering statistics of the 10 most massive halosin each simulation volume. The halos are identiﬁed using an FoFalgorithm from the simulation particles. Even for comsologies with neutrino particles in the simulation, the halo ﬁnding is run only onthe CDM particles. At z =

0, this number cut corresponds to masscuts around 5 × M (cid:12) / h , and at z = .

5, corresponds to mass cutsaround 3 × M (cid:12) / h . As discussed previously, the k NN statisticsis sensitive to the mean number density, and using a ﬁxed numberdensity ensures that diﬀerences in the k NN statistics at diﬀerentcosmologies arise only from changes in the underlying clustering.In the language of the Fisher formalism, this analysis quantiﬁes theresponse of the clustering observables — k NN distributions or ξ ( r ) measurements — constructed from the 10 most massive halos ina ( h − Gpc ) volume, to a change in the cosmological parameters,and then converts the amplitude of the response into parameterconstraints. Once again, we only use scales in the range 10 h − Mpcto 40 h − Mpc in our analysis.If the two-point function is used as the summary statistic, pa-

MNRAS000

MNRAS000 , 1–21 (2020) Banerjee & Abel .

03 0 . Ω b − − w − ∆ M ν . . . σ . . . Ω m . . . h . . . . n s . . n s . . h . . . Ω m . . σ − . . ∆ M ν − . − . wξ ( r )analysis k NN analysis

Figure 10.

Constraints on the cosmological parameters derived from the Fisher analysis of the clustering of the 10 most massive halos at diﬀerent cosmologies,combining information from z = z = .

5, and using scales in the range 10 h − Mpc to 40 h − Mpc. The constraints from the k NN analysis are much tighterthan those from the ξ ( r ) analysis. The values of the constraints on individual parameters are listed in Table 2. rameter constraints degrade signiﬁcantly when analyzing halo clus-tering compared to matter clustering. This is due to halo bias, which,in general, is unknown. Since the eﬀect of the linear bias term on thetwo-point function, especially on large scales, is degenerate with anumber of cosmological parameters, like σ and M ν , marginalizingover the unknown bias term relaxes the constraints on these. It isonly possible to break the degeneracy by either using other observ-ables (usually lensing) which have a diﬀerent dependence on bias,or by using information from smaller scales, where the linear biasapproximation breaks down. Note that, in general, it is not necessaryfor the linear bias approximation to break down at the same scaleas where the underlying clustering becomes non-Gaussian (Modiet al. 2020).We have already demonstrated in Sec. 3.3.1 that the k NN statis-tics can break the degeneracy between linear bias and the amplitude of clustering of the underlying ﬁeld. The k NN statistics were ableto distinguish between two halo samples where the underlying clus-tering of the matter ﬁeld were diﬀerent, but the bias (or mass cut)adjusted such that they have the same two-point clustering signal.It is therefore, possible, that more information about the cosmolog-ical parameters is retained in the k NN analysis of massive halos,compared to a ξ ( r ) analysis.This is indeed what we ﬁnd from the Fisher analysis on theclustering of halos in real space, and the results are presented inFig. 10. The values of the 1- σ constraints on individual parametersare tabulated in Table 2. Unsurprisingly, we ﬁnd that for ξ ( r ) , theconstraints are signiﬁcantly relaxed when compared to the matterclustering case in Fig. 9 and Table 1. This is most signiﬁcantlyapparent for σ , where the constraints degrade by almost a factor of8. On the other hand, the constraints yielded by the k NN analysis

MNRAS , 1–21 (2020) osmology with Nearest Neighbor Distributions .

03 0 . Ω b − − w − ∆ M ν . . . σ . . . Ω m . . . h . . . . n s . . n s . . h . . Ω m . . σ − . . ∆ M ν − . − . wξ ( s ) analysis k NN analysis

Figure 11.

Constraints on the cosmological parameters derived from the Fisher analysis of the monopole of clustering of the 10 most massive halos in redshiftspace, combining information from z = z = .

5, and using scales in the range 10 h − Mpc to 40 h − Mpc. Similar to Fig. 10, the constraints from the k NNanalysis are much tighter than those from the ξ ( r ) analysis. are much closer to those from the analysis of the matter ﬁeld. Eventhough the constraint on σ is less stringent than for the matter ﬁeldanalysis, it is still tighter than the ξ ( r ) constraint by a factor of ∼ k NN statistics, and the im-provement over the two-point analysis using the same scales is largerwhen considering the 10 most massive halos than when consid-ering the clustering of the underlying matter ﬁeld. This has manypotential implications for the analysis of galaxy clustering in cos-mological surveys, both photometric, and spectroscopic.While the analysis presented above considers the clusteringof halos in real space, spectroscopic surveys generally measure theclustering of galaxies in redshift space. It is, therefore, worthwhileto understand, if the projection to redshift space aﬀects the relativeimprovements on parameter constraints from the k NN analysis over the ξ ( r ) analysis. To convert the real space positions r of the halosin the simulation volume to redshift space positions s , we simplyuse s = r + + zH ( z ) v x ˆ x , (26)where H ( z ) is the Hubble parameter at redshift z . In the aboveequation, we have assumed that the ˆ x direction is the line-of-sightdirection.Because the line of sight projection breaks the rotational sym-metry, clustering in redshift is no longer isotropic, and it is commonto include the quadrupole and the hexadecapole of the correlationfunction in addition to the monopole. To keep the analysis simple,we focus here only on the monopole for the two point correlationfunction, and compare it to the k NN statistics in redshift space. Toconstruct the CDFs for the k NN distributions, we proceed exactly

MNRAS000

MNRAS000 , 1–21 (2020) Banerjee & Abel

Table 1. σ constraints on cosmological parameters, from the k NN and ξ ( r ) analysis of simulation particles using scales in the range 10 h − Mpc to40 h − Mpc. Parameter σ k NN σ ξ ( r ) Ω b n s h Ω m σ ∆ M ν w Table 2. σ constraints on cosmological parameters, from the k NN and ξ ( r ) analysis of the real space clustering of the 10 most massive halos inthe simulation volume of ( h − Gpc ) , using scales in the range 10 h − Mpcto 40 h − Mpc. Parameter σ k NN σ ξ ( r ) Ω b n s h Ω m σ ∆ M ν w as for the real space calculations, once all the halo positions havebeen transformed according to Eq. 26.The results of the analysis are plotted in Fig. 11. Once again,we ﬁnd that using the k NN statistics yields much tighter constraintson the cosmological parameters compared to the two-point functionanalysis. The gain is very similar to those that were obtained in thereal space analysis above. Adding in the quadrupole and hexade-capole two-point functions should add more constraining power tothe two-point analysis, but it should be noted that the k NN statis-tics can also be expanded in the presence of a special line-of-sightdirection. For example, the nearest neighbor distributions along theLOS direction and the perpendicular directions can be computedseparately - the diﬀerences in the distribution along individual di-rections arise exactly from the redshift projection. We will explorethese issues in detail in future work.

In this paper we have presented the nearest neighbor distributions( k NN-CDF), the empirical cumulative distribution function of dis-tances from a set of volume-ﬁlling Poisson distributed randompoints to the k –nearest data points, as a set of new summary statis-tics that can be used in cosmological clustering analysis. We haveshown how these are related to the counts-in-cell distributions, andto the various N -point correlation functions of the clustering of aset of data points. We have outlined how these distributions can beeﬃciently computed over a range of scales using the data points, ora subsample of them, and a set of volume ﬁlling randoms. We haveapplied the k NN-CDF formalism to the clustering of tracers drawnfrom two diﬀerent continuous distributions, i.e. the Poisson sam-pling of an uniform ﬁeld, and a Gaussian ﬁeld. We also demonstrateits application to data from a cosmological simulation - both for simulation particles, and dark matter halos identiﬁed in the simula-tions. We have explored how the nearest neighbor distributions canbreak the degeneracy between linear halo bias and the amplitude ofclustering of the underlying matter distribution, which is a majorlimitation of the two-point statistics on large scales. Then, employ-ing the Fisher matrix formalism, we have quantiﬁed the constraintsthat can be placed on various cosmological parameters by using the k NN-CDF as the summary statistics. Using simulation particles,and scales between 10 h − Mpc and 40 h − Mpc, we demonstrate thatthe use of k NN-CDF distributions improve the constraints by afactor of ∼ k NN-CDF statistics tochanges in the underlying cosmology, the fact that they can com-puted extremely eﬃciently, is highly advantageous. Other meth-ods which aim to harness the extra constraining power, beyond thetwo-point information, usually are computationally expensive. Forexample, measuring the bispectrum is much more expensive thanthe power spectrum, while the trispectrum is even more so. Otherapproaches, such as measuring the density PDF at diﬀerent scalesusually require multiple Fourier transforms, at least two for eachscale at which the PDF is to be computed. For the k NN-CDFs onthe other hand, search times for the k -th nearest neighbor does notscale with k once the tree has been constructed, and so, informa-tion about higher order cumulants can be accessed at negligibleadditional computational cost. Further, a single computation is suf-ﬁcient to provide information on a wide range of scales, and a rangeof values for k . While we have focused in this paper on cluster-ing in 3 dimensions, the formalism presented can easily be appliedto non-Gaussian data in 2 dimensions, where the calculations arecomputationally even cheaper due to the reduction in the dimen-sionality. This framework is therefore also well suited to the studythe tomographic clustering of galaxies in a photometric survey.For distributions with only a ﬁnite number of connected N -point functions, say m , measuring only the m nearest neighbordistributions is suﬃcient to capture the full clustering. For cos-mological applications, one does not a priori know the value of k ,or if k is even ﬁnite, but as has been demonstrated, we exhaust mostof the independent information by the time, e.g , the 8-th nearestneighbor distribution is used in the analysis. This value is of coursedependent on the choice of scale cuts used in the analysis, and ifsmaller scales are used, it is likely that a few more nearest neighbormeasurements will be needed to capture all the information. As wehave pointed out, we have been conservative in our choice of scalecuts based on the resolution of the simulations used in the analysis,and including smaller scales will yield even greater improvementsin parameter constraints over the two-point analysis.While the NN-CDF formalism lends itself most naturally tothe analysis of clustering of discrete points, we have demonstratedthat it can also be used to study statistical properties of continuousﬁelds. The matter density ﬁeld has been analyzed by downsamplingthe simulation particles. Similarly, statistical properties of othercontinuous ﬁelds, such as the convergence ﬁelds for weak lensingmeasurements, can also be analyzed in the same way by appropri-ately sampling the ﬁeld. The sampling rate is determined by thescales of interest, such that diﬀerent distribution functions are ro-bustly measured, and have to be tailored for the application at hand.However, once the sampling rate has been ﬁxed, all other aspects MNRAS , 1–21 (2020) osmology with Nearest Neighbor Distributions of the analysis presented here can be carried through without majorchanges in the workﬂow.In the current work, we have used the Fisher formalism todemonstrate the improvement in constraining power when NN-CDFs are used, the Fisher formalism is only reliable when certainapproximations are valid. Further, note that the empirical CDF iscomposed of ∼ measurements, but in the Fisher analysis, weonly use the CDFs interpolated to 16 radial scales. This is neededsuch that the Hartlap factor in Eq. 25, given the number of simu-lations, remains close to unity, and that the Fisher constraints arereliable. Any analysis of actual data will likely demand more so-phisticated statistical techniques. Since the actual measurements inthe nearest neighbor calculations are those of Cumulative Distribu-tion Functions, the Kolmogorov–Smirnov (KS) test (Smirnov 1948)lends itself quite naturally as an alternative. Another advantage ofthe KS test, when a large number of measurements is at hand is thatthe test has more statistical power when the empirical CDF is sam-pled at higher rates. Other measures such as the Kullback–Leiblerdivergence (Kullback & Leibler 1951) can also potentially be ap-plied in conjunction with the NN-CDF measurements, and theseissues will be explored in more detail in future work.We have focused here on the improvement in parameter con-straints for samples where the number densities are high enough forthe two-point function to be computed using standard methods with-out being strongly aﬀected by shot noise considerations. However,as demonstrated in Sec.3.2, the NN-CDF method also allows forthe measurement of the clustering signal even on scales where shotnoise is expected to dominate, i.e. on scales smaller than the meaninter-particle separations. The stronger the clustering, the easier itis to robustly detect the clustering at a ﬁxed mean number density.In cosmology, interesting samples with low number density, suchas the most massive clusters, also tend to be highly clustered withrespect to the underlying matter ﬁeld, implying that the NN-CDFmethod can be applied to the study of the clustering of extremelyrare objects such as the most massive galaxy clusters, which are usu-ally completely noise-dominated when computed using the standardmethods.Another possible application of nearest neighbor statistics is asa test of non-Gaussianity in the clustering of objects. If the numberdensity of a sample is well known, and the variance of the distribu-tion can be computed as a function of scale, the nearest neighbormeasurements can be used to check if the clustering matches that of afully Gaussian distribution, as discussed in Sec. 3.2. Any departuresfrom the expressions for the nearest neighbor distributions derivedin Sec. 3.2 can be interpreted as the presence of non-Gaussian terms.On small scales, this can help test the range of validity of analysismethods that rely on assumptions of Gaussianity, while on largerscales such measures could be relevant for the search for primordialnon-Gaussianity in Large Scale Structure.At a ﬁxed cosmology, the NN-CDF measurements can be usedto study the clustering of diﬀerent halo samples. One such ap-plication is in the context of halo and galaxy assembly bias. Forexample, diﬀerent galaxy samples targeted in spectroscopic surveysare known to occupy diﬀerent parts of the cosmic web (Orsi &Angulo 2018; Alam et al. 2019). Under these conditions, a singlelinear bias parameter in the measured two-point correlation func-tion does not reﬂect the full diﬀerence in the clustering (Wang et al.2019). The nearest neighbor framework oﬀers a way to capturethese diﬀerences more clearly. On a related note, adding NN-CDFmeasurements to the data vectors used in studies of the galaxy-haloconnection can add to the power of these models, and can improve constraints on cosmological parameters even after marginalizingover all the galaxy-halo connection parameters. ACKNOWLEDGEMENTS

This work was supported by the U.S. Department of Energy SLACContract No. DE-AC02-76SF00515. The authors would like tothank Susmita Adhikari and Michael Kopp for useful discussions,and Daniel Gruen, Michael Kopp, Johannes Lange, Jeﬀ Scargle, Ist-van Szapudi, Cora Uhlemann, and Francisco Villaescusa-Navarrofor comments on an earlier version of the manuscript. Some of thecomputing for this project was performed on the Sherlock clus-ter. The authors would like to thank Stanford University and theStanford Research Computing Center for providing computationalresources and support that contributed to these research results. ThePylians3 analysis library was used extensively in this paper. Weacknowledge the use of the GetDist (Lewis 2019) software forplotting. REFERENCES

Abbott T. M. C., et al., 2018, Phys. Rev. D, 98, 043526Alam S., et al., 2017, MNRAS, 470, 2617Alam S., Zu Y., Peacock J. A., Mand elbaum R., 2019, MNRAS, 483, 4501Armijo J., Cai Y.-C., Padilla N., Li B., Peacock J. A., 2018, Monthly Noticesof the Royal Astronomical Society, 478, 3627Baldauf T., Smith R. E., Seljak U. c. v., Mandelbaum R., 2010, Phys. Rev.D, 81, 063531Balian R., Schaeﬀer R., 1989, A&A, 220, 1Banerjee A., Dalal N., 2016, J. Cosmology Astropart. Phys., 2016, 015Banerjee A., Castorina E., Villaescusa-Navarro F., Court T., Viel M., 2019,arXiv e-prints, p. arXiv:1907.06598Bardeen J. M., Bond J. R., Kaiser N., Szalay A. S., 1986, ApJ, 304, 15Bernardeau F., 1994, A&A, 291, 697Bernardeau F., Pichon C., Codis S., 2014, Phys. Rev. D, 90, 103519Carrasco J. J. M., Hertzberg M. P., Senatore L., 2012, Journal of HighEnergy Physics, 2012, 82Castorina E., Feng Y., Seljak U., Villaescusa-Navarro F., 2018, Phys.Rev. Lett., 121, 101301Chiang C.-T., LoVerde M., Villaescusa-Navarro F., 2019, Phys. Rev. Lett.,122, 041302Coles P., Jones B., 1991, MNRAS, 248, 1Colombi S., 1994, ApJ, 435, 536Coulton W. R., Liu J., Madhavacheril M. S., Böhm V., Spergel D. N., 2019,J. Cosmology Astropart. Phys., 2019, 043Dalal N., Doré O., Huterer D., Shirokov A., 2008, Phys. Rev. D, 77, 123514Desjacques V., Seljak U., Iliev I. T., 2009, MNRAS, 396, 85Eisenstein D. J., Seo H.-J., Sirko E., Spergel D. N., 2007, The AstrophysicalJournal, 664, 675Euclid Collaboration et al., 2019, MNRAS, 484, 5509Fluri J., Kacprzak T., Sgier R., Refregier A., Amara A., 2018, J. CosmologyAstropart. Phys., 2018, 051Friedrich O., et al., 2018, Phys. Rev. D, 98, 023508Fry J. N., 1986, ApJ, 306, 358Fry J. N., Colombi S., 2013, Monthly Notices of the Royal AstronomicalSociety, 433, 581Gaztañaga E., Fosalba P., Elizalde E., 2000, ApJ, 539, 522Gil-Marín H., Noreña J., Verde L., Percival W. J., Wagner C., Manera M.,Schneider D. P., 2015, MNRAS, 451, 539 https://github.com/franciscovillaescusa/Pylians3 https://getdist.readthedocs.io/en/latest/MNRAS000

Gruen D., et al., 2015, Monthly Notices of the Royal Astronomical Society,455, 3367Gruen D., et al., 2018, Phys. Rev. D, 98, 023507Gualdi D., Gil-Marín H., Schuhmann R. L., Manera M., Joachimi B., LahavO., 2019, MNRAS, 484, 3713Hahn C., Villaescusa-Navarro F., Castorina E., Scoccimarro R., 2020, J. Cos-mology Astropart. Phys., 2020, 040Hartlap J., Simon P., Schneider P., 2007, A&A, 464, 399Hernández-Aguayo C., Baugh C. M., Li B., 2018, MNRAS, 479, 4824Hildebrandt H., et al., 2017, MNRAS, 465, 1454Hinshaw G., et al., 2013, ApJS, 208, 19Horowitz B., Seljak U., Aslanyan G., 2019, J. Cosmology Astropart. Phys.,2019, 035Ivanov M. M., Simonović M., Zaldarriaga M., 2020, J. Cosmology Astropart.Phys., 2020, 042Kerscher M., Pons-Bordería M. J., Schmalzing J., Trasarti-Battistoni R.,Buchert T., Martínez V. J., Valdarnini R., 1999, ApJ, 513, 543Klypin A., Prada F., Betancort-Rijo J., Albareti F. D., 2018, MNRAS, 481,4588Kofman L., Bertschinger E., Gelb J. M., Nusser A., Dekel A., 1994, ApJ,420, 44Kullback S., Leibler R. A., 1951, Ann. Math. Statist., 22, 79Lam T. Y., Sheth R. K., 2008, MNRAS, 386, 407Lawrence E., et al., 2017, ApJ, 847, 50Lewis A., 2019, arXiv e-prints, p. arXiv:1910.13970LoVerde M., 2016, Phys. Rev. D, 93, 103526M Evans N. H., Peacock B., 2000, Measurement Science and Technology,12, 117Massara E., Villaescusa-Navarro F., Ho S., Dalal N., Spergel D. N., 2020,arXiv e-prints, p. arXiv:2001.11024Matsubara T., 2010, Phys. Rev. D, 81, 083505Mead A. J., Peacock J. A., Heymans C., Joudaki S., Heavens A. F., 2015,MNRAS, 454, 1958Modi C., Feng Y., Seljak U., 2018, J. Cosmology Astropart. Phys., 2018,028Modi C., Chen S.-F., White M., 2020, MNRAS, 492, 5754Munshi D., van Waerbeke L., Smidt J., Coles P., 2012, MNRAS, 419, 536Nishimichi T., Bernardeau F., Taruya A., 2017, Phys. Rev. D, 96, 123515Oort J. H., 1983, ARA&A, 21, 373Orsi Á. A., Angulo R. E., 2018, MNRAS, 475, 2530Padmanabhan N., White M., Cohn J. D., 2009, Phys. Rev. D, 79, 063523Padmanabhan N., Xu X., Eisenstein D. J., Scalzo R., Cuesta A. J., MehtaK. T., Kazin E., 2012, MNRAS, 427, 2132Pan J., Szapudi I., 2005, MNRAS, 362, 1363Paranjape A., Alam S., 2020, MNRAS, 495, 3233Peacock J. A., 1998, Cosmological Physics. Cambridge University Press,doi:10.1017/CBO9780511804533Peel A., Lin C.-A., Lanusse F., Leonard A., Starck J.-L., Kilbinger M., 2017,A&A, 599, A79Petri A., Haiman Z., Hui L., May M., Kratochvil J. M., 2013, Phys. Rev. D,88, 123002Petri A., Liu J., Haiman Z., May M., Hui L., Kratochvil J. M., 2015, Phys.Rev. D, 91, 103511Planck Collaboration et al., 2018, arXiv e-prints, p. arXiv:1807.06209Planck Collaboration et al., 2019, arXiv e-prints, p. arXiv:1905.05697Repp A., Szapudi I., 2020, arXiv e-prints, p. arXiv:2006.01146Schmittfull M., Feng Y., Beutler F., Sherwin B., Chu M. Y., 2015, Phys.Rev. D, 92, 123522Scoccimarro R., Colombi S., Fry J. N., Frieman J. A., Hivon E., Melott A.,1998, ApJ, 496, 586Sefusatti E., Crocce M., Pueblas S., Scoccimarro R., 2006, Phys. Rev. D,74, 023522Seljak U., Aslanyan G., Feng Y., Modi C., 2017, J. Cosmology Astropart.Phys., 2017, 009Sharp N. A., 1981, MNRAS, 195, 857Sinha M., Garrison L., 2019, in Majumdar A., Arora R., eds, SoftwareChallenges to Exascale Computing. Springer Singapore, Singapore, pp3–20, https://doi.org/10.1007/978-981-13-7729-7_1

Sinha M., Garrison L. H., 2020, MNRAS, 491, 3022Slepian Z., et al., 2017, MNRAS, 469, 1738Smirnov N., 1948, Ann. Math. Statist., 19, 279Szapudi I., Szalay A. S., 1993, ApJ, 408, 43Takada M., Jain B., 2004, MNRAS, 348, 897Taruya A., Bernardeau F., Nishimichi T., Codis S., 2012, Phys. Rev. D, 86,103528Tegmark M., 1997, Phys. Rev. Lett., 79, 3806Uhlemann C., Codis S., Pichon C., Bernardeau F., Reimberg P., 2016,Monthly Notices of the Royal Astronomical Society, 460, 1529Uhlemann C., Friedrich O., Villaescusa-Navarro F., Banerjee A., Codis S. r.,2019, arXiv e-prints, p. arXiv:1911.11158Vaart A. W. v. d., 1998, Asymptotic Statistics. Cambridge Series in Sta-tistical and Probabilistic Mathematics, Cambridge University Press,doi:10.1017/CBO9780511802256Verde L., Heavens A. F., 2001, ApJ, 553, 14Villaescusa-Navarro F., Marulli F., Viel M., Branchini E., Castorina E.,Sefusatti E., Saito S., 2014, J. Cosmology Astropart. Phys., 2014, 011Villaescusa-Navarro F., et al., 2019, arXiv e-prints, p. arXiv:1909.05273Vlah Z., Seljak U., Baldauf T., 2015, Phys. Rev. D, 91, 023508Wald I., Havran V., 2006, in 2006 IEEE Symposium on Interactive RayTracing. pp 61–69Walsh K., Tinker J., 2019, MNRAS, 488, 470Wang K., et al., 2019, MNRAS, 488, 3541Way M. J., Gazis P. R., Scargle J. D., 2011, ApJ, 727, 48White S. D. M., 1979, MNRAS, 186, 145White M., 2016, J. Cosmology Astropart. Phys., 2016, 057d’Amico G., Gleyzes J., Kokron N., Markovic K., Senatore L., Zhang P.,Beutler F., Gil-Marín H., 2020, J. Cosmology Astropart. Phys., 2020,005

APPENDIX A: DERIVATION OF THE GENERATINGFUNCTION FOR DISCRETE COUNTS

We derive the expression for Eq. 1 by summarizing the argumentspresented in Szapudi & Szalay (1993). The generating functionalfor a continuous ﬁeld ρ ( r ) can be written as an integral over all ﬁeldconﬁgurations of ρ : Z[ J ] = ∫ (cid:2) D ρ ( r ) (cid:3) P (cid:2) ρ ( r ) (cid:3) exp (cid:34) i ∫ d r ρ ( r ) J ( r ) (cid:35) (A1)where (cid:2) D ρ ( r ) (cid:3) represents the functional integral over ﬁeld conﬁg-urations, and P (cid:2) ρ ( r ) (cid:3) represents the probability of a speciﬁc ﬁeldconﬁguration. For a mean density ¯ ρ , the cumulants of the distribu-tion are related to the generating functional through the followingequation:¯ ρ k ξ ( k ) ( r ... r k ) = i k δ k (cid:0) ln Z[ J ] (cid:1) δ J ( r ) ...δ J ( r k ) . (A2)Eq. A2 can be inverted to express the generating functional itself interms of the cumulants: Z[ J ] = exp (cid:34) ∞ (cid:213) k = ( i ¯ ρ ) k k ! ∫ d r ... d r k ξ ( k ) ( r ... r k ) × J ( r ) ... J ( r k ) (cid:35) . (A3)We now consider a set of points generated by a local Poissonprocess where the number of points generated in a volume V around r depends only on the local integrated density over that volume, ρ V ( r ) : M V ( r ) = ∫ V d r (cid:48) ρ ( r (cid:48) ) W ( r , r (cid:48) ) , (A4) MNRAS , 1–21 (2020) osmology with Nearest Neighbor Distributions where W ( r , r (cid:48) ) deﬁnes the window function for the smoothing pro-cedure. Therefore, at a given point r with integrated density M V ( r ) ,the probability of ﬁnding k points within radius r is P k |M V = (M V / m ) k k ! exp (cid:2) − M V / m (cid:3) , (A5)where m is the “mass" associated with each particle. While this hasa relatively straightforward meaning when applied to simulationparticles, for halos, this can be thought of as a normalization factor.The overall probability of ﬁnding k points within volume V of r needs to average over all possible ﬁeld conﬁgurations: P k | V = (cid:42) (M V / m ) k k ! exp (cid:2) − M V / m (cid:3)(cid:43) . (A6)The generating function for the discrete distribution is thendeﬁned as P ( z | V ) = ∞ (cid:213) k = P k | V z k = (cid:42) (M V / m ) k k ! exp (cid:2) − M V / m (cid:3)(cid:43) z k = (cid:68) exp (cid:2) M V ( z − )/ m (cid:3)(cid:69) . (A7)Note that in the last line of Eq. A7, all quantities are in terms ofthe continuous variables. Therefore this expectation value can becomputed through the functional integral over all conﬁgurations ofthe continuous ﬁeld ρ : P ( z | V ) = ∫ (cid:2) D ρ ( r ) (cid:3) P (cid:2) ρ ( r ) (cid:3) × exp (cid:34) ( z − ) m ∫ d r ρ ( r ) W ( r , r (cid:48) ) (cid:35) . (A8)Therefore, Eqs. A1 and A8 match each other when J ( r ) = W ( r , r (cid:48) )( z − )/( im ) . We now use the fact that the RHS of Eqs.A1 and A3 are equivalent, to write P ( z | V ) = exp (cid:34) ∞ (cid:213) k = (cid:0) ¯ n ( z − ) (cid:1) k k ! ∫ d r ... d r k ξ ( k ) (cid:16) r (cid:48) , ..., r (cid:48) k (cid:17) × W ( r , r (cid:48) ) ... W ( r k , r (cid:48) k ) (cid:35) , (A9)where ¯ n = ¯ ρ / m . For the special case W ( r , r (cid:48) ) = P ( z | V ) = exp (cid:34) ∞ (cid:213) k = ¯ n k ( z − ) k k ! × ∫ V ... ∫ V d r ... d r k ξ ( k ) ( r , ..., r k ) (cid:35) . (A10) APPENDIX B: CUMULANTS

The moment generating function for the discrete distribution is givenby M ( t | V ) = ∞ (cid:213) k = exp (cid:2) tk (cid:3) P k | V = exp (cid:34) ∞ (cid:213) k = ¯ n k ( e t − ) k k ! × ∫ V ... ∫ V d r ... d r k ξ ( k ) ( r , ..., r k ) (cid:35) . (B1) The k th cumulant C k is given by C k = (cid:34)(cid:18) ddt (cid:19) k ln (cid:18) M ( t | V ) (cid:19)(cid:35) t = . (B2)The ﬁrst and second cumulants therefore work out to be C = ¯ nV (B3) C = ¯ nV + ¯ n ∫ V ∫ V d r d r ξ ( ) ( r , r ) , (B4)where ξ ( ) is the usual two-point correlation function. Note that ξ ( ) ( r , r ) = ξ ( ) ( r = | r − r |) due to isotropy. For the specialcase when the volume V is associated with a sphere of radius r , wecan write C as a one dimensional integral over ξ ( ) ( r ) with someweight function W ( r | r ) : C = ¯ nV + ¯ n ∫ r dr r ξ ( ) ( r ) W ( r | r ) , (B5)where W ( r | r ) = π (cid:18) ( r − r ) + r ( r − r ) (cid:19) . (B6)The second moment or variance can also be expressed in terms ofthe power spectrum, P ( k ) , smoothed on scale r corresponding tovolume V : C = ¯ nV + ( ¯ nV ) π ∫ dk k P ( k ) W ( kr ) = ¯ nV + ¯ n V σ V . (B7) W ( kr ) represents the Fourier transform of the tophat window func-tion for a ﬁlter with radius r : W ( kr ) = ( kr ) (cid:18) sin (cid:0) kr (cid:1) − (cid:0) kr (cid:1) cos (cid:0) kr (cid:1)(cid:19) . (B8)Note that Eq. B7 is true for any distribution - the distribution neednot be a Gaussian. Therefore, ξ ( r ) or P ( k ) , along with the knowledgeof the mean number density, encodes information about the secondcumulant of the distribution, even in the nonlinear regime.The second cumulant can also be computed from the CICdistributions: C ( V ) = (cid:205) ∞ k = k P k | V (cid:205) ∞ k = P k | V − (cid:32) (cid:205) ∞ k = kP k | V (cid:205) ∞ k = P k | V (cid:33) . (B9)While the sum formally runs from 0 to ∞ , in practice, for anyrealistic distribution function, a ﬁnite number of terms is suﬃcientto determine C accurately. Fig. B1 shows the measurement of C ( V ) for a set of 10 simulation particles in a ( h − Gpc ) volumeat z =

0. The solid line represents the measurement from the datausing Eq. B9. The CIC measurements themselves are derived fromthe k NN measurements through Eq. 8. The dotted line representsthe measurement of C using Eq. B5, where ξ ( r ) was computedfrom all 512 particles in the simulation, but ¯ n was set to matchthe number density considered for the CIC measurement. The twomeasurements produce consistent results at the ∼

1% level, eventhough we use only 200 nearest neighbor distributions to computethe solid curve - where, formally, the sums in Eq. B9 run from 0 to ∞ . Similarly, higher cumulants can also be obtained directly fromthe NN-CDF by taking higher moments of the P k | V , as in Eq. B9.These cumulants are directly related to the connected higher N -pointfunctions and their Fourier equivalents, just as the second cumulantis related to the two-point function or P ( k ) . We plot the skewness andexcess kurtosis, which are related to the third and fourth cumulants, MNRAS000

Top : Second cumulant C , as a function of scale, for the distri-bution of 10 simulation particles in a ( h − Gpc ) volume at z =

0. Thesolid line represents the result for C computed using the CIC distributions,according to Eq. B9. The CIC distributions themselves have been computedfrom the k NN distributions, according to Eq.8. The dashed line represents C computed from the measured ξ ( r ) of the full set of simulation particles,using Eq. B5. Bottom : Ratio of the two measurements from the top panel. of the same sample of particles from the simulations using thesolid lines in Fig. B2. The dot-dashed lines are the expected skewand kurtosis for a set of points with the same number density,and the same two-point function, but with no higher connected n − point functions. The dashed lines of the same color representthe measurements of the the same quantities from a set of Poissondistributed points with the same mean density. The dotted linesrepresent the analytic predictions for the skew and excess kurtosis forthe Poisson distribution. We ﬁnd very good agreement between themeasurements and the predicted values for the Poisson distribution,while the measurements from the cosmological distribution (solidlines) are clearly diﬀerent from that of the Poisson distribution onall scales displayed on the plot. This is true even in scales below themean inter-particle separation, where shot noise could potentiallybe important. Further, comparing the solid lines and the dot-dashedlines on the plot illustrates the fact that this measurement is indeedsensitive to the presence of higher cumulants in the data. Onceagain, we have terminated the calculation at 200 nearest neighbors,just as we did for the calculation of the second cumulant. It is alsoworth noting that the time taken to compute quantities related tothe second, third, fourth, and potentially higher cumulants is thesame - i.e. once the NN-CDF distributions for the 200 neighborshas been computed using the tree, the rest of the calculation iscomputationally trivial, irrespective of the order of the cumulant.Of course, sample variance limitations of the nature discussed inthe text imply that the lower cumulants are more robustly measuredover a larger range of scales. This can be seen in the departure of r ( h − Mpc)10 − SkewExcess Kurtosis

Figure B2.

Solid lines represent the skew and excess kurtosis (as a functionof scale) of the distribution of 10 simulation particles in a ( h − Gpc ) vol-ume at z =

0. The dot-dashed lines represent the expected skew and kurtosisvalues for a set of discrete points with the same number density, and the sametwo point function, but no higher connected n − point functions. The dashedlines represent the measurements for a set of Poisson distributed pointswith the same number density. The thin dotted lines represent the analyticpredictions for the skew and excess kurtosis of the Poisson distribution. the measurement and analytic expectation of the excess kurtosis, orthe fourth cumulant, of the Poisson distribution on large scales inFig. B2. APPENDIX C: DATA DOWNSAMPLING

In the paper introducing the Void Probability Function, White(1979) showed that the VPF itself can serve as the generating func-tion for the P k | V distribution. In this formalism, however, the deriva-tives had to be taken with respect to the mean number density ¯ n ,unlike Eq. 2, where the derivatives are taken with respect to thedummy variable z . As we have shown earlier,VPF ( r ) = − CDF ( r ) . (C1)In theory, therefore, the information in the k -th neighbor distributioncan also be accessed by computing the nearest neighbor distributionat a diﬀerent mean number density.We now compare the actual parameter constraints from theFisher analysis of k NN distributions for simulation particles, pre-sented in Sec. 4.2, with those that can be obtained by comput-ing the VPF at diﬀerent number densities. In order to do this, wecompute the VPF for subsamples of the simulation particles with¯ n = { × − , . × − , . × − , . × − }( h − Mpc ) − ,using Eq. C1. Note that the 1 × − ( h − Mpc ) − is the mean numberdensity used in Sec. 4.2. The other values of ¯ n are also chosen in away that they correspond to the k NN measurements presented there.The data vector is created by appending the VPF measurements at16 logarithmically spaced scales for each ¯ n in the scale range of10 h − Mpc to 40 h − Mpc. We once again use the relevant Erlang

MNRAS , 1–21 (2020) osmology with Nearest Neighbor Distributions .

046 0 . Ω b − . − . − . w − . . . ∆ M ν . . . . σ . . . Ω m . . . h . . . n s .

95 1 . n s .

65 0 . h .

30 0 . Ω m .

825 0 . σ − . . ∆ M ν − . − . w VPF analysis k NN analysis

Figure C1.

Fisher constraints on cosmological parameters from the k NN analysis in Fig. 9 and the VPF analysis outlined above. The two formalisms yieldvery similar constraints on all the cosmological parameters. distribution to ensure that the scales do not include measurementsof the VPF deep into the tails. As earlier, we combine measurementsfrom z = z = . k NN analysis presented in Fig. 9and Table 1. The ﬁnal constraints on individual parameters arevery similar, though some of the degeneracy directions are slightlydiﬀerent. The agreement in the constraints hold true even though thediﬀerent derivatives and the covariance matrices between the twoanalyses look very diﬀerent. This is a useful check on the robustnessof the constraints obtained from the k NN distributions, as a well asa test of the understanding of the underlying statistical methods.

This paper has been typeset from a TEX/L A TEX ﬁle prepared by the author.MNRAS000