[PDF] Graph Database Solution for Higher Order Spatial Statistics in the Era of Big Data

Abstract

We present an algorithm for the fast computation of the general N -point spatial correlation functions of any discrete point set embedded within an Euclidean space of R n . Utilizing the concepts of kd-trees and graph databases, we describe how to count all possible N -tuples in binned configurations within a given length scale, e.g. all pairs of points or all triplets of points with side lengths < r max . Through bench-marking we show the computational advantage of our new graph based algorithm over more traditional methods. We show that all 3-point configurations up to and beyond the Baryon Acoustic Oscillation scale ( ∼ 200 Mpc in physical units) can be performed on current SDSS data in reasonable time. Finally we present the first measurements of the 4-point correlation function of ∼ 0.5 million SDSS galaxies over the redshift range 0.43<z<0.7 .

Full PDF

DDraft version January 3, 2019

Typeset using L A TEX twocolumn style in AASTeX61

GRAPH DATABASE SOLUTION FOR HIGHER ORDER SPATIAL STATISTICS IN THE ERA OF BIG DATA

Cristiano G. Sabiu, Ben Hoyle,

2, 3

Juhan Kim, and Xiao-Dong Li Department of Astronomy, Yonsei University, 50 Yonsei-ro, Seoul 03722, Korea Universitaets-Sternwarte, Fakultaet fur Physik, Ludwig-Maximilians Universitaet Muenchen, Scheinerstr. 1, D-81679 Muenchen,Germany Max Planck Institute fur Extraterrestrial Physics, Giessenbachstr. 1, D-85748 Garching, Germany Center for Advanced Computation, Korea Institute for Advanced Study, 85 Hoegi-ro, Dongdaemun-gu, Seoul 02455, Korea School of Physics and Astronomy, Sun Yat-Sen University, Guangzhou 510297, People’s Republic of China (Received; Revised; Accepted)

Submitted to ApJSABSTRACTWe present an algorithm for the fast computation of the general N -point spatial correlation functions of any discretepoint set embedded within an Euclidean space of R n . Utilizing the concepts of kd-trees and graph databases, wedescribe how to count all possible N -tuples in binned conﬁgurations within a given length scale, e.g. all pairs of pointsor all triplets of points with side lengths < r max . Through bench-marking we show the computational advantage ofour new graph based algorithm over more traditional methods. We show that all 3-point conﬁgurations up to andbeyond the Baryon Acoustic Oscillation scale ( ∼

200 Mpc in physical units) can be performed on current SDSS datain reasonable time. Finally we present the ﬁrst measurements of the 4-point correlation function of ∼ . < z < . Keywords: cosmology: theory; methods: data analysis

Corresponding author: Juhan [email protected] a r X i v : . [ a s t r o - ph . C O ] J a n INTRODUCTIONCosmology is now entering an era of big data . Fromlarge optical surveys like LSST , Euclid and DESI toall sky 21cm radio surveys like SKA , Chime and Tian-lai , where we will reach petabytes of data, it is clearthat the future of cosmology will require fast and eﬃ-cient algorithms for extracting scientiﬁcally meaningfulinformation from the wealth of data collected.A common statistical tool for compressing the spatialinformation of the galaxy distribution is the N -pointcorrelation functions (or its Fourier counterpart thepoly-spectra). Even at 3rd order statistics we have seentheir usefulness in constraining the statistical bias be-tween the distribution of galaxies and dark matter (e.g.,Fry & Gaztanaga 1993; Jing & B¨orner 1998; Frieman &Gazta˜naga 1999; Szapudi et al. 2000; Scoccimarro et al.2001; Verde et al. 2002) as well as to place constraintson the amount of primordial non-Gaussianity in the cos-mic microwave background (CMB; Komatsu et al. 2003;Planck Collaboration, et al. 2016).The ﬁrst conﬁguration space galaxy 3-point corre-lation function (3pCF) measurements were made byGroth & Peebles (1977) and subsequently by Gott etal. (1991), where they probed the hierarchical cluster-ing ansatz (Peebles 1980) with O (1000) galaxies. How-ever, more recently the 3pCF has been measured us-ing hundreds of thousands of galaxies and used to placeconstrains on the Halo Occupation Distribution (HOD),by breaking degeneracies between galaxy bias and theamplitude of density ﬂuctuations (Guo et al. 2015), ithas shown promise at discriminating diﬀerent gravitymodels (Sabiu et al. 2016), helping to constrain non-Gaussianity in the galaxy distribution (Nishimichi et al.2010) and characterizing the turbulence of the interstel-lar medium (Portillo et al. 2018).Regarding large scale cosmology, Slepian et al.(2017a,b) made a detection of the Baryon AcousticOscillations (BAO) in the 3rd order spatial clusteringstatistics of the Sloan Digital Sky Survey (SDSS) galax-ies. Although an earlier tentative detection of the 3pCFBAO was claimed by Gazta˜naga et al. (2009) who usedthe SDSS DR7 sample of Luminous Red Galaxies.Naively, the CFs require computing the distance fromevery galaxy to every other galaxy in an N gal operationfor the 2pCF or N gal for the 3pCF. However, there are https://chime-experiment.ca http://tianlai.bao.ac.cn some algorithmic methods that can speed up this kindof computation.Arranging the data in a hierarchical structure knownas a ‘tree’, allows fast distance matching to be per-formed. Particularly k-d trees have been utilized for CFsin codes such as Ntropy (Moore et al. 2001; Gardner etal. 2007) and

KSTAT (Sabiu 2018; Sabiu et al. 2016).Other novel methods have been developed to quicklymeasure the higher order statistics, including theposition-dependent power spectrum (Chiang et al. 2014)and multipole expansions (Szapudi 2004). More recentlySlepian & Eisenstein (2015, 2016) have developed amethod based on Fourier transforms and spherical har-monics that can be combined to form the multipolecoeﬃcients of the 3pCF.In this work we will present a new algorithm for com-puting spatial correlations that is based on the conceptof a graph database . A graph database is a type ofNoSQL database, i.e. it does not rely on the data be-ing described in a tabular, relational format. Rather,a graph database, or more speciﬁcally a graph-orienteddatabase uses graph theory to store, map and query re-lationships of the data.The algorithm that we propose makes no approxima-tion or data compression and is designed to measure alltriplet conﬁgurations up to large cosmic scales beyondthe BAO distance.The outline of this paper is as follows. In § §

3. In § § CORRELATION FUNCTIONSThe spatial distribution of galaxies encodes a wealthof cosmological information. In an eﬀort to condensethe information of millions of 3D galaxy positions intoa manageable form we rely on correlation functions.The two-point correlation function (2pCF) is deﬁnedas ξ ( (cid:126)r ) = (cid:104) δ ( (cid:126)x ) δ ( (cid:126)x + (cid:126)r ) (cid:105) x , (1)where δ is the density contrast, related to the density as δ ≡ ρ/ ¯ ρ − (cid:104) .. (cid:105) denotes spatial averaging over (cid:126)x .In practice the 2pCF is calculated using estimators,the most popular of which being the “Landy-Szalay”estimator (Landy & Szalay 1993); ξ ( (cid:126)r ) = DD ( (cid:126)r ) − DR ( (cid:126)r ) + RR ( (cid:126)r ) RR ( (cid:126)r ) , (2) 𝜽 𝜽 xyz0 12 r r r Figure 1.

This illustration shows the parameterization of adata triplet in anisotropic coordinates. Triangles are deﬁnedby 5 parameters: ( r , r , r , θ , θ ). where DD is the number of data–data pairs, DR thenumber of data-random pairs, and RR is the number ofrandom–random pairs, all separated by a displacementvector (cid:126)r and properly normalised. The number of ran-dom particles used to deﬁne the unclustered referencesample is typically 20 times the number of data parti-cles. This is done to reduce statistical ﬂuctuation dueto Poisson noise in the random pair counting. In theanisotropic analysis we decompose the vector (cid:126)r into alength component s and the angle θ between the pairvector and the line-of-sight direction.2.1. Three-point correlation function

The 3pCF is deﬁned as the joint probability of therebeing a galaxy in each of the volume elements dV , dV ,and dV given that these elements are arranged in aconﬁguration deﬁned by the sides of the triangle, r , r ,and r . The joint probability can be written as dP , , = ¯ n [1 + ξ ( r ) + ξ ( r ) + ξ ( r )++ ζ ( r , r , r )] dV dV dV . (3)The expression above consists of several parts: the sumof 2pCFs for each side of the triangle, ζ , the full three-point correlation function, and ¯ n the mean density ofdata points. We utilise the 3pCF estimator of Szapudi& Szalay (1998), ζ = DDD − DDR + 3

DRR − RRRRRR , (4)where each term represents the normalised triplet countsin the data (D) and random (R) ﬁelds that satisfy aparticular triangular conﬁguration of our choice.The 3pCF is a function of the three sides of the trian-gle ( r , r , r ) and additionally it may also be computed in the anisotropic case, thus introducing two angles rel-ative to the line of sight, θ and θ . The triangularconﬁguration parameters can be see in Figure 1.We could imagine, for example, performing an analy-sis in the anisotropic 5-parameter space ( r , r , r , θ , θ )with corresponding (20, 20, 20, 10, 10) equally spacedbins. Considering only legal triangles and symmetries,the number of possible bins is ≈ ALGORITHM DESIGN3.1.

Graph Database

A graph database does not rely on the data beingdescribed in a tabular or relational format, as withmore traditional database structures. Rather, a graphdatabase uses graph theory to store, map and query re-lationships of the data.Each data point is called a node and has a numberof associated properties. On the right side of Figure 2we see the main elements of the graph. In our work,the important properties of a node are 1) if the pointis a galaxy or random 2) any weight associated withthe data point e.g.

FKP weights (Feldman, Kaiser &Peacock 1994), angular systematic weights (Ross, et al.2012), etc 3) the number of neighbour points within aﬁxed radius 4) if the data point is within a buﬀer re-gion. The buﬀer region (4) is only required if the datahas been decomposed into multiple domains which willbe discussed later. The number of neighbours (3) is re-quired solely to facilitate dynamical memory allocation.The graph is constructed by visiting each data pointand building a list of relationships to its neighborswithin a distance, r max , see left panel of Figure 2. Thelist of neighbors can be obtained quickly using a kd-tree .As we can see in the right panel of Figure 2, eachnode relationship contains the distance information andoptionally the angle relative the line of sight direction,if anisotropic correlation analysis is required. It alsocontains the unique ID of the data point. In an eﬀortto be memory eﬃcient, we bin the distance and angularinformation immediately thus turning a single/double We have used the open source solution called kdtree2 fromKennel (2004) https://github.com/jmhodges/kdtree2

ARelationship

Node A Neighbour iNeighbour jNeighbour N

1. relative distance 2. relative los angle 3. unique id

Node BNode N data

1. galaxy or random 2. weight (e.g. FKP, etc) 3. buffer region ﬂag 4. properties propertiesrelationships

Graph Database Structure

Figure 2.

Left:

This is our deﬁnition of a node. A node is a data point, (e.g. galaxy or random) that contains a list ofrelationships between itself and its nearest neighbors within a predeﬁned length scale, r max . In this example Node A has 5relationships to other data points, but has no relationship to data points beyond r max . Right:

This is the structure of the graphdatabase. Each data point is a Node containing certain properties about itself. It also contains a list of relationships. In ourcase this is a list of neighbor points within a distance of r max from Node A. Each neighbor also has properties, these includethe distance information to Node A and optionally the angular information relative to the line of sight. The also contain theunique ID of the data point. precision ﬂoating point number into a integer. Since thenumber of bins is typically of order 100, we can save itas a 8byte integer (2 = 256 unique values), reducingthe required memory signiﬁcantly.3.2. Querying the database: speciﬁc conﬁgurations

Unlike for more traditional databases, graph databasequeries work on relationships and properties thereof. Asa simple example if we want to compute the 2-pointcorrelation function we can see that it can be obtainedimmediately by counting all possible relationships wherethe distance property of the relationship matches ourdesired scale. This list of relationships can be queriedfurther to ﬁnd which ones correspond to the pairs ofdata (DD), the pairs of random points (RR) and themixture of both (DR).Speciﬁc n -tuples can be computed by querying therelationships and relationships of relationships such thatthe distances satisfy r , r , ..., r n and that the initial andﬁnal data point have the same unique ID, thus closing the tuple.3.3. Querying the database: all conﬁgurations

In some circumstances we may wish to compute allbinned conﬁgurations of an n -tuple correlation function.As an example in the 3pCF, we would have to countall possible triplets of points with the sole criteria that r , r , r < r max . Thankfully this calculation can beperformed rather eﬃciently by making use of a nice ge-ometrical property of our database. Between a node and one of its relationship data pointsthere can be deﬁned a region containing all possibletriplets associated to the original pair with conﬁguration r , r , r < r max . This is a simple geometrical propertyand can be more easily visualized in the left panel ofFigure 3.In the right panel of Figure 3 we can see illustrativelyhow the triplets are counted. We ﬁrstly visit each Node,eg Node A and open its list of relationships. We thenopen the Node corresponding to the ﬁrst relationship(neighboring point). This Node (e.g. Node B) also con-tains a list of relationships. If we match each of theselists on the unique ID of each data point, then we willhave a new list containing all possible triplets, wheretwo corners are Nodes A and B and the side lengths sat-isfy the constraint r , r , r < r max . The union of twoordered sets can be performed very quickly, thus we ini-tially sort the neighbor lists by their unique ID. We cansee why this is computationally eﬃcient, because thelength of the 3rd side of the triangle does not need tobe calculated, since all distance between data pair havebeen already been measured. This is the main factor inthis algorithm, all distances are precomputed and savedinto memory .Due to the precomputation of the distances and thespeciﬁc design of the graph database, it becomes trivialto go beyond the 2- and 3pCF to arbitrary order inthe N -point statistics. This may allow us to estimatedirectly the covariance matrices of lower order statistics,since for example the cov( ξ ) A BC R R R Node A Neighbour i: IDNeighbour j: IDNeighbour N: IDNode B Neighbour i: IDNeighbour j: IDNeighbour M: ID

Match lists on unique ID

Unique ID iUnique ID jUnique ID N

Find remaining triplet attributes R =A : i : dist R =B : j : dist R =A : UniqueID : dist DDD(R ,R ,R )++ Query Graph: 3PCF

Figure 3.

Left:

The intersection (shaded region) between two spatially overlapping nodes, A and B, contains a list of pointssatisfying our triangular conﬁguration constraints, e.g. node C.

Right:

This shows the computational structure of the graphquery. Node A is opened and then recursively its relations are opened. The ﬁrst neighboring point is identiﬁed as Node B whichis then opened. These two open Nodes contain a list of neighbor points, which are then matched according to their unique ID.This resulting intersection contains all possible triplets satisfying the condition r , r , r < r max . Domain Decomposition

The graph database can become memory intensive,e.g. for 10 million points with a number density, ρ ≈ − Mpc h − and analysis scale r max = 150 Mpc/ h ,this could reach almost 1TB of memory. Thus if wewant to query such a database quickly it would ideallybe loaded into RAM memory. We could imagine con-structing a large database (e.g. SQL, etc) that can beexternally called or large HDF5 ﬁles that can be opti-mally constructed for fast random access calls. However,we decided to opt for a distributed memory scheme thatwould allow diﬀerent parts of the graph database to re-side on many compute nodes separately. Since no com-pute node can hold all of the database, we use a domaindecomposition to spatially partition the data nodes over N compute nodes. This can be seen in Figure 4. Eachdomain must also include a buﬀer region since nodesoutside the domain may be required for the correlationfunction analysis. Therefore the buﬀer region extendsa distance r max in each spatial direction. With thisscheme we can be sure that all relevant data nodes areavailable in memory for the required domain correlationanalysis. 3.5. Computational Structure

In Figure 5 we show the computational structure ofthe algorithm. The N precomputed domains are read by N compute nodes, i.e. one domain per compute node.This can be achieved using MPI as shown in the ﬁgure,or since each domain is independent, the code could berun many times with 1 MPI thread and the user then Figure 4.

This is the domain decomposition scheme for ouralgorithm. In this 2D example the space is initially splitinto 25 areas containing equal number of data points. Inaccounting for the range of the correlation analysis we adda buﬀer region (red dashed line) around the initial domain(blue region). The black circle denotes the maximum scaleof our analysis, r max . gathers the results individual for each of the N domains.For now we will consider the MPI case.Each MPI task loads the particle data into memoryand proceeds to construct a kd-tree. Once the kd-treehas been constructed we spawn a number of OpenMPthreads for parallel computation. The kd-tree is thenqueried for every point, and their N neighbors withina distance r max having their distance computed and,optionally, the angle between the connecting vector andthe line-of-sight direction. The distance and angle can Node 1 Node 2 Node 3 Node N

Spawn N MPI processesOpenMP threading

CPU 1 CPU 2 CPU N

Node 1 Node 2 Node 3 Node N

Reduce OMP arrays for each node/domainCollect MPI arrays and output

Master Node

Domain decomposition is precomputed

Figure 5.

This is the overall computational structure of ouralgorithm (from top to bottom). The data is initially splitinto N ﬁles according to our domain decomposition scheme.Each domain is then loaded into memory on a separate com-pute nodes. Each node then open several OpenMP threadsfor distributing the task of graph database construction andthen graph querying. The OpenMP threads are reduced, col-lecting the local pair and triplet counts, then the MPI pro-cesses are collected on the master Node for ﬁnal calculationand result output. be saved for each neighbor of each data point as an 8byteinteger in a structure, as shown in the right panel ofFigure 2. This is the construction of the graph database.Once the graph database has been constructed, thecode proceeds to query the graph and measure all thetriplet conﬁgurations, as discussed above. Again we in-voke a number of OpenMP threads to speed up the cal-culation on each compute node.The MPI tasks are then reduced to one master nodewhere the

DDD , DDR , DRR , RRR counts are eachcombined together and normalized by the total possiblenumber of triplet counts (including weights).In this work we present results from our own customgraph database written in modern

FORTRAN and imple-menting both

OpenMP and

MPI protocols. However wehave also tested the popular open source graph database

Neo4j ﬁnding similar performance. TESTING AND BENCHMARKINGFor the purposes of benchmarking we focus on smallscales and count all possible triangular conﬁgurationswithin r , r , r <

30 Mpc. We use 4 random data sam-ples with a varying number of data points, N data , and https://neo4j.com densities while keeping the volume constant at 1 Gpc .We use samples with N data = { } mil-lion points. In Figure 6 we compare our new algorithm(black points) with the kd-tree algorithm, KSTAT (redpoints). The new algorithm is signiﬁcantly faster than

KSTAT and the scaling also shows a gentler slope withthe number data points with an approximate scaling re-lation of N . .The reason for the discrepancy between the two al-gorithms is due to the eﬃcient information handling ofthe graph database. In the case of KSTAT , the algorithmmust compute the anisotropic angles for every pair andtriplet of data. However, the graph database only re-quires this calculation to be performed on pairs of datapoints and then to compute 3-point statistics it simplyretrieves the relevant information from the database ofrelationships as described in § Scalability

We investigate the performance of the algorithm withincreasing parallelisation. Adopting a hybrid MPI +OpenMP scheme we perform a test using multiple In-tel Xeon Phi 7250 nodes, each comprising 68 computa-tional cores and 96 GB of RAM. Thus it is natural to setthe number of OpenMP threads as 68 and create (andquery) the graph database for each domain on separatecompute nodes, as illustrated in Figure 5.In the right panel of Figure 6 we show how the wallclock time scales with number of processors. We ﬁndreasonable scaling up to several thousand processors.The departure from the 1 to 1 scaling is due to im-perfect load-balancing i.e. the distribution of work overthe individual processors.In galaxy clustering it is impossible to know a priori how much computational time will be required for dif-ferent parts of the data, because of the complex surveygeometry and the spatial clustering of the data itself.Thus if we want to avoid signiﬁcant interprocessor com-munication we reply on an an initial domain decompo-sition scheme, which was described earlier in § Example case I: 3-point BAO

We now investigate the performance of the algorithmon large scales. One of the goals of modern cosmologyis to determine the expansion history of the Universethrough distance measurements, which can be obtainedby identifying the BAO ‘bump’ in the 2pCF (Eisenstein,et al. 2005; Anderson et al. 2012). It is also expectedthat this feature should be present in the higher orderstatistics.We consider Data Release 12 of the Sloan DigitalSky Survey’s (SDSS; Eisenstein et al. 2011) Baryon Os-cillation Spectrioscopic Survey

Constant Stellar Mass . . . . Number of points (x10 ) W a ll t i m e ( s ) Number of Processors Sp ee dup Figure 6.

Left:

The computational wall clock time for a single call of the 3pCF is shown as a function of number of datapoints. The red points are computed using 4 MPI threads with

KSTAT while the black points are from the Graph databasecode. The blue dashed line corresponds to an N . relation. Right:

The computational speedup scaling with the number ofprocessors. An ideal case with perfect load-balancing would expect to give 1 to 1 scaling (blue dashed). The graph databaseproduces reasonable scaling up to several thousands of processors (red stars). (CMASS; Bolton et al. 2012; Dawson et al. 2013; Alamet al. 2015) sample . Speciﬁcally we use their pub-licly released observational catalogues for the North-ern Galactic Patch ( i.e. galaxy DR12v5 CMASS North )and 150 Quick-Particle-Mesh (QPM; White, Tinker &McBride 2014) mock galaxy catalogues that mimic theobservational geometry and selection eﬀects. In the ob-servational sample and in each mock catalogue there areapproximately 500,000 galaxies and 2 million randompoints over the redshift range 0 . < z < .

7. In trans-forming from redshift space to comoving Cartesian co-ordinates we adopt a ﬂat ΛCDM cosmology model with h = 0 . m = 0 . r , r , r <

200 Mpc. This information is diﬃcult to dis-play in 2-dimensions so for clarity we can look at slicesor projections through this space. In the right panelof Figure 7 we display the equilateral conﬁgurations foreach mock individually (red lines), for the mean of mockcatalogues (black dashed line) and for the observationalsample (back circles). Although there is signiﬁcant scat-ter among the mocks, the BAO ‘bump’ is clearly visible https://data.sdss.org/sas/dr12/boss/lss/ in the mean of the mock samples and also in the obser-vational sample.Each galaxy catalogue DDD count took just 515 sec-onds on a 68 core Intel Xeon Phi processor. Althoughthe full calculation of the

RRR counts took considerablylonger at ∼ Example case II: 4-point function

Following hot on the heels of Fry & Peebles (1978), wecompute the 4-point correlation function from the mocksand observational catalogues presented in the previoussection.For simplicity we adopt the following estimator for the4pCF, η ( r , r , r , r ) = DDDDRRRR − r , r , r , r for each sideof the quadrilateral. Although in general the sides maybe vectors and the 4 points may not occupy a singleplane in 3D in which case the most general quadrupletshave skewed quadrilateral conﬁgurations.In this example we restrict ourselves to general equi-sided quadrilateral conﬁgurations and varying the sidelength from 0 < r <

30 Mpc in 10 equally spaced bins.We proceed to measure the 4pCF in the observationalcatalogue and in each of the 150 mock galaxy catalogues. r [ M p c ] r [ M p c ] r [ M p c ] − − −

25 50 75 100 125 150 175 r [Mpc] − − − − ζ ( r ) × r Individual mocksMean of mocksSDSS galaxies

Figure 7.

The large scale 3pCF is shown for 150 mock SDSS DR12 CMASS samples.

Left:

The full space of r , r , r < r = 10 Mpc binning. The colored points denoted the mean 3pCF value of the mocks for speciﬁcbinned triplet conﬁgurations. Right:

The equilateral 3pCF for for each mock catalogue (red) and for the mean of the 150 mockcatalogues (black). Although there is signiﬁcant scatter among the mock samples the mean is very smooth and exhibits a clearBAO ‘bump’ at ∼

150 Mpc. r [Mpc] η ( r ) , ζ ( r ) ζ ( r ) , η ( r ) : mocks ζ ( r ) : SDSS galaxies η ( r ) : SDSS galaxies Figure 8.

The equilateral 3pCF ( ζ ) and the box 4pCF( η ) are shown on scales 0 < r < In Figure 8 we show the equilateral 3pCF and the box4pCF up to 30 Mpc. It is clearly noticeable that the4pCF of the mocks and the observational galaxies are intension, while the 3pCF shows much better agreement.This is not too surprising given the approximate natureof the QPM method, which is already known to break-down for 3rd order statistics on mildly non-linear scales(Kitaura, et al. 2016).The computational time required for one cataloguewas approximately 1 minute on 4 OMP threads, which included 1.8s to construct the graph, 9s to query therequired triplet counts and a further 40s to obtain thequadruplets for the 4pCF. CONCLUSIONSWe present a fast algorithm for the computation of allpossible triplet conﬁgurations with r , r , r < r max ofa discrete point set.Through benchmarking we demonstrate that the algo-rithm scales well with increasing number of data pointsup to millions and can easily handle the current cosmo-logical data sets.We show reasonable parallel scalability through initialdomain decomposition and load-balancing. Althoughthere may be room for improvement with a dynamicalload-balancing scheme.The BAO at 3rd order is presented showing the visualBAO peak structure in both mocks and observationaldata. As it was not the primary aim of this work, wesave interpretation of the BAO signal for future work.We also show for the ﬁrst time the 4pCF of SDSSgalaxies. Again, as it was not our primary goal to makecosmological inferences, we will leave the interpretationof the 4pCF to forthcoming work.We optimized the code to run on the Cray CS500 sys-tem Nurion at the Korea Institute of Science and Tech-nology Information (KISTI) which comprises 570,020compute cores. Running the code on a single catalogueof the SDSS BOSS DR12 sample containing ∼ ∼ GRAMSCI (GRAph Made Statistics for Cosmologi-cal Information; https://bitbucket.org/csabiu/gramsci ), under a GNU General Public License.ACKNOWLEDGEMENTS We thank Mijin Yoon for useful discussion and com-ments on the manuscript.We use the Nurion supercomputing cluster at theKorea Institute of Science and Technology Information(KISTI).CGS acknowledges ﬁnancial support from the Na-tional Research Foundation (NRF; ), under a GNU General Public License.ACKNOWLEDGEMENTS We thank Mijin Yoon for useful discussion and com-ments on the manuscript.We use the Nurion supercomputing cluster at theKorea Institute of Science and Technology Information(KISTI).CGS acknowledges ﬁnancial support from the Na-tional Research Foundation (NRF;