Finding Wombling Boundaries in LHC Data with Voronoi and Delaunay Tessellations
Konstantin T. Matchev, Alexander Roman, Prasanth Shyamsundar
PPrepared for submission to JHEP
Finding Wombling Boundaries in LHC Data withVoronoi and Delaunay Tessellations
Konstantin T. Matchev, Alexander Roman, Prasanth Shyamsundar
Institute for Fundamental Theory, Physics Department, University of Florida, Gainesville, FL32611, USA
E-mail: [email protected] , [email protected] , [email protected] Abstract:
We address the problem of finding a wombling boundary in point data gen-erated by a general Poisson point process, a specific example of which is an LHC eventsample distributed in the phase space of a final state signature, with the wombling bound-ary created by some new physics. We discuss the use of Voronoi and Delaunay tessellationsof the point data for estimating the local gradients and investigate methods for sharpeningthe boundaries by reducing the statistical noise. The outcome from traditional womblingalgorithms is a set of boundary cell candidates with relatively large gradients, whose spa-tial properties must then be scrutinized in order to construct the boundary and evaluateits significance. Here we propose an alternative approach where we simultaneously formand evaluate the significance of all possible boundaries in terms of the total gradient flux.We illustrate our method with several toy examples of both straight and curved boundarieswith varying amounts of signal present in the data.
Dedicated to B. P. Delaunay. a r X i v : . [ h e p - ph ] J un ontents ρ = 5 ρ = 1 . – 1 – Introduction
Collider data in high energy physics can be viewed, at least at the parton level, as a collectionof points { (cid:126) x , (cid:126) x , . . . , (cid:126) x N } in the relevant phase space P of the final state signature. Theultimate goal of a high-energy physics experiment is then to test whether the distribution ofthose N points (commonly referred to as “events”) in P follows the probability distributionpredicted in some theory model by the fully differential cross-section dσd(cid:126) x ≡ f ( (cid:126) x , { α } ) , (1.1)where (cid:126) x ∈ P is a particular phase space point and { α } is a set of input model parameterssuch as particle masses, widths, couplings, etc. More specifically, in searches for new physics(NP) beyond the standard model (SM), eq. (1.1) can be split into respective SM and NPcontributions f ( (cid:126) x , { α } ) ≡ f SM ( (cid:126) x , { α SM } ) + f NP ( (cid:126) x , { α SM } , { α NP } ) , (1.2)where { α SM } and { α NP } label respectively the set of SM parameters and the set of addi-tional NP parameters. Let the corresponding regions of phase space populated by SM andNP events be P SM and P NP , then the respective total cross-sections are given by σ SM = (cid:90) P SM f SM ( (cid:126) x , { α SM } ) d(cid:126) x , (1.3a) σ NP = (cid:90) P NP f NP ( (cid:126) x , { α SM } , { α NP } ) d(cid:126) x . (1.3b)Since the standard model is well known, its distribution f SM , a.k.a. “the background”, iscalculable theoretically (up to some fixed order in perturbation theory). However, once weaccount for the experimental realities, the result may suffer from non-negligible systematicuncertainties, particularly in the case of challenging signatures involving QCD and/or re-ducible backgrounds. This is the main roadblock in NP searches via counting experiments,where one focuses on a suitably chosen “signal region” P SR ⊂ P NP and looks for an excessover the SM expectation (cid:82) P SR f SM ( (cid:126) x , { α SM } ) d(cid:126) x . Instead, in this paper we shall consider methods which could allow us to infer, atleast in principle, the existence of the NP contribution f NP without any prior knowledge of the SM prediction f SM . Recently there has been hightened interest in such “blind” or“background-independent” searches for NP, particularly using machine learning techniques[1–15]. Here, instead of looking for an excess in the signal region P SR , we shall follow upon the idea of Refs. [16, 17] to target directly the boundary ∂ P NP of the NP phase spaceregion P NP , by using the fact that the combined distribution (1.2) is non-differentiableanywhere on ∂ P NP where f NP is non-vanishing. As it turns out, the latter is a very safe In this paper, we have in mind the typical NP scenarios where P NP is a subset of P SM , P NP ⊆ P SM .This is certainly the case for signatures where P SM consists of the full phase space, P SM = P . If by anychance the signature happens to be such that P SM ⊂ P NP , the selection of the signal region is trivial: P SR = P NP − ( P NP ∩ P SM ) , and a counting experiment should already be good enough. – 2 –ssumption — if anything, f NP is not only non-vanishing, but often enhanced and even singular on the boundary ∂ P NP [18–27]. Since the background distribution f SM is a smoothfunction across ∂ P NP , the presence of f NP creates a discontinuous “jump” in the combinedevent density (1.2), precisely at the location of the boundary ∂ P NP [16, 17]. We can thusreformulate the original problem of finding evidence of NP in the collider data as follows:Given a collection of N points { (cid:126) x , (cid:126) x , . . . , (cid:126) x N } in the phase space P , iden-tify (the locations of) any candidate wombling boundaries and estimate theirstatistical significance.The detection of such difference boundaries (or wombling boundaries, named after a pioneerin the field, W. H. Womble [28]) is a well-known problem in spatial statistics, see, e.g., [29].Broadly speaking, wombling is any of a number of techniques used for identifying zones ofrapid change, typically in some quantity as it varies across some geographical or Euclideanspace. Wombling techniques are being applied in a wide variety of disciplines, includingcomputational ecology, anthropology, linguistics, geography and many others. In our casehere, we shall be interested in identifying the phase space region in the vicinity of ∂ P NP where the density of points is changing significantly.Before going through the typical steps of a wombling analysis, several comments arein order. First, the discontinuous jump of the combined distribution (1.2) across ∂ P NP inpractice will be smoothed out to some extent by the detector resolution and finite widtheffects, leading to a well-defined, finite, density gradient everywhere in P . This precludesus from using methods specifically designed to detect image discontinuities such as ridgesand cliffs but do not admit gradients [40]. Second, a good wombling method should also beable to pick up any boundaries created within P SM by interesting SM subprocesses, e.g.,top, Higgs or heavy gauge boson production. This would guarantee an opportunity forthe LHC experiments to start testing and validating the method with existing real data,before any NP discovery. Third, while most applications of wombling in the literature havebeen limited to two-dimensional data, the method shoud be readily generalizable to higherdimensions, if it is to be of any interest to the high-energy physics community where thedimensionality of the relevant phase space P is typically much higher (although in somespecial cases it can be reduced to 2 or even 1 through suitable projections preserving theboundary). Along those lines, it is also important to choose a good parametrization of P , sothat the dimensionality can be reduced by projecting out uninteresting degrees of freedomwithout washing out the wombling boundary.The main steps of a typical wombling analysis are the following [41]. • Data preparation and preprocessing.
The starting point in wombling is a spatiallyreferenced dataset ( (cid:126)x i , f i ) , i = 1 , , . . . , N, (1.4) For example, wombling has been used to identify genetic boundaries in Eurasian human populations[30, 31], language boundaries in Europe [32], transition zones in genetic, morphometric and physiologicalcharacteristics [33], boundaries of different types of vegetation [34, 35], hospital admission rates for respira-tory conditions [36], cancer rates [37], metal concentrations in the Swiss Jura [38] and other environmentaldata [39]. – 3 –here a set of values { f i } for the function of interest f ( (cid:126)x ) are obtained at some finitenumber N of point locations { (cid:126)x i } . In some applications, e.g., for aerial and remotelysensed images, the locations { (cid:126)x i } can be chosen by the experimenter — then it maybe convenient to arrange them in some kind of a regular lattice, as required by somewombling algorithms, including the original proposal in [28]. Alternatively, whenthe data are gathered by an irregular or random design, eq. (1.4) is known as point-referenced or geostatistical data. While in many other fields of science the functionalvalues { f i } for geostatistical data are obtained directly from field observations, forthe Monte Carlo simulations used in high-energy physics the situation is more subtle— each f i is supposed to be a measure of the local point density at (cid:126)x i and needsto be evaluated as a preprocessing step. For this purpose, Refs. [16, 17] proposed toconsider the Voronoi tessellation in P of the dataset (1.4), since the geometric volume v i of the Voronoi cell containing (cid:126)x i provides a natural local estimator of the pointdensity at (cid:126)x i [42]: f i ∼ v i . (1.5)While Voronoi tessellations have been widely used in many other fields of science,they seem to be underutilized in high energy physics where their application has beenlimited to jet clustering [43] and the partitioning of the signature phase space intosearch regions, as implemented in the SLEUTH algorithm [44–46] which was used toperform model-independent new physics searches at D0 [47–49], HERA [50] and CDF[51–53]. Yet, the Voronoi approach is ideally suited for finding interesting (e.g.,singular) features in f ( (cid:126)x ) , since it preserves the maximum spatial resolution in thedata [54]. For example, the standard approach of binning the data in order to obtain alocal density estimate necessarily throws away a certain amount of useful information,and is associated with some arbitrariness in the exact choice of binning [16]. • Gradient estimation.
Once we have the point-referenced dataset (1.4), the next step isusually to estimate the magnitude of the local gradient (cid:126) ∇ f of the function f ( (cid:126)x ) , sinceany zone of rapid change is necessarily associated with large values for | (cid:126) ∇ f | . In cal-culating the gradient, one has to overcome the fact that the data is a) discrete and b)irregularly sampled. One possible approach is to obtain a continuous approximationfor f ( (cid:126)x ) via some spatial interpolation method [42, 55]. Unfortunately, interpolationtechniques tend to smooth out not only the noise but also small local discontinuities,which can result in masking some true boundaries [34]. For this reason, Refs. [16, 17]explored several boundary detection techniques (further developed and illustrated in[22, 23, 56]) which continued to use the Voronoi tessellation of the data and the funda-mental relation (1.5). Among the different options studied in [16, 56], the normalizedstandard deviation (sometimes also called the coefficient of variance) of the volumes ofthe neighboring cells emerged as a viable measure of the magnitude of the local gradi- In the context of treating high energy collider data as a point dataset (1.4), it is worth mentioningthe recent idea of Ref. [14] to consider { (cid:126)x i } as a graph network of weighted nodes, which is somewhatorthogonal, but similar in spirit to the Voronoi approach. – 4 –nt within a given Voronoi cell. In this paper, we shall pursue a somewhat orthogonaland more traditional approach, known in the literature as “triangulation wombling”[34, 39], which makes use of the dual Delaunay tessellation of the data (1.4). Eventhough the two types of tessellations are dual to each other , the Delaunay versionseems more natural for the specific problem at hand of calculating gradients — thiswill be further discussed and illustrated in Section 2 below. • Tagging selection.
Having constructed the Delaunay or Voronoi tessellation and ob-tained estimates for the local gradients, the next task is to identify elements of thetessellation (edges or vertices) which are likely to be located on or near a womblingboundary. This is typically accomplished with a cut on the ranked values of | (cid:126) ∇ f | for all elements in the tessellation, e.g., selecting elements whose calculated gradientsare in a certain upper percentile . However, using such a simple threshold for tag-ging boundary elements has been viewed as somewhat arbitrary and rather subjective[41] — in the absence of any robust guidelines, typical values used in the literaturerange in the th to th percentile [30, 31, 36, 39]. Furthermore, for any given valueof the threshold, there will always be a certain number of elements passing the cut,including elements “in the bulk”, i.e., away from any wombling boundaries, and onehas to design a prescription on how to deal with such false positives. Obviously, avalue for the cut which is too stringent will miss many true boundary elements, whilea value which is too generous will bring about a lot of false positives. Finally, giventhat the dimensionalities of P NP and ∂ P NP necessarily differ by one, the concept of a“boundary element” can be open to interpretation — how close to the boundary doesan element have to be in order to be considered a “boundary element”? • Agglomeration.
The previous step results in a collection of tagged boundary ele-ments scattered throughout P , so now the question is how to use that informationto reconstruct the complete boundary. As a first step, one can start forming sub-boundaries by linking adjacent tagged boundary elements, possibly subject to someadditional criteria, e.g., that the directions of their gradients are within ◦ of eachother [31]. This agglomeration procedure will result in a graph whose nodes are thetagged boundary elements [57]. The properties of this graph can then be studied todetermine its statistical significance [32, 34] (see the next item) and to get some ideaabout the shape of the boundary. In graph theory, the dual graph of a plane graph G is a graph that has a vertex for each face of G .Correspondingly, each edge e of G has a corresponding dual edge, whose endpoints are the dual verticescorresponding to the faces on either side of e . Some care must be exercised for the Delaunay edges alongthe convex hull of the point set — their Voronoi duals are infinite rays which can be turned into finite linesegments by adding an artificial point at infinity which serves as a common endpoint for all the rays. Inour analysis, this complication will not arise since we will perform our analysis in the interior of P . In NP scenarios where the signal density is additionally enhanced on the boundary ∂ P NP , Ref. [22]proposed a two-dimensional cut, simultaneously targeting both large gradients and large values of thefunction. The two requirements — that the gradients are large and that their directions are correlated — can beconveniently encoded in the scalar (dot) product of the gradient vectors of neighboring elements [16]. – 5 –t this point it is worth mentioning that in collider physics applications there canbe situations where the shape of the boundary is parametrically known. This isprecisely the case with “simplified model” NP searches at the LHC [58] — once theevent topology is assumed, the geometry of the NP final state phase space P NP isalso fixed. As an example, consider a sequence of three two-body decays, which isthe classic squark signature in supersymmetry (SUSY) [59]. The relevant phase space P NP is three-dimensional and can be parametrized by the invariant masses of thethree pairs of visible decay products. The equation for the boundary ∂ P NP is knownanalytically [21, 60, 61] in terms of just four parameters — the masses of the SUSYparticles participating in the decay chain. In that situation, Ref. [23] proposed tobypass both of the last two steps (the tagging and the agglomeration) altogether andinstead fit the equation for the surface ∂ P NP to the full tessellation. Operationallythis was done by computing a quantity inspired by Bayesian wombling (see nextbullet), namely, a two-dimensional surface integral of the gradient magnitude overthe boundary surface, normalized to the total area of the surface: (cid:82) ∂ P NP da | (cid:126) ∇ f | (cid:82) ∂ P NP da . (1.6)In Ref. [23], it was demonstrated that this quantity is maximized for the true valuesof the SUSY masses, resulting in a novel method for SUSY mass measurements. • Significance estimation.
Any wombling algorithm as described so far will produce nu-merical results regardless of whether a true pattern exists or not. The crucial questionnow is to assess the likelihood that the observed pattern could have been producedfrom random fluctuations instead of a true boundary. One possible approach, knownas sub-boundary statistics [32, 34], is to analyze the properties of the graph mentionedin the previous step, formed out of the tagged boundary elements. Strictly speak-ing, sub-boundary statistics tests whether the different components of the graph aresufficiently contiguous (and not whether the rates of change are sufficiently large)[29, 41]. To this end, one looks at (distributions of) quantities which would charac-terize coherent boundaries formed from connected boundary elements, such as: thetotal number of subgraphs, the number of single-node subgraphs, the maximum andthe mean of the length and/or the diameter of the subgraphs, the superfluity, etc. Analternative approach, named Bayesian wombling, starts with an ansatz for the shapeof the wombling line (in two spatial dimensions) and then computes the average fluxof the two-dimensional gradient field through all possible such lines [40, 62, 63]. Theidea is that the average flux will be maximized when the ansatz matches the truewombling boundary. As already mentioned in relation to eq. (1.6), the advantage ofthis approach is that it avoids the subjectiveness associated with the steps of taggingand agglomeration. With either approach, one has to specify a null hypothesis, inorder to quantify the confidence level. Unlike other fields of science, where the nullhypothesis may not be immediately obvious and one typically has to rely on a ran-– 6 –omization scheme [29], in high-energy physics the null hypothesis is well defined —it is the SM.In this paper we further develop and refine the Voronoi boundary detection methodsfrom Refs. [16, 22, 23, 56]. As before, the main goal will be to outline a method for discov-ering new physics in LHC collider data by identifying wombling boundaries in phase space.The paper is structured around the five typical steps of algorithmic wombling describedabove. The novel elements in the analysis presented here are the following: • In addition to the Voronoi tessellations utilized in [16, 22, 23, 56], here we also con-sider Delaunay tessellations. In Section 2 we shall briefly review the two types oftessellations and outline the range of new possibilities offered by the use of a Delau-nay tessellation for our purposes. In particular, in Section 2.3 we shall illustrate thefour possible types of boundary elements, discuss their relations to each other, andhow each can be potentially targeted in the tagging step of the wombling algorithm. • Unlike previous work in high-energy physics, here our main tool for calculating thelocal gradients will be the Delaunay tessellation (often referred to as “triangulation”since in two dimensions the Delaunay polygons are triangles). Section 3 is devoted tothe topic of gradient estimation — after a brief review in Section 3.1 of previous workon estimating gradients from a Voronoi tessellation, in Section 3.2 we shall describethe gradient calculation from the Delaunay triangulation. In the process, we shall payspecial attention to techniques for reducing the random fluctuations in the obtainedgradient values — three such procedures and their interplay and optimization arediscussed in Section 3.3. • Different techniques for tagging boundary elements will be discussed in Section 4. Inaddition to tagging Voronoi cells [16, 22], here we shall also be interested in taggingVoronoi edges, Delaunay cells and Delaunay edges as well. Although this is not ourmain focus here, in Section 5 we shall illustrate how these tagging methods can beused for agglomeration. • The main results of the paper are presented in Sections 6-8. We follow the approachof Bayesian wombling [40, 62, 63], which lets us avoid the intricacies and uncertaintiesof the tagging and agglomeration steps. To gain some intuition, in Section 6 we firstgo over a toy example where the function f ( (cid:126)x ) can be sampled continuously froma distribution which resembles real data including finite width effects and detectorresolution. Then in Section 7 we study point-referenced data of the type (1.4) with astraight (Section 7.1) or circular (Section 7.2) boundary. The corresponding estimatesof the statistical significance of the obtained wombling boundaries are performed inSection 8. – 7 – Voronoi and Delaunay tessellations of point data
Virtually all wombling studies in the literature have been concerned with two-dimensional point data generated, e.g., from field samples taken within a certain geographical area, fromremotely sensed images, etc. In this paper, we shall continue to work in two-dimensions,but this will be done only for clarity of the presentation, since it is difficult to visualizeVoronoi and Delaunay tessellations in more than two dimensions; the methods which weshall describe will be applicable to higher dimensional data as well. To be specific, weshall consider the Cartesian plane where the data points are specified by their coordinates ( x i , y i ) , so that the dataset (1.4) reduces to ( x i , y i , f i ) , i = 1 , , . . . , N. (2.1)For concreteness, we shall choose our field of view to be the unit square, ≤ x ≤ , ≤ y ≤ , although data will be generated beyond the boundaries of the unit square — thiswill eliminate any spurious boundary effects like clipping which would modify the statisticalproperties of the Voronoi cells near the boundaries [64]. Following Refs. [16, 17, 22], thedatasets will be generated according to (1.2) with the following assumptions for f SM and f NP : • Background.
As in previous work [16, 17, 22], our proxy for the SM backgrounddistribution f SM will be the uniform distribution f SM = constant. (2.2)The exact value of the constant will depend on the normalization: for pure-backgroundsamples within the unit square the constant is 1, while for background plus signalsamples, it will depend on the relative strength of the signal. Strictly speaking, theassumption (2.2) is unrealistic from the point of view of a high energy physicist, since f SM is in general a function of the kinematic variables parametrizing the phase space P . Nevertheless, it is good enough for our purposes here — the important point isthat any realistic SM distribution is very weakly varying across the boundary ∂ P NP ,which justifies the use of (2.2) for our model-independent toy examples below. Atypical background distribution of N = 500 points within the unit square is shown inthe left panels of Figs. 1 and 2. It is evident that such a distribution does not haveany obvious features and any wombling boundary would have to be created purely bychance. • Signal with a flat boundary.
As our first example of a hypothesized NP signal we shallconsider a distribution f NP populating a region P NP with a flat boundary. Againfollowing Refs. [16, 17, 22], we shall take the boundary to be the vertical line at x = 0 . and the corresponding signal distribution f line to be flat and non-zero onlyto the left of the boundary: f line = 2 H (0 . − x ) , (2.3)– 8 – igure 1 . Typical simulated point data sets used in our studies of straight-line boundaries. Herewe show N = 500 points within the unit square, distributed according to: background only (leftpanel); background with an additionally injected signal (2.3) in the left half-plane with ρ = 1 . (middle panel) or ρ = 5 (right panel). where H is the Heavyside step function. When adding this signal to the backgrounddistribution, it is important to specify the mixing ratio. For this purpose, Refs. [16,17, 22] introduced a parameter ρ which measures the ratio of the event densities dnda on the two sides of the boundary : ρ ≡ lim x → . − (cid:18) dnda (cid:19) lim x → . + (cid:18) dnda (cid:19) . (2.4)Combining (2.2) and (2.3), the unit-normalized total distribution (1.2) on the unitsquare for a signal with a flat boundary reads [16] f = 2 ρ + 1 (cid:104) ρ H (0 . − x ) + H ( x − . (cid:105) . (2.5)In the middle and right panel of Fig. 1 we show distributions of N = 500 pointsaccording to (2.5) with ρ = 1 . and ρ = 5 , respectively. In the latter case, the valueof ρ is sufficiently large that the boundary is clearly visible with the naked eye. As inRefs. [16, 17, 22] (which considered an even more extreme value of ρ = 6 ), the case ofrelatively large ρ is meant mostly for illustration — it makes it easier to visualize thebenefits from the various wombling and denoising techniques introduced below. Ourreal target will be the case of relatively low values of ρ as shown in the middle panelof Fig. 1, where it is rather difficult to discern any apparent wombling boundary. • Signal with a circular boundary.
For completeness, we shall also consider an exampleof a signal in a domain bounded by a curvilinear boundary. Following [22], we shall Since we are interested in detecting a wombling boundary, the parameter ρ as defined here is moresuitable than the more familiar ratio S/B of signal to background inside the signal region P SR . Note thatwith our setup, the two are related as S/B = ρ − . In order to declutter the notation, in what follows we shall omit the prefactor of H ( x ) H (1 − x ) H ( y ) H (1 − y ) which confines us to the unit square. – 9 – igure 2 . The same as Fig. 1, but for the circular signal (2.6) with R = 0 . . take the signal distribution f circle to be confined to a circular region of radius R < . centered at ( x, y ) = (0 . , . : f circle = 1 πR H (cid:16) R − (cid:112) ( x − . + ( y − . (cid:17) , (2.6)so that the combined total distribution (1.2) becomes f = 1( ρ − πR + 1 (cid:104) ρ H (cid:16) R − (cid:112) ( x − . + ( y − . (cid:17) + H (cid:16)(cid:112) ( x − . + ( y − . − R (cid:17)(cid:105) , (2.7)with the ρ parameter suitably defined in analogy to (2.4) as the ratio of point densitiesacross the circular boundary. The middle and right plots in Fig. 2 depict typicaldistributions of N = 500 points according to (2.5) with ρ = 1 . and ρ = 5 , respectively.As before, the boundary for ρ = 5 is clearly visible, but the case of ρ = 1 . appearsmuch more challenging. The Voronoi and Delaunay tessellations of the point data sample in the right panel of Fig. 1are illustrated in Fig. 3. In the left panels, which show the Voronoi tessellation, the datapoints appear as dots, while in the right panels, which illustrate the Delaunay tessellation,the data points are located at the vertices of the Delaunay triangles and are not explicitlyshown.As illustrated in the left panels of Fig. 3, a Voronoi tessellation of N points (oftenreferred to as generators ) in the plane is constructed as follows (see, e.g., [42]). Everylocation in the plane is assigned to the closest member of the point set. If a locationhappens to be equally close to two (or more) generator points, it is assigned to all ofthose points; all such eqidistant locations form the edges of the Voronoi graph. The set oflocations assigned to a given member of the point set forms the Voronoi cell correspondingto that generator point; as seen in Fig. 3, the Voronoi cells in the plane are polygons, withthe corresponding generator point located somewhere in the polygon’s interior, but notnecessarily at its geometric center. Note that Voronoi polygons have different shapes and– 10 – igure 3 . The Voronoi and Delaunay tessellations of a typical point data distributed according toeq. (2.5) with N = 500 and ρ = 4 . The left (right) panels show the Voronoi (Delaunay) tessellation.The vertical blue dashed line marks the theoretical boundary line at x = 0 . . The yellow-shadedcells in the top panels illustrate the boundary cells defined in Sec. 2.3 for the Voronoi and Delaunaycase, respectively. In the bottom panels, the tessellations are outlined with dotted lines while thesolid lines represent the boundary Voronoi edges (red lines) and their dual Delaunay edges (greenlines) as defined in Sec. 2.3. sizes; in particular, the number of edges of a polygon varies greatly, the average numberbeing no more than six [42]. An endpoint of a Voronoi edge is called a Voronoi vertex ;alternatively, a vertex may be defined as a point shared by three (or more) Voronoi edges.When each vertex belongs to three and only three edges, the Voronoi tessellation is non-degenerate; as seen in Fig. 3 this will be our case as well, since the probability of generatinga degenerate vertex in Monte Carlo sampled data is vanishingly small.The right panels in Fig. 3 depict the corresponding Delaunay tessellation of the samedata. Since the Voronoi and Delaunay tessellations are dual to each other, one way toconstruct the Delaunay tessellation is to start from the Voronoi diagram and join all pairsof generator points whose Voronoi polygons share a common Voronoi edge. If the Voronoitessellation is non-degenerate, each Voronoi vertex belongs to exactly three Voronoi edges,– 11 –hich in turn define a triangular polygon in the Delaunay tessellation (see Fig. 3); forthis reason the Delaunay tessellation is often referred to as a triangulation . The describedprocedure also manifestly pairs up all Voronoi edges with their corresponding dual edgesin the Delaunay tessellation; if each such pair of dual edges from the two tessellations hasa common point, i.e., the dual edges cross each other, the Delaunay triangulation is knownas Pitteway triangulation [65, 66]. As we shall see explicitly below in Fig. 4, our datasetswill generally not lead to Pitteway triangulations; so it will be important to keep in mindthat some dual pairs of edges may be slightly offset and not intersect each other.In what follows, our discussion will often switch back and forth between the two types oftessellations, so at this point it may be useful to build some intuition by recapping some ofthe relationships between the constituent objects of the Voronoi and Delaunay tessellationsfor a dataset consisting of N points: • Each data point defines both a Voronoi polygon and a corresponding Delaunay vertex;the total number of Voronoi polygons or Delaunay vertices is thus N . The numberof sides of a Voronoi polygon is equal to the number of Delaunay edges joining at thecorresponding Delaunay vertex. • Each Voronoi edge has a dual Delaunay edge; the total number of Voronoi edges istherefore equal to the total number of Delaunay edges and is on the order of, butslightly less than, N [42]. • Each Voronoi vertex defines a corresponding Delaunay triangle; the total number ofsuch objects is on the order of, but slightly less than, N [42].In summary, we have the following duality relations between the elements of the Voronoiand Delaunay tessellations: Voronoi cell ←→ Delaunay vertex , (2.8a) Voronoi edge ←→ Delaunay edge , (2.8b) Voronoi vertex ←→ Delaunay triangle . (2.8c)We can formalize these relations by introducing some notation. Let us use Latin indicesto label elements from the Voronoi tessellation and Greek indices to label elements fromthe Delaunay triangulation. Then let { P i } be the set of generator points, { V i } be the setof Voronoi cells, { V ij } be the set of Voronoi edges and { V ijk } be the set of Voronoi vertices.Note that each Voronoi edge can be uniquely identified by the labels i and j of the pair ofVoronoi cells which it separates, and similarly, for non-degenerate tessellations, a Voronoivertex V ijk is labelled by exactly three indices since it is the meeting point of the edges V ij , V jk and V ki . Also let N i be the number of edges (or equivalently, neighboring polygons)of the i th Voronoi cell V i . With regards to the Delaunay triangulation, let { D α } be the There are alternative methods to construct the Delaunay triangulation, e.g., using the property thatthe interiors of the circumcircles of Delaunay triangles are empty circles, i.e., contain no points from thedataset. – 12 –et of Delaunay triangles, { D αβ } be the set of Delaunay edges and { D α α α ...α Ni } be theset of Delaunay vertices. As before, each edge D αβ can be identified by the labels α and β of the Delaunay triangles which it separates, and each vertex can be identified by thelabels { α , α , ..., α N i } of the N i Delaunay triangles which are sharing it. Then the dualityrelations (2.8) can be written as P i ←→ V i ←→ D α α α ...α Ni , (2.9a) V ij ←→ D αβ , (2.9b) V ijk ←→ D α , (2.9c) N i ←→ N α = 3 . (2.9d)Note the trade-off in complexity — in the Voronoi case, vertices are labelled with exactly 3indices, but the number of polygon edges N i varies, while in the Delaunay case, the numberof polygon edges is always 3, but vertices are labelled with a varying number of indices. Having described the Voronoi and Delaunay tessellations of the data, before proceeding tothe next two stages of gradient computation and tagging, it is worth pausing for a momentto discuss which elements of the tessellation are best suited for describing a womblingboundary. In Fig. 3 the theoretical boundary at x = 0 . was marked with a vertical bluedashed line, but this was done only to guide the eye, since in reality both the existenceand location of the boundary will be a priori unknown. In practice, we need to identifyindividual elements of the tessellations located at (or close to) the boundary which couldbe targeted by the wombling analysis.By including the Delaunay tessellation into our discussion, we obtain three new pos-sibilities in addition to the approach of Refs. [16, 17]. The four panels in Fig. 3 illustratethese four options: • Boundary Voronoi Cells (BVCs) , shown in the upper left panel in Fig. 3. Whenworking with the Voronoi tessellation, this is the most natural and perhaps onlyoption. A Boundary Voronoi Cell was defined as any Voronoi cell which is crossedby the theoretical boundary [16, 17]. In the upper left panel of Fig. 3 and in the leftpanel of Fig. 4, the BVCs are shaded in yellow. • Boundary Delaunay Triangles (BDTs) , shown in the upper right panel in Fig. 3. Whenworking with the Delaunay triangulation, in analogy we can now define a BoundaryDelaunay Triangle to be any Delaunay triangle which is crossed by the theoreticalboundary. In the upper right panel of Fig. 3 and in the right panel of Fig. 4, theBDTs are shaded in yellow. • Boundary Delaunay Edges (BDEs) , shown in the lower right panel in Fig. 3. Similarly,we can define a Boundary Delaunay Edge to be any Delaunay edge which is crossedby the theoretical boundary. In the lower right panel of Fig. 3 and in the two panelsof Fig. 4, the BDEs are indicated with green lines.– 13 – igure 4 . Left: The Voronoi tessellation of the data shown in Fig. 3 illustrating the relationshipsbetween Boundary Voronoi Cells (yellow-shaded), Boundary Voronoi Edges (red solid lines) andBoundary Delaunay Edges (green dashed lines). Right: Delaunay tessellation of the data shown inFig. 3 illustrating the relationships between Boundary Delaunay Triangles (yellow-shaded), Bound-ary Voronoi Edges (red solid lines) and Boundary Delaunay Edges (green solid lines). • Boundary Voronoi Edges (BVEs) , shown in the lower left panel in Fig. 3. Finally,using the duality relation (2.8b) we can define a Boundary Voronoi Edge to be anyVoronoi edge which is dual to a Boundary Delaunay Edge. In the lower left panel ofFig. 3 and in the two panels of Fig. 4, the BVEs are indicated with red lines.Note that the last three options all rely on the Delaunay triangulation and would not havebeen possible if we were only considering the Voronoi tessellation of the data.Given the duality relations (2.8) between the Voronoi and Delaunay tessellations, thedifferent categories of boundary objects defined above are related to each other. Theserelationships are exhibited in Fig. 4, where we simply superimpose some of the results fromFig. 3 in order to better see the existing correlations. Figs. 3 and 4 demonstrate that allfour definitions lead to a contiguous set of boundary objects strung along the theoreticalboundary. Now the question becomes how to tag these boundary objects with a suitablealgorithm using their geometric properties.
As discussed in the Introduction, the Voronoi tessellation provides a natural estimate (1.5)for the values of the function f at the location of each generator point. In our case, sincewe are dealing with a two-dimensional dataset (2.1), eq. (1.5) reduces to f i ∼ a i , (3.1)where a i is the area of the i th Voronoi cell V i . The identification (3.1) is pictorially illustratedin Fig. 5 for the point data examples from Fig. 1. Unfortunately, the area a i by itself does– 14 – igure 5 . Pictorial illustration of the identification (3.1) for the case of the three point datasetsshown in Fig. 1. The Voronoi cells are color-coded by their inverse areas with warm (cold) colorsindicating large (small) values of f . not tell us anything about the gradient of the function f ( x, y ) — for this purpose, we needto compare a i to the areas of the surrounding Voronoi cells. Let N i = { j , j , . . . , j N i } bethe set of indices labelling the neighboring Voronoi cells, i.e., the Voronoi cells sharing anedge with V i . By taking the neighboring cells one at a time, j ∈ N i , one can computedirectional derivatives ( ∇ ˆ n ij f ) i in the direction of the j th neighboring cell, i.e., along theunit vector ˆ n ij = 1 (cid:112) ( x j − x i ) + ( y j − y i ) ( x j − x i , y j − y i ) , as [16] ( ∇ ˆ n ij f ) i = ( a i a j ) f j − f i (cid:112) ( x j − x i ) + ( y j − y i ) , (3.2)where the prefactor of ( a i a j ) was included to make the directional derivative dimensionless.Since each Voronoi cell V i has a varying number of edges N i ≡ |N i | , there will be a differentnumber of directional derivatives available at each point i , but they can all be fitted to theexpected distribution from the true gradient, thus producing an estimate (cid:126)G i of the gradientvector at the i th generator point P i [16]. Another variable explored in Ref. [16] was therelative standard deviation ¯ σ i of the areas of the neighboring cells, ¯ σ i ≡ σ i ¯ a i ≡ a i (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) j ∈N i ( a j − ¯ a i ) N i − , (3.3)where ¯ a i ≡ N i (cid:88) j ∈N i a j is the mean area of the neighbors of the i th Voronoi cell. As demonstrated in Refs. [16,22, 56], among the different possibilities, the relative standard deviation (3.3) performedrather well in tagging the BVCs. Of course, those studies were utilizing only the Voronoitessellation, which is not ideal for computing gradients. The Delaunay triangulation, on– 15 –he other hand, provides a more natural framework for the gradient estimation , as willbe discussed in the following subsection. The Delaunay triangulation leads to a natural method for computing the gradient knownas “triangulation wombling” [34, 39]. The starting point is the observation that for eachDelaunay triangle D α , the functional values at its three vertices are known from (3.1).Three points are enough to fit a plane, whose slope will provide an estimate of the gradientvector (cid:126)G α to be associated with the Delaunay triangle D α . Recall the duality relation(2.9c) which maps the triangle D α to its three vertices which carry indices i , j and k in theVoronoi tessellation. We can then parametrize the plane defined by D α as f = G αx x + G αy y + C, (3.4)where C is some constant. Applying this relation at each vertex, we obtain three indepen-dent equations f i = G αx x i + G αy y i + C, (3.5a) f j = G αx x j + G αy y j + C, (3.5b) f k = G αx x k + G αy y k + C, (3.5c)which can be solved for the gradient (cid:126)G α and the constant C as [34, 39] G αx G αy C = x i y i x j y j x k y k − f i f j f k . (3.6)From here, a wombling analysis would typically focus on the magnitude of the gradient G α = (cid:113) G αx + G αy (3.7)and proceed to select Delaunay triangles D α with relatively large values of G α , typically inthe top th percentile of all cells, as candidates for boundary elements.However, the straightforward application of this procedure leads to a problem whichis illustrated in Fig. 6. In the left panel we show the distributions of gradient magnitudes G α as calculated by the triangulation wombling method just described. We see that thedistribution is a very steeply falling function with a long tail — most Delaunay cells haverelatively small gradients and only a small fraction populates the large G α tail. Now, if thecells on the tail were predominantly BDTs, the method would have succeeded and therewould be no problem, but unfortunately, that is not the case. In the middle panel of Fig. 6 A purist might say that, since the two tessellations are dual to each other, strictly speaking there isnothing more to be gained from the Delaunay tessellation that could not have already be obtained fromthe Voronoi tessellation. While this may be technically correct, we found the Delaunay tessellation usefulin hinting at some new techniques and ideas as discussed below. – 16 – G r a d i e n t Figure 6 . Left panel: Distribution of the magnitudes G α of the gradient vectors (cid:126)G α computedin the process of triangulation wombling. Middle panel: A scatter plot of G α versus the horizontalposition of the centroid of the corresponding Delaunay triangle D α . Right panel: the Delaunaytriangles (red-shaded) within the top th percentile as ranked by gradient magnitudes. we show a scatter plot of the calculated gradient magnitudes G α versus the horizontalposition of the corresponding Delaunay triangle D α as given by its centroid. We noticethat the gradients computed for cells in the dense region ( x < . ) tend to be much largerthan the gradients in the sparse region ( x > . ) (this is to be expected, since statisticalfluctuations scale as √ f ). As a result, the cells with large gradients will tend to be more orless uniformly distributed in the dense region, with no relation to the boundary at x = 0 . .This is confirmed in the right panel of Fig. 6, where we identify with red shading theDelaunay triangles whose gradients are in the top th percentile of gradient magnitudes.As anticipated from the result in the middle panel, the red-shaded Delaunay triangles arelocated almost entirely in the dense region and there is no apparent clustering near theboundary. This means that in order to properly tag the BDTs, we must first pre-processthe computed gradients (cid:126)G α in order to mitigate the effect of the statistical noise. In this subsection we outline three different procedures for denoising the computed gradientvectors (cid:126)G α . As we shall demonstrate, each of them has the desired effect, and the optimalapproach will be some combination of the three, although finding out the exact proportionsis beyond the scope of this paper. The first approach is to rescale each calculated gradient magnitude G α as G α → ˜ G α ≡ G α √ a i a j a k , (3.8)where a i , a j and a k are the areas of the Voronoi cells centered on the three vertices ofthe Delaunay triangle D α (recall the duality relation (2.9c)). Fig. 7 demonstrates that therescaling (3.8) does have the desired effect. In the left panel, the distribution of rescaledgradient magnitudes ˜ G α still follows the same trend as in Fig. 6, but the numerical valueshave been reduced by several orders of magnitude. More importantly, the middle panel of– 17 – G r a d i e n t Figure 7 . The same as Fig. 6, but after rescaling the calculated gradients as in eq. (3.8).
Fig. 7 confirms that the rescaled gradient magnitudes are now uniformly consistent acrossthe two regions, with a cluster of points with relatively large ˜ G α beginning to emerge nearthe theoretical boundary. The right panel of Fig. 7 shows the updated plot of the Delaunaytriangles falling in the top th percentile of ˜ G α . We observe that, while the Delaunay cellstagged as BDTs are still scattered throughout the field of view, a significant fraction of themappears at the location of the boundary, which further validates the rescaling procedure(3.8). That is why from now on, unless specified otherwise, we shall always work withgradient vectors which have been rescaled as in (3.8). Another method for removing unwanted noise fluctuations in the data was explored in [16]and involved the so called Voronoi relaxation of the data by means of Lloyd’s algorithm [67].A closer inspection of the Voronoi tessellations depicted in Figs. 3 and 4 reveals that afterthe tessellation is constructed, the generator points can be found pretty much anywherewithin the Voronoi polygon — near the center of the cell, close to an edge, or somewherein between. The idea of the Voronoi relaxation is to make the whole Voronoi structuremore uniform by performing several steps (or iterations) of Lloyd’s algorithm, where ateach iteration, the generator point is moved to the centroid of the corresponding Voronoicell and the tessellation is redone.The effect of performing such Lloyd step uniformization (LSU) on our data is shown inFigs. 8 and 9. Each figure has 8 panels, depicting the Delaunay tessellation of the dataafter a certain number of Lloyd iterations, starting with 0 (no Lloyd steps) in the upperleft panels and going up to 20 Lloyd steps in the lower right panels. In addition, in Fig. 8each Delaunay triangle D α is color-coded by the rescaled magnitude ˜ G α of the respectivegradient (the gradients are recalculated after each step). Note that the color bar extendsonly up to of the largest gradient magnitude found in the data ; any cell with agradient magnitude above that threshold is colored yellow, essentially creating an overflowcolor bin. This was done in order to minimize the effect of outliers and better visualize For similar plots illustrating the effect of LSU on the Voronoi tessellation, see [16]. For example, the left panel in Fig. 7 reveals that the largest ˜ G α value found in the data is around 75,while the color bar in the upper left panel of Fig. 8 only goes up to ×
60% = 45 . – 18 – Figure 8 . Delaunay triangulations with individual cells color-coded by their rescaled gradientmagnitudes ˜ G α , after a certain number of Lloyd steps (from top left to bottom right): 0, 1, 2, 3, 5,7, 10, 20. The color bar extends up to of the largest ˜ G α value found in the data; any cell witha gradient magnitude above that threshold is colored yellow, essentially creating an overflow colorbin. This was done in order to minimize the effect of outliers and better visualize the bulk of thecells with typical values. Figure 9 . The same as the right panel in Fig. 7, but showing the effect of applying the Lloydalgorithm with varying number of steps as in Fig. 8. the bulk of the cells with the more typical values. Fig. 9, on the other hand, marks theDelaunay triangles falling in the top th percentile of ˜ G α values, just like the plot in theright panel of Fig. 7.There are several lessons that can be drawn from Figs. 8 and 9. First, as expected,the Lloyd relaxation causes the Delaunay triangles to become more regularly shaped. Forexample, notice that a large fraction of the triangles in the original Delaunay tessellation are– 19 – igure 10 . Left panel: Rescaled gradient vectors ˜ (cid:126)G α computed via triangulation wombling, afterapplying one Lloyd iteration to the original data. Middle panel: the result (cid:104) (cid:126)G (cid:105) i from averaging thegradient vectors ˜ (cid:126)G α shown in the left panel at each data point P i according to eq. (3.9). Rightpanel: the result (cid:104) (cid:126)G (cid:105) α from further averaging the Voronoi-averaged gradient vectors (cid:104) (cid:126)G (cid:105) i shown inthe middle panel at each Delaunay triangle according to eq. (3.10). obtuse (see the right panels in Figs. 3-7). However, within a few Lloyd steps, the fractionof obtuse triangles drops significantly and obtuse triangles are rather rare in the plots inthe lower rows of Figs. 8 and 9. More importantly, as shown in Figs. 8, the LSU procedurealso tends to wash out the noisy fluctuations in the calculated rescaled gradient magnitudes ˜ G α within the bulk regions away from the boundary (note the decreasing range of ˜ G α onthe color bars). This further sharpens the contrast between the Delaunay triangles situatednear the boundary versus those in the bulk. In particular, notice the gradual emergenceof the boundary, which becomes quite pronounced and unmistakable after 5-7 Lloyd steps.However, the figures also show that the number of Lloyd steps should be chosen with care— applying too few may not optimally showcase the boundary, while applying too manymay cause the boundary to start disintegrating, as evidenced in the lower right panels after20 iterations. A third approach for smoothing out the local statistical fluctuations in the data is to performsome type of averaging procedure over a region extending beyond the individual Voronoi orDelaunay cells and their immediate neighbors. For example, Ref. [16] considered extendingthe calculation of ¯ σ i in (3.3) over several tiers of Voronoi neighbors (up to 5) and showedthat this indeed produces the desired effect of reducing the fluctuations and sharpening theboundary identified by means of ¯ σ i .In our case here, we are dealing with the Delaunay tessellation instead, where theprocedures of triangulation wombling (3.6) and rescaling (3.8) allow us to directly computethe rescaled gradient vector ˜ (cid:126)G α associated with each Delaunay triangle D α . The result(after one Lloyd iteration) is shown in the left panel of Fig. 10, where we plot each vector ˜ (cid:126)G α at the centroid of the corresponding Delaunay triangle. This plot is the vector fieldanalogue of the color map shown in the second panel of the upper row of Fig. 8, which Furthermore, the largest angle of any remaining obtuse triangle is typically not too far above ◦ . – 20 – .250.500.751.001.251.501.752.00 Figure 11 . Delaunay (left and right panels) or Voronoi (middle panel) triangulations of the dataafter applying one Lloyd iteration, with individual cells color-coded by the magnitudes of theirrespective rescaled gradient vectors computed from triangulation wombling (left panel), Voronoi-averaged according to eq. (3.9) (middle panel), or Delaunay-averaged according to eq. (3.10) (rightpanel). for ease of comparison we reproduce again in the left panel of Fig. 11. The difference isthat the color map plots in Figs. 8 and 11 identify cells only by the magnitude ˜ G α of thegradient, while the vector field plots of Fig. 10 include the directional information as well,which is useful in visualizing the spatial patterns and correlations of the gradient vectors.Now, given the rescaled gradient vectors ˜ (cid:126)G α shown in the left panel of Fig. 10, thereare two ways to perform local averaging of these vectors, depending on whether we wantto associate the result from the averaging with a Voronoi cell (i.e., a Delaunay vertex) orwith a Delaunay triangle: • Voronoi cell averaging.
Recall that according to the duality relation (2.9a), eachVoronoi cell V i corresponds to a Delaunay vertex D α α α ...α Ni , which is the commonvertex of N i Delaunay triangles D α , D α , D α , . . . , D α Ni . Thus we can simply definethe average gradient (cid:104) (cid:126)G (cid:105) i at any Voronoi cell V i to be (cid:104) (cid:126)G (cid:105) i = 1 N i N i (cid:88) k =1 ˜ (cid:126)G α k . (3.9) • Delaunay cell averaging.
Once we have the averaged vectors (3.9) at our disposal,we can go back to each Delaunay triangle and further average the three vectors (3.9)associated with its three vertices. (cid:104) (cid:126)G (cid:105) α = 13 (cid:16) (cid:104) (cid:126)G (cid:105) i + (cid:104) (cid:126)G (cid:105) j + (cid:104) (cid:126)G (cid:105) k (cid:17) , (3.10)where the indices i , j and k label the data points at the vertices of D α (recall theduality relation (2.9c)).The vector fields resulting from the averaging prescriptions in eqs. (3.9) and (3.10) areshown in the middle and right panels of Fig. 10, respectively. One can see that the statistical As an alternative to a simple sum as in (3.9), one could assign a weight for each vector ˜ (cid:126)G α k , for example,the angular size of the D α k triangle as seen from the point P i . – 21 –uctuations are indeed getting suppressed as a result of the averaging, and furthermore, thedirections of the gradient vectors near the boundary are becoming better correlated witheach successive iteration. This directional correlation will become important in the nexttwo sections where we shall compare the properties of neighboring cells in the tessellation.For now, in order to demonstrate the benefits from the averaging procedures (3.9) and(3.10) for the purposes of boundary detection, it is sufficient to update the color maps fromFig. 8 using the magnitudes of the averaged gradients instead. This is done in the middleand right panels of Fig. 11, where the individual cells in the tessellation have been color-coded by the magnitudes of the Voronoi-averaged gradient (3.9) and the Delaunay-averagedgradient (3.10), respectively. By comparing Fig. 11 to Fig. 8, we see that the local averagingprocedures produce comparable benefits to Voronoi relaxation, so that we can view the twoprocedures as alternatives to the other. More specifically, local averaging seems to be atleast equivalent to (if not better than) running on the order of 5-7 Lloyd iterations, whichseemed to be the optimal choice in Figs. 8 and 9. Of course, the two methods can also beapplied simultaneously, so that their benefits can be optimally exploited. In our analysisof Sections 7 and 8, unless specified otherwise, we shall choose to employ the Delaunay-averaged gradient vectors (3.10). Armed with the various estimates of the local gradient vectors discussed in the previoussection, we are now ready for the next step of the wombling algorithm, namely, the taggingof Voronoi or Delaunay cells as boundary candidates. The standard approach is to placea lower cut on the relevant variable (typically the magnitude) which measures the size ofthe gradient. This selection singles out a certain set of candidate boundary cells as shownin Fig. 9. The purpose of this section is to study how effective this selection is andto suggest a potential improvement of the standard approach by utilizing the correlationsbetween gradient vectors computed in neighboring cells. The idea will be to place a premiumnot just on cells whose own gradient ˜ (cid:126)G α has a large magnitude, but on cells where theneighboring gradients ˜ (cid:126)G β have both a) large magnitudes and b) correlated directions with ˜ (cid:126)G α . A convenient variable which captures the desired correlations between two vectors ˜ (cid:126)G α and ˜ (cid:126)G β is their dot product, ˜ (cid:126)G α · ˜ (cid:126)G β [16].Fig. 12 shows scatter plots of such dot products of neighboring vectors for the threetypes of gradients introduced in the previous section: ˜ (cid:126)G α · ˜ (cid:126)G β (upper right panel), (cid:104) (cid:126)G (cid:105) i ·(cid:104) (cid:126)G (cid:105) j (lower left panel) and (cid:104) (cid:126)G (cid:105) α · (cid:104) (cid:126)G (cid:105) β (lower right panel). In each case, the result is plottedversus the horizontal position ( x i + x j ) / of the midpoint of the respective Delaunay edge Selection efficiency is typically illustrated with ROC curves, where one varies the cut on the selectionvariable and plots the fraction of signal versus the fraction of background surviving the cut. This was alsothe procedure used in Refs. [16, 22]. Here, however, we prefer to simply show scatter plots of the taggingvariable versus the distance to the boundary. In this way, we avoid the need to define what exactly is meantby a “boundary” object versus a “bulk” object. In the case of ˜ (cid:126)G α · ˜ (cid:126)G β and (cid:104) (cid:126)G (cid:105) α ·(cid:104) (cid:126)G (cid:105) β , the relevant Delaunay edge is simply D αβ , i.e., the edge separatingthe Delaunay triangles D α and D β , while in the case of (cid:104) (cid:126)G (cid:105) i · (cid:104) (cid:126)G (cid:105) j the relevant Delaunay edge is the one – 22 – .0 0.2 0.4 0.6 0.8 1.0Horizontal Position of Data Point0.00.20.40.60.81.0 R e l a t i v e S t a n d a r d D e v i a t i o n D o t P r o d u c t o f N e i g h b o r i n g G r a d i e n t s D o t P r o d u c t o f N e i g h b o r i n g G r a d i e n t s D o t P r o d u c t o f N e i g h b o r i n g G r a d i e n t s Figure 12 . Scatter plots of different tagging variables discussed in the text versus the horizontalposition of the corresponding element in the tessellation. The upper left panel shows the relativestandard deviation ¯ σ i defined in (3.3) versus the horizontal position x i of the generator point P i .The other three panels show dot products of different versions of neighboring gradient vectorsversus the horizontal position ( x i + x j ) / of the midpoint of the respective Delaunay edge D αβ : therescaled gradients ˜ (cid:126)G α · ˜ (cid:126)G β (upper right panel), the Voronoi-averaged gradients (cid:104) (cid:126)G (cid:105) i · (cid:104) (cid:126)G (cid:105) j (lowerleft panel) and the Delaunay-averaged gradients (cid:104) (cid:126)G (cid:105) α · (cid:104) (cid:126)G (cid:105) β (lower right panel). D αβ . For comparison, in the upper left panel we show a scatter plot of the relative standarddeviation ¯ σ i defined in (3.3) versus the horizontal position x i of the corresponding generatorpoint P i . As explained in Sec. 3.1, the relative standard deviation ¯ σ i is constructed from theVoronoi tessellation and was found to perform best among several other Voronoi-constructedalternatives [16]. The upper left panel in Fig. 12 confirms that the highest values of ¯ σ i areindeed found for cells near the boundary; in fact, the top 12 highest ¯ σ i values belong tosuch cells. At the same time, we also observe a significant variation in the ¯ σ i values for cells dual to the Voronoi edge V ij , see eq. (2.9b). – 23 –n the bulk; for ¯ σ i < . this starts introducing a certain number of false positives.The remaining three panels of Fig. 12 demonstrate that the corresponding dot prod-ucts of gradients computed from the Delaunay tessellation are also efficient in identifyingboundary objects (in this case, Delaunay edges). Among the three options illustrated inthe plot, the dot product of the Delaunay-averaged gradients seems to perform the best— the top 45 highest dot products of neighboring (cid:104) (cid:126)G (cid:105) α vectors belong to Delaunay edgesnear the boundary; and there is a well defined cluster of points with large y values in theboundary region . < x < . . Note the different y -axis range on these three plots — thevariation of dot product values is largest for ˜ (cid:126)G α · ˜ (cid:126)G β and smallest for (cid:104) (cid:126)G (cid:105) α · (cid:104) (cid:126)G (cid:105) β , furtherdemonstrating the beneficial effects from the averaging procedures discussed in Sec. 3.3.3.Comparing the top left panel in Fig. 12 to the other three panels, we conclude thatthe gradient dot products, which take advantage of the correlations between neighboringgradient vectors in terms of both direction and magnitude, are able to identify the boundarybetter than ¯ σ i and similar variables. As a byproduct of this new method of tagging, wehave also automatically built up a network of associations among the Delaunay cells in thetriangulation, which is readily available for use in the next step (agglomeration), where oneattempts to construct the actual boundary. This is the subject of the next section. The previous step (the tagging of boundary elements) typically fails to result in a continuousboundary, especially in case of weak signals. Instead, the algorithm produces a collectionof scattered “islands” of tagged cells, as seen in the right panels of Figs. 6 and 7 and inthe top panels of Fig. 9. This necessitates the next step of agglomerating the individualtagged cells into subgraphs and evaluating whether the resulting pattern is consistent witha linear boundary [29]. The downside of this approach is that it does not take into furtherconsideration the cells which have failed the tagging cut — those cells are simply ignoredfrom this point on. Another potential drawback is that the tagging and agglomeration stepsare done independently from each other, so that the existence of any spatial correlationsamong neighboring cells is not being used during the tagging. As already mentioned inthe previous section, both of these problems are avoided when we use the dot products ofneighboring gradients as tagging variables. As illustrated in the last three panels of Fig. 12,each dot product of gradients can be uniquely associated with a Delaunay edge D αβ orwith its dual Voronoi edge V ij , see eq. (2.9b). We can then treat the original Voronoi andDelaunay tessellations as weighted networks , where each edge is assigned a weight equal tothe dot products of the corresponding two gradients. This weighted network representationis illustrated in Fig. 13, where we superimpose the Voronoi tessellation (red lines) and theDelaunay triangulation (blue lines). The weight of an edge is indicated by the line thickness— thicker lines imply higher weights and vice versa. The three panels show three differentways to compute the weights from the gradient dot products, depending on which set ofgradient vectors from Fig. 10 we choose to use: ˜ (cid:126)G α · ˜ (cid:126)G β (left panel), (cid:104) (cid:126)G (cid:105) i · (cid:104) (cid:126)G (cid:105) j (middlepanel) or (cid:104) (cid:126)G (cid:105) α · (cid:104) (cid:126)G (cid:105) β (right panel). As seen in Fig. 12, a certain fraction of dot productsare negative; if that is the case, the corresponding edge is not plotted.– 24 – igure 13 . Weighted network representations of the Voronoi tessellation (red lines) and theDelaunay triangulation (blue lines). The line thickness is proportional to the weight, which is givenby ˜ (cid:126)G α · ˜ (cid:126)G β (left panel), (cid:104) (cid:126)G (cid:105) i ·(cid:104) (cid:126)G (cid:105) j (middle panel) and (cid:104) (cid:126)G (cid:105) α ·(cid:104) (cid:126)G (cid:105) β (right panel). Edges with negativeweights are not shown. The three panels in Fig. 13 can be contrasted with the corresponding results in Fig. 11,where we used just the magnitudes of the individual gradient vectors, without any referenceto their neighbors. The boundary seems to be better outlined in Fig. 13, particularly whenwe make use of the averaging procedures from Sec. 3.3.3. We also note the benefit of plottingthe Voronoi and Delaunay tessellations simultaneously — the orientation of the edges withrespect to the boundary is random, so whenever a given edge happens to be orthogonalto the boundary, its dual tends to be parallel to it, so taken together, they trace out thecorrect shape of the boundary.Fig. 13 also elucidates the results from Fig. 12, where we observed that while manyedges situated close to the boundary enjoyed relatively large values of their gradient dotproducts, there was also a non-negligible fraction of edges near the boundary with ratherlow values of the gradient dot products. Fig. 13 now reveals that these two populations arespatially correlated — note how the edges with large values of the gradient dot productsare linked together, as are their counterparts. This confirms that using the gradient dotproducts as tagging variables automatically also takes care of the agglomeration.Until now we have been following the standard steps of a wombling analysis. As alreadymentioned, the last remaining step is to evaluate the statistical significance of the observedpattern of tagged boundary candidates. Note that the last three figures illustrate taggingprocedures for each of the four types of boundary candidate objects defined in Sec. 2.3:Voronoi cells (middle panel in Fig. 11 and upper left panel in Fig. 12), Delaunay cells (leftand right panels in Fig. 11), Delaunay edges (upper right and lower panels in Fig. 12 andall three panels in Fig. 13) and Voronoi edges (Fig. 13). Of course, since the Voronoi andDelaunay edges are dual to each other, any procedure which can tag one edge type can alsobe applied to tag the other.In the next three sections we shall outline an alternative approach originally proposedin [40], which allows us to perform the tagging, agglomeration and statistical evaluationsteps in one go. In order to introduce and illustrate the method, in the next section Sec. 6we shall start with the case of a continuously defined function f ( x, y ) and then proceed to– 25 –nalyze the case of point data in Secs. 7 and 8. In order to bypass the tagging and agglomeration steps, Ref. [40] proposed to directlyconsider various curves C in the ( x, y ) plane, and to associate a “wombling measure” Γ witheach one, so that true wombling boundaries can be identified by their large values of Γ .Since a wombling boundary is supposed to represent a zone of rapid change in the function f , it is natural to define the wombling measure in terms of the local gradient (cid:126) ∇ f suitablyintegrated along C . In particular, Ref. [40] defined Γ to be the total gradient flux through C Γ[ C ] ≡ (cid:90) C (cid:16) (cid:126) ∇ f · ˆ n C (cid:17) d(cid:96), (6.1)where ˆ n C is a unit vector normal to the curve C and d(cid:96) is the infinitesimal length along C .Additionally, Ref. [40] also considered the average gradient flux ¯Γ[ C ] ≡ (cid:82) C (cid:16) (cid:126) ∇ f · ˆ n C (cid:17) d(cid:96) (cid:82) C d(cid:96) , (6.2)where using the total length of the curve as a normalization factor eliminates the unfairadvantage of curves which happen to be too long. For this reason, in what follows we shallmake use of (6.2) and not (6.1). Without any further constraints on the type of curves C that we are allowed to con-sider, this method would be rather impractical. To make further progress, two approachesare possible. The first one is the model-dependent route — if we specify exactly what kindof new physics model generates the wombling boundary, then C can be specified by onlya handful of parameters (typically the masses of the new particles). Then the problem ofmaximizing the functional (6.2) over all possible curves C reduces to a simple global maxi-mization problem in the parameter space describing C [23]. Here, however, we would like tostay as model-independent as possible, so we shall not assume any specific parametrizationof the boundary. At the same time, we do not want to consider arbitrarily general curves C either.An intermediate compromise approach is the following. Note that any curve C canlocally be approximated by a straight line segment. Therefore, we can perform a scan ofthe ( x, y ) plane where at each point P we try a line segment (centered on P ) of fixed length L and arbitrary angular orientation ϕ . Each point in the so-defined 4-dimensional parameterspace ( x, y, L, ϕ ) corresponds to a well-defined line segment, for which the wombling measure An additional variation mentioned in [40] was to consider integrating the flux in absolute value, e.g., Γ[ C ] ≡ (cid:90) C (cid:12)(cid:12)(cid:12) (cid:126) ∇ f · ˆ n C (cid:12)(cid:12)(cid:12) d(cid:96), in order to avoid cases where a large positive flux over one section of the curve C is cancelled by a largenegative flux over another section. However, in our case such cancellations are welcome since the noisefluctuations are random and we would like to allow them to cancel out each other as much as possible. Wehave confirmed numerically in our examples that the absolute value alternative leads to lower sensitivity. – 26 –
101 0 1 2 3 401234 p in p in p o u t p o u t Diagram Figure 14 . Our parametrization of all possible straight lines intersecting the unit square in termsof the coordinate p in of the entry point and the coordinate p out of the exit point, where p in ( p out )is measured counterclockwise (clockwise) along the perimeter. The right panel shows the complete ( p in , p out ) parameter space, which is a square of side length 4. The blue dot represents the blueexample line shown in the left panel and its partner (when the line is traversed in reverse). (6.2) can be calculated. The regions in ( x, y, L, ϕ ) parameter space with large values for ¯Γ[ C ] would then identify (segments of) the wombling boundary. Since this procedure involvesoptimization in a 4-dimensional parameter space, it will be difficult to illustrate here. Thisis why from now on we choose to focus on the ( L, ϕ ) subspace — one can think of this asfirst zooming in on an interesting region of the ( x, y ) plane and then testing for the presenceof a linear wombling line segment.Our reparametrization of the remaining two degrees of freedom describing the linesegment is illustrated in Fig. 14. As before, we retain the unit square as our field of view.We then consider all possible straight lines crossing the unit square — each such line can beidentified by the point where it enters the square and the point where it exits the square.We shall identify the locations of those two points by their respective coordinates p in and p out measured along the perimeteter, as shown in the left panel of Fig. 14. Since theperimeter of a unit square is equal to 4, the parameter space ( p in , p out ) spans the × square shown in the right panel of Fig. 14 — any point within that × square can beuniquely associated with a straight line crossing the field of view in one of the two possibledirections. For example, ( p in , p out ) = (0 , describes a diagonal line traversed from thelower left corner to the upper right corner, while ( p in , p out ) = (2 , describes the samediagonal line covered in reverse. In our previous examples, the true boundary was locatedat x = 0 . , and corresponds to either ( p in , p out ) = (0 . , . or ( p in , p out ) = (2 . , . .In the remainder of this section and in the next Sec. 7, our main goal will be tocompute the wombling measure ¯Γ[ C ] in the ( p in , p out ) parameter space and identify therelevant wombling boundary segment(s). First we shall illustrate this procedure with theexample of a continuous function f ( x, y ) before tackling the case of point datasets in thenext section. In our conventions, reversing the direction of a given line ( p in , p out ) implies ( p in , p out ) → (4 − p out , − p in ) . – 27 – . . . . . . x . . . . . . . . f unsmearedfit, ξ = 17 , k = 5 smeared, σ = 0 . . . . . . . x . . . . . . . . f unsmearedfit, ξ = 1 . , k = 5 smeared, σ = 0 . Figure 15 . The black dotted line shows the original distribution (2.5) for ρ = 5 (left panel) and ρ = 1 . (right panel) before any smearing. The red solid histogram is the result from applyingGaussian smearing with σ = 0 . . The blue dashed line is the result from fitting to the ansatz (6.4). In this subsection we shall revisit the vertical straight line boundary example from theprevious sections. However, we shall not use the original distribution (2.5), for two reasons:first, the discontinuity at x = 0 . would generate an infinite gradient when computed ana-lytically, and second, the distribution (2.5) corresponds to an idealized situation where theeffects of particle widths and detector smearing are ignored. In any realistic experimentalanalysis the sharp step at x = 0 . will be smeared and the boundary will be characterizedby a large but finite gradient.This is illustrated in Fig. 15, where the black dotted lines show (the x -dependence of)the original distribution (2.5) before any smearing, for ρ = 5 (left panel) and ρ = 1 . (rightpanel). We then apply Gaussian smearing with σ = 0 . , resulting in the red histograms,which have the typical shapes expected in a realistic experimental analysis. In particular,notice how the gradient at the boundary is significantly reduced as a result of the smearing,making the task of finding the wombling boundary quite challenging. At this point we fitan analytical function to the so obtained smeared distributions. For the fit, we choose toutilize the (unit-normalized) ArcTan sigmoid function g res ( x ; a, ξ, k ) = 1 − π ξ − ξ + 1 arctan (cid:16) k ( x − a ) (cid:17) , (6.3)whose derivative is maximal (in absolute value) at x = a , the sharpness of the transitionbeing controlled by the parameter k . The remaining parameter ξ is analogous to ρ in thesense that (compare to (2.4)) ξ = g ( x = −∞ ) g ( x = ∞ ) . Since our field of view is limited to x ∈ [0 , and the boundary is at x = 0 . , for the actualfit we choose the parametrization f ( x, y ) = g res ( x ; a = 0 . , ξ, k ) (6.4)– 28 – . . . . . . . . . . . . Figure 16 . The result from the linear wombling procedure described in the text on a data setwith ρ = 5 (left) and ρ = 1 . (right). The heat map is color coded according to the value of | ¯Γ[ C ] | and exhibits two degenerate maxima since each line is traversed twice (once in each direction). and then adjust ξ and k to match the smeared distributions shown by the red histograms.As seen in Fig. 15, the fit reproduces the effects of smearing rather well, so in the rest ofthis subsection we shall use (6.4) as our analytically defined distribution.Using the respective fit (6.4) as our proxy, we can now compute the wombling measure ¯Γ[ C ] in the ( p in , p out ) space of all lines intersecting our unit square. The result for ρ = 5 ( ρ = 1 . ) is shown in the left (right) panel of Fig. 16. We choose to plot the absolute valueof ¯Γ[ C ] , since the sign of ¯Γ[ C ] is determined by the direction in which we traverse the line,and does not have any bearing on whether the line is a wombling boundary or not. Fig. 16reveals that, as expected, there are two locations with maximal wombling measures: at ( p in , p out ) = (0 . , . and at ( p in , p out ) = (2 . , . . Both of those correspond to the samevertical line at x = 0 . which was the true boundary. This demonstrates that the methodis indeed able to find the correct boundary. The significance of these findings, however, issensitive to the amount of signal present — in the left panel, where ρ = 5 , the two winninganswers are very clearly identified, while in the right panel (using the same color scale) theyappear to be less noticeable, which is a hint that the effect might be in danger of beingwashed out once we include the statistical fluctuations present in the point data — thisissue will be investigated in detail in Sec. 7 below. The example in the previous Sec. 6.1 might appear somewhat contrived since the trueboundary was a straight line and at the same time, we also used straight lines in computingthe wombling measure ¯Γ[ C ] . Since the shapes of the lines match, it was inevitable to finda unique best match, as shown in Fig. 16. To be fair, we shall now consider a less trivialexample where the true boundary has a different shape from the line segments which weuse to test for the presence of a wombling boundary. In particular, we shall revisit the caseof a circular boundary introduced in Sec. 2.1, where the probability distribution was given– 29 – . . . . . . . . . . . . . . . . . . . Figure 17 . Color maps of the probability distribution (6.5) used for the circular boundary example(left panel) and of the corresponding | ¯Γ[ C ] | in the ( p in , p out ) parameter space (right panel). by (2.7). Once again, we shall not rely directly on (2.7), but in order to account for thedetector resolution, we shall sample the events according to f ( r, ϕ ) ∼ rg res ( r ; a = 0 . , ξ = 2 , k = 30) , (6.5)where r and ϕ are the polar coordinates in the plane, measured from an origin at the centerof the circle, and the smearing function g res was already defined in (6.3). The resultingprobability distribution is plotted in the left panel of Fig. 17. Note that the densities onthe two sides of the circular boundary differ by no more than a factor of 2, so in this sensethis example is analogous to ρ ∼ in our usual notation. The corresponding heat map of | ¯Γ[ C ] | in the ( p in , p out ) parameter space is shown in the right panel of Fig. 17. The moststriking difference from the previous results in Fig. 16 is that now we find not just a singlewombling boundary candidate, but a whole class of wombling boundaries, identified by thetwo bright yellow stripes running diagonally across the plot. A careful inspection of theright panel in Fig. 17 reveals that each of the identified wombling boundary candidates istangential to the circular boundary, which suggests that the dominant contribution to theintegral comes from the region in the vicinity of the circular boundary, where the gradientis largest. The slight difference in the brightness along the stripes can then be attributed tothe different orientation of the lines, which leads to differences in the (sub-dominant) fluxcontributions in the regions away from the true boundary. Having illustrated the basic idea of Refs. [40, 63] with the continuous examples from theprevious section, we shall now apply it to point data. Following the outline of Sec. 6, we We obtain two stripes because of the double counting ( p in , p out ) ←→ (4 − p out , − p in ) due to thepossibility to traverse a line segment in each of the two opposite directions, as shown in the right panel ofFig. 14. – 30 –hall first consider the case of a straight line boundary in Sec. 7.1 and then the case of acircular boundary in Sec. 7.2. ρ = 5 Our first straight-line boundary example will be the same point data example which we havebeen using so far throughout the paper to illustrate the various methods and techniques ofa wombling analysis, see Figs. 3, 4, 6, 7, 8, 9, 10, 11, 12 and 13. (As a reminder, we used N = 500 points generated according to the distribution (2.5) with ρ = 5 .) In particular,we shall repeat the procedure from Sec. 6 and compute a wombling measure ¯Γ[ C ] for eachpossible straight line C crossing the field of view. However, since we are now dealing withdiscretely sampled point data instead of a continuous function f ( x, y ) , we need to adaptthe definition (6.2) as follows ¯Γ[ C ] ≡ (cid:80) α (cid:16) (cid:104) (cid:126)G (cid:105) α · ˆ n C (cid:17) ∆ (cid:96) α (cid:80) α ∆ (cid:96) α , (7.1)where each sum runs over all Delaunay triangles D α which are crossed by the straight line C and, as before, (cid:104) (cid:126)G (cid:105) α is the Delaunay-averaged gradient vector (3.10) associated with D α . Since the Delaunay tessellation gives complete coverage of the field of view, the line C necessarily gets fragmented into individual line segments of length ∆ (cid:96) α , defined so thateach segment is contained within a single Delaunay cell D α . Finally, ˆ n C is a unit vectororthogonal to the straight line C and therefore, to each individual line segment ∆ (cid:96) α as well,so that an index α on it is unnecessary.Fig. 18 shows a heat map of the wombling measure ¯Γ[ C ] computed from eq. (7.1)throughout the parameter space ( p in , p out ) of all possible straight lines passing through ourdata. As a pre-processing step, we applied one Lloyd iteration and then used the Delaunay-averaged gradient vectors illustrated in the right panel of Fig. 10. The heat map in Fig. 18contains four dark squares situated along the diagonal from the upper left corner to thelower right corner. All points within those four squares define line segments C which donot enter the field of view at all, and instead run along one of the edges of the field of view.Clearly, such line segments are irrelevant for our wombling boundary analysis, so they havebeen assigned ¯Γ[ C ] = 0 by default and are excluded from further consideration.Fig. 18 reveals that there is a unique line with the largest possible wombling measure— let us denote this winning line with C w : C w ≡ arg max C (cid:0) ¯Γ[ C ] (cid:1) . (7.2)By construction, the winning line C w is the best wombling boundary candidate among theset of all straight-line boundary candidates. Is it the correct wombling boundary though? In principle, we can define the wombling measure (7.1) in terms of the original gradient vectors ˜ (cid:126)G α orin terms of the Voronoi-averaged gradient vectors (cid:104) (cid:126)G (cid:105) i defined in (3.10). However, as argued in Secs. 4 and5, the Delaunay-averaged vectors offer the best option for our purposes. – 31 – igure 18 . Results from applying the wombling procedure of Sec. 6 to the illustrative pointdata example used in the previous sections ( N = 500 points generated from (2.5) with ρ = 5 ).For concreteness, at the pre-processing stage we applied one Lloyd iteration and then used theDelaunay-averaged gradient vectors which were defined in (3.10). In constructing the heat map, wesampled the ( p in , p out ) parameter space on a 200 by 200 grid with step size 0.02 and then computed ¯Γ[ C ] from eq. (7.1). The four dark squares in the plot correspond to line segments which overlapwith one of the edges of the unit square which is our field of view - those segments were assigned ¯Γ[ C ] = 0 by default and excluded from further consideration. For reference, the true boundary islocated at ( p in , p out ) = (0 . , . , or equivalently, at ( p in , p out ) = (2 . , . . According to the result from Fig. 18, the answer in this example is yes: C w is found atthe exact location (0 . , . (or equivalently, (2 . , . ) of the true theoretical boundary x = 0 . . This can be verified explicitly in Fig. 19, where we plot C w overlaid on top of theDelaunay-averaged gradient vectors (cid:104) (cid:126)G (cid:105) α from the right panel of Fig. 10. Fig. 19 not onlyconfirms that C w is the correct wombling boundary, but also helps us understand why C w was chosen by the algorithm. Note that for any given line C , the calculation of its womblingmeasure ¯Γ[ C ] depends only on the Delaunay cells D α which happen to be crossed by theline C — in Fig. 19 those cells are shaded in red. A careful inspection of Fig. 19 revealsthat in the large majority of red-shaded Delaunay cells the gradient vectors are both largeand (roughly) orthogonal to the line C w , thus maximizing the average flux (7.1) throughit. To better visualize this, we have color-coded the individual line segments ∆ (cid:96) α of C w according to their individual contributions (cid:104) (cid:126)G (cid:105) α · ˆ n C to the total flux, with warm (cold)colors corresponding to large (small) values. We see that C w is predominantly colored withwarm colors, indicating large fluxes all the way throughout. On the other hand, it is not– 32 – igure 19 . The line C w with the largest wombling measure ¯Γ[ C ] in Fig. 18, plotted over theDelaunay-averaged gradient vector field from the right panel of Fig. 10. The individual line segments ∆ (cid:96) α have been color-coded by their respective individual contributions (cid:104) (cid:126)G (cid:105) α · ˆ n C to the total flux.The red-shaded Delaunay cells are those which are crossed by the line C w and are therefore includedin the two summations in (7.1). difficult to convince oneself that this will not be true for any other randomly chosen line C crossing the field of view — the gradient vectors may happen to be relatively large andperhaps even roughly orthogonal to it purely by chance in some restricted region, but thiswill not occur consistently along the full length of the line as was the case with C w . Inshort, the wombling procedure applied here automatically takes into account the spatialcorrelations of the gradient vectors along a wombling boundary. ρ = 1 . We are now ready to tackle a more difficult case, with a smaller signal to background ratio.In this subsection we consider N = 1000 data points, still distributed according to (2.5),but with the much smaller value of ρ = 1 . . This data is shown in the left panel of Fig. 20.Unlike the previous example, this time the wombling boundary is not as easy to identifyvisually in the data. As already discussed in Ref. [16], the weaker the signal, the more Lloydsteps are needed for optimal results. Correspondingly, here we apply 5 Lloyd iterations atthe preprocessing stage and obtain the data shown in the right panel of Fig. 20, on whichthe subsequent wombling analysis is done.The end result from our wombling procedure is shown in Fig. 21, which is the analogueof Fig. 18. We notice that the typical values obtained for ¯Γ[ C ] are now much lower than whatwe saw in Fig. 18. This is to be expected due to the smaller value of ρ — the boundaryis less pronounced, and the magnitudes of the gradient vectors are generally reduced aswell. Despite these difficulties, the boundary is still correctly identified — notice the twobright spots in the heatmap located near (0 . , . and (2 . , . , which is the right answer.– 33 – igure 20 . The point data for the straight line boundary example considered in Sec. 7.1.2. Theleft panel shows the original N = 1000 points distributed according to (2.5) with ρ = 1 . . The rightpanel shows the same data after applying 5 Lloyd iterations. Figure 21 . The same as Fig. 18 but for the dataset with ρ = 1 . shown in the right panel ofFig. 20. Here we exclude from the analysis not only lines along the perimeter of the field of viewbut also any line of length less than 0.5; in the heatmap such lines are assigned ¯Γ[ C ] = 0 by default. The corresponding winning line C w has ¯Γ[ C w ] = 0 . and is plotted in Fig. 22, where forillustration we also show a second, rather generic, line C at ( p in , p out ) = (0 . , . whichhas a more typical value of the wombling parameter, ¯Γ[ C ] = 0 . . The individual linesegments of each of the two lines are color-coded similar to Fig. 19, i.e., proportional totheir individual contributions to the total flux. We see that the winner C w is again colored In order to highlight the differences between the lines segments of the two lines, we additionally applythe scaling (7.3) motivated below in Sec. 7.2.2. – 34 – igure 22 . The same as Fig. 19 but for the dataset with ρ = 1 . shown in the right panel inFig. 20. In addition to the line C w with the largest value of the wombling measure ¯Γ , for comparisonwe also show a generic line with a more typical value of ¯Γ[ C ] ; for concreteness, we chose the line at ( p in , p out ) = (0 . , . which happens to have ¯Γ[ C ] = 0 . . with mostly warm colors, indicating large flux contributions everywhere (except for justa few spots where the gradient vectors happen to be either too small or oriented alongthe line), while the generic line C is colored with mostly cold colors, confirming that itswombling measure ¯Γ is indeed rather small.By comparing Figs. 18 and 21, one can notice that in Fig. 21 we have excluded fromconsideration not only lines which run purely along the perimeter of the field of view butalso any line of length less than 0.5; this additional constraint results in the eliminationof several quadrant sectors on the plot — one at the lower left corner, one at the upperright corner, and six along the diagonal running from the upper left corner to the lowerright corner. Since our wombling measure is the average flux, short lines which literally “cuta corner” of the square field of view, can potentially pick up large gradients due to localfluctuations in the bulk, without an opportunity to cancel those fluctuations elsewhere. Inother words, if we encounter two candidate lines with the same value of ¯Γ , and one is muchlonger than the other, then we would treat the longer line as the more likely womblingboundary. Thus eliminating very short lines from consideration early on would go a longway in simplifying the significance estimation procedure, see Sec. 8 below. We now proceed with the discrete version of the circular boundary example considered inSec. 6.2. The point dataset is shown in the left panel of Fig. 23 and consists of N = 1000 points distributed according to the probability distribution with a circular boundary (2.7)with ρ = 5 . As a preprocessing step, we then apply a single Lloyd iteration, obtaining the– 35 – igure 23 . The point data for the circular boundary example considered in Sec. 7.2.1. The leftpanel shows the original N = 1000 points distributed according to (2.7) with ρ = 5 . The rightpanel shows the same data after 1 Lloyd iteration. Figure 24 . The same as Fig. 21 but for the dataset with a circular boundary shown in the rightpanel of Fig. 23. data shown in the right panel of Fig. 23, on which the subsequent wombling analysis isperformed.The results from the wombling procedure are displayed in Figs. 24 and 25. Fig. 24 isthe analogue of Figs. 18 and 21, but for the circular boundary example considered in thissubsection. Comparing to those previous figures, we notice that the largest absolute valuesfor ¯Γ obtained here are lower than those in Fig. 18 but higher than those in Fig. 21. Theformer is due to the fact that we are using straight line segments to test for a curvilinear– 36 – .0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0 0.00.20.40.60.81.01.2 Figure 25 . The line C w with the largest wombling measure in the circular boundary example ofSec. 7.2, overlaid on the Delaunay-averaged gradient vectors (cid:104) (cid:126)G (cid:105) α (left panel) or on the heatmapof their magnitudes (right panel). wombling boundary, and no single line segment can capture the full extent of a circularboundary, while the latter is due to the fact that the value of ρ is higher for the datasetused to produce Fig. 24.Fig. 24 exhibits the typical pattern observed in the continuous version of this example(the right panel of Fig. 17). The locations with the largest values of | ¯Γ | trace out the twostripes seen in Fig. 17, which indicates that in the discrete version of the example it is stillthe line segments tangential to the circular boundary which tend to have large values of | ¯Γ | .The line C w with the largest wombling measure happens to be at ( p in , p out ) = (0 . , . andis plotted in Fig. 25, together with the set of Delaunay-averaged gradient vectors (cid:104) (cid:126)G (cid:105) α (leftpanel) or the heatmap of their magnitudes (right panel). As anticipated, C w is tangentialto the circular boundary, and the largest contributions to its wombling measure indeedcome from the (warm-colored) segments in the vicinity of the boundary. One should keepin mind that the particular line C w shown in Fig. 25 is a very close winner among severalother worthy challengers with similar values for the wombling measure — as Fig. 24 showed,there are several locations along the two bright stripes with similarly large values of | ¯Γ | . The analysis from the previous subsection 7.2.1 demonstrated that even in the case of acurvilinear boundary, our wombling procedure produces reasonable results — it is ableto identify a class of line segments, each of which already contains a portion of the trueboundary. Unfortunately, none of the identified line segments is able to reproduce the fullboundary all by itself. In this subsection, we shall therefore address the question of beingable to globally reconstruct the wombling boundary, regardless of its shape, from the resultspresented so far.In principle, there can be several different approaches to this problem. • Algorithmic wombling.
The standard approach is the algorithmic wombling proce-– 37 –ure outlined in the introduction [29]. One applies a lower cut (tagging) on themagnitudes of the locally estimated gradient vectors and then suitably connects them(agglomeration). Although this approach has been subject to criticism [40], its mod-ern implementation can perhaps benefit from some of the improvements which wehave introduced here, in particular gradient rescaling (Sec. 3.3.1), Voronoi relaxation(Sec. 3.3.2), local averaging of gradient vectors (Sec. 3.3.3), utilizing improved tag-ging variables which account for the spatial correlations among neighboring gradientvectors (Sec. 4), etc. • Use the correct ansatz for the shape of the boundary.
At the cost of giving up model-independence, one could focus on a particular theory model, derive the parametricform of the expected boundary shape, and then use that parametrization to test forthe presence of such boundaries in the data. This approach was proved successful inspecific event topologies motivated by supersymmetry [22, 23, 56], but relies on theexperimenter being able to make the correct theory model assumption. • Construct the envelope of the line segments with the largest wombling measures.
Asshown by the results in Figs. 17 and 24, when probing a curvilinear boundary withstraight line segments, we obtain a whole family of wombling boundary candidateswhich can be tagged with their relatively large values of ¯Γ[ C ] . We saw that each ofthese candidates is tangential to the true boundary, therefore, the task of constructingthe true boundary reduces to the task of finding a planar curve which is tangentialto each of the tagged straight line candidate segments at some point. The answer tothis problem is precisely the envelope curve [68]. • Identify and agglomerate “the best” line segments ∆ (cid:96) α . A specific realization of anapproximate piece-wise reconstruction of the envelope curve mentioned above is of-fered by the following procedure. Note that our analysis not only tags straight lines C with large values of the wombling measure, but it also identifies which individualportions ∆ (cid:96) α of those lines are most likely to be tangential to a true boundary, asindicated by the rainbow-color coding of the line C w in Figs. 19, 22 and 25. Thissuggests that instead of working with the full line C tagged by the algorithm, we caninstead focus our attention on the individual elements ∆ (cid:96) α from it with the largestcontributions to the average flux through C . The straightforward application of thismethod, however, will reintroduce sensitivity to local gradient fluctuations. In orderto avoid this, we propose to rank the individual elements ∆ (cid:96) α not by their averageflux (cid:104) (cid:126)G (cid:105) α · ˆ n C , as was done in Figs. 19, 22 and 25, but by the rescaled average flux (cid:16) (cid:104) (cid:126)G (cid:105) α · ˆ n C (cid:17) × (cid:12)(cid:12) ¯Γ[ C ] (cid:12)(cid:12) γ , (7.3)where the power γ is a suitably chosen positive parameter, which interpolates smoothlybetween “complete locality” ( γ = 0 ) and “complete globality” ( γ → ∞ ). The scalingby a power of ¯Γ[ C ] in eq. (7.3) ensures that a given element ∆ (cid:96) α is judged not only bythe local flux going through it, but also by its association with a suitable womblingboundary candidate line C . By increasing the power γ , we can suppress the effects– 38 – igure 26 . The individual segment selection procedure described in Sec. 7.2.2. After scanningthe ( p in , p out ) parameter space on an × grid and computing the wombling measures ¯Γ[ C ] , theaverage flux for each individual line segment ∆ (cid:96) α has been rescaled according to (7.3) with γ = 4 and the plot shows (in red) the segments with rescaled flux values in the top 1 percentile. of local statistical fluctuations, and in the limit of γ → ∞ we eventually recover ourprevious results where one would select all the segments ∆ (cid:96) α belonging to the bestwombling candidate line C w . However, using finite values of γ allows us to 1) let in“the best” individual segments ∆ (cid:96) α from other candidate lines C which are not C w ,but have comparably large values of ¯Γ[ C ] , and 2) eliminate from consideration thoseindividual segments ∆ (cid:96) α from the winning line C w whose local flux values are toolow — presumably such elements belong to C w only because we have used the wrongansatz for the shape of the boundary. This approach strikes the right balance betweenthe “locality” of the flux through an individual element ∆ (cid:96) α and the “globality” (in themethod of calculation) of the wombling measure ¯Γ[ C ] . The procedure is illustrated inFig. 26, where we show the individual line segments ∆ (cid:96) α whose rescaled flux values(7.3) are in the top 1 percentile, after scanning the ( p in , p out ) parameter space on an × grid. We see that, despite using the wrong ansatz (straight lines), the circularboundary is reconstructed quite well, with only a few stragglers showing up in thebulk. In the previous sections we discussed different techniques for identifying wombling bound-aries in point datasets. In applications to collider event data in high-energy physics, thepresence of a wombling boundary could be indicative of new physics, if its location is in aregion of phase space which is unremarkable from the point of view of the SM background.– 39 – igure 27 . A heat map of wombling measures | ¯Γ[ C ] | analogous to Figs. 18, 21 and 24, but for atypical background pseudo-experiment where the point data is generated from the pure-backgrounddistribution (2.2) and then a single Lloyd iteration is applied. However, before claiming a discovery, one must be confident that such wombling boundariescannot be accidentally generated by SM data alone. For this purpose, it is necessary to sup-plement any proposed wombling technique with a corresponding prescription for assessingthe significance of any reconstructed wombling boundary. Since previous work [16, 22, 23]did not address this issue, we shall now do so using a frequentist approach.The end result of the wombling method described in Sec. 7 was the selection of the bestpossible wombling line candidate C w , together with its corresponding wombling measure ¯Γ w ≡ ¯Γ[ C w ] . (8.1)In order to use | ¯Γ w | as our test statistic, we need to know its distribution under the pure-background hypothesis. For this purpose, we generate a number of pseudo-experimentswhere the point data is generated from the pure-background distribution (2.2), and foreach pseudo-experiment, we repeat the wombling analysis from the previous section. Atypical result from one such pseudo-experiment is shown in Fig. 27, where, in analogy toFigs. 18, 21 and 24, we show a heat map of | ¯Γ[ C ] | in the ( p in , p out ) parameter space. We seethat, as expected, in the absence of a real signal the typical values for the wombling measureare relatively low almost everywhere in the ( p in , p out ) parameter space, except at one veryspecial location near (1 , . It is not difficult to realize that this location corresponds to veryshort candidate lines C which “cut” the lower right corner of our field of view. We alreadyalluded to this problem at the very end of Sec. 7.1.2, and our proposed solution was simplyto exclude such very short lines from consideration . Therefore, when we derive the | ¯Γ w | Another possibility could be to use the unnormalized wombling measure (6.1). We shall leave this – 40 – .06 0.08 0.10 0.12 0.14 0.16| w |0510152025303540 C o un t Figure 28 . Distributions of | ¯Γ w | for two populations of 100 pseudo-experiments each, and with N = 1000 points per pseudo-experiment. The blue histogram corresponds to data generated fromthe pure-background distribution (2.2) as in the left panel of Fig. 1. The orange histogram representspseudo-experiments with added signal with a straight-line boundary as in the middle panel of Fig. 1,with the points sampled from (2.5) with ρ = 1 . . In either case, the wombling measure ¯Γ[ C ] wascomputed throughout the ( p in , p out ) parameter space on an × grid and the best wombling line C w of minimum length √ / was identified following the procedure of Sec. 7. distributions below, we shall apply a minimum cut on the allowed length of any candidateline C . In order to make sure this problem does not reappear, we shall conservativelyincrease our previous minimum length cut from . to √ / , which is the length of the lineconnecting the midpoints of any two neighboring edges of our field of view.With those preliminaries, we are ready to compare the distributions of our test statistic | ¯Γ w | for signal and background. The blue histogram in Fig. 28 shows the | ¯Γ w | distributionfor 100 pure-background pseudo-experiments with N = 1000 points each, where the datawas generated from the pure-background distribution (2.2) as in the left panel of Fig. 1. Atthe preprocessing stage, we increased the number of Lloyd iterations to 10, since we shallbe interested in the case of relatively weak signals (see the related discussion in Sec. 7.1.2and Ref. [16]). Following the procedure of Sec. 7, we then computed the wombling measure ¯Γ[ C ] on an × grid in the ( p in , p out ) parameter space and the best wombling line C w (ofminimum length √ / ) was identified and its wombling measure (8.1) was entered in thehistogram. The resulting distribution is relatively concentrated around a mean of 0.08, andextends up to 0.11, which sets the lower limit on the target range for signal detection. Forillustration, in Fig. 29 we show results for one typical pure-background pseudo-experimentwhose value for ¯Γ w is equal to the mean of the distribution shown in Fig. 28.Given the background distribution from Fig. 28, we can now assess what types of signals option open for a future investigation [69]. – 41 – p in p o u t Figure 29 . The point data (left panel) and the | ¯Γ[ C ] | heatmap (right panel) for a representativepure-background pseudo-experiment entering the blue histogram in Fig. 28. might be discoverable. Obviously, the larger the signal component, the more pronouncedthe wombling boundary. In our conventions, the signal strength was parametrized by the ρ parameter. For example, in Sec. 7.1.1 we saw that for ρ = 5 we obtained ¯Γ w ∼ . ,while the weaker signal with ρ = 1 . in Sec. 7.1.2 resulted in only ¯Γ w ∼ . (note thatin Sec. 7.1.2 the data was preprocessed with only 5 Lloyd steps; adding 5 additional stepsas was done in Fig. 28 would further reduce the value of ¯Γ w slightly). Given the pure-background distribution in Fig. 28, it is clear that in both of those examples the observedeffect could not have been attributed to a background fluctuation and would represent adiscovery. At the same time, a careful inspection of the left panel in Fig. 20 shows that the ρ = 1 . example of Sec. 7.1.2 was rather “lucky” due to fortuitous fluctuations in the datanear the theoretical boundary. In order to estimate the prospects for a more typical signalscenario, we simulate 100 pseudo-experiments with N = 1000 data points each, generatedfrom the distribution (2.5) with ρ = 1 . . The corresponding distribution of the test statistic | ¯Γ w | for those signal pseudo-experiments is shown with the orange histogram in Fig. 28. Wesee that, on average, the values of | ¯Γ w | are larger in the presence of a signal — the meanof the orange histogram is shifted to 0.10. Comparing the tails of the two distributions,we find that in 40% of the cases, the signal is discoverable at 2 sigma and in 16% of thecases it is discoverable at 3 sigma. These prospects can probably be further improved byoptimizing the different aspects of our wombling algorithm, but such an optimization isoutside the scope of this paper. In this paper we reviewed and refined the existing procedures for identifying womblingboundaries in point datasets. Our interest in this topic stems from the fact that highenergy physics collider data can be viewed as point data in the relevant phase space of thefinal state signature. For better visual illustration, we considered point data examples in twodimensions, but our technique can be readily generalized to higher dimensional data. We– 42 –roposed several modifications to the standard algorithm which lead to improved detectionefficiency and significance: • We advocated the use of the Delaunay triangulation of the data instead of (or perhapsin addition to) the Voronoi tessellation of the data. We argued that the Delaunaytessellation is the natural framework for computing the local gradient vectors whichis the first and most important step of any wombling algorithm. • We considered three different techniques for reducing the effect of statistical fluctua-tions:1. Rescaling the local gradient vectors as in (3.8), see Section 3.3.1.2. Applying Voronoi relaxation of the data via several Lloyd iterations as a prepro-cessing step, see Section 3.3.2.3. Local averaging of the gradient vectors, which can be performed either at thelevel of a Voronoi cell (3.9) or at the level of a Delaunay triangle (3.10), seeSection 3.3.3. • We studied new tagging variables (dot products of neighboring averaged gradientvectors) for selecting elements of the tessellation marking the location of a womblingboundary. In Section 4 we showed that the new variables have improved selectionefficiency, since they take into account the spatial correlations among neighboringgradient vectors along the wombling boundary. In Section 5 we pointed out that thenew variables additionally can be used to naturally connect the tagged elements intocontinuous boundaries. • In Secs. 6 and 7 we explored the idea of Refs. [40, 63] to rank wombling boundarycandidates C by a global wombling measure, e.g., Γ[ C ] from (6.1) or ¯Γ[ C ] from (6.2).On the basis of several toy examples we showed that this approach is successful inidentifying the correct boundary, and with a slight modification (7.3) can be used evenwhen the shape of the boundary is different from the assumed ansatz, see Sec. 7.2.2. • In Sec. 8 we showed how one can estimate the statistical significance of any detectedwombling boundary using a frequentist approach.The present study complements and further expands the work of Refs. [16, 22, 23, 25, 56]in an interesting direction which, while popular in other fields of science, is still rather new tothe field of high energy physics. We believe that our investigations here are only scratchingthe surface of what could be a very promising research thrust. In particular, the approachof treating high energy collider data as point data and studying its geometric properties iscomplementary to the existing binning techniques and in the long run could prove to bemore suitable to the application of modern machine learning techniques [70, 71].– 43 – cknowledgments
We thank D. Debnath, J. Gainer, C. Kilic, D. Kim and Y.-P. Yang for useful discussions. PSis grateful to the LHC Physics Center at Fermilab for hospitality and financial support aspart of the Guests and Visitors Program in the summer of 2019. The work of PS is supportedby the University of Florida CLAS Dissertation Fellowship funded by the Charles Vincentand Heidi Cole McLaughlin Endowment. This work was supported in part by the UnitedStates Department of Energy under Grant No. DE-SC0010296.
References [1] E. M. Metodiev, B. Nachman and J. Thaler, “Classification without labels: Learning frommixed samples in high energy physics,” JHEP , 174 (2017) [arXiv:1708.02949 [hep-ph]].[2] J. A. Aguilar-Saavedra, J. H. Collins and R. K. Mishra, “A generic anti-QCD jet tagger,”JHEP , 163 (2017) [arXiv:1709.01087 [hep-ph]].[3] J. H. Collins, K. Howe and B. Nachman, “Anomaly Detection for Resonant New Physics withMachine Learning,” Phys. Rev. Lett. , no. 24, 241803 (2018) [arXiv:1805.02664 [hep-ph]].[4] A. De Simone and T. Jacques, “Guiding New Physics Searches with Unsupervised Learning,”Eur. Phys. J. C , no. 4, 289 (2019) [arXiv:1807.06038 [hep-ph]].[5] J. Hajer, Y. Y. Li, T. Liu and H. Wang, “Novelty Detection Meets Collider Physics,”arXiv:1807.10261 [hep-ph].[6] T. Heimel, G. Kasieczka, T. Plehn and J. M. Thompson, “QCD or What?,” SciPost Phys. ,no. 3, 030 (2019) [arXiv:1808.08979 [hep-ph]].[7] M. Farina, Y. Nakai and D. Shih, “Searching for New Physics with Deep Autoencoders,”arXiv:1808.08992 [hep-ph].[8] A. Casa and G. Menardi, “Nonparametric semisupervised classification for signal detection inhigh energy physics,” arXiv:1809.02977 [stat.AP].[9] O. Cerri, T. Q. Nguyen, M. Pierini, M. Spiropulu and J. R. Vlimant, “VariationalAutoencoders for New Physics Mining at the Large Hadron Collider,” JHEP , 036(2019) [arXiv:1811.10276 [hep-ex]].[10] J. H. Collins, K. Howe and B. Nachman, “Extending the search for new resonances withmachine learning,” Phys. Rev. D , no. 1, 014038 (2019) [arXiv:1902.02634 [hep-ph]].[11] T. S. Roy and A. H. Vijay, “A robust anomaly finder based on autoencoder,”arXiv:1903.02032 [hep-ph].[12] B. M. Dillon, D. A. Faroughy and J. F. Kamenik, “Uncovering latent jet substructure,” Phys.Rev. D , no. 5, 056002 (2019) [arXiv:1904.04200 [hep-ph]].[13] A. Blance, M. Spannowsky and P. Waite, “Adversarially-trained autoencoders for robustunsupervised new physics searches,” JHEP , 047 (2019) [arXiv:1905.10384 [hep-ph]].[14] A. Mullin, H. Pacey, M. Parker, M. White and S. Williams, “Does SUSY have friends? Anew approach for LHC event analysis,” arXiv:1912.10625 [hep-ph].[15] B. Nachman and D. Shih, “Anomaly Detection with Density Estimation,” arXiv:2001.04990[hep-ph]. – 44 –
16] D. Debnath, J. S. Gainer, D. Kim and K. T. Matchev, “Edge Detecting New Physics theVoronoi Way,” EPL , no.4, 41001 (2016) [arXiv:1506.04141 [hep-ph]].[17] D. Debnath, J. S. Gainer, D. Kim and K. T. Matchev, “Discovering New Physics withVoronoi Tessellations,” [arXiv:1511.02724 [hep-ph]].[18] I. Kim, “Algebraic Singularity Method for Mass Measurement with Missing Energy,” Phys.Rev. Lett. , 081601 (2010) [arXiv:0910.1149 [hep-ph]]. LaTeX (US)[19] A. Rujula and A. Galindo, “Measuring the W-Boson mass at a hadron collider: a study ofphase-space singularity methods,” JHEP , 023 (2011) [arXiv:1106.0396 [hep-ph]].[20] A. De Rujula and A. Galindo, “Singular ways to search for the Higgs boson,” JHEP , 091(2012) [arXiv:1202.2552 [hep-ph]].[21] P. Agrawal, C. Kilic, C. White and J. Yu, “Improved Mass Measurement Using the Boundaryof Many-Body Phase Space,” Phys. Rev. D , no.1, 015021 (2014) [arXiv:1308.6560[hep-ph]]. LaTeX (US)[22] D. Debnath, J. S. Gainer, C. Kilic, D. Kim, K. T. Matchev and Y. Yang, “Identifying PhaseSpace Boundaries with Voronoi Tessellations,” Eur. Phys. J. C , no.11, 645 (2016)[arXiv:1606.02721 [hep-ph]].[23] D. Debnath, J. S. Gainer, C. Kilic, D. Kim, K. T. Matchev and Y. Yang, “Detectingkinematic boundary surfaces in phase space: particle mass measurements in SUSY-likeevents,” JHEP , 092 (2017) [arXiv:1611.04487 [hep-ph]].[24] B. Altunkaynak, C. Kilic and M. D. Klimek, “Multidimensional phase space methods formass measurements and decay topology determination,” Eur. Phys. J. C , no.2, 61 (2017)[arXiv:1611.09764 [hep-ph]].[25] D. Debnath, J. S. Gainer, C. Kilic, D. Kim, K. T. Matchev and Y. Yang, “Enhancing thediscovery prospects for SUSY-like decays with a forgotten kinematic variable,” JHEP , 008(2019) [arXiv:1809.04517 [hep-ph]].[26] D. Kim, K. T. Matchev and P. Shyamsundar, “Kinematic Focus Point Method for ParticleMass Measurements in Missing Energy Events,” JHEP , 154 (2019) [arXiv:1906.02821[hep-ph]].[27] K. T. Matchev and P. Shyamsundar, “Singularity Variables for Missing Energy EventKinematics,” JHEP , 027 (2020) [arXiv:1911.01913 [hep-ph]].[28] W. H. Womble, “Differential Systematics," Science , No. 2961, 315 (1951).[29] M. R. T. Dale and M.-J. Fortin, “Spatial Analysis: A Guide for Ecologists”, CambridgeUniversity Press, 2014.[30] G. Barbujani, N. L. Oden and R. R. Sokal, “Detecting Regions of Abrupt Change in Maps ofBiological Variables,” Systematic Biology , No 4, 376 (1989).[31] G. Barbujani, G. M. Jacquez and L. Ligi, “Diversity of some gene frequencies in Europeanand Asian populations. V. Steep multilocus clines,” American Journal of Human Genetics , 867 (1990).[32] N. L. Oden R. R. Sokal, M.-J. Fortin and H. Goebl, “Categorical Wombling: DetectingRegions of Significant Change in Spatially Located Categorical Variables,” GeographicalAnalysis , 315 (1993). – 45 –
33] J. P. Bocquet-Appel and J. N. Bacro, “Generalized Wombling,” Systematic Biology , No 3,442 (1994).[34] M.-J. Fortin, “Edge Detection Algorithms for Two-Dimensional Ecological data,” Ecology ,No 4, 956 (1994).[35] M.-J. Fortin, “Effects of Data Types on Vegetation Boundary Delineation,” Canadian Journalof Forest Research , 1851 (1997).[36] G. M. Jacquez, “The map comparison problem: tests for the overlap of geographicboundaries.” Statistics in medicine , 2343 (1995).[37] G. M. Jacquez and D. A. Greiling, “Geographic boundaries in breast, lung and colorectalcancers in relation to exposure to air toxics in Long Island, New York,” International Journalof Health Geographics , 323 (1995).[40] S. Banerjee and A. E. Gelfand, “Bayesian Wombling,” Journal of the American StatisticalAssociation, , No 476, 1487 (2006).[41] G. M. Jacquez, S. Maruca and M.-J. Fortin, “From fields to objects: A review of geographicboundary analysis,” Journal of Geographical Systems , 221 (2000).[42] S. Okabe, B. Boots and K. Sugihara, “Spatial Tessellations: Concepts and Applications ofVoronoi Diagrams,” John Wiley & Sons, 1992.[43] M. Cacciari, G. P. Salam and G. Soyez, “FastJet User Manual,” Eur. Phys. J. C , 1896(2012) [arXiv:1111.6097 [hep-ph]].[44] B. Knuteson and B. Padley, “Statistical challenges with massive data sets in particlephysics,” [arXiv:hep-ex/0305064 [hep-ex]].[45] B. Knuteson, “Systematic analysis of high-energy collider data,” Nucl. Instrum. Meth. A , 7-14 (2004) [arXiv:hep-ex/0402029 [hep-ex]].[46] B. Knuteson, ‘Systematic analysis of frontier energy collider data,” [arXiv:hep-ex/0504041[hep-ex]].[47] B. Abbott et al. [D0], “A quasi-model-independent search for new high p T physics at DØ,”Phys. Rev. Lett. , 3712-3717 (2001) [arXiv:hep-ex/0011071 [hep-ex]].[48] B. Abbott et al. [D0], “Search for new physics in eµX data at DØ using Sherlock: A quasimodel independent search strategy for new physics,” Phys. Rev. D , 092004 (2000)[arXiv:hep-ex/0006011 [hep-ex]].[49] V. Abazov et al. [D0], “A Quasi model independent search for new physics at large transversemomentum,” Phys. Rev. D , 012004 (2001) [arXiv:hep-ex/0011067 [hep-ex]].[50] A. Aktas et al. [H1], “A General search for new phenomena in ep scattering at HERA,” Phys.Lett. B , 14-30 (2004) [arXiv:hep-ex/0408044 [hep-ex]].[51] T. Aaltonen et al. [CDF], “Model-Independent and Quasi-Model-Independent Search for NewPhysics at CDF,” Phys. Rev. D , 012002 (2008) [arXiv:0712.1311 [hep-ex]]. – 46 –
52] T. Aaltonen et al. [CDF], “Model-Independent Global Search for New High-p(T) Physics atCDF,” [arXiv:0712.2534 [hep-ex]].[53] T. Aaltonen et al. [CDF], “Global Search for New Physics with 2.0 fb − at CDF,” Phys. Rev.D , 011101 (2009) [arXiv:0809.3781 [hep-ex]].[54] M. Cappellari, “Voronoi binning: Optimal adaptive tessellations of multi-dimensional data,”Invited review for the volume “Tessellations in the Sciences: Virtues, Techniques andApplications of Geometric Tilings”, eds. R. van de Weijgaert, G. Vegter, J. Ritzerveld andV. Icke, Kluwer/Springer, Berlin (2009) [arXiv:0912.1303 [astro-ph.IM]].[55] S. E. Stead, “Estimation of gradients from scattered data,” Rocky Mountain J. Math. , no.1, 265 (1984).[56] D. Debnath, “Generic and Sensitive Searches for New Physics,” PhD Thesis, University ofFlorida (2018).[57] M. R. T. Dale and M.-J. Fortin, “From Graphs to Spatial Graphs,” Annual Review ofEcology, Evolution, and Systematics , 21 (2010).[58] D. Alves et al. [LHC New Physics Working Group], “Simplified Models for LHC New PhysicsSearches,” J. Phys. G , 105005 (2012) [arXiv:1105.2838 [hep-ph]].[59] S. P. Martin, “A Supersymmetry primer,” Adv. Ser. Direct. High Energy Phys. , 1-153(2010) [arXiv:hep-ph/9709356 [hep-ph]].[60] D. Costanzo and D. R. Tovey, “Supersymmetric particle mass measurement with invariantmass correlations,” JHEP , 084 (2009) [arXiv:0902.2331 [hep-ph]].[61] D. Kim, K. T. Matchev and M. Park, “Using sorted invariant mass variables to evadecombinatorial ambiguities in cascade decays,” JHEP , 129 (2016) [arXiv:1512.02222[hep-ph]].[62] S. Banerjee, “Spatial gradients and wombling,” In Handbook of Spatial Statistics. Ed(s) P.Diggle, M. Fuentes, A. E. Gelfand and P. Guttorp, Taylor and Francis, Boca Raton, FL(2010).[63] A. E. Gelfand and S. Banerjee, “Bayesian wombling: finding rapid change in spatial maps,”WIREs Comput. Stat. , 307 (2015).[64] K. Koufos and C. P. Dettmann, “Distribution of Cell Area in Bounded Poisson VoronoiTessellations with Application to Secure Local Connectivity,” J. Stat. Phys. , 1296 (2019)[arXiv:1612.02375 [cs.NI]][65] M. L. V. Pitteway, “Computer graphics research in an academic environment,” Datafair ‘73(1973).[66] D. H. McLain, “Two dimensional interpolation from random data,” The Computer Journal , 178 (1976).[67] S. P. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Information Theory , (2) 129, (1982).[68] “Envelope (Mathematics),” n.d., https://en.wikipedia.org/wiki/Envelope_(mathematics) .[69] K. T. Matchev, A. Roman and P. Shyamsundar, in preparation.[70] K. Albertsson et al., “Machine Learning in High Energy Physics Community White Paper,”J. Phys. Conf. Ser. , no.2, 022008 (2018) [arXiv:1807.02876 [physics.comp-ph]]. – 47 –
71] D. Bourilkov, “Machine and Deep Learning Applications in Particle Physics,” Int. J. Mod.Phys. A , no.35, 1930019 (2020) [arXiv:1912.08245 [physics.data-an]]., no.35, 1930019 (2020) [arXiv:1912.08245 [physics.data-an]].