Nonparametric estimation of directional highest density regions
NNonparametric estimation of directional highestdensity regions
Paula Saavedra-Nieves Rosa M. CrujeirasDepartment of Statistics, Mathematical Analysis and OptimizationUniversidade de Santiago de Compostela {paula.saavedra, rosa.crujeiras}@usc.es
Abstract
Reconstruction of sets from a random sample of points intimately related to them is thegoal of set estimation theory. Within this context, a particular problem is the one related withthe reconstruction of density level sets and specifically, those ones with a high probabilitycontent, namely highest density regions.We define highest density regions for directional data and provide a plug-in estimator,based on kernel smoothing. A suitable bootstrap bandwidth selector is provided for the prac-tical implementation of the proposal. An extensive simulation study shows the performanceof the plug-in estimator proposed with the bootstrap bandwidth selector and with other band-width selectors specifically designed for circular and spherical kernel density estimation. Themethodology is applied to analyze two real data sets in animal orientation and seismology.
Keywords: bootstrap, directional data, highest density regions, kernel density estimation,level sets.
Contents a r X i v : . [ s t a t . M E ] N ov .2 A suitable bootstrap bandwidth selector . . . . . . . . . . . . . . . . . . . . . . . . 13 A.1 Levels to the estimated HDRs disaggregating the sandhoppers variables . . . . . 35A.2 Interactive representation of HDRs for eathquakes on Earth . . . . . . . . . . . . 35
B Simulated spherical models 36C Some details on the directional bandwidth selectors 37
Set estimation is focused on the reconstruction of a set (or the approximation of any of its char-acteristic features such as its boundary or its volume) from a random sample of points. One ofthe specific topics in this area is concerned with the estimation of sets directly related to densityfunctions such as level sets. Mathematically, for a given level t >
0, the goal is to reconstructthe unknown set G g ( t ) = { x ∈ R d : g ( x ) ≥ t } (1)from random sample of points of a density function g on R d . This topic has received consider-able attention in the statistical literature, specially since the notion of population clusters wasestablished in [34] as the connected components of the set in (1). This cluster definition reliesclearly on the user-specified level t , so for addressing this problem, an algorithm for estimatingthe smallest level with more than a single connected component was proposed in [65]. Further-more, interesting applications of this clustering approach have emerged into different fieldssuch as astronomy in [37]; cytometry in [57]; detection of mine fields in [35]; detection of out-liers in [27] or [44] and quality control in [21], [8] and [7]. For a general review on clustering,see [3], [30], [18] and [56].The rationale for establishing this definition of cluster is quite related to the notion of mode.In fact, several cluster algorithms are based on the detection of modes (see, for example, [64])noting that the number of modes (local maxima of f) is not usually smaller than the number2f clusters. Nevertheless, the concept of cluster is easier to handle, since it has a global andgeometrical nature, whereas the local maxima depend on analytical properties. There existseveral works in literature dealing with the issue of inference on the number of modes with anapproach based on density estimates (see [63], [40], [47], [39], [33] and [25]), but restricted up todimension two. More recent contributions on this perspective are analyzed in [4], [66] and [12].There are also contributions from a testing perspective, with extensions for circular data (see [1]and [2], and references therein).The number of clusters is a basic feature for a statistical population. However, the problemof its estimation is not always taken into account in cluster analysis where it is usually chosenby the practitioner as a first step. Since the number of clusters is equal to the number of con-nected components of a level set, a very natural estimator for this populational parameter isthe number of the connected components of the level set reconstruction. This perspective thatsolves the problem of selecting this unknown population parameter is considered, for instance,in [16], [17] and [10].Although initially established for a density supported on an Euclidean space such as inequation (1), some generalizations of level set estimation theory have been already introducedin the literature. For instance, estimation of level sets for general functions (not necessarilya density) is considered in [19]. As an illustration, the authors show a first approach for theestimation of density level sets from data on a sphere. More recently, the reconstruction ofdensity level sets on manifolds is studied in [14]. Through some simulations, the behavior ofthe proposed method is analyzed on the torus and on the sphere. Despite the introduction ofthese specific extensions, the general problem of reconstructing density level sets from randomsamples in the directional setting (that is, on a d -dimensional sphere) has not been formallyestablished yet.Unfortunately, for most practical purposes, the specific value of the level t in (3) is fully un-known by the practitioner. In addition, areas of the distribution support where f is close to zero( non-effective support ) are usually of limited interest for applications. Therefore, in this work, weaim to introduce an alternative definition of directional level set where the practitioner estab-lishes the probability content instead of the level t . This type of regions is known as highestdensity regions (HDRs) in the Euclidean space, so an appropriate definition of HDRs for thedirectional setting as well as a procedure for their estimation in practice (based on plug-in ideasconsidering a kernel density estimator) is presented in this work. For the practical computationof the proposed plug-in estimator, a bootstrap bandwidth designed for reconstructing direc-tional HDRs is also introduced. Its performance is analyzed through an extensive simulationstudy and compared with other bandwidth selectors specifically devised for density estimation.One may argue that such an absence of a general and effective proposal for directional levelset estimation may be due to a lack of practical interest, but this is far from the truth, so letus present two application examples that motivate the developments in this work. The firstone concerns a problem from animal orientation studies and the second one is related to earth-3uakes occurrences. Animal orientation example.
Behavioral plasticity is considered by biologists as a feature of adap-tation to changing beach environments. In particular, orientation is an adaptation characteristicthat can not be modified by a single factor. Nonetheless, experts found some regularities in theorientation of sandhoppers and other animals from beach environments by changing one factorat a time under other controlled conditions. (cid:113)
Zouara beach
Figure 1: Geographical location of Zouara beach (right). Talorchestia brito (center) and Talitrus saltator(left).
For instance, the orientation of two sandhoppers species (Talitrus saltator and Talorchestiabrito) is analyzed in [61]. The experiment was carried out on the exposed non-tidal sand ofZouara beach located in the Tunisian northwestern coast. Both species are shown in Figure 1.Bottom pictures can be found in [20]. Apart from the specie and the orientation angles, thisdataset contains information about other variables such as sex (male, female), month (April,October) and moment of the day (morning, afternoon and noon) when the experiment wasdone. We refer to [61] and [45] for further details on the dataset and the experimental design.Comparing the two species through regresion procedures, [61] conclude that Talitrus salta-tor showed more differentiated orientations, depending on the time of day, period of the yearand sex, with respect to Talorchestia brito. Moreover, it seems that Talitrus saltator shows ahigher flexibility (variation) of orientation than Talorchestia brito under the same environmen-tal conditions, supporting the hypothesis that the former has a higher level of terrestrialization.As an illustration, Figure 2 (left panel) shows the 36 orientation points (slightly jittered) corre-sponding to males of the specie Talitrus saltator measurements during the noon in April. It alsocontains the 77 angles (slightly jittered) when the measures are taken in October (Figure 2, right4anel). Differences in the distribution on the circle of these two samples can be easily observed.Therefore, the month of the year seems to play a significant role in sandhoppers behavior. Inparticular, two clusters for October measurements can be detected around the angle π but theyare not present for the April sample. Similar comments could be done for the situation regis-tered around the angles π /2. Therefore, cluster identification under the established conditionscan be considered as an useful alternative to analyze sandhoppers orientation. p p p lllll ll l ll l l lllllllll l l llll llll ll lll p p p llll l ll l ll l l ll llll ll lllllll l l lll lllll l ll l lllllllll l lll lll ll l l ll ll llll ll l l lll l Figure 2: Orientation data (slightly jittered) corresponding to males of the specie Talitrus saltator registeredin the noon in April (left) and October (right).
Earthquakes occurrences.
The European-Mediterranean Seismological Centre (EMSC) is anon-governmental and non-profit organisation that has been established in 1975 at the requestof the European Seismological Commission. Since the European-Mediterranean region has suf-fered several destructive earthquakes, there was a need for a scientific organisation to be incharge of the determination, as quickly as possible (within one hour of the earthquake occur-rence), of the characteristics of such earthquakes. These predictions are based on the seismo-logical data received from more than 65 national seismological agencies, mostly in the Euro-Med region. Figure 3 (left) shows the geographical coordinates (red points), downloaded fromEMSC website, of a total of 272 medium and strong world earthquakes registered between1 th October 2004 and 9 th April 2020. The magnitude of all these events is at least 2.5 degreeson the Richter scale. Of course, these planar points correspond to spherical coordinates onEarth. Due to the important damages that earthquakes cause, cluster detection could be usefulto identify, from a real dataset, where earthquakes are specially likely. This information is keyfor decision-making, for example, to update construction codes guaranteeing a better buildingseismic-resistance. An interactive representation of the sphere can be seen in Appendix C. European-Mediterranean Seismological Centre: . igure 3: Distribution of earthquakes around the world between October 2004 and April 2020 (left). Den-sity level set contour obtained from the sample of world earthquakes registered between October 2004 andApril 2020 (right). This paper is organized as follows. Section 2 contains some background on level set estimationin the Euclidean setting, extending the definition for directional data and proposing a plug-inestimator. HDRs are the topic of Section 3, where a proper definition, jointly with a plug-in es-timator are presented. This plug-in estimator is based on a directional kernel density estimator,which requires a smoothing parameter (bandwidth) for practical implementation. An appro-priate bootstrap bandwidth selector is also introduced in this section. Section 4 presents an ex-tensive simulation study illustrating the performance of the plug-in estimator for the HDRs (forcircular and spherical domains) with the proposed bandwidth selector, comparing the resultswith those provided when other directional bandwidth selector criteria are considered. Theproposed methodology is applied to the two real data examples presented in the Introduction.Finally, some conclusions and ideas for further research are presented in Section 6. This work iscompleted with some supplementary material. Appendix A includes further information on thedatasets. Appendix B specifies the parameters taken for the construction of the spherical den-sities in the simulation study. Appendix C collects the description of the bandwidth selectorsconsidered in the simulation study.
The specific problem of reconstructing density level sets in the directional setting is addressedin this section: a definition of directional level set is provided jointly with a plug-in estimator.Based on the real data and simulated examples, some discussion about how to measure theestimation error is also included. 6 .1 On directional level sets
Consider a random vector X taking values on a d -dimensional unit sphere S d − with density f .Given a level t >
0, the directional level set is defined as: G f ( t ) = { x ∈ S d − : f ( x ) ≥ t } . (2)Note that each x ∈ S d − fully characterizes a point in θ ∈ [
0, 2 π ) d − . Therefore, definition in (2)could have been also equivalently established as a subset of points in [
0, 2 π ) d − . p p p llllllllllllllllllllllllllllllllllllllllllllllllll p p p llllllllllllllllllllllllllllllllllllllllllllllllll p p p llllllllllllllllllllllllllllllllllllllllllllllllll p p p llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll p p p llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll p p p llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll p p p llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll p p p llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll p p p llllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll Figure 4: For thee different circular densities, G f ( t ) for t = t (first column), t = t (second column) and t = t (third column) verifying 0 < t < t < t . Equivalently, L ( f τ ) for τ = τ = τ = The nature of different level sets is shown in Figure 4, wich represents G f ( t ) in grey color7or three different circular densities and three different values of the level t . The threshold t isrepresented through a dotted grey line. Note that, if large values of t are considered (bottomrow in Figure 4), G f ( t ) coincides with the greatest modes. However, for small values of t , thelevel set G f ( t ) is virtually equal to the support of the distribution.It is important to noticed that, following [34], the concept of cluster in directional setting canbe established as the connected components of the level set G f ( t ) . With this view in mind, notethat the density represented in the second row of Figure 4 presents four connected componentsfor all of the considered values for t , determining four population clusters.Plug-in estimation is the most natural and common choice for reconstructing density levelsets in the Euclidean space. A review of other existing estimation alternatives can be seen in [58].Plug-in methods are devised to reconstruct (1) asˆ G g ( t ) = { x ∈ R d : g n ( x ) ≥ t } where g n usually denotes the classical kernel estimator for euclidean data (see [51] and [59]).This methodology, which has received considerable attention (see, for instance, [69], [6], [46],[55], [41], [53] or [13]) can be easily generalized to the directional setting. Given a randomsample X n = { X , · · · , X n } ∈ S d − of the unknown directional density f , G f ( t ) in (2) can bereconstructed as ˆ G f ( t ) = { x ∈ S d − : f n ( x ) ≥ t } (3)where f n denotes a nonparametric directional density estimator. Following the ideas of theclassical linear (for real-valued random variables) kernel estimator, a kernel estimator on S d − is provided in [5]. Strong pointwise consistency, uniform consistency, and L − norm consis-tency of the estimator are proved. Almost simultaneously, a similar kernel density estimationprocedure also on S d − is presented in [32]. Some of the results in [32] are later extended in [38].Following [5], from a random sample on a d -dimensional sphere, X n , the directional kerneldensity estimator at a point x ∈ S d − is defined as f n ( x ) = n n ∑ i = K vM ( x ; X i ; 1/ h ) , (4)where 1/ h > K vM denotes the von Mises-Fisher kernel den-sity. The von Mises-Fisher distribution plays the role of the normal distribution in directionalsetting ( [43]). Formally, its density function can be written as K vM ( x ; µ ; κ ) = C d ( κ ) exp { κ x T µ } , with C d ( κ ) = κ d − ( π ) d + I d − ( κ ) where µ ∈ S d − is the directional mean, κ > T I p is the modified Bessel function of order p , given by I p ( z ) = ( z ) p π Γ ( p + ) (cid:90) − ( − t ) p − e zt dt where Γ ( p ) = (cid:82) ∞ x p − e − x dx , with p > − h plays an analogous role to the bandwidth in theEuclidean case. For small values of 1/ h , the density estimator is oversmoothed. The oppositeeffect is obtained as 1/ h increases: with a large value of 1/ h , the estimator is clearly under-smoothing the underlying target density. Hence, the choice of h is a crucial issue. For simplicity,in what follows, we refer to h as bandwidth parameter. Several approaches for selecting h inpractice, in circular and even directional settings, have been proposed in the literature. How-ever, no one of these existing proposals was designed focusing on the problem of directionallevel set estimation. For real-valued random variables, this problem was already widely treated.See, for instance, [7], [62], [60], [54] and [24]. Figure 5 shows three plug-in estimators ˆ G f ( t ) for models (black colour) and levels t i , i =
1, 2, 3(dotted grey line) considered in Figure 4. Kernel density estimators (grey color) in (4) have beendetermined from samples of size 250 considering the proposal in [48] as smoothing parame-ter. Note that ˆ G f ( t ) in third column presents two connected components. However, Figure 4shows that the theoretical level set G f ( t ) has exactly three. Therefore, the estimation error isconsiderable and distances between sets should be used to measure it.Since ( S , d E ) is a metric space when d E denotes the metric induced by the Euclidean norm (cid:107) · (cid:107) in S , it is possible to write d E ( x , y ) = ( x − y ) T ( x − y ) = x T x + y T y − x T y = ( − x T y ) for x , y ∈ S .Let us recall that, if A and B are non-empty compact sets in ( S , d E ) , the Hausdorff distancebetween A and B is established as follows d H ( A , B ) = max (cid:40) sup x ∈ A d E ( { x } , B ) , sup y ∈ B d E ( { y } , A ) (cid:41) where d E ( { x } , B ) = inf y ∈ B { d E ( x , y ) } . The metric d H is not completely successful in detectingdifferences in shape properties. In other words, two sets can be very close in d H and still showquite different shapes. This typically happens where the boundaries ∂ A and ∂ B are far apart, nomatter the proximity of A and B . So a natural way to reinforce the notion of visual proximitybetween two sets provided by Hausdorff distance is to account also for the proximity of the9espective boundaries. In particular, this error criterion is considered in [19] in order to establishthe consistency in the sphere of the plug-in estimator defined in (3). p p p llllllllllllllllllllllllllllllllllllllllllllllllll p p p llllll l l l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll p p p llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Figure 5: Plug-in density level sets ˆ G f ( t ) from X for three different circular densities with t (firstcolumn), t (second column) and t (third column) verifying 0 < t < t < t . For instance, for the sandhoppers example, Figure 6 shows the plug-in estimators obtainedfor the two samples of sandhoppers represented in Figure 2. Note that the value of the level t considered is large enough in order to detect the greatest modes of the two sample distributionscorresponding to April and October samples. These results allow us to confirm the differencesbetween the two populations. The largest cluster of April orientations is located around theangle 7 π /4. However, the pattern observed for October registries is completely different. Al-though an only cluster is identified around the angle 3 π /2, if the level t decreases slightly twoadditional groups can be detected around the angles 3 π /4 and 5 π /4, respectively. p p p + llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll p p p + llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Figure 6: Plug-in density level sets ˆ G f ( t ) obtained from the orientation samples corresponding males ofthe specie Talitrus saltator registered in the noon in April (left) and October (right). Regarding the earthquakes illustration, Figure 3 (right) shows the plug-in density level setcontour in blue obtained from the selected sample of world earthquakes considered. Chosinga convenient value of the level t , the greatest mode of sample distribution is identified in theSoutheast of Europe. Countries such as Italy, Greece or Turkey (located withint this cluster) areclearly risky areas. 10 HDRs in the directional setting
As noted in the Itroduction, the level t is usually unknown and, for practical purposes, theinterest usually focus on the effective support reconstruction for the density f considering afixed probability content. Figure 7 (top) shows four different 50% circular regions (regions con-taining 50% of the probability, empirically approximated) for the kernel density estimator f n represented in grey. Although all of them have probability content equal to 50%, they are com-pletely different. Therefore, it is obvious that there exists an infinite number of ways to choosea region with given coverage probability. Depending on the specific problem, a certain regionmay be selected but in a general scenario, it may not be clear which region must be chosen.The same happens for real-valued random variables, and [36] suggests that HDRs are the bestsubset to summarize a probability distribution. The concept of HDRs will be extended to thedirectional setting in what follows.The usual purpose in summarizing a probability distribution by a region of the sample spaceis to delineate a comparatively small set which contains most of the probability, although thedensity may be nonzero over infinite regions of the sample space. Therefore, as in the linearcase, it is necessary to decide what properties the region has to verify. The following conditionsare natural:(C1) The region should occupy the smallest possible volume in the sample space.(C2) Every point inside the region should have probability density at least as large as everypoint outside the region.Following [11], conditions (C1) and (C2 are equivalent and lead to regions called HDRs.Definition 3.1 formalizes this concept in the directional context taking into account the secondcriterion. Definition 3.1.
Let f be a directional density function on S d − of a random vector X. Given τ ∈ (
0, 1 ) ,the ( − τ ) % HDR is the subsetL ( f τ ) = { x ∈ S d − : f ( x ) ≥ f τ } (5) where f τ can be seen as the largest constant such that P ( X ∈ L ( f τ )) ≥ − τ (6) with respect to the distribution induced by f . According to [52] and [27] in the Euclidean context, L ( f τ ) is the minimum volume level setwith probability content at least ( − τ ) . Figure 4 shows the HDR L ( f τ ) in grey for three differ-ent circular densities and three different values of τ . The threshold f τ is represented througha dotted grey line. Note that, if large values of τ are considered, L ( f τ ) is equal to the greatest11odes and, therefore, the most differentiated clusters can be easily identified. However, forsmall values of τ , L ( f τ ) is almost equal to the support of the distribution. The first step to reconstruct the HDR established in Definition 3.1 for a given τ ∈ (
0, 1 ) isto estimate the threshold f τ . As in the Euclidean case, numerical integration methods couldbe also used in the directional setting in order to approximate its value. However, when thedimension increases, the computational cost becomes a major issue due to the complexity ofthe numerical integration algorithms considered on high diemsnional spaces. An alternativeapproach reducing the computational cost is described next.As before, let X be a random vector with directional density f and let Y = f ( X ) be therandom vector obtained by transforming X by its own density function. Since P ( f ( X ) ≥ f τ ) = − τ , f τ is exactly the τ − quantile of Y . Following [36] in the linear case, f τ can be estimatedas a sample quantile from a set of independent and identically distributed random vectors withthe same distribution as Y .In particular, if X n = { X , · · · , X n } denotes a set of independent observations in S d − from adensity f . Then, { f ( X ) , · · · , f ( X n ) } is a set of independent observations from the distributionof Y . Let f ( j ) be the j -th largest value of { f ( X i ) } ni = so that f ( j ) is the ( j / n ) sample quantile of Y .We shall use f ( j ) as an estimate of f τ . Specifically, we choose ˆ f τ = f ( j ) where j = (cid:98) τ n (cid:99) . Then, ˆ f τ converges to f τ as n tends to ∞ , and therefore L ( ˆ f τ ) converges to L ( f τ ) as n tends to ∞ .Of course, if f is a known function, the observations can be generated pseudorandomlyand the estimation of f τ could be made arbitrarily accurate by increasing n . In practice, as fordensity level set estimation, f is often unknown. In this case, we have as only information arandom sample of points X n from an unknown density f . From this sample, we propose firstto determine the kernel estimator f n established in (4). If n is large enough, we propose tocalculate the set { f n ( X ) , · · · , f n ( X n ) } in order to estimate f empirically. If n is moderate, itmay be preferable to generate observations X n = { X l , · · · , X N } of large size N from f n . Forsmall values of n it may not be possible to get a reasonable density estimate. Besides, withfew observations and no prior knowledge of the underlying density, there seems little pointin attempting to summarize the sample space. See [70] for some discussion on the number ofobservations needed for a reasonable linear density estimate. Note that the problem here is notwith the density quantile algorithm (that give results to an arbitrary degree of accuracy given adensity), but with estimating the density from insufficient data.Once the threshold f τ is estimated, plug-in methods reconstruct the 100 ( − τ ) % HDR L ( f τ ) in (5) as ˆ L ( ˆ f τ ) = { x ∈ S d − : f n ( x ) ≥ ˆ f τ } . (7)Figure 7 shows the circular kernel estimator f n (grey color) calculated from a sample X generated from the second model (black color) in Figure 4 and different empirically approxi-12ated 50% circular regions (grey color, top). The boxplot of the transformed values denotedby { f n ( X ) , · · · , f n ( X ) } is also shown (bottom). The dotted lines represent the quantiles thatdetermine the corresponding 50% (probability coverage) circular region. Note that only theestimated HDR (left), ˆ L ( ˆ f τ ) , is able to show the existence of the four existing modes. p p p llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll p p p llll lll lll l lll ll lll l ll lll l ll lll ll l ll l ll ll llll ll ll ll ll lll lll llllll lll ll l ll l llll lll l ll lllll llll l llll lll ll lll l lll ll ll ll llll llll lllllllllllllllllllllllllllllllllllllllllllllllll lllll llllll llllllllllllllll p p p lllllllllllllllllllllllllllll lllllllllllllll lllllllllllll _ p p p lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllll lllllllllllll llllllllllllllllllll . . . . . .
30 0 . . . . . .
30 0 . . . . . .
30 0 . . . . . . Figure 7: 50% circular regions obtained from the circular kernel estimator f n (Grey color) obtained froma sample X . Boxplots of { f n ( X ) , · · · , f n ( X ) } and quantiles (dotted lines) that determine the 50%regions (bottom). The construction of the kernel density estimator in (4) requires an appropriate selection of h .Although there exist several proposals in the literature for this task, none of them has beenspecifically designed for reconstructing a directional level set. This is the goal of this section.The previous bandwidth selectors (designed for density estimation) are briefly introduced inAppendix C, as supplementary information for the simulation study.A bootstrap bandwidth selector focused on the problem of reconstructing density level setsis introduced in what follows. The idea is to use an error criterion that quantifies the differencesbetween the theoretical set and its reconstructions instead of measuring the accuracy of kernelestimators. In the real-valued setting, these ideas are also considered in [60] for proposing oneof the first bandwidth selectors in level set estimation setting.In the directional case, the closed expression of d H ( ∂ L ( f τ ) , ∂ ˆ L ( ˆ f τ )) is not known. However,it could be estimated through a bootstrap procedure. Therefore, a new bandwidth selector canbe established as h = arg min h > E B (cid:104) d H ( ∂ L ∗ ( ˆ f ∗ τ ) , ∂ ˆ L ( ˆ f τ )) (cid:105) (8)13here E B denotes the bootstrap expectation with respect to random samples X n = { X ∗ , · · · , X ∗ n } generated from the directional kernel f n that, of course, is heavily dependent on a pilot band-width.This bandwidth selector is specifically designed for HDRs estimation, but it may be arguedthat a plug-in estimator may be computed just taking a kernel density estimator with a suit-able bandwidth for reproducing the curve, that is, minimizing some global error on the curveestimate. As mentioned before, there are other approaches for selecting the bandwidth param-eter in (4), such as the circular rule-of-thumb by [67] ( h in this work) or the improved versionby [48] (namely h ). Cross-validation methods (likelihood h and least squares h ) were intro-duced by [32] whereas a bootstrap bandwidth ( h ) was presented by [23]. From the previousproposals, cross-validation bandwidth selectors can be applied for data on a sphere S d − forany d . For spherical data, a plug-in bandwidth selector was also introduced by [28] ( h ).Figure 13 shows the theoretical HDR for model S3 (see Section 4.2) when τ = L ( ˆ f τ ) obtained from a sampleof size n = h when τ = h (fourth and fifthcolumns). A relevant issue appears when h is estimated from imprecise level set estimatorssuch as the obtained one from h . Remember that the minimization procedure considered fordetermining h involves the boundary of the set ˆ L ( ˆ f τ ) . If this set is poorly approximated theresulting bandwidth surely will not provide competitive results. Therefore, largest sample sizeswill be considered in this section for avoiding this problem. Additionally, the bandwidth h willbe used as pilot in order to determine the set ˆ L ( ˆ f τ ) . The performance of different bandwidth selectors (our proposal in Section 3 and other selec-tors for density estimation described in Appendix C) is checked through a simulation study.Circular and spherical HDRs are estimated considering the plug-in methods that arise of theconsideration of these bandwidths parameters. The code for computing h can be obtainedfrom the authors upon request. All the rest bandwidths are implemented in the R packages NPCirc and Directional . Sections 4.1 and 4.2 contain the results obtained in circular andspherical settings, respectively. A collection of 9 circular densities (models C1 to C9) have been considered in this simulationstudy. These models are mixture of different circular distributions and they correspond to den- https://CRAN.R-project.org/package=NPCirc https://CRAN.R-project.org/package=Directional f τ for τ = τ = τ = C1 (5) C2 (6) C3 (7)C4 (8) C5 (10) C6 (11)C7 (16) C8 (19) C9 (20) p p p
2+ 0 p p p
2+ 0 p p p p p p
2+ 0 p p p
2+ 0 p p p p p p
2+ 0 p p p
2+ 0 p p p Figure 8: Circular density models for simulations. Dotted circles represent the threshold f τ when τ = τ = τ = A total of 250 random samples of sizes n =
500 and n = τ = τ = τ = h , a pilot bandwidth is required. In thisstudy, h has been taken as a pilot, and B =
200 resamples are considered for obtaining h .15or each method and each sample, the estimation error is measured by computing the Haus-dorff and Euclidean distances ( d H and d E ) between the boundaries of estimated level set andthe frontier of theoretical set. Note that the Euclidean distance is not as informative as the Haus-dorff criterion to detect differences between sets. Therefore, just results for d H are shown. As areference, note that the maximum value of both criteria in S is 2. This upper bound coincidesexactly with the length of the diameter of the circle.Tables 1 and 2 show the means and the standard deviations of the 250 estimation errorsobtained when τ = n =
500 and n = h a competitive behavior in all cases. Thisfeature is also illustrated in Figure 9. Note that h presents a poor behavior for models C3, C6,C7, C8 and C9, and h performance is also unsatisfactory for models C2 and C9, although itimproves with sample size.Similar comments can be made for τ = n =
500 and n = h (being a competitive selector in all the scenarios) is thebest one for models C3, C5, C6 and C8 (with n = τ = n =
500 and n = h is the best selector for five models (C2, C6, C7, C8 and C9). Itis clear that the new selector improves its results when large values of τ are considered and,therefore, largest modes are identified. 16 able 1: Means (M) and standard deviations (SD) of 250 errors in Hausdorff distance for τ = n =
500 and B = C1 C2 C3 C4 C5 C6 C7 C8 C9
M SD M SD M SD M SD M SD M SD M SD M SD M SD h h h h h h Table 2: Means (M) and standard deviations (SD) of 250 errors in Hausdorff distance for τ = n = B = C1 C2 C3 C4 C5 C6 C7 C8 C9
M SD M SD M SD M SD M SD M SD M SD M SD M SD h h h h h h . . . . . h h h h h h l l l l l l . . . . . h h h h h h l l l l l l . . . . h h h h h h l l l l l l . . . . . h h h h h h l l l l l l Figure 9: Violin plots of Hausdorff errors for models C3, C5, C6 and C8 for τ = n = h , the scale of these figures is different.able 3: Means (M) and standard deviations (SD) of 250 errors in Hausdorff distance for τ = n =
500 and B = C1 C2 C3 C4 C5 C6 C7 C8 C9
M SD M SD M SD M SD M SD M SD M SD M SD M SD h h h h h h Table 4: Means (M) and standard deviations (SD) of 250 errors in Hausdorff distance for τ = n = B = C1 C2 C3 C4 C5 C6 C7 C8 C9
M SD M SD M SD M SD M SD M SD M SD M SD M SD h h h h h h . . . . . h h h h h h l l l l l l . . . . h h h h h h l l l l l l . . . . h h h h h h l l l l l l . . . . . . . . h h h h h h l l l l l l Figure 10: Violin plots of Hausdorff errors for models C1, C3, C6 and C8 when τ = n = h , the scale of these figures is different.able 5: Means (M) and standard deviations (SD) of 250 errors in Hausdorff distance for τ = n =
500 and B = C1 C2 C3 C4 C5 C6 C7 C8 C9
M SD M SD M SD M SD M SD M SD M SD M SD M SD h h h h h h Table 6: Means (M) and standard deviations (SD) of 250 errors in Hausdorff distance for τ = n = B = C1 C2 C3 C4 C5 C6 C7 C8 C9
M SD M SD M SD M SD M SD M SD M SD M SD M SD h h h h h h . . . . . h h h h h h l l l l l l . . . . . . . . h h h h h h l l l l l l . . . . . h h h h h h l l l l l l Figure 11: Violin plots of Hausdorff errors for models C3, C8 and C9 when τ = n = igures 9, 10 and 11 show the violin plots of Hausdorff errors obtained for some of thesimulation models when τ = τ = τ = n = h is the selector that presents a worst behavior for the represented circular densities. If τ = h can be large although this selector provides competitivemean errors. As for the spherical scenario, 9 density models have been considered. These models, namely S1to S9, are mixtures of von Mises-Fisher densities on the sphere, allowing to represent complexstructures such as multimodality and/or asymetry. Parameters of mixtures are fully establishedin Table 16 in Appendix B. Figure 12 shows these densities and the corresponding HDRs for τ = τ = τ = S1 S2 S3S4 S5 S6S7 S8 S9
Figure 12: Finite mixtures of von Mises-Fisher spherical models for simulations. HDRs are represented for τ = τ = τ = n = n = n = τ = τ = τ = B = h , taking h as apilot bandwidth.For each method and each sample, the estimation error is again measured calculating theHausdorff distance between the boundaries of estimated level set and the frontier of theoreticalset. As reference, note that the maximum value of both criteria S is also 2. In this case, thisupper bound coincides exactly with the length of the diameter of the sphere. Figure 13: Theoretical HDR for model S3 when τ = n = τ = h (third column) and h (fourth and fifth columns) as smoothing parameters. Note that thelast two columns show two views of the sphere. Tables 7 and 8 show the means and the standard deviations of the 200 estimation errorsobtained when τ = n = n = h is the best or shows a competitive performance. For τ = n = n = n = h and h usually behave similarly and h is the worst selector for S3.Tables 12 and 13 show the means and the standard deviations of the 200 estimation errorsobtained when τ = n = n = h is considered, this selector is again the best or competitivewith h . As for h , results are remarkably poor in S2 and S6.Figure 14 contains the violin plots of Hausdorff errors for models S3, S4, S6 and S9 when τ = n = h is considerably good. Figure 15for τ = n = h and h usually present similar results, see densities S5and S9. However, h is clearly more competitive for models S1 and S8.21 able 7: Means (M) and standard deviations (SD) of 200 errors in Hausdorff distance for τ = n = B = S1 S2 S3 S4 S5 S6 S7 S8 S9
M SD M SD M SD M SD M SD M SD M SD M SD M SD h h h Table 8: Means (M) and standard deviations (SD) of 200 errors in Hausdorff distance for τ = n = B = S1 S2 S3 S4 S5 S6 S7 S8 S9
M SD M SD M SD M SD M SD M SD M SD M SD M SD h h h . . . . . . . h h h l l l . . . . . . h h h l l l . . . . . . h h h l l l . . . . . h h h l l l Figure 14: Violin plots of Hausdorff errors for models S3 and S4, S6 and S9 when τ = n = τ = n =
500 and B = S1 S2 S3 S4 S5 S6 S7 S8 S9
M SD M SD M SD M SD M SD M SD M SD M SD M SD h h h Table 10: Means (M) and standard deviations (SD) of 200 errors in Hausdorff distance for τ = n = B = S1 S2 S3 S4 S5 S6 S7 S8 S9
M SD M SD M SD M SD M SD M SD M SD M SD M SD h h h . . . . . h h h l l l . . . . h h h l l l . . . . . h h h l l l . . . . . h h h l l l Figure 15: Violin plots of Hausdorff errors for models S1, S5, S8 and S9 when τ = n = τ = n = B = S1 S2 S3 S4 S5 S6 S7 S8 S9
M SD M SD M SD M SD M SD M SD M SD M SD M SD h h h Table 12: Means (M) and standard deviations (SD) of 200 errors in Hausdorff distance for τ = n = B = S1 S2 S3 S4 S5 S6 S7 S8 S9
M SD M SD M SD M SD M SD M SD M SD M SD M SD h h h Table 13: Means (M) and standard deviations (SD) of 200 errors in Hausdorff distance for τ = n = B = S1 S2 S3 S4 S5 S6 S7 S8 S9
M SD M SD M SD M SD M SD M SD M SD M SD M SD h h h Real data analysis
The proposed methodology is now applied to the two real datasets presented in the Introduc-tion exemplifying the aplicability of the method for circular and spherical data.
Adaptation to changing beach environments for the real example on sandhoppers introducedin Section 1.1 is analyzed from density level set estimation perspective.HDRs are estimated for τ = d H , the Euclidean distance d E between A and B defined as d E ( A , B ) = inf { d E ( x , y ) , x ∈ A , y ∈ B } is also used. Although d E ( A , B ) = A = B , this distance is useful to deter-mine large differences between orientations. In general, large distances between the boundariesof two sets indicate the existence of modes in different directions. If the categories of all vari-ables with the exception of one are fixed, it is possible to check if the different values of the non-fixing variable has some influence in sandhoppers orientation through the comparison of theestimated level sets. As a reference, note that the maximum value of Hausdorff and Euclideandistances between two points in S is 2. The upper triangular matrix in Table 14 contains theHausdorff distances between boundaries and the lower triangular matrix, the Euclidean ones.The largest distances are represented in blue color for both criteria. Grey color is used in orderto depict the next largest values. Furthermore, Table 14 (top) contains some of the estimatedHDRs that present the largest distances.In particular, Hausdorff distance between regions 5 and 11 is equal to 1.91. According toTable 15, the variable configuration 5 corresponds to the largest orientation modes for femalesof the specie Talitrus saltator when the orientation is measure in noon during October. Region11 refers to same measurements taken in April. Therefore, the month can be seen as variablethat has influence on the orientation for sandhoppers.Euclidean distance between regions 5 and 6 is equal to 1.57. According to Table 15, set 6 alsocorresponds to the level set for females of the specie Talitrus saltator but, in this case, when theorientation is registered in morning during October. Then, the moment of the day also seems afactor with influence on the sandhoppers behavior.25 able 14: Upper triangular matrix contains the Hausdorff distances between boundaries of sets from 1 to 24. Lower triangular matrix of Euclideandistances between boundaries of sets from 1 to 24 (bottom). HDRs representations for τ = p p p + llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll p p p + llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll p p p + llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll p p p + llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll p p p + llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll p p p + l p p p + llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll p p p + llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll p p p + llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll p p p + llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll p p p + llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll p p p + llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
11 14 15 17 18 20 . . . . Talitrus saltator Talorchestia brito l l . . . . . Talitrus saltator Talorchestia brito l l
Figure 16: Violin plots of Euclidean (left) and Hausdorff (right) distances for species Talitrus saltator andTalorchestia brito.
Finally, Figure 16 shows the violin plots for Hausdorff and Euclidean distances for the twospecies of sandhoppers. Note that the median of the Talitrus saltator in Hausdorff (Euclidean)distance is 0.76 (0.23), clearly bigger than the median of Talorchestia brito that is equal to 0.52(0.14). This shows that Talitrus saltator presents more differentiated orientations, depending onthe time of day, period of year and sex, with respect to Talorchestia brito. Therefore, conclusionsin [61] are corroborated from this perspective.
According to the theory of plate tectonics, Earth is an active planet. Its surface is composed ofabout 15 individual plates that move and interact, constantly changing and reshaping Earth’souter layer. These movements are usually the main cause of volcanoes and earthquakes. Infact, seismologists have related these natural phenomena to the boundaries of tectonic platesbecause they tend to occur there. In fact, the concentration of earthquake epicenters traces27he filamentary network of fault lines and, consequently, they could be analyzed alternativelyfrom the perspective of nonparametric filamentary structure estimation (see, for instance, [29]).Moreover, tectonic hazards can provoque important damages (destroy buildings, infrastruc-tures or even cause deaths). Therefore, it is important to detect which areas are specially risky.As an illustration, the recent world earthquakes distribution is analyzed next through HDRsestimation.
Figure 17: Contours of HDRs for τ = τ = τ = τ = τ = Figure 17 shows the margins of the tectonic plates (grey color) and the geographical coor-dinates (red points) of a total of 272 medium and strong earthquakes registered between 1 th October 2004 and 9 th April 2020 already introduced in Section 1.1. Note that most of events areexactly located on the plates boundaries.Our main goal is to detect which areas or countries are really problematic nowadays. InSection 1.1, we show that the largest mode is located on the Southeast Europe considering avalue of τ = τ . Specifically, they were estimatedchoosing τ = τ = τ = τ = τ = − τ = − τ = τ = τ = τ = − τ %) of earthquakes areregistered. If τ = The main goals of this work are to extend the definition of HDRs for directional data and pro-pose a plug-in estimator based on a new bootstrap bandwidth selector that is focused on theproblem of level set reconstruction. The route designed to reach this goal can be summarizedas follows: (1) Extending the definition of HDRs for directional data, (2) proposing the plug-inestimator, (3) introducing a suitable bootstrap selector of the bandwidth parameter, (3) study-ing the behavior of the plug-in estimators (using the new selector and other classical directionalbandwidths) and (4) applying the plug-in reconstruction of HDRs to the real data on sandhop-pers orientation and earthquakes.Finally, natural extensions of this work are discussed. First, other suitable bootstrap band-width selectors could be proposed in order to estimate HDRs by using, for instance, the Lebesguemeasure instead of the Hausdorff distance. Secondly, an estimator for the number of populationclusters can be proposed in the directional setting as the number of connected components ofthe HDRs plug-in estimators. Theoretical results on its consistency might probably be provedunder certain regularity conditions. Another important achievement would be to introduce anonparametric test for comparing two or more populations in general dimension using dis-tances between sets. The test statistic could measure the discrepancy (for example, boundarydistances) among the directional level set estimators of these populations. This test procedurecould use explicitly the distance between boundaries of the estimated level sets. The simplegeometric structure of estimators could be used to compute the procedure and calibrate the testusing re-sampling schemes. Finally, earthquakes on earth could be analyzed following a differ-ent approach of set estimation theory. Since the concentration of earthquake epicenters tracesthe filamentary network of fault lines, the performance of nonparametric filament estimatorscould be analyzed.
Acknowledgements
R.M. Crujeiras and P. Saavedra-Nieves acknowledge the financial sup-port of Ministerio de Economía y Competitividad of the Spanish government under grantsMTM2016-76969P and MTM2017-089422-P and ERDF. Authors also thank Prof. Felicita Scapinifor providing the sandhoppers data (collected under the support of the European Project ERBICI8-CT98-0270) and the computational resources of the CESGA Supercomputing Center.29 eferences [1] Ameijeiras-Alonso, J., Crujeiras, R.M. and Rodríguez-Casal, A., Mode testing, critical band-width and excess mass, Test, 28, 900-919 (2019)[2] Ameijeiras-Alonso, J., Benali, A., Crujeiras, R.M., Rodríguez-Casal, A. and Pereira, J. M.C., Fire seasonality identification with multimodality tests, Annals of Applied Statistics, 13,2120-2139 (2019)[3] Anderberg, M. R., Cluster Analysis for Applications, Academic Press, New York (1973)[4] Azzalini, A. and Torelli, N., Clustering via nonparametric density estimation, Statistics andComputing, 17(1), 71-80, (2007)[5] Bai, Z. D., Rao, C. R. and Zhao, L. C., Kernel estimators of density function of directionaldata, Multivariate Statistics and Probability, 24-39 (1989)[6] Baíllo, A., Total error in a plug-in estimator of level sets, Statist. Probab. Lett., 65, 411-417(2003)[7] Baíllo, A. and Cuevas, A., Parametric versus nonparametric tolerance regions in detectionproblems, Comput. Statist., 21, 523-536 (2006)[8] Baíllo, A., Cuevas, A. and Justel, A., Set estimation and nonparametric detection, Canad. J.Statist., 28, 765-782 (2000)[9] Banerjee, A., Dhillon, I. S., Ghosh, J., and Sra, S., Clustering on the unit hypersphere usingvon Mises-Fisher distributions, J. Mach. Learn. Res., 6, 1345-1382 (2005)[10] Biau, G., Cadre, B. and Pelletier, B., A graph-based estimator of the number of clusters.ESAIM: Probability and Statistics, 11, 272-280 (2007)[11] Box, G. E. P., and Tiao, G. C., Bayesian Inference in Statistical Analysis, Reading, MA:Addison-Wesley (1973)[12] Burman, P. and Polonik, W., Multivariate mode hunting: Data analytic tools with measuresof significance, Journal of Multivariate Analysis, 100(6), 1198-1218 (2009)[13] Chen, Y. C., Genovese, C. R. and Wasserman, L., Density level sets: Asymptotics, inference,and visualization, J. Am. Stat. Assoc., 112, 1684-1696 (2017)[14] Cholaquidis, A., Fraiman, R. and Moreno, L., Level set and density estimation on mani-folds, arXiv preprint arXiv:2003.05814 (2020)[15] Cuevas, A., Febrero, M. and Fraiman, R., Estimating the number of clusters, Canad. J.Statist., 28, 367-382 (2000) 3016] Cuevas, A., Febrero, M. and Fraiman, R., Estimating the number of clusters. CanadianJournal of Statistics, 28, 367-382 (2000)[17] Cuevas, A., Febrero, M. and Fraiman, R., Cluster analysis: a further approach based ondensity estimation, Computational Statistics and Data Analysis, 36(4), 441-459 (2001)[18] Cuevas, A. and Fraiman, R., A plug-in approach to support estimation, Ann. Statist, 25,2300-2312 (1997)[19] Cuevas, A., González-Manteiga, W. and Rodríguez-Casal, A., Plug-in estimation of generallevel sets, Australian and New Zealand Journal of Statistics, 48(1), 7-19 (2006)[20] Dekker, W., Strandvlooien (Talitridae). Tabellenserie van de Strandwerkgemeenschap, 24(1978)[21] Devroye, L. and Wise, G., Detection of abnormal behavior via nonparametric estimation ofthe support, SIAM J. Appl. Math., 38, 480-488 (1980)[22] Di Marzio, M., Panzera, A., and Taylor, C. C., Local polynomial regression for circularpredictors, Statistics & Probability Letters, 79, 2066-2075 (2009)[23] Di Marzio, M., Panzera, A., Taylor, C.C., Kernel density estimation on the torus, Journal ofStatistical Planning and Inference, 141, 2156-2173 (2011)[24] Doss, C. R. and Weng, G., Bandwidth selection for kernel density estimators of multivariatelevel sets and highest density regions, Electron. J. Stat., 12. 4313-4376 (2018)[25] Fraiman, R. and Meloche, J., Counting bumps. Ann. Inst. Stat. Math. (1997)[26] Gardner, A.B., Krieger, A.M., Vachtsevanos, G and Litt, B., One-class novelty detection forseizure analysis from intracranial EEG, J. Mach. Learn. Res., 7, 1025-1044 (2006)[27] Garcia, J. N., Kutalik, Z., Cho, K. H. and Wolkenhauer, O., Level sets and minimum volumesets of probability density functions, Int. J. Approx. Reason., 34, 25-47 (2003)[28] García-Portugués, E., Exact risk improvement of bandwidth selectors for kernel densityestimation with directional data, Electronic Journal of Statistics, 7, 1655-1685 (2013)[29] Genovese, C. R., Perone-Pacifico, M., Verdinelli, I., and Wasserman, L., The geome-try of nonparametric filament estimation, Journal of the American Statistical Association,107(498), 788-799 (2012)[30] Everitt, B. S., Cluster Analysis. Arnold-Halsted, New York (1993)[31] Hall, P., Central limit theorem for integrated square error of multivariate nonparametricdensity estimators, J. Multivariate Anal., 14(1), 1-16 (1984)3132] Hall, P., Watson, G. S., and Cabrera, J., Kernel density estimation with spherical data,Biometrika, 74(4), 751-762 (1987)[33] Hall, P. and Wood, A. T., Approximations to distributions of statistics used for testing hy-potheses about the number of modes of a population, J. Statist. Plann. Inf., 55, 299-317 (1996)[34] Hartigan, J., Clustering algorithms, Wiley Series in Probability and Mathematical Statistics,John Wiley & Sons, New York-London-Sydney (1975)[35] Huo, X. and Lu, J.C., A network flow approach in finding maximum likelihood estimate ofhigh concentration regions, Comput. Statist. Data Anal., 46, 33-56 (2004)[36] Hyndman, R.J., Computing and graphing highest density regions, Am. Stat., 50, 120-126(1996)[37] Jang, W., Nonparametric density estimation and clustering in astronomical sky surveys,Comput. Statist. Data Anal., 50, 760-774 (2006)[38] Klemelä, J., Estimation of densities and derivatives of densities with directional data, J.Multivariate Anal., 73(1), 18-40 (2000)[39] Mammen, E., On qualitative smoothness of kernel density estimates, Statistics, 26, 253-267(1995)[40] Mammen, E., Marron, J. S. and Fisher, N. I., Some asymptotics for multimodality testsbased on kernel density estimates, Prob. Theory Relat. Fields, 91, 115-132 (1992)[41] Mammen, E. and Polonik, W., Confidence regions for level sets, J. Multivariate Anal., 122,202-214 (2013)[42] Mammen, E. and Tsybakov, A. B., Asymptotical minimax recovery of sets with smoothboundaries, Ann. Stat., 23, 502-524 (1995)[43] Mardia, K.V. and Jupp, P., Directional Statistics, Wiley (2009).[44] Markou, M. and Singh, S., Novelty detection: a reviewpart 1: statistical approaches, SignalProcessing, 83, 2481-2497 (2003)[45] Marchetti, G. M. and Scapini, F., Use of multiple regression models in the study of sand-hopper orientation under natural conditions, Estuarine, Coastal and Shelf Science, 58, 207-215 (2003)[46] Mason, D.M. and Polonik, W., Asymptotic normality of plug-in level set estimates, Ann.Appl. Probab., 19, 1108-1142 (2009)[47] Minnotte, C. M. and Scott, D. W., The mode tree: A tool for visualization of nonparametricdensity features, J. Comp. Graph. Stat., 2, 51-68 (1993)3248] Oliveira, M., Crujeiras, R.M. and Rodríguez-Casal, A., A plug-in rule for bandwidth se-lection in circular density estimation, Computational Statistics and Data Analysis, 56, 3898-3908 (2012)[49] Oliveira, M., Crujeiras R.M. and Rodríguez-Casal, A., Nonparametric circular methods forexploring environmental data, Environmental and Ecological Statistics, 20, 1-17 (2013)[50] Oliveira, M., Crujeiras R.M. and Rodríguez-Casal, A., NPCirc: an R package for nonpara-metric circular methods, Journal of Statistical Software, 61(9), 1-26 (2014)[51] Parzen, E., On estimation of a probability density function and mode, The Annals of Math-ematical Statistics, 33, 1065-1076 (1962)[52] Polonik, W., Minimum volume sets and generalized quantile processes, Stoch. Proc. Appl.,69, 1-24 (1997)[53] Polonik, W., Confidence regions for level sets, J. Multivar. Anal., 122, 202-214 (2013)[54] Qiao, W., Asymptotics and optimal bandwidth selection for nonparametric estimation ofdensity level sets, arXiv preprint arXiv:1707.09697 (2017)[55] Rigollet, P. and Vert, R., Optimal rates for plug-in estimators of density level sets, Bernoulli,15, 1154-1178 (2009)[56] Rinaldo, A. and Wasserman, L., Generalized density clustering, Ann. Stat., 38, 2678-2722(2010)[57] Roederer, M. and Hardy, R.R., Frequency difference gating: a multivariate method foridentifying subsets that differ between samples, Cytometry, 45, 56-64 (2001)[58] Rodríguez-Casal, A. and Saavedra-Nieves, P., Minimax Hausdorff estimation of densitylevel sets. arXiv preprint arXiv:1905.02897 (2019)[59] Rosenblatt, M., Remarks on some nonparametric estimate of a density function, The An-nals of Mathematical Statistics, 27, 832-837 (1956)[60] Samworth, R. and Wand, M., Asymptotics and optimal bandwidth selection for highestdensity region estimation, Ann. Statist., 38, 1767-1792 (2010)[61] Scapini, F., Aloia, A., Bouslama, M. F., Chelazzi, L., Colombini, I., ElGtari, M., Fallaci, M.and Marchetti, G. M. Multiple regression analysis of the sources of variation in orientationof two sympatric sandhoppers, Talitrus saltator and Talorchestia brito, from an exposedMediterranean beach, Behavioral Ecology and Sociobiology, 51(5), 403-414 (2002)[62] Singh, A., Scott, C. and Nowak, R., Adaptive Hausdorff estimation of density level sets,Ann. Statist., 37, 2760-2782 (2009) 3363] Silverman, B. W., Using kernel density estimates to investigate multimodality, J. Roy.Statist. Soc. B, 43, 97-99 (1981)[64] Silverman, B. W., Density Estimation for Statistics and Data Analysis, Chapman and Hall,London (1986)[65] Steinwart, I., Fully adaptive density-based clustering, Ann. Statist., 43, 2132-2167 (2015)[66] Stuetzle, W. and Nugent, R., A generalized single linkage method for estimating the clustertree of a density, Journal of Computational and Graphical Statistics, 19(2), 397-418 (2010)[67] Taylor, C. C., Automatic bandwidth selection for circular density estimation, Computa-tional Statistics & Data Analysis, 52, 3493-3500 (2008)[68] Tibshirani, R., Walther, G., and Hastie, T., Estimating the number of clusters in a data set viathe gap statistic, Journal of the Royal Statistical Society: Series B (Statistical Methodology),63(2), 411-423 (2001)[69] Tsybakov, A. B., On nonparametric estimation of density level sets, Ann. Statist., 25, 948-969 (1997)[70] Wand, M. P., and Jones, M. C., Kernel Smoothing, London: Chapman and Hall (1995)[71] Zhao, L. and Wu, C., Central limit theorem for integrated square error of kernel estimatorsof spherical density, Sci. China Ser. A, 44(4), 474-483 (2001)34
Further details on the datasets
A.1 Levels to the estimated HDRs disaggregating the sandhoppers variables
Table 15: Associated levels to the 24 estimated HDRs.
Variables Males Femaleslevels Afternoon Noon Morning Afternoon Noon MorningTalitrus saltator October E1 E2 E3 E4 E5 E6April E7 E8 E9 E10 E11 E12Talorchestia brito October E13 E14 E15 E16 E17 E18April E19 E20 E21 E22 E23 E24
A.2 Interactive representation of HDRs for eathquakes on Earth
Figure 18: Distribution of earthquakes around the world between October 2004 and April 2020 (red color).Contours of HDRs for τ = τ = τ = τ = τ = Simulated spherical models
Model µ κ
Mixture probabilitiesS1 (
0, 0, 1 )
10 1S2 (
0, 0, 1 ) ; (
0, 0, − )
1; 1 1/2; 1/2S3 (
0, 0, 1 ) ; (
0, 0, − )
10; 1 1/2; 1/2S4 (
0, 0, 1 ) ; (
0, 1/ √
2, 1/ √ )
10; 10 1/2; 1/2S5 (
0, 0, 1 ) ; (
0, 1/ √
2, 1/ √ )
10; 10 2/5; 3/5S6 (
0, 0, 1 ) ; (
0, 1/ √
2, 1/ √ )
10; 5 1/5; 4/5S7 (
0, 0, 1 ) ; (
0, 1, 0 ) ; (
1, 0, 0 )
5; 5; 5 1/3; 1/3; 1/3S8 (
0, 0, 1 ) ; (
0, 1, 0 ) ; (
1, 0, 0 )
5; 5; 5 2/3; 1/6; 1/6S9 (
0, 0, 1 ) ; (
0, 1/ √
2, 1/ √ ) ; (
0, 1, 0 )
10; 10; 10 1/3; 1/3; 1/3
Table 16: Finite mixtures of von Mises-Fisher spherical distributions considered as models for simulations. Some details on the directional bandwidth selectors
We briefly revise in this section some bandwidth selection methods designed for kernel den-sity estimation. Although these methods do not focus on HDRs, but on the reconstruction ofthe whole density curve, it may be argued that they could also be used for constructing theproposed plug-in estimator. The performance of our proposal is compared in all the simulatedscenarios with different bandwidth selectors for circular and spherical data.As in the Euclidean setting, most used techniques for selecting h are based on the minimiza-tion of some error criteria that quantify the accuracy of the kernel density estimator. One of themost simple errors to be considered is the mean integrated squared error that can be written asfollows: MISE ( h ) = E (cid:20) (cid:90) S d − ( f n ( x ) − f ( x )) ω d ( dx ) (cid:21) , (9)where ω d denotes the Lebesgue in S d − . Then, a possibility is to search for the bandwidththat minimizes (9). However, the asymptotic version of MISE , AMISE , is more commonlyused in literature. A rule of thumb proposed in [67] adapts the idea in [64] in kernel lineardensity estimation to the circular setting. The resulting plug-in selector assumes that the datafollow a von Mises distribution to determine the
AMISE . The bandwidth is chosen by firstobtaining an estimation ˆ κ of the concentration parameter κ in the reference density (for example,by maximum likelihood) through the formula h = (cid:34) π I ( ˆ κ ) κ I ( κ ) n (cid:35) .Remark that the parametrization in [67] has been adapted to the context of the estimator (4)by denoting by h the inverse of the squared concentration parameter employed in his paper.The poor performance of this rule is sometimes due to the non robust estimation by maximumlikelihood of the concentration parameter. An alternative and robustified estimation procedureis considered in [49].A new selector also devoted to the circular case is established in [48]. It improves the perfor-mance of the Taylor’s proposal allowing for more flexibility in the reference density, consideringa mixture of von Mises. This selector is mainly based on two elements. First, the AMISE expan-sion derived in [22] for the circular kernel density estimator by the use of Fourier expansions ofthe circular kernels. This expression has the following form when the kernel is a circular vonMises (the estimator is equivalent to consider L ( r ) = e − r and h as the inverse of the squaredconcentration parameter in (4): AMISE ( h ) = (cid:34) − I ( h − ) I ( h − ) (cid:35) (cid:90) π f (cid:48)(cid:48) ( θ ) d θ + I ( h − ) n π I ( h − ) . (10)37he second element is the Expectation-Maximization (EM) algorithm in [9] for fitting mixturesof directional von Mises. The selector, that is denoted by h , proceeds as follows: first, applythe EM algorithm to fit mixtures with different number of components; then, choose the fittedmixture with the lowest AIC. Finally, compute the curvature term in (10) using the fitted mixtureand seek for the h that minimizes this expression. This value of h is denoted by h .Of course, plug-in rules are not the only alternative to smoothing parameter selection. Someother data-driven directional procedures were already proposed in [32] using cross-validationideas. Specifically, Least Squares Cross-Validation (LSCV) and Likelihood Cross-Validation(LCV) bandwidth are introduced, arising as the minimizers of the cross-validated estimatesof the squared error loss and the Kullback-Leibler loss, respectively. The selectors have thefollowing expressions: h = arg max h > n − n ∑ i = f − in ( X i ) − (cid:90) S d − f n ( x ) ω q ( dx ) and h = arg max h > n ∑ i = log f − in ( X i ) ,where f − in represents the kernel estimator computed without the i − th observation.A bootstrap bandwidth selection procedure for data lying on a d − dimensional torus is pro-posed in [23]. If a von Mises kernel is used, then the bootstrap MISE has a closed expression.Then, h is selected as the value that minimizes (cid:90) S E B [ f n ∗ ( X ) − f n ( X )] ω d ( dx ) where E B denotes the bootstrap expectation with respect to random samples { X ∗ , · · · , X ∗ n } gen-erated from f n ( X ) . A common problem for small samples is that a local minimum may bechosen, as pointed out by [48].Apart from existing cross-validation procedures in the directional setting, [28] derives aplug-in directional analogue to the rule of thumb in [64] using the properties of the von Misesdensity. Moreover, it is the optimal AMISE bandwidth for normal reference density and nor-mal kernel. Concretely, if the von Mises kernel is considered and κ is estimated by maximumlikelihood, h = (cid:104) π I ( ˆ k ) ˆ k [ I ( k )+ k I ( k )] n (cid:105) in S (cid:104) ( ˆ k ) ˆ k [( + k ) sinh ( k ) − k cosh 2ˆ k ] n (cid:105) in S .