[PDF] Directional quantile classifiers

Abstract

We introduce classifiers based on directional quantiles. We derive theoretical results for selecting optimal quantile levels given a direction, and, conversely, an optimal direction given a quantile level. We also show that the misclassification rate is infinitesimal if population distributions differ by at most a location shift and if the number of directions is allowed to diverge at the same rate of the problem's dimension. We illustrate the satisfactory performance of our proposed classifiers in both small and high dimensional settings via a simulation study and a real data example. The code implementing the proposed methods is publicly available in the R package Qtools.

Full PDF

DDIRECTIONAL QUANTILE CLASSIFIERS

By Alessio Farcomeni † and Marco Geraci ‡ and Cinzia Viroli § University of Rome “Tor Vergata” † , University of South Carolina ‡ ,University of Bologna § Abstract

We introduce classiﬁers based on directional quantiles.We derive theoretical results for selecting optimal quantile levelsgiven a direction, and, conversely, an optimal direction given a quan-tile level. We also show that the misclassiﬁcation rate is inﬁnitesimalif population distributions diﬀer by at most a location shift and ifthe number of directions is allowed to diverge at the same rate of theproblem’s dimension. We illustrate the satisfactory performance ofour proposed classiﬁers in both small and high dimensional settingsvia a simulation study and a real data example. The code imple-menting the proposed methods is publicly available in the R package Qtools .

1. Introduction.

The idea of using quantiles in classiﬁcation is rela-tively recent and largely unexplored. The median classiﬁer for high-dimensionalproblems proposed by Hall, Titterington and Xue (2009), which calculatesthe L distance of the coordinates of a multivariate data point from com-ponentwise medians (rather than centroids), is particularly advantageouswhen data exhibit heavy-tailed or skewed distributions. Building on Hall,Titterington and Xue’s (2009) idea, Hennig and Viroli (2016a) proposedquantile classiﬁers which hinge on the sum of distances from componentwisequantiles at some generic level θ ∈ (0 , † Corresponding author: Alessio Farcomeni, Department of Economics and Finance,University of Rome “Tor Vergata”, Italy. E-mail: [email protected]

MSC 2010 subject classiﬁcations:

Primary 62G05; secondary 62G20

Keywords and phrases: classiﬁcation, L distance, machine learning, quantiles for mul-tivariate data a r X i v : . [ s t a t . M E ] S e p FARCOMENI ET AL taken into account by computing linear combinations of input variables. Sec-ond, directional quantiles have a simple interpretation since the projections’weights embody the relative importance of the variables involved in theclassiﬁcation problem. Finally, in the special case of p canonical directions(with p equal to the number of variables), the use of directional quantilesleads to the componentwise quantile classiﬁer (Hennig and Viroli, 2016a),and thus inherits asymptotic optimal properties as shown in Appendix. Di-rectional quantiles have already found application in risk classiﬁcation prob-lems (Geraci et al., 2020) and proved to be a worthwhile alternative to riskclassiﬁcation based on componentwise quantile thresholds.In general, the application of our methods does not require any assump-tion on the shape of the population distributions. We derive asymptotictheoretical properties of the proposed classiﬁer, under the assumption thatdistributions for alternative populations diﬀer by at most a location-shift.While this assumption may be unrealistic in practice, empirical results sup-port the merit of the proposed classiﬁer also when the distributions diﬀerby shape and not just by location.The rest of the paper is organised as follows. In the next section, we intro-duce notation and basic deﬁnitions, followed by our proposal of directionalquantile classiﬁers. Theoretical results are stated in Section 3. We reportthe results of a simulation study in Section 4 and of a real data analysis inSection 5. Concluding remarks are given in Section 6. All proofs of theoret-ical results are reported in Appendix 6. A software implementation of ourapproach can be found in the package Qtools (Geraci, 2016), freely availableon the Comprehensive R Archive Network (R Core Team, 2020).

2. Methods.

Notation and deﬁnitions.

Let X (1) = (cid:16) X (1)1 , X (1)2 , . . . , X (1) p (cid:17) (cid:62) and X (2) = (cid:16) X (2)1 , X (2)2 , . . . , X (2) p (cid:17) (cid:62) denote two p -variate random variables withabsolutely continuous distributions F (1) and F (2) deﬁned on the same space X ⊆ R p for two populations Π (1) and Π (2) , respectively. The marginal dis-tributions of the components of X ( k ) are denoted by F ( k ) j , for j = 1 , , . . . and k = 1 ,

2. Further, I ( · ) denotes the indicator function which is equal to1 if its argument is true, and 0 otherwise.Our goal is to assign a new observation y = ( y , y , . . . , y p ) (cid:62) to eitherΠ (1) or Π (2) according to how close the point is to one or the other. Inquantile-based classiﬁcation (Hennig and Viroli, 2016a), the distance is ﬁrstcalculated for each component of y using the asymmetrically weighted loss IRECTIONAL QUANTILE CLASSIFIERS function(1) Φ ( k ) ( θ ; y j ) = { θ + (1 − θ ) I ( y j − Q ( k ) X j ( θ ) < }| y j − Q ( k ) X j ( θ ) | for j = 1 , , . . . , p and k = 1 ,

2, where Q ( k ) X j ( θ ) is the componentwise quantileat level θ ∈ (0 ,

1) for the k th population, which can be obtained by inversionof F ( k ) j . Subsequently, y is assigned to Π (1) if the discrepancy(2) d ( y , θ ) = p (cid:88) j =1 { Φ (2) ( θ ; y j ) − Φ (1) ( θ ; y j ) } is positive, and to Π (2) otherwise. The quantile classiﬁer reduces to thecomponentwise median classiﬁer of Hall, Titterington and Xue (2009) for θ = 0 .

5. An extension of (2) to more than two populations is straightforward.The classiﬁcation rule based on (2) does not acknowledge the possible in-terdependence among the variables, since quantiles are obtained marginallyfor each variable. We address this limitation by using directional quantilesfor multivariate data (Kong and Mizera, 2012). We now explain our ideainformally and, in the next section, give a rigorous treatment.Deﬁne u to be a vector with unit norm in R p . Throughout this paper, ourfocus will be on the projected random variables u (cid:62) X ( k ) ≡ Z ( k ) , k = 1 , Z ⊆ R . By assumption, the Z ( k ) ’s are continuous. We denotethe corresponding distribution and density functions with G ( k ) ( · ; u ) and g ( k ) ( · ; u ), respectively.Our goal is to develop a classiﬁer where the quantities in (1) are oppor-tunely redeﬁned on the corresponding projections along u to capture themultivariate nature of the distributions, namely(3) Φ ( k ) ( θ ; u (cid:62) y ) = { θ + (1 − θ ) I ( u (cid:62) y − Q ( k ) Z ( θ ; u ) < }| u (cid:62) y − Q ( k ) Z ( θ ; u ) | for k = 1,2, where Q ( k ) Z ( θ ; u ) ≡ Q ( k ) u (cid:62) X ( θ ) is the θ th quantile of Z ( k ) . Thelatter is obtained by inverting G ( k ) and it can be recognised as the θ th directional quantile of X ( k ) in the direction u (Kong and Mizera, 2012).By working with projections, we basically summarise a multivariate prob-lem as a univariate one. Clearly, one diﬃculty to address is how many andwhich directions should be considered. To this end, we should note thatnot all the directions are equally useful for classiﬁcation. To exemplify, con-sider Figure 1, which depicts bivariate normal samples from two independentpopulations centered at (1,1) and (3,3), respectively, and same variance. Wewant to assign the new observation y = (1 . , . (cid:62) to one of the two popula-tions. The log-density at y of two bivariate normal distributions with sample FARCOMENI ET AL −1 0 1 2 3 4 5 − x x y −1 0 1 2 3 4 5 − x x y−1 0 1 2 3 4 5 − x x y −1 0 1 2 3 4 5 − x x y Figure 1 . Simulated data depicting bivariate normal samples from two independent dis-tributions (black and grey dots). The red ﬁlled squares mark the point with coordinates (1 . , . , while dashed lines mark directions. means and covariance matrices separately estimated from the two samples,is − . − .

7, respectively. This suggests that y has been generated morelikely from F than from F .Now compute Φ ( k ) (0 . u (cid:62) y ), k = 1 ,

2, as in (3) for four normalised direc-tions. The results are reported in Table 1. Based on a principle of minimumdistance, we assign y to F , thus consistently with a maximum likelihoodprinciple, for three, though not all four, directions. IRECTIONAL QUANTILE CLASSIFIERS Table 1

Distance Φ ( k ) (0 . u (cid:62) y ) , k = 1 , , calculated for simulated data using four diﬀerentdirections u . u (cid:62) Φ (1) Φ (2) ( − . , − .

81) 0.27 0.01(0 . , .

97) 1.58 0.07( − . , − .

98) 0.30 0.08(1 . , .

02) 0.03 0.23

Directional quantile classiﬁer.

Let ϑ = { θ , θ , . . . , θ R } be a set of R distinct quantile levels on (0 , υ r = { u r , u r , . . . , u rS r } containing S r normalised directions associated with θ r , r = 1 , . . . , R , and let υ = { υ , υ , . . . , υ R } . (Note that for convenience one may set S r = S for r = 1 , . . . , R .)As mentioned in the previous section, we need to be wary of particulardirections that may lead us to a classiﬁcation error. Therefore, we introduceweights ω rs associated with each direction u rs to decrease (or increase) theirrelative importance. Let ω = ( ω , . . . , ω S , . . . , ω RS R ) (cid:62) denote the vectorof all such weights. We propose the discrepancy(4) d ( y , ϑ, υ, ω ) = R (cid:88) r =1 S r (cid:88) s =1 ω rs { Φ (2) ( θ r ; u (cid:62) rs y ) − Φ (1) ( θ r ; u (cid:62) rs y ) } , where Φ ( k ) is deﬁned in (3). Then our directional quantile classiﬁer (DQC)assigns the observation y to Π (1) if d ( y , ϑ, υ, ω ) >

0, or to Π (2) otherwise.Note that if R = 1, S r = p , ω rs = 1, and υ = { e , e , . . . , e p } the standardbasis in R p , then (4) reduces to (2).A diﬃculty associated with the calculation of (4) is the selection of quan-tile levels, directions, and weights in the training data, say x , that give thebest performance on the test data, say y . For some prior probabilities π and π , let ψ ( x , ϑ, υ, ω ) = π (cid:90) X I { d ( x , ϑ, υ, ω ) > } d F (1) ( x )+ π (cid:90) X I { d ( x , ϑ, υ, ω ) ≤ } d F (2) ( x )(5)denote the population probability of correct classiﬁcation by the DQC. Notethat maximising (5) is equivalent to minimising the theoretical misclassiﬁca-tion rate. For any given level θ and direction u , the optimal misclassiﬁcationrate is obtained when π (cid:90) X Φ (1) ( θ ; u (cid:62) x ) d F (1) ( x ) < π (cid:90) X Φ (2) ( θ ; u (cid:62) x ) d F (1) ( x ) FARCOMENI ET AL and π (cid:90) X Φ (2) ( θ ; u (cid:62) x ) d F (2) ( x ) < π (cid:90) X Φ (1) ( θ ; u (cid:62) x ) d F (2) ( x ) , which is equivalent to minimise π (cid:90) X (cid:110) Φ (1) ( θ ; u (cid:62) x ) − Φ (2) ( θ ; u (cid:62) x ) (cid:111) d F (1) ( x )+ π (cid:90) X (cid:110) Φ (2) ( θ ; u (cid:62) x ) − Φ (1) ( θ ; u (cid:62) x ) (cid:111) d F (2) ( x ) . (6)In the general problem with K populations, the minimum misclassiﬁcationrate is obtained when K (cid:88) k =1 π k (cid:90) X Φ ( k ) ( θ ; u (cid:62) x ) d F ( k ) ( x ) < K (cid:88) k =1 π k (cid:90) X min k (cid:48) (cid:54) = k Φ ( k (cid:48) ) ( θ ; u (cid:62) x ) d F ( k ) ( x ) . (7)Let ∆ ( k ) ( x , θ, u ) = Φ ( k ) ( θ ; u (cid:62) x ) − min k (cid:48) (cid:54) = k Φ ( k (cid:48) ) ( θ ; u (cid:62) x ). Given a sample of n observations x i and corresponding class labels (cid:96) i ∈ { , , . . . , K } , we aimto solve(8) min ϑ,υ, ω K (cid:88) k =1 (cid:88) i : (cid:96) i = k R (cid:88) r =1 S r (cid:88) s =1 ω rs ∆ ( k ) ( x i , θ r , u rs ) . Problem (8) may seem daunting, but luckily we can solve for ω rather easily.Given ϑ and υ , problem (8) is linear with unit-norm constraints and can beminimised by using the Lagrange multiplier method. This problem has aclosed-form solution given by ˆ ω = (ˆ ω , . . . , ˆ ω S , . . . , ˆ ω RS R ) (cid:62) with generic rs th element ˆ ω rs = ˜∆ rs (cid:113)(cid:80) Rr =1 (cid:80) S r s =1 ˜∆ rs , where ˜∆ rs = (cid:80) Kk =1 (cid:80) i : (cid:96) i = k ∆ ( k ) ( x i , θ r , u rs ).We now turn to how to choose directions and quantile levels. A crudesolution would consist in doing a multidimensional grid search on p + 1dimensions. However, such a solution would become computationally pro-hibitive even at modest values of p . Thankfully, we are able to mitigatethe computational cost of a na¨ıve numerical solution with some theoreticalresults (Section 3); in particular, with Theorem 1, which guarantees thatfor each projection there exists (at least) a quantile that leads to the opti-mal Bayes misclassiﬁcation probability, and Theorem 2, which, conversely, IRECTIONAL QUANTILE CLASSIFIERS identiﬁes the best direction for a given quantile level. Unfortunately, a theo-retical result for the simultaneous optimisation with respect to θ and u doesnot exist. Nevertheless, we show that our DQC is asymptotically optimal(i.e. the misclassiﬁcation rate goes to zero) when the number of directionsincreases with p and n (Theorem 3) under certain assumptions.In summary, there are diﬀerent possible approaches including randomlyselecting one or more directions and using the optimal quantile levels asso-ciated with those directions; or spanning a grid of quantile levels and usingthe optimal directions associated with those quantiles. After some empir-ical investigation, we found that a strategy that gives satisfactory resultsin diﬀerent settings is as follows. First, we deﬁne a grid of θ values span-ning the unit interval and, for each of these values, randomly draw a setof normalised directions from the hyperplane that is identiﬁed as optimalaccording to Theorem 2. The performance of a DQC based on each single θ value is evaluated using ﬁve-fold cross-validation. In the end, we use asingle quantile level (optimal according to cross-validation), with the cor-responding directions sampled from the optimal hyperplane. In particular,this strategy improves over the use of an asymptotically optimal quantilelevel when n is small. Moreover, when p is not too large, a similar strategycan be used to select an approximately optimal hyperplane.

3. Theoretical results.

In this section, we present theoretical resultsconcerning our DQC. The proofs of lemmas and theorems are reported inthe Appendix.3.1.

Optimal quantile level θ . We derive the theoretical rate of correctclassiﬁcation as a function of θ , for given u . We assume K = 2 populations,although results can be generalised to K > Lemma . For given u , let Q α ( θ ; u ) = min { Q (1) Z ( θ ; u ) , Q (2) Z ( θ ; u ) } withcorresponding inverse G α ( · ; u ), density g α ( · ; u ), and prior probability of cor-rect classiﬁcation π α , and let Q β ( θ ; u ) = max { Q (1) Z ( θ ; u ) , Q (2) Z ( θ ; u ) } withcorresponding inverse G β ( · ; u ), density g β ( · ; u ), and prior probability of cor-rect classiﬁcation π β . The probability of correct classiﬁcation of the direc-tional quantile classiﬁer is(9) ψ ( θ ) = π α G α ( ˜ Q ( θ ; u ); u ) + π β { − G β ( ˜ Q ( θ ; u ); u ) } where ˜ Q ( θ ; u ) = θQ α ( θ ; u ) + (1 − θ ) Q β ( θ ; u ). Analogously, the theoreticalmisclassiﬁcation rate is(10) 1 − ψ ( θ ) = π α { − G α ( ˜ Q ( θ ; u ); u ) } + π β G β ( ˜ Q ( θ ; u ); u ) . FARCOMENI ET AL a Q~ Q b g a Q~Q b z g Figure 2 . Misclassiﬁcation probability (shaded grey area) with two location-shifted skeweddistributions according to median classiﬁer (upper panel) and the optimal quantile classiﬁer(lower panel).

Theorem . Assume that the density functions g α ( z ; u ) and g β ( z ; u )exist for z and are nonzero on the same compact domain Z . Further assumethat there is a point z with π α g α ( z ; u ) = π β g β ( z ; u ) so that π α g α ( z ; u ) >π β g β ( z ; u ) for z on one side of z and π α g α ( z ; u ) < π β g β ( z ; u ) for z on theother side of z . Then the quantile classiﬁer using the quantile ˜ Q ( θ ; u ) thatminimises the theoretical misclassiﬁcation probability achieves the optimalBayes misclassiﬁcation probability.The consistency of the classiﬁer may be illustrated with an example. Con-sider a two class decision problem where one population is a location-shiftversion of the other. Figure 2 shows two distributions which have both thesame right skewness. The quantiles Q α ( θ ) and Q β ( θ ) are marked by dashedlines. The median classiﬁer (Hall, Titterington and Xue, 2009) in the upperpanel leads to a non-optimal misclassiﬁcation probability equal to 0.30. How-ever, the misclassiﬁcation probability is reduced to 0.28 by setting θ = 0 . IRECTIONAL QUANTILE CLASSIFIERS Optimal direction u . The next lemma and theorem give the optimaldirection that minimises the misclassiﬁcation rate at a given θ . Lemma . Let z be a realisation of either Z (1) or Z (2) , thenΦ (2) ( θ ; z ) − Φ (1) ( θ ; z ) ≤ Q (2) Z ( θ ) − Q (1) Z ( θ ) , where Φ ( k ) ( θ ; z ) = θ max( η ( k ) ,

0) + (1 − θ ) max( − η ( k ) ,

0) and η ( k ) = z − Q ( k ) Z ( θ ), k = 1 , Theorem . Let W = ( W , W , . . . , W p ) (cid:62) be a p -variate random vari-able such that Q W j ( θ ) = 0, for j = 1 , . . . , p , and let µ ( k ) = ( µ ( k )1 , µ ( k )2 , . . . , µ ( k ) p ) (cid:62) be a vector of constants, k = 1 ,

2. We assume that X ( k ) = W + µ ( k ) and itsprobability distribution function is F ( k ) , for k = 1 ,

2. Moreover, assume that Q (2) Z ( θ ; u ) > Q (1) Z ( θ ; u ), where Q ( k ) Z ( θ ; u ) is the θ -quantile of Z ( k ) ≡ u (cid:62) X ( k ) .(Notice that there is no loss of generality with this assumption since the case Q (2) Z ( θ ; u ) ≤ Q (1) Z ( θ ; u ) can be reformulated as Q (2) Z ( θ ; − u ) > Q (1) Z ( θ ; − u ).)Under these assumptions, the normalised direction u that minimises themisclassiﬁcation error (6) is(11) ˆ u = µ (2) − µ (1) (cid:107) µ (2) − µ (1) (cid:107) The generalization of Theorem 2 to

K > K ( K − / Asymptotic misclassiﬁcation rate.

In this section, we show that un-der certain assumptions, the correct classiﬁcation probability converges tounity when the number of dimensions grows to inﬁnity along with the sam-ple size and the number of projections. The proof is built following a strat-egy similar to that used in Hall, Titterington and Xue (2009, Theorem 2),although our premises start from milder assumptions. In particular the pro-jections are not required to obey the “ ψ − mixing condition” (Bradley,2005), which is rather strict in practice. Our theorem is developed for any θ r ∈ (0 , ω rs = 1, and R = 1. Thus, the asymptotic resultholds for sub-components of the summation in (8), which are then weightedand summed to minimise the misclassiﬁcation rate. Hence, the overall crite-rion inherits the optimal properties of its additive components.As we did with the theorems in the previous sections, we present this the-orem for K = 2 classes. Its extension to K > K − FARCOMENI ET AL

Theorem . Consider a set of directions υ = { u , . . . , u S } sampledfrom a unit p -sphere and let n = max( n , n ), with n and n denoting thesample sizes of the two groups in the training set. Assume(i) For a constant A > S ≥ A n .(ii) The p variables X ( k )1 , X ( k )2 , . . . , X ( k ) p have each the same distribution as W + µ ( k )1 , W + µ ( k )2 , . . . , W p + µ ( k ) p , respectively. Moreover, Q W j ( θ ) =0 ∀ j and sup j ≥ Var( W j ) = A < + ∞ .(iii) The ﬁrst moments of the projections are uniformly bounded in a strongsense. This implies that ∀ c > ∀ u , ∃ v with | u (cid:62) v | > c such thatinf s ≥ inf | u (cid:62) s v | >c θ E | u (cid:62) s W + u (cid:62) s v | − (1 − θ ) E | u (cid:62) s W | > . (iv) For some (cid:15) >

0, the proportion of values s ∈ { , , . . . , S } for which | θ u (cid:62) s µ (2) − (1 − θ ) u (cid:62) s µ (1) | > (cid:15) multiplied by n / , say n / (cid:93) K (cid:15) , is of larger order than S , which means S (cid:0) n / (cid:93) K (cid:15) (cid:1) − goes to zero as n and S increase.Under the previous assumptions, the directional quantile classiﬁer C basedon d ( y , θ, υ, ω ) = S (cid:88) s =1 { Φ (2) ( θ ; u (cid:62) s y ) − Φ (1) ( θ ; u (cid:62) s y ) } , makes the correct choice asymptotically. More speciﬁcally, as p → ∞ , theclassiﬁer C makes the correct decision with probability P (1) {C ( Y ) = 1 } + P (2) {C ( Y ) = 2 } converging to 1 if both n and n diverge with p , where P ( k ) , k = 1 , Y is drawnfrom population k .

4. Simulation study.

We assessed the performance of the proposedclassiﬁer in a simulation study under three scenarios with two populations.In the ﬁrst scenario, observations were generated independently from a mul-tivariate Student’s t distribution with 3 degrees of freedom, with eitheruncorrelated or correlated variables. In the second scenario, observationswere generated as in the ﬁrst scenario, but each variable was subsequentlytransformed according to x (cid:55)→ log( | x | ) to induce asymmetry. In both cases,the two populations diﬀered by a location shift equal to 0.4. Finally, in IRECTIONAL QUANTILE CLASSIFIERS the third scenario observations were generated as in the ﬁrst scenario, buteach variable was subsequently transformed according to x (cid:55)→ log( | x | ) or to x (cid:55)→ − log( | x | ) depending on whether observations belonged to one or theother population, respectively.Data were generated for each combination of overall sample size n ∈{ , , } (with n/ p ∈{ , , , } . All in all, this resulted in 3 × × × t distribution with correlated vari-ables was generated randomly for each p using the function rcorrmatrix with default settings as provided in the package clusterGeneration (Qiuand Joe, 2015; Joe, 2006). This resulted in non-constant pairwise correla-tions on the interval ( − , k -nearest neighbour (KNN) (Cover andHart, 1967), penalised logistic regression (PLR) (Park and Hastie, 2008),support vector machines (SVM) (Cortes and Vapnik, 1995; Wang, Zhu andZou, 2008), and na¨ıve Bayes classiﬁer (Bayes) (Hand and Yu, 2001). Tuningparameters for PLR, KNN, and SVM where selected using cross-validation.For the CQC, the Galton correction was used to reduce skewness and opti-mal quantile was selected by minimising the error rate on the training set(Hennig and Viroli, 2016a).We used the package Qtools (Geraci, 2016, 2020) for the directional quan-tile classiﬁer; the package quantileDA (Hennig and Viroli, 2016b) for thecentroid, median and componentwise quantile classiﬁers; the package eqc (Lai and McLeod, 2019) for the ensemble quantile classiﬁer; the package

MASS (Venables and Ripley, 2002) for linear discriminant analysis; the pack-age class (Venables and Ripley, 2002) for k -nearest neighbour; the package e1071 (Meyer et al., 2019) for support vector machines and Bayes classiﬁers;and the package stepPlr (Park and Hastie, 2018) for penalised logistic re-gression. All analyses were carried out in R version 4.0.0 (R Core Team,2020).The misclassiﬁcation rates averaged over 100 replications for all simulationcases are reported in Tables 2-4. The results indicate that the performanceof our proposed classiﬁer improves as n and p increase, in agreement with FARCOMENI ET AL the theoretical results. In the ﬁrst two scenarios, our classiﬁer outperformsthe competitors in both scenarios when variables are uncorrelated. Whenvariables are correlated, the proposed classiﬁer still performs very well, evenif it is not uniformly the best. In the third scenario where class distributionshave diﬀerent shapes, the performance of our classiﬁer is often, but notalways, the best.

5. Clinical trial on Crohn’s disease.

We analyse data from a matchedcase-control study in ﬁrst-degree relatives (FDRs) of Crohn’s disease (CD)patients originally published by Sorrentino et al. (2014). The goal of thestudy was to identify asymptomatic FDRs with early CD signs using severalintestinal inﬂammatory markers. The latter included hemoglobin, erythro-cyte sedimentation rate, C-reactive protein, fecal calprotectin, and averagemature ileum score. In our analysis, we grouped subjects into 2 classes, onewith signs of inﬂammation ( n = 9 subjects with early or frank CD) andone with normal values of markers ( n = 26 subjects with no signs of in-ﬂammation, including healthy controls). In a separate analysis, we augmentthe dataset with 45 artiﬁcial markers generated from independent standardnormal distributions to investigate the impact of uninformative noise on theperformance of the DQC. We approach data analysis with leave-one-out val-idation and evaluate the misclassiﬁcation rate as the proportion of subjectsthat are misclassiﬁed when each is left out of analysis.We estimated the classiﬁcation error for all the classiﬁers as included inour simulation study (Section 4). The results are reported in Table 5. Theproposed DQC outperforms its competitors in both the original ( p = 5) andnoisy ( p = 50) versions of the dataset.

6. Conclusions.

We proposed directional quantile classiﬁers whose pre-dictive ability is consistently good in both simulation and real data studies,on small and large dimensional classiﬁcation problems. In particular, theempirical results show that our approach either outperforms its competitorsor, when this is not the case, its performance is still in the ballpark of that ofthe best classiﬁers. Such a reliable behaviour across diﬀerent scenarios is notshared by the other selected classiﬁers. Moreover, the directional quantileclassiﬁers enjoy optimal theoretical properties under certain assumptions.A limitation of the approach is that the number of directions needed tospan a p -sphere with a regular grid becomes prohibitive already at modestvalues of p . On the other hand, our theoretical results indicate that one cansample directions from an optimal hyperplane, thus reducing the computa-tional burden, but not at the expense of the classiﬁer’s performance. Ourstrategy allows us to balance the importance of quantile levels and directions IRECTIONAL QUANTILE CLASSIFIERS Table 2

Misclassiﬁcation rates averaged over 100 replications for ten classiﬁers (DQC, directionalquantile classiﬁer; Centroid, centroid classiﬁer; Median, median classiﬁer; CQC,componentwise quantile classiﬁer; EQC, ensemble quantile classiﬁer; LDA, lineardiscriminant analysis; KNN, k-nearest neighbour; PLR, penalised logistic regression;SVM, support vector machines; Bayes, na¨ıve Bayes) in the ﬁrst scenario wherepopulations have symmetric distributions.

Uncorrelated CorrelatedDimension p

10 50 100 500 10 50 100 500

Sample size n = 50DQC 0.334 0.187 0.120 0.020 0.315 0.202 0.128 0.028Centroid 0.355 0.232 0.168 0.049 0.349 0.277 0.189 0.059Median 0.372 0.230 0.153 0.043 0.357 0.252 0.170 0.047CQC 0.362 0.273 0.220 0.180 0.367 0.284 0.222 0.177EQC 0.373 0.240 0.172 0.044 0.339 0.253 0.168 0.055LDA 0.365 0.382 0.295 0.313 0.245 0.252 0.308 0.339KNN 0.362 0.287 0.263 0.212 0.360 0.300 0.271 0.211PLR 0.348 0.199 0.134 0.023 0.275 0.154 0.103 0.025SVM 0.413 0.252 0.140 0.046 0.401 0.263 0.140 0.049Bayes 0.390 0.333 0.302 0.225 0.395 0.327 0.287 0.237 Sample size n = 100DQC 0.306 0.145 0.089 0.015 0.283 0.146 0.076 0.017Centroid 0.325 0.181 0.114 0.025 0.331 0.210 0.132 0.036Median 0.334 0.194 0.129 0.032 0.339 0.211 0.136 0.033CQC 0.343 0.213 0.151 0.076 0.341 0.223 0.157 0.092EQC 0.337 0.213 0.135 0.039 0.329 0.193 0.138 0.040LDA 0.338 0.236 0.393 0.182 0.226 0.055 0.105 0.240KNN 0.346 0.214 0.184 0.102 0.325 0.223 0.191 0.108PLR 0.329 0.182 0.113 0.019 0.240 0.092 0.056 0.020SVM 0.370 0.176 0.106 0.032 0.382 0.153 0.069 0.034Bayes 0.367 0.284 0.227 0.179 0.371 0.291 0.242 0.179 Sample size n = 500DQC 0.286 0.128 0.069 0.010 0.263 0.126 0.058 0.010Centroid 0.300 0.145 0.080 0.014 0.288 0.154 0.077 0.016Median 0.320 0.173 0.101 0.020 0.315 0.178 0.099 0.019CQC 0.327 0.176 0.108 0.023 0.320 0.183 0.103 0.025EQC 0.324 0.177 0.106 0.021 0.296 0.149 0.085 0.021LDA 0.302 0.160 0.106 0.367 0.196 0.027 0.000 0.036KNN 0.326 0.163 0.098 0.018 0.283 0.166 0.097 0.022PLR 0.301 0.161 0.104 0.013 0.194 0.044 0.018 0.008SVM 0.300 0.142 0.084 0.018 0.238 0.066 0.020 0.014Bayes 0.329 0.198 0.145 0.077 0.326 0.199 0.140 0.0764 FARCOMENI ET AL

Table 3

Misclassiﬁcation rates averaged over 100 replications for ten classiﬁers (DQC, directionalquantile classiﬁer; Centroid, centroid classiﬁer; Median, median classiﬁer; CQC,componentwise quantile classiﬁer; EQC, ensemble quantile classiﬁer; LDA, lineardiscriminant analysis; KNN, k-nearest neighbour; PLR, penalised logistic regression;SVM, support vector machines; Bayes, na¨ıve Bayes) in the second scenario wherepopulations have distributions with same skewness.

Uncorrelated CorrelatedDimension p

10 50 100 500 10 50 100 500

Sample size n = 50DQC 0.313 0.170 0.096 0.052 0.306 0.169 0.095 0.059Centroid 0.323 0.212 0.145 0.097 0.330 0.220 0.140 0.113Median 0.334 0.206 0.147 0.105 0.334 0.215 0.140 0.106CQC 0.350 0.235 0.187 0.234 0.360 0.245 0.193 0.248EQC 0.340 0.180 0.089 0.015 0.333 0.178 0.102 0.019LDA 0.317 0.383 0.238 0.228 0.315 0.397 0.233 0.237KNN 0.382 0.275 0.210 0.064 0.364 0.282 0.213 0.078PLR 0.313 0.183 0.095 0.002 0.322 0.186 0.098 0.004SVM 0.330 0.224 0.150 0.021 0.332 0.240 0.150 0.021Bayes 0.378 0.281 0.220 0.153 0.377 0.272 0.223 0.161 Sample size n = 100DQC 0.293 0.129 0.060 0.012 0.280 0.118 0.058 0.010Centroid 0.310 0.168 0.106 0.057 0.307 0.161 0.104 0.065Median 0.328 0.177 0.110 0.067 0.316 0.171 0.106 0.071CQC 0.317 0.173 0.110 0.135 0.314 0.160 0.113 0.137EQC 0.310 0.135 0.071 0.006 0.296 0.126 0.062 0.006LDA 0.301 0.218 0.374 0.084 0.281 0.203 0.395 0.089KNN 0.358 0.242 0.188 0.047 0.353 0.256 0.186 0.045PLR 0.300 0.163 0.079 0.001 0.284 0.153 0.078 0.001SVM 0.318 0.177 0.089 0.005 0.333 0.168 0.079 0.005Bayes 0.330 0.229 0.167 0.095 0.334 0.225 0.166 0.098 Sample size n = 500DQC 0.273 0.097 0.038 0.000 0.265 0.097 0.035 0.000Centroid 0.282 0.119 0.059 0.007 0.275 0.116 0.056 0.008Median 0.295 0.128 0.074 0.017 0.286 0.124 0.069 0.019CQC 0.272 0.099 0.053 0.019 0.261 0.092 0.050 0.018EQC 0.267 0.088 0.035 0.001 0.244 0.079 0.032 0.001LDA 0.279 0.116 0.060 0.374 0.266 0.114 0.057 0.372KNN 0.323 0.206 0.140 0.016 0.310 0.207 0.140 0.015PLR 0.279 0.121 0.060 0.000 0.266 0.119 0.056 0.000SVM 0.283 0.109 0.046 0.000 0.274 0.107 0.044 0.000Bayes 0.273 0.129 0.080 0.020 0.266 0.125 0.077 0.021 IRECTIONAL QUANTILE CLASSIFIERS Table 4

Misclassiﬁcation rates averaged over 100 replications for ten classiﬁers (DQC, directionalquantile classiﬁer; Centroid, centroid classiﬁer; Median, median classiﬁer; CQC,componentwise quantile classiﬁer; EQC, ensemble quantile classiﬁer; LDA, lineardiscriminant analysis; KNN, k-nearest neighbour; PLR, penalised logistic regression;SVM, support vector machines; Bayes, na¨ıve Bayes) in the third scenario wherepopulations have distributions with opposite skewness.

Uncorrelated CorrelatedDimension p

10 50 100 500 10 50 100 500

Sample size n = 50DQC 0.199 0.171 0.166 0.159 0.237 0.172 0.110 0.023Centroid 0.228 0.176 0.169 0.160 0.362 0.265 0.190 0.066Median 0.321 0.283 0.273 0.264 0.359 0.240 0.166 0.045Quantile 0.236 0.112 0.087 0.073 0.371 0.279 0.215 0.181EQC 0.315 0.279 0.256 0.234 0.349 0.239 0.162 0.051LDA 0.277 0.450 0.253 0.161 0.248 0.270 0.298 0.349KNN 0.277 0.213 0.192 0.173 0.365 0.284 0.214 0.074PLR 0.259 0.252 0.213 0.173 0.317 0.189 0.100 0.003SVM 0.231 0.175 0.170 0.159 0.338 0.240 0.157 0.018Bayes 0.229 0.132 0.123 0.106 0.373 0.288 0.227 0.145 Sample size n = 100DQC 0.188 0.162 0.165 0.166 0.195 0.133 0.071 0.016Centroid 0.214 0.167 0.167 0.166 0.336 0.212 0.128 0.033Median 0.314 0.287 0.285 0.275 0.341 0.215 0.132 0.032Quantile 0.214 0.086 0.071 0.058 0.346 0.226 0.159 0.091EQC 0.296 0.254 0.256 0.241 0.334 0.198 0.132 0.039LDA 0.237 0.300 0.456 0.174 0.222 0.056 0.110 0.238KNN 0.246 0.204 0.191 0.184 0.351 0.251 0.189 0.044PLR 0.234 0.252 0.235 0.187 0.284 0.156 0.079 0.001SVM 0.221 0.169 0.170 0.166 0.323 0.174 0.083 0.004Bayes 0.187 0.112 0.105 0.095 0.343 0.232 0.169 0.092 Sample size n = 500DQC 0.182 0.166 0.162 0.159 0.177 0.111 0.053 0.010Centroid 0.209 0.170 0.165 0.160 0.283 0.153 0.076 0.015Median 0.312 0.288 0.283 0.279 0.312 0.178 0.099 0.020Quantile 0.203 0.069 0.052 0.041 0.316 0.182 0.104 0.024EQC 0.282 0.249 0.241 0.236 0.293 0.148 0.085 0.021LDA 0.212 0.194 0.212 0.474 0.194 0.027 0.000 0.032KNN 0.193 0.173 0.176 0.178 0.311 0.206 0.138 0.015PLR 0.213 0.201 0.226 0.237 0.266 0.118 0.057 0.001SVM 0.209 0.172 0.167 0.163 0.274 0.107 0.043 0.001Bayes 0.164 0.102 0.096 0.086 0.269 0.126 0.080 0.0196 FARCOMENI ET AL

Table 5

Cross-validated estimates of the misclassiﬁcation rates for the Crohn’s disease dataset( p = 5 ) and its noisy version ( p = 50 ) using ten classiﬁers (DQC, directional quantileclassiﬁer; Centroid, centroid classiﬁer; Median, median classiﬁer; CQC, componentwisequantile classiﬁer; EQC, ensemble quantile classiﬁer; LDA, linear discriminant analysis;KNN, k -nearest neighbour; PLR, penalised logistic regression; SVM, support vectormachines; Bayes, na¨ıve Bayes). p = 5 p = 50DQC 0.229 0.229Centroid 0.286 0.286Median 0.400 0.400CQC 0.314 0.343EQC 0.314 0.314LDA 0.257 0.543KNN 0.371 0.343PLR 0.286 0.343SVM 0.257 0.257Bayes 0.286 0.257 used for classiﬁcation by means of weights, which can be optimised using aconvenient closed-form expression. IRECTIONAL QUANTILE CLASSIFIERS APPENDIX A - PROOFS OF THEOREMS

A.1. Proofs of Lemma 1 and Theorem 1.

Proof.

The proofs of Lemma 1 and Theorem 1 follow the argumentsgiven in Hennig and Viroli (2016a, Supplementary Material). Here, we brieﬂysketch the main idea. The optimal value θ that minimises the theoreticalmisclassiﬁcation probability can be obtained by setting the ﬁrst derivativeof (10) to zero, from which π α g α { ˜ Q ( θ ; u ) } = π β g β { ˜ Q ( θ ; u ) } . By assumption, there exists θ ∈ (0 ,

1) such that ˜ Q ( θ ; u ) = z . Hence,the identity above is satisﬁed because Q α ( θ ; u ) and Q β ( θ ; u ) are continuousfunctions of θ that converge to the lower and upper bound of Z for θ ap-proaching either 0 or 1, respectively. Furthermore, under the assumptions ofTheorem 1, the optimal Bayesian classiﬁer has a single decision boundaryat ˜ Q ( θ ; u ). A.2. Proof of Lemma 2.

Proof.

Without loss of generality, assume Q (1) Z ( θ ) ≤ Q (2) Z ( θ ). Let ∆( θ ; z ) =Φ (2) ( θ ; z ) − Φ (1) ( θ ; z ) and consider three possible, distinct cases: z ≤ Q (1) Z ( θ ), Q (1) Z ( θ ) < z < Q (2) Z ( θ ), and Q (2) Z ( θ ) ≤ z .If z ≤ Q (1) Z ( θ ), then∆( θ ; z ) = (1 − θ ) { Q (2) Z ( θ ) − z } − (1 − θ ) { Q (1) Z ( θ ) − z } = (1 − θ ) { Q (2) Z ( θ ) − Q (1) Z ( θ ) } ≤ Q (2) Z ( θ ) − Q (1) Z ( θ )by deﬁnition. If Q (1) Z ( θ ) < z < Q (2) Z ( θ ), then∆( θ ; z ) = (1 − θ ) { Q (2) Z ( θ ) − z } − θ { z − Q (1) Z ( θ ) } = θ { Q (1) Z ( θ ) − Q (2) Z ( θ ) } + Q (2) Z ( θ ) − z ≤ θQ (1) Z ( θ ) − Q (2) Z ( θ ) ≤ Q (2) Z ( θ ) − Q (1) Z ( θ ) . Finally, if Q (2) Z ( θ ) ≤ z , then∆( θ ; z ) = θ { z − Q (2) Z ( θ ) } − θ { z − Q (1) Z ( θ ) }≤ Q (2) Z ( θ ) − Q (1) Z ( θ ) . FARCOMENI ET AL

A.3. Proof of Theorem 2.

Proof.

By Lemma 2, the diﬀerences Φ (1) ( θ ; u (cid:62) x ) − Φ (2) ( θ ; u (cid:62) x ) andΦ (2) ( θ ; u (cid:62) x ) − Φ (1) ( θ ; u (cid:62) x ) are upper bounded by Q (2) Z ( θ ; u ) − Q (1) Z ( θ ; u )since Q (2) Z ( θ ; u ) > Q (1) Z ( θ ; u ). Therefore the quantity in (6), which is to beminimised with respect to u subject to (cid:107) u (cid:107) = 1, is uniformly bounded aboveby π (cid:90) X (cid:110) Φ (1) ( θ ; u (cid:62) x ) − Φ (2) ( θ ; u (cid:62) x ) (cid:111) d F (1) ( x )+ π (cid:90) X (cid:110) Φ (2) ( θ ; u (cid:62) x ) − Φ (1) ( θ ; u (cid:62) x ) (cid:111) d F (2) ( x ) + λ ( u (cid:62) u − ≤ Q (2) Z ( θ ; u ) − Q (1) Z ( θ ; u ) + λ ( u (cid:62) u − Q W ( θ ; u ) + u (cid:62) µ (2) ) − ( Q W ( θ ; u ) + u (cid:62) µ (1) ) + λ ( u (cid:62) u − u (cid:62) ( µ (2) − µ (1) ) + λ ( u (cid:62) u − . To ﬁnd u , we minimise the Lagrangian function u (cid:62) ( µ (2) − µ (1) )+ λ ( u (cid:62) u − A.4. Proof of Theorem 3.

Proof.

Let Q ( k ) Z ( θ ; u s ) be the empirical quantile computed on the pro-jected training data u (cid:62) s X ( k ) . We writeΦ ( k ) ( θ ; u (cid:62) s Y ) = γ ( k ) s ( θ ) | u (cid:62) s Y − Q ( k ) Z ( θ ; u s ) | , where γ ( k ) s ( θ ) = θ + (1 − θ ) I { u (cid:62) s Y < Q ( k ) Z ( θ ; u s ) } . Let µ y denote thevector of quantiles of Y , and put µ ( k ) y = µ ( k ) − µ y for k = 1 , V = Y − µ y . By the triangular inequality γ (2) s ( θ ) | u (cid:62) s Y − Q (2) Z ( θ ; u s ) | − γ (1) s ( θ ) | u (cid:62) s Y − Q (1) Z ( θ ; u s ) | = γ (2) s ( θ ) | u (cid:62) s V − u (cid:62) s µ (2) y | − γ (1) s ( θ ) | u (cid:62) s V − u (cid:62) s µ (1) y | + τ | Q (2) Z ( θ ; u s ) − u (cid:62) s µ (2) | + τ | Q (1) Z ( θ ; u s ) − u (cid:62) s µ (1) | , where τ and τ satisfy | τ k | ≤ k = 1 ,

2. Hence T ≡ S (cid:88) s =1 γ (2) s ( θ ) | u (cid:62) s Y − Q (2) Z ( θ ; u s ) | − γ (1) s ( θ ) | u (cid:62) s Y − Q (1) Z ( θ ; u s ) | = T + τ R + τ R , IRECTIONAL QUANTILE CLASSIFIERS where T = (cid:80) Ss =1 γ (2) s ( θ ) | u (cid:62) s V − u (cid:62) s µ (2) y | − γ (1) s ( θ ) | u (cid:62) s V − u (cid:62) s µ (1) y | , R = (cid:80) Ss =1 | Q (1) Z ( θ ; u s ) − u (cid:62) s µ (1) | and R = (cid:80) Ss =1 | Q (2) Z ( θ ; u s ) − u (cid:62) s µ (2) | .Given the convergence of the empirical quantiles to the respective popu-lation quantiles, it follows that P (1) ( T > c − c Sn − / ) ≥ P (1) ( T > c ) − P ( R > c Sn − / ) − P ( R > c Sn − / ) ≥ P (1) ( T > c ) − S (cid:88) s =1 e − n δ (1) s − S (cid:88) s =1 e − n δ (2) s for any c , c >

0, where δ ( k ) s = (cid:20) min (cid:26) F ( k ) (cid:18) u (cid:62) s µ ( k ) + c Sn / (cid:19) − θ, θ − F ( k ) (cid:18) u (cid:62) s µ ( k ) − c Sn / (cid:19)(cid:27)(cid:21) . Now deﬁne d s = E (cid:110) γ (2) s ( θ ) | u (cid:62) s ( V − µ (2) y ) | − γ (1) s ( θ ) | u (cid:62) s ( V − µ (1) y ) | (cid:111) . Given (cid:15) >

0, let K (cid:15) denote the set of indices s ∈ { , , . . . , S } such that | γ (2) s ( θ ) u (cid:62) s µ − γ (1) s ( θ ) u (cid:62) s µ | > (cid:15) ∀ θ ∈ (0 , Y has distribution F (1) , we have d s = E (cid:110) γ (2) s ( θ ) | u (cid:62) s ( Y − µ ) | − γ (1) s ( θ ) | u (cid:62) s ( Y − µ ) | (cid:111) = γ (2) s ( θ ) E | u (cid:62) s ( Z + µ − µ ) | − γ (1) s ( θ ) E | u (cid:62) s Z | , where E is the expectation under P (1) .Therefore, by assumption ( iii ) and provided c ≥ (cid:15) , we have (cid:88) s ∈K (cid:15) d s ≥ a ( c )( (cid:93) K c )where a ( c ) >

0, with a ( c ) = γ (2) s ( θ ) E | u (cid:62) s ( Z + µ − µ ) | − γ (1) s ( θ ) E | u (cid:62) s Z | in view of ( iii ). As a consequence, for E ( T ) = (cid:80) Ss =1 d s and (cid:15) →

0, and ∀ c ,we have E ( T ) ≥ a ( c )( (cid:93) K c ) , (A.1)where (cid:93)A denotes the cardinality of the set A . FARCOMENI ET AL

By the Chebychev inequality and provided that c < E ( T ), we have P (1) ( T > c ) ≥ − P (1) ( | T − E ( T ) | > c ) ≥ − c − E { T − E ( T ) } ≥ − c − var ( T ≥ − A c − S, (A.2)where var denotes the variance under P (1) and the second inequality followsfrom assumption ( ii ); more speciﬁcallyvar ( T ) = var (cid:40) S (cid:88) s =1 (cid:16) γ (2) s ( θ ) | u (cid:62) s ( V − µ (2) y ) | − γ (1) s ( θ ) | u (cid:62) s ( V − µ (1) y ) | (cid:17)(cid:41) ≤ var (cid:40) S (cid:88) s =1 (cid:16) γ (2) s ( θ ) u (cid:62) s ( V − µ (2) y ) − γ (1) s ( θ ) u (cid:62) s ( V − µ (1) y ) (cid:17)(cid:41) = var (cid:40) S (cid:88) s =1 (cid:16) γ (2) s ( θ ) u (cid:62) s ( W + µ (1) − µ (2) ) − γ (1) s ( θ ) u (cid:62) s W (cid:17)(cid:41) ≤ S (cid:88) s =1 A u (cid:62) s u s + 2 S − (cid:88) s =1 S (cid:88) s (cid:48) = s +1 A u (cid:62) s u s (cid:48) . Stam (1982) proved that a uniform random variable on the sphere, U ∈ R p , converges to a standard Gaussian as p → ∞ . Therefore, for S → ∞ , bythe strong law of large numbers we have2 (cid:80) S − s =1 (cid:80) Ss (cid:48) = s +1 A U (cid:62) s U s (cid:48) S ( S − a.s. −−→ A E( Ξ (cid:62) Ξ ) = 0 , where Ξ and Ξ are two independent standard Gaussians. This explainswhy the covariances become negligible in the last part of (A.2) as p increases.It remains to prove that c < E ( T ). Consider c = c Sn / , where c isa positive constant. By (A.1), the latter holds if c Sn − / < a ( c ) K c . Butthis is true because it implies that S (cid:16) n / (cid:93) K c (cid:17) − < a ( c ) c − , where the term on the left goes to zero according to assumption ( iv ) while a ( c ) >

0, thus c − > c = c Sn / , we have P (1) ( T > c Sn − / − c Sn − / ) ≥ − A nc S − S (cid:88) s =1 e − n δ (1) s − S (cid:88) s =1 e − n δ (2) s . IRECTIONAL QUANTILE CLASSIFIERS We wish to choose c and c such as P (1) ( T > ≥ − (cid:15). Therefore, we ﬁx (cid:15) and choose c such that A c A ≤ (cid:15) , where A is deﬁned inassumption ( i ). It follows that A Sc = A nc S ≤ A c A ≤ (cid:15). Then we choose c such that c > c and observe that 2 (cid:80) Ss =1 e − n δ (1) s +2 (cid:80) Ss =1 e − n δ (2) s → n, S → ∞ . Since this is true for each (cid:15) >

0, then P (1) ( T > →

1, and similarly P (2) ( T < → FARCOMENI ET AL

REFERENCES

Bradley, R. C. (2005). Basic properties of strong mixing conditions: A survey and someopen questions.

Probability Surveys Cortes, C. and

Vapnik, V. (1995). Support-vector networks.

Machine Learning Cover, T. M. and

Hart, P. E. (1967). Nearest neighbor pattern classiﬁcation.

IEEETransactions on Information Theory Geraci, M. (2016). Qtools: A collection of models and other tools for quantile inference.

R Journal Geraci, M. (2020). Qtools: Utilities for quantiles R package version 1.5.2.

Geraci, M. , Boghossian, N. S. , Farcomeni, A. and

Horbar, J. D. (2020). Quantilecontours and allometric modelling for risk classiﬁcation of abnormal ratios with anapplication to asymmetric growth-restriction in preterm infants.

Statistical Methods inMedical Research Hall, P. , Titterington, D. M. and

Xue, J. H. (2009). Median-based classiﬁers forhigh-dimensional data.

Journal of the American Statistical Society

Hand, D. J. and

Yu, K. (2001). Idiot’s Bayes - Not so stupid after all?

InternationalStatistical Review Hennig, C. and

Viroli, C. (2016a). Quantile-based classiﬁers.

Biometrika

Hennig, C. and

Viroli, C. (2016b). quantileDA: Quantile classiﬁer R package version1.1.

Joe, H. (2006). Generating random correlation matrices based on partial correlations.

Journal of Multivariate Analysis Kong, L. and

Mizera, I. (2012). Quantile Tomography: Using Quantiles with Multivari-ate Data.

Statistica Sinica Lai, Y. and

McLeod, I. (2019). eqc: Ensemble quantile classiﬁcation R package version1.2-2.

Lai, Y. and

McLeod, I. (2020). Ensemble quantile classiﬁer.

Computational Statistics &Data Analysis

Meyer, D. , Dimitriadou, E. , Hornik, K. , Weingessel, A. and

Leisch, F. (2019).e1071: Misc functions of the department of statistics, probability theory group (for-merly: E1071), TU Wien R package version 1.7-3.

Park, M. Y. and

Hastie, T. (2008). Penalized logistic regression for detecting geneinteractions.

Biostatistics Park, M. Y. and

Hastie, T. (2018). stepPlr: L2 penalized logistic regression with step-wise variable selection R package version 0.93.

Qiu, W. and

Joe, H. (2015). clusterGeneration: Random cluster generation (with speciﬁeddegree of separation) R package version 1.3.4.

Sorrentino, D. , Avellini, C. , Geraci, M. , Dassopoulos, T. , Zarifi, D. , Vadal´a diPrampero, S. F. and

Benevento, G. (2014). Tissue studies in screened ﬁrst-degreerelatives reveal a distinct Crohn’s disease phenotype.

Inﬂammatory Bowel Diseases Stam, A. J. (1982). Limit theorems for uniform distributions on spheres in high-dimensional Euclidean spaces.

Journal of Applied Probabability R Core Team (2020). R: A language and environment for statistical computing R Foun-dation for Statistical Computing, Vienna, Austria.

Tibshirani, R. , Hastie, T. , Narasimhan, B. and

Chu, G. (2002). Diagnosis of multiplecancer types by shrunken centroids of gene expression.

Proceedings of the NationalAcademy of Sciences of the United States of America Venables, W. N. and

Ripley, B. D. (2002).

Modern applied statistics with S , Fourthed. Springer, New York, NY.

Wang, L. , Zhu, J. and

Zou, H. (2008). Hybrid Huberized support vector machines formicroarray classiﬁcation and gene selection.

Bioinformatics24