[PDF] Distributionally robust halfspace depth

Abstract

Tukey's halfspace depth can be seen as a stochastic program and as such it is not guarded against optimizer's curse, so that a limited training sample may easily result in a poor out-of-sample performance. We propose a generalized halfspace depth concept relying on the recent advances in distributionally robust optimization, where every halfspace is examined using the respective worst-case distribution in the Wasserstein ball of radius \delta\geq 0 centered at the empirical law. This new depth can be seen as a smoothed and regularized classical halfspace depth which is retrieved as \delta\downarrow 0. It inherits most of the main properties of the latter and, additionally, enjoys various new attractive features such as continuity and strict positivity beyond the convex hull of the support. We provide numerical illustrations of the new depth and its advantages, and develop some fundamental theory. In particular, we study the upper level sets and the median region including their breakdown properties.

Full PDF

DDistributionally Robust Halfspace Depth

Jevgenijs Ivanovs

Department of Mathematics, Aarhus University

Pavlo Mozharovskyi

LTCI, T´el´ecom Paris, Institut Polytechnique de Paris

January 3, 2021

Abstract

Tukey’s halfspace depth can be seen as a stochastic program and as such itis not guarded against optimizer’s curse, so that a limited training sample mayeasily result in a poor out-of-sample performance. We propose a generalizedhalfspace depth concept relying on the recent advances in distributionally robustoptimization, where every halfspace is examined using the respective worst-casedistribution in the Wasserstein ball of radius δ ≥ δ ↓

0. It inherits most of the main properties of thelatter and, additionally, enjoys various new attractive features such as continuityand strict positivity beyond the convex hull of the support. We provide numericalillustrations of the new depth and its advantages, and develop some fundamen-tal theory. In particular, we study the upper level sets and the median regionincluding their breakdown properties.

Keywords:

Data depth, Tukey depth, optimal transport, Wasserstein distance, distri-butional robustness, multivariate median.

The earliest and arguably most popular depth concept for multivariate data is thehalfspace depth introduced by Tukey (1975), see also the recent survey Nagy et al.(2019). For a random vector X ∈ R d , d ≥ P it is deﬁned by means ofthe stochastic program D ( z | P ) = inf u ∈ R d : (cid:107) u (cid:107) =1 P ( (cid:104) u , X − z (cid:105) ≥ a r X i v : . [ m a t h . O C ] J a n or all points z ∈ R d , where (cid:107) u (cid:107) = (cid:112) (cid:104) u, u (cid:105) is the Euclidean norm. In words, oneconsiders the probability mass in the closed halfspace { x ∈ R d : u (cid:62) ( x − z ) ≥ } andminimizes over all possible directions u . One of the major problems in applications isthat the law P is never known exactly, and so it must be inferred from data. However,optimization in the model calibrated to a given dataset often results in a rather poorout-of-sample performance. This optimizer’s curse is a well-known phenomenon instochastic programming (Esfahani and Kuhn, 2018), and it may roughly be comparedto overﬁtting.An increasingly popular way of dealing with the optimizer’s curse is to explicitlyallow for some ambiguity in the approximate model P n and to ﬁnd the decision opti-mizing the worst-case expected cost, see Esfahani and Kuhn (2018); Scarf (1958). Theworst case is computed over the ambiguity set of laws, which commonly is a ball ofprobability measures centered at P n with respect to some dissimilalrity or distance.The empirical law P n corresponding to the given training sample is a basic example inthis context, and this meaning of P n is used in the rest of this paper. Thus a naturalcandidate is the Wasserstein (earth mover’s) distance between two laws on R d (see, forexample, Rachev and R¨uschendorf (1998)): d W ( P , P (cid:48) ) = inf (cid:8) (cid:90) (cid:107) x − y (cid:107) Π(d x , d y ) : Π ∈ P P , P (cid:48) (cid:9) ∈ [0 , ∞ ] , where P P , P (cid:48) is the set of all joint probability laws on R d × R d (equipped with Borel σ -algebra) with the marginal laws P and P (cid:48) , respectively. We refer to Rubner et al.(2000); Pele and Werman (2009); Cuturi and Doucet (2014) for applications of thisdistance to machine learning tasks.According to the above described framework we arrive at the following minimaxproblem for an arbitrary law P (an empirical law is just one example) D δ ( z | P ) = inf u ∈ R d : (cid:107) u (cid:107) =1 (cid:40) sup P (cid:48) : d W ( P (cid:48) , P ) ≤ δ P (cid:48) ( (cid:104) u , X − z (cid:105) ≥ (cid:41) ∈ [0 , , (1)where for every direction u the worst case probability measure P (cid:48) is considered. Impor-tantly, the inner problem in (1) has a simple explicit solution, which follows from thetheory of distributional model risk, see Wozabal (2012); Blanchet and Murthy (2019)and references therein. We argue that this is a natural way to deﬁne a smoothed andregularized version of the halfspace depth. Normally the focus is on the relative valuesof the depth function and thus the fact that the robustiﬁed halfspace depth in (1) is atleast as large as the traditional depth is of little concern.In the sample version we take P n instead of P and treat δ as the smoothing pa-rameter, where δ ↓ { z : D δ ( z | P n ) = α } are the boundaries of convex sets for all positive α below themaximal depth. Moreover, D δ ( z | P n ) approaches the true halfspace depth D ( z | P ) as2raditional Tukey depth D ( ·| P n ) Robust Tukey depth D . ( ·| P n ) −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 − − X1 X Tukey depth plots for Normal distribution l l llll ll l lllll l l ll l llll ll −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 − − X1 X Depth plots for Normal distribution l l llll ll l lllll l l ll l llll ll

Figure 1: Heat maps of the traditional Tukey depth D ( ·| P n ) (left) and the proposedrobust Tukey depth D . ( ·| P n ) (right) for a sample of 25 points drawn from a bivariatestandard normal distribution. A few contours on the right plot (for depth levels 0 . .

35, and 0 .

5) are presented to emphasize their convexity (later shown formally). Withwhite color being for 0, the values span from the smallest (shades of yellow) to thehighest (shades of red). n → ∞ and δ ↓

0. We note that in some other contexts such minimax problems can beindeed rewritten as regularized original problems, see Blanchet et al. (2019); Bousquetand Elisseeﬀ (2002); Ghaoui and Lebret (1997); Xu et al. (2009).In addition, we also brieﬂy consider D δ ( z | P ) = inf u ∈ R d : (cid:107) u (cid:107) =1 (cid:26) inf P (cid:48) : d W ( P (cid:48) , P ) ≤ δ P (cid:48) ( (cid:104) u , X − z (cid:105) > (cid:27) , where the inner problem minimizes the mass in the open halfspace over the Wassertseinball of measures around P . Assuming d W ( P , P n ) ≤ δ for some δ > D δ ( z | P n ) ≤ D ( z | P ) ≤ D δ ( z | P n ) , see Figure 2. This can be used to construct asymptotic conﬁdence intervals and ﬁnitesample guarantees (Esfahani and Kuhn, 2018). Finding the right threshold δ > − − Depth contours for delta = 0.05. X1 X ll l lll ll ll lll l lll ll l ll ll ll ll l l lll ll ll l ll ll lllll l l lll ll lll l ll llll ll l ll ll l lll lll ll ll lll ll l l ll l ll lll ll l −2 −1 0 1 2 − − Depth contours for delta = 0.015. X1 X ll l lll ll ll lll l lll ll l ll ll ll ll l l lll ll ll l ll ll lllll l l lll ll lll l ll llll ll l ll ll l lll lll ll ll lll ll l l ll l ll lll ll l −2 −1 0 1 2 − − Depth contours for delta = 0.005. X1 X ll l lll ll ll lll l lll ll l ll ll ll ll l l lll ll ll l ll ll lllll l l lll ll lll l ll llll ll l ll ll l lll lll ll ll lll ll l l ll l ll lll ll l Figure 2: Depth contours (upper-level sets of the depth function) to the level α = 0 . andwith the covariance matrix (cid:0) (1 , (cid:62) , (1 , (cid:62) (cid:1) for the following depth functions: D ( ·| P )(black dashed), D ( ·| P n ) (black solid), D δ ( ·| P n ) (red), and D δ ( ·| P n ) (blue). We use δ = 0 .

05 (left), δ = 0 .

015 (middle), and δ = 0 .

005 (right).

The main focus of this paper is on the robustiﬁed halfspace depth deﬁned in (1). Theinner problem is studied in §

2, where we also establish its various useful propertiesand provide Algorithm 1 for the sample version. In § P , establish its decayat inﬁnity, and show that the corresponding maximin problem does not necessarilycoincide with our minimax formulation in (1). Finite sample version of the robustiﬁeddepth is further studied in §

4, where we prove asymptotic consistency and show thatthe breakdown point of an upper level set is similar to the case of the classical halfspacedepth which, nevertheless, results in a more robust model. Furthermore, the medianregion may have the asymptotic breakdown point at 1 /

2. Simple representations ofthe outer level sets and the median region are presented in § §

6. Finally, Appendix contains the remaining proofs.

Chernozhukov et al. (2017) proposed Monge-Kantorovich depth based on the optimaltransport theory. Their basic construction consists of transforming P into the sphericaluniform distribution and applying the classical halfspace depth (based on this uniformdistribution) to the transformed points. We employ optimal transport theory in adiﬀerent way. Our new distribution is not some ﬁxed well-chosen reference distribution- it is the worst-case distribution for the given direction u and ambiguity radius δ > u E w ( (cid:104) u , X − z (cid:105) ) E w ( −(cid:104) u , X − z (cid:105) ) , w ( y ) = 0 for y <

0. In our case the problem (1) can be rewritten as inf (cid:107) u (cid:107) =1 f (cid:0) L ( (cid:104) u , X − z (cid:105) ) (cid:1) , where the function f is applied to the law of the projection on u , see (4) below,and this seems to be the closest representation.Nagy and Dvoˇr´ak (2020) introduced the illumination depth based on a ﬁxed upperlevel set R corresponding to he classical halfspace depth. More precisely, the illumina-tion depth of a point z outside of R is deﬁned as the volume of the convex hull of { z }∪ R divided by the volume of R . Note, however, that this depth depends only on the shapeof R and the relative position of z . For the empirical law P n this leads to a non-trivialordering beyond the convex hull of the set of observations. The similarity to our depthcomes from the fact that both depths are asymptotically inversely proportional to thedistance from z to the convex hull of observations as z escapes to inﬁnity.Einmahl et al. (2015) use extreme value theory to adjust the empirical halfspacedepth when far away from the center. Assuming multivariate regular variation of P ,they extrapolate from moderately remote regions into the remote regions with few orno observations. In the case of robustiﬁed halfspace depth the extrapolation for theempirical law is simplistic - the depth decays as δ/ (cid:107) z (cid:107) as will be shown in the following. This section is devoted to showing that the inner problem in (1) has a simple explicitform, which readily follows from the general theory in Blanchet and Murthy (2019).Furthermore, we establish various useful properties of the solution and specialize to thecase where the center of the ambiguity ball is given by the empirical law. It must bementioned that the latter case can also be treated using the results from Esfahani andKuhn (2018).

For any z ∈ R d and any direction u ∈ R d , (cid:107) u (cid:107) = 1 we consider a random variable Y ∈ R , which is the projection of z − X onto the direction u : Y = (cid:104) u , z − X (cid:105) , so that {(cid:104) u , X − z (cid:105) ≥ } = { Y ≤ } . Importantly, the random variable Y will be used to express the solution to the in-ner problem in (1). It will be shown that for a given direction u this d -dimensionaloptimization problem reduces to a one-dimensional problem, where the ambiguity isspeciﬁed for the random variable Y . In this regard, we write P Y for the law of Y on R and d W for the Wasserstein distance between the laws on the real line with the costfunction c ( x, y ) = | x − y | .Consider the truncated expectation of Y and its left-inverse: h ( y ) = E ( Y { Y ∈ (0 ,y ] } ) , h − ( δ ) = inf { y ≥ h ( y ) ≥ δ } , (2)5here y, δ ≥

0. Note that h is a non-decreasing right-continuous function, and h − isa non-decreasing left-continuous function with values in [0 , ∞ ]. Note that h (0+) = 0and thus h − ( δ ) > δ >

0. Finally, a + = max( a,

0) and a − = max( − a,

0) denotethe positive and the negative parts of a , respectively. Proposition 1 (Inner problem) . For δ ∈ (0 , ∞ ) and z , u ∈ R d with (cid:107) u (cid:107) = 1 it holdsthat sup P (cid:48) : d W ( P (cid:48) , P ) ≤ δ P (cid:48) ( (cid:104) u , X − z (cid:105) ≥

0) = sup P (cid:48) : d W ( P (cid:48) , P Y ) ≤ δ P (cid:48) ( Y ≤

0) (3)= inf λ (cid:48) > { δ/λ (cid:48) + E (1 − Y + /λ (cid:48) ) + } = δ/λ + E (1 − Y + /λ ) + (4)= P ( Y ≤ λ ) − ( h ( λ ) − δ ) /λ = P ( Y < λ ) + ( δ − h ( λ − )) /λ, (5) where λ = h − ( δ ) ∈ (0 , ∞ ] and (5) is not used when λ = ∞ . Furthermore, (4) equals 1iﬀ E Y + ≤ δ .The minimization problem has the solution: inf P (cid:48) : d W ( P (cid:48) , P ) ≤ δ P (cid:48) ( (cid:104) u , X − z (cid:105) >

0) = inf P (cid:48) : d W ( P (cid:48) , P Y ) ≤ δ P (cid:48) ( Y <

0) = P ( Y < − λ ) + ( h ( λ ) − δ ) /λ, (6) where λ = h − ( δ ) ∈ (0 , ∞ ] and h ( y ) = E ( − Y {− Y ∈ (0 ,y ] } ) is the analogue of h for the − Y variable. Furthermore, (6) equals 0 iﬀ E Y − ≤ δ .Proof. The result in (4) follows readily from (Blanchet and Murthy, 2019, Thm. 1 andThm. 2(a)) by noting that the distance from x ∈ R d to A = { x : (cid:104) u , x − z (cid:105) ≥ } is givenby y + with y = (cid:104) u , z − x (cid:105) . This also shows the equivalence of the two optimizationproblems in (3), which can also be proven directly by relating the ambiguity sets. Therepresentation in (5) for λ < ∞ follows by rewriting the expectation of the positivepart. Suppose that the value is 1 then it can not be that E Y + > δ , because then λ < ∞ and P ( Y ≤ λ ) < h ( λ ) ≥ δ . Assuming E Y + ≤ δ we consider the twocases λ = ∞ and λ > ∞ to ﬁnd that the value is 1.The minimization problem is analyzed by considering1 − sup P (cid:48) : d W ( P (cid:48) , P ) ≤ δ P (cid:48) ( (cid:104) u , X − z (cid:105) ≤ , and noting that the subset A = { x : (cid:104) u , x − z (cid:105) ≤ } is closed and the distance of x to A is given by y − . We conclude by noting that 1 − (1 − a ) + = min( a,

1) and simplifyingthe result E min( Y − /λ, − δ/λ .With regard to the representation in (5) we note that for λ < ∞ : h ( λ ) ≥ δ ≥ h ( λ − )and the solution reads simply P ( Y ≤ λ ) when Y has no mass at λ . The latter isalways true if Y has no atoms. In this case we may view the solution as the original6robability, where the hyperplane deﬁning the halfspace is shifted appropriately in thedirection opposite to u .Importantly, the optimal transport plans exist for both (3) and (6), see Blanchetand Murthy (2019, Lem. 3). That is, there is a random variable X ∗ (not unique ingeneral) on a possibly extended probability space (Ω , F , P ) such thatsup P (cid:48) : d W ( P (cid:48) , P ) ≤ δ P (cid:48) ( (cid:104) u , X − z (cid:105) ≥

0) = P ( (cid:104) u , X ∗ − z (cid:105) ≥ , E (cid:107) X ∗ − X (cid:107) ≤ δ. (7)Furthermore, the structure of X ∗ is very simple: it coincides with X when (cid:104) u , X − z (cid:105) = − Y ≥ X + u (cid:104) u , z − X (cid:105) onto the hyperplane when Y ∈ (0 , λ )or when Y = λ and U ≤ ( δ − h ( λ − )) /λ with some independent uniform U . The latteruniform is needed to transport only some part of the mass corresponding to the atomof Y at λ , see also Figure 3 below. When the maximal value is 1 the above shouldbe understood as transporting all the mass corresponding to Y > E (cid:107) X ∗ − X (cid:107) can be strictly below δ . Forthe inﬁmum all the mass corresponding to Y ∈ ( − λ,

0) is moved to the hyperplane andadditionally some mass corresponding to Y = − λ as well. Optimal transport plans ofthis kind are standard in various related problems, see Wozabal (2014); Pﬂug and Pohl(2017) and references therein. Here we take P = P n = n (cid:80) i ≤ n δ { x i } to be the empirical measure corresponding tothe observations x i ∈ R d , and further simplify the expressions in Proposition 1. Let y i = (cid:104) u , z − x i (cid:105) , i = 1 , . . . , n be the projected observations, and denote by y ( − m ) ≤ · · · ≤ y ( − ≤ < y (1) ≤ y (2) ≤ · · · ≤ y ( m ) , s i = i (cid:88) j =1 y ( j ) , s i = − i (cid:88) j =1 y ( − j ) the sorted values of y i and the partial sums of positive and non-positive projections,respectively, where m, m ≥ m + m = n . For convenience we assume that s = s = 0, s i = s m for i > m , and similarly s i = s m for i > m . In particular, we have p = P n ( (cid:104) u , X − z (cid:105) ≥

0) = ( n − m ) /n = m/n. It must be noted that the following result can also be obtained from Esfahani and Kuhn(2018, Cor. 5.3).

Corollary 1 (Sample version) . Assume that (cid:107) u (cid:107) = 1 .If ≤ k ≤ m is such that s k − < δn ≤ s k then sup P (cid:48) : d W ( P (cid:48) , P n ) ≤ δ P (cid:48) ( (cid:104) u , X − z (cid:105) ≥

0) = p + k − n + δ − s k − /ny ( k ) ∈ (cid:16) p + k − n , p + kn (cid:105) . (8)7 urthermore, s m ≤ δn yields 1.If ≤ k ≤ m is such that s k − < δn ≤ s k then inf P (cid:48) : d W ( P (cid:48) , P n ) ≤ δ P (cid:48) ( (cid:104) u , X − z (cid:105) >

0) = p − k − n + δ − s k − /ny ( − k ) ∈ (cid:104) p − kn , p − k − n (cid:17) . Furthermore, s m ≤ δn yields 0.Proof. Note that s m < δn corresponds to ˆ E Y + < δ and thus we get 1. Otherwise,choose 1 ≤ k ≤ m as stated. Then λ = y ( k ) and according to Proposition 1 thesupremum is given by p + 1 n k − (cid:88) i =1 (1 − y ( i ) /y ( k ) ) + δ/y ( k ) , because the indices corresponding to y ( i ) = y ( k ) are irrelevant. The ﬁrst result nowfollows, where the upper bound stems from the inequality δn − s k − ≤ s k − s k − = y ( k ) .Similar argument for the second expression gives λ = − y ( − k ) and then also m − k + 1 n − n s k − /y ( − k ) + δ/y ( − k ) and the result follows.For better clarity and visibility we provide the algorithm corresponding to Corol-lary 1. It’s complexity derives from the complexity of sorting, and so it is O ( n log n ). Algorithm 1:

Compute sup P (cid:48) : d W ( P (cid:48) , P n ) ≤ δ P (cid:48) ( (cid:104) u , X − z (cid:105) ≥ y i = (cid:104) u , z − x i (cid:105) ;pick y i > m ∈ { , . . . , n } be their number;compute cumulative sums s i = (cid:80) ij =1 y j with s = 0; if s m ≤ δn thenResult: k ∈ { , . . . , m } such that s k ≥ δn ; Result: n − mn + k − n + δ − s k − /ny k Finally, we note that the optimal transport plan for the supremum consists of movingthe points x j corresponding to y ( i ) , i = 1 , . . . , k − ≤ /n is shifted from the next closest pointto the hyperplane so that (cid:80) (mass i × distance i ) = δ unless s m /n < δ , see Figure 3. This section is devoted to various properties of the solution to the inner problem, whichform the basis for the following study of the robustiﬁed depth D δ ( z | P ). For a ﬁxed δ ≥ y (3) Figure 3: Illustration of the solution in (8): p = 1 /n and k = 3 with the resultin (3 /n, /n ]. The optimal transport plan (7) is depicted by blue arrows, where thedashed arrow indicates that less than 1 /n mass is moved.and some random variable Y i ∈ R we let v i = v i ( δ ) ∈ [0 ,

1] be the optimal value in (3) with Y d = Y i . In the case δ = 0 we take v i = P ( Y i ≤ v i is a non-decreasing function of δ . For further use we ﬁrst consider a deterministic Y = c . Proposition 1 readily impliesthat v = (cid:40) δ/c, if c > δ, , otherwise. (9)Next, recall that Y is stochastically smaller than Y ( Y ≺ Y ) if P ( Y ≤ y ) ≥ P ( Y ≤ y )for all y ∈ R . We start with a monotonicity result. The proofs of all of these resultsare postponed to Appendix A. Lemma 1 (Monotonicity) . If Y +1 ≺ Y +2 then v ≥ v for any δ ≥ . Note that Y ≺ Y implies Y +1 ≺ Y +2 . Lemma 2 (Mixture) . Let Y be a mixture of Y and Y with probabilities p ∈ (0 , and − p , respectively. Then for δ ≥ we have v = sup { pv ( δc ) + (1 − p ) v ( δc ) : c , c ≥ , pc + (1 − p ) c ≤ } . In particular, there is the sandwich bound: pv ( δ/p ) + (1 − p ) v (0) ≤ v ≤ pv ( δ/p ) + (1 − p ) v ( δ/ (1 − p )) . These bounds, for example, imply the following property which will be used later: pv ( δ/p ) is non-decreasing in p ∈ [0 , , (10)9here p = 0 yields 0. Take a trivial mixture with Y i d = Y to ﬁnd that v ( δ ) ≥ pv ( δ/p ).Now the inequality for 0 < p < p ≤ v ( δ/p ) ≥ ( p /p ) v (( δ/p ) / ( p /p )).Furthermore, we observe that P ( Y ≤ c ) ≥ p for some c > v ( δ ) ≥ min( p, δ/c ) . (11)This follows by regarding Y as a mixture of Y | Y ≤ c and Y | Y > c , using the lowerbound and monotonicity in (10), and then applying Lemma 1 and (9).

Lemma 3 (Boundary values) . The following is true for δ ≥ :(i) If P ( Y t > c ) → for every c > then v t → ,(ii) If P ( Y t ≤ → then v t → .Let Y be a mixture of Y and Y t with probabilities p ∈ (0 , and − p , respectively.Then v → pv ( δ/p ) in case (i) and v → pv ( δ/p ) + (1 − p ) in case (ii). The next result is crucial for the property P6 of the depth below.

Lemma 4 (Strict monotonicity) . Let Y = Y + η for η > . Then v = v for δ > implies that both v i are 1. We conclude with a continuity result.

Lemma 5 (Continuity) . Consider δ n → δ > and Y n d → Y . Then v n ( δ n ) → v ( δ ) . Note that the boundary case δ = 0 is excluded from the above result, but seeProposition 4 below. In this section we show that the robustiﬁed halfspace depth deﬁned in (1) satisﬁes thedesirable properties of a depth function (Zuo and Serﬂing, 2000) with a relaxation ofaﬃne invariance, and that it has various further nice features. In addition, we study itsdecay at inﬁnity and show that the corresponding maximin problem may give a strictlysmaller value.

Throughout this section we assume that the law P of the random vector X ∈ R d isﬁxed. We use P X (cid:48) to denote the law of the random variable X (cid:48) ∈ R d . The ﬁrst resultshows that the robustiﬁed halfspace depth D δ inherits all the main properties of theclassical halfspace depth D , whereas the second result establishes various additionalnice properties. We show that our depth function is continuous and strictly positive withlevel sets being the boundaries of the convex upper level sets. This should be compared10o the ‘staircase’ form of the empirical halfspace depth. Finally, we show that D δ decreases to the halfspace depth as δ ↓

0. Thus we regard D δ as a smoothed/regularizedversion of the classical depth function.It is noted that some further properties readily follow from the ones stated below as,for example, monotonicity relative to the deepest point and maximality at the centerfor a centrally symmetric distribution. We would like to emphasize that no assumptionsare made on P and hence all the properties hold also for the empirical law P n . Proposition 2 (Standard properties) . For any δ > the robustiﬁed halfspace depth D δ ( ·| P ) satisﬁes the following properties:P1 Isometric invariance: D δ ( A z + b | P A X + b ) = D δ ( z | P X ) for any z , b ∈ R d and any A ∈ R d × R d with AA (cid:62) = I ;P2 Null at inﬁnity: D δ ( z | P ) → as (cid:107) z (cid:107) → ∞ .P3 Convex upper level sets: the sets { z ∈ R d : D δ ( z | P ) ≥ α } are convex for every α ∈ (0 , .Proof. P1. Since every vector can be represented as A x + b we have D δ ( A z + b | P A X + b ) = inf (cid:107) u (cid:107) =1 sup P (cid:48) : d W ( P (cid:48) A X + b , P A X + b ) ≤ δ P (cid:48) ( (cid:104) u , ( A X + b ) − ( A z + b ) (cid:105) ≥ . From the deﬁnition of the Wasserstein distance and (cid:107) ( A x − b ) − ( A y − b ) (cid:107) = (cid:107) x − y (cid:107) we readily deduce that d W ( P (cid:48) A X + b , P A X + b ) = d W ( P (cid:48) X , P X ) . It is left to note that (cid:104) u , ( A X + b ) − ( A z + b ) (cid:105) = (cid:104) A (cid:62) u , X − z (cid:105) , where the vector A (cid:62) u runs over all unit vectors. The result now follows.P2. For every z (cid:54) = 0 we pick u = z / (cid:107) z (cid:107) and observe that Y = (cid:104) u , z − X (cid:105) = (cid:107) z (cid:107) − (cid:104) z , X (cid:105) / (cid:107) z (cid:107) ≥ (cid:107) z (cid:107) − (cid:107) X (cid:107) . According to Lemma 1 the depth is upper-bounded by the value ˜ v corresponding to˜ Y = (cid:107) z (cid:107) − (cid:107) X (cid:107) → ∞ a.s. Conclude by Lemma 3(i).P3. It is suﬃcient to show that for any z , z ∈ R d and any t ∈ [0 ,

1] we have D δ ( t z + (1 − t ) z | P ) ≥ min( D δ ( z | P ) , D δ ( z | P )) . Note that (cid:104) u , z − z (cid:105) ≥ (cid:104) u , X − z (cid:105) ≤ (cid:104) u , X − ( t z + (1 − t ) z ) (cid:105) , whereas (cid:104) u , z − z (cid:105) ≤ (cid:104) u , X − z (cid:105) ≤ (cid:104) u , X − ( t z + (1 − t ) z ) (cid:105) . The result nowfollows from (1). 11iﬀerent to the traditional halfspace depth, its robustiﬁed version is isometricallyinvariant only. In other words, our depth is preserved under translation, rotation andreﬂection. Absence of general aﬃne invariance comes as the price for distributionalrobustness in a natural way, due to weaker invariance properties of the Wassersteindistance. If aﬃne invariance is a necessary requirement, a distribution dissimilaritysatisfying it can be used instead in the inner problem. As another option, providedthat the ﬁrst two moments µ X and Σ X = Λ X Λ (cid:62) X are known or (which is more realistic)can be reliably estimated, aﬃne invariance can be implemented as follows: consider thestandardized ˜ X = Λ − X ( X − µ X ) and the respective ˜ z = Λ − X ( z − µ X ) and calculatethe depth of ˜ z w.r.t. the law of ˜ X . Proposition 3 (Additional properties) . For any δ > the function D δ ( ·| P ) satisﬁesthe following additional properties:P4 Positivity: D δ ( z | P ) > for all z ∈ R d ;P5 Joint continuity: D δ ( z | P ) is continuous in ( z , δ ) ∈ R d × (0 , ∞ ) ;P6 Level sets: for any α ∈ (0 , the set { z ∈ R d : D δ ( z | P ) = α } is the boundary ofthe respective upper level set { z ∈ R d : D δ ( z | P ) ≥ α } , or both are empty.Proof. P4. Fix z and note that (cid:104) u , z − X (cid:105) ≤ (cid:107) z − X (cid:107) when (cid:107) u (cid:107) = 1. Choose c > (cid:15) > c with probability (cid:15) . Thus forall directions u we have P ( Y ≤ c ) ≥ (cid:15) . According to (11) the depth D δ ( z | P ) is lowerbounded by min( (cid:15), δ/c ) > u , z , δ >

0. Take ( u n , z n ) → ( u , z ) so that Y n = (cid:104) u n , z n − X (cid:105) → (cid:104) u , z − X (cid:105) = Y a.s. and apply Lemma 5.P6. Assume that the upper level set is non-empty for the given α . By P2 it must bebounded. By continuity we see that the boundary of the upper level set correspondingto α ∈ (0 ,

1) must yield depth α . Suppose there is a point z strictly inside this convexset with D δ ( z | P ) = α . Choose the corresponding optimal direction u which mustexist by continuity of the inner supremum in u . Find z on the boundary such that z = z + u η for some η >

0. Now we have (cid:104) u , z − z (cid:105) = η >

0, and so Y − Y = η with Y i = (cid:104) u , z i − X (cid:105) . Furthermore α = v ≤ v , because the direction u is not necessarilyoptimal for z , but it is for z . Lemma 1 shows that v = v and then Lemma 4 impliesthat α = 1, a contradiction.From the deﬁnition (1) it is clear that the depth D δ ( z | P ) is non-decreasing in δ ≥ D δ ( z | P ) approaches the halfspace depth as δ ↓

0. Note that uniform convergence fails in general, since D δ ( z | P ) is continuous in z for every δ > D ( z | P ) is not necessarilycontinuous. Proposition 4 (Tukey depth approximation) . For all z ∈ R d it holds that D δ ( z | P ) ↓ D ( z | P ) as δ ↓ . roof. It is suﬃcient to show for any direction u thatsup d W ( P (cid:48) , P ) ≤ δ P (cid:48) ( (cid:104) u , X − z (cid:105) ≥ − P ( (cid:104) u , X − z (cid:105) ≥ ↓ δ ↓

0. We may assume that P ( Y > > δ > E Y + > δ and so λ δ = h − ( δ ) < ∞ . Now the solutionin (5) gives the representation of the above diﬀerence: P ( Y ∈ (0 , λ δ ]) − ( h ( λ δ ) − δ ) /λ δ , ( h ( λ δ ) − δ ) /λ δ ∈ [0 , P ( Y = λ δ )) . (12)This indeed converges to 0 if λ δ →

0. Assume that λ δ → λ > P ( Y ∈ (0 , λ )) = 0. If Y has no mass at λ then we are done, and if such mass is positivethen it must be that ( h ( λ δ ) − δ ) /λ δ converges to this mass. Hence the expression in (12)goes to 0 and the proof is complete.For any δ > α ( δ ) = max z ∈ R d D δ ( z | P ) ∈ (0 , , which is achieved according to P2 and P5. Lemma 6.

The maximal depth α ( δ ) ∈ (0 , is non-decreasing and continuous for δ > .Proof. Our proof relies on the joint continuity of the depth function, see P5. Consider δ n ↑ δ and let z be a point achieving the maximal depth for δ . Now D δ n ( z | P ) → D δ ( z | P )and the latter can not be exceeded for ambiguity radius δ n , showing left-continuityof α . Let δ n ↓ δ with z n being points with maximal depth. These z n must belongto a compact set, see P2 and use monotonicity of depth in δ n . Thus ( z n ) contains aconvergent subsequence establishing right-continuity of α .The median M δ = { z ∈ R d : D δ ( z | P ) = α ( δ ) } (13)is a convex region which is, in fact, an interval (commonly a single point) if α <

1, seeP6. The case α = 1 is special as it allows for median regions with a non-empty interior,and it also leads to a simple characterization: M δ = { z ∈ R d : sup (cid:107) u (cid:107) =1 E (cid:104) u , z − X (cid:105) + ≤ δ } , when α ( δ ) = 1 , (14)see Proposition 1. In words, such points z have the property that the expected distanceof X to any halfspace anchored at z is at most δ . The median region for the empiricaldistribution is further studied in § δ > X n d → X does not in general implythat the median regions M nδ well approximate M δ . In particular, M δ may be a set13ith a non-empty interior, whereas all M nδ can be single points or intervals whichhappens, for example, when (cid:104) u , X n (cid:105) + are non-integrable for all u (cid:54) = and all n . Itseems plausible that the upper level sets for α ∈ (0 , α ) do converge with respect to theHausdorf distance, say. Such approximation results are beyond the scope of this paper.Let us now analyze the case of the standard normal vector X in more detail. Example 1 (Standard normal) . Assume that X is a standard d -dimensional normalvector. Consider Y = (cid:104) u , z − X (cid:105) ∼ N ( (cid:104) u , z (cid:105) ,

1) when (cid:107) u (cid:107) = 1. Assume for themoment that z (cid:54) = and note that (cid:104) u , z (cid:105) is maximal when u = z / (cid:107) z (cid:107) , and thus this isan optimal direction according to Lemma 1. For z = any direction is optimal. It isleft to analyze the random variable Y = (cid:107) z (cid:107) + Z for a standard normal Z .Let Φ , ϕ be the normal cumulative distribution function and the density, respec-tively. According to Proposition 1 the depth is given by D δ ( z | P ) = Φ( λ − (cid:107) z (cid:107) ) , where (cid:90) λ xϕ ( x − (cid:107) z (cid:107) )d x = δ with λ = ∞ (leading to 1) when the solution does not exist; recall that the standardTukey depth corresponds to λ = 0. In particular, the upper level sets are balls centeredat the origin. The maximal depth attains 1 iﬀ δ ≥ E Z + = 1 / √ π ≈ . α ( δ ) < δ < / √ π , which must beattained by z = . Thus we need to solve ϕ (0) − ϕ ( λ ) = δ according to the formula forthe expectation of the truncated normal. This readily gives α ( δ ) = Φ (cid:16)(cid:113) − − δ √ π ) (cid:17) ∈ (1 / , , δ ∈ (0 , / √ π ) , see Figure 5 below for an illustration. The property P2 states that the robustiﬁed depth becomes 0 at inﬁnity. Here we studyits rate of decay and compare it to the halfspace depth. First, we provide an upperbound.

Lemma 7. If E (cid:107) X (cid:107) p < ∞ for some p ∈ (0 , then for δ > there exists a constant c > such that D δ ( z | P ) ≤ c (cid:107) z (cid:107) − p for all z ∈ R d \{ } .Proof. Take u = z / (cid:107) z (cid:107) assuming z (cid:54) = and reconsider the optimal transport plan X ∗ satisfying (7). Note that (cid:104) u , X ∗ − z (cid:105) ≤ (cid:107) X ∗ (cid:107) − (cid:107) z (cid:107) , and so we have the bound D δ ( z | P ) ≤ P ( (cid:107) X ∗ (cid:107) − (cid:107) z (cid:107) ≥ ≤ E (cid:107) X ∗ (cid:107) p (cid:107) z (cid:107) p , E (cid:107) X ∗ − X (cid:107) ≤ δ ; here we have also used Markov’s inequality. Using ( a + b ) p ≤ a p + b p for p ≤ E (cid:107) X ∗ (cid:107) p ≤ E ( (cid:107) X ∗ − X (cid:107) + (cid:107) X (cid:107) ) p ≤ E (cid:107) X ∗ − X (cid:107) p + E (cid:107) X (cid:107) p . It is left to observe that E (cid:107) X ∗ − X (cid:107) p ≤ E (1 + (cid:107) X ∗ − X (cid:107) ) ≤ (1 + δ ).Next, we show that the decay can never be faster than linear. Lemma 8.

For δ > it holds that lim inf (cid:107) z (cid:107)→∞ (cid:107) z (cid:107) D δ ( z | P ) ≥ δ. Proof.

Choose a constant r > P ( (cid:107) X (cid:107) ≤ r ) ≥ /

2, and let B r be the ball of radius r centered at the origin. For any point z and any direction u wemust have (cid:104) u , z − x (cid:105) ≤ (cid:107) z (cid:107) + r, ∀ x ∈ B r . Thus Y = (cid:104) u , z − X (cid:105) ≤ (cid:107) z (cid:107) + r with probability p ≥ /

2, and according to (11) weﬁnd D δ ( z | P ) ≥ min(1 / , δ/ ( (cid:107) z (cid:107) + r )) . This lower bound behaving asymptotically as δ/ (cid:107) z (cid:107) and the result follows.Thus we ﬁnd that for an integrable X the depth decays linearly. It will be shownin Proposition 8 that the empirical depth (which is necessarily compactly supported)has the exact rate of decay δ/ (cid:107) z (cid:107) . Corollary 2 (Linear decay) . Assume E (cid:107) X (cid:107) < ∞ . Then for δ > there exist twoconstants < c < c such that c (cid:107) z (cid:107) − ≤ D δ ( z | P ) ≤ c (cid:107) z (cid:107) − when (cid:107) z (cid:107) is large enough.Proof. Combine Lemma 7 and Lemma 8.In the following we discuss a simple class of examples where the decay is slower thanlinear. Necessarily, such must have inﬁnite ﬁrst moment E (cid:107) X (cid:107) = ∞ . Here we rely onsome basic theory of regular variation (Bingham et al., 1989). Lemma 9.

Assume that X has a spherical distribution and P ( (cid:104) u , X (cid:105) > t ) = t − α (cid:96) ( t ) with α ∈ (0 , and some slowly varying (at inﬁnity) function (cid:96) . Then D δ ( z | P ) ∼(cid:107) z (cid:107) − α (cid:96) ( (cid:107) z (cid:107) ) as (cid:107) z (cid:107) → ∞ for any δ ≥ . In particular, lim (cid:107) z (cid:107)→∞ (cid:107) z (cid:107) p D δ ( z | P ) = (cid:40) , p < α, ∞ , p > α. roof. Consider Y = (cid:104) u , z − X (cid:105) d = (cid:104) u , z (cid:105) − Z , where Z has the distribution of (cid:104) u , X (cid:105) ;the same for all u by assumption. The largest (cid:104) u , z (cid:105) equals (cid:107) z (cid:107) , and according toLemma 1 it gives the smallest value, and hence the depth. According to (5) we thenhave D δ ( z | P ) ≥ P ( (cid:107) z (cid:107) − Z < λ ) = P ( Z > (cid:107) z (cid:107) − λ ) = ( (cid:107) z (cid:107) − λ ) − α (cid:96) ( (cid:107) z (cid:107) − λ ) , where λ = λ ( (cid:107) z (cid:107) ) is deﬁned using the random variable (cid:107) z (cid:107) − Z . For the lower boundit is thus left to show that λ/ (cid:107) z (cid:107) →

0, see the uniform convergence theorem (Binghamet al., 1989, Thm. 1.2.1). For the upper bound we use P ( (cid:107) z (cid:107) − Z ≤ λ ) resulting in thesame asymptotic behavior.On the contrary, assume that for some (cid:15) > λ > (cid:15) (cid:107) z (cid:107) for some z witharbitrarily large norm. For simplicity we write η = (cid:107) z (cid:107) . From the deﬁnition of λ wethen must have δ ≥ E (cid:0) ( η − Z ) { η − Z ∈ (0 , (cid:15)η ] } (cid:1) ≥ (cid:15)η P ( Z ∈ [(1 − (cid:15) ) η, (1 − (cid:15) ) η )]) . The latter probability is asymptotic to ((1 − (cid:15) ) − α − (1 − (cid:15) ) − α ) η − α (cid:96) ( η ), and so for α ∈ (0 ,

1) the right hand side in the display increases to ∞ as η → ∞ , which is acontradiction.In conclusion, the main diﬀerence from the halfspace depth arises in the integrablecase, where the traditional depth may decay at an arbitrarily fast rate. One particu-lar example concerns empirical distributions, where the halfspace depth is 0 for largeenough (cid:107) z (cid:107) , whereas the robustiﬁed depth decays as δ/ (cid:107) z (cid:107) . In the non-integrable caseone may distinguish between the directions along which z escapes to inﬁnity, and thenthe decay may be drastically diﬀerent for diﬀerent directions. One may be interested in swapping minimax in (1) for the maximin formulation. Bythe standard max-min inequality we havesup P (cid:48) : d W ( P (cid:48) , P ) ≤ δ (cid:26) inf u ∈ R d : (cid:107) u (cid:107) =1 P (cid:48) ( (cid:104) u , X − z (cid:105) ≥ (cid:27) ≤ D δ ( z | P ) , (15)and the obvious question is if the two formulations coincide. The answer is no in general,and we provide a simple example with the strict inequality below.We take d = 2 , z = and consider P = 12 δ { (1 , } + 12 δ { (1 , − } (16)putting half of the mass at the point (1 ,

1) and the other half at (1 , − δ ∈ (0 , / u = ( − ,

0) resulting in D δ ( | P ) = δ .16 m ˆ m m ˆ m m Figure 4: Illustration of the cones in the proof of Lemma 10 for n = 1. Lemma 10.

For δ ∈ (0 , / and P in (16) it holds that sup P (cid:48) : d W ( P (cid:48) , P ) ≤ δ (cid:26) inf u ∈ R d : (cid:107) u (cid:107) =1 P (cid:48) ( (cid:104) u , X (cid:105) ≥ (cid:27) = 1 √ δ < δ = D δ ( | P ) . Proof.

It is only required to show that the supremum can not exceed δ/ √

2, since bymoving δ/ √ δ/ √

2. Note that theoptimal direction u in this case is not unique.Fix P (cid:48) in the ambiguity ball centered at P , choose n ≥ θ = π n +1) . Next, we consider the masses in the cones m i = P (cid:48) (cid:16) X > , X /X ∈ [tan( π/ iθ ) , tan( π/ i + 1) θ )) (cid:17) , i = 1 , . . . , n, ˆ m i = P (cid:48) (cid:16) X < , − X /X ∈ [tan( π/ iθ ) , tan( π/ i + 1) θ )) (cid:17) , i = 1 , . . . , n,m = P (cid:48) (cid:16) X = X = 0 or X < , X /X ∈ [tan( π/ nθ ) , tan( − π/ − nθ )] (cid:17) , see Figure 4. Observe thatinf u ∈ R d P (cid:48) ( (cid:104) u , X (cid:105) ≥ ≤ min { m + · · · + m n + m , m + · · · + m n + m + ˆ m n , . . . , m + ˆ m n + · · · + ˆ m } , (17)where the minimum runs over 2 n + 1 terms corresponding to certain halfspaces. More-over, the distance from the point (1 ,

1) to the i th halfspace (the closest point therein) issin( iθ ) √

2, and the distances are reversed for the point (1 , − d W ( P (cid:48) , P ) ≤ δ implies the constraint( m + ˆ m ) sin( θ ) + · · · + ( m n + ˆ m n ) sin(2 nθ ) + m sin((2 n + 1) θ ) ≤ δ/ √ . (18)We now maximize the minimum in (17) under the constraint (18), which must yieldan upper bound on the supremum of interest. Observe that the solution must make17ll the 2 n + 1 terms in (17) equal and so m i = ˆ m n +1 − i for i = 1 , . . . , n . This is sosince otherwise the mass can be redistributed in such a way that smallest terms becomeslightly larger. The underlying argument requires a little thought which is left for thereader.Letting ˜ m i = m i + m n +1 − i = ˆ m n +1 − i + ˆ m i for i = 1 , . . . , n we observe that we needto maximize ˜ m + · · · + ˜ m n + m subject to˜ m (sin( θ ) + sin(2 nθ )) + · · · + ˜ m n (sin( nθ ) + sin(( n + 1) θ )) + m sin((2 n + 1) θ ) ≤ δ/ √ . By concavity of sin on [0 , π/

2] we see that sin((2 n + 1) θ ) is the smallest coeﬃcient andso the maximum is obtained by taking m = n +1) θ ) δ √ and all other variables beingzero. It is left to note thatsin((2 n + 1) θ ) = sin((2 n + 1) π/ (4( n + 1))) ↑ sin( π/

2) = 1as n → ∞ .The main problem in using maximin formulation is that it does not seem to yieldan explicit solution in general unlike the robustiﬁed depth suggested in this paper.Furthermore, our depth concept is in accordance with the modern approach of dealingwith the optimizer’s curse in stochastic programs (Esfahani and Kuhn, 2018). Throughout this section we assume that P = P n = n (cid:80) ni =1 δ { x i } is the empirical measurecorresponding to the n iid observations of X . We have a consistency result, uniform in both the point z and the ambiguity radius δ . Proposition 5.

For P n obtained by independently sampling from P it holds that sup z ∈ R d ,δ ≥ | D δ ( z | P n ) − D δ ( z | P ) | → a.s.as n → ∞ .Proof. Firstly, the case δ = 0 is classical (Donoho and Gasko, 1992). According to | inf f − inf g | ≤ sup | f − g | and the representation (4) of the depth for δ >

0, it issuﬃcient to show thatsup z , u ∈ R d | E n (1 − (cid:104) u , z − X (cid:105) + ) + − E (1 − (cid:104) u , z − X (cid:105) + ) + | → , where the division by λ (cid:48) is incorporated into u .18ow the result follows from empirical process theory (Pollard, 1984, Thm. II.24 andLem. II.25). It is only needed to observe that the graph of the function f u , z ( x ) =(1 − (cid:104) u , z − x (cid:105) + ) + ∈ [0 ,

1] can be constructed from three half-spaces in R d +1 usingunion and intersection operations, and so the respective class has polynomial discrimi-nation (Pollard, 1984, Lem. II.15).Upon recalling the continuity of depth in the ambiguity radius we readily obtainthe basic consistency result. Corollary 3.

1, which is slightly stronger than saying that (cid:107) X (cid:107) is light-tailed. Furthermore, it is required there that δ n decays suﬃciently slowly. Additionally,Esfahani and Kuhn (2018) provide a ﬁnite sample guarantee. We start by deﬁning an important quantity α ∗ = α ∗ ( δ ) ∈ (0 , /

2] using the equation α ∗ − α ∗ = α (cid:16) δ − α ∗ (cid:17) , (19)where α denotes the maximal depth. Figure 5 illustrates the maximal depth α and also α ∗ for the standard normal distribution in an arbitrary dimension d , see Example 1.Let us summarize the basic properties of α ∗ . Lemma 11.

Let δ ≥ . The equation (19) has a unique solution α ∗ in (0 , . Moreover, α ∗ ∈ (0 , / with α ∗ = 1 / iﬀ α (2 δ ) = 1 . For any α ∈ [0 , it holds that α − α < α (cid:16) δ − a (cid:17) iﬀ α < α ∗ . Furthermore, α ∗ ( δ ) ≤ α ( δ ) and α ∗ ( δ ) is non-decreasing in δ ≥ . .0 0.1 0.2 0.3 0.4 0.5 . . . . . . Figure 5: α ( δ ) and α ∗ ( δ ) (red) for the standard normal in an arbitrary dimension. Proof.

According to (10) the function pα ( δ/p ) is non-decreasing in p ∈ [0 , z and all directions u ; it is assumed to be 0 at 0.Hence pα ( δ/p ) − (1 − p ) is strictly increasing with values − α ( δ ) > α , see Lemma 6. Hence there is aunique root, proving the ﬁrst claim. The equivalence of inequalities follows from thestrict monotonicity. Finally, α ∗ = 1 / [1 + 1 /α ( δ/ (1 − α ∗ ))] which must be in (0 , / α ≤

1, and the value 1 / α ( δ/ (1 − /

2) = 1. The fact that α ∗ ( δ ) is non-decreasing is inherited from the same property of α ( δ ). Note also that(1 − α ) α ( δ/ (1 − α )) ≤ α and hence α ∗ ≤ α .Consider the upper level set U δ ( α ) = U δ ( α | P n ) = { z ∈ R d : D δ ( z | P n ) ≥ α } and let d H ( A, B ) be the Hausdorﬀ distance between two compact sets (the maximal dis-tance from a point in one set to the closest point in the other set). We let d H ( A, B ) = ∞ if A or B is empty. The breakdown point of U δ ( α ) for a given P n , δ, α is deﬁned (Donohoand Gasko, 1992; Nagy et al., 2019) by BP ( U δ ( α ) , P n ) = min m ∈ N (cid:40) mm + n : sup y ,... y m ∈ R d d H ( U δ ( α | P n ) , U δ ( α | P n + m )) = ∞ (cid:41) , where P n + m = n + m ( (cid:80) ni =1 δ { x i } + (cid:80) mi =1 δ { y i } ). In words, it is the minimal proportionof new (contaminating) observations which can make the upper level set arbitrarilydiﬀerent from the given one. In the case of the classical halfspace depth δ = 0 thebreakdown point is given by Nagy and Dvoˇr´ak (2020); Donoho and Gasko (1992), andhere we get a similar result. Note, however, that the robust depth is at least as largeas the traditional Tukey depth and so the eﬀective α in our case (resulting in a similarupper level set) is larger, making our depth more robust to contamination. In this20egard, we also mention that our upper bound on α is larger. We write α ∗ ( δ, P n ) forthe solution of (19) with respect to the measure P n and radius δ . Proposition 6.

For any δ ≥ , P n and < α ≤ α ∗ ( δ, P n ) it holds that BP ( U δ ( α ) , P n ) = (cid:100) nα/ (1 − α ) (cid:101) n + (cid:100) nα/ (1 − α ) (cid:101) . Moreover, for < α < α ∗ ( δ, P ) we also have BP ( U δ ( α ) , P n ) → α as n → ∞ on a set of measure one.Proof. The proof follows the same ideas as in (Donoho and Gasko, 1992, Lem 3.1), buthere we rely on the properties of the optimal value v (for a ﬁxed direction) establishedin Section 2. Let m be the smallest integer satisfying m/ ( n + m ) ≥ α , that is, m = (cid:100) nα/ (1 − α ) (cid:101) . First, we show that m new points are suﬃcient for the breakdown. Weplace all y i at the same location z = t u and let t → ∞ . The value at z along anydirection is at least m/ ( m + n ) according to Lemma 2. Hence the depth at this farremoved z is at least α and we are done.Next, suppose k < m points are suﬃcient to achieve inﬁnite distance while havingnon-empty upper level set. Let z be a point in this new upper level set. We denote itsdistance to the convex hull of x i by d . By assumption we may choose z such that d isarbitrarily large. Take the direction u along which this distance is computed, and suchthat it points in the direction of z . The value at z along the direction u converges to k/ ( n + k ) < α as d → ∞ according to Lemma 3. This yields a contradiction.Now we show that for any k < m additional points the upper level set is non-empty; the original level set is non-empty since α ∗ ≤ α according to Lemma 11. Let p = n/ ( n + k ) and take a point z with the maximal depth α ( δ/p ) in the original datasetfor an inﬂated ambiguity ball. For any direction u and arbitrary y i the new depth of z is larger than pα ( δ/p ), see Lemma 2. It is left to check that this lower bound is notsmaller than α . By assumption k < nα/ (1 − α ) and thus p > − α . According to (10)we indeed have pα ( δ/p ) ≥ (1 − α ) α ( δ/ (1 − α )) ≥ α, where the latter follows from Lemma 11 and the assumption α ≤ α ∗ ( δ, P n ).The ﬁnal statement, in view of Lemma 11, only requires checking that α ( δ, P n ) → α ( δ, P ) a.s. for any δ ≥

0, which is a consequence of the uniform consistency result inProposition 5, see also the proof of Lemma 6.Note that α ∗ (0) = α (0) / (1 + α (0)) and we recover the upper bound on α in theclassical setting. The only diﬀerence in the breakdown point of the upper level set forthe robustiﬁed and classical halfspace depths is in the allowed range of the level α , butsee the above comment on the eﬀective α implying that our depth is more robust, seealso Figure 6 for an illustration. This result can be reformulated to show that for large21raditional Tukey depth D ( ·| P n + m ) Robust Tukey depth D . ( ·| P n + m ) C o n t a m i n a t i o n m = X1 X Tukey depth plots for contaminated data X1 X Depth plots for contaminated data C o n t a m i n a t i o n m = X1 X Tukey depth plots for contaminated data X1 X Depth plots for contaminated data

Figure 6: Heat maps of the traditional Tukey depth D ( ·| P n + m ) (left) and the proposedrobust Tukey depth D . ( ·| P n + m ) (right) for a sample of n + m = 120 points drawnfrom a bivariate standard normal distribution except contaminating m = 30 (top) and m = 50 (bottom). Depth contours are plotted for depth values ( , , , × α ( · , P n + m ).The contour of depth is plotted with dashed green line. α (exceeding the above bound) the breakdown occurs at α ∗ ( δ, P ) due to the empty setproblem. We note that the level set for α = α breaks down with a single additionalpoint in the sense that such level is not achieved in the new dataset. A more relevantstatistic is the median M δ itself, deﬁned in (13), and we study it below. Proposition 7.

For any δ ≥ and α ∗ n = α ∗ ( δ, P n ) it holds that BP ( M δ , P n ) ≥ (cid:100) nα ∗ n / (1 − α ∗ n ) (cid:101) n + (cid:100) nα ∗ n / (1 − α ∗ n ) (cid:101) . f α (2 δ, P n ) = 1 then BP ( M δ , P n ) = 1 / .Furthermore, lim inf n →∞ BP ( M δ , P n ) ≥ α ∗ ( δ, P ) on a set of measure one, where the limit is / if α (2 δ, P ) = 1 .Proof. Suppose there exists z outside of the convex hull of x i , and such that it belongsto the new median region. Its depth is no more than p + δ/d with p = m/ ( n + m ),where d is the distance from the convex hull, see Lemma 2 as well as Lemma 1 with (9);alternatively one may use Lemma 3. The points in the old median region have newdepth at least (1 − p ) α ( δ/ (1 − p ) , P n ), and hence we must have p ≥ (1 − p ) α ( δ/ (1 − p ) , P n ) . According to Lemma 11 we must have p ≥ α ∗ n and then m ≥ nα ∗ n / (1 − α ∗ n ).Let us now show that m = n new points are suﬃcient for the breakdown, so that thebreakdown point is exactly 1 / α ∗ n = 1 /

2. Suppose breakdown does not occur,and so the new median must be contained in some inﬂation of the given convex hull.Now we copy the given n points and shift them suﬃciently far away, so that the twoinﬂations do not intersect. But we may regard the new set of points as the originaldataset, which readily leads to a contradiction. The ﬁnal statement follows from theconvergence α ∗ n → α ∗ ( δ, P ) a.s.Importantly, α ∗ ( δ ) is non-decreasing and so the breakdown bound grows with δ .Furthermore, any median region with non-empty interior has a breakdown point of 1 / α (2 δ ) = 1 occurs when δ ≥ − / π − / ≈ .

2, see Figure 5. In this case the median has a breakdown point of 1 /

2, even thoughthe maximal depth can be as small as 0.881.

Importantly, in the sample version quite a bit more can be said about the shape of themedian region and the outer level sets, that is, those corresponding to a small level α .Characterization of such sets is addressed in this section. The convex hull of the observations x , . . . , x n is denoted by H . Let d ( z , H ) = inf {(cid:107) z − y (cid:107) : y ∈ H } be the distance from z to the convex hull. The next result shows that for α ≤ /n the upper level set is δ/α -thickened convex hull H , see Figure 7. Proposition 8 (Outer level sets) . The following is true for any δ > and z ∈ R d : If d ( z , H ) ≥ δn then D δ ( z | P n ) = δ/d ( z , H ) ≤ /n . • For α ∈ (0 , /n ] the level set is given by { z : D δ ( z | P n ) = α } = { z : d ( z , H ) = δ/α } . • D δ ( z | P n ) ∼ δ/ (cid:107) z (cid:107) as (cid:107) z (cid:107) → ∞ .Proof. Let z ∗ be the unique point in H such that (cid:107) z − z ∗ (cid:107) = d ( z , H ) ≥ δn . Consider thedirection u = ( z − z ∗ ) / (cid:107) z − z ∗ (cid:107) and observe that this direction maximizes min i (cid:104) u , z − x i (cid:105) . This can be easily seen by examining the vertices of the face containing z ∗ . Forsuch u consider the solution to the inner problem as described in Corollary 1. Notethat m = n and y (1) = d ( z , H ) ≥ δn . Hence we get the solution δ/d ( z , H ) ≤ /n . Anyother direction yields a larger value, since either y (1) is smaller or m < n . This provesthe ﬁrst statement.Since δ/α ≥ δn we see from the ﬁrst statement that { z ∈ R d : d ( z , H ) = δ/α } musthave depth α . The second result now follows from the property P6 of the depth.For the third result we let c be the maximal distance of the points in H from theorigin. Then (cid:107) z (cid:107) − c ≤ d ( z , H ) ≤ (cid:107) z (cid:107) + c showing that d ( z , H ) ∼ (cid:107) z (cid:107) , and the resultfollows easily from the ﬁrst statement. H Figure 7: Illustration of the outer upper level sets for α ≤ /n . Here our focus is on the median region M δ deﬁned in (14) for the empirical law P n and δ > α ( δ ) = 1. Deﬁne the class of sets C = (cid:110) I ⊆ { , . . . , n } : { x i : i ∈ I } and { x i : i / ∈ I } can be separated by a hyperplane (cid:111) , (20)which will be crucial for the following. Importantly, the cardinality of C is O ( n d ), whichshould be compared to 2 n possible subsets.24 roposition 9. For δ > such that α ( δ ) = 1 we have M δ = (cid:92) I ∈C ,I (cid:54) = ∅ B I = (cid:92) I (cid:54) = ∅ B I , where B I is the ball centered at (cid:80) i ∈ I x i / | I | with radius δn/ | I | . Moreover, for δ largeenough M δ is a ball of radius δ centered at (cid:80) i x i /n . We illustrate this result in Figure 9. Our proof relies on the following basic geometric x x x δ Figure 8: The sample median region M δ for n = 3 and x , x , x depicted by blackdots.optimization problem, concerning the maximal total Euclidean distance of given pointsto an arbitrary half-space. Such a result does not seem to appear in the standard bookssuch as Boyd and Vandenberghe (2004), and so we provide a proof in Appendix A. Lemma 12.

Consider non-zero vectors v i ∈ R d for i = 1 , . . . , n . Then argmax u ∈ R d : (cid:107) u (cid:107) =1 (cid:40) n (cid:88) i =1 (cid:104) u , v i (cid:105) + (cid:41) (21) is non-empty and all of its elements have the form (cid:80) i ∈ I v i / (cid:107) (cid:80) i ∈ I v i (cid:107) , where the set I (cid:54) = ∅ satisﬁes I = { j : (cid:88) i ∈ I (cid:104) v i , v j (cid:105) > } . (22) The optimal value is given by max I {(cid:107) (cid:80) i ∈ I v i (cid:107)} = max J {(cid:107) (cid:80) i ∈ J v i (cid:107)} , where I sat-isﬁes the above property and J runs over all possible subsets of { , . . . , n } . This result will be used with v i = x i − z , in which case the sets I in (22) necessarilybelong to C . For example, the hyperplane passing through z + (cid:15) ν with the normal ν = (cid:80) i ∈ I v i is such for a suﬃciently small (cid:15) >

0. Importantly, the class C does notdepend on the chosen z . We are now ready to give a proof of Proposition 9.25 roof. From Corollary 1 we see that D δ ( z | P n ) = 1 holds iﬀ (cid:80) i (cid:104) u , x i − z (cid:105) + ≤ δn for all directions u (we take the reverse sign here for convenience). Thus according toLemma 12 the depth is 1 iﬀ (cid:107) (cid:80) i ∈ I v i (cid:107) ≤ δn with v i = x i − z for all I (cid:54) = ∅ satisfying (22),and then also for all non-empty I . This inequality can be rewritten as (cid:107) z − (cid:88) i ∈ I x i / | I |(cid:107) ≤ δn/ | I | . Recall that I satisfying (22) for some z is necessarily such that { x i : i ∈ I } can beseparated from the rest by a hyperplane, and the stated form of M δ follows.By comparing the radius δ of the ball corresponding to all the observations and theradius δn/m of the ball corresponding to some I of size m . The diﬀerence of such radiiis δ ( n/m − → ∞ as δ → ∞ for any m ∈ { , . . . , n − } . Hence the intersection inthe representation of M δ will eventually become the ﬁrst ball.Let us discuss the description of M δ in Proposition 9. In practice, we may ﬁnd thelist of pairs ( (cid:80) i ∈ I x i , | I | ) for all I corresponding to the separable points in O ( n d log n )time. Then for each z we can verify if it belongs to the median in O ( N ) time, where N = O ( n d ) is the length of the above list. Some algorithms concerning the intersectionof balls can be found in Aurenhammer (1988). For a ﬁxed z it is suﬃcient to separate v i by a hyperplane passing through the origin, and the corresponding algorithm indimension d = 2 is given in the following. R and a numericalexample Here we assume that n ≥ v i = x i − z ∈ R are given. We present an algorithmcomputing the maximal Wasserstein cost of shifting the empirical distribution into anarbitrary half-space having z on its boundary. In other words, it ﬁnds the minimal δ > D δ ( z ) = 1. We note that the complexity of this algorithm is O ( n log n )which comes from sorting the points v i according to their angles. We say that index n is followed by index 1.The basic observation behind this algorithm is that the 2 n proposed subsets neces-sarily contain all subsets I such that { v i , i ∈ I } can be separated from the rest by aline passing through ; additionally, there are some non-separable subsets when somepoints v i lay on the same line passing through the origin. These, in turn, include all thesubsets I satisfying the property (22) as mentioned above. In order to understand thenumber of the latter subsets we perform the following experiment: generate 100 pointsfrom a bivariate normal with correlation a) 0 and b) 0.7, calculate the number of suchsets and replicate 1000 times. We ﬁnd that the average number of sets I satisfying (22)is approximately 37 in a) and 9 in b) out of 2 n = 200.Next, we use Algorithm 2 to construct the median regions for various δ and a certainempirical distribution, see Figure 9. We discretize a certain central region and compute26 lgorithm 2: Compute the minimal δ resulting in D δ ( z ) = 1. Assumption: d = 2.sort points v i so that their angle in [0 , π ) is non-decreasing; δ = 0; for m = 1 , . . . n doif m = 1 then // initialization i := 1;ﬁnd largest j such that det[ v , v j ] ≥ v := v + · · · + v j ; else take k := ( j + 1) { j

0) = 0 , where we have used 1 − P ∗ ( (cid:104) u , X − z (cid:105) ≥

0) = P ∗ ( (cid:104)− u , X − z (cid:105) >

0) and then swapped − u for u . Suppose that some law P is given together with the bound d W ( P , P ) ≤ δ fora known δ >

0. We would like to determine the points z such that for any direction u it is possible that (cid:104) u , X − z (cid:105) > P . That is, we can not choose adirection guaranteeing to see some mass. Given the available information such pointscorrespond to D δ ( z | P ) = 1, i.e., to the median region of P for the ambiguity δ .27 − − X1 X Deltas plots for Gaussian distribution −4 −3 −2 −1 0 1 2 3 − − − X1 X Deltas plots for Gaussian distribution

Figure 9: Minimal δ s delivering (maximal) depth 1, depicted in color (darker colormeans higher values). The three depicted contours correspond (from inside to outside)to δ = 1 , . ,

2. Left: 100 bivariate observations drawn from a standard normal dis-tribution. Right: 100 bivariate observations drawn from a normal distribution withcorrelation 0 .

5. The ﬁlled red dot is the median for the smallest δ = 0 .

398 (left) and0 .

497 (right) resulting in depth 1.

Here we illustrate the orderings provided by the robust Tukey depth and by the tradi-tional Tukey depth in the case of a relatively small sample. We use an elliptical Cauchydistribution in dimensions d = 2 and d = 10, where the scatter matrix is given by(2 −| i − j | ) di,j =1 . Population depth is used to deﬁne a reference ordering, which in this casecoincides with the density-based ordering (Liu and Singh, 1993) and so it is found usingthe Mahalanobis distance from the center.Firstly, we focus on the probability that a pair of points (sampled independentlyfrom the underlying Cauchy distribution) is correctly ordered by (a) the empirical robustdepth with δ = 0 . n wegenerate 1999 empirical laws and one pair of points with the required conditional law.As we can observe from Figure 10, not only the robust Tukey depth provides rankingvery similar to the traditional one, but also allows for (a rather consistent and notavailable) ordering of the Tukey depth’s ties.28 . . . . . . Ordering probability (Cauchy d = 2)

Number of reference points P r obab ili t y o f c o rr e c t o r de r i ng Robust Tukey depth for tiesTukey depth for non−tiesRobust Tukey depth for non−tiesProbabity of Tukey depth tied pair 20 40 60 80 100 120 140 . . . . . . Ordering probability (Cauchy d = 10)

Number of reference points P r obab ili t y o f c o rr e c t o r de r i ng Robust Tukey depth for tiesTukey depth for non−tiesRobust Tukey depth for non−tiesProbabity of Tukey depth tied pair

Figure 10: Conditional probability of the correct ordering of a pair of points stemmingfrom an elliptical Cauchy distribution for dimension d = 2 (left) and d = 10 (right).The three lines correspond to (a) robust Tukey depth (dotted) and (b) Tukey depth(dashed) where the latter is not tied, and (c) the robust Tukey depth where the Tukeydepth is tied (solid). The rhombuses show a probability of drawing a pair tied by theempirical Tukey depth. All the above plots are generated using Algorithm 1 for 1000 equally spaced direc-tions. In the following we show that robustiﬁed depth also allows for gradient-basedoptimization techniques, unlike the traditional depth.Choose some smooth parameterization of the angles u = u ( θ ), where θ is d − ∂y j /∂θ i = (cid:88) k ( z k − x jk ) ∂u k /∂θ i at least for θ not on the boundary of the domain. Consider the representation of thevalue v ( u ; δ ) in Corollary 1 and assume that u is such that y ( − (cid:54) = 0 , y ( k − < y ( k ) < y ( k +1) , s k (cid:54) = δn, (23)where 1 ≤ k ≤ m and y (0) = 0 , y ( m +1) = ∞ . By continuity of y i in the angle u itmust be that p, k are constant for small perturbations of u and also that y (1) , . . . , y ( k ) correspond to the same x i up to the order of the ﬁrst k −

1. Hence we readily ﬁnd that ∂v ( u ; δ ) ∂θ i = − y ( k ) (cid:32) n k − (cid:88) j =1 ∂y ( j ) /∂θ i + (cid:16) v ( u ; δ ) − n − m + k − n (cid:17) ∂y ( k ) /∂θ i (cid:33) , (24)29 . . . . . t r e s − . . . . t r e s D Figure 11: The value function v ( θ ) and its derivative v (cid:48) ( θ ) for d = 2 , δ = 0 . n = 100 calculated at 1000 equally spaced θ ∈ [0 , π ].where ∂y ( j ) /∂θ i is given above for the speciﬁc index. It is noted that the above assump-tion is satisﬁed for Lebesgue-almost all directions in the d − d = 2 we take u = (cos θ, sin θ ) , θ ∈ [0 , π ] and write v ( θ ) = v ( u ( θ ); δ ).An illustration of the value function and its derivative (24) is given in Figure 11 in thecase of a bivariate standard normal sample with n = 100 and δ = 0 .

1. The minimumis achieved at θ = 1 . Acknowledgments

J. Ivanovs gratefully acknowledges ﬁnancial support of Sapere Aude Starting Grant8049-00021B “Distributional Robustness in Assessment of Extreme Risk”.

A Remaining proofs

Proof of Lemma 1.

Observe from (4) that Y and Y + yield the same value. Hencewe may focus on the non-negative random variables and assume that Y ≺ Y . Thecase δ = 0 follows immediately from the deﬁnition of stochastic ordering, and so wemay assume δ >

0. Our proof relies on the direct use of couplings instead of therepresentations in Proposition 1, which are diﬃcult to handle in this context. Consider Y and the corresponding Y ∗ ∈ R attaining the optimal value v on a possibly extendedprobability space (Ω , F , P ) allowing for a further independent uniform random variable,see (7). Stochastic ordering implies by a standard argument that there is a randomvariable Y ≤ Y on this latter probability space with the given law. Furthermore,deﬁne Y ∗ = Y ∗ − Y + Y and note that E | Y ∗ − Y | = E | Y ∗ − Y | ≤ δ . Finally, weobserve that v = P ( Y ∗ ≤ ≤ P ( Y ∗ ≤ Y − Y ) = P ( Y ∗ ≤ ≤ v , Y ∗ belongs to the ambiguity ball around the law of Y . Proof of Lemma 2.

The result is obvious for δ = 0 and so we assume δ >

0. Chooseany c , c ≥ pc + (1 − p ) c ≤

1. Consider ( Y ∗ , Y ) and ( Y ∗ , Y ) on the sameprobability space, where the ﬁrst pair represents the optimal transport plan for Y withambiguity radius δc , and the second for Y with radius δc , see (7). If c i = 0 we take Y ∗ i = Y i . Deﬁne ( Y ∗ , Y ) as their mixture with probabilities p and 1 − p , respectively.Note that E | Y ∗ − Y | = p E | Y ∗ − Y | + (1 − p ) E | Y ∗ − Y | ≤ pδc + (1 − p ) δc ≤ δ, and also v ≥ P ( Y ∗ ≤

0) = p P ( Y ∗ ≤

0) + (1 − p ) P ( Y ∗ ≤

0) = pv ( δc ) + (1 − p ) v ( δc ) . Thus we have established the lower bound on v in terms of the supremum over the legal c i . Now consider a probability space and random variables Y and I , where I is 1 withprobability p and 2 with probability (1 − p ), and Y given I = i has the law of Y i . Wemay extend this probability space so that there is a random variable Y ∗ with ( Y ∗ , Y )being the optimal coupling. Observe that v = P ( Y ∗ ≤

1. Notethat P ( Y ∗ ≤ | I = i ) ≤ v i ( δc i ), because the law of Y ∗ | I = i is in the ambiguity ballcentered at Y i with radius c i δ . Hence v ≤ pv ( δc ) + (1 − p ) v ( δc ) for the given above c , c . Proof of Lemma 3.

Assume (i). Then for any (cid:15) ∈ (0 , /

2) small we have P ( Y t < c ) ≤ (cid:15) for large enough t . Regard Y t as the mixture corresponding to Y t < c and Y t ≥ c . Thesandwich bounds in Lemma 2 imply that v t ≤ (cid:15) + v t (2 δ )for large enough t , where v t corresponds to the second mixing component. Accordingto Lemma 1 such must be smaller than the value corresponding to the deterministicvariable at c . But the latter deminishes to 0 as c → ∞ according to (9). For (ii) wetake the mixture corresponding to Y t ≤ Y t >

0, and use the lower bound.With regard to the mixture, the statement follows readily from the sandwich bounds.

Proof of Lemma 4.

Assume v = v = v <

1, and let λ i = h − i ( δ ) as speciﬁed inProposition 1. Observe that λ < ∞ and P ( Y < λ ) ≤ v ≤ P ( Y + η ≤ λ ) . λ ≥ λ + η , since otherwise Y has no mass at some small interval ( λ − (cid:15), λ )and we readily get a contradiction to the deﬁnition of λ . Now we write h ( λ − ) = E ( Y { Y ∈ (0 ,λ − η ) } ) + η P ( Y ∈ (0 , λ − η )) + E ( Y { Y ∈ (0 ,η ] } ) ≥ h ( λ − ) . (25)For λ > λ + η we get h ( λ − ) > h ( λ ) ≥ δ . The latter condradicts the choice of λ and so we assume λ = λ + η . According to (5) we thus must have δ − h ( λ − ) = λ λ ( δ − h ( λ − )) ≥ . Hence either h ( λ − ) = h ( λ − ) = δ or h ( λ − ) < h ( λ − ), both contradicting (25).The former implies P ( Y ∈ (0 , λ )) = 0 and so δ = h ( λ − ) = 0. Proof of Lemma 5.

We use the representation v = inf λ (cid:48) > { δ/λ (cid:48) + E (1 − Y + /λ (cid:48) ) + } . Note that the expression under the inf is jointly continuous in

Y, δ > , λ (cid:48) >

0, whereconvergence of Y is understood in the weak sense. This readily follows by the dominatedconvergence theorem. Thus we get the required continuity for the modiﬁed problemwhere λ (cid:48) > λ (cid:48) ∈ [ a, b ].Assume that E Y + > δ so that the optimal λ = h − ( δ ) ∈ (0 , ∞ ). The optimal λ n corresponding to Y n and δ n may or may not converge to λ . Importantly, we have h − ( δ ) ≤ lim inf λ n ≤ lim sup λ n ≤ h − ( δ +) , see (Resnick, 2008, Prop. 0.1). Hence for large enough n we may restrict λ (cid:48) to a compactinterval while preserving the inﬁmum.Now assume that E Y + ≤ δ and so v = 1. We need to prove that v n ( δ n ) →

1. Forany (cid:15) > δ (cid:48) ∈ (0 , δ ) such that 1 − (cid:15) < v ( δ (cid:48) ) < E Y + > δ (cid:48) ,which follows from the left-continuity of h − and the right expression in (4). But nowwe can apply the above proven fact to deduce that v n ( δ (cid:48) ) → v ( δ (cid:48) ). Eventually δ n > δ (cid:48) and so v n ( δ n ) ≥ v n ( δ (cid:48) ). Hence lim inf v n ( δ n ) ≥ − (cid:15) and the proof is complete. Proof of Lemma 12.

Note that we optimize over a compact set and that the objectivefunction is continuous in u . Thus the argmax is non-empty.We employ the induction in n . For n = 1 the result is obvious with I = { } and u = v / (cid:107) v (cid:107) . Suppose that our result is proven for some n ≥

1, and consider n + 1non-zero vectors v i . Let u ∗ be any of the optimal directions. If (cid:104) u ∗ , v k (cid:105) ≤ k ≤ n + 1 then (cid:88) i ≤ n +1 (cid:104) u ∗ , v i (cid:105) + = (cid:88) i ≤ n +1 ,i (cid:54) = k (cid:104) u ∗ , v i (cid:105) + ≤ (cid:88) i ≤ n +1 ,i (cid:54) = k (cid:104) u ∗ k , v i (cid:105) + ≤ (cid:88) i ≤ n +1 (cid:104) u ∗ k , v i (cid:105) + , where u ∗ k is any of the optimal directions for the problem with v k excluded. By max-imality of the left hand side we must have equalities. Thus u ∗ is an optimal direction32or the problem with v k excluded, and by the inductive assumption u ∗ has the statedform with I ⊂ { , . . . , n }\{ k } . It is left to recall that (cid:104) u ∗ , v k (cid:105) ≤

0, and so the form isas stated.Finally, we assume that there is an optimal direction u ∗ ∈ S + = { u ∈ R d : (cid:104) u , v i (cid:105) > ∀ i ≤ n + 1 } . One readily checks that S + is a convex set with a non-empty interior by assumption.This leads to the convex optimization problemsup u ∈ S + , (cid:107) u (cid:107)≤ n +1 (cid:88) i =1 (cid:104) u , v i (cid:105) with a linear objective function, where the constraint (cid:107) u (cid:107) ≤ (cid:107) u (cid:107) =1. Clearly there is a feasible solution u in the interior of S + which has (cid:107) u (cid:107) < n +1 (cid:88) i =1 (cid:104) u , v i (cid:105) − λ ( u (cid:62) u − u ∈ S + for some λ ≥

0. In the case λ > u = (cid:80) i ≤ n +1 v i / (2 λ ), because the boundary of S + does not have common points with S + . This solution must be in S + and so (cid:104) u , v i (cid:105) > i , yielding the stated formwith I = { , . . . , n + 1 } . The case λ = 0 results in (cid:80) v i = 0, but then the optimalvalue is 0, contradicting our assumption on u ∗ .Considering the optimal value, we observe for I satisfying (22) that (cid:88) j ≤ n (cid:104) (cid:88) i ∈ I v i /c, v j (cid:105) + = (cid:88) j ∈ I (cid:104) (cid:88) i ∈ I v i , v j (cid:105) /c = (cid:107) (cid:88) i ∈ I v i (cid:107) , where c = (cid:107) (cid:80) i ∈ I v i (cid:107) . Now take an arbitrary subset J and let u = (cid:80) i ∈ J v i /c with c such that (cid:107) u (cid:107) = 1. Note that (cid:88) j (cid:104) u , v j (cid:105) + ≥ (cid:88) j ∈ J (cid:104) u , v j (cid:105) = (cid:107) (cid:88) i ∈ J v i (cid:107) , because the second sum ignores some positive contribution and adds up some negativecontribution. But the left hand side is upper bounded by the optimal value, and thusadding additional terms (cid:107) (cid:80) i ∈ J v i (cid:107) to the maximum does not change the result.33 eferences Aurenhammer, F. (1988). Improved algorithms for discs and balls using power diagrams.

Journal of Algorithms 9 (2), 151–161.Bingham, N. H., C. M. Goldie, and J. L. Teugels (1989).

Regular Variation , Volume 27of

Encyclopedia of Mathematics and its Applications . Cambridge University Press,Cambridge.Blanchet, J., Y. Kang, and K. Murthy (2019). Robust Wasserstein proﬁle inferenceand applications to machine learning.

Journal of Applied Probability 56 (3), 830–857.Blanchet, J. and K. Murthy (2019). Quantifying distributional model risk via optimaltransport.

Mathematics of Operations Research 44 (2), 565–600.Bousquet, O. and A. Elisseeﬀ (2002, March). Stability and generalization.

Journal ofMachine Learning Research 2 , 499––526.Boyd, S. and L. Vandenberghe (2004).

Convex Optimization . Cambridge UniversityPress, Cambridge.Chernozhukov, V., A. Galichon, M. Hallin, and M. Henry (2017). Monge-Kantorovichdepth, quantiles, ranks and signs.

The Annals of Statistics 45 (1), 223–256.Cuturi, M. and A. Doucet (2014, 22–24 Jun). Fast computation of wasserstein barycen-ters. In E. P. Xing and T. Jebara (Eds.),

Proceedings of the 31st International Confer-ence on Machine Learning , Volume 32 of

Proceedings of Machine Learning Research ,Bejing, China, pp. 685–693. PMLR.Donoho, D. L. and M. Gasko (1992). Breakdown properties of location estimates basedon halfspace depth and projected outlyingness.

The Annals of Statistics 20 (4), 1803–1827.Einmahl, J. H. J., J. Li, and R. Y. Liu (2015, 12). Bridging centrality and extremity: Re-ﬁning empirical data depth using extreme value statistics.

Annals of Statistics 43 (6),2738–2765.Esfahani, P. M. and D. Kuhn (2018). Data-driven distributionally robust optimizationusing the wasserstein metric: Performance guarantees and tractable reformulations.

Mathematical Programming 171 (1-2), 115–166.Fournier, N. and A. Guillin (2015). On the rate of convergence in wasserstein distanceof the empirical measure.

Probability Theory and Related Fields 162 (3-4), 707–738.Ghaoui, L. E. and H. Lebret (1997, October). Robust solutions to least-squares prob-lems with uncertain data.

SIAM Journal on Matrix Analysis and Applications 18 (4),1035–1064. 34lubinka, D., L. Kot´ık, and O. Venc´alek (2010). Weighted halfspace depth.

Kyber-netika 46 (1), 125–148.Liu, R. Y. and K. Singh (1993). A quality index based on data depth and multivariaterank tests.

Journal of the American Statistical Association 88 (421), 252–260.Nagy, S. and J. Dvoˇr´ak (2020). Illumination depth.

Journal of Computational andGraphical Statistics (just-accepted).Nagy, S., C. Sch¨utt, and E. M. Werner (2019). Halfspace depth and ﬂoating body.

Statistics Surveys 13 , 52–118.Pele, O. and M. Werman (2009). Fast and robust earth mover’s distances. In , pp. 460–467. IEEE.Pﬂug, G. C. and M. Pohl (2017). A review on ambiguity in stochastic portfolio opti-mization.

Set-Valued and Variational Analysis .Pollard, D. (1984).

Convergence of Stochastic Processes . Springer Series in Statistics.Springer-Verlag, New York.Rachev, S. T. and L. R¨uschendorf (1998).

Mass transportation problems. Vol. I . Prob-ability and its Applications. Springer-Verlag, New York.Resnick, S. I. (2008).

Extreme values, regular variation and point processes . Springer Se-ries in Operations Research and Financial Engineering. Springer, New York. Reprintof the 1987 original.Rockafellar, R. T. (1970).

Convex Analysis . Princeton Mathematical Series, No. 28.Princeton University Press, Princeton, N.J.Rubner, Y., C. Tomasi, and L. J. Guibas (2000). The earth mover’s distance as a metricfor image retrieval.

International Journal of Computer Vision 40 (2), 99–121.Scarf, H. (1958). A min max solution of an inventory problem.

Studies in the Mathe-matical Theory of Inventory and Production .Tukey, J. W. (1975). Mathematics and the picturing of data. In

Proceedings of theInternational Congress of Mathematicians, Vancouver, 1975 , Volume 2, pp. 523–531.Wozabal, D. (2012). A framework for optimization under ambiguity.

Annals of Opera-tions Research 193 (1), 21–47.Wozabal, D. (2014). Robustifying convex risk measures for linear portfolios: A non-parametric approach.

Operations Research 62 (6), 1302–1315.Xu, H., C. Caramanis, and S. Mannor (2009, December). Robustness and regularizationof support vector machines.

Journal of Machine Learning Research 10 , 1485–1510.35uo, Y. and R. Serﬂing (2000). General notions of statistical depth function.