Finite sample breakdown point of multivariate regression depth median
aa r X i v : . [ m a t h . S T ] S e p Finite sample breakdown point of multivariate regression depthmedian
Yijun Zuo
Department of Statistics and Probability, Michigan State University
East Lansing, MI 48824, [email protected]
September 3, 2020 bstract
Depth induced multivariate medians (multi-dimensional maximum depth estimators) in regressionserve as robust alternatives to the traditional least squares and least absolute deviations estimators.The induced median ( β ∗ RD ) from regression depth (RD) of Rousseeuw and Hubert (1999) (RH99) isone of the most prevailing estimators in regression.The maximum regression depth median possesses the outstanding robustness similar to the uni-variate location counterpart. Indeed, the β ∗ RD can, asymptotically, resist up to 33% contaminationwithout breakdown, in contrast to the 0% for the traditional estimators (see Van Aelst and Rousseeuw,2000) (VAR00). The results from VAR00 are pioneering and innovative, yet they are limited to re-gression symmetric populations and the ǫ -contamination and maximum bias model.With finite fixed sample size practice, the most prevailing measure of robustness for estimatorsis the finite sample breakdown point (FSBP) (Donoho (1982), Donoho and Huber (1983)). A lowerbound of the FSBP for the β ∗ RD was given in RH99 (in a corollary of a conjecture).This article establishes a sharper lower bound and an upper bound of the FSBP for the β ∗ RD ,revealing an intrinsic connection between the regression depth of β ∗ RD and its FSBP, justifying theemployment of the β ∗ RD as a robust alternative to the traditional estimators and demonstratingthe necessity and the merit of using FSBP in finite sample real practice instead of an asymptoticbreakdown value. AMS 2000 Classification:
Primary 62G35; Secondary 62G08, 62F35.
Key words and phrase: finite sample breakdown point, regression median, regression depth, max-imum depth estimator, robustness.
Running title:
Finite sample breakdown point of regression depth median.
Introduction
The notion of depth in regression was introduced and investigated two decades ago. Regres-sion depth (RD) of Rousseeuw and Hubert (1999) (RH99) is the most popular example inthe literature.One of the primary advantages of depth notion in regression is that it can be utilized todirectly introduce a median-type maximum depth estimator (aka depth median), which canserve as a robust alternative to the traditional least squares and least absolute deviationsestimators.Robustness of the regression depth induced median ( β ∗ RD ) was investigated two decadesago by Van Aelst and Rousseeuw (2000) (VAR00). It turns out that the regression mediancan, asymptotically, resist up to 33% contamination without breakdown, in contrast to the0% for the traditional estimators.The results in VAR00 were established for population (regression) symmetric distribu-tions, under the ǫ contamination model and maximum bias framework in the asymptoticsense, but it is not directly applicable to realistic fixed finite sample size practice.In the finite sample scenario, the most prevailing robustness measure is the finite samplebreakdown point (FSBP), introduced by Donoho and Huber (1983) and popularized andpromoted by Rousseeuw (1984), Rousseeuw and Leroy (1987) (RL87) and Donoho and Gasko(1992), among others.The FSBP of the β ∗ RD was briefly addressed in RH99, and one lower bound (LB) of theFSBP was given in Corollary of Conjecture 1 (CC1). Theorem 8 (TH8) of RH99 listed,albeit with no published proof available, the limiting value of the FSBP, 1/3, for data fromregression symmetric populations with a strictly positive density.The LB of the FSBP in CC1 of RH99 depends on a generic LB of maximum RD valuein part (a) of Conjecture 1 of RH99. It is approximately 1 / ( p + 1) for a large n , and it cannever approach the limit value 1 / p >
2) in TH8 of RH99 and in the literature as theasymptotic breakdown point of the β ∗ RD . This implies that the LB is not sharp. Furthermore,RH99 ingeniously proved part (a) of Conjecture 1 for just p = 2 case. This article presents acomplete proof for any p ≥ β ∗ RD has remained open for the last two decades.This article tries to fill that gap, complementing the complete robust spectrum of the β ∗ RD .A sharper LB given in this article that utilizes the maximum RD value reveals an intrinsicconnection between the maximum depth and the FSBP for the β ∗ RD . The higher the depthvalue of the β ∗ RD , the more robust the β ∗ RD is. The LB can have the limiting value 1 / β ∗ RD given in this article indicates that in finitesample cases the β ∗ RD can actually resist much less than the asymptotic result of 33% (1 / / β ∗ RD as an alternative to the traditional estimators and (ii) thenecessity and merits of investigation on the finite sample breakdown point of the β ∗ RD .3hroughout, we are concerned with the FSBP of the multi-dimensional maximum depthestimator β ∗ RD for the multivariate candidate regression parameter β = ( β , β ′ ) ′ ∈ R p ( p ≥ y = (1 , x ′ ) β + e, (1)where ′ denotes the transpose of a vector, random vector x = ( x , · · · , x p − ) ′ is in R p − , andrandom variables y and e are in R . The β is the intercept term in the model (1).Section 2 briefly reviews the history behind the notion of the breakdown point and in-troduces (i) two versions for the FSBP and (ii) the notion of regression depth and depthinduced median. Section 3 establishes a general upper bound of the FSBP for any regressionequivariant (see Def. in section 3) estimator and the lower and upper bounds of the FSBPfor the regression depth median β ∗ RD , and furthermore it certificates rigorously the truth ofpart (a) of Conjecture 1 in RH99. The article ends with some concluding remarks supportedby substantial empirical evidence in Section 4. The notion of the breakdown point first appeared in Hodges (1967) and later was generalizedby Hampel (1968, 1971). The finite sample versions of the breakdown point, including the addition breakdown point (ABP) and replacement breakdown point (RBP), were introducedin Donoho and Huber (1983) (DH83). They have become the most prevalent quantitativeassessments of the global robustness of estimators, complementing the assessment of (i) localrobustness of estimators captured by the influence function approach (see Hampel, et al .(1986)) and (ii) global robustness of estimators assessed by the asymptotic breakdown pointvia the maximum bias approach (see Hampel, et al . (1986) and Huber (1981)).Stimulating and intriguing discussions on the notion of the breakdown point includeDonoho (1982), Rousseeuw (1984), Rousseeuw and Leroy (1987), Lopuha¨a and Rousseeuw(1991), Maronna and Yohai (1991), Donoho and Gasko (1992), Tyler (1994), M¨uller (1995),Ghosh and Sengupta (1999), Davies (1987, 1990, 1993), Davies and Gather (2005), Maronna, et al . (2006), and Liu, et al . (2017), among others.Some authors are favorable to the ABP in the discussion of the robustness property ofestimators, whereas others prefer the RBP, which they believe is more simple, realistic andgenerally more applicable. Zuo (2001) presented some quantitative relationships between thetwo versions of the finite sample breakdown point, rendering the arguments on the preference(precedence) between the two versions void. Nevertheless, for a given estimator, sometimesone version can be more convenient for the derivation of a desired result.We now explore the definitions for the two versions of the FSBP. Let X n = { X , . . . , X n } be a sample of size n in R p (in the regression setting, X n is the Z n = { ( x ′ i , y i ) , i = 1 , · · · , n } in R p ).A location estimator T in R p is called translation equivarant if T ( X n + b ) = T ( X n ) + b b ∈ R p , where X n + b = { X + b , · · · , X n + b } . Translation equivariance is a desirableproperty for any reasonable location estimator. Definition 2.1 [DH83] The finite sample addition breakdown point of a translation equivarantestimator T at X n in R p is defined asABP( T , X n ) = min { mn + m : sup Y m k T ( X n ∪ Y m ) − T ( X n ) k = ∞} , where Y m denotes a dataset of size m with arbitrary values and X n ∪ Y m denotes thecontaminated sample by adjoining Y m to X n .It is apparent that the ABP of the univariate sample median is 1 /
2, whereas that of themean is 1 / ( n + 1). Definition 2.2 [DH83] The finite sample replacement breakdown point of a translation equiv-arant estimator T at X n in R p is defined asRBP( T , X n ) = min { mn : sup X nm k T ( X nm ) − T ( X n ) k = ∞} , where X nm denotes the corrupted sample from X n by replacing m points of X n with arbitrary m values.It is apparent that the RBP of the univariate sample median is ⌊ ( n + 1) / ⌋ /n , whereasthat of the mean is 1 /n , where ⌊·⌋ is the floor function.In other words, the ABP and RBP of an estimator are respectively the minimum addi-tion fraction and replacement fraction of the contamination which could drive the estimatorbeyond any bound. Definition 2.3
For any β ∈ R p and joint distribution P of ( x ′ , y ) in (1), RH99 defined theregression depth of β , denoted by RD( β ; P ), to be the minimum probability mass that needsto be passed when titling (the hyperplane induced from) β in any way until it is vertical.The RD( β ; P ) definition above is rather abstract and not easy to comprehend. Somecharacterizations, or equivalent definitions, were given in the literature, e.g. in Remark 5.1of Zuo (2018) and in Lemma 2.1 of Zuo (2020), also see (3) below.The maximum regression depth estimating functional T ∗ RD (also denoted by β ∗ RD ) isdefined as T ∗ RD ( P ) = argmax β ∈ R p RD( β ; P ) . (2)If there are several β s that attain the maximum depth value on the right hand side (RHS)of (2), then the average of all those β s is taken.We obtain the sample version of RD( β ; P ) and T ∗ RD ( P ) for a given sample Z n = { ( x ′ i , y i ) , i =1 , · · · , n } in R p , or equivalently a given P n , the empirical distribution based on Z n , by re-placing P with P n . (In the empirical case, the RD discussed originally in RH99 divided by n is identical to the definition 2.3 above). 5he notions of breakdown point and regression depth seem unrelated and have nothingto do with each other. But in the next section, it is shown that they are actually closelyconnected through T ∗ RD ( P n ). For a given β = ( β , β ′ ) ′ ∈ R p , denoted by H β the unique hyperplane determined by y =(1 , x ′ ) β . Denote the angle between the hyperplane H β and the horizontal hyperplane plane H h (determined by y = 0) by θ β (hereafter consider only the acute one). That is, θ β is theangle between the normal vector ( − β ′ , ′ and the normal vector ( ′ , ′ in the ( x ′ , y ) ′ -space.Hence, cos( θ β ) = 1 p k β k + 1 . Therefore, it is not hard to see that | tan( θ β ) | = k β k . H v H b l v ( b ) Figure 1: A two-dimensional vertical cross-section of a figure in R p . There are two ways to tilt H β to a vertical position H v (which does not necessarily contain the origin in the definition,it does in this figure) along hyperline l v ( β ) (which passes (0 ,
2) in the figure). One way iscrossing the two wedges each with an acute angle (the shaded double wedge), the other wayis passing through the other two wedges each with an obtuse angle (the unshaded doublewedge). That is, counter-clockwise or clockwise, tilting H β to H v .6ilting β to a vertical position (some vertical hyperplane H v ) in Definition 2.3 meanstilting H β along a hyperline l v ( β ) which is the intersection hyperline of H β with the H v .Let min fr ( l v ( β ) , P n ) be the minimum of the two fractions of data points touched by tilting H β in the definition of RD RH to a vertical position H v along l v ( β ) in two ways (one way isby crossing the double wedge formed by two single wedges with an acute angle between H β and H v (the two shaded regions in Figure 1) and the other way is by passing through thedouble wedge formed by two single wedges with an obtuse angle between H β and H v (theother two regions in Figure 1)).In other words, within the two-dimensional plane that is perpendicular to the horizontalhyperplane H h (vertical cross-section), one tilts H β in either a clockwise or counter-clockwisemanner. Then it is readily apparent that we have Proposition 3.1
For a given data set Z n , the regression depth of β defined in Definition 2.3can be characterized as RD( β ; P n ) = inf l v ( β ) min fr ( l v ( β ) , P n ) , (3)where infimum is taken over all possible l v ( β )s or equivalently all possible H v s. Proof: “The minimum probability mass that needs to be passed when titling (the hyperplaneinduced from) β in any way until it is vertical” in Definition 2.3 means that (i) H v could bearbitrary, which is equivalent to the arbitrariness of the intersection hyperline l v ( β ) of the H β with the H v , (ii) tilting H β along the hyperline l v ( β ) to the vertical position H v whichcontains two ways (clockwise or counter-clockwise, see Figure 1), (iii) obtaining the minimumof the two fractions of data points touched by tilting H β in two ways and (iv) calculating theoverall minimum of the empirical probability mass w.r.t. all possible H v s.In light of the discussions before the Proposition, all (i), (ii), (iii) and (iv) are exactlycaptured in the RHS of (3). (cid:4) For a given sample Z n in R p ( Z n and P n are used interchangeably hereafter), write k ∗ ( P n ) = max β ∈ R p n RD( β , P n ) . (4) Remarks 3.1 (1) The RHS of (3) actually implicitly involves all possible H v s and the fixed H β .(2) n RD( β , P n ) in (4) is the least number of data points touched by tilting H β to a verticalposition in light of (3). The k ∗ ( P n ) then is the maximum (w.r.t. β s) of the least numberof data points touched by tilting a H β to a vertical position. That is the least number ofdata points touched by tilting H β m in any way to a vertical position with β m attainingthe maximum RD (i.e. β m is a RD maximizer). (cid:4) A regression estimator T is called regression equivariant (page 116 of RL87) if T (cid:0) { ( x ′ i , y i + (1 , x ′ i ) b ) , i = 1 , · · · , n } (cid:1) = T (cid:0) { ( x ′ i , y i ) , i = 1 , · · · , n } (cid:1) + b , ∀ b ∈ R p (5)7egression equivariance is just as desirable as translation equivariance for a multivariatelocation estimator. Because of regression equivariance, we see often in the literature thephrase “without loss of generality, assume that β = in the following discussion”. Thisstatement is based on the application of (5).We first establish a general ABP upper bound for any regression equivariant estimator(the RBP counter part has been given in the literature, e.g. page 125 of RL87). Proposition 3.2
For a given Z n in R p , any regression equivariant estimator T in R p satisfiesABP( T , Z n ) ≤ mn + m , where m = n − p + 1 (throughout we assume that n ≥ p ). Proof:
Assume that u ∈ R p is a unit vector from the null space of the ( p − × p matrix( w m +1 , · · · , w n ) ′ , where w ′ i = (1 , x ′ i ). Now construct two contaminated data sets Z n ∪ Y m : { ( x ′ , y ) , · · · , ( x ′ m , y m ) , ( x ′ m +1 , y m +1 ) , · · · , ( x ′ n , y n ) , ( x ′ , y + w ′ b , · · · , ( x ′ m , y m + w ′ m b ) } , { ( x ′ , y − w ′ b ) , · · · , ( x ′ m , y m − w ′ m b ) , ( x ′ m +1 , y m +1 ) , · · · , ( x ′ n , y n ) , ( x ′ , y ) , · · · , ( x ′ m , y m ) } , where b = λ u , λ >
0. Denote the two contaminated data sets by Z n ∪ Y m and Y m ∪ Z n . Itis not hard to see that Y m = { ( x ′ , y + w ′ b , · · · , ( x ′ m , y m + w ′ m b ) } Y m = { ( x ′ , y − w ′ b ) , · · · , ( x ′ m , y m − w ′ m b ) } . Notice that w ′ j b = 0 for m < j ≤ n . Subtracting the y component of each point in Z n ∪ Y m by the inner product of the w ′ = (1 , x ′ ) (where x ′ is the x ′ component of the point) with thevector b , we obtain Y m ∪ Z n . This, in conjunction with (5), immediately leads to λ = k b k = k T ( Z n ∪ Y m ) − T ( Z n ∪ Y m ) k = k ( T ( Z n ∪ Y m ) − T ( Z n )) + ( T ( Z n ) − T ( Z n ∪ Y m )) k≤ Y m k T ( Z n ∪ Y m ) − T ( Z n ) k Let λ → ∞ , we immediately obtain the desired result. (cid:4) We shall say Z n is in general position (IGP) when any p of observations in Z n give aunique determination of β . In other words, any (p-1) dimensional affine subspace of the space( x ′ , y ) contains at most p observations of Z n . When the observations come from continuousdistributions, the event ( Z n being in general position) happens with probability one.8 roposition 3.3 For a given IGP Z n (or P n ) in R p , we have(A) ABP( T ∗ RD , Z n ) ≥ k ∗ ( Z n ) − p + 1 n + k ∗ ( Z n ) − p + 1 , (6)(B) ABP( T ∗ RD , Z n ) ≤ n − p + 12 n − p + 1 . (7) Proof:
Since T ∗ RD is regression equivariant (see Zuo (2018)), (B) follows from Lemma 3.1 imme-diately. It suffices to prove (A). We claim that m < k ∗ ( Z n ) − p + ∗ RD .Assume, otherwise, that m < k ∗ ( Z n ) − p +1 contaminating points are enough to breakdown T ∗ RD . That is, sup Y m k T ∗ RD ( Z n ∪ Y m ) k = ∞ . Denote β ∗ RD ( Z n ∪ Y m ) (slight notation abuse) as the maximizer of RD at Z n ∪ Y m whichhas the maximum norm among all the RD maximizers. There are at most finitely many RDmaximizers for a fixed finite sample size n + m .Notice that T ∗ RD ( Z n ∪ Y m ) is not necessarily identical to β ∗ RD ( Z n ∪ Y m ) in the non-unique maximizer case, and the RD of the former could be smaller than the latter and lessthan the maximum depth value due to the average. For example, in the case of the data set { (0 , , (1 , , (5 , , } in R , the former is 0 whereas the latter is 1 /
2. See Figure (2). Thisis one of the reasons we will treat β ∗ RD ( Z n ∪ Y m ) in the sequel instead of T ∗ RD ( Z n ∪ Y m ), inorder to compensate for the weakness of the latter.The other reason, which is the important one, issup Y m k T ∗ RD ( Z n ∪ Y m ) k = ∞ if and only if sup Y m k β ∗ RD ( Z n ∪ Y m ) k = ∞ . The above implies that k β ∗ RD ( Z n ∪ ( Y m ) j ) k = (cid:12)(cid:12) tan (cid:16) θ β ∗ RD ( Z n ∪ ( Y m ) j ) (cid:17)(cid:12)(cid:12) → ∞ , along a sequence of ( Y m ) j as j → ∞ , where the subscript 2 on the LHS means that it is thenon-intercept component of the β ∗ RD ( Z n ∪ ( Y m ) j ) (remember we write β = ( β , β ′ ) ′ ). If, there exists a finite j such that θ β ∗ RD ( Z n ∪ ( Y m ) j = π / then at most m contam-inatimng points from ( Y m ) j are on the hyperplane H β ∗ RD ( Z n ∪ ( Y m ) j ) which contains at most p − Z n . The latter is due the fact that Z n is IGP and the intersection hyper-line between H β ∗ RD ( Z n ∪ ( Y m ) j ) and the horizontal hyperplane y = 0 is a ( p −
2) dimensionalsubspace of the p dimensional space ( x ′ , y ). It is not hard to see that9 . . . . . . x y T RD*
Figure 2: Four in general position points (represented by four filled circles) located in R . Sixlines formed, { (0 , , ( − , , , (0 , , (0 , / , (5 / , − / } , in (intercept, slope) form, eachconnecting two data points. Each line attains the maximum regression depth 1 /
2. However,the average of these deepest lines, T ∗ RD : ( − / , /
12) has regression depth 0. m + ( p − ≥ ( m + n )RD( β ∗ RD ( Z n ∪ ( Y m ) j ) , Z n ∪ ( Y m ) j )= k ∗ ( Z n ∪ ( Y m ) j ) ≥ k ∗ ( Z n ) , (8)where the first inequality follows from the fact that the vertical hyperplane contains at most m + ( p −
1) points from Z n ∪ ( Y m ) j . The second equality follows from (4) and the definitionof β ∗ RD ( Z n ∪ ( Y m ) j ) above. The last inequality follows from the fact that ( m + n )RD( β , Z n ∪ Y m ) ≥ n RD( β , Z n ) in light of (3). Otherwise (i.e. θ β ∗ RD ( Z n ∪ ( Y m ) j ) < π /
2, for any j), then consider, without loss ofgenerality (and in light of (3)) for any large j and ( Y m ) j and H β ∗ RD ( Z n ∪ ( Y m ) j ) , a verticalhyperplane H v (which depends on j ) that contains no data points from Z n and intersectswith H β ∗ RD ( Z n ∪ ( Y m ) j ) at l v ( β ∗ RD ( Z n ∪ ( Y m ) j ).Clearly, there exists a narrow vertical hyperstrip centered at H v (with its two boundaryhyperplanes parallel to H v ), and within the hyperstrip there are no data points from Z n .Now, when tilting the hyperplane H β ∗ RD ( Z n ∪ ( Y m ) j ) (which is already almost vertical for alarge enough j ) along l v ( β ∗ RD ( Z n ∪ ( Y m ) j ) to its eventual vertical position of H v , it is readily10pparent that m + p − ≥ ( m + n ) min fr ( l v ( β ∗ RD ( Z n ∪ ( Y m ) j ) , P n + m ) ≥ RD( β ∗ RD ( Z n ∪ ( Y m ) j ) = k ∗ ( Z n ∪ ( Y m ) j ) ≥ k ∗ ( Z n ) , (9)where P m + n stands for the empirical distribution based on Z n ∪ ( Y m ) j . The first inequalityfollows from the definition of min fr ( l v ( β , P n ) (since there is one way of tilting the hyperplane H β ∗ RD ( Z n ∪ ( Y m ) j ) along l v ( β ∗ RD ( Z n ∪ ( Y m ) j )) so that no original data points from Z n (exceptat most p − Z n that are already on the hyperplane) are touched during themovement and the points it can touch are at most all the m contaminating points).The second inequality above follows (3) and the third equality follows from the introduc-tion of β ∗ RD ( Z n ∪ Y m ) at the beginning and from (4). Finally, the fourth one is trivial since( m + n )RD( β , Z n ∪ Y m ) ≥ n RD( β , Z n ) in light of (3).All the inequalities in (8) and (9) lead to a contradiction. Thus, m < k ∗ ( Z n ) − p + 1contaminating points are not enough to break down T ∗ RD . We obtain (A). (cid:4) Remarks 3.2
It is unfortunate that the upper bound in the Proposition could not be furtherimproved to m = k ∗ ( Z n ) − p + 1 at this point. Note that for Z n IGP, k ∗ ( Z n ) ≤ ⌊ ( n + p ) / ⌋ .The latter is the maximum possible RD value which is still less than n if n > p . We expectthat this sharper upper bound ( m = k ∗ ( Z n ) − p + 1) holds true. In the latter case, we havea unified ABP. That is ABP( T ∗ RD , Z n ) = k ∗ ( Z n ) − p + 1 / ( n + k ∗ ( Z n ) − p + 1). (cid:4) RH99 also presented in their Corollary of Conjecture 1 (CC1) a lower bound (LB) of RBPfor the T ∗ RD , RBP( T ∗ RD , Z n ) ≥ n (cid:18)l np + 1 m − p + 1 (cid:19) , (10)under the assumptions that (i) the (a) of Conjecture 1 of RH99 holds true and (ii) x i areIGP.If k ∗ ( Z n ) does not depend on Z n and is a constant, then by the (b) of Theorem 2.1 ofZuo (2001), the LB of ABP in Proposition 3.3 can be converted to a LB of RBP for the T ∗ RD ,it will be ( k ∗ ( Z n ) − p + 1) /n which is larger than the LB of RBP (10) above if the (a) ofConjecture 1 holds true. Unfortunately, k ∗ ( Z n ) depends on Z n .The part (a) of Conjecture 1 of RH99 is proved elegantly and creatively by RH99 for just p = 2 case. In the sequel we prove the part (a) of Conjecture 1 of RH99 for any p ≥ Proposition 3.4
The (a) of Conjecture 1 of RH99 holds true. That is, for Z n in R p ( p ≥ l np + 1 m ≤ k ∗ ( Z n ) = max β ∈ R p n RD( β , Z n ) . (11) Proof: p = 2 has been treated in Theorem 1 of RH99. So in the following, we extendthe idea from the proof of RH99 and take care of p > n = ( p + 1) k − r for some positive integers k and r and 0 < r < ( p + 1). That is, ⌈ n/ ( p + 1) ⌉ = k .We need to consider only k >
3. Since if k ≤
3, the Proposition already holds (simply byselecting a β such that the hyperplane H β contains p points of Z n , then the RHS of (11) isobviously no less than k ). It is not hard to show that n ≥ k + 1.Sort Z n = { ( x i , y i ) , i = 1 , · · · , n } according to the first coordinate of x i and divide theminto three (called G , G and G ) groups such that G contains the left most k points in Z n and G contains the right most k points in Z n (note that n ≥ k + 1) all other points form G group.If there are ties in the first coordinate of x i (i.e. there are multiple points that have thesame k th or ( n − k +1)th first x coordinate), then sort those with identical k th or ( n − k +1)thfirst x coordinate using their second x coordinate, and if,...., using their third x ..., ..., untilusing their y coordinate. If there are still ties (at k th or ( n − k + 1)th points) for dividinggroups over the boundaries, then label those overlapping points over the boundary between G and G as G ; and those over the one between G and G as G . In this situation, both S := G ∪ G and S := G ∪ G contain at least k points of Z n . Now S , S := G , S are three groups we will work on instead of original G , G and G .That is, we have divided the original data set Z n into three groups with two verticalhyperplanes which do not contain any data points from Z n with the G and G group bothcontaining at least k = ⌈ n/ ( p + 1) ⌉ points.For a given arbitrary data set W n of size n in R p , a hyperlane is said to bisects it if thenumber of points n + ( W n ) that lie in the open half-space on one side of the hyperplane is thesame as the number of points n − ( W n ) that lie in the other open half-space on the other sideof the hyperplane and n + ( W n ) + n − ( W n ) + n ( W n ) = n , where n ( W n ) is the number ofpoints on the hyperplane.By the ham-sandwich theorem (see Stone and Tukey (1942) or Edelsbrunner, 1987), thereexists a hyperplane H β that simultaneously bisects G , G and G (or S and S and S ),where β is the unique parameter determined by the hyperplane.Now for any vertical hyperplane H v which is implicitly included in the RHS of (3), assume(w.l.o.g.) that it contains no data points from Z n in light of (3) and intersects H β at hyperline l v ( β ). Denote the number of positive (resp. negative, zero) residuals on the LHS of H v andabove (resp. below, on) the H β as L v ( Z n ) + (resp. L v ( Z n ) − , L v ( Z n ) ). Similarly, define R v ( Z n ) + , R v ( Z n ) − , R v ( Z n ) for the residuals on the RHS of H v above (resp. below, on) the H β .In light of (3), it suffices to just consider H v in (i) G or the left of the G open region,(ii) G region and (iii) G or the right of the G open region. We show that RD( β , Z n ) ≥⌈ n/ ( p + 1) ⌉ . It suffices to treat (i) and (ii) only.(i) Tilting H β to the vertical position of the H v in G or the left of the G open region,12t is apparent that n min fr l v ( β , Z n ) ≥ min { R v ( G ∪ G ) − + R v ( G ∪ G ) , R v ( G ∪ G ) + + R v ( G ∪ G ) } = R v ( G ∪ G ) − + R v ( G ∪ G ) ≥ nP n ( G ∪ G ) / > k, where (i) the first inequality follows from the definition of min fr and the fact that we skip thepositive (negative, zero) count of residuals from part of G ; (ii) the second equality followsfrom the property of bisection of H β to G and G , (iii) the third inequality follows fromthe fact that the number of the non-positive and the number of the non-negative residuals in G ∪ G are the same, (iv) the fourth inequality follows from the fact that R v ( G ∪ G ) + + R v ( G ∪ G ) − + R v ( G ∪ G ) = nP n ( G ∪ G ), and the first two terms are identical due tothe bisection of H β and nP n ( G ∪ G ) is just the count of points of Z n in G ∪ G , (v) Thelast inequality follows from the direct calculation of the points of Z n in G ∪ G (here we cantreat the case G = k or the S > k separately, the former case is straightforward. In thelatter case we have to take into account R v ( G ) + , R v ( G ) − and R v ( G ) on the RHS ofabove displays, the inequality follows again trivially).(ii) Tilting H β to the vertical position of the H v in the G region, it is no hard to see that n min fr l v ( β , Z n ) > min { L v ( G ) + + L v ( G ) + R v ( G ) − + R v ( G ) , L v ( G ) − + L v ( G ) + R v ( G ) + + R v ( G ) } = L v ( G ) + + L v ( G ) + R v ( G ) − + R v ( G ) = (( k + L v ( G ) ) + ( k + R v ( G ) )) / ≥ k, where the first inequality follows from the definition of min fr and the fact that we skip thepositive (negative, zero) count of residuals from part of G , the first equality follows from theproperty of bisection of H β to G and G , the second equality follows from the fact that (1) L v ( G ) − = L v ( G ) + and L v ( G ) + + L v ( G ) − + L v ( G ) = k and (2) R v ( G ) − = R v ( G ) + and R v ( G ) + + R v ( G ) − + R v ( G ) = k (equalities become inequality( ≥ ) in the case of S i ).The last inequality is trivial.(iii) The case of H v in G or the right side of the G open region can be treated similarlyto the case of H v in G or the left side of the G open region above.In light of (3) and all above inequalities we see immediately that k ∗ ( Z n ) ≥ n RD( β , Z n ) ≥ k = ⌈ np + 1 ⌉ . This completes the proof of the Proposition. (cid:4)
Remarks 3.3 (i) Amenta, et al. (2000) has also proved the part (a) of Conjecture 1 in RH99. Theproof, however, was much more sophisticated based on the Projective Geometry and Central13rojection and not easy to follow and check. Mizera (2002) has also given a much moremathematically abstracted and complicated proof covering the (b) of Conjecture 1 in RH99(the population case). He claimed that it covers the part (a). RH99 treated the case of p = 2elegantly and creatively. They did not consider the overlapping boundary points betweengroups and the case of H v in the middle group where one cannot determine what is the exactnumber of data points in the non-middle groups.(ii) The LB of the RBP in RH99 which utilizes the lower bound of maximum depth valuein Conjecture 1 and does not depend on the configuration of data set Z n only on p and n .The one in Proposition 3.3 for the LB of the ABP utilizes the maximum depth value anddoes depend on the configuration of the data set Z n . The difference between the two will berevealed more clearly in the next section. (cid:4) (I) The state of the art on the FSBP of the T ∗ RD and a sharper lower bound .No existing result in the literature covers or implies Proposition 3.3. The most closely relatedresults include the lower bound (LB) of RBP for T ∗ RD given in RH99, Lemma 3 and Theorem 2in Van Aelst and Rousseeuw (2000) (VAR00), but they are limited to the regression symmetricpopulation distributions in an asymptotic bias sense, and those lower bounds (Theorems 4.1and 4.2) in Mizera (2002) (M02) which are, maximally abstracted and general and also in an ǫ -contamination and maximum bias framework.Furthermore, all those lower bounds are not sharp (in light of the Proposition 3.3). Let usfocus on the LB of the RBP in RH99. The difference between this and the LB of the ABP inProposition 3.3 include (i) the former purely depends on n and p whereas the latter dependson the configuration of Z n , (ii) for a fixed p >
2, as n → ∞ the former approaches 1 / ( p + 1)which can never be 1 / / / ( n + 1) v.s. 1 /n ; inthe sample median case, it is n/ ( n + n ) v.s. ⌊ ( n + 1) / ⌋ /n . Directly comparing, in terms oftheir magnitude, the ABP and RBP is unfavorable (or unfair) to ABP.To better appreciate the sharpness of the LB of the ABP in the Proposition 3.3 and knowbetter the quantitative difference between the LB of the ABP in Proposition 3.3 and theLB of the RBP in RH99 for the same T ∗ RD , we carry out a small scale simulation study tocalculate the average differences of ( k ∗ ( Z n ) − p + 1) / ( n + k ∗ ( Z n ) − p + 1) (the LB of theABP in Prop. 3.3) subtracted by n (cid:16) ⌈ np +1 ⌉ − p + 1 (cid:17) (the LB of the RBP in RH99) in 1000multivariate N ( , I ) samples for different small ns and ps ; the results are given in table 1.14verage of (cid:0) the LB of ABP in Prop. 3.3 – the LB of the RBP in RH99 (cid:1) in 1000 samplesn 10 20 30 50 100 200p=2 -3.725% -1.776% -0.913% -2.237% -2.456% -1.805%p=3 10.38% 8.736% 5.235% 4.646% 5.139% 5.328%p=5 31.52% 16.85% 15.09% 12.02% 11.47% 11.15%Table 1: Average differences between the LB of the ABP in Prop. 3.3 and the LB of theRBP in the Corollary of Conjecture 1 (CC1) of RH99, based on 1000 multivariate standardnormal samples.Keep in mind that the comparison in the table 1 is very unfavorable to the LB of theABP in the Proposition 3.3. However, it is readily apparent from the table that the LB ofthe ABP in the Proposition 3.3 is sharper than the LB of the RBP in RH99 because of theall positive entries (when p >
2) and as well as the negative ones (when p = 2) since allentries suppose to be negative if the number of contaminating points m is the same in thetwo contamination schemes.The positive entries in the table support the statement above. They imply that the LBof the RBP in RH99 severely underestimate the contamination percentage that the T ∗ RD canresist. Consequently the LB in RH99 is not sharp. Note that when n increases so does thedifference eventually (e.g. when p = 2 and n = 500, the entry in the table will be -1.352%),the same is true w.r.t the dimension parameter p . The LB of the RBP in RH99 becomesnegative and uninformative if n ≤ ( p + 1)( p −
2) (this explains the unusually large entry inthe case p = 5 , n = 10).The results in the table demonstrate the merit of the LB of the ABP in Proposition 3.3.One question that might be raised for the results in the table is: are those results distribution-dependent? That is, if the underlying distribution of the samples changes, does the LB ofthe ABP in Proposition 3.3 still have any advantage over the one in RH99? To answer thisquestion, we carried out a small scale simulation study and results are reported in table 2.Average of (cid:0) the LB of the ABP in Prop. 3.3 – the LB of the RBP in RH99 (cid:1) in 1000 samplesn 10 20 30 50 100 200p=2 -3.687% -1.089% -0.929% -2.261% -2.604% -2.042%p=3 10.45% 8.742% 5.214% 4.572% 4.888% 4.993%p=5 31.36% 16.92% 16.09% 11.85% 11.30% 10.78%Table 2: Average differences between the LB of the ABP in Proposition 3.3 and the LB ofthe RBP in the CC1 in RH99, based on 1000 contaminated multivariate normal samples.Here we generated 1000 samples Z ( n ) = { ( x ′ i , y i ) ′ , i = 1 , · · · , n, x i ∈ R p − } from the Gaussiandistribution with zero mean vector and 1 to p as its diagonal entries of the diagonal covariance15atrix for various n s and p s. Each sample is contaminated by 5% i.i.d. normal p -dimensionalpoints with individual mean 10 and variance 0 .
1. Thus, we no longer have a symmetric errorsand homoscedastic variance model.Comparing the table entries in Tables 1 and 2, we conclude that the sharpness of theLB of ABP in Proposition 3.3 over the LB of RBP in RH99 almost does not depend on theunderlying distributions overall (confirmed in the multivariate t-distribution case). However, k ∗ ( Z n ) does depend on the configuration of points in Z n .Lower bounds in AVR00 are not applicable when one replaces P by P n or a real dataset Z n . This is because the regression symmetric assumption does not hold in practice forfixed sample size data sets, and the depth region { β : , RD( β , Z n ) ≥ η } for 0 < η < / x i are IGP (required in the CC1 of RH99). The boundedness is alsoassumed in Lemma 3 and Theorem 2 of VAR00.(II) Finite sample versus asymptotic breakdown point, the merit of FSBP .The asymptotic breakdown value or the limit of the finite sample breakdown value of T ∗ RD ,1 /
3, was given in AVR00 (Theorem 2) and RH99 (Theorem 8), respectively.Average of (cid:0) the LB of ABP in Prop. 3.1 – 1/3 (asymptotic BP) (cid:1) in 1000 samplesn 10 20 30 50 100 200p=2 -7.075% -5.168% -4.306% -3.562% -2.790% -2.136%p=3 -12.70% -9.517% -8.105% -6.689% -5.186% -4.017%p=5 -22.12% -16.43% -13.94% -11.34% -8.855% -7.167%Table 3: Average differences between the LB of ABP given in Proposition 3.3 and the asymp-totic breakdown point 1 /
3, based on 1000 multivariate standard normal samples.This limiting result can be obtained directly from Proposition 3.3 if samples come from aregression symmetric population distribution, in this case k ∗ ( Z n ) = n RD( β ∗ RD , Z n ) → n/ n → ∞ (see Theorems 6 and 7 of RH99). The T ∗ RD , however, has to be used in finitesample practice, and the limit 1 / / k ∗ ( Z n ) − p + 1) / ( n + k ∗ ( Z n ) − p + 1) in finite sample cases (especially small sample sizes)are given in table 3 or Figure 2 below.The table entries reveal once again the merit of the LB of the ABP for the finite samplebreakdown point of the regression median because it is smaller than the asymptotic break-down value 1/3 in all cases considered. For example, in small n and large p cases, it can beless than 20% or more ( p = 5 , n = 10). The asymptotic breakdown point, 1/3, is irrelevantfor these finite sample cases because it over-estimates the real breakdown value of the T ∗ RD infinite sample real practice. The LB of the ABP in Proposition 3.3 increases when n increasesfor fixed p and decreases as p increases for fixed n and is always less than 1 / =10 n=20 n=30 n=40 n=50 . . . lower bound of ABP in 1000 samples p=2 (a) p=2 n=10 n=20 n=30 n=40 n=50 . . . lower bound of ABP in 1000 samples p=3 (b) p=3 n=10 n=20 n=30 n=40 n=50 . . . . lower bound of ABP in 1000 samples p=5 (c) p=5 Figure 3: Boxplots for the LB of ABP in Proposition 3.3 for T ∗ RD based on 1000 multivariatestandard normal samples for three dimensions and five different sample sizes.Simulation results of the LB of ABP in Proposition 3.3 for the T ∗ RD in 1000 samples canalso be displayed graphically in terms of their distributions such as in Figure 3.Inspection of the figure reveals that (i) the LB of the ABP in Proposition 3.3 is alwayslower than the asymptotic breakdown point 1 /
3, (ii) it decreases as p increases for a fixed n and increases as n does for fixed p and (iii) outliers exist in various cases, including p =2 , n = 20 , , , p = 3 , n = 40 ,
50 and p = 5 , n = 20 , , ,
50. All these observationsand results demonstrate the merit of the FSBP and the relevance of the LB of the ABP inProposition 3.3 (and the irrelevance of the asymptotic breakdown point 1 /
3) in finite samplepractice.(III)
Justification of regression by the maximum depth estimator (median) .Proposition 3.3 reveals the intrinsic connection between the breakdown point and the maxi-mum depth value. This kind of connection was also discussed explicitly in M02.This intrinsic connection clearly justifies employment of the maximum depth median asa robust alternative to the traditional regression (the least squares or the least absolutedeviations) estimators since the former is much more robust both in a finite sample senseand in an asymptotic sense.(IV)
Location counterpart and other related results .The location counterpart of RD and β ∗ RD are respectively halfspace depth (Tukey (1975)) andhalfspace median (HM). The finite sample breakdown point of the latter has been investigatedthoroughly in the literature, e.g. Donoho (1982), Donoho and Gasko (1992) (DG92), Chen(1995) (C95), Chen and Tyler (2000) (CT00), and Liu, et al . (2017) (LZW17).In summary, the asymptotic breakdown point of the HM can be as high as 1 / X n is assumed to be IGP. The exact expression of the FSBP of HM is17iven in LZW17 under two assumptions (i) the X n is IGP and (ii) a special contaminatingscheme: all contaminating points lie at the same site. It seems that the idea of the proof ofProposition 3.3 could be extended to establish the bounds of the FSBP for HM with a moreregular and general contamination scheme.The exact FSBP of the projection regression depth median, T ∗ P RD , a major competitorof the T ∗ RD , has been investigated and established in Zuo (2019). The asymptotic breakdownpoint of the T ∗ P RD reaches the highest possible value of 50%.(V)
Computation of regression median .The computation of RD and T ∗ RD is challenging and has been discussed in RH99 briefly,in Rousseeuw and Struyf (1998), in Van Aelst, Rousseeuw, Hubert, and Struyf (2002), andin Liu and Zuo (2014). A R package “mrfDepth” has been developed by Segaert, Hubert,Rousseeuw, Raymaekers, and Vakili (2020). Like most other high breakdown point methods, T ∗ RD has to be computed approximately, which might affect its actual finite sample breakdownvalue as pointed out in the literature. Acknowledgment
The author thanks Hanshi Zuo, Hanwen Zuo and Dr. Wei Shao for their careful proofreadingwhich has led to improvements in the manuscript.
References [1] Amenta, N, Bern, M, Eppstein, D and Teng, S.-H. (2000), “Regression depth and centerpoints”,
Discrete Comput. Geom.
23 305-323.[2] Chen, Z. (1995), “Robustness of the half-space median”,
J Statist Plann Infer , 46(2):175-184[3] Chen, Z. and Tyler, D. E. (2002), “The influence function and maximum bias of Tukeysmedian”,
Ann Statist , 2002, 30: 1737-1759.[4] Davies, P. L. (1987), “Asymptotic behavior of S -estimates of multivariate location pa-rameters and dispersion matrices”, Ann. Statist. , 1269-1292.[5] Davies, P. L. (1990), “The asymptotics of S-estimators in the linear regression model”, Ann. Statist. , 18 1651-1675.[6] Davies, P. L. (1993), “Aspects of robust linear regression”,
Ann. Statist. , 21 1843-1899.[7] Davies, P. L., and Gather, U. (2005), “Breakdown and groups”,
Ann. Statist. , Vol. 33,No. 3, 977-988.[8] Donoho, D. L. (1982), “Breakdown properties of multivariate location estimators”. Ph.D.qualifying paper, Dept. Statistics, Harvard University.189] Donoho, D. L., and Gasko, M. (1992), “Breakdown properties of multivariate locationparameters and dispersion matrices”,
Ann. Statist. , 1803-1827.[10] Donoho, D. L., and Huber, P. J. (1983), “The notion of breakdown point”, in: P. J.Bickel, K. A. Doksum and J. L. Hodges, Jr., eds. A Festschrift foe Erich L. Lehmann (Wadsworth, Belmont, CA) pp. 157-184.[11] Edelsbrunner, H. (1987),
Algorithms in Combinatorial Geometry , Berlin: Springer-Verlag.[12] Ghosh, S. K., and Sengupta, D (1999), “On multivariate monotonic measures of locationwith high breakdown point”,
Sankhy¯a A , 362-380.[13] Hampel, F. R. (1968), “Contributions to the theory of robust estimation”, Ph.D. thesis,University of California, Berkeley.[14] Hampel, F. R. (1971), “A general qualitative definition of robustness”, Ann. Math.Statist. , 1887–1896[15] Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. (1986), RobustStatistics. The Approach Based on Influence Functions (John Wiley & Sons, New York).[16] Hodges, J. L. Jr. (1967), “Efficiency in normal samples and tolerance of extreme valuesfor some estimates of location”,
Proceedings of the 5th Berkeley Symposium on Mathe-matical Statistics and Probability
Vol.1, 163-168.[17] Huber, P. J. (1981),
Robust Statistics , Wiley, New York.[18] Liu, X. Zuo, Y., and Wang, Q. (2017), Finite sample breakdown point of Tukey’s halfs-pace median,
Sci China Math , 60: 861874.[19] Lopuha¨a, H. P. (1992), “Highly efficient estimators of multivariate location with highbreakdown point”,
Ann. Statist. , 398–413.[20] Lopuha¨a, H. P. and Rousseeuw, J. (1991), “Breakdown points of affine equivariant esti-mators of multivariate location and covariance matrices”, Ann. Statist. , 229-248.[21] M¨uller, C. H. (1995), “Breakdown points for designed experiments”, J. Statist. Plann.Inference , 413-427.[22] Maronna, R. A. and Yohai, V. J. (1991), “The breakdown point of simultaneous general M estimates of regression and scale”, J. Amer. Statist. Assoc. , 699-703.[23] Maronna, R. A., Martin, R. D., and Yohai, V. J.(2006), “ Robust Statistics: Theory andMethods”, John Wiley &Sons[24] Mizera, I. (2002), “On depth and deep points: a calculus”, Ann. Statist. , 30(6), 1681–1736.[25] Rousseeuw, P. J. (1984), “Least median of squares regression”,
J. Amer. Statist. Assoc. , 871-880. 1926] Rousseeuw, P. J. and Leroy, A. M. (1987), Robust Regression and Outlier Detection (Wiley, New York).[27] Rousseeuw, P. J., and Hubert, M. (1999), “Regression depth (with discussion)”,
J. Amer.Statist. Assoc. , 94, 388–433.[28] Rousseeuw, P.J., and Leroy, A.
Robust regression and outlier detection . Wiley New York.(1987).[29] Rousseeuw, P. J., and Struyf, A. Computing location depth and regression depth inhigher dimensions.
Statistics and Computing (1998), 8,193-203.[30] Rousseeuw, P. J., and Struyf, A. (2004), “Characterizing angular symmetry and regres-sion symmetry”,
J. Statist. Plann. Inference , 122, 161-173.[31] Stone, A. H. and Tukey, J. W. (1942), “Generalized ”sandwich” theorems”,
Duke Math-ematical Journal , 9 (2): 356359,[32] Tyler, D. E. (1994), “Finite sample breakdown points of projection based multivariatelocation and scatter statistics”,
Ann. Statist. , 1024-1044.[33] Tukey, J. W. Mathematics and the picturing of data. In: James, R.D. (ed.), Proceedingof the International Congress of Mathematicians , Vancouver 1974 (Volume 2), CanadianMathematical Congress, Montreal, 1975, 523-531.[34] Van Aelst, S., and Rousseeuw, P. J. (2000), “Robustness of Deepest Regression”,
J.Multivariate Anal. , 73, 82–106.[35] Zuo, Y. (2001), “Some Quantitative Relationships Between Two Types of Finite SampleBreakdown Point”,
Statistics and Probability Letters , 51 (4): 369-375.[36] Zuo, Y. (2018), “On general notions of depth in regression”,
Statistical Science (in press),arXiv:1805.02046.[37] Zuo, Y. (2019), “Robustness of deepest projection regression depth functional”,
Statis-tical Papers , https://doi.org/10.1007/s00362-019-01129-4, arXiv:1806.09611.[38] Zuo, Y. (2020), “Large sample properties of the for the regression depth inducedmedain”,