[PDF] On minimum Bregman divergence inference

Abstract

In this paper a new family of minimum divergence estimators based on the Bregman divergence is proposed. The popular density power divergence (DPD) class of estimators is a sub-class of Bregman divergences. We propose and study a new sub-class of Bregman divergences called the exponentially weighted divergence (EWD). Like the minimum DPD estimator, the minimum EWD estimator is recognised as an M-estimator. This characterisation is useful while discussing the asymptotic behaviour as well as the robustness properties of this class of estimators. Performances of the two classes are compared -- both through simulations as well as through real life examples. We develop an estimation process not only for independent and homogeneous data, but also for non-homogeneous data. General tests of parametric hypotheses based on the Bregman divergences are also considered. We establish the asymptotic null distribution of our proposed test statistic and explore its behaviour when applied to real data. The inference procedures generated by the new EWD divergence appear to be competitive or better that than the DPD based procedures.

Full PDF

OO N MINIMUM B REGMAN DIVERGENCE INFERENCE . Soumik Purkayastha

Department of BiostatisticsUniversity of MichiganAnn Arbor, MI, USA [email protected]

Ayanendranath Basu

Interdisciplinary Statistical Research UnitIndian Statistical InstituteKolkata, WB, INDIA [email protected]

August 18, 2020 A BSTRACT

In this paper a new family of minimum divergence estimators based on the Bregman divergence isproposed. The popular density power divergence (DPD) class of estimators is a sub-class of Bregmandivergences. We propose and study a new sub-class of Bregman divergences called the exponentiallyweighted divergence (EWD). Like the minimum DPD estimator, the minimum EWD estimatoris recognised as an M-estimator. This characterisation is useful while discussing the asymptoticbehaviour as well as the robustness properties of this class of estimators. Performances of the twoclasses are compared – both through simulations as well as through real life examples. We develop anestimation process not only for independent and homogeneous data, but also for non-homogeneousdata. General tests of parametric hypotheses based on the Bregman divergences are also considered.We establish the asymptotic null distribution of our proposed test statistic and explore its behaviourwhen applied to real data. The inference procedures generated by the new EWD divergence appear tobe competitive or better that than the DPD based procedures.

Density based minimum divergence methods are popular tools in statistical inference. In case of the estimation problem,this amounts to estimating the parameters of interest by minimising an empirical version of some suitably chosendivergence between the ‘true’ density underlying the data and the assumed model density. Many of these methodscombine strong robustness properties with high asymptotic efﬁciency, which is one of the reasons for their popularity.An important class of density based divergences useful in this context is the class of φ divergences (see Csisz´ar (1963)).Under standard regularity conditions, all minimum φ divergence estimators have full asymptotic efﬁciency at themodel (Lindsay (1994)); many of them also have attractive robustness properties. A seminal work by Beran (1977),who investigated the minimum Hellinger distance estimator (MHDE), appears to be the ﬁrst which demonstrated thatstrong robustness properties may be achieved simultaneously with full asymptotic efﬁciency. Later, the same hasbeen demonstrated with respect to much of the φ divergence class (see, eg., Basu et al. (2011)). The usefulness ofthe corresponding procedures in providing robust alternatives to the likelihood ratio test has also been explored in theliterature (Simpson (1989); Lindsay (1994); Basu et al. (2011)). The extension of this approach to problems beyondthe simple i.i.d. set-up has also been attempted by several later authors. On the whole, the utility of the minimumdivergence procedures based on φ divergences is well established in the literature.One of the major criticisms of this inferential procedure is that it involves the use of some form of non-parametricsmoothing (such as kernel-based density estimation) to produce a continuous estimate of the true density (which isnecessary to construct the divergence when the model density is continuous). While kernel density estimation (orother suitable non-parametric smoothing techniques) represents a very important class of statistical procedures, itinvolves a lot of complication and the bandwidth selection issue can throw up many potential difﬁculties. The slowconvergence of the kernel density estimator to the ‘truth’ for high dimensional data adds another facet to the problem.The complicated nature of the estimating equation also makes the theoretical derivations harder. Development ofmethods which eliminate these difﬁculties may be worthwhile even if they involve a small loss in asymptotic efﬁciency. a r X i v : . [ m a t h . S T ] A ug PREPRINT - A

UGUST

18, 2020An alternative class of minimum divergence estimators which does not require non-parametric smoothing in theconstruction of the empirical divergence is the class of minimum Bregman divergence estimators. An important exampleof divergences in this class is the family of density power divergences (DPD( α ), where α is the tuning parameter);the corresponding minimum density power divergence estimators (MDPDE( α )) have been shown to combine strongrobustness properties with high asymptotic efﬁciency (see Basu et al. (1998)). Divergences within the Bregman classhave been called decomposable divergences by Broniatowski et al. (2012) and non-kernel divergences by Jana andBasu (2019). These divergences have simple estimating equations and much of their asymptotic properties can beobtained from the M-estimation theory. The Kullback-Leibler divergence, which is a decomposable divergence, is theonly common member between the φ divergence class and the Bregman divergence class.In the context of robust parametric estimation, speciﬁcally in the context of density-based minimum divergenceestimation with a view to robustness, we have several ‘good’ choices available. In order to justify the developmentof another family of estimators, one must demonstrate that the new estimators are competitive, if not better thanthe existing standard. Within the class of minimum divergence estimators which do not require any nonparametricsmoothing, the MDPDE( α ) is the current standard. In this paper, we will develop a family of divergences yieldingminimum divergence estimators which appear to satisfy this requirement; at the least, this family provides a highlycompetitive standard. Our proposed class of divergences will be called the exponentially weighted divergence family,indexed by a tuning parameter β (henceforth referred to as EWD( β )). The corresponding minimum exponentiallyweighted divergence estimator will be denoted by MEWDE( β ).Like estimation, hypothesis testing is another fundamental area of statistical inference. Although the likelihood ratiotest is a well studied component of classical hypothesis testing theory, it is known to have very poor robustness undermodel misspeciﬁcation and presence of outliers. Many density based minimum distance procedures have been observedto have strong robustness properties in estimation and testing together with high efﬁciency, eg., Pardo (2006) andBasu et al. (2013, 2018). The last mentioned paper addresses the general problem of parametric hypothesis testingof composite null hypotheses based on the density power divergence alone. The theoretical results presented in thepresent paper extend the said testing procedure to the entire class of Bregman divergence, with special emphasis on ourproposed EWD( β ) class of divergences. Originally deﬁned in the context of convex programming by Bregman (1967), the

Bregman divergence for p, q ∈ R d isdeﬁned as D B ( p, q ) = B ( p ) − B ( q ) − (cid:104)∇ B ( q ) , p − q (cid:105) , where ∇ denotes the gradient of a function with respect to its arguments and (cid:104) x, y (cid:105) denotes the inner product of x and y . The function B : R d → R is strictly convex and consequently, the measure D B ( x, y ) is non-negative and equalszero only when x = y . Extending this formulation to the case of two probability density functions (pdf) g and f , wedeﬁne the divergence D B ( g, f ) = (cid:90) x [ B ( g ( x )) − B ( f ( x )) − ( g ( x ) − f ( x )) B (cid:48) ( f ( x ))] dx, (2.1)where B (cid:48) ( · ) is the derivative of B with respect to its argument. Since the integrand is non-negative for each x ,it follows that D B ( g, f ) is non-negative. Moreover, the measure is zero when its arguments are identically equal.Csisz´ar et al. (1991) discuss this and similar measures in greater detail. We note that the convex functions B ( y ) and B ∗ ( y ) = B ( y ) + ay + c generate identical divergences in Equation (2.1) for a, c ∈ R .The minimum Bregman divergence estimation procedure based on a general convex B function may be described asfollows. Given an i.i.d. random sample X , . . . , X n from the distribution G , we model these data by a parametricfamily (of densities f θ , indexed by the parameter θ ) F θ := { f θ : θ ∈ Ω ⊂ R p } . Speciﬁcally, we wish to estimate thevalue of the model parameter θ by choosing the model density which gives the closest ﬁt to the data in the minimumBregman divergence sense. Let g and f θ be the density functions associated with distribution functions G and F θ respectively. An empirical version D B ( g, f ) , given by the right side of Equation (2.1) with f replaced by the modelelement f θ may now be obtained as (cid:90) [ B ( g ( x )) − B ( f θ ( x ))] dx − n − n (cid:88) i =1 B (cid:48) ( f θ ( X i )) + (cid:90) B (cid:48) ( f θ ( x )) f θ ( x ) dx. Here we have replaced the theoretical mean (cid:82) B (cid:48) ( f θ ( x )) g ( x ) dx with the sample mean based on X , . . . , X n . Groupingthe terms of the above equation including terms based only on f θ , terms based both on g and f θ and terms based only2 PREPRINT - A

UGUST

18, 2020on g , the above objective function may be expressed as (cid:90) (cid:2) f θ ( x ) B (cid:48) ( f θ ( x )) − B ( f θ ( x )) (cid:3) dx − n − n (cid:88) i =1 B (cid:48) ( f θ ( X i )) + (cid:90) B ( g ( x )) dx. (2.2)The last term in the above expression may be ignored as it has no role in optimisation over θ ∈ Ω .Let u θ ( x ) = ∇ θ log( f θ ( x )) be the likelihood score function of the model being considered, where ∇ θ represents thegradient of a function with respect to θ . Under differentiability of the model with respect to θ , minimisation of theexpression in Equation (2.2) leads to the estimating equation n − n (cid:88) i =1 u θ ( X i ) B (cid:48)(cid:48) ( f θ ( X i )) f θ ( X i ) = (cid:90) u θ ( x ) B (cid:48)(cid:48) ( f θ ( x )) f θ ( x ) dx, (2.3)where B (cid:48)(cid:48) ( · )( or B (cid:48)(cid:48)(cid:48) ( · )) is the indicated second (or third) derivative. This may be viewed as a generalised likelihoodequation, or a weighted likelihood equation having the form n − n (cid:88) i =1 u θ ( X i ) w ( f θ ( X i )) = (cid:90) u θ ( x ) w ( f θ ( x )) f θ ( x ) dx, (2.4)where the usual likelihood score equation is recovered for w ( t ) ≡ . Comparing Equations (2.3) and (2.4), we obtain w ( f θ ( x )) = B (cid:48)(cid:48) ( f θ ( x )) f θ ( x ) . (2.5)Using convexity of B and non negativity of f θ , it follows that w will be non-negative. We see that the Kullback-Leiblerdivergence, the L divergence and more generally, the density power divergence DPD( α ) are all special cases of theBregman divergence; the corresponding B functions are x log( x ) , x and α − x α respectively. The Kullback-Leiblerdivergence is D KL ( g, f θ ) = (cid:90) g ( x ) log (cid:32) g ( x ) f θ ( x ) (cid:33) dx, while the (squared) L distance is D L ( g, f θ ) = (cid:90) [ g ( x ) − f θ ( x )] dx, and the general form of DPD( α ) is DP D α ( g, f θ ) = (cid:90) (cid:104) f α θ ( x ) − (cid:16) α (cid:17) g ( x ) f α θ ( x ) + 1 α g α ( x ) (cid:105) dx, α > . In order to develop new estimation procedures based on Bregman divergences, one can do one of two things: (a) startwith a speciﬁc convex function B and construct a weighted likelihood equation as given in Equation (2.3), or (b) beginwith a suitable weight function (motivated by considerations of robustness), and the associated weighted likelihoodrepresentation as in Equation (2.4) and backtrack to recover the corresponding convex function B . We choose the latterapproach. See Biswas et al. (2020) for a general discussion on Bregman divergences and weighted likelihood.Philosophically, our treatment of outliers is probabilistic, in that an outlying point is one which has a small probabilityof occurrence under a given model f θ ∈ F θ . We choose to downweight those observations in the estimating equationfor which the value of f θ ( x ) is small. We plot the weight functions deﬁned in Equation (2.6) for some members of theDPD( α ) family at different values of α in Figure 1. While the weight function is a constant (equal to 1) at α = 0 , allpositive values of α downweight observations having density f θ ( x ) ≤ . The strength of downweighting increases withincreasing α . For f θ ( x ) > , the weights grow unboundedly for all α > as the argument increases. The measureDPD(0) corresponds, in a limiting sense, to the Kullback-Leibler divergence which is minimized by the MLE. FromFigure 1, we note that the MLE gives equal weight to all observations, including outlying ones, leading to its poorrobustness properties. The measure DPD(1) corresponds to the squared L distance.We propose a new class of divergences based on a different choice of the weight function w β ( t ) = (cid:26) − exp( − t/β ) if β > , if β = 0 . (2.6)These weights smoothly drop to zero for decreasing values of the probability density function f θ ( x ) for β > . However,unlike the DPD( α ) weights, they are bounded above by 1. We plot the weight functions given by Equation (2.6) for3 PREPRINT - A

UGUST

18, 2020 t w ( t ) = t α α Figure 1: Weight functions of some DPD( α ) members.speciﬁc values of β in Figure 2. The likelihood equation may be recovered at β = 0 , where, to avoid the complicationsof division by zero, the weights have been deﬁned by the corresponding limiting case as β → . Using Equation (2.5),we recover the divergence (or rather, the associated B function). This function is given by B ( x ) = x β (cid:104) ∞ (cid:88) n =0 ( − x/β ) n ( n + 2)!( n + 1) (cid:105) . In Appendix A, we show that this can be further simpliﬁed to B ( x ) = − x + γx + β − β exp( − x/β ) + x Γ(0 , x/β ) + x log( x/β ) , (2.7)where γ is the Euler-Mascheroni constant γ = lim n →∞ (cid:32) n (cid:88) k =1 k − log n (cid:33) = − (cid:90) ∞ log( t ) exp( − t ) dt, and Γ( α, β ) is the incomplete Gamma integral deﬁned as Γ( α, β ) = (cid:90) ∞ β y α − exp( − y ) dy. The associated Bregman divergence (which we will refer to as the exponentially weighted divergence EWD( β ) ) has theform EWD ( β ) = (cid:90) (cid:2) f θ ( x ) B (cid:48) ( f θ ( x )) − B ( f θ ( x )) (cid:3) dx − n − n (cid:88) i =1 B (cid:48) ( f θ ( X i )) , (2.8)4 PREPRINT - A

UGUST

18, 2020 t w ( t ) = − e − t β β Figure 2: Weight functions of some EWD( β ) members.where B is given by Equation (2.7). The resultant estimating equation has the form n − n (cid:88) i =1 u θ ( X i )[1 − exp( − f θ ( X i ) /β )] = (cid:90) u θ ( x )[1 − exp( − f θ ( x ) /β )] f θ ( x ) dx. We note that the EWD( β ) family can also be generated by the simpliﬁed B function B ( x ) = − β exp( − x/β ) + x Γ(0 , x/β ) + x log( x/β ) . Given i.i.d. observations X , . . . , X n from a distribution modeled by the parametric family F θ , an M-estimator of thetarget parameter θ may be obtained by solving an estimating equation of the form (cid:80) ni =1 ψ ( X i , θ ) = 0 (see, e.g., Huberand Ronchetti (2009) and Hampel et al. (1986) for more details). Any minimum Bregman divergence estimator is alsoan M estimator. The ψ function associated with the minimum Bregman divergence estimator is ψ ( x, θ ) = u θ ( x ) B (cid:48)(cid:48) ( f θ ( x )) f θ ( x ) − (cid:90) t u θ ( t ) B (cid:48)(cid:48) ( f θ ( t )) f θ ( t ) dt. (3.1)We note that the ψ function in this case makes explicit use of the form of the pdf of the model unlike the location-scaleform used commonly in M-estimation. For MEWDE( β ), in particular, the associated ψ function is ψ ( x, θ ) = u θ ( x )[1 − exp( − f θ ( x ) /β )] − (cid:90) t u θ ( t )[1 − exp( − f θ ( t ) /β )] f θ ( t ) dt. PREPRINT - A

UGUST

18, 2020

We present results related to the asymptotic distribution of any minimum Bregman divergence based estimator in generaland the MEWDE( β ) in particular, when the true distribution G from which the data are generated is not necessarily inthe model under study. The theoretical estimating equation is (cid:82) ψ ( x, θ ) dG ( x ) = 0 , where ψ ( x, θ ) is given by Equation(3.1). We observe that the functional T β ( G ) associated with MEWDE( β ) is Fisher consistent; it recovers the value θ when the true distribution G = F θ is a member of the parametric family being used to model the given data (this istrue for any minimum Bregman divergence based estimator in general). When G is not in the model, our best ﬁttingparameter θ g = T β ( G ) will be the root of the theoretical estimating equation (cid:90) x u θ ( x ) B (cid:48)(cid:48) ( f θ ( x )) f θ ( x ) dG ( x ) = (cid:90) x u θ ( x ) B (cid:48)(cid:48) ( f θ ( x )) f θ ( x ) dF θ ( x ) . (3.2)Let X , . . . , X n be a random sample from the distribution G having density g . The minimum Bregman divergenceestimator for this sample is obtained as a solution of Equation (2.3) via the minimization of the quantity given byEquation (2.2) for a given B function. We deﬁne H n ( θ ) = (cid:90) (cid:2) f θ ( x ) B (cid:48) ( f θ ( x )) − B ( f θ ( x )) (cid:3) dx − n − n (cid:88) i =1 B (cid:48) ( f θ ( X i ))= n − n (cid:88) i =1 V θ ( X i ) , (3.3)where V θ ( t ) = (cid:82) (cid:2) f θ ( x ) B (cid:48) ( f θ ( x )) − B ( f θ ( x )) (cid:3) dx − B (cid:48) ( f θ ( t )) . As the population analogue to Equation (3.3), wedeﬁne H ( θ ) = (cid:90) (cid:2) f θ ( x ) B (cid:48) ( f θ ( x )) − B ( f θ ( x )) (cid:3) dx − (cid:90) B (cid:48) ( f θ ( x )) dG ( x ) . (3.4)We also deﬁne the following quantities.1. The information function of the model: I θ ( x ) = −∇ θ u θ ( x ) .2. The covariance function of T ( X ) = u θ ( X ) B (cid:48)(cid:48) ( f θ ( X )) f θ ( X ) under G , which has the form K ( θ ) = (cid:90) u θ ( x ) u T θ ( x )[ B (cid:48)(cid:48) ( f θ ( x )) f θ ( x )] dG ( x ) − ξ ( θ ) ξ T ( θ ) ,ξ ( θ ) = (cid:90) u θ ( x ) B (cid:48)(cid:48) ( f θ ( x )) f θ ( x ) dG ( x ) . (3.5)3. The function J ( θ ) , where J ( θ ) = (cid:90) u θ ( x ) u T θ ( x ) B (cid:48)(cid:48) ( f θ ( x )) f θ ( x ) dx + (cid:90) [ I θ ( x ) − u θ ( x ) u T θ ( x ) h ( x )] × [ g ( x ) − f θ ( x )][ B (cid:48)(cid:48) ( f θ ( x )) f θ ( x )] dx (3.6)where w ( t ) = B (cid:48)(cid:48) ( t ) · t , w (cid:48) ( t ) = B (cid:48)(cid:48) ( t ) + B (cid:48)(cid:48)(cid:48) ( t ) · t and h ( t ) = w (cid:48) ( f θ ( t )) f θ ( t ) w ( f θ ( t )) . Thereom 3.1 is provided under the set of assumptions given below. These may be viewed as generalizations of theconditions presented in Basu et al. (2011) (which were designed speciﬁcally for the DPD class). The details of theproof are not presented here, as it mimics the approach of Theorem 9.2 of Basu et al. (2011) exactly.(A1) The distributions F θ of X have common support, so that the set χ = { x : f θ ( x ) > } is independent of θ .The distribution of G is also supported on χ , on which the corresponding density g is greater than zero.(A2) There is an open subset ω of the s -dimensional parameter space Ω containing the best ﬁtting parameter θ g such that for almost all x ∈ χ and all θ ∈ ω , the density f θ ( x ) is three times differentiable with respect to θ and the third partial derivatives are continuous with respect to θ .6 PREPRINT - A

UGUST

18, 2020(A3) The integrals (cid:82) [ f θ ( x ) B (cid:48) ( f θ ( x )) − B ( f θ ( x ))] dx and (cid:82) B (cid:48) ( f θ ( x )) g ( x ) dx can be differentiated three timeswith respect to θ and the derivatives can be taken under the integral sign.(A4) The s × s matrix J ( θ ) , with its ( k, l ) entry deﬁned as J kl ( θ ) = E g (cid:104) ∇ kl (cid:110) (cid:90) [ f θ ( x ) B (cid:48) ( f θ ( x )) − B ( f θ ( x ))] dx − B (cid:48) ( f θ ( x )) (cid:111)(cid:105) , (3.7)is positive deﬁnite. Here ∇ kl denotes the partial derivative of a function with respect to the k th and l thcomponents of its argument and E g represents the expectation under the density g .(A5) There exists a function M jkl ( x ) such that | ∇ jkl V θ ( x ) | ≤ M jkl ( x ) ∀ θ ∈ ω, where V θ ( x ) = (cid:82) (cid:2) f θ ( x ) B (cid:48) ( f θ ( x )) − B ( f θ ( x )) (cid:3) dx − B (cid:48) ( f θ ( x )) and E g [ M jkl ( X )] = m jkl < ∞ for all j, k and l. Theorem 3.1.

Assuming that conditions A1-A5 hold,1. The estimating equation given by Equation (2.3) has a consistent sequence of roots ˆ θ = ˆ θ n and2. √ n ( ˆ θ − θ g ) has an asymptotic multivariate normal distribution with (vector) mean zero and covariancematrix J − KJ − . Recalling our formulation of any minimum Bregman divergence based estimator (say, T B ) as an M-estimator, weobserve that its inﬂuence function is given by IF ( y, T B , G ) = J − [ u θ ( y ) B (cid:48)(cid:48) ( f θ ( y )) f θ ( y ) − ξ ] , (3.8)where ξ and J are given in Equations (3.5) and (3.6) repectively. These quantities get simpliﬁed further in case the truedistribution G belongs to the model under consideration. Assuming J and ξ to be ﬁnite, the inﬂuence function turns outto be bounded whenever the quantity u θ ( y ) B (cid:48)(cid:48) ( f θ ( y )) f θ ( y ) is bounded in y (or, equivalently, when u θ ( y ) w ( f θ ( y )) isbounded in y ). This is true, for example, in case of all members of the DPD family with all α > , and for most standardparametric families (including the normal location-scale family). In case of MEWDE( β ), the inﬂuence function isimmediately seen to be IF ( y, T β , G ) = J − (cid:104) u θ ( y ) (cid:2) − exp( f θ ( y ) /β ) (cid:3) − ξ (cid:105) . In Figure 3, the inﬂuence functions of MEWDE( β ) for the estimation of the normal mean when σ = 1 are plotted; forall β > considered here, we note their bounded redescending nature. Remark 1.

The asymptotic variance of √ n times the MEWDE( β ) can be consistently estimated in a sandwich fashion byusing the above inﬂuence function, as in Huber and Ronchetti (2009). Let K i = u θ ( X i )(1 − exp( − f θ ( X i ) /β )) − ξ ( θ ) and let ˆ K i be the corresponding quantity evaluated at ˆ θ , with the empirical distribution G n plugged in place of G .Let ˆ K = ( n − − (cid:80) i ( ˆ K i ˆ K Ti ) . Similarly, we obtain ˆ J from J by replacing θ by ˆ θ , with G n plugged in place of G .Then, the asymptotic variance of √ n MEWDE( β ) can be consistently estimated by ˆ J − ˆ K ˆ J − . Consistent estimators ofthe asymptotic variance of this estimator can also be obtained by the jackknife and bootstrap techniques. Again, thistechnique can be extended to consistently estimate the asymptotic variance of any minimum Bregman divergence basedestimator for a given B function. When the true distribution G belongs to the model, i.e. G = F θ for some θ ∈ Ω , the formulae for J , K and ξ , in caseof MEWDE( β ), is as follows. J = (cid:90) u θ ( x ) u T θ ( x )[1 − exp( − f θ ( x ) /β )] f θ ( x ) dx,K = (cid:90) u θ ( x ) u T θ ( x )[1 − exp( − f θ ( x ) /β )] f θ ( x ) dx − ξξ T ,ξ = (cid:90) u θ ( x )[1 − exp( − f θ ( x ) /β )] f θ ( x ) dx. (4.1)7 PREPRINT - A

UGUST

18, 2020 − − − − t I F ( t ) β Figure 3: IFs for MEWDE( β ) for ˆ µ of N ( µ, family at N (0 , distribution.As β → , J and K both tend to the Fisher information matrix. We use Equation (4.1) to compute asymptotic relativeefﬁciencies of MEWDE( β ), which indicate how much efﬁciency is lost, relative to the maximum likelihood estimator,under the pure model. Along the lines of Basu et al. (1998), we consider examples of some speciﬁc parametric families. Here we consider different parametric families and compute the MEWDEs of the model parameters under differentscenarios using simulated data and compare them with the corresponding MDPDEs. At the ﬁrst stage we compute,for a ﬁxed parametric family of densities F θ = { f θ : θ ∈ Ω ⊂ R p } , the empirical mean square errors (MSEs) of theparameter estimates – for several members of both the MDPDE and the MEWDE classes – under pure data generatedfrom the given parametric model. Then we identify several sets of combinations ( α , β ) , the tuning parameters of thetwo families, for which the empirical MSEs of MDPDE( α ) and MEWDE( β ) are approximately equal. Subsequentlywe generate data from contaminated model distributions having densities of the form h ( x ) = (1 − (cid:15) ) f θ ( x ) + (cid:15)v ( x ) , where (cid:15) is the contaminating proportion, v ( x ) is a suitable contaminating density, but θ is still the target parameter.Now we compare the MSEs of MDPDE ( α ) and MEWDE( β ), with an aim to determine which one of these two,which are close in terms of model efﬁciency, have better outlier stability. Unless otherwise mentioned, we have usedsamples of size n = 200 , and for each scenario we have replicated the sample r = 2000 times. The ﬁnite samplerelative efﬁciency (FSRE) of the MDPDE is deﬁned to be the ratio of MSE(MLE) to MSE(MDPDE); similarly for theMEWDE. The relevant R codes are presented in the Online Supplement.8 PREPRINT - A

UGUST

18, 2020

Taking f θ to be the density function for N ( µ, σ ) with σ known, using Theorem 3.1 and Equation (4.1), one cancompute and compare theoretical asymptotic relative efﬁciencies (AREs) of both the MEWDE( β ) and MDPDE( α )with respect to the MLE (see Table 1). As both α and β move away from zero, the efﬁciency of MDPDE( α ) decreasesTable 1: AREs of MDPDE and MEWDE of µ for N( µ , 1).Tuning par. ( k ) ARE(MDPDE( k )) ARE(MEWDE( k ))0.001 1.000 0.9960.004 1.000 0.9870.016 1.000 0.9550.062 0.995 0.8670.250 0.941 0.7411.000 0.650 0.6764.000 0.216 0.656slowly for a brief initial period, but then drops much more rapidly as compared to MEWDE( β ), as is seen in Table 1.For a simulation-based comparison of the DPD and EWD classes, we follow the scheme outlined in Section 4.2, wherethe true distribution is N (0 , and the contaminating distribution is N ( µ c , . We have carried out simulation studiesfor µ c = 3 and and estimated the mean parameter under the N ( µ, model. Our ﬁndings are presented in Table 2.The ﬁrst column presents the ( α , β ) combinations used in this example, and the second column indicates how closeTable 2: FSRE’s of MDPDE(denoted D( α )) and MEWDE(denoted ) with respect to MLE. Figures in bold denote best FSRE in that contamination scheme. µ c = 3 µ c = 5ˆ µ (cid:15) = 0 (cid:15) = 0 . (cid:15) = 0 . (cid:15) = 0 . (cid:15) = 0 . (cid:15) = 0 . (cid:15) = 0 . MLE D(0.05) .

996 1 .

358 1 .

250 2 .

635 2 .

568 2 . E(0.001) .

996 1 .

791 1 .

806 1 .

495 12 .

326 31 .

141 52 . D(0.1) .

956 2 .

567 3 .

027 2 .

552 10 .

966 23 .

342 29 . E(0.004) .

954 3 .

409 4 .

863 4 . .

509 43 .

633 140 . D(0.43) .

871 3 .

592 6 .

075 6 .

213 12 .

106 38 .

450 115 . E(0.063) . . .

947 9 .

664 12 .

356 40 .

769 137 . D(0.74) .

749 3 .

693 8 .

567 13 .

500 10 .

495 34 .

680 117 . E(0.25) .

747 3 . . .

557 10 .

525 34 .

861 118 . D(0.98) .

666 3 .

428 8 .

837 17 .

716 9 .

304 30 .

871 105 . E(4) .

666 3 .

430 8 .

861 17 .

852 9 .

304 30 .

871 105 . L .

659 3 .

401 8 . . .

206 30 .

553 104 . the corresponding MSEs are. The following important observations can be made from the ﬁgures of Table 2.1. For uncontaminated data, the MLE is the most efﬁcient estimator, as it should be.2. Under even a slight contamination there is a severe degradation in performance of the MLE, and the other twoestimators quickly overtake it.3. Generally, as the proportion of contamination increases, larger tuning parameters give better performance(on account of their stronger downweighting). However, this improvement is not absolute. Generally, withincreasing tuning parameter, the performance of the estimators reach a peak at some moderate value of thetuning parameter, and thereafter drops again.4. If the contaminating distribution is far separated from the target distribution, smaller values of tuning parametersare sufﬁcient to provide good outlier stability.5. However, the most important observation for us is that in all the pairs considered here having comparable MSEsunder pure data, the MEWDE beats the MDPDE, sometimes quite soundly, under contaminated scenarios.9 PREPRINT - A

UGUST

18, 2020

Sample size (n) M SE ( m u l t i p li ed b y ) β Figure 4: MSE of MEWDE( β ) of µ under the N ( µ, model for pure N (0 , data.In Figures 4 and 5 we graphically present the MSEs of different members of the MEWDE( β ) class for the indicatedpure normal data and contaminated normal data situations over a sequence of sample sizes. Figure 4 clearly shows thehierarchical relation between increasing β and increasing MSE. In Figure 5 it may be seen that the optimal MSE is atan intermediate value of β . If we brought the contaminating mean closer, or pushed up the contaminating proportion, ahigher value of β would be required for the optimal solution. We compare the robustness of the competing MDPDE( α ) and MEWDE( β ) classes in the context of estimating σ whendata come from a contaminated normal distribution given by (1 − (cid:15) ) N (0 ,

1) + (cid:15)N (0 , σ c ) , where (cid:15) is the contaminationproportion and σ c is the variance of the contaminating distribution; the model is the N (0 , σ ) model and the targetparameter is 1. The observations from the results reported in Table 3 are very similar to those for Table 2. Once again,we note that within each (efﬁciency-wise) equivalent pair, the MEWDE beats the MDPDE in each single case. We compare the robustness of the competing MDPDE( α ) and MEWDE( β ) classes in the context of estimating the meanparameter λ of the exponential model, when data come from a contaminated distribution given by (1 − (cid:15) ) E (1)+ (cid:15)E ( λ c ) , where (cid:15) is the contamination proportion and E ( λ ) denotes an exponential distribution with mean λ . Here λ c is the meanof the contaminating exponential distribution and the target parameter value is λ (ﬁxed at ). Again the observations10 PREPRINT - A

UGUST

18, 2020

Sample size (n) M SE ( m u l t i p li ed b y ) β Figure 5: MSE of MEWDE( β ) for data from . N (0 ,

1) + 0 . N (5 , .Table 3: FSRE’s of MDPDE(denoted D( α )) and MEWDE(denoted E( β )) with respect to the MLE for the normal scalemodel. Figures in bold denote best FSRE in that contamination scheme. σ c = 3 σ c = 5ˆ σ (cid:15) = 0 (cid:15) = 0 . (cid:15) = 0 . (cid:15) = 0 . (cid:15) = 0 . (cid:15) = 0 . (cid:15) = 0 . MLE D(0.098) .

970 2 .

980 2 .

476 1 .

789 10 .

270 5 .

478 2 . E(0.001) .

971 5 .

670 5 .

017 2 .

806 40 .

979 34 .

196 10 . D(0.177) .

873 6 .

269 6 .

338 4 .

057 38 .

358 33 .

529 13 . E(0.004) .

872 8 .

555 10 .

786 7 . . .

181 47 . D(0.551) .

670 7 .

950 12 .

253 10 .

299 50 .

158 72 .

823 52 . E(0.063) . .

658 14 .

871 13 . . .

436 82 . D(0.884) .

550 7 .

103 12 .

494 12 .

443 43 .

073 67 .

906 55 . E(0.5) .

549 7 .

161 12 .

772 12 .

907 43 .

559 69 .

896 58 . D(0.983) .

526 6 .

839 12 .

217 12 .

463 41 .

136 64 .

981 53 . E(4) .

526 6 .

844 12 .

247 12 .

512 41 .

167 65 .

181 54 . L .

522 6 .

801 12 .

164 12 .

453 40 .

825 64 .

499 53 . PREPRINT - A

UGUST

18, 2020Table 4: FSRE’s of MDPDE(denoted D( α )) and MEWDE(denoted E( β )) with respect to the MLE for the exponentialmodel. Figures in bold denote best FSRE in that contamination scheme. λ c = 3 λ c = 5ˆ λ (cid:15) = 0 (cid:15) = 0 . (cid:15) = 0 . (cid:15) = 0 . (cid:15) = 0 . (cid:15) = 0 . (cid:15) = 0 . MLE D(0.153) .

904 1 .

592 1 .

858 1 .

771 3 .

217 3 .

438 2 . E(0.004) .

905 1 .

766 2 .

230 2 .

101 4 .

652 6 .

002 4 . D(0.440) .

696 1 .

753 2 .

774 3 .

099 4 .

810 7 .

896 7 . E(0.063) . .

824 3 . . .

321 10 .

115 11 . D(0.844) .

525 1 .

496 2 .

752 3 .

617 4 .

237 8 .

236 9 . E(1.000) .

525 1 .

498 2 . . .

245 8 .

282 9 . D(0.989) .

492 1 .

420 2 .

669 3 .

615 4 .

022 7 .

958 9 . E(16.00) .

492 1 .

420 2 .

669 3 .

616 4 .

022 7 .

958 9 . L .

490 1 .

414 2 .

662 3 .

613 4 .

006 7 .

935 9 . are similar to those of Tables 2 and Table 3. Once again the MEWDE is equivalent or better than the MDPDE in eachsingle case. Data on Shoshoni rectangles presented and analyzed by Hettmansperger and McKean (2010) are studied here. Dataon twenty width to length ratios of beaded rectangles found in baskets used by Shoshonis are given in Table 5. First,Table 5: Width to Length Ratios of Rectangles0.553 0.570 0.576 0.601 0.606 0.606 0.609 0.611 0.615 0.6280.654 0.662 0.668 0.670 0.672 0.690 0.693 0.749 0.844 0.933we examine the histogram of the observations (see Figure 6) — it is seen that 3 observations (colored in red) arewell-separated from the ‘main body’ made up of the remaining observations. We note from Figure 7 that the Q-Q plotgenerated by all twenty observations (suitably centered and scaled) gives us more evidence to claim that the 3 largestobservations are (possibly) outliers. On the other hand, by studying the Q-Q plot generated from the ‘outlier deleted’dataset (again suitably centered and scaled), we are motivated to model the dataset using a normal distribution withunknown mean µ and standard deviation σ (i.e., θ = ( µ, σ ) T ).Based on the kernel based estimate of the density, we observe there is a mildly bimodal structure present in the mainbody of the dataset. However, the Shapiro-Wilk test (Shapiro and Wilk (1965)), applied to the outlier deleted data, failsto reject the the null hypothesis that the outlier-deleted data are generated by a normal distribution. Indicating the fulldata maximum likelihood estimates by ML and the outlier deleted ones by ML+D, we get, under the normal model, ˆ µ ML = 0 . and ˆ σ ML = 0 . ; the outlier deleted estimates are ˆ µ ML + D = 0 . and ˆ σ ML + D = 0 . . Thus theoutliers have a moderate effect on the mean, but a substantial effect on the scale parameter. However all the differenttuning parameters for DPD and EWD used in this case produce stable estimators and outlier resistant ﬁts to the full data.As the outliers are quite distant from the majority of the data, small values of tuning parameters appear to sufﬁcient ineither case. For data generally well modeled by the Poisson distribution, we choose to compare the performance of the MEWDEand MDPDE in the context of data on fruit ﬂies (see (Woodruff et al., 1984)).In this experiment male ﬂies were sprayed with a certain level of a chemical to be screened, and then made to matewith unexposed females. The response, for each father ﬂy, was the number of daughter ﬂies having a recessive lethal12

PREPRINT - A

UGUST

18, 2020 x D en s i t y Figure 6: Histogram of values presented in Table 5 with (suspect) outliers colored red.mutation in the X-chromosome. The frequencies of these responses (presented in the ﬁrst row of Table 6) are modeledas Poisson variables, and the estimates of the Poisson mean parameter λ (as well as the estimated frequencies), usingseveral members of the DPD and EWD families, the MLE and the MLE+D (outlier deleted MLE) are presented in Table6. The single extreme value at 91 is treated as the obvious outlier. Both set of estimators have comparable (satisfactory)performance. It is clear that in doing estimation using the EWD, small values of β provide greater model efﬁciency, while largevalues of β provide greater outlier stability and protection against small model violations. Given any real data set wemust choose the “optimal”, data-based tuning parameter β so that the procedure has the right amount of balance as isnecessary for the data set in question. Here we follow the approach of Warwick and Jones (2005) to derive the optimalestimate of the tuning parameter. This approach constructs an empirical estimate of the mean square error as a functionof the tuning parameter (and a pilot estimator). The empirical estimate of the mean square error MSE β , as a function ofthe tuning parameter β and a pilot estimator θ P is given by (cid:92) M SE β ( θ P ) = (cid:16) (cid:98) θ β − θ P (cid:17) T (cid:16) (cid:98) θ β − θ P (cid:17) + n − tr (cid:16) J − β (cid:16) (cid:98) θ β (cid:17) K β (cid:16) (cid:98) θ β (cid:17) J − β (cid:16) (cid:98) θ β (cid:17)(cid:17) , where J and K are the terms deﬁned in Equation (4.1), (cid:98) θ β is the MEWDE( β ) and tr( · ) denotes the trace of a matrix.By minimizing this objective function over the tuning parameter, we get a data driven ‘optimal’ estimate of the tuning13 PREPRINT - A

UGUST

18, 2020 − − Theoretical quantiles S a m p l e quan t il e s − − Theoretical quantiles S a m p l e quan t il e s Figure 7: (L): Q-Q plot for complete data. (R): Q-Q plot for outlier deleted data. x D en s i t y Curve Kernel MLE EWD(0.07) EWD(8) L2 MLE+D

Figure 8: Density estimates for Shoshoni rectangles using MEWDEs.14

PREPRINT - A

UGUST

18, 2020 x D en s i t y Curve Kernel MLE DPD(0.3) DPD(0.6) L2 MLE+D

Figure 9: Density estimates for Shoshoni rectangles using MDPDEs.Table 6: Fitted frequencies for data using MLE, MDPDE(D( α )) and MEWDE(E( β )).Count 0 1 2 3 4 ≥ λ Observed 23 7 3 0 0 1 (91) –MLE. .

596 4 .

882 7 .

467 7 .

613 5 .

822 6 .

620 3 . D( . ) .

981 9 .

002 1 .

763 0 .

230 0 .

023 0 .

002 0 . D( . ) .

375 8 .

759 1 .

641 0 .

205 0 .

019 0 .

002 0 . D( . ) .

549 8 .

649 1 .

588 0 .

194 0 .

018 0 .

001 0 . E( . ) .

894 9 .

055 1 .

791 0 .

236 0 .

023 0 .

002 0 . E( . ) .

614 9 .

222 1 .

880 0 .

256 0 .

026 0 .

002 0 . E( . ) .

712 8 .

545 1 .

540 0 .

185 0 .

017 0 .

001 0 . L .

609 8 .

611 1 .

570 0 .

191 0 .

017 0 .

001 0 . MLE + D .

93 9 .

03 1 .

78 0 .

23 0 .

02 0 0 . parameter. Warwick and Jones (2005) propose the minimum L estimator as the pilot estimator in the above calculation,as it has strong robustness properties.For the data on Shoshoni rectangles presented in Section 4.6, we implement the tuning parameter selection algorithmdetailed above. The normal distribution with unknown mean and standard deviation parameters is used to model thisdata. The optimal tuning parameter is found to be β OP T = 0 . , and the associated estimated parameters are ˆ µ = 0 . and ˆ σ = 0 . . The corresponding (sample size-scaled) asymptotic mean-squared error is . × − .A similar exercise is carried out using the Poisson distribution to model the data on Drosophila fruit ﬂies presented inSection 4.7. The optimal tuning parameter is found to be β OP T = 0 . , and the estimated mean parameter is given by ˆ λ = 0 . . The corresponding (sample size-scaled) asymptotic mean-squared error is . .See (Basak et al., 2020) for some other approaches to tuning parameter selection.15 PREPRINT - A

UGUST

18, 2020

In this section, going beyond the i.i.d. situation, we extend our method to the case of data which are independent andshare common parameters in their distribution but are not identically distributed. Ghosh and Basu (2013) refer to suchdata as independent and non-homogeneous observations and we adhere to that nomenclature. Exploiting the robustnessof our minimum distance procedure, we develop a general estimation method for handling such data. We establish theasymptotic properties of the proposed estimator, and illustrate the beneﬁts of our method in case of linear regression.We assume that our observed data Y , . . . , Y n are independent. For i = 1 , , . . . , n we have Y i ∼ g i , where g i arepossibly different densities with respect to some common dominating measure. We want to model g i by the family F i, θ = { F i ( · ; θ ) | θ ∈ Ω } for each i = 1 , , . . . , n. An estimate of the Bregman divergence between the densitycorresponding to the i -th data point and the associated model density given by d B (ˆ g i ( · ) , f i ( · ; θ )) . Since our aim is to reach some ‘common’ value of θ (if it exists) which can be used to model each g i individually, it isintuitive to minimize the average divergence between the data points and the models. Consequently, we minimize n − n (cid:88) i =1 d B (ˆ g i ( · ) , f i ( · ; θ )) with respect to θ , where ˆ g i is a non-parametric density estimate of g . As in the approach suggested by Ghosh and Basu(2013), in presence of only one data point Y i from density g i , the best density estimate of g i is taken to be the (degenerate)density which puts the entire mass on Y i . Consequently, our objective function becomes H n ( θ ) = n − (cid:80) ni =1 V i ( Y i , θ ) ,which can be simpliﬁed as n − n (cid:88) i =1 (cid:104) (cid:90) (cid:110) f i ( y ; θ ) B (cid:48) ( f i ( y ; θ )) − B ( f i ( y ; θ )) (cid:111) dy − B (cid:48) ( f i ( Y i ; θ )) (cid:105) . (5.1)In case of the MEWDE( β ), the B function is given by Equation (2.7). Considering partial derivatives of Equation (5.1)with respect to θ , we arrive at the estimating equation ∇ θ (cid:80) ni =1 V i ( Y i , θ ) = 0 , which can be rewritten as n (cid:88) i =1 (cid:104) u i ( Y i ) w ( f i ( Y i ; θ )) − (cid:90) (cid:110) u i ( t ) w ( f i ( t ; θ )) f i ( t ; θ ) dt (cid:111)(cid:105) = 0 , (5.2)where u i ( x ) = ∇ θ log( f i ( x, θ )) is the likelihood score function of the density f i ( x, θ ) used to model the i -th datapoint, and w ( t ) = B (cid:48)(cid:48) ( t ) × t . For MEWDE( β ), w ( t ) = 1 − exp( − t/β ) . We note that as β → , the correspondingobjective function becomes n (cid:88) i =1 [ − log( f i ( Y i , θ ))] , and the associated estimating equation becomes n (cid:88) i =1 u i ( Y i , θ ) = 0 . We arrive at the fact that the objective function given by Equation (5.1) and estimating equation given by Equation (5.2)are simple generalizations of the maximum likelihood score equation for independent and non-homogeneous data.

Remark 2.

In terms of statistical functionals, the minimum Bregman divergence based functional T B ( G , ..., G n ) fornon-homogeneous observations is given by the relation T B ( G , ..., G n ) = argmin θ ∈ Ω n − n (cid:88) i =1 d B ( g i ( · ) , f i ( · ; θ )) . Since we have already established that the Bregman divergence is a genuine divergence (in the sense that it isnon-negative and attains its minimum if and only if the two arguments are identical), it follows that the functional T B ( G , ..., G n ) is Fisher consistent under the assumption of the identiﬁability of the model. PREPRINT - A

UGUST

18, 2020

We derive the asymptotic distribution of the minimum exponentially weighted divergence estimator ˆ θ n deﬁned by therelation ˆ θ n = argmin θ ∈ Ω H n ( θ ) provided such a minimum exists, where H n ( θ ) is as deﬁned in Equation (5.1). We will be working under the frameworkas discussed in Section 5.1. We also assume that there exists a best ﬁtting parameter of θ which is independent of theindex i of the different densities and let us denote it by θ g . It is important to note that this assumption is satisﬁed if allthe true densities g i belong to the model family so that g i = f i ( · ; θ ) for some common θ , and in that case the bestﬁtting parameter is that true parameter θ . We know that the minimum Bregman divergence based estimator ˆ θ n isobtained as a solution of the estimating equation given by Equation (5.2); as per our deﬁnition, this equation is satisﬁedby the minimizer of H n ( θ ) as deﬁned in Equation (5.1). We now deﬁne, for i = 1 , , . . .H ( i ) ( θ ) = (cid:90) (cid:110) f i ( y ; θ ) B (cid:48) ( f i ( y ; θ )) − B ( f i ( y ; θ )) (cid:111) dy − (cid:90) (cid:110) B (cid:48) ( f i ( y ; θ )) g i ( y ) (cid:111) dy, (5.3)so that at the best ﬁtting parameter (i.e., our target parameter value θ g ), we have ∇ H ( i ) ( θ g ) = 0 , i = 1 , , . . . We also deﬁne, for each i = 1 , , . . . , the s × s matrix J ( i ) whose ( k, l ) -th entry is given by J ( i ) kl = E g i [ ∇ kl V i ( Y ; θ )] , (5.4)where ∇ kl V i ( Y ; θ ) = ∂ V i ( Y ; θ ) ∂θ k ∂θ l and E g i ( · ) denotes taking expectation under the distribution speciﬁed by g i . We alsodeﬁne Ψ n = n − n (cid:88) i =1 J ( i ) , (5.5) Ω n = n − n (cid:88) i =1 V g i [ ∇ V i ( Y ; θ )] . (5.6)The matrix deﬁned in Equation (5.5) has the expression J ( i ) = (cid:90) u i ( y, θ g ) u Ti ( y, θ g ) w [ f i ( y ; θ )] f i ( y ; θ ) dy + (cid:90) [ −∇ u i ( y, θ g ) − u i ( y, θ g ) u Ti ( y, θ g ) h i ( x )] × ( g i ( x ) − f i ( y ; θ )) w [ f i ( y ; θ )] dy (5.7)where w ( t ) = B (cid:48)(cid:48) ( t ) × t , w (cid:48) ( t ) = B (cid:48)(cid:48) ( t ) + B (cid:48)(cid:48)(cid:48) ( t ) · t and h i ( t ) = w (cid:48) ( f i ( t ; θ )) f i ( t ; θ ) w ( f i ( t ; θ )) . Similarly, the matrix deﬁned in Equation (5.6), has the expression Ω n = 1 n n (cid:88) i =1 (cid:104) (cid:90) u i ( y, θ g ) u Ti ( y, θ g ) w [ f i ( y ; θ )] dG i ( y ) − ξ i ξ Ti (cid:105) (5.8)where ξ i = (cid:82) u i ( y, θ g ) w [ f i ( y ; θ )] dG i ( y ) . As in the case for i.i.d. data, we will make the following assumptions toestablish the asymptotic properties of the minimum EWD estimators. These are analogous to the assumptions given inGhosh and Basu (2013); appropriate generalizations have been made to serve the entire Bregman divergence family.(B1) The support χ = { y | f i ( y ; θ ) > } is independent of i and θ for all i ; the true distributions G i are alsosupported on χ for all i .(B2) There is an open subset ω of the parameter space Ω , containing the best ﬁtting parameter θ g such that foralmost all y ∈ χ , and all θ ∈ Ω , all i = 1 , , . . . , the density f i ( y ; θ ) is thrice differentiable with respect to θ and the third partial derivatives are continuous with respect to θ .17 PREPRINT - A

UGUST

18, 2020(B3) For i = 1 , , . . . , the integrals (cid:82) [ f i ( y ; θ ) B (cid:48) ( f i ( y ; θ )) − B ( f i ( y ; θ ))] dy and (cid:82) [ B (cid:48) ( f i ( y ; θ )) g i ( y )] dy can bedifferentiated thrice with respect to θ , and the derivatives can be taken under the integral sign.(B4) For each i = 1 , , . . . , the matrices J ( i ) are positive deﬁnite and λ = inf n [ min eigenvalue of Ψ n ] > (B5) There exists a function M ( i ) jkl ( y ) such that | ∇ jkl V i ( y ; θ ) |≤ M ( i ) jkl ( y ) ∀ θ ∈ Ω , ∀ i = 1 , , . . . where n n (cid:88) i =1 E q i (cid:2) M ( i ) jkl ( Y ) (cid:3) = O (1) ∀ j, k, l. (B6) For all j, k we have lim N →∞ sup n (cid:110) n n (cid:88) i =1 E g i (cid:2) | ∇ j V i ( Y ; θ ) | I ( | ∇ j V i ( Y ; θ ) | > N ) (cid:3)(cid:111) = 0 . lim N →∞ sup n (cid:110) n n (cid:88) i =1 E g i (cid:104)(cid:0) | ∇ jk V i ( Y ; θ ) − E g i ( ∇ jk V i ( Y ; θ )) | (cid:1) × I (cid:0) | ∇ jk V i ( Y ; θ ) − E g i ( ∇ jk V i ( Y ; θ )) | > N (cid:1)(cid:105)(cid:111) = 0 . (B7) For all (cid:15) > , we have lim N →∞ (cid:110) n (cid:88) i =1 E g i (cid:104) || Ω − / n ∇ V i ( Y ; θ ) || I ( || Ω − / n ∇ V i ( Y ; θ ) || > (cid:15) (cid:112) ( n ) (cid:105)(cid:111) = 0 . Theorem 5.1.

If assumptions (B1)–(B7) hold, the following results are true.1. There exists a consistent sequence θ n of roots satisfying the minimum Bregman divergence estimating equationgiven by Equation (5.2).2. The asymptotic distribution of Ω − / n Ψ n [ √ n ( θ n − θ g )] is s-dimensional normal with (vector) mean 0 andcovariance matrix I s , the s-dimensional identity matrix.Proof. The proof of this theorem follows exactly like the proof presented in Appendix 1 of Ghosh and Basu (2013).

Remark 3.

On assumptions (B1) - (B7).

The assumptions (B1)–(B5) are simple generalizations of the assumptions(A1)-(A5) presented in Section 3.2 of this manuscript. The assumptions (B6) and (B7) are similar in spirit to thecorresponding assumptions required in the case of the maximum likelihood estimators under the similar independentnon-homogeneous set-up as discussed in Ibragimov and Has’Minskii (1981). These assumptions hold automaticallyfor minimum Bregman divergence estimators in the i.i.d. case (see remark below). In subsequent sections we will seethat these assumptions hold, for example, for the normal linear regression models under some mild conditions on theregressor variables.

Remark 4.

For homogeneous data: a special case.

Note that, setting f i = f for all i , we get back the correspondingasymptotic properties of the minimum Bregman divergence estimator for the i.i.d. case as given in Section 3.2. If f i = f, i = 1 , , . . . , we get J ( i ) = J, ξ i = ξ for all i; thus Ψ n = J and Ω n = K . Here J, K and ξ are as deﬁned inSection 3.2. In this case assumptions (B1)–(B5) are exactly the same as the assumptions (A1)-(A5) given in Section 3.2,while assumptions (B6) and (B7) are automatically satisﬁed by the dominated convergence theorem. Thus, Theorem 3.1,which establishes the consistency and asymptotic normality of the minimum Bregman divergence based estimator ˆ θ with √ n ( ˆ θ n − θ g ) having the asymptotic covariance matrix Ψ − n Ω n Ψ − n = J − KJ − , emerges as a special case ofTheorem 5.1. PREPRINT - A

UGUST

18, 2020

In this section, we will see that the theory proposed in Section 5.2 would be immediately applicable in the case of linearregression under some mild conditions on the regressor variables. Speciﬁcally, the methodology described previouslywill immediately fall into place for the case of linear regression set-up with normal errors where the conditional approachto inference given ﬁxed values of the explanatory variable is adopted. In this section, we will discuss applications of theproposed method in case of linear regression. Consider the linear regression model y i = x Ti γ + (cid:15) i , i = 1 , , . . . where the error (cid:15) i ’s are i.i.d. normal variables with mean zero and variance σ , x Ti = ( x i , x i , . . . , x is ) is the vectorof the independent variables corresponding to the i -th observation and γ = ( γ , . . . , γ s ) T represents the regressioncoefﬁcients. We will assume that x i ’s are ﬁxed. Then y i ∼ N ( x Ti γ , σ ) and hence the y i ’s are independent but notidentically distributed. Thus y i ’s satisfy the set-up of Sections 5.1 and 5.2 and hence the minimum Bregman divergenceestimator of the parameter θ = ( γ T , σ ) T can be obtained by minimizing the expression in Equation (5.1) with f i ( y ; θ ) = 1 σ φ (cid:16) y − x Ti γ σ (cid:17) where φ ( · ) is the pdf of a standard normal random variable. Following the notation of Equation (5.2), we have the scoreequation as n (cid:88) i =1 (cid:104) u i ( Y i ; θ ) w [ f i ( Y i ; θ )] − (cid:90) u i ( t ; θ ) w [ f i ( t ; θ )] f i ( t ; θ ) dt (cid:105) = 0 , where w ( t ) = B (cid:48)(cid:48) ( t ) × t and the score function is given by u i ( Y i ; θ ) = (cid:104) ( Y i − x Ti γ ) σ x Ti ; ( Y i − x Ti γ ) − σ σ (cid:105) T . Thus, we get the set of s + 1 estimating equations: n (cid:88) i =1 x ij ( y i − x Ti γ )) w ( f i ( y ; θ )) = 0 ∀ j = 1 , . . . , s,n − n (cid:88) i =1 (cid:104)(cid:104) ( y − x Ti γ ) − σ σ (cid:105) w ( f i ( y ; θ )) − (cid:90) (cid:104) ( y − x Ti γ ) − σ σ (cid:105) w ( f i ( y ; θ )) f i ( y ; θ ) dy (cid:105) = 0 . Now, we can then solve these s + 1 estimating equations numerically to obtain the estimates of θ . Remark 5.

On asymptotic behaviour of the estimator ˆ θ = (ˆ γ T , ˆ σ ) T : For simplicity we will assume that the truedata generating density g i also belongs to the model family of distributions, i.e., g i ( · ) = f i ( · ; θ ) ∀ i = 1 , , . . . Thenwe can derive the simpliﬁed form of the matrices Ψ n and Ω n . We had previously deﬁned Ψ n = n − (cid:80) ni =1 J ( i ) and Ω n = n − (cid:80) ni =1 K ( i ) . Using Equations (5.7) and (5.8), and the fact that g i ( · ) = f i ( · ; θ ) ∀ i = 1 , , . . . , we have J ( i ) = (cid:90) u i ( y, θ g ) u Ti ( y, θ g ) w [ f i ( y ; θ )] f i ( y ; θ ) dy,K ( i ) = (cid:90) u i ( y, θ g ) u Ti ( y, θ g ) w [ f i ( y ; θ )] f i ( y ; θ ) dy − ξ ( i ) ξ ( i ) T ,ξ ( i ) = (cid:90) u i ( y, θ g ) w [ f i ( y ; θ )] f i ( y ; θ ) dy. As in the previous sections, we have w ( t ) = B (cid:48)(cid:48) ( t ) × t ; in the case of EWD( β ), w ( t ) = 1 − exp( − t/β ) . It can be shownthat ˆ θ n is a consistent estimator of θ . Further, the asymptotic distribution of √ n Ω − / n Ψ n ( ˆ θ n − θ ) is multivariatenormal with mean (vector) zero and covariance matrix I s +1 . This can be proved by consulting Theorem 5.1. In the next section, we will see how this method works in the context of some real life data sets.19

PREPRINT - A

UGUST

18, 2020

As an application of the robust regression method developed in Section 5.3, we consider modeling age-standardizednational ﬁrearm-related homicide rates in 23 Western countries as a function of per-capita gross domestic productas of 2017. Information on GDP was obtained from The CIA World Factbook and data on ﬁrearm-related homiciderates were obtained from Roser and Ritchie (2020). Figure 10 is a scatter-plot of the data set described, where theindependent variable per-capita gross domestic product is plotted on the X-axis, and ﬁrearm related homicide rate on theY-axis. The United States of America has an abnormally high ﬁrearm related homicide rate in relation to its per-capitaGDP, and this single outlier forces the least squares regression line to have a positive slope, which clearly contradictsthe general conﬁguration of points.In comparison, the two MEWDE ﬁts show a clear reversal in slope, and give more satisfactory descriptions of the rest ofthe data, sacriﬁcing the large outlier. Table 7 shows how estimated coefﬁcients vary as we change the tuning parameter β . We observe that for very small β = 0 . , our MEWDE estimator almost mimics the MLE + D estimator, implyingthat the MEWDE ﬁts the data well by automatically downweighting the outlier, even for very small values of β .Table 7: Estimated regression parameters for homicide data.Estimates MLE E(0.002) E(0.02) E(0.25) E(1) MLE+DIntercept − .

293 0 .

356 0 .

404 0 .

359 0 . GDP ( × − ) . − . − . − . − . − . Error s.d. .

959 0 .

110 0 .

111 0 .

131 0 .

106 0 . Per − capita GDP (in USD) A ge − s t anda r d i z ed na t i ona l f i r ea r m ho m i c i de r a t e ( % ) Fitted − line MLE EWD(1/512) EWD(1/4) Figure 10: Modeling ﬁrearm-related homicide rates in Western countries as a function of per-capita gross domesticproduct: ﬁts with ML and minimum EWD estimators.20

PREPRINT - A

UGUST

18, 2020

We have analyzed two other datasets, the Belgian telephone data and the Hertzsprung Russell star cluster data, bothavailable in Rousseeuw and Leroy (1987), using the minimum divergence procedures based on the EWD and DPD. Thedetails are provided in Appendix B.0.2.

We consider ﬁtting a multiple linear regression model to the dataset concerning alcohol solubility in water (Maronna et al.,2019). The dataset gives, for 44 aliphatic alcohols, the logarithm of their solubility together with three physicochemicalcharacteristics (namely, solvent accessible surface-bounded molecular volume (SAG), mass and volume). The interest isin predicting the solubility. Following the authors’ suggestion of ﬁtting an MM regression-based model to the data, weobserve that four data points (roughly of the data set) are assigned much smaller ‘robustness weights’ as comparedto the remaining 40 data points. Treating these four observations as outliers, we obtain the outlier-deleted maximumlikelihood estimates (denoted by MLE + D) of the regression coefﬁcients and error standard deviation. We also computethe robust LMS estimate. In order to estimate the error s.d. σ , we compute ˆ σ = median | r i − median ( r i ) | / . .Finally, we compute minimum EWD( β ) regression parameter estimates for various values of β . Our ﬁndings areTable 8: Estimated regression parameters for alcohol solubility data (Maronna et al. (2019)).Estimates MLE LMS E(0.1) E(0.4) E(0.7) MLE + DIntercept .

777 3 .

617 5 .

883 3 .

974 5 .

444 6 . SAG .

014 0 .

177 0 .

110 0 .

163 0 .

129 0 . Volume − . − . − . − . − . − . Mass .

027 0 .

248 0 .

172 0 .

235 0 .

206 0 . Error s.d. .

504 0 .

405 0 .

372 0 .

145 0 .

221 0 . presented in Table 8.Unlike simple linear regression, where the ﬁt can be plotted and its suitability visually examined, a visual inspection isnot possible for the ﬁts this multiple linear regression model. Thus the coefﬁcients of Table 8 are not alone sufﬁcient togive a full idea about how good the ﬁts are, how stable and outlier-resistant their behaviors are. We therefore look at theresiduals of each of these ﬁts and try to determine how well they fare in terms of separating out the outliers. Whenthere is a small number of outliers in the data, a robust and outlier-resistant procedure is likely to ﬁt the good datapart adequately and make the outliers stand out in terms of ﬁtted residuals. A robust and resistant ﬁt is supposed toproperly model the majority good data and sacriﬁce the stray outliers, which then stand out in terms of residuals. Thenon-robust ﬁts, on the other hand, are highly affected by the outliers, and the residuals of this ﬁt may no longer standout, but get masked with the other residuals. In Figure 11 we present the boxplots of the residuals of the ML (LS) ﬁt,ML+D (outlier deleted LS) ﬁt, the LMS ﬁt and the minimum EWD(0.66) ﬁt. The optimal tuning parameter is found tobe β OP T = 0 . . See Section 5.4 for more details.It may be seen that the LMS and minimum EWD(0.66) procedure identiﬁes two and three outliers, respectively, by thebasic boxplot method. On the other hand, for the ML method the residuals of the outliers are masked with the gooddata, while for the ML+D method there are no outliers. We also present the residual plots (against ﬁtted values) of thesefour ﬁts as well as the kernel density estimates of these outliers in Appendix B.0.3 for further substantiation of thisdescription. As an extension of Section 4.8, we refer to Ghosh and Basu (2013), where the problem of tuning parameter selectionin the context of independent and non-homogeneous data was discussed. The generalizations required in the case ofminimum Bregman divergence estimation are relatively straightforward, so we do not dwell on those here.For the data on ﬁrearm-related homicide and GDP presented in Section 5.3.1, we obtain the optimal tuning parameter tobe β OP T = 1 . , and the corresponding estimated regression parameters (intercept, GDP, error standard deviation) are (0 . , − . × − , . .Similarly, for the data on alcohol solubility presented in Section 5.3.3, we obtain the optimal tuning parameter to be β OP T = 0 . , and the corresponding estimated regression parameters (intercept, SAG, Volume, Mass, error standarddeviation) are (6 . , . , − . , . , . . 21 PREPRINT - A

UGUST

18, 2020 l l l l l

MLLMSEWD(0.66)ML+D −1 0 1

Residual E s t i m a t o r Figure 11: Residual boxplots of ML, LMS, ML+D and minimum EWD( . ) ﬁts for alcohol solubility data (Maronnaet al., 2019). In the following subsections, we make use of the EWD in constructing robust tests of hypotheses based on the Bregmandivergence and the corresponding minimum divergence estimators. Our work may be viewed as a generalization ofthe work presented in Basu et al. (2013) and Basu et al. (2018). We establish the asymptotic null distribution of theproposed test statistic and apply the theory developed to a real-life data set. As in the previous sections, our focus willremain on the exponentially weighted divergence.

We begin with { P θ : θ ∈ Ω } , an identiﬁable parametric family of probability measures on a measurable space { χ, A } with an open parameter space Ω ⊂ R p , p ≥ . Measures P θ are described by densities f θ = dP θ /dµ , absolutelycontinuous with respect to a dominating σ -ﬁnite measure µ on χ . We have a sample of size n given by X , X , . . . , X n from a density belonging to the family F θ = { f θ : θ ∈ Ω } . We will assume that the support of the distribution isindependent of θ . Our aim is to test a general null hypothesis of the form H : θ ∈ Ω against H : θ / ∈ Ω . (6.1)As in many practical hypothesis testing problems, we consider the set-up where the restricted parameter space speciﬁedby H can be rewritten by a set of r < p restrictions of the form m ( θ ) = 0 r (6.2)on Ω , where m : R p → R r is a vector valued function such that the p × r matrix M ( θ ) = ∂ g T ( θ ) ∂ θ exists and is continuous in θ and rank( M ( θ ) ) = r .Given a sample, our approach to solving the hypothesis testing problem described in Equation (6.1) will be to ﬁrstobtain ˆ θ B , the unrestricted minimum Bregman divergence estimator for a given B function, then obtain the restricted22 PREPRINT - A

UGUST

18, 2020minimum Bregman divergence estimator ˜ θ B , subject to the constraints speciﬁed by Equation (6.2), for the same B function. Finally, we will look at the family of Bregman divergence test statistics (BDTS) T B (ˆ θ B , ˜ θ B ) = 2 n × D B ( f ˆ θ B , f ˜ θ B ) , (6.3)where D B ( g, f ) is the Bregman divergence between two densities g and f , deﬁned in Equation (2.1) with B as the B function. The asymptotic distribution of the test statistic can be worked out for the case where the functions B and B are distinct, and, for maximum ﬂexibility of the method, we establish the asymptotic results of our testing procedure forthis general case. In practice, however, it is not easy to determine the beneﬁts of having two different functions in thesetwo roles, and a single, suitably chosen function will generally work well in most cases. We will consider the functions B and B to have the same parametric form, corresponding to exponentially weighted divergences, only differing, if atall, in the values of their tuning parameters.In Section 3.2, we have established the asympotic behaviour of the unrestricted minimum Bregman divergence estimator.In order to establish the asymptotic behaviour of the Bregman divergence test statistic, we ﬁrst obtain the asymptoticdistribution of restricted minimum Bregman divergence estimator ˜ θ B , and then work out the asymptotic properties ofthe family of test statistics given by Equation (6.3). In addition to assumptions (A1) to (A5) in Section 3.2, we make the assumption(A6) For all θ ∈ ω , the partial derivatives ∂ m l ( θ ) /∂ θ j ∂ θ k are bounded for all j , k and l , where m l ( · ) is the l -thelement of m ( · ) We also assume that the true distribution belongs to the model and θ ∈ Ω is the true parameter. Under this set-up, theminimum Bregman divergence estimator ˜ θ B obtained under the constraints m ( θ ) = 0 r has the following asymptoticproperties.1. The restricted minimum Bregman divergence estimating equation has a consistent sequence of roots, i.e., ˜ θ B P → n →∞ θ .

2. The null distribution of √ n (˜ θ B − θ ) is given by an p dimensional multivariate normal distribution with thezero mean vector and an p × p dispersion matrix Σ B ( θ ) . This matrix is deﬁned as Σ B ( θ ) = P B ( θ ) K B ( θ ) P B ( θ ) , (6.4) where K B ( θ ) is deﬁned by Equation (3.5) with B as the relevant B function. The matrix P B ( θ ) is deﬁnedas P B ( θ ) = J − B ( θ ) − Q B ( θ ) M TB ( θ ) J − B ( θ ) , (6.5) where J B ( θ ) is deﬁned by Equation (3.6) with B as the associated B function, and the matrix Q B ( θ ) isdeﬁned as Q B ( θ ) = J − B ( θ ) M B ( θ ) (cid:104) M TB ( θ ) J − B ( θ ) M B ( θ ) (cid:105) − . (6.6) Proof.

The proof of this theorem follows exactly like the proof presented in the Appendix of Basu et al. (2018).It is interesting to note that Theorem 6.1 is an extension of Theorem 3.1. While the former allows for estimation ina restricted parameter space, the latter does not. As a result, when dealing with an unrestricted parameter space, M becomes a null matrix and consequently, P B ( θ ) = J − B ( θ ) and the asymptotic dispersion matrix of the unrestrictedminimum Bregman divergence estimator reduces to the form speciﬁed by Theorem 3.1. First, we ﬁx a function B and denote ˆ θ B as the unconstrained minimum Bregman divergence estimator of θ , and ˜ θ B as the restricted estimator under the null hypothesis speciﬁed by Equation (6.1). Next, we consider another function B (which is, for simplicity, assumed to have the same functional form as B , only differing in the value(s) of tuningparameter(s)) and construct the BDTS, as deﬁned in Equation (6.3). In the following theorem, we present the asymptoticdistribution of the family of BDTS. 23 PREPRINT - A

UGUST

18, 2020

Theorem 6.2.

We assume that conditions (A1) - (A5) of Theorem 3.1 and condition (A6) of Theorem 6.1 holds. Theasymptotic distribution of T B (ˆ θ B , ˜ θ B ) deﬁned in Equation (6.3) coincides with, under the null hypothesis speciﬁedin Equation (6.1), the distribution of the random variable k (cid:88) i =1 λ i ( B , B , θ ) Z i , where Z , . . . , Z k are independent standard normal variables and λ i ( B , B , θ ) for i = 1 , . . . , k are the nonzeroeigenvalues of the matrix A B ( θ ) B B ( θ ) K B ( θ ) B B ( θ ) , and k is the rank of the matrix given by B B ( θ ) K B ( θ ) B B ( θ ) A B ( θ ) B B ( θ ) K B ( θ ) B B ( θ ) . (6.7) The matrix A B ( θ ) is deﬁned element-wise by = (cid:16) a B ij ( θ ) (cid:17) p × p = (cid:90) (cid:34) B (cid:48)(cid:48) ( f θ ( x )) ∂f θ ( x ) ∂ θ i ∂f θ ( x ) ∂ θ j (cid:35) dx, (6.8) and B B ( θ ) is the matrix J − B ( θ ) M ( θ ) (cid:2) M T ( θ ) J − B ( θ ) M ( θ ) (cid:3) − M T ( θ ) J − B ( θ ) . (6.9) Proof.

The proof of this theorem resembles the proof of Theorem 6 of Basu et al. (2018).

Remark 6.

We observe that the ranks of B B ( θ ) K B ( θ ) B B ( θ ) and B B ( θ ) K B ( θ ) B B ( θ ) A B ( θ ) B B ( θ ) K B ( θ ) B B ( θ ) are equal. Further, B B ( θ ) K B ( θ ) B B ( θ ) and M ( θ ) both have the same rank r . So, k = r and there areexactly r many nonzero eigenvalues. Remark 7.

The critical region of the BDTS needs to be found so that the test can be carried out. An easy way toapproximate the required critical region is outlined here. From Theorem 6.2 it is obvious that the k eigenvaluesdescribed are functions of θ . Under the null, they can be estimated in a consistent manner by plugging in ˜ θ B in placeof θ . Let these estimated eigenvalues be ˆ λ , · · · , ˆ λ k . Generating k many independent observations Z , · · · , Z k fromthe N (0 , distribution repeatedly, one can obtain empirical estimates of the quantiles of (cid:80) i ˆ λ i Z i by replicating thisprocedure a sufﬁcient number of times. Remark 8.

An approximate form of the power function of the BDTS can be obtained by following the steps outlined byTheorem 7 of Basu et al. (2018).

We will now focus on a special case of the Bregman divergence test statistic — the exponentially weighted divergencetest statistic (EWDTS). Under the N ( µ, σ ) model, consider the problem of testing H : µ = µ versus H : µ (cid:54) = µ , (6.10)where σ is an unknown nuisance parameter. We note that the unrestricted parameter space is Ω = { ( µ, σ ) T : µ ∈ R , σ ∈ R + } and that the restricted paramter space is Ω = { ( µ , σ ) T : σ ∈ R + } . We consider the restriction m ( θ ) = µ − µ where θ = ( µ, σ ) T so that the null hypothesis can be rewritten as H : m ( θ ) = 0 . As a result of this formulation, we are now ready to discuss the testing problem in the framework of Theorems 6.1 and6.2. We have already discussed the unrestricted minimum exponentially weighted divergence estimation of unknownmean and standard deviation parameters for the normal distribution in Section 4.6. We denote the unrestricted minimumEWD estimator obtained by ˆ θ β , β being the tuning parameter associated with the B function of the EWD. Now, we24 PREPRINT - A

UGUST

18, 2020turn our attention to the restricted minimum EWD estimation of θ = ( µ, σ ) when subject to the restriction µ = µ . Theestimate will be obtained by minimising the empirical estimate of the exponentially weighted divergence ˜ θ β = argmin θ ∈ Ω (cid:90) (cid:104) B (cid:48) ( f θ ( x )) f θ ( x ) − B ( f θ ( x )) (cid:105) dx − n − n (cid:88) i =1 B (cid:48) ( f θ ( X i )) , (6.11)where the second derivative of B obeys the relation B (cid:48)(cid:48) ( t ) × t = 1 − exp( − t/β ) . It should be noted that we use thesame tuning parameter β when looking for restricted as well as unrestricted estimates of θ . For the testing problemdescribed by Equation (6.10), we take f θ ( x ) = φ (( x − µ ) /σ ) /σ and obtain the restricted estimate ˜ θ β = ( µ , ˜ σ β ) ,where ˜ σ β = argmin σ> (cid:90) (cid:34) B (cid:48) (cid:32) σ φ (cid:16) x − µ σ (cid:17)(cid:33) σ φ (cid:16) x − µ σ (cid:17) − B (cid:32) σ φ (cid:16) x − µ σ (cid:17)(cid:33)(cid:35) dx − n n (cid:88) i =1 B (cid:48) (cid:32) σ φ (cid:16) X i − µ σ (cid:17)(cid:33) . Now, for some tuning parameter γ , we construct the EWDTS required to test the hypothesis speciﬁed by Equation(6.10). The test statistic is given by T γ (ˆ θ β , ˜ θ β ) = 2 n × D γ ( f ˆ θ β , f ˜ θ β ) (6.12)where D γ is the exponentially weighted divergence with tuning parameter γ . For notational convenience, here we indexthe divergence D (in the subscript) by the tuning parameter γ of the EWD, rather than by the corresponding convexfunction.We have already established that the null hypothesis can be well speciﬁed by one linear constraint on µ . So, usingTheorem 6.2, we can claim that the asymptotic null distribution of the EWDTS may be characterized by λ ( θ , β, γ ) Z ,where Z ∼ N (0 , and λ ( θ , β, γ ) is the nonzero eigenvalue of the matrix A γ ( θ ) B β ( θ ) K β ( θ ) B β ( θ ) . Since thevalue of θ is unknown, we can plug in ˜ θ β in its place and claim T γ (ˆ θ β , ˜ θ β ) λ (˜ θ β , β, γ ) L → n →∞ Z . (6.13)As we have noted before, J β and K β do not have neat closed form expressions. This is true for A γ as well. As a result,instead of trying to ﬁnd a more detailed form of λ (˜ θ β , β, γ ) , we will make use of numerical approximations. We have previously examined these data in Section 4.6. The focal point of our discussion there was the minimum EWDestimation of θ = ( µ, σ ) T when the data are assumed to come from a N ( µ, σ ) distribution, both location and scaleparameters being unknown. Hettmansperger and McKean (2010) note that if we were to implement the t -test to test forthe hypothesis H : µ = 0 . versus H : µ (cid:54) = 0 . , (6.14)we would get a p value of 0.053, which is at the borderline of signiﬁcance at 5% level. On the other hand, thenonparametric one-sample sign test returns an entirely insigniﬁcant p value of 0.823. It would be interesting toinvestigate the performance of the EWDTS in such a scenario.In Figure 12, we present a graph of the p -values of the test statistic T β (ˆ θ β , ˜ θ β ) for testing the hypothesis in Equation(6.14) over a set of values of β . For β → , the EWDTS becomes equivalent to, asymptotically (in n ), the ordinarylikelihood ratio test under the null. This statistic returns a signiﬁcant p -value of 0.045 for the hypothesis in Equation6.14, but with increasing β , the signiﬁcance turns to insigniﬁcance very fast. It may be noted that the p -value for the t -test statistic for the outlier deleted (the three large outliers removed) data is . , which conforms to the p -valuescorresponding to the EWDTS for moderately large positive values of β . In this paper, we have presented an estimator based on a sub-class of density-based Bregman divergences, which isseen to be outperforming the existing standard (i.e., the DPD based estimator). We have shown several asymptoticand distributional properties of the proposed estimator, both in the context of i.i.d data as well as independent and25

PREPRINT - A

UGUST

18, 2020Figure 12: p -value of EWDTS for values of β for Shoshoni rectangles datanon-homogenous data. A special case of linear regression (both simple and multiple) has been explored in the contextof real data. We have also discussed ‘judicial’ choice(s) of the tuning parameter which, when chosen properly, yieldshighly robust and efﬁcient estimators which can often dominate the MDPDE. We have also considered an hypothesistesting strategy for parameteric models which may serve as robust alternatives to the classical likelihood ratio and otherlikelihood based tests. As we have noted, the weight function generated by EWD converges to 1 as its argument, thevalue of the density function, increases. We feel that this is the more balanced way for weighting the observations, ratherthan the weighing provided by the DPD, where the weights increase indeﬁnitely with increase in the value of the density.It may also be mentioned that the proposal based on the EWD has the potential to be useful in all the situations wherethe DPD has been successfully applied, such as generalized linear models, survival analysis and Bayesian inference, toname a few. We hope to pursue all of these in our future research.In case of the hypothesis testing problem, here we have only investigated the analogues of the likelihood-ratio type tests.Other procedures, Wald-type tests based on the EWD, for example, should also be studied. The DPD based Wald-typetest has been extensively used the literature, and comparisons with EWD based tests will be interesting.26 PREPRINT - A

UGUST

18, 2020

A The B function of EWD( β ) B ( x ) = x β ∞ (cid:88) n =0 ( − x/β ) n ( n + 2)!( n + 1)= x β ∞ (cid:88) n =2 ( − x/β ) n − ( n )!( n − x β β x ∞ (cid:88) n =2 ( − x/β ) n ( n )! (cid:90) t n − dt = β (cid:90) t ∞ (cid:88) n =2 ( − xt/β ) n ( n )! dt = β (cid:90) (cid:16) exp( − xt/β ) − xtβ (cid:17) t dt = β (cid:90) (cid:16) exp( − xt/β ) − xtβ (cid:17) t dt = x (cid:90) x/β (cid:16) exp( − t ) − t (cid:17) t dt = x (cid:34) − exp( − t ) − tt (cid:12)(cid:12)(cid:12)(cid:12) x/β − (cid:90) x/β (cid:104) exp( − t ) − t (cid:105) dt (cid:35) = x [ I − I ] where x · I = β − β exp( − x/β ) − x, and x · I = x (cid:40) (cid:90) x/β (cid:104) exp( − t ) − t (cid:105) dt (cid:41) = x (cid:40) lim ∆ →∞ (cid:34) (cid:90) ∆0 (cid:104) exp( − t ) − t (cid:105) dt − (cid:90) ∆ x/β (cid:104) exp( − t ) − t (cid:105) dt (cid:35)(cid:111) = x (cid:40) lim ∆ →∞ (cid:34) log(∆)[exp( − ∆) −

1] + (cid:90) ∆0 (cid:104) log( t ) exp( − t ) (cid:105) dt − (cid:90) ∆ x/β exp( − t ) t dt + log (cid:16) ∆ x/β (cid:17)(cid:35)(cid:41) = x (cid:34) lim ∆ →∞ log(∆) exp( − ∆) + (cid:90) ∞ log( t ) exp( − t ) dt − (cid:90) ∞ x/β exp( − t ) t dt − log( x/β ) (cid:35) = x · [0 − γ − Γ(0 , x/β ) − log( x/β )]= − x · [ γ + Γ(0 , x/β ) + log( x/β )] . Here γ is the Euler-Mascheroni constant, usually deﬁned as γ = lim n →∞ (cid:32) n (cid:88) k =1 k − log n (cid:33) = − (cid:90) ∞ exp( − x ) log( x ) dx, and Γ( α, β ) is the incomplete Gamma integral deﬁned as Γ( α, β ) = (cid:90) ∞ β y α − exp( − y ) dy. PREPRINT - A

UGUST

18, 2020Finally, we can write B ( x ) = x β ∞ (cid:88) n =0 ( − x/β ) n ( n + 2)!( n + 1)= x [ I − I ]= − x + γx + β − β exp( − x/β ) + x Γ(0 , x/β ) + x log( x/β ) . B Additional examples of simple linear regression

B.0.1 Hertzsprung-Russell Star Cluster data

We consider the data for the

Hertzsprung-Russell diagram of the star cluster CYG OB1 containing 47 stars in thedirection of Cygnus (Rousseeuw and Leroy, 1987, Table 3, Chap. 2). For these data the independent variable x is thelogarithm of the effective temperature at the surface of the star ( T e ), and the dependent variable y is the logarithm of itslight intensity ( L/L ). The data were thoroughly studied by Rousseeuw and Leroy (1987) who inferred that there aretwo groups of data-points — four data points (in the top right corner of the scatter plot) clearly form a separate group incomparison with the rest of the data-points. These data points are known as giants in astronomy. So, these outliers arenot recording errors but are actually leverage points with the interpretation that the data are coming from two differentgroups. Estimates of the linear regression parameters obtained by the minimum DPD and minimum EWD methods arepresented in Tables 9 and 10, respectively.Table 9: ˆ θ for Hertzsprung-Russell dataset using MDPDE(D( α ))Estimates MLE D( . ) D( . ) D( . ) D( . ) D( . ) D( )Intercept .

793 6 .

796 6 .

803 6 . − . − . − . Slope − . − . − . − .

417 2 .

440 2 .

943 3 . Error s.d. .

565 0 .

554 0 .

560 0 .

566 0 .

405 0 .

393 0 . Table 10: ˆ θ for Hertzsprung-Russell dataset using MEWDE(E( β ))Estimates MLE E( . ) E( . ) E( . ) E( . ) E( . ) E( )Intercept .

793 6 . − . − . − . − . − . Slope − . − .

414 2 .

988 3 .

024 3 .

057 3 .

014 2 . Error s.d. .

565 0 .

561 0 .

355 0 .

359 0 .

376 0 .

213 0 . We observe that1. Clearly that the estimators corresponding to α = 0 and β = 0 (which are identical and also coincide with theordinary least squares estimators) are pulled away signiﬁcantly by the four leverage points and hence it is notpossible to separate out the two group of data by looking at the corresponding residuals.2. The MDPDE with α ≥ . can successfully ignore the outliers to give excellent robust ﬁts and are muchcloser to the ﬁt generated by the LMS estimates.3. The MEWDE with β ≥ . are strongly robust with respect to the outliers, giving excellent ﬁts to theremaining observations.4. For both MDPDE and MEWDE methods, based on the residuals , we can also separate out the two group ofobservations – four large residuals correspond to the four giant stars.Thus the analysis based on DPD and EWD give stable and competitive inference in this case. B.0.2 Number of international telephone calls in Belgium (1950-73)

We consider a segment of data obtained from the

Belgian Statistical Survey by the Ministry of Economy, Belgium(Rousseeuw and Leroy, 1987, Table 2, Chap. 2). Here, the total number (in tens of millions) of international phonecalls made in a year is the dependent variable y . The independent variable is the year number x = 50 , , . . . , .28 PREPRINT - A

UGUST

18, 2020Figure 13: Data-points and ﬁtted regression lines for Hertzsprung Russell star cluster data using least squares andminimum DPD estimates.Figure 14: Data-points and ﬁtted regression lines for Hertzsprung Russell star cluster data using least squares andminimum EWD estimates.29

PREPRINT - A

UGUST

18, 2020Figure 15: Plots of the data-points and ﬁtted regression lines for Belgium telephone data using least squares andminimum DPD estimates.However, due to the use of another recording system (giving the total number of minutes of these calls) from the year1964 to 1969, the data contain heavy contamination in the y-direction in that range. The years 1963 and 1970 are alsopartially affected for the same reason. Estimates of the linear regression parameters obtained by the minimum DPD andminimum EWD methods are presented in Tables 11 and 12 respectively.Table 11: ˆ θ for Belgium telephone data using MDPDE(D( α )).Estimates D( ) D( . ) D( . ) D( . ) D( . ) D( )Intercept − . − . − . − . − . − . Slope .

500 0 .

480 0 .

430 0 .

110 0 . Error s.d. .

380 5 .

400 5 .

410 5 .

290 0 .

110 0 . Table 12: ˆ θ for Belgium telephone data using MEWDE(E( β )).Estimates E( ) E( . ) E( . ) E( . ) E( . ) E( )Intercept − . − . − . − . − . − . Slope .

500 0 .

110 0 .

110 0 . Error s.d. .

380 0 .

090 0 .

080 0 . We make the following observations.1. It is clear that the estimators corresponding to α = 0 and β = 0 (which are identical and coincide with theordinary LS estimators) are heavily affected by the outliers.2. The MDPDE with α ≥ . are strongly robust with respect to the outliers, giving excellent ﬁts to the remainingobservations. While analyzing this dataset, Ghosh and Basu (2013) note that the slope parameter remainspractically constant for all α ≥ . . 30 PREPRINT - A

UGUST

18, 2020Figure 16: Plots of the data-points and ﬁtted regression lines for Belgium telephone data using least squares andminimum EWD estimates.3. The MEWDE with β ≥ . are strongly robust with respect to the outliers, giving excellent ﬁts to theremaining observations. We note that the estimated regression parameters do not differ by much for all β ≥ . when compared to the outlier-inﬂuenced ML regression estimates.4. The least square estimators of the regression parameters, after deleting the outlying observations correspondingto the years 1964 to 1970 are − . , . and . , quite close to all our robust estimators.Clearly, the performance of the MDPDEs and the MEWDEs are quite competitive in this example. B.0.3 Residual analysis of certain ﬁts for alcohol solubility data.

In continuation with the example considered in Section 5.3.3 of the main article, we have, in Figure 17, presented theresidual plots (against ﬁtted values) of some ﬁts (ML, LMS, ML+D (outlier deleted) and minimum EWD(0.66)) for thealcohol solubility data (Maronna et al., 2019). In Figure 18 we present the kernel density estimates of the residuals ofthe same ﬁts.As noted in Section 5.3.3, we observe that the LMS and minimum EWD(0.66) procedures identify a few outliers. Onthe other hand, these observations remain masked in case of the maximum likelihood method, while the ML+D methoddoes not identify any outlier. This is indicated by the lack of the long tails for the ML and ML+D methods.

References

Sancharee Basak, Ayanendranath Basu, and MC Jones. On the ‘optimal’ density power divergence tuning parameter.

Journal of Applied Statistics , 2020. URL https://doi.org/10.1080/02664763.2020.1736524 .Ayanendranath Basu, Ian R Harris, Nils L Hjort, and MC Jones. Robust and efﬁcient estimation by minimising a densitypower divergence.

Biometrika , 85(3):549–559, 1998.Ayanendranath Basu, Hiroyuki Shioya, and Chanseok Park.

Statistical Inference: The Minimum Distance Approach .Chapman and Hall/CRC, 2011.Ayanendranath Basu, Abhijit Mandal, N Martin, and L Pardo. Testing statistical hypotheses based on the density powerdivergence.

Annals of the Institute of Statistical Mathematics , 65(2):319–348, 2013.31

PREPRINT - A

UGUST

18, 2020 lll llllll l llll lll ll ll llllll lllll ll llll lll lll lllll ll ll l llll lll ll ll llllll ll lll ll llll lll lll lll ll llll l llll lll ll ll llllll lllll ll llll lll lll lll ll lll l ll lll ll ll llllll lllll l llll lll lll M LL M SE W D ( . ) M L + D −15 −10 −5 0−101−101−101−101 Fitted Value R e s i dua l Figure 17: Scatter plots of the residuals against ﬁtted values for ML, LMS, ML+D and minimum EWD( . ) ﬁts foralcohol solubility data (Maronna et al., 2019). ||| ||| || | || || | || ||| | | || || | | | ||| || | ||| | || ||| ||| ||| || || | || | || | || | | || || | | | || || || | ||| | ||| || ||| ||| || || | || | || ||| | ||| || | | ||| || || | |||| ||||| ||| ||| || || | || | || ||| | | || || | | | || || || | ||| | || MLLMSEWD(0.66)ML+D −2 −1 0 1 2

Residual E s t i m a t o r Figure 18: Kernel density estimates of the residuals arising from ML, LMS, ML+D and minimum EWD( . ) ﬁts foralcohol solubility data (Maronna et al., 2019). Vertical black lines correspond to th , th and th percentiles ofcorresponding density curves, while red rug-lines correspond to the actual residuals from which the kernel densityestimates are obtained.32 PREPRINT - A

UGUST

18, 2020Ayanendranath Basu, Abhijit Mandal, Nirian Martin, and Leandro Pardo. Testing composite hypothesis based on thedensity power divergence.

Sankhya B , 80(2):222–262, 2018.Rudolf Beran. Minimum Hellinger distance estimates for parametric models.

The Annals of Statistics , 5(3):445–463,1977.Adhidev Biswas, Abhik Ghosh, and Ayanendranath Basu. Minimum bregman divergence and weighted likelihood: acomprehensive study.

Indian Statistical Institute , 2020.Lev M Bregman. The Relaxation Method of Finding the Common Point of Convex Sets and its Application to theSolution of Problems in Convex Programming.

USSR Computational Mathematics and Mathematical Physics , 7(3):200–217, 1967.Michel Broniatowski, Aida Toma, and Igor Vajda. Decomposable pseudodistances and applications in statisticalestimation.

Journal of Statistical Planning and Inference , 142(9):2574–2585, 2012.Imre Csisz´ar. Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoff-schen ketten.

Magyer Tud. Akad. Mat. Kutato Int. Koezl. , 8:85–108, 1963.Imre Csisz´ar et al. Why least squares and maximum entropy? An axiomatic approach to inference for linear inverseproblems.

The Annals of Statistics , 19(4):2032–2066, 1991.Abhik Ghosh and Ayanendranath Basu. Robust estimation for independent non-homogeneous observations usingdensity power divergence with applications to linear regression.

Electronic Journal of Statistics , 7:2420–2456, 2013.Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, and Werner A Stahel.

Robust Statistics: The approachbased on Inﬂuence Functions . John Wiley & Sons, 1986.Thomas P Hettmansperger and Joseph W McKean.

Robust nonparametric statistical methods . CRC Press, 2010.Peter J Huber and Elvezio M Ronchetti.

Robust Statistics . John Wiley & Sons, 2009.Il’dar Abdulovic Ibragimov and Rafail Zalmanovich Has’Minskii.

Statistical estimation: asymptotic theory , volume 16.Springer Science & Business Media, 1981.Soham Jana and Ayanendranath Basu. A characterization of all single-integral, non-kernel divergence estimators.

IEEETransactions on Information Theory , 65(12):7976–7984, 2019.B. G. Lindsay. Efﬁciency versus robustness: the case for minimum Hellinger distance and related methods.

The Annalsof Statistics , 22(2):1081–1114, 1994.Ricardo A Maronna, R Douglas Martin, Victor J Yohai, and Mat´ıas Salibi´an-Barrera.

Robust Statistics: Theory andMethods (with R) . John Wiley & Sons, 2019.Leandro Pardo.

Statistical inference based on divergence measures . CRC press, 2006.Max Roser and Hannah Ritchie. Homicides.

Our World in Data , 2020. https://ourworldindata.org/homicides.Peter J Rousseeuw and Annick M Leroy.

Robust regression and outlier detection , volume 1. Wiley Online Library,1987.Douglas G Simpson. Hellinger deviance tests: efﬁciency, breakdown points, and examples.

Journal of the AmericanStatistical Association , 84(405):107–113, 1989.Central Intelligence Agency The CIA World Factbook. Country comparison :: Gdp - per capita (ppp).

Central Intel-ligence Agency . URL .J Warwick and MC Jones. Choosing a robustness tuning parameter.

Journal of Statistical Computation and Simulation ,75(7):581–588, 2005.RC Woodruff, JM Mason, R Valencia, and S Zimmering. Chemical mutagenesis testing in drosophila: I. comparisonof positive and negative control data for sex-linked recessive lethal mutations and reciprocal translocations in threelaboratories.