[PDF] Improvements in the Small Sample Efficiency of the Minimum S-Divergence Estimators under Discrete Models

Abstract

This paper considers the problem of inliers and empty cells and the resulting issue of relative inefficiency in estimation under pure samples from a discrete population when the sample size is small. Many minimum divergence estimators in the S-divergence family, although possessing very strong outlier stability properties, often have very poor small sample efficiency in the presence of inliers and some are not even defined in the presence of a single empty cell; this limits the practical applicability of these estimators, in spite of their otherwise sound robustness properties and high asymptotic efficiency. Here, we will study a penalized version of the S-divergences such that the resulting minimum divergence estimators are free from these issues without altering their robustness properties and asymptotic efficiencies. We will give a general proof for the asymptotic properties of these minimum penalized S-divergence estimators. This provides a significant addition to the literature as the asymptotics of penalized divergences which are not finitely defined are currently unavailable in the literature. The small sample advantages of the minimum penalized S-divergence estimators are examined through an extensive simulation study and some empirical suggestions regarding the choice of the relevant underlying tuning parameters are also provided.

Full PDF

IImprovements in the Small Sample Eﬃciency of theMinimum S -Divergence Estimators under DiscreteModels ∗ Abhik Ghosh and Ayanendranath Basu † Indian Statistical Institute, Kolkata, India

November 11, 2018

Abstract

This paper considers the problem of inliers and empty cells and the resulting issue of relativeineﬃciency in estimation under pure samples from a discrete population when the sample sizeis small. Many minimum divergence estimators in the S -divergence family, although possessingvery strong outlier stability properties, often have very poor small sample eﬃciency in thepresence of inliers and some are not even deﬁned in the presence of a single empty cell; thislimits the practical applicability of these estimators, in spite of their otherwise sound robustnessproperties and high asymptotic eﬃciency. Here, we will study a penalized version of the S -divergences such that the resulting minimum divergence estimators are free from these issueswithout altering their robustness properties and asymptotic eﬃciencies. We will give a generalproof for the asymptotic properties of these minimum penalized S -divergence estimators. Thisprovides a signiﬁcant addition to the literature as the asymptotics of penalized divergences whichare not ﬁnitely deﬁned are currently unavailable in the literature. The small sample advantagesof the minimum penalized S -divergence estimators are examined through an extensive simulationstudy and some empirical suggestions regarding the choice of the relevant underlying tuningparameters are also provided. Minimum divergence inference provides an excellent theoretical alternative to the classical max-imum likelihood approach in presence of contamination in the observed data. Many minimumdivergence estimators are highly robust in the presence of outliers and have asymptotic eﬃcienciesclose to that of the maximum likelihood estimator under the pure model. Indeed, some minimumdivergence estimators, along with their high robustness, provide full asymptotic eﬃciency underthe true model (e.g., those based on the class of disparities). Although this is a very desirable largesample asymptotic property, the results are not always so spectacular when applied to practicalreal-life data-sets of small sizes. Some of the robust minimum divergence estimators have very ∗ This is a part of the Ph.D. dissertation of the ﬁrst author. † Corresponding author. Email: [email protected] a r X i v : . [ s t a t . M E ] F e b oor performance compared to the maximum likelihood estimators in small samples under puredata. Examples of such divergences include the celebrated Hellinger distance along with otherCressie-Read power divergences (Cressie and Read, 1984) with large negative values of the tuningparameter. The mean square error of these estimators at small sample sizes often turn out to besubstantially higher than that of the maximum likelihood estimator under pure model. This limitsthe use of such estimators in spite of their demonstrated strong robustness properties and goodasymptotic performances.The issue of small sample eﬃciency of the robust minimum divergence estimators has receivedsome attention in the recent literature. The root of this problem appears to be the presence of theso-called “inliers” in the data. Inliers are those values in the sample space where fewer observationsare available compared to what is expected under the model. An empty cell is the most extreme caseof an inlier. The inlier problem becomes more acute as the sample size becomes smaller. Since mostof the robust density based minimum divergence estimators successfully deal with the outliers bydown-weighting those observations by the model density, they in turn magnify the eﬀect of inliers.Hence weights attached to the inliers or empty cells play a crucial role in the poor performance ofthe estimators in small samples. Lindsay (1994) observed this phenomenon in case of the popularHellinger distance. The problem of inliers can be further understood by noting the fact thatminimum divergence estimators with suitable treatments of inliers provide competitive small sampleperformance compared to the maximum likelihood estimator under the true model. Examples ofsuch divergences include, among others, the Cressie-Read power divergence with positive values ofthe tuning parameter, the negative exponential disparity (Lindsay, 1994; Basu et al., 1997) and thegeneralizations (Bhandari et al., 2006) of the negative exponential disparity.Although the concept of inlier is relatively new compared to that of the outlier, there has beena fair bit of recent activity leading to several methods for inlier correction, without compromisingthe robustness properties of the corresponding minimum divergence estimators. Basu et al. (2011)and Mandal and Basu (2013) provide a comprehensive description of the concept and relevantapproaches to solve the problem of inliers. Among all the available methods, in this paper, wewill consider one particular technique based on the method of penalized divergences and use it toimprove the minimum S -divergence estimators. The S -divergence family has been developed inGhosh et al. (2013, 2016) and generates many robust estimators without any signiﬁcant loss ineﬃciency. This large family includes the popular Cressie-Read power divergence and the densitypower divergence (Basu et al., 1998) measures as special cases. Ghosh (2015) and Ghosh andBasu (2015) have also derived the asymptotic distribution of the minimum S -divergence estimatorsunder discrete and continuous models, respectively. However, just like many other density baseddivergences, the S -divergences also use the model density to down-weight the outliers and theirsmall sample performance becomes worse in presence of inliers under the pure model as will bedemonstrated later in the paper. We will provide a modiﬁcation to the minimum S -divergenceestimators using the concept of penalized divergences and prove their asymptotic equivalence tothe original minimum S -divergence estimator. The corresponding estimator will also be robustunder data contamination with improved eﬃciency in small samples.It is important to clearly spell out what is new in the present paper. Mandal et al. (2010)established the asymptotic equivalence of the minimum divergence estimators corresponding toordinary and penalized disparities. However, their proof was restricted to the cases where theordinary divergences are ﬁnitely deﬁned with probability one. This excludes all divergences within2he Cressie-Read family of disparities for which the tuning parameter λ ≤ − λ = −

1) or the Neyman’schi-square ( λ = − S -divergence family. Our proofcan be easily generalized to all disparities, and also accommodates the class of density powerdivergences. Thus, not only we allow the controlling of all disparities, including those which arenot ﬁnitely deﬁned, we also add another dimension to this exercise by including the divergenceswithin the S -divergence family, and in particular the members of the density power divergencefamily. We will, however, restrict our attention to discrete models throughout the paper, as this isthe case where the empty cells are more relevant.Another major contribution of the present paper is to study the small sample behaviors ofdiﬀerent minimum S -divergence estimators and their penalized versions, to be introduced here,through extensive simulations under the Poisson model. The study of the MSDEs in small samplesindicates the necessity of inlier correction for many robust members within the S -divergence family.As a solution, we then consider a penalized version of the S -divergence measure and empiricallyillustrate their small sample superiority in inlier control. Indeed, for this purpose, we deﬁne thepenalized S -divergence by replacing the weights attached to the empty cells in the S -divergenceby a suitably chosen penalty factor. The choice of this penalty factor becomes crucial for theimprovement of their small sample eﬃciency. A large scale simulation exercise studies this problemin great detail and attempts to ﬁnd out the optimum value of the penalty factor separately for eachmember of the divergence family over diﬀerent (small) sample sizes and diﬀerent model parameters.Some overall practical suggestions and guidelines are also provided for practitioners through properempirical evidences. Another possible intuitive extension of the penalty scheme is also proposed atthe end of the paper with some brief suggestions. A real data illustration is also provided.The rest of the paper is organized as follows. We begin with a brief description of the minimum S -divergence estimators in Section 2 and show how the members of the S -divergence family areaﬀected by the inliers in terms of their small sample eﬃciency compared to the maximum likelihoodestimator. Then we introduce the concept of “penalized S -divergence” and the correspondingminimum divergence estimators in Section 3 and prove their asymptotic equivalence to the originalminimum S -divergence estimator in Section 4. We illustrate the performance of the minimumpenalized S -divergence estimators in Section 5 through an extensive simulation study, where wesuggest suitable optimum choices of the penalty factor for practical application of diﬀerent penalizedestimators at small sample sizes. A real data example is considered in Section 6. Finally we endthe paper with some conclusions, recommendations and discussions on possible future extensionsin Section 7. S -Divergence Estimators (MSDE) under Dis-crete Models and its Small Sample Eﬃciency The S -Divergence family has been deﬁned as a general family of divergence measures includingthe famous Cressie-Read power divergence family and the density power divergence family as its3ubclasses (Ghosh et al., 2013, 2016). It is deﬁned in terms of two parameters α ≥ λ ∈ R as S ( α,λ ) ( g, f ) = 1 A (cid:90) f α − αAB (cid:90) f B g A + 1 B (cid:90) g α , (1)where A = 1 + λ (1 − α ) and B = α − λ (1 − α ). For A = 0 the S -divergence measures may bere-deﬁned by its continuous limit as A → S ( α,λ : A =0) ( g, f ) = lim A → S ( α,λ ) ( g, f ) = (cid:90) f α log (cid:18) fg (cid:19) − (cid:90) ( f α − g α )1 + α . (2)Similarly, for B = 0, we have S ( α,λ : B =0) ( g, f ) = lim B → S ( α,λ ) ( g, f ) = (cid:90) g α log (cid:18) gf (cid:19) − (cid:90) ( g α − f α )1 + α . (3)Note that at α = 0 , the S -divergence family reduces to the Cressie-Read family having parameter λ and at α = 1, it becomes independent of λ coinciding with the L divergence. On the otherhand, at λ = 0, it generates the density dower divergences with parameter α . The members of the S -divergence family are indeed genuine statistical divergence measures provided λ ∈ R , and α ≥ n independent and identically distributed observations X , · · · , X n from the true distribution G having probability mass function (pmf) g . Without loss of generality, the support of g is assumed tobe χ = { , , , · · · } . We want to model it by a parametric family of model pmf F = { f θ : θ ∈ Θ ⊆ R p } . Then the S -divergence measure between the data and the model is deﬁned through the relativefrequency vector r n = ( r n (0) , r n (1) , · · · ) T and the model probability vector f θ = ( f θ (0) , f θ (1) , · · · ) T ;here for any x ∈ χ , we deﬁne r n ( x ) = n (cid:80) ni =1 I ( X i = x ) with I ( E ) being the indicator functionof the event E . The minimum S -divergence estimator is the parameter value which minimizes the S -divergence measure between the data r n and the model f θ . Hence, the estimating equation forthe minimum S -divergence estimator is given by ∞ (cid:88) x =0 f αθ ( x ) u θ ( x ) − ∞ (cid:88) x =0 f Bθ ( x ) r An ( x ) u θ ( x ) = 0 , (4)or, ∞ (cid:88) x =0 K ( δ ( x )) f αθ ( x ) u θ ( x ) = 0 , (5)where δ ( x ) = δ n ( x ) = r n ( x ) f θ ( x ) − K ( δ ) = ( δ +1) A − A and u θ ( x ) = ∇ ln f θ ( x ) is the likelihood scorefunction. Here, ∇ = ( ∇ , . . . , ∇ p ) T denotes the derivative with respect to θ = ( θ , . . . , θ p ) T . SeeGhosh (2015) and Ghosh et al. (2016) for detailed properties of the minimum S -divergence estimatorunder discrete models, including their asymptotic distribution and the inﬂuence function.Let us now examine the small sample performance of the minimum S -divergence estimators(MSDE), say (cid:98) θ α,λ , for diﬀerent values of tuning parameters α and λ . We consider the discretePoisson model with mean θ and perform a simulation study to examine empirical MSE of theMSDEs under several small sample sizes. Figure 1 below shows the MSE of √ n (cid:98) θ α,λ (= n × MSE of (cid:98) θ α,λ ) over sample size n for diﬀerent values of the Poisson parameter θ . Note that, according to4 a) θ = 3, λ = 0 (b) θ = 3, λ = − . θ = 3, λ = − θ = 5, λ = 0 (e) θ = 5, λ = − . θ = 5, λ = − θ = 8, λ = 0 (h) θ = 8, λ = − . θ = 8, λ = − Figure 1: MSE of √ n (cid:98) θ α,λ over sample size n for diﬀerent λ , α and θ [Dotted line: α = 0 .

1; Dashedline: α = 0 .

25; Dot-dashed line: α = 0 .

5; Solid line: MLE ( λ = 0, α = 0) in the ﬁrst column;minimum Hellinger distance estimator ( λ = − . α = 0) in the second column; minimum L distance estimator ( α = 1) in the third column. ]the asymptotic theory of MSDEs, this MSE of the √ n times MSDEs converges to a constant limitdepending only on the tuning parameter α under the (pure) model and increases as α increases.However, interestingly, it is observed from Figure 1 that the MSE of √ n (cid:98) θ α,λ increases signiﬁ-cantly as the sample size n decreases for λ = − . , − α ; it implies that the MSDEscorresponding to those tuning parameters (negative λ and smaller α ) become unstable at the smallsample size. The main reason behind this instability in fact comes from the existence of inliers andempty cells in the sample with smaller sizes; this fact is also justiﬁed by the fact that the instabilityof those MSDEs increases for large values of θ where the chances of an inlier or empty cell is higher.However, it is already observed in Ghosh et al. (2016) that these MSDEs with negative λ are highly5obust in presence of outliers and they cannot be ignored due to their strong robustness and goodasymptotic properties; but their application to small sample becomes restricted due to its inlierproblem. Thus, we need to have a small sample correction on the MSDEs to control the inlierskeeping all the good robustness and asymptotic properties as it is.Further, all the S -divergence measures with λ < −

1, any 0 ≤ α < λ = − , α = 0 are infact becomes undeﬁned if there is an empty cell in the sample from a discrete population. This canbe seen by looking at the explicit form of the corresponding S -divergence given by S ( α,λ ) ( r n , f θ ) = 1 A ∞ (cid:88) x =0 f αθ ( x ) − αAB ∞ (cid:88) x =0 f Bθ ( x ) r An ( x ) + 1 B ∞ (cid:88) x =0 r αn ( x ) , (6)where A and B are non-zero. Now, whenever A < λ < − / (1 − α ), see Table 1), thenthe term containing r n ( x ) A (that also contains f θ and so cannot be neglected in the objectivefunction or the estimating equation) becomes undeﬁned at those points x in the sample space forwhich r n ( x ) = 0 (empty cells) and hence the corresponding divergence also becomes undeﬁned inthe presence of even a single empty cell. Thus using such divergences for the derivation of theminimum divergence estimator becomes a fruitless exercise. The same can be observed for the case A = 0 also, since then the S -divergence measure contains a term involving f θ and log( r n ) whichbecomes undeﬁned at r n ( x ) = 0. All these motivate for a suitable inlier correction in the minimum S -divergence estimator. S -Divergence (PSD) and Minimum DivergenceEstimation The concept of penalized divergence was used in the context of successful inlier correction byMandal et al. (2010) and Mandal and Basu (2013) to modify the Cressie-Read power divergencefamily, a particular subfamily of the S -divergences. We will now extend it in the case of generalfamily of S -divergence and examine its performance with respect to eﬃciency, robustness and inliercontrols.Consider the set-up of discrete parametric model as described in previous section. Then the S -divergence measure between the data and the model for A (cid:54) = 0 and B (cid:54) = 0 is given by (6), whichcan be further re-written as S ( α,λ ) ( r n , f θ ) = (cid:88) x : r n ( x ) (cid:54) =0 (cid:20) A f αθ ( x ) − αAB f Bθ ( x ) r An ( x ) + 1 B r αn ( x ) (cid:21) + (cid:88) x : r n ( x )=0 (cid:20) A f αθ ( x ) − αAB f Bθ ( x ) r An ( x ) + 1 B r αn ( x ) (cid:21) . (7)Note that the ﬁrst term is always deﬁned, whereas the second term is not deﬁned at negative valuesof A . For the positive values of A , the second term further simpliﬁes to A (cid:80) x : r n ( x )=0 f αθ ( x ) whichcan also be deﬁned with A <

0. Thus, motivating from this fact and retaining similar consistency inthe expression of the divergence, we deﬁne the penalized version of the S -divergence under discrete6odels as P SD h ( α,λ ) ( r n , f θ ) = (cid:88) x : r n ( x ) (cid:54) =0 (cid:20) A f αθ ( x ) − αAB f Bθ ( x ) r An ( x ) + 1 B r αn ( x ) (cid:21) + h (cid:88) x : r n ( x )=0 f αθ ( x ) , (8)where h can be thought of as a penalty factor. In the absence of any empty cell, the penalized S -divergence (PSD) coincides with the ordinary S -divergence; also in the presence of empty cellsit coincides with the ordinary S -divergence only for the particular choice h = A provided A > S -divergence ﬁnitely deﬁned for all A ∈ R and adjusts the weights of empty cells by the factor h for the cases A >

0. Note that the function

P SD h ( α,λ ) ( r n , f θ ) as deﬁned above is a genuine statistical divergence for all ( α, λ ) with A (cid:54) = 0 and B (cid:54) = 0 and h ≥ A (and empty cell weight A in parenthesis) for diﬀerent α and λ λ α − − − − − − − − − − − − − − − − − − − For a discrete model family as considered above, the minimum PSD estimating equation is thengiven by (cid:88) x : r n ( x ) (cid:54) =0 (cid:2) f αθ ( x ) − f Bθ ( x ) r An ( x ) (cid:3) u θ ( x ) + hA (cid:88) x : r n ( x )=0 f αθ ( x ) u θ ( x ) = 0or, (cid:88) K h ( δ ( x )) f αθ ( x ) u θ ( x ) = 0 , (9)where δ ( x ) = δ n ( x ) = r n ( x ) f θ ( x ) − K h ( δ ) = (cid:40) ( δ +1) A − A if δ (cid:54) = − , − h if δ = − . (10)Note that the estimating equation for the minimum penalized S -divergence estimator has the sameform as that of the S -divergence case (Ghosh et al., 2016), except that the continuous K ( δ ) hasbeen now transformed to K h ( δ ) that is discontinuous at the lower end-point δ = −

1. However,due to this diﬀerent structure of the functions K ( δ ) and K h ( δ ), the asymptotic properties of theminimum penalized S -divergence estimators (MPSDE) cannot be obtained directly from that ofthe MSDEs. We will rigorously derive the asymptotics of all MPSDEs in the next section.7 Asymptotic Properties of the MPSDE under Discrete Models

Consider the set-up of discrete models as described above. Note that, the estimating equationsof MPSDE and MSDE only diﬀer in terms of the functions K ( δ ) and K h ( δ ). We will followthe approach of Ghosh (2015) in establishing the asymptotic properties of the minimum divergenceestimators, while clearly indicating the required modiﬁcations needed to extend the proof to the caseof the penalized version. Note that, intuitively this diﬀerence between the two estimating equationsvanishes asymptotically under the true model, because the set of possible points x with r n ( x ) = 0should converge to a null set under the true distribution. Hence the asymptotic distribution of theMPSDEs may intuitively be expected to be the same as that of the MSDES and there is no lossof asymptotic eﬃciency in using the penalized S -divergence over S -divergence. The theorem belowpresents the same with more concrete proof, which extends the proof for the ordinary S -divergences(Theorem 1, Ghosh, 2015) to this present case of penalized S -divergences.Let us start with some useful Lemmas. Along with the notations of Ghosh (2015), con-sider the deﬁnitions a n ( x ) = K ( δ n ( x )) − K ( δ g ( x )) and b n ( x ) = ( δ n ( x ) − δ g ( x )) K (cid:48) ( δ g ( x )), where δ g ( x ) = g ( x ) f θ ( x ) −

1. Further assume that Conditions (SA1)–(SA7) of Ghosh (2015) hold. Then thefollowing two lemmas help us to obtain the asymptotic distribution of S ∗ n = √ n (cid:88) x : r n ( x ) (cid:54) =0 a n ( x ) f αθ ( x ) u θ ( x ) and S ∗ n = √ n (cid:88) x : r n ( x ) (cid:54) =0 b n ( x ) f αθ ( x ) u θ ( x ). Lemma 4.1.

Assume that Condition (SA5) holds. Then E | S ∗ n − S ∗ n | → as n → ∞ ,and hence S n − S n P → as n → ∞ .Proof. Following the same line of the proof of the Lemma 3 in Ghosh (2015), we get E | S ∗ n − S ∗ n | ≤ β (cid:88) x : r n ( x ) (cid:54) =0 g / ( x ) f αθ ( x ) | u θ ( x ) |≤ β (cid:88) x g / ( x ) f αθ ( x ) | u θ ( x ) | < ∞ , (by assumption (SA5)) . Then the proof follows using the dominated convergence theorem (DCT) and Markov inequality.

Lemma 4.2.

Suppose the matrix V g , as deﬁned in Lemma 4 of Ghosh (2015), is ﬁnite. Then S ∗ n D → N (0 , V g ) . roof. Note that, S ∗ n = √ n (cid:88) x : r n ( x ) (cid:54) =0 ( δ n ( x ) − δ g ( x )) K (cid:48) ( δ g ( x )) f αθ ( x ) u θ ( x )= √ n (cid:88) x : r n ( x ) (cid:54) =0 ( r n ( x ) − g ( x )) K (cid:48) ( δ g ( x )) f αθ ( x ) u θ ( x )+ √ n (cid:88) x : r n ( x )=0 g ( x ) K (cid:48) ( δ g ( x )) f αθ ( x ) u θ ( x )= √ n (cid:32) n n (cid:88) i =1 (cid:2) K (cid:48) ( δ g ( X i )) f αθ ( X i ) u θ ( X i ) − E g { K (cid:48) ( δ g ( X )) f αθ ( X ) u θ ( X ) } (cid:3)(cid:33) + √ n (cid:88) x : r n ( x )=0 g ( x ) K (cid:48) ( δ g ( x )) f αθ ( x ) u θ ( x ) . (11)Now, the ﬁrst term in above converges in distribution to N (0 , V g ). Considering the second term,we will show that √ n (cid:88) x : r n ( x )=0 g ( x ) K (cid:48) ( δ g ( x )) f αθ ( x ) u θ ( x ) P → . (12)Note that √ n (cid:88) x : r n ( x )=0 g ( x ) K (cid:48) ( δ g ( x )) f αθ ( x ) u θ ( x )= √ n (cid:88) x g ( x ) K (cid:48) ( δ g ( x )) f αθ ( x ) u θ ( x ) I ( r n ( x )) , (13)where I ( y ) = 1 if y = 0 and 0 otherwise. thus, E g  √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) x : r n ( x )=0 g ( x ) K (cid:48) ( δ g ( x )) f αθ ( x ) u θ ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = E g (cid:34) √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) x g ( x ) K (cid:48) ( δ g ( x )) f αθ ( x ) u θ ( x ) I ( r n ( x )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) ≤ (cid:88) x (cid:12)(cid:12)(cid:12) g / ( x ) K (cid:48) ( δ g ( x )) f αθ ( x ) u θ ( x ) (cid:12)(cid:12)(cid:12) (cid:104) √ n g / ( x ) { − g ( x ) } n (cid:105) ≤ C (cid:88) x g / ( x ) f αθ ( x ) | u θ ( x ) | (cid:104) √ n g / ( x ) { − g ( x ) } n (cid:105) . The last inequality follows by Assumption (SA7) and the strong law of large numbers (SLLN),under which | K (cid:48) ( δ ) | = | ( δ + 1) A − | < C = C , (say) . (14)9urther, for all 0 < x < (cid:2) √ n g / ( x ) { − g ( x ) } n (cid:3) → n → ∞ and its maximum over0 < x < / √

2. Hence, by assumption (SA5) and DCT it follows that E g  √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) x : r n ( x )=0 g ( x ) K (cid:48) ( δ g ( x )) f αθ ( x ) u θ ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → . Then, by Markov inequality, it follows that the second term in (11) goes to zero in probability as n → ∞ and so S ∗ n D → N (0 , V g ). Combining this with the previous Lemma, we have the requiredresult. Theorem 4.3.

Under Assumptions (SA1)–(SA7) of Ghosh (2015), there exists a consistent se-quence ˆ θ n of roots to the minimum penalized S -divergence estimating equation (5). Also, the asymp-totic distribution of √ n (ˆ θ n − θ g ) is p -dimensional normal with mean and variance J − g V g J − g ,where J g and V g are as deﬁned in Ghosh (2015).Proof. Consistency: Following the proof of Theorem 1 of Ghosh (2015), let us consider the behaviorof

P SD h ( α,λ ) ( r n , f θ ) on a sphere Q a of radius a and center at θ g . We wish to show that, for suﬃcientlysmall a , P SD h ( α,λ ) ( r n , f θ ) > P SD h ( α,λ ) ( r n , f θ g ) ∀ θ on the surface of Q a , (15)with probability tending to one so that the penalized S -divergence also has a local minimum withrespect to θ in the interior of Q a . At a local minimum, the estimating equations must be satisﬁed.Therefore, for any a > S -divergence estimating equation have asolution θ n within Q a with probability tending to one as n → ∞ .Now taking a Taylor series expansion of P SD h ( α,λ ) ( r n , f θ ) about θ = θ g , we get P SD h ( α,λ ) ( r n , f θ g ) − P SD h ( α,λ ) ( r n , f θ )= − (cid:88) j ( θ j − θ gj ) ∇ j P SD h ( α,λ ) ( r n , f θ ) | θ = θ g − (cid:88) j,k ( θ j − θ gj )( θ k − θ gk ) ∇ jk P SD h ( α,λ ) ( r n , f θ ) | θ = θ g − (cid:88) j,k,l ( θ j − θ gj )( θ k − θ gk )( θ l − θ gl ) ∇ jkl P SD h ( α,λ ) ( r n , f θ ) | θ = θ ∗ = S + S + S , ( say )where θ ∗ lies between θ g and θ .For the linear term S , we consider ∇ j P SD h ( α,λ ) ( r n , f θ ) | θ = θ g = − (1 + α ) (cid:88) x : r n ( x ) (cid:54) =0 K ( δ gn ( x )) f αθ g ( x ) u jθ g ( x )+ h (1 + α ) (cid:88) x : r n ( x )=0 f αθ g ( x ) u jθ g ( x ) (16)10here δ gn ( x ) is the δ n ( x ) evaluated at θ = θ g . We will now show that (cid:88) x : r n ( x ) (cid:54) =0 K ( δ gn ( x )) f αθ g ( x ) u jθ g ( x ) P → (cid:88) x K ( δ gg ( x )) f αθ g ( x ) u jθ g ( x ) , (17)as n → ∞ and note that the right hand side of above is zero by deﬁnition of the minimum PSDestimator. Note that the one-term Taylor series expansion yields (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) x : r n ( x ) (cid:54) =0 K ( δ gn ( x )) f αθ g ( x ) u jθ g ( x ) − (cid:88) x K ( δ gg ( x )) f αθ g ( x ) u jθ g ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) x : r n ( x ) (cid:54) =0 K ( δ gn ( x )) f αθ g ( x ) u jθ g ( x ) − (cid:88) x : r n ( x ) (cid:54) =0 K ( δ gg ( x )) f αθ g ( x ) u jθ g ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:88) x : r n ( x )=0 (cid:12)(cid:12) K ( δ gg ( x )) f αθ g ( x ) u jθ g ( x ) (cid:12)(cid:12) ≤ C (cid:88) x : r n ( x ) (cid:54) =0 | δ gn ( x ) − δ gg ( x ) | f αθ g ( x ) | u jθ g ( x ) | + (cid:88) x : r n ( x )=0 (cid:12)(cid:12) K ( δ gg ( x )) f αθ g ( x ) u jθ g ( x ) (cid:12)(cid:12) (by Equation 14) ≤ C (cid:88) x | δ gn ( x ) − δ gg ( x ) | f αθ g ( x ) | u jθ g ( x ) | + (cid:88) x : r n ( x )=0 (cid:12)(cid:12) K ( δ gg ( x )) f αθ g ( x ) u jθ g ( x ) (cid:12)(cid:12) . But, it was proved in Theorem 1 of Ghosh (2015) that, as n → ∞ (cid:88) x | δ gn ( x ) − δ gg ( x ) | f αθ g ( x ) | u jθ g ( x ) | P → . Further, along the lines of the proof of the convergence result in (12) as given in Lemma 4.2, onecan show that, as n → ∞ (cid:88) x : r n ( x )=0 | K ( δ gg ( x )) f αθ g ( x ) u jθ g ( x ) | = o p ( n − / ) , and (cid:88) x : r n ( x )=0 f αθ g ( x ) u jθ g ( x ) = o p ( n − / ) . Combining these, we get ∇ j P SD h ( α,λ ) ( r n , f θ ) | θ = θ g → P

0, as n → ∞ . Thus, with probability tendingto one, | S | < pa , where p is the dimension of θ and a is the radius of Q a .By a similar extension of the proof of Theorem 1 of Ghosh (2015), we can show that thereexists c > a > a < a , we have S < − ca with probability tending to oneand | S | < ba on the sphere Q a with probability tending to one. This implies that (15) holds,completing the proof of the consistency part. Asymptotic Normality:

For the asymptotic normality, let us rewrite the estimating equation of theMPSDE in (9) as (cid:88) x : r n ( x ) (cid:54) =0 K ( δ gn ( x )) f αθ g ( x ) u jθ g ( x ) − h (cid:88) x : r n ( x )=0 f αθ g ( x ) u jθ g ( x ) = 0 . (18)11ow the second term in the left-hand side (LHS) of Equation (18) converges to zero in probabilityas proved above in the consistency part. We will expand the ﬁrst term in the LHS of (18) in aTaylor series about θ = θ g to get (cid:88) x : r n ( x ) (cid:54) =0 K ( δ n ( x )) f αθ ( x ) u θ ( x )= (cid:88) x : r n ( x ) (cid:54) =0 K ( δ gn ( x )) f αθ g ( x ) u θ g ( x )+ (cid:88) k ( θ k − θ gk ) ∇ k  (cid:88) x : r n ( x ) (cid:54) =0 K ( δ n ( x )) f αθ ( x ) u θ ( x )  θ = θ g + 12 (cid:88) k,l ( θ k − θ gk )( θ l − θ gl ) ∇ kl  (cid:88) x : r n ( x ) (cid:54) =0 K ( δ n ( x )) f αθ ( x ) u θ ( x )  θ = θ (cid:48) , (19)where θ (cid:48) lies in between θ and θ g . Now, let θ n be the solution of the minimum PSD estimatingequation, which exist and is consistent by the previous part. Replacing θ by θ n in Equation (19),its LHS becomes zero, yielding −√ n (cid:88) x : r n ( x ) (cid:54) =0 K ( δ gn ( x )) f αθ g ( x ) u θ g ( x )= √ n (cid:88) k ( θ nk − θ gk ) ×  ∇ k  (cid:88) x : r n ( x ) (cid:54) =0 K ( δ n ( x )) f αθ ( x ) u θ ( x )  θ = θ g + 12 (cid:88) l ( θ nl − θ gl ) ∇ kl  (cid:88) x : r n ( x ) (cid:54) =0 K ( δ n ( x )) f αθ ( x ) u θ ( x )  θ = θ (cid:48)  . (20)Note that, the ﬁrst term within the bracketed quantity in the RHS of Equation (20) converges to J g with probability tending to one (by a proof similar to that in Lemma 4.2), while the secondbracketed term is an o p (1) term (as proved in the proof of consistency part). Also, by using Lemma4.2, we get √ n (cid:88) x : r n ( x ) (cid:54) =0 K ( δ gn ( x )) f αθ g ( x ) u θ g ( x )= √ n (cid:88) x : r n ( x ) (cid:54) =0 [ K ( δ gn ( x )) − K ( δ gg ( x ))] f αθ g ( x ) u θ g ( x )= S ∗ n | θ = θ g D → N p (0 , V g ) . (21)Therefore, the theorem follows by the Lemma 4.1 of Lehmann (1983).In the particular case, when the true distribution G belongs to the model family with G = F θ for some θ ∈ Θ, then θ g = θ and √ n ( θ n − θ ) has asymptotic distribution as N p (0 , J − V J − ), where12 = M α and V = M α − N α N Tα with M α = (cid:90) u θ ( x ) u Tθ ( x ) f αθ ( x ) dx, N α = (cid:90) u θ ( x ) f αθ ( x ) dx. Note that, as in the case of the minimum S -divergence estimators, their penalized version also hasasymptotic distribution independent of the parameter λ under the model family.Therefore we have observed that the ﬁrst order asymptotic properties of the minimum pe-nalized S -divergence estimator at the model family is exactly the same as that of the minimum S -divergence estimator. This implies that the ﬁrst order inﬂuence function of these two estimatorswill also be the same so that the robustness properties of these two estimators are expected tobe equivalent. Therefore, the minimum penalized S -divergence estimators generalize the minimum S -divergence estimators with no loss in their asymptotic eﬃciency and no degradation in theirrobustness properties, and provide us the extra facility of inlier correction at small sample sizes.Intuitively, it is quite clear that the performance of the penalized divergences in terms of their abil-ity to successfully handle the inliers and empty cells in a small sample would depend on the choiceof the penalty factor h . In the next section, we will examine this characteristic of the minimumpenalized S -divergence estimators through an extensive simulation study under the Poisson modelwith small sample sizes. h In the previous section we have seen that the asymptotic properties of the MPSDEs are the sameas those of the MSDEs. However, due to the special nature of their construction, the small sampleproperties of the MPSDEs are often signiﬁcantly diﬀerent; we expect that the MPSDEs will havesubstantially superior performance in the presence of inliers and empty cells provided the penaltyfactor h , which has a crucial role in determining the small sample performance of the MPSDE, ischosen carefully. In this section, we will empirically examine these small sample performances ofthe proposed penalized S -divergence estimators under the Poisson model family.For our numerical illustrations, we generate random samples of small sizes ( n = 10 , ,

20) froma Poisson distribution with mean θ and compute the minimum penalized S -divergence estimatorof the parameter θ under the Poisson model for each given sample. The sample generation processis replicated 1000 times to generate the empirical MSE of the MPSDEs; we then compare theperformance of the MPSDEs based on the empirical MSE for diﬀerent values of tuning parameters α , λ and h . We will restrict ourselves only to the range of positive h , which eliminates the possibilityof a negative divergence. Further, since the small sample properties usually depend on the meanof the Poisson model, we will consider several values of θ ranging from 3 to 9. For brevity inpresentation, we will only present some interesting cases in Figures 2 to 4.It is easy to note from the ﬁgures that the pattern of the MSE over diﬀerent values of the penaltyfactor h and Poisson mean θ generally have a similar nature for all the sample sizes n = 10 ,

15 and20. In all the cases, the MSE increases with increasing θ and decreases slightly as the sample sizeincreases. However its behavior for diﬀerent values of λ and α varies signiﬁcantly and so does theoptimum range for the values of the penalty factor h generating the minimum MSE for diﬀerentmembers of the S -divergence family. Further, although it is not discernible from these graphs13 a) α = 0, λ = 0 (b) α = 0 . λ = 0 (c) α = 0 . λ = 0 (d) α = 0 . λ = 0(e) α = 0, λ = − . α = 0 . λ = − . α = 0 . λ = − . α = 0 . λ = − . α = 0, λ = − α = 0 . λ = − α = 0 . λ = − α = 0 . λ = − α = 0, λ = − . α = 0 . λ = − . α = 0 . λ = − . α = 0 . λ = − . α = 0, λ = − α = 0 . λ = − α = 0 . λ = − α = 0 . λ = − Figure 2: MSE of (cid:98) θ α,λ over θ and h for diﬀerent λ and α at sample size n = 10.14 a) α = 0, λ = 0 (b) α = 0 . λ = 0 (c) α = 0 . λ = 0 (d) α = 0 . λ = 0(e) α = 0, λ = − . α = 0 . λ = − . α = 0 . λ = − . α = 0 . λ = − . α = 0, λ = − α = 0 . λ = − α = 0 . λ = − α = 0 . λ = − α = 0, λ = − . α = 0 . λ = − . α = 0 . λ = − . α = 0 . λ = − . α = 0, λ = − α = 0 . λ = − α = 0 . λ = − α = 0 . λ = − Figure 3: MSE of (cid:98) θ α,λ over θ and h for diﬀerent λ and α at sample size n = 15.15 a) α = 0, λ = 0 (b) α = 0 . λ = 0 (c) α = 0 . λ = 0 (d) α = 0 . λ = 0(e) α = 0, λ = − . α = 0 . λ = − . α = 0 . λ = − . α = 0 . λ = − . α = 0, λ = − α = 0 . λ = − α = 0 . λ = − α = 0 . λ = − α = 0, λ = − . α = 0 . λ = − . α = 0 . λ = − . α = 0 . λ = − . α = 0, λ = − α = 0 . λ = − α = 0 . λ = − α = 0 . λ = − Figure 4: MSE of (cid:98) θ α,λ over θ and h for diﬀerent λ and α at sample size n = 20.16hemselves, the minimum value of the MSE of the MPSDE (over the diﬀerent choices of of thepenalty factor h ) for any ( α , λ ) combination is generally much smaller compared to that of thecorresponding MSDE. This illustrates that a suitably chosen penalized version of the MSDE canlargely solve the problem of inliers and empty cells in small sample sizes. And in almost all thesituations studied, the optimal penalty factor is in the range [0 , . h . And in the overwhelming majority of the cases,the MSE surface shows very little change over h ∈ [0 . , .

5] for ﬁxed values of α , λ and the Poissonparameter.Table 2: Optimum values of the penalty factor h for sample size n = 10 and diﬀerent values of( α, λ ), along with the true empty cell weight = A α λ /A θ = 3 θ = 4 θ = 5 θ = 6 θ = 7 θ = 8 θ = 90 0 1 0.9 0.8 0.6 0.7 0.7 0.7 0.70.1 0 1 0.9 0.9 0.8 0.8 0.7 0.7 0.60.25 0 1 1 1 0.7 0.8 0.6 0.5 10.5 0 1 1 0.8 0.5 0.6 0.8 0.6 1.10 − . − . − . − . − ∞ − − − − . − − . − .

86 0.4 0.6 0.5 0.1 0 0 0.30.25 − . − − . − − − − .

25 0.8 0.7 0.6 0.4 0.4 0.3 0.20.25 − − − −∞ h depends on the tuningparameters ( α, λ ), and possibly also on the sample size and the Poisson parameter θ . In Tables2–4, we have presented the optimum values of the penalty factor h for diﬀerent θ , α and λ for n = 10, 15 and 20, along with the true empty cell weight 1 /A . In the following we give a structureddescription of what we observe in the tables. • For the maximum likelihood estimator (corresponding to α = 0, λ = 0), the impact of thepenalty factor is not substantial, and the optimal h is usually close to the natural factor 1 /A (equal to 1 in this case). • The ﬁndings are similar for the ( α > , λ = 0) cases, although larger α and n usually require17able 3: Optimum values of the penalty factor h for sample size n = 15 and diﬀerent values of( α, λ ), along with the true empty cell weight = A α λ /A θ = 3 θ = 4 θ = 5 θ = 6 θ = 7 θ = 8 θ = 90 0 1 0.8 1 0.9 1.1 0.8 0.8 0.80.1 0 1 0.8 0.9 0.9 0.9 0.8 0.9 0.90.25 0 1 1.3 1.2 1.1 1 0.7 0.6 0.90.5 0 1 1.2 0.9 1.1 1.4 1.5 0.9 0.90 − . − . − . − . − ∞ − − − − . − − . − .

86 1 1 1 0.7 0.8 0.7 0.70.25 − . − − . − − − − .

25 1.5 1.3 1.1 1 0.8 0.9 0.80.25 − − − −∞ • For the Hellinger distance ( α = 0 , λ = − .

5) there is signiﬁcant improvement in the MSEdue to the imposition of the penalty. The optimum h varies between 0.5 and 0.1 (the naturalvalue is 2). Although we have not presented the corresponding values in our tables or ﬁgures,the improvement is even more spectacular for larger (in absolute magnitude) negative valuesof λ . • A pattern similar to the one described in the last item is observed for α > λ = − . h as n increases. • In general, smaller sample sizes require greater shrinking of the penalty factor towards zero.Very small sample sizes (such as 10), require penalty factors closer to 0.5, while with moderateto larger values a factor closer to 1 (or slightly higher) is more appropriate. • In practically all the cases that we have investigated, the optimal penalty factor is smaller(in absolute magnitude) than the natural penalty weight. • As the true mean parameter increases, the optimal penalty factor needed is, in general,smaller. This is because in this case there are a larger number of possibly inlying cells withnon-negligible probabilities. 18able 4: Optimum values of the penalty factor h for sample size n = 20 and diﬀerent values of( α, λ ), along with the true empty cell weight = A α λ /A θ = 3 θ = 4 θ = 5 θ = 6 θ = 7 θ = 8 θ = 90 0 1 1 1 0.9 0.9 0.9 0.9 0.90.1 0 1 1 1 1.2 1.1 1 1.2 0.90.25 0 1 1.4 1 1.4 1 1 1 10.5 0 1 1.5 1.4 1.3 1.5 1 0.9 1.10 − . − . − . − . − ∞ − − − − . − − . − .

86 1.1 1.3 1.1 1.1 1.2 0.9 10.25 − . − − . − − − − .

25 1.4 1.5 1.5 1.3 1.2 1.2 0.90.25 − − − −∞ θ , the sample size n and the tuning parameter ( α, λ ). A completelyautomatic, case speciﬁc recommendation for the penalty factor to choose in a given situation mayrequire additional research. However, as an overall recommendation, we observe that in general h ∈ [0 . ,

1] works well in practically all situations, with the speciﬁc choice being guided, to theextent possible, by the discussion above.As we have already observed, the MSE of the MPSDE does not vary appreciably in the interval[0 . , . h = 0 . h = 1 is often a “close to optimal” choice. Based on this observation, wenext study the relative increase in MSE (and hence loss in eﬃciency) of the estimators for each ofthe cases considered previously for the simpler choices h ∗ = 0 . h , say h opt ; we deﬁne this measure as RI = MSE( h ∗ ) − MSE( h opt )MSE( h opt ) . Figure 5 plots these measures of relative increase in MSE for the sample sizes n = 10 and 20; theresults for n = 15 are similar with MSE values in between these two and hence omitted to savespace. It can be clearly observed from the ﬁgure that, when λ = 0, the natural weight 1 (= 1 /A ) in19 a) n = 10, λ = 0, h ∗ = 0 . n = 10, λ = 0, h ∗ = 1 (c) n = 20, λ = 0, h ∗ = 0 . n = 20, λ = 0, h ∗ = 1(e) n = 10, λ = − . h ∗ =0 . n = 10, λ = − . h ∗ = 1 (g) n = 20, λ = − . h ∗ =0 . n = 20, λ = − . h ∗ = 1(i) n = 10, λ = − h ∗ = 0 . n = 10, λ = − h ∗ = 1 (k) n = 20, λ = − h ∗ = 0 . n = 20, λ = − h ∗ = 1(m) n = 10, λ = − . h ∗ =0 . n = 10, λ = − . h ∗ = 1 (o) n = 20, λ = − . h ∗ =0 . n = 20, λ = − . h ∗ = 1(q) n = 10, λ = − h ∗ = 0 . n = 10, λ = − h ∗ = 1 (s) n = 20, λ = − h ∗ = 0 . n = 20, λ = − h ∗ = 1 Figure 5: Relative increase (RI) in MSE due to use of the simple choice h ∗ over the optimum choicefor diﬀerent λ and α at sample sizes n = 10 ,

20 (solid line: α = 0, dotted line: α = 0 .

1, dashed line: α = 0 .

25 and dash-dotted line: α = 0 . λ = 0 and is more important for λ < λ <

0, in most cases, we generally do not have more than 10% relative increasein MSE while using h ∗ = 0 . n = 10 and using h ∗ = 1 for n = 20. These choices, therefore,give the practitioners a guidance on a simple primary application which can be reﬁned at a laterstage using more detailed exploration of the role of h . The cases where we have a larger percentageof relative increase correspond to very small values of the mean square error, both penalized andordinary.Table 5: Parameter estimates for the Drosophila data with diﬀerent methods, along with the trueempty cell weight = A λ α /A MSDE ( a ) MPSDE( h = 0 .

5) MPSDE( h = 1)0 0 1 3.0588 3.3873 3.05880 0.1 1 0.3917 0.3998 0.39170 0.25 1 0.3858 0.3905 0.38580 0.5 1 0.3747 0.3763 0.3747 − − − − − ∞ – 0.3831 0.3714 − − − − − − − − − − − − − − − − − ∞ – 0.3613 0.3598 ( a ) It is ‘–’ when the S -divergence is not deﬁned due to the presence of empty cells we now present a real life application of the proposed minimum penalized S -divergence estimators.We consider a segment of the Drosophila data (Woodruﬀ et al., 1984) based on a chemical mu-tagenicity experiment. The dataset contains the number of daughters carrying a recessive lethalmutation on their X chromosome among (roughly) 100 sampled daughter ﬂies from each maleDrosophila ﬂy when exposed to a certain level of a chemical and mated with unexposed femaleﬂies in a particular (on day 177) experimental run. The observed frequencies of the male ﬂies21re r n = (23 , , ,

1) having x = (0 , , ,

91) recessive lethal daughters; all other values of x hasfrequency zero. Clearly there is one large outlier in the data (at 91) and plenty of empty cells.The dataset can be modeled nicely by a Poisson model except for the outlying point, as describedin Simpson (1987), Basu et al. (2011) and Ghosh (2015). The last paper presented the minimum S -divergence estimators for the Poisson mean parameter with these data both with and without theoutlier, where it was observed that the S -divergences with negative λ yield robust estimators. Butsome of these MSDEs (including the Hellinger distance) are substantially aﬀected by the presenceof empty cells in the data and hence diﬀer signiﬁcantly from the outlier deleted MLE (which is0.3939). In fact, many robust members of the S -divergence family with λ < − S -divergence estimators of the Poissonparameter θ for this dataset at any value of the tuning parameters ( α, λ ). These MPSDEs arereported in Table 5 for the suggested simple choices of h = 0 . ,

1, along with the correspondingMSDE whenever deﬁned. In terms of the matching of the observed and expected data (excludingthe outliers), the estimates in the (0.38, 0.39) window appear to perform the best. However, theestimators with natural penalty weight are sometimes shifted by a large amount from this regiondue to the empty cell eﬀect. A case in point is the ( λ = − , α = 0 .

1) combination, where the naturalestimator is drastically aﬀected by the empty cells, but the penalties put them in the desired zone.An even stronger eﬀect of this phenomenon (not presented in Table 5) is for the ( λ = − . , α = 0)combination. Many minimum divergence estimators, including those within the class of disparities and the classof S-divergences, have excellent robustness properties, but are often handicapped in small samplesdue to their poor inlier controlling properties which may lead to substantially degraded modelperformance. An extreme form of inliers are the empty cells, and suitable empty cell correctionsare useful in improving this small sample model performance. In the tradition of research on thistopic, we believe that we have made some signiﬁcant additions to the literature. Our achievementsand recommendations are listed below. • The divergences which have natural empty cell penalty factor equal to ∞ cannot be ordinarilydeﬁned in an inﬁnite sample space. But with our penalized scheme there is no problem withtheir construction; our approach also ﬁlls in the theoretical convergence and distributionalproperties of such estimators, so far unavailable in the literature. • Our approach generalizes the inlier controlling strategy beyond the class of disparities. • We have provided actual (simulation based) ﬁgures of the optimal penalty factor for diﬀerentvalues of the Poisson mean parameter θ , diﬀerent ( α, λ ) combinations, and a few small tomoderate sample sizes. While this is fairly detailed, a completely automatic, case speciﬁcrecommendation for the penalty factor to choose in a given situation may require additionalresearch. However, as an overall recommendation, we observe that in general h ∈ [0 . , β to replace theparameter α in the term corresponding to the empty cells, so that a modiﬁed penalized S -divergencemeasure can be deﬁned as P SD h,β ( α,λ ) ( r n , f θ ) = (cid:88) x : r n ( x ) (cid:54) =0 (cid:20) A f αθ ( x ) − αAB f Bθ ( x ) r An ( x ) + 1 B r αn ( x ) (cid:21) + h (cid:88) x : r n ( x )=0 f βθ ( x ) . (22)One can again deﬁne the MPSDE based on this new deﬁnition of the Penalized S-Divergence asgiven in Equation (22) of the PSD and it will follow, along the same lines of the proof given inSection 4, that these modiﬁed MPSDEs also have the same asymptotic properties as the previousversion given in Theorem 4.3. As the intuitive motivation suggests, one may possibly achieve abetter inlier control by varying both h and β simultaneously (for any given divergence with ﬁxed α and λ ) generating estimators with even smaller MSEs. For a brief illustration, in Figure 6, we haveplotted the resulting optimum MSE (minimum MSE over both h and β ) along with the optimum (a) n = 10, θ = 3 (b) n = 10, θ = 5 (c) n = 10, , θ = 9(d) n = 20, θ = 3 (e) n = 20, θ = 5 (f) n = 20, , θ = 9 Figure 6: Empirical MSEs of the MPSDEs with λ = − . α , with the natural empty cell weight h = 1 /A (solid line), with the optimum h obtained under the formulation of (8) (dotted line) andwith the optimum h and β obtained under the formulation of (22) (dashed line) for diﬀerent samplesize n and model parameters θ . 23SE obtained for deﬁnition (8) (minimum MSE over h only) and the MSE obtained by the naturalempty cell weight 1 /A for λ = − . α , θ and n ; the pattern is similar for other λ < β parameter to use in a given situation. However Figure 6 gives ampleevidence of the fact that there is the possibility of further improving the small sample performanceof the MSDEs, and it may be worthwhile to further pursue the role of the β parameter.As a ﬁnal word we point out that as all modiﬁcations involving the h and β parameters relateto the inliers, the improvement that is obtained in either case is achieved without compromisingthe outlier stability properties of the divergence. This is not just a technical observation, we havenoticed this repeatedly in our simulations. However, we have not actually put up such tables inthe paper that illustrate the robust behavior of the MSPDEs, as our interest here is on improvingsmall sample model eﬃciency, rather than exploring the robustness of the estimators. References

Basu, A., I. R. Harris, N. L. Hjort, and M. C. Jones (1998). Robust and eﬃcient estimation byminimising a density power divergence.

Biometrika 85 , 549–559.Basu, A., S. Sarkar, and A. N. Vidyashankar (1997). Minimum negative exponential disparityestimation in parametric models.

Journal of Statistical Planning and Inference 58 , 349–370.Basu, A., Shioya, H. and Park, C. (2011).

Statistical Inference: The Minimum Distance Approach .Chapman & Hall/CRC.Beran, R. J. (1977). Minimum Hellinger distance estimates for parametric models.

Annals ofStatistics 5 , 445–463.Bhandari, S., A. Basu, and S. Sarkar (2006). Robust inference in parametric models using thefamily of generalized negative exponential disparities.

Australian and New Zealand Journal ofStatistics 48 , 95–114.Bregman, L. M. (1967). The relaxation method of ﬁnding the common point of convex sets andits application to the solution of problems in convex programming.

USSR Computational Math-ematics and Mathematical Physics 7 , 200–217.Burbea, J. and C. R. Rao (1982). Entropy diﬀerential metric, distance and divergence measures inprobability spaces: A uniﬁed approach.

Journal of Multivariate Analysis 12 , 575–596.Cressie, N. and T. R. C. Read (1984). Multinomial goodness-of-ﬁt tests.

Journal of the RoyalStatistical Society B 46 , 440–464.Csisz´ar, I. (1963). Eine informations theoretische Ungleichung und ihre Anwendung auf den Beweisder Ergodizitat von Markoﬀschen Ketten.

Publ. Math. Inst. Hungar. Acad. Sci. 3 , 85–107.Ghosh, A. (2015). Asymptotic Properties of Minimum S -Divergence Estimator for Discrete Models. Sankhya A: The Indian Journal of Statistics , , 380–407.24hosh, A. and Basu A. (2015). The Minimum S -Divergence Estimator under Continuous Models:The Basu-Lindsay Approach. Statistical Papers , doi: 10.1007/s00362 − − Bernoulli , to appear. Pre-print Technical Report (2013) at

BIRU/2013/3 ,Bayesian and Interdisciplinary Research Unit, Indian Statistical Institute, Kolkata, India.Ghosh, A., A. Maji, and A. Basu (2013).

Robust Inference Based on Divergences in ReliabilitySystems . Applied Reliability Engineering and Risk Analysis. Probabilistic Models and StatisticalInference , Ilia Frenkel, Alex, Karagrigoriou, Anatoly Lisnianski & Andre Kleyner, Eds, Dedicatedto the Centennial of the birth of Boris Gnedenko, Wiley, New York, USA.Jones, M. C., N. L. Hjort, I. R. Harris, and A. Basu (2001). A comparison of related density basedminimum divergence estimators.

Biometrika 88 , 865–873.Kullback, S. and R. A. Leibler (1951). On information and suﬃciency.

Annals of MathematicalStatistics 22 , 79–86.Lehmann, E. L. (1983).

Theory of Point Estimation . John Wiley & Sons.Lindsay, B. G. (1994). Eﬃciency versus robustness: The case for minimum Hellinger distance andrelated methods.

Annals of Statistics 22 , 1081–1114.Mandal, A. and A. Basu (2013). Minimum distance estimation: Improved eﬃciency through inliermodiﬁcation.

Computational Statistics & Data Analysis 64 , 71–86.Mandal, A., A. Basu, and L. Pardo (2010). Minimum Hellinger distance inference and the emptycell penalty: Asymptotic results.

Sankhya A 72 , 376–406.Patra, S., Maji, A., Basu, A., Pardo, L. (2013). The Power Divergence and the Density PowerDivergence Families : the Mathematical Connection.

Sankhya B 75 , 16–28.Read, T. R. C. and N. Cressie (1988).

Goodness-of-Fit Statistics for Discrete Multivariate Data .New York, USA: Springer-Verlag.Simpson, D. G. (1987). Minimum Hellinger distance estimation for the analysis of count data.

Journal of the American Statistical Association 82 , 802–807.Simpson, D. G. (1989). Hellinger deviance test: eﬃciency, breakdown points, and examples.

Journalof the American Statistical Association 84 , 107–113.Tamura, R. N. and D. D. Boos (1986). Minimum Hellinger distance estimation for multivariatelocation and covariance.

Journal of the American Statistical Association 81 , 223–229.Woodruﬀ, R. C., J. M. Mason, R. Valencia, and A. Zimmering (1984). Chemical mutagenesis testingin drosophila — I: Comparison of positive and negative control data for sex-linked recessive lethalmutations and reciprocal translocations in three laboratories.